5.13 pm, Friday 9 May 2008

The Yardstik Approach

Stop Press: Free trial service now available

League-based competition avoids the problems of the "better" comparator by ensuring that whatever inequity is involved is equally shared. Yardstik, by contrast, addresses the problem head-on. It generates a scale of comparative match-playing skill using the results from those matches that share the greatest measure of internal consistency.

Although match results needn't "behave properly" they often do: if you can beat me and I can beat John, then more often than not, you are able to beat John easily. If I lose heavily to John and you lose marginally, then "better" has performed like "taller". We can agree a ranking, perhaps even a scale of our relative abilities, without being conscious of doing so.

The problem occurs when anomalies arise — when matches don't go the way previous evidence suggests it should. When the evidence of thousands of matches, involving hundreds of players are analysed, anomalies are inevitable. The Yardstik algorithm calculates a skill rating for each player that minimises the number and severity of these anomalies.

Pull & push anomalies

To perform this calculation properly one has to understand what constitutes an anomaly, and what doesn't. It is essential, for example, to treat a "whitewash" (a game in which the loser scores nothing) in a very different way to a tight match. A tight match is evidence of very similar levels of skill. A whitewash says little about the magnitude of the skill gap. It suggests a lower limit but not an upper one.

The idea can be generalised. Most of us "play to win", but we don't play to humiliate. It's bad manners and a waste of energy. So a comfortable 9:4 win might, with a little more effort, have been a 9:2 win or even a 9:0 thrashing. In general, the greater the difference in score, the less weight should be accorded the lower score. Player ratings that imply by a greater disparity than the 9:4 result represent less of an anomaly than player ratings that imply a closer result. Or more succinctly: push anomalies become increasingly more important than pull anomalies, the greater the difference in score.

Discounting history

The more matches Yardstik can use to calculate its player ratings, the more reliable those ratings are likely to be, so it is good practice to accumulate results over several months, or even a year or two. But if the evidence of last year were considered to be as valid as that of this year, an improving player would be forever handicapped by his or her history. So Yardstik embodies a user-specified discount factor. Provided matches are dated, the discount factor will reduce the significance of matches according to their age.

Yardstik's memory of distant events gets dimmer with time. In that respect, Yardstik treatment of time is much like ours, and quite different from league-based systems for which history has very little meaning at all.

Disjoint groups

Yardstik generates player ratings on the basis of whatever data it is given. It doesn't require any particular playing regime and there are few practical limits to the number of matches it can process simultaneously. It doesn't require that players belong to the same club or even live on the same continent. That makes it rather likely that the entire player population will comprise a number of disjoint groups whose members can be compared internally, but not with other groups, because no match has yet been played which might link them. Yardstik identifies disjoint groups and generates player ratings that apply within those groups. If two disjoint groups deliberately organise a tournament between them, then the ratings generated by subsequent Yardstik runs apply to the merged group.

Scoring: "Summary" versus "Detail"

Summary Scoring

Expressing match results as (say) best of 5 games yields possible scores of 3:0, 3:1, 3:2, 2:3, 1:3, 0:3. Differences in skill are compressed into six levels — two of which (3:0 and 0:3) we have already recognised to be degenerate. Yardstik certainly works with such data, but the level of discrimination is coarse.

Detail Scoring

It would be nice to be able to distinguish a 3:0 result arising from 9:2, 9:4, 9:2 games from one based on 9:7, 10:8, 9:6 games. This is certainly possible, and one approach would be to treat the three game results as three independent "best of 9" matches. Although this usually generates entirely plausible rating differences, the assumption of independence is not tenable, and in extreme cases this method can generate erroneous results. Consider 0:9, 9:7, 0:9, 9:7, 9:7. If these games are independent, then running on default settings, Yardstik ranks the away player above the home player. Considered as a match, the home player has the edge when it really matters and is therefore considered the winner. Yardstik resolves this problem by allowing match results to be expressed as a sequence of games from which it extracts the median result. In this case that would be a 9:7 home win.

 
© 2005 Ralph Seeley sitemap Designed by Ralph Seeley