|
The
Yardstik Approach
Stop Press: Free
trial service now available
League-based competition avoids the problems
of the "better" comparator by ensuring that whatever inequity
is involved is equally shared. Yardstik, by contrast, addresses
the problem head-on. It generates a scale of comparative match-playing
skill using the results from those matches that share the
greatest measure of internal consistency.
Although match results needn't "behave properly" they often
do: if you can beat me and I can beat John, then more often
than not, you are able to beat John easily. If I lose heavily
to John and you lose marginally, then "better" has performed
like "taller". We can agree a ranking, perhaps even a scale
of our relative abilities, without being conscious of doing
so.
The problem occurs when anomalies arise — when matches don't
go the way previous evidence suggests it should. When the
evidence of thousands of matches, involving hundreds of players
are analysed, anomalies are inevitable. The Yardstik algorithm
calculates a skill rating for each player that minimises the
number and severity of these anomalies.
Pull
& push anomalies
To perform this calculation properly
one has to understand what constitutes an anomaly, and what
doesn't. It is essential, for example, to treat a "whitewash"
(a game in which the loser scores nothing) in a very different
way to a tight match. A tight match is evidence of very similar
levels of skill. A whitewash says little about the magnitude
of the skill gap. It suggests a lower limit but not an upper
one.
The idea can be generalised. Most of us "play to win", but
we don't play to humiliate. It's bad manners and a waste of
energy. So a comfortable 9:4 win might, with a little more
effort, have been a 9:2 win or even a 9:0 thrashing. In general,
the greater the difference in score, the less weight should
be accorded the lower score. Player ratings that imply by
a greater disparity than the 9:4 result represent less of
an anomaly than player ratings that imply a closer result.
Or more succinctly: push anomalies become increasingly more
important than pull anomalies, the greater the difference
in score.
Discounting history
The
more matches Yardstik can use to calculate its player ratings,
the more reliable those ratings are likely to be, so it is
good practice to accumulate results over several months, or
even a year or two. But if the evidence of last year were
considered to be as valid as that of this year, an improving
player would be forever handicapped by his or her history.
So Yardstik embodies a user-specified discount factor. Provided
matches are dated, the discount factor will reduce the significance
of matches according to their age.
Yardstik's memory of distant events gets dimmer with time.
In that respect, Yardstik treatment of time is much like ours,
and quite different from league-based systems for which history
has very little meaning at all.
Disjoint groups
Yardstik generates player ratings on the basis of whatever
data it is given. It doesn't require any particular playing
regime and there are few practical limits to the number of
matches it can process simultaneously. It doesn't require
that players belong to the same club or even live on the same
continent. That makes it rather likely that the entire player
population will comprise a number of disjoint groups whose
members can be compared internally, but not with other groups,
because no match has yet been played which might link them.
Yardstik identifies disjoint groups and generates player ratings
that apply within those groups. If two disjoint groups deliberately
organise a tournament between them, then the ratings generated
by subsequent Yardstik runs apply to the merged group.
Scoring: "Summary" versus "Detail"
Summary Scoring
Expressing match results as (say) best of 5 games yields
possible scores of 3:0, 3:1, 3:2, 2:3, 1:3, 0:3. Differences
in skill are compressed into six levels — two of which (3:0
and 0:3) we have already recognised to be degenerate. Yardstik
certainly works with such data, but the level of discrimination
is coarse.
Detail Scoring
It would be nice to be able to distinguish a 3:0 result arising
from 9:2, 9:4, 9:2 games from one based on 9:7, 10:8, 9:6
games. This is certainly possible, and one approach would
be to treat the three game results as three independent "best
of 9" matches. Although this usually generates entirely plausible
rating differences, the assumption of independence is not
tenable, and in extreme cases this method can generate erroneous
results. Consider 0:9, 9:7, 0:9, 9:7, 9:7. If these games
are independent, then running on default settings, Yardstik
ranks the away player above the home player. Considered as
a match, the home player has the edge when it really matters
and is therefore considered the winner. Yardstik resolves
this problem by allowing match results to be expressed as
a sequence of games from which it extracts the median result.
In this case that would be a 9:7 home win.
|