2.20 am, Thursday 11 March 2010
Computerised Player Ratings for Two Contestant Games

(such as Squash, Tennis, Badminton, Table Tennis, … )

 Stop Press: This algorithm has been greatly enhanced and updated. A description is available here and a Linux version of the program can be downloaded here.

Ralph Seeley
Pivotal, 1998

Players of tennis, badminton, table tennis, squash, ... can have no measure of their skill since there is absolute nothing absolute against which to measure it. But they can refer to players they have beaten, and the margins by which, and frequencies with which, they did so. So, given that informal comparisons certainly are possible, can the process be made sufficiently rigorous to support comparative player ratings? If such a scheme had a wide enough coverage might it not provide players with something comparable to a golfer’s handicap?

The discussion that follows should be applicable to all two-player games. It is described with respect to squash, since it was that game to which this technique was applied, and for which, the examples apply.

[Squash is an indoor racket game played between two players, generally for a period of 30 to 45 minutes. A game is won by the first player to gain nine points by a clear margin of two. A match is generally decided on the best of five games. It is similar to Badminton in the method of scoring and the exhausting pace at which it is generally played]

Discussion

If two squash players can find one or more other players against whom both have recorded results, then they may be able to estimate their relative ability. On that basis they can decide whether a game is likely to be good fun or merely an embarrassment. It is a very informal means of inter-comparison and it is fraught with practical problems, but it is potentially very powerful. In theory, anybody who has played squash for any period of time should have played somebody — who has played somebody else — ... — who has played a world champion!

Ranking and Ratings

It is therefore possible to conceive of a means of providing every club squash player with a "rating" by which means he or she could be compared with every other player. Using a personal computer it is now relatively easy to derive such a rating scheme for a club of (say) 300 active members. If every club within a county subscribed to a scheme, and if the top players of each club played within a "super league" with associated "super ratings" then a common measure of squash playing ability could be established for an entire county.

Ladders and Leagues

Measures of comparative squash ability already exist of course. Most clubs have a "ladder" or a hierarchy of "leagues". The higher the position "in the leagues" or "on the ladder", the better, supposedly, is the player. For average players "the leagues" are more important because they provide not only a rough ranking scheme, but also a regular programme of matches against players of supposedly similar ability.

Leagues

A club can run 80 or more leagues, for both men and women. Men’s leagues usually admit more ambitious women. Typically, each league comprises 5 or 6 players who are expected to play each other over a period of 6 weeks or so. Each match is played on a "best of 5 games" basis, 10 points being shared between the two contestants according to the result. At the end of a six-week period the points are added up, and the highest scoring player is (typically) promoted two leagues and the second highest player is promoted one league. The lower scoring players are relegated in a complementary manner.

Unfortunately this scheme if often far from satisfactory:

  • At best it provides only a ranking. The degree by which one player is superior to another is not well measured. Since most players tend to be of average ability, a difference in ranking of 25 players (or five leagues) in the middle of a club can still provide for a very close game. The same difference at the top of the same club may separate a first team player from an "also-ran".
  • Playing ability inevitably gets confused with commitment. Irregular players tend to score fewer points and thus get demoted. This corrupts the ranking of course, but more importantly it means that members of lower leagues get a worse deal. They are more likely to find themselves matched against these irregular players. If they get a game they are likely to be "whitewashed"; if they don’t, they miss out on the points and the competition. Either way they lose.
  • The system is very slow to adjust itself. In a large club it could, in theory, take the world champion more than three years to rise from the bottom of the leagues to the top. Less committed and less consistent players, playing in a club with a rapidly changing membership, can (and do) give up before reaching their proper position.
  • The system is very inflexible. To work properly every match must be played. When only two thirds of games are played, (as is often the case) the essentially arbitrary treatment of missed games becomes disproportionately important.

On the credit side, such a system is simple to implement and easy to understand. Its justice may be rough and ready, but theoretically at least, it is unbiased. In practice of course, adding-up, sorting out and writing 300 names (even on pre-prepared forms) is fairly onerous. After special pleading and extraordinary circumstances have been allowed for, the result is usually in error and the system’s rough and ready justice has often been thoroughly compromised.

Statistical Inefficiencies

The inherent weakness of any league system is the extraordinarily modest use that it makes of match results. A set of leagues are calculated every six weeks or so, and they are to taken to represent everything that is worth knowing about comparative player abilities up to that date. Actual match scores are, metaphorically at least, thrown away. A good result against a new player who subsequently proves to be first team material provides no positive evidence as to the loser's ability — in fact it may have caused his demotion.

Alternative: A Player Rating System

Rather than using ladders or leagues, the results of individual matches can be used directly to estimate individual player ratings.

Table A
  John Paul Fred David Geoff
John   3 (8) 3 (6) 3 (8) 3 (10)
Paul 1 (2)   2 (4) 3 (6) 3 (8)
Fred 2 (4) 3 (6)   3 (6) 3 (8)
David 1 (2) 2 (4) 2 (4)   3 (6)
Geoff 0 (0) 1 (2) 1 (2) 2 (4)  

Example: Consider five squash players: John, Paul, Fred, David and Geoff.

Suppose they play each other with the following results as shown in Table A:

Each cell records the number of games won by the player named in column 1 of the row against the player named in row 1 of the column. Thus John won 3 games in his game against Paul. (Since Paul won 1 game in his game against John the match score was 3:1 to John.)

Ten points have been allocated between the two players according to the result of the match. (These are the bracketed and italicised figures). A 3:0 game is allocated 10:0, a 3:1 game is allocated 8:2 and a 3:2 game 6:4. An allocation of 5:5 is given for tied matches (matches that could not be decided in the allocated time). Theoretically there ought to be rules for 2:1 and 1:0 results, but such results are very rare.

This procedure ensures that all contestants score positively — an important social consideration. From the perspective of comparative performance, however, it rather fogs the issue. We can obtain more readily interpreted comparisons either by:

Table B
John Paul Fred David Geoff
John 2 (3) 1 (1) 2 (3) 3 (5)
Paul -2 (-3) -1 (-1) 1 (1) 2 (3)
Fred -1 (-1) 1 (1) 2 (1) 2 (3)
David -2 (-3) -1 (-1) -1 (-1) 1 (1)
Geoff -3 (-5) -2 (-3) -2 (-3) -1 (-1)
  • subtracting five points from each contestant’s score (5 points representing a draw and thus parity). This yields the italicised and bracketed figures in Table B. We call this the rating difference.
  • recording the advantage implied by the number of games won (3:1 scoring +2, 3:2 scoring +1 etc.). This yields the un-bracketed, plain text figures in Table B. We call this the game difference.
Rating difference is clearly not a linear function of game difference and it might therefore be considered an inferior, and rather arbitrary, measure of comparative ability.

Table C: Rating Difference versus Score Difference

Game Difference (GD) Rating Difference (RD) RD per GD
0 0  
1 1 1
2 3 1.5
3 5 1.67

In fact there is logic for narrowing the differential for 3:2 games. Drawing a squash match requires an unusual combination of endurance and consistency. An arbitrary lapse — either way — is far more likely. A 3:2 result need not, therefore, signify much difference in ability. For this reason, and because this allocation of 10 points is an established and credible one, we have adopted it as our fundamental inter-player comparative measure.

If we look at the top line of Table B we can see that John is "better" than Paul, Fred, David and Geoff to the extent of 3, 1, 3 and 5 respectively. If any of these scores were negative there might be scope for doubt, but in this case John is clearly the best player of our group of five.

Table D
John 5
Paul 2
Fred 3
David 1
Geoff 0

We might also like to know to what extent John is "better" than Paul, Fred, David and Geoff and similarly how Paul relates to Fred, David and Geoff; Fred to David and Geoff; David to Geoff etc.

Suppose we could allocate a "rating" to each of John, Paul, Fred, David and Geoff so that we could forecast the extent to which John would beat (or lose to) Fred in an actual match by subtracting John's rating from that of Fred's.

Consider, for example, the test ratings shown in table D:

If we subtract John's rating from Paul's we get: 5-2 = 3. Similarly, subtracting Paul's rating from Fred's, we get -1 (2-3).

Table E: Predictions

  Rating John Paul Fred David Geoff
John 5   3 (3) 2 (1) 4 (3) 5 (5)
Paul 2 -3 (-3)   -1 (-1) 1 (1) 2 (3)
Fred 3 -2 (-1) 1 (1)   2 (1) 3 (3)
David 1 -4 (-3) -1 (-1) -2 (-1)   1 (1)
Geoff 0 -5 (-5) -2 (–3) -3 (-3) -1 (-1)  

We can use these ratings to predict the results of games in table B. We calculate the rating difference only. The actuals results (from table B) are the italicised and bracketed figures.

By inspection we can see that the ratings of table D are not perfect, but that (in this case) they have always predicted the winner correctly.

Rating accuracy may be estimated (in table F) by calculating a table of "errors" obtained by subtracting the predicted scores of table E from the actual scores of table B and then ignoring the sign.

Table F: Errors

  John Paul Fred David Geoff
John   0 1 1 0
Paul     0 1 1
Fred       0 0
David         0
Geoff          

By calculating the average of these errors we obtain some measure of the quality of the ratings.

In this (synthetic) example, the ratings have predicted 10 games with total error of 4: i.e. the average predictive error is 0.4.

Real life errors are generally larger than this. There are many reasons. For a start, regardless of who we might be playing, our own performance is not consistent. But there are more serious issues. Assuming that squash ability can be measured along one axis is questionable. So is the assumption that these Squash ratings are transitive: if A is 2 "points" better than L, and L is -1 "points" better than B, is it true that A is 1 "point" better than B? Or would a comparison of A and B via M yield a different result? Perhaps A plays a faster, harder game with L than he is allowed to play with M.

Having got some of the caveats out of the way, we can proceed. Poor performance will be indicated by high predictive errors. Individual anomalies should correct themselves after a few new results. At least the process and its effectiveness is explicit. Many other comparisons of ability have very little objectivity, no clear methodology and no appeal.

The Algorithm

Given any set of results we could, by some means or other, estimate a set of ratings and then assess their quality by performing the calculation described above. We would have to repeat the procedure until we could be reasonably sure that no one — particularly an interested party — could surprise us with a set of ratings yielding a smaller overall error. However this would be a tedious and unsatisfactory exercise. Clearly such a scheme is only going to be practical if some method of calculation can be found which can be guaranteed to find the best set of ratings.

Fortunately mathematical algorithms do exist that will yield a set of ratings which cannot be improved. This does not mean that no other set of ratings will be as good — in general this cannot be guaranteed — but it does mean that no other set of ratings can be found with a lower total error. We can guarantee a best set, not the best set, of ratings

The example supposed an ideal environment, a competition among five players, in which all possible matches are played. The proposed algorithm is not as restrictive however — no special match organisation is required at all — the method only uses the evidence of the matches that were actually played. Irregular players may be retained or dropped as the club wishes — they are no longer a serious problem. An injured player may retire for a few months for example — and still retain his rating. Alternatively, since it is no longer crucial that all matches be played, the leagues could be increased in size and less enthusiastic members tolerated.

The Mathematics

The obvious approach is to use the games that were played to estimate the ratings by minimizing the total discrepancy between the results as played and the results as predicted.

Using R(i) to mean player i's rating we have:

  R(i) - R(j) + Error(i,j) = G(i,j) where: G(i,j) = Result of the game between i & j
  G(j,i) = -G(i,j)  
  for all i,j player combinations
where games were played
 

If we can calculate the R(i)'s to minimise the total of the Error(i,j)'s we should be close to our goal. We are not quite there however. The measure of error must be independent of sign; we cannot have negative errors cancelling out positive ones.

The 3:0 "whitewash" issue

We must also recognise the problem posed by "whitewashes". If A beats C by 3:0 this does not mean, even in our terms, that "A is 5 points better than C" but merely that he is at least 5 points better. If A goes on to beat B by 3:0 and B subsequently beats C by 3:0 then a rating differential of 10 is far more appropriate. If this point is neglected it is possible for the poorest club members to enhance their ratings simply by losing sufficiently frequently against the club's best players.

We can now formalise the model:

To minimise  
  S a (i,j)*EPos(i,j) + b (i,j)*ENeg(i,j)  
where:    
  b (i,j) = 0 if G(i,j) = 5 (i.e. i beat j by 3:0)
  b (i,j) = 1 otherwise
  a (i,j) = 0 if G(i,j) = -5 (i.e. j beat i by 3:0)
  a (i,j) = 1 otherwise
Subject to:  
  R(i) - R(j) + EPos(i,j) - ENeg(i,J) = G(i,j)  
  EPos(i,j) >= 0  
  ENeg(i,j) >= 0  

This defines the R(i) comparatively, but not absolutely. To eliminate the remaining degree of freedom we must add a further constraint. We could consider requiring the R(i) to have a predetermined average value of (say) 100:

  ( R(1) + R(2) + R(3) + ... + R(m) ) / m = 100  

But this might violate the convenience of working with integer R(i). In practice we can give an arbitrarily fixed value to any single R(i) and determine the remaining R(i) with respect to it. (In fact we may be required to assign more than one arbitrary value. See "Groups: disjoint competitions")

The Practical Problem

This might look daunting, but in fact we have formulated the problem as a standard Linear Programming model (an "LP" model). That means that we can use standard LP software to solve it. It may be too large for primer or tutorial level software, however. There are as many rows as there were games played and the number of columns is equal to twice the number of games played plus the number of players involved in those games.

For example: 275 players each playing 4 games (in 55 leagues of 5 players) implies that 550 games are played every 6 weeks or so. Furthermore we most process at least two sets of league results simultaneously to get the necessary inter-league connectivity. In practice, and in the examples that follow, 3 successive sets of results have been processed simultaneously. With missed games working out at 30% to 40%, this yields an LP of ~1000 rows and somewhat less than 2300 columns.

The Practical Solution

In fact problems considerably larger than this can be solved without recourse to heavyweight hardware or full-blown LP software. The first and most important step in achieving this, is reformulation of the problem:

  • Using LP’s concept of "duality" the model can be "turned on its side". Rather than a minimisation problem involving some 1000 rows and 2300 columns, it becomes a maximisation problem involving 1000 variables or columns (one per game) each subject to an upper and lower bound whose value depend on whether or not the game was won 3:0. In this example, there are 290 remaining constraints (or rows) corresponding to the players who played at least once during the three sets of leagues analysed.
  • The second step is to recognise this as a Network Flow or Transportation model. Network Flow models represent an important sub-class of the standard LP model for which some very useful solution economies are possible:
  • The problem can be solved entirely in the integer domain, thus avoiding slower floating point operations and also problems of accumulating rounding error.
  • Specific Network Flow Programming algorithms exist which represent the problem structure as a network (rather than LP’s more general, but more cumbersome, product matrix representation).
  • By using a Network Flow Programming algorithm we are able to solve problems such as this one very quickly and with quite modest equipment. In fact the first solutions to this example pre-date the IBM personal computer. They were done within 64K bytes of RAM under CP/M

Figure G is a screen dump of the optimisation engine at work. Before computers became so powerful, this screen would show the progress towards achieving feasibility and then optimisation, dynamically. Using a 233MHz Pentium II it still does, but it is all over too quickly to be visible.

An optimum solution is reported when no further improvement is possible.

Fig G

NETWORK FLOW PROGRAMMING ALGORITHM:
Version 7.0, 8:10:1990 (C++: May 5 1995)
File Name: c:\testm
NODES: 288 ARCS: 1008
Data input complete after 0 minutes and 0.11 seconds
Artificial basis generated after 0 minutes and 0.11 seconds
Feasible Solution found after 0 minutes and 0.11 seconds
Optimum Solution found after 0 minutes and 0.22 seconds
Solution Cost: -1436
Solution file written after 0 minutes and 0.33 seconds
ALGORITHM STATUS: Complete
- ARC EXCHANGE - POTENTIAL
ITERATION IN: OUT: CHANGE
2581 -(210->212) +(84->210) 1
DIAGNOSTICS:
ITERATION STATUS DESCRIPTION
PRIMAL COMPLETE Iterations: 2668 (No flow change in: 2201)

The Results

The optimisation complete, it is necessary to merge the information back into the members’ file. Although calculating ratings is the most challenging aspect of the program, an obvious by-product of processing match results is the ability to identify dormant or less active players. If player inactivity is no longer penalised by demotion, a club must decide when a player has effectively voted himself out of club-organised competition.

As a result of merging the new ratings into the members’ file it is possible to produce a listing like Table H.

The members are sorted in descending order of their ratings: best at the top, worst at the bottom. Below the active player listing are the names of dormant players whose ratings have been retained.

Table H

No. of Players: 288  
No. of Games: 1008
No. of Groups: 6

Average Rating Error: 1.42 (per game)

 
Name Member Number Group Rating Omissions Tel. No.
           
Steve Green 1222 6 124 2  
Peter Blundell 1170 6 124 0  
Brian Wagstaff 1082 6 122 3  
Ted Oakley 1421 6 122 6 01123 80054
Karl Walker 1219 6 122 5  
Peter Cassidy 1035 6 122 0  
Eddie Cawthorne 1311 6 121 2  
John Grundy 1068 6 120 0  
Paul Oakley 1013 6 120 4  
G Wilby 1014 6 119 1  
Dave Rothwell 1136 6 118 1  
Greg Roworth 1120 6 118 0 01234 66949
Allan Jackson 1443 6 117 4  
Brian Crennell 1370 6 117 6  
Robert Dixon 1295 6 117 1  
Mike Tarbuck 1269 6 117 1 01345 62359
Dave Wallis 1132 6 117 2  
Harold Godwin 1066 6 117 5  
David Stanworth 1034 6 117 1  
Steve Orrell 1402 6 116 3  
Dave Thomas 1266 6 116 4  
Dave O’Byrne 1514 6 115 1 01456 67337
·          
·          
·          
Julie Birch 1089 6 77 1  
Kurt Boland j140 6 72 6  
Daniel Lawrenson j103 6 72 3  
Martin Mitchell 1150 6 72 6  
A Noble 1337 5 100 2  
Gordon Greaves 1840 4 100 6  
Keith Fishwick 1561 2 100 8  
           
Dormant Members          
Rob Jamieson 1287 6 124 2  
Colin Benson 1576 6 122 2  
Iain Halliwell 1016 6 121 0  
Jim Garner 1174 6 116 5  
·          
·          
·          
Ken Pratt 1110 6 87 2  
Arvon Ginley 1420 6 85 2  
P Briggs 1330 3 100 4  

Table H has been abbreviated and mildly censored. It has been abbreviated by omitting the vast bulk of average players — those with ratings between 115 and 77 – and censored by modifying names and telephone numbers slightly.

Most elements of this output should be self-explanatory: "Name", "Tel. No.", and "Rating" are what they seem. The "Member Number" is an unambiguous player identifier. "Omissions" is a count of un-played games. Since ratings are calculated only on the basis of played games, one of the obvious remaining sanctions against players who consistently fail to play their games is to make the fact public.

The Average Rating Error provides justification for the whole exercise. The value of 1.42 given here suggests that a 3:1 prediction might well be a draw, but that the predicted winner of a 3:0 match is unlikely to lose.

Groups: disjoint competitions

The third column, titled "Group" is very important. A group is a set of players linked in a tree-like network by some of the matches in which they played. The matches that are chosen to make this structure are those that the algorithm considers to provide the most accurate representation of their comparative ability. If, by accident or design, two (or more) groups exist without any "bridging" matches then their respective ratings sets cannot be compared. The club network is then not so much a tree as a shrub.

In practice significantly-sized secondary groups are unusual, but since the possibility exists, it must be treated. Although six groups are present in Table H, only one of them (group 6) is other than degenerate. Particularly badly planned matches can cause non-degenerate secondary groups to occur. Most obviously, allowing (or requiring) that all new joiners start at the bottom aggravates this risk. Once a league comprises new members only, then those members’ initial ratings will relate to each other but not to the club as a whole.

It is a simple matter to reformat the output from Table B into conventional league tables like Table I.

Table I

Mem #: Player
Rating, ()
Tel. No.
0222 0170 0082 0421 0219  
0222: Steve Green,
124, (2 6)
01345 45678
           
0170: Peter Blundell,
124, (0 6)
 
           
0082: Brian Wagstaff,
124, , (3 6)
 
           
0421: Ted Oakley,
122, (6 6)
01456 80054
           
021: Karl Walker,
122, (5 6)
01234 64295
           
(Write In)
 
 
           

Analysis and Inspection

Examination of this example output is encouraging. The solutions make sense; players that were recognised to be good are given higher ratings.

It is also possible to consider these ratings from a statistical perspective. The average value is, of course, completely without significance. It is a matter of deliberate choice. 100 seems to provide a good range without any serious likelihood of awarding negative ratings.

The distribution of squash ability — as represented by ratings and calculated from the example data — is shown in the histograms in Charts J and K. The ratings for men and women are independent of each other. The histogram of women’s ratings has a fairly standard, symmetric "bell shaped" distribution. By contrast the men's distribution seems to be truncated to the left. This may be a genuine reflection of the way male squash skills were distributed or it may suggest a boundary below which public competition was considered embarrassing.

Further Developments

Currently all games are equally weighted, regardless of date. Including the results of older games is particularly valuable when it provides comparative evidence that would not otherwise be available. Using older match results would be more justifiable if they were allocated smaller weights.

If this algorithm is able to generate stable, non-contentious ratings, then it could become a victim of its own success. Stable ratings might not only yield repetitive playing schedules, but disjoint groupings as well.

Since ratings need no longer be a function of league position there would be merit in inducing some limited, but deliberate variation in player schedules.

A secondary problem might therefore be to generate a match schedule for each player, so as to provide him or her with four new matches against players with whom he or she has not played recently. The current ratings of the proposed contestants should be within a reasonable range (say ± 7 rating points) and evenly balanced between expected wins and expected losses.

Conclusion

Calculating player ratings for two contestant games such as Squash is an eminently practical proposition that can confer significant advantages. The worked example could be scaled up to meet the needs of the largest single club, or even networks of clubs.

Although the example describes a system in which the optimisation algorithm was central, in a practical implementation most of the data could be taken from, and returned to, an existing membership database system. If match results were not recorded electronically however, systems such as this certainly encourage it.

 Stop Press: This algorithm has been greatly enhanced and updated. A description is available here and a Linux version of the program can be downloaded here.


References:

  1. Beale, E. M. L. "Mathematical Programming in Practice." Sir Isaac Pitman & Sons, London 1968
  2. Wagner, H. M. "Principles of Operations Research." Prentice- Hall International Inc., 1975
  3. Kalan, J. E. "Aspects of Large-Scale In-Core Linear Programming." Proceedings of the 26th A.C.M. Chicago, Ill, Aug '71 pages 304-313.
  4. Jenson, P.A. & Barnes J.W. "Network Flow Programming",Wiley 1980
  5. Barr R, Glover F & Klinkman D "Enhancements of Spanning Tree Labeling Procedures for Network Optimization", Research Report CCS 262, University of Texas, 1976
 
© 2005 Ralph Seeley sitemap Designed by Ralph Seeley