Friday, 25 January 2019

World Champions and bad(?) stats

"The Ratings of Chessplayers:Past and Present" by Arpad Elo is basically the text book for the FIDE Ratings System. Written in 1978, it describes how the Elo System was developed and has a look at the ratings of players from when it was written, and before.
One section of the book is about estimating the rating of players from the 1850s and 60's, when of course there was no rating system to speak of. Based on his calculations, he estimated Morphy's strength at 2695, with Anderssen the second highest at 2552.
As an exercise I attempted to apply the same process to all World Champions (as listed in this post), based solely on results between each other. In doing so I came across some interesting statistics, as well as drawing the conclusion that the method Elo used may not have been entirely accurate.
The method he used was to take a set of results between a group of players, and based on their percentage score find D(P) for each player (This is the number of points added to the average of the opponents rating to give an estimated performance rating for the player). He gave every a player a starting rating of 500 and then added each D(P) for each player to calculate a new rating. He then averaged these new ratings (weighted by the number of games played) and added the D(P) to this, to get a 'better' rating. After a few iterations the ratings stabilised, which Elo considered 'remarkable'.
I'd have to agree with him in this case, as I've never been able to replicate this result with other data. One of the issues that he doesn't mention is that the calculations of D(P) across a pool isn't symmetrical. An example of this is as follows: In a 3 player group Player A score 7/10 against player C. Player B also scores 7/10 against Player C. Player A and B score 5/10 against each other. Player A and B each score 60% which is a D(P) of 72. Player C scores 30% which is a D(P) of -149. So the first iteration would have A and B rated 572 and C rated 351. After the next iteration A and B would be rated 535.5 while C would be rated 423. Iteration 3 would have A and B at 551 and player C at 386.5. At first this looks like the ratings are heading in the right direction (albeit by overshooting and undershooting), but at each iteration, the average rating of the pool is decreasing slightly. This is because the D(P) of A and B adds up to 144, while the D(P) of C is -149. So each iteration sees a loss of 1.66 points across the pool (5 divided by the 3 players).
I first encountered this bug years ago when trying to set up a starting ratings system for a group of unrated junior players. I re-encountered it when trying to compare the performance of World Champions using results against each other. The greater the iterations, the lower the ratings became. Which wasn't what I was looking for at all.
To get something sensible out of the experiment, I had to add a scaling factor. At the end of the iteration I calculated the net loss of rating points and then 'gave' that back to each player. This kept the average rating at 500 across the pool, and at least made it look sensible. However, my lack of statistical training makes me wonder if I have introduced another error into the calculations.
As for the actual results, Kasparov comes out with the highest rating at 565 (add 2250 to this to give a kind of rating typical of today's top players). This isn't particularly surprising, nor is Steinitz bringing up the rear with a rating of 354.
What is a little more surprising is that between Fischer (527) and Kasparov are Carlsen (548), Karpov (538), Anand (530) and Kramnik (541). The score for Anand and Kramnik I found a bit surprising, although the fact they played lots of games against Kasparov and Karpov may have lifted their own ratings.
For the older World Champions Lasker (500) and Capablanca (512) are slightly ahead of everyone until Fischer, although the run of Soviet champions from Botvinnik to Spassky are tightly bunched in the 489 to 499 band (Probably because they played against each other a lot, and drew a lot of these games).
What does this prove? Probably nothing. The data set makes no distinction between the ages of the players ( eg Anand has games against Smyslov, Tal and Spassky), and so results from when a player is improving are weighted the same as when they are at the end of their career.
As an exercise in calculation (and programming) it was an interesting one, but until I check my working, it may not be an accurate one.


Anonymous said...

Those interested in comparative ratings could check out Edo Historical Chess Ratings - - and Chessmetrics -

Mark Patterson, Esq. said...

Back in those days when your model drifted off track you just had to fudge with a pencil. The methods of the time normalised.