Sunday, 1 April 2012

Rating Randomness

What is the probability of obtaining a rating increase of 400 points or more in one year, purely by chance, with no real improvement?  There are perhaps 7.5 million registered players worldwide, see:

Given such a large number of players, we might expect someone somewhere at sometime to achieve a 400 point apparent improvement purely by chance.  To make an estimate of the probability of this happening, we need a simple model.  This model need not be completely accurate.  We only need a rough answer here.

For simplicity, I will consider a player whose true rating is the same as the average true rating of his opponents.  His score after an infinite number of games should then be 50%. I will assume that for every game, a coin is tossed twice.  If we get two heads, our player wins.  If we get two tails, he loses.  If we get one head and one tail, the game is a draw.  If our player plays N games, we toss the coins 2*N times, and his fractional score is the number of times that the coin comes up heads, divided by 2*N.  For a simple account of the mathematics of coin tossing see:

My simple model has some limitations.  The proportion of draws may not be 50%.  A higher proportion of draws will reduce the variability of the score, and vice versa.  (N.B. An increased variability of the score increases the chances of an inaccurate rating.)   Our player’s opponents, may not all be of roughly equal strength.  Some may be very strong and some very weak, in which case, the very strong players will nearly always win, and the very weak players nearly always lose.  This will reduce the statistical variability of the score.  Nonetheless, the simple model should be good enough for making rough estimates.  These problems are, in any case, ignored by the Elo rating system.

For the USCF version of the Elo rating system, the expected fractional score s is given by:

s = 1 / (1 + 10 ^  -(d/400))

Where d is the player’s rating minus that of that of his opponent.  See:

The table below gives the expected percentage score for rating differences of 100, 200 and 400 Elo points:

 d       s
100    64.01%
200    75.97%
400    90.91%

For N = 12 and d = 200, we expect to score 75.97%.  To the nearest integer, we expect 18 heads when we toss the coin 24 times.  The probability of scoring 18 or more half points can be found using the binomial distribution calculator:

The probability that we will receive a rating that is 200 or more points higher than our true rating is 0.01133, i.e. about 1 in 88.  (N.B. The probability that we will receive a rating that is 200 or more points less than our true rating is the same, because of the symmetry between wins and losses. This is easily verified using the calculator.)

For N = 12 and d = 100, we expect to score 64.01%.  We expect 15 heads when we toss the coin 24 times.  The probability of scoring 15 or more half points is 0.1537, i.e. about 1 in 6.5.

For N = 12 and d = 400, we expect to score 90.91%.  We expect 22 heads when we toss the coin 24 times.  The probability of scoring 22 or more half points is 0.00001794, i.e. 1 in 55,741.

The probability that we will receive a rating that is 200 or more points lower than our true rating from the results of 12 games, and a rating that is 200 or more points higher than our true rating over another 12 games is about 1 in 88^2 = 7,744.

For N = 24 and d = 100, we expect 31 heads when we toss the coin 48 times.  The probability of scoring 31 or more half points is 0.02973, i.e. about 1 in 34.

For N = 24 and d = 200, we expect 36 heads when we toss the coin 48 times.  The probability of scoring 36 or more half points is 0.0003586, i.e. about 1 in 2,789.

The probability that we will receive a rating that is 200 or more points lower than our true rating from the results of 24 games, and a rating that is 200 or more points higher than our true rating over another 24 games is about 1 in 2,789^2 = 7,778,521.

I have ignored the effect of statistical variations in the opponents’ ratings in these calculations.  These variations will increase the chances of a freak result.  Nonetheless, they tend to average out, and it turns out that we can ignore them to a first approximation (unless the opponents are playing much fewer games).

In my experience, most competitive players play less than 24 rated games per season. For the USCF rating system, 25 games are needed for a full rating.  For 24 games, the odds against a spurious 400 point rating improvement in one year are about the same as the number of competitive players (according to the estimate quoted above).  On that basis, we would expect this feat to be achieved by someone somewhere about once a year.  Of course, this achievement would not be at all persuasive, even to the statistically naive, unless the player happened to have no previous track record, and promptly retired from chess.  That increases, the odds, but a lower number of games reduces them.

In this article, I have been looking at purely random variations in the player’s results.  In practice the probability of a freak result will be greatly increased by any systematic inaccuracies in the opponents’ ratings (e.g. as a result of geographic variation, rapidly improving juniors, or the treatment of unrated players).  There are also a variety of personal factors that can depress a player’s performance.

Michael de la Maza’s 400 points in 400 days may not be quite what it appears to be. [Michael de la Maza turns out to have played a large number of rated games.  See my next article for an analysis of his results.] Others report suspiciously rapid training results too.  Jeremy Silman said:

“I get hundreds of letters from students worldwide that gain hundreds of points in a few months from reading my strategically oriented books.”  See:

Perhaps these results are not what they appear to be.



  2. Thank you for that link. Michael de la Maza clearly played too many games for random fluctuations in his results to be a significant factor. If there were any other material factors affecting his performance he has not told us about them.

  3. I could not understand all of the mathematics here, but I will assume they are right. Just a question: Did you calculate the probability of the PERFORMANCE-rating, or the probabilities of the new calculated rating (because in the formular the old rating has a huge weight)? Because the chances to achieve such a performance result might be little. But how about the new rating?
    Here you need a performance of roughly 800 points in order to gain 400 points. Unless you did not have an initial rating, which is not true for most organised players. (only DLM was so "lucky" not to have any records of his old rating, so his rating was his performance. Either there was no record, or it was to old and got lost. I believe he had once a record of approximately 2000 Fide elo, but it got lost, or wasnt in the american USCF rating system recorded.)

    Also, I understood you take the probability of a win as a 50% chance? It is more likely to be 40% for a win, 20% for a draw, 40% for a loss.
    How about the probabilities of 3 wins in a row then?
    0.4x0.4x0.4 gives a probability of 6.4%
    compared to probability of
    0.5x0.5x0.5 is 12.5%
    (which is almost twice as much as 6.4%)

    But anyway: thanks for the calculation of the probabilities. It means that DLMs success is not a matter of chance, because for this his success is way to big to be pure "luck". It is much more likely, that DLM either got very strong during those 400 days, or that he was already very strong before he started playing chess in tournaments (but didnt have a rating).

    The "several hundred" points increase within few months after reading Jerimy Silmans book(s) are really amazing if I think about that the new rating is usually a mixture derived form the old rating plus the performance. So if the gain is 200 points (minimum to call it "several hundreds", though I would not like to talk of "several hundreds" if I only mean 2x100) that requires approximately a performance that is 400 points higher than the old rating? Impressive.

    But as you said: "perhaps these results are not what they appear to be EITHER".

    1. I have deleted the "either". My syntax was a bit loose there. I did not mean to prejudge the issue.

  4. I have tried to keep things dead simple here. I have assumed that we calculate each rating from a fresh set of results, ignoring all previous results, as was done annually in the old British system. Nowadays, the ECF says that they “may” carry results over from previous seasons when less than 30 games are played in the current season. I have not studied the USCF system in any detail, but I would expect that if you play enough games, the old games will not greatly influence the result.

    40% for a win, 20% for a draw, 40% for a loss is closer to one toss of a coin for a win or loss than two tosses of the coin for half a point each, so the statistical variability would be greater. MdlM had only about 10% draws according to the link. His 52 games for the very low rating probably only count as about 30 in my calculation. The odds against that result being 200+ points adrift by chance are not going to be much more than 100 to 1. Chance should not have greatly affected the results for the other two years.

    If you are just looking at MdlM, it is very unlikely that his results were entirely due to chance. The odds against it happening are more than winning the lottery. However, people do win lotteries, despite the long odds, because so many people enter. Nonetheless, there are more dishonest people in the world than lottery winners.

  5. Sorry, not 100 to 1. I was looking at the wrong number. It will be thousands to one. I am a bit tired at the moment!

  6. DLM did paticipate a lot of Rating tournaments, U1700 and so on. He did start 1999 with a very bad performance

    and 2000 :

    well i did not calculate that performance-gain but it is gigantic
    This was no 400 points in 400 days, that was 1000 (performance-rating) in one week.

    The only mirrical realy happening is: that people still believe in mirricals ;-)

  7. Always fun to see these sorts of ratings analyses. In a real world setup (not pure theory) I'd expect to see much more of these 400 point gains in the Class D to Class E range, rather than Class C and above, due to the general squishiness (to use the technical term) of those ratings categories, because of the prevalence of scholastic and first-time (provisional) ratings, which are not as robust statistically.