Empirical Rabbit: How Many Rating Points Is That?

I have been asked how many rating points I have improved. As I have said, unless you make a very large improvement, it is not possible to estimate your rating improvement with a useful degree of accuracy from the results of a practicable number of games. For most players, year to year rating variations are mostly random. However, given the large volume of data that I now have, I can now roughly estimate the number of rating points by which I have improved at spotting simple tactics quickly.

My first clue on how to estimate my improvement was the rating system used by Chess Tactics Server (CTS). With CTS, the problems are given ratings and treated as opponents. Solving a problem quickly counts as a win for the user, and a failure or a slow success counts as a loss. For a correct solution, CTS assigns the user a score between 0 and 1, depending on the time the spent solving the problem:

See: http://chess.emrald.net/time.php. (You can click on the diagrams to enlarge them.) I approximated the CTS scoring graph above with exp(-0.099021*(t-3)) for t>3:

Smoothed CTS Scoring Function

This graph closely tracks that used by CTS. The CTS link above also says that the time for which the result is 0.5 is extended at the rate of 1 second for each 20 Elo points difference between the rating of the problem and the user. I used a fixed 30 second time limit in the Bain, Woolum and CHP experiments, so the unmodified graph looks appropriate. Extending the graph for higher rated problems is highly questionable anyway. It would make no sense to give me extra time on the clock in a game against a stronger opponent, and then fail to take this into account when working out my rating! The precise shape of the scoring graph does not appear to matter very much. I get similar results if I score 1 whenever I get the solution in under 5 seconds and 0 otherwise. Other tactical servers use very different graphs, e.g. see: http://www.chess.com/tactics/help.html#rating and http://chesstempo.com/user-guide/en/tacticRatingSystem.html#blitzRating.

How do we convert these scores into rating points? My clue here was the calculation used by the English Chess Federation (ECF) rating system. In this system, your rating is calculated by adding 50 points to your opponent’s rating if you win, adding nothing to it if you draw, and subtracting 50 points if you lose. Your rating for next year is then the average of these values for this year’s games. There are some refinements to this system that need not concern us here, see: http://en.wikipedia.org/wiki/Chess_rating_system.
With my simplifications, your new rating is the average rating of your opponents, plus a rating difference, which is the average of 50 points whenever you win, no points whenever you draw, and -50 points whenever you lose. Of course I do not know the average rating of the problems that I am solving, but this cancels out when we calculate rating differences. (N.B. The ratings of the problems will depend on the time limit that I impose. If I reduce the time limit, the problems become more to difficult to solve within that time limit, and their ratings will therefore be higher.) We can find my score for each problem in a problem batch using the graph above, work out my average score, multiply it by 100 and subtract 50, to give an ECF rating point difference. We can convert this to Elo points by multiplying by 8. Here are my results for the first pass through each batch for the Bain Experiment:

Bain: Rating Difference vs. Problems Learned

The horizontal axis of this graph is the number of problems learned, and the vertical axis is the rating difference, calculated as above. The red dots represent the rating differences for each of the problem batches, and the green line is the least squares best fit to the data. Each red dot represents my average score for 65 problems, and is positioned at the mid point of these problems on the horizontal axis. It is reasonable to assume that my improvement started with the first problem that I learned and continued until the last problem, so I have extended the line to the first problem in the first batch and the last problem in the last batch. The graph suggests that my ability to spot simple tactics (very simple tactics in the case of Bain) quickly improved by about 300 Elo points in the Bain Experiment. (This improvement was in my ability to solve problems that I had never seen before, not the problems that I was practicing.)

For the Bain Experiment, I removed all the problems that were exact duplicates, but many near duplicates remained. The remaining level of pattern duplication is still looks larger than that in tactics randomly selected from real games, or indeed from a large collection of problem books. However, my pattern matching model puts the level of remaining pattern duplication in Bain at about 40%, and the level of pattern duplication in Woolum at about 30%. The duplication in Bain is more blatant and annoying than in Woolum, but perhaps it is not as bad as it appears. Nonetheless, any excess pattern duplication in Bain will show up as a spurious improvement on this graph. See my earlier article Tactics Performance Measurement for further discussion. Here are the corresponding results for the Woolum Experiment:

Woolum: Rating Difference vs. Problems Learned

This graph is less dramatic, but roughly 100 points in 42 days still looks impressive! The drop from +200 at the end of Bain to 0 at the start of Woolum suggests that Woolum is about 200 points harder than Bain. However, I believe that this drop is partly a reflection of the larger number of patterns sampled by Woolum. (I would not have done as well on new patterns, even if the problems containing them were no harder.) Here are the results for Heisman + Pandolfini from the CHP Experiment:

Heisman+Pandolfini: Rating Difference vs. Problems Learned

I again appear to have improved by roughly 100 points in 42 days. The drop of about 80 points from the end of Woolum to the start of Heisman + Pandolfini suggests that Heisman + Pandolfini is about 80 points harder than Woolum.

How accurate are these numbers? Of course, these graphs are just estimating my performance improvement at spotting simple tactics quickly, not my improvement at the game as a whole. The numbers here are also subject to random variation. For Bain, my estimated rate of progress is about three standard deviations (according to the standard formula based on the least squares residuals). For Woolum, it is about two standard deviations. For Heisman + Pandolfini, the standard formula puts my estimated rate of progress at 1.2 standard deviations. The larger scatter on this graph appears to be due to chance variations in my performance. Nonetheless, I cannot claim a good level of accuracy for this problem set.

Can we just add my improvements together? That would be too optimistic. My pattern matching model suggests that the patterns in Bain were selected from a pool a about a third as big as that for Woolum. This suggests that my 300 point gain for Bain would be diluted to about a 100 point gain for the Woolum problem set. (My pattern matching model also suggests that I learned about 200 patterns from Bain, and about 300 from Woolum, so a 100 point gain looks reasonable from this point of view.) It is also possible that my improvement at Woolum might not be fully reflected in my improvement at Heisman + Pandolfini. There are many uncertainties here, but an overall improvement of 200-300 points looks likely for solving problems at this level. [See my later article Rating Points Revisited for the improvements that I made to this method.]

2 comments:

AoxomoxoA wondering1 August 2011 at 02:35
ECF is OK if the ratings are "close" together, but these puzzles where "easy" for you, your score was always far from 50%. I think the best Ratingsystem is Glicko but the "RD's" are unknown. So the ELO -System is the best availble ratingsystem (imo).See http://en.wikipedia.org/wiki/Elo_rating_system#Mathematical_details

To get a stabile rating Ra' needs to be equal Ra so: Sa needs to be equal Ea
( Not suprising: the expectated Score Ea needs to be equal to the real Score Sa )

(Score = 100% = 1 if your performanc was best possible
Score = 0% = 0 if your performance was worst possible)

If you had at the beginning of your training a score S1 then your rating was?

1/(1+10^((delta R1)/400))=S1
1+10^((delta R1)/400)=1/S1
10^((delta R1)/400)=1/S1-1
(delta R1)/400=log(1/S1-1)
delta R1=400*log(1/S1-1)

so:

If your score was 50%=0.5 then delta R1 was 0 the Opponent ( the Problems ) had the same rating.

If your score was 75%=0.75 then delta R1 was ~ -190, meaning the problems had ~ Elo 190 less than you.

If S2 is the Score after the Training then
delta R2=400*log(1/S2-1)

and your Rating gain by the training is

Gain = R1-R2 = 400*log(1/S1-1)-400*log(1/S2-1)
----------------------------------------------

this is still "easy" to calculate ( but far from linear )
Geoff Fergusson1 August 2011 at 06:47
Yes, I should get 100% against Bain with a standard time limit, but I had a big time handicap here! The scores for the best fit to my six data points for Bain ranged from 0.411 to 0.713, so my scores were reasonably close to 0.5. (0.5 corresponds to a rating difference of zero.) For Heisman + Pandolfini they ranged from 0.525 to 0.613. Yes, I have extended the line a little in either direction, but that does not make too much difference. (I have added a note to the text to say that the rating of the problems depends on the time limit.)

ECF is fair enough here. The numbers are ball court at best!

My learning curve is not necessarily linear. The pattern matching model implies an exponential curve, but the curvature is insignificant in these examples.

For harder problem sets, I am extending the time limit, and I can change the exponent in the scoring function to get scores nearer to 0.5 for these sets.

Monday, 1 August 2011

How Many Rating Points Is That?

2 comments: