The purpose of this article is to investigate improvements to the rating calculations used by the online chess problem servers, e.g. Chess Tactics Server (CTS), chess.com and Chess Tempo. For these servers, whenever you tackle a problem, you are given a score based on whether you succeeded (or how completely you succeeded in the case of chess.com) - and, if you are successful, the time that you took to solve the problem. These scores are used as input to the Glicko rating system, with the problems being treated as opponents.
With CTS, if you fail to solve a problem correctly, your score for that problem is 0. If you solve a problem correctly within 3 seconds, your score is 1, and after that your score decreases linearly to 0.5 at 10 seconds. If the problem rating is greater than your rating, the score remains at 0.5 for 1 second * (the problem rating - your rating) / 20 Elo rating points. Your score then decreases exponentially:
For chess.com, you receive a “move score” of 1 if you solve a problem correctly, or a fractional move score depending on how many moves you got right. The move score is then multiplied by a factor that depends on how your solution time compares with the average (correct or partially correct) solution time for that problem:
With Chess Tempo, if you fail to solve a problem correctly, your score for that problem is 0. If you solve a problem correctly in less than the average correct solution time for that problem, minus one standard deviation, your score is 1.25. If your correct solution time lies between the this value, and the average correct solution time, your score is 1. If you solve the problem correctly in more than the average correct solution time, your score is: the average correct solution time for that problem / your correct solution time.
There are inherent difficulties with this approach. The first difficulty is that the longer you take to solve a problem, the more likely you are to get it right. Giving problems fixed ratings makes no sense unless there is a fixed time to solve the problem. The creators of CTS appear to have recognised this fact: “Due to the fact, that the problems are invariant to time control, their rating will decrease systematically if the time regime is relaxed. The ratings of tacticians will increase vice versa.” Nonetheless, they still give a longer time limit to more difficult problems. It makes no sense to give a player more time on the clock than has opponent, and then fail to take this into account in the rating calculation! The other tactics servers have the same problem. They all have a single rating per problem, and none of them imposes a fixed time for solving these problems, or tries to compensate for not having done so.
A second difficulty with this approach is that although the servers take account of the time taken to solve problems successfully, they ignore the time taken on unsuccessful attempts. For my first pass through Coakley, my average time to solve a problem incorrectly was nearly 50% longer than my average time to solve a problem correctly, and the problems that I failed to solve at took me nearly three times longer (the time limit), see: A Three Parameter Model.
A third difficulty with this approach is that there is that no clear rationale is offered for the scoring methods.
A chess game can be viewed as a sequence of problems, one for each move. Your playing strength depends on the degree of success you achieve in solving these problems, and the total time taken. The simplest approach to scoring is to allocate points for solving each problem, and add up these points to give a total score, as in a “How Good is Your Chess?” magazine article. For simplicity, I will assume that one point is allocated per problem, as for Chess Tactics Server and Chess Tempo. I will also assume the we measure the score as the total number of problems solved correctly, divided by the number of problems attempted. I believe that the most appropriate measure for the time taken is the average time taken per problem (including those that were solved incorrectly or not solved at all). (N.B. This approach is an improvement on that used by the tactics servers, but is less than ideal. There are difficulties in relating to probability of winning a game to the probability of solving a problem, as discussed in my earlier article Rating Points Revisited.)
Clearly, we cannot ensure that all players take the same time solving a problem, or that all problems are solved in the same time. However, we can, in principle, use the problem scheduling to ensure that all players take the same average time per problem, and that all problems have the same time spent on them, averaged over all the players that attempt them. We can make the players take the same average time per problem by serving up harder problems whenever their average time is less than the standard average time, and easier problems when their average time is greater. Similarly, we can ensure that the same average time is spent on each problem by serving them up to stronger players when the average time is less than the standard time, and to weaker players when it is greater than the standard time. We can then update the rating for the player or problem concerned when the average time is very close to the standard time. If the average time deviates significantly from the average time, we can simply accumulate the data for later use.
This simple method should work very well for a server that does not allow the user to choose which problems are scheduled. The main limitation for such a server is that forcing all the players to take the same average time per problem is not likely to be popular. Some of players will want to practice with a faster time limit than others. If we have enough traffic, it should be possible to offer a choice of average times, each with its own rating system and different ratings for both the problems and the players. Since some players will be better at blitz and others with a slower time limit, different ratings for different speeds makes a lot of sense.
An alternative approach is to adjust the scores of both the players and the problems to what they would have been if their average times were equal to a standard time. My earlier article, Rating, Time and Score suggests a possible calculation. This calculation is based on the assumption that each doubling of a player’s speed adds K points to his rating. K appears to be about 200 Elo points for human players (see An Important Discovery and Rating vs. Time on the Clock). For this calculation, we adjust the players’ scores to what they would have been if their average time per problem was equal to the standard time. We also adjust the problems’ scores (using the unadjusted players’ scores) to what they would have been if the average time per problem was equal to the standard time. We can then apply a standard rating calculation to update the ratings of both the players and the problems. (It is also possible to use this approach in conjunction with the previous method to correct for small deviations from the standard time. The calculation does not have to be particularly accurate to work well in this case.)
We can use the data collected by the server to optimise the value of K. One possible method is to calculate two sets of ratings for the problems. The first set uses data from only those solution attempts where the player’s time was below the average. The second set uses data from only those solution attempts where the player’s time was above the average. We can assess the accuracy of the fit by calculating the sum of the squares of the differences between the corresponding ratings for the problems in these two sets. We can carry out this calculation for various values of K, and find the value of K that provides the best fit to the data. [Suggested method amended 31/7/2012.]
The optimum value of K may be both problem and player dependent. Players with good calculation skills are more likely to benefit from extra time. Extra time may also be more advantageous for problems where the candidate moves are obvious, but significant calculation is needed. Given sufficient data, it should be possible to fit different K values to different players and problems, but a “one size fits all” value of K may be adequate. The mathematical relationship between speed and rating may not precisely match my assumption: K points for each doubling in speed. Nonetheless, that relationship is a good starting point. In principle, we can find the precise relationship by collecting enough data, and adjusting the calculation accordingly.
I believe that the methods of this article are a significant improvement on those used by the popular servers. From the standpoint of accuracy, the most promising approach is to use problem scheduling to equalise the average times. Equalising these times by calculation allows more user features to be supported, but is likely to be less accurate. Nonetheless, this method should still be more accurate than the methods currently in use. Combining both methods may be the best approach.
Also see my later articles More Rating, Time and Score and Problem Server Ratings Revisited.