Empirical Rabbit: Rethinking Chess Problem Server Ratings

The purpose of this article is to investigate improvements to the rating calculations used by the online chess problem servers, e.g. Chess Tactics Server (CTS), chess.com and Chess Tempo. For these servers, whenever you tackle a problem, you are given a score based on whether you succeeded (or how completely you succeeded in the case of chess.com) - and, if you are successful, the time that you took to solve the problem. These scores are used as input to the Glicko rating system, with the problems being treated as opponents.

With CTS, if you fail to solve a problem correctly, your score for that problem is 0. If you solve a problem correctly within 3 seconds, your score is 1, and after that your score decreases linearly to 0.5 at 10 seconds. If the problem rating is greater than your rating, the score remains at 0.5 for 1 second * (the problem rating - your rating) / 20 Elo rating points. Your score then decreases exponentially:

For chess.com, you receive a “move score” of 1 if you solve a problem correctly, or a fractional move score depending on how many moves you got right. The move score is then multiplied by a factor that depends on how your solution time compares with the average (correct or partially correct) solution time for that problem:

With Chess Tempo, if you fail to solve a problem correctly, your score for that problem is 0. If you solve a problem correctly in less than the average correct solution time for that problem, minus one standard deviation, your score is 1.25. If your correct solution time lies between the this value, and the average correct solution time, your score is 1. If you solve the problem correctly in more than the average correct solution time, your score is: the average correct solution time for that problem / your correct solution time.

There are inherent difficulties with this approach. The first difficulty is that the longer you take to solve a problem, the more likely you are to get it right. Giving problems fixed ratings makes no sense unless there is a fixed time to solve the problem. The creators of CTS appear to have recognised this fact: “Due to the fact, that the problems are invariant to time control, their rating will decrease systematically if the time regime is relaxed. The ratings of tacticians will increase vice versa.” Nonetheless, they still give a longer time limit to more difficult problems. It makes no sense to give a player more time on the clock than has opponent, and then fail to take this into account in the rating calculation! The other tactics servers have the same problem. They all have a single rating per problem, and none of them imposes a fixed time for solving these problems, or tries to compensate for not having done so.

A second difficulty with this approach is that although the servers take account of the time taken to solve problems successfully, they ignore the time taken on unsuccessful attempts. For my first pass through Coakley, my average time to solve a problem incorrectly was nearly 50% longer than my average time to solve a problem correctly, and the problems that I failed to solve at took me nearly three times longer (the time limit), see: A Three Parameter Model.

A third difficulty with this approach is that there is that no clear rationale is offered for the scoring methods.

A chess game can be viewed as a sequence of problems, one for each move. Your playing strength depends on the degree of success you achieve in solving these problems, and the total time taken. The simplest approach to scoring is to allocate points for solving each problem, and add up these points to give a total score, as in a “How Good is Your Chess?” magazine article. For simplicity, I will assume that one point is allocated per problem, as for Chess Tactics Server and Chess Tempo. I will also assume the we measure the score as the total number of problems solved correctly, divided by the number of problems attempted. I believe that the most appropriate measure for the time taken is the average time taken per problem (including those that were solved incorrectly or not solved at all). (N.B. This approach is an improvement on that used by the tactics servers, but is less than ideal. There are difficulties in relating to probability of winning a game to the probability of solving a problem, as discussed in my earlier article Rating Points Revisited.)

Clearly, we cannot ensure that all players take the same time solving a problem, or that all problems are solved in the same time. However, we can, in principle, use the problem scheduling to ensure that all players take the same average time per problem, and that all problems have the same time spent on them, averaged over all the players that attempt them. We can make the players take the same average time per problem by serving up harder problems whenever their average time is less than the standard average time, and easier problems when their average time is greater. Similarly, we can ensure that the same average time is spent on each problem by serving them up to stronger players when the average time is less than the standard time, and to weaker players when it is greater than the standard time. We can then update the rating for the player or problem concerned when the average time is very close to the standard time. If the average time deviates significantly from the average time, we can simply accumulate the data for later use.

This simple method should work very well for a server that does not allow the user to choose which problems are scheduled. The main limitation for such a server is that forcing all the players to take the same average time per problem is not likely to be popular. Some of players will want to practice with a faster time limit than others. If we have enough traffic, it should be possible to offer a choice of average times, each with its own rating system and different ratings for both the problems and the players. Since some players will be better at blitz and others with a slower time limit, different ratings for different speeds makes a lot of sense.

An alternative approach is to adjust the scores of both the players and the problems to what they would have been if their average times were equal to a standard time. My earlier article, Rating, Time and Score suggests a possible calculation. This calculation is based on the assumption that each doubling of a player’s speed adds K points to his rating. K appears to be about 200 Elo points for human players (see An Important Discovery and Rating vs. Time on the Clock). For this calculation, we adjust the players’ scores to what they would have been if their average time per problem was equal to the standard time. We also adjust the problems’ scores (using the unadjusted players’ scores) to what they would have been if the average time per problem was equal to the standard time. We can then apply a standard rating calculation to update the ratings of both the players and the problems. (It is also possible to use this approach in conjunction with the previous method to correct for small deviations from the standard time. The calculation does not have to be particularly accurate to work well in this case.)

We can use the data collected by the server to optimise the value of K. One possible method is to calculate two sets of ratings for the problems. The first set uses data from only those solution attempts where the player’s time was below the average. The second set uses data from only those solution attempts where the player’s time was above the average. We can assess the accuracy of the fit by calculating the sum of the squares of the differences between the corresponding ratings for the problems in these two sets. We can carry out this calculation for various values of K, and find the value of K that provides the best fit to the data. [Suggested method amended 31/7/2012.]

The optimum value of K may be both problem and player dependent. Players with good calculation skills are more likely to benefit from extra time. Extra time may also be more advantageous for problems where the candidate moves are obvious, but significant calculation is needed. Given sufficient data, it should be possible to fit different K values to different players and problems, but a “one size fits all” value of K may be adequate. The mathematical relationship between speed and rating may not precisely match my assumption: K points for each doubling in speed. Nonetheless, that relationship is a good starting point. In principle, we can find the precise relationship by collecting enough data, and adjusting the calculation accordingly.

I believe that the methods of this article are a significant improvement on those used by the popular servers. From the standpoint of accuracy, the most promising approach is to use problem scheduling to equalise the average times. Equalising these times by calculation allows more user features to be supported, but is likely to be less accurate. Nonetheless, this method should still be more accurate than the methods currently in use. Combining both methods may be the best approach.

Also see my later articles More Rating, Time and Score and Problem Server Ratings Revisited.

7 comments:

AoxomoxoA wondering1 January 2012 at 01:56
I dont think that any of the present servers will change their rating system. They have a system wich works. "Never touch a running system", "never change a winning team". I suggested many times at several servers to use "fixed" times for the problems, for example 3 min/problem, that would be the time for a move at OTB. This regime would help to get a "right" time feeling for OTB-Games too. But there is seemingly no interest in that...
Geoff Fergusson1 January 2012 at 10:24
Do the servers have a rating system that works? They have a system that gives numbers, but these numbers do not have any sound mathematical or empirical basis. Their users do not know that, and even if they did, and wanted something better, they do not have an alternative. It might be worthwhile for a new server provider to provide a better rating system to attract users away from the existing servers.
Anonymous1 January 2012 at 10:50
The tactics servers use their proprietary ratings - which as you point out aren't exactly well-founded - in a Pavlovian fashion to encourage the servers' continued use. I ignore the ratings and just use CTS for free chess problem solving, for which I thank the server admins. The stat I pay attention to is therefore % correct rather than the arbitrary rating assigned.
Munich1 January 2012 at 13:48
If each problem has about the same average solving time - I cant see that this improves a rating system. All it does is, that you know in which time you should solve it.
It is clear that you already have an advantage over OTB if you know that the puzzle has a solution. OTB you do NOT know that you can find a decisive winning move.
Now, if you make every puzzle the same average solution time, then you get another hint: It cant be a very simple solution (You approximately know which puzzle rating range you get served).

For a chess puzzle server it is important to get a stable puzzle rating. It is less important to get a stable user rating. But well, of course the user rating is of interest for oneself in order to measure progress.
The CT approach with the fide estimate on first timers is a good one. To make it better it would require more input by the users: they should provide their fide rating.
This new approach fide estimate approach eliminates the problem with punishing a user if he saw a puzzle before. But I guess it would be even better, if CT would not only offer a fide estimate (based on first timers) but simply give a CT Blitz rating based on first timers, too.

Also: The rating system of CT does not reward or punish a user so much if he solves a puzzle very fast or very slow, but simply discourage taking too much time. The more you take solving time into account the more you are getting away from the main point: getting the puzzle correct.
The CT Blitz rating system punishes "guessing": it doubles the time you needed after you moved your first move.
And beeing honest: I hardly believe I will achieve a CT blitz rating or a similar fide estimate of 2300 and wont achieve a similar fide rating at the same time.
I reckon, that the main problem with the fide correlation is the fide elo system itself, which is not as good as a glicko rating system. And even the glicko rating system is not the best. There were competitions of rating systems carried out, and many of them were able to forecast the result between 2 players even better than the glicko system.
Geoff Fergusson2 January 2012 at 01:26
If each problem has about the same average solving time - I cant see that this improves a rating system.

Suppose that I play a match against several GMs, and I have 5 hours on the clock and they all have five minutes. Scoring 50% should not entitle me to a GM rating. Suppose, that the system users get two problems right 50% of the time, but their average solution time for the first problem is 5 minutes, and for the second 5 seconds, the first problem is likely to be much harder than the second.

All it does is, that you know in which time you should solve it.

Not necessarily. I am not proposing that an equal time should be spent on solving each problem. That would be impossible. The difficulty of a problem is user dependent; and a problem that I found difficult last month might be easy for me now, and vice versa, depending on whether I have seen a similar problem recently. What I am proposing is that the time spent, averaged over all the rated users who tackled the problem, should be equal. We can achieve this by moving the peak of the distribution of the ratings of the users to whom the problem is served higher if the average is too large, and lower if it is too small. This need not be noticeable to the user. Similarly, for the user, it is the average solution time that is fixed, not the time to tackle each position, just as in a game. Alternatively, if we use calculation to compensate for different average problem solution times, this need not affect the scheduling at all.

It is clear that you already have an advantage over OTB if you know that the puzzle has a solution.

The problems could be random positions from games.

I reckon, that the main problem with the fide correlation is the fide elo system itself, which is not as good as a Glicko rating system. And even the Glicko rating system is not the best. There were competitions of rating systems carried out, and many of them were able to forecast the result between 2 players even better than the Glicko system.

Glicko is a little better than Elo, but the best system is much better:

http://blog.kaggle.com/2011/04/24/the-deloittefide-chess-competition-play-by-play/

The methods that I have proposed can be applied to any rating system.
Munich2 January 2012 at 16:08
Your link is very interesting about the rating systems competition. The final winner was Tim Salimans with a 33.9% better than elo! (capped binominal deviation was 0.246463)

However, Salisman himself commented that the design of the contest was not ideal for the goal to find the best rating system. See here his conclusion:

http://people.few.eur.nl/salimans/chess.html

Here the final standings for the rating system (By the way: "our" Uri Blass got place 7, with 16.4% better than elo. Well done Uri!):

http://www.kaggle.com/c/ChessRatings2/Leaderboard

For the subject about the average time: O.k., I can understand your point now. True, your approach would be better, no doubt about it.
Geoff Fergusson2 January 2012 at 21:54
Your link did not work in Chrome. This one did:

people.few.eur.nl/salimans/chess.html

This link provides a more positive conclusion:

blog.kaggle.com/2011/05/29/the-thrill-of-the-chase-tim-salisman-on-how-he-took-home-deloittefide-chess-comp/

Sunday, 1 January 2012

Rethinking Chess Problem Server Ratings

7 comments: