Empirical Rabbit: Rating Points Revisited

Saturday, 1 October 2011

Rating Points Revisited

In my previous article, How Many Rating Points Is That?, I introduced a method for estimating my tactical rating point improvement from my improvement in solution times. After applying this method to the results of my tactics training experiments, it has become clear that the method can be improved upon.

In my earlier article, I used a scoring graph that closely followed that used by Chess Tactics Server (CTS). With CTS (and other tactical servers), the problems are given ratings and treated as opponents. Solving a problem quickly counts as a win for the user, and a failure or a slow success counts as a loss. For a correct solution, the scoring graph provides a score between 0 and 1, depending on the time spent solving the problem:

I found that I got similar results when, in place of this scoring graph, I simply scored 1 whenever I solved the problem in under 5 seconds and 0 otherwise. Superficially, it appears that just counting the number of solution times that fall within a time limit should be less accurate than making use of the precise values of all those solution times. However, in practice, the standard deviations given by the simple time limit method were often smaller (in relation to the rating improvement) than those obtained using the scoring graph. The main problem with the CTS method (and those used by other tactics servers) is that the resulting score does not relate directly to what happens in a real game. The score given by simple time limit method, on the other hand, does have a direct relationship with what happens in a real game.

The score given by the simple time limit method estimates the probability that you will find the tactic within the time available. If there is a single win or lose tactic per game at the level of the tactics problem, this probability is the same as your probability of winning the game (provided that the time limit matches the time available in a game). In practice, it is more likely that if you fail to spot a tactic, you will lose (or fail to gain) half a point. If there is only one such tactical chance per game, the score given by the time limit method will, in this case, over estimate your game score. However, if there are two tactical chances per game (attacking or defensive), and spotting each tactic in time earns you half a point, the time limit method gives a realistic estimate of the probability of winning the game.

(Suppose that there is a probability p that you will spot a tactic to earn half a point. There is then a probability 1 - p that you will not spot the tactic. Suppose also that there is the same probability p that you will spot a second tactic to earn another half point. The probability that you will earn two half points is p ^ 2, the probability that you will earn one half point is 2 * p * (1 - p), and the probability that you will earn no points is (1 - p) ^ 2. On average, you will gain p ^ 2 + p * (1 - p) = p points.)

If there is more than one tactical chance per player per each game, the time limit method underestimates the rating benefit of spotting those chances. The number of tactical chances per game will clearly depend on how sharp the positions are. You can get a feel for the numbers here by analysing your own games on a computer, or simply by playing against a computer. Taking the score given by the time limit method as the average number of points that you can expect to win tactically per game (at the level of tactical difficulty of the problems concerned) is likely to be conservative for lower rated players.

Previously, I converted my scores for solving chess problems into rating points using the English Chess Federation (ECF) method. This was adequate when the scores were near to 0.5, but the Elo method gives good results over a wider range:

http://en.wikipedia.org/wiki/Elo_rating_system

For this method, your expected score s is given by:

s = 1 / (1 + 10 ^ -(d/400))

Where d is your rating minus that of that of your opponent (or the problem set here).

(The ECF method approximates this curve with a straight line from the bottom left hand corner (-400,0) to the top right hand corner (400,1).) Solving the Elo equation for d gives:

d = -400log(1/s - 1)

In this context, the score s is taken to be the fraction of the problems that you were able to solve within the time limit. To use this result, we can:

(1). Time ourselves solving a series of equally difficult problem batches.
(2). Calculate the values of s for each batch.
(3). Calculate the values of d for each batch.
(4). Plot the values of d on a graph.
(5). Fit a straight line to the graph.

(N.B. I assume that we time ourselves on our first pass through each problem batch, that there are no duplicate problems, and that the batches are all representative of the tactics that we will meet in real games.) Here is the graph for the Ivashchenko 1b Experiment, with a time limit of 55 seconds:

The line extends from first problem of batch A to the last problem of batch D, and the red dots are at the mid point of each problem batch. The graph suggests that I improved by about 100 Elo points, but the standard deviation is also about 100 Elo points (due to the large scatter), so we cannot draw any firm conclusions here.

(N.B. In my experiments, I stop the clock as soon as I believe that I have found the solution. This protocol enables me to estimate the number of problems that I can solve at different time limits; but I would get a higher score if I continued checking until the time limit concerned expired. However, the resulting underestimation of my performance probably is not significant, given all the other uncertainties.)

I believe that this method is an improvement on my previous one, and on those used by the online tactical servers. However, it is clear from the discussion above that all these methods have serious limitations. The only really sound approach here is to test a large number of accurately rated players at solving the problem set, as discussed in my earlier article: Tactics Performance Measurement.

8 comments:

AoxomoxoA wondering2 October 2011 at 02:10
Hi,

i still have my problems with your Rating-tecnic. A rating on "known" problems is imo of very! limited value. ( Might help to monitor the training though )

Argument 1:

Thesis 1: If i do better and quicker than any other then my rating should be higher then of any other.

Thesis 2: I can learn a ( small ) set of problems so i can solve them quicker then anyone else ( unprepared, at this time :)

Concusion: My rating on this set is suddenly 2800+++
(?)

Argument 2:

Step 1: I create a (small) set with problems where i need t+1 sec to solve every! one of them
Step 2: i do some training on it till i solve every! problem in t-1 -- sec
Step 3: I calculate a rating with the cut off time t

Conclusion: My rating jumps from 0 towards eternity
(?)

Argument 3:

Step 1: I create a (small) set with problems where i need 100 sec to solve every! one of them
Step 2: i do some training on it till i solve every! problem in 30 sec
Step 3: I calculate a rating with the cut off time 29

Conclusion: My rating stays the same
(?)

I did some analysis on the matter speed and rating( for example here: http://aoxomoxoa-wondering.blogspot.com/2011/07/how-speed-influence-rating.html ) but i dont have a concrete "result" how to calculate.

I think the time-score function of CTS and CT was just "selected" and are not based on empirical datas. CTS was the first tactic server anyway. How should they know, what the real speed-rating relation is.

With a lot of work it should be possible to extract the necessary data from CT but i found my solution to this problem: I look at my perfomance on "problems i never saw before" at chesstempo.
ReplyDelete
Replies
Geoff Fergusson2 October 2011 at 12:41
Suppose that I got 50% of the problems in batch A right in under T seconds, on my first pass through that batch. Suppose also that I got 60% of the problems in batch D right in under T seconds, on my first pass through that batch. The value of d for batch A is -400log(1/0.5 - 1) = 0. The value of d for batch D is -400log(1/0.6 - 1) = 70. I therefore estimate that I improved by 70 Elo points as result of practicing batches A, B and C.

Any improvement that I make by practicing batches A, B and C will not be fully reflected in an improvement at solving batch D (assuming that batch D does not contain any of the problems in batches A, B and C). Indeed, if batches A, B and C are chess and batch D is checkers, practicing A, B and C probably will not help at all with batch D. However, if batches A, B, C and D are all randomly selected from chess games, practicing A, B and C should help solving batch D to some extent.

Actually, my problems are randomly selected from problem books rather than chess games. The problems in problem books are usually less alike than those randomly selected from chess games - but, as I have said, Bain has a worrying high level of duplication.

An alternative approach is to have a large test set that I solve at with a very long repetition interval (one year say). I can then argue that I will have forgotten anything that I leaned on the previous passes. There are problems with both methods!

I will make the text more explicit.
ReplyDelete
Replies
AoxomoxoA wondering2 October 2011 at 13:36
So Argument 1 is wrong, if you measure on a "unkown" set D.
Argument 2 and 3 dont work directly, you start with 50%. But i still think the rating-improvement is dependend on the distribution of the problem-complexity, i think its parametric:

In my training my average "end" speed is ~2++ times quicker than at the first time i see them.
If the problems are "close" together (=low "variance" ) in their "complexity" ( ::= ~ time i need for them ;) then i have 50% at the beginning and 100% at the end, but if the complexity is not close together ( high variance ) , if there are many problems with extreme high "complexity" then they will stay unsolved. If for example 25% of the problems are some of these problems wich are only solvable by computers then its not possible to score higher than ~75%
ReplyDelete
Replies
AoxomoxoA wondering2 October 2011 at 22:45
If the Problems are all, about of the same complexity then the rating is ok. That should be the case in "good" books.
ReplyDelete
Replies
Geoff Fergusson2 October 2011 at 23:19
I go to a lot of trouble to find problem sets in which the difficulty range is a narrow as possible. The rationale is that every game contains easy tactics that both 2000 players will spot, and difficult tactics that neither player will spot. What decides games is tactics that one player spots and the other misses. If the problem set is easy, I set a short time limit, and if it is harder, I set a longer one, to make the problems realistic differentiators.

Suppose that 25% of the problems are impossible, and that the percentage that I get under the time limit on the remaining problems increases from 50% to 60%. My percentage increase on the problem set as a whole increases from 0.75*50% = 37.5% to 0.75%*60% = 45%, so that is a 7.5% increase rather than a 10% increase. Nonetheless, I still get a rough estimate of my improvement.

If I used average solution times, this problem would be much worse, because the very large solution times for the very difficult problems would dominate in the average. This is one reason why I do not use average solution times!
ReplyDelete
Replies
Geoff Fergusson3 October 2011 at 02:31
The last two comments are out of sequence because of the time difference between England and Germany!

The main remaining uncertainty is the average number of times per game that a potentially half point winning or losing tactic occurs. The games of weak club players often have many changes of fortune, and many missed tactical opportunities; but tactics rarely decides GM games. At the 2000 level, two half point tactical opportunities per game looks plausible.
ReplyDelete
Replies
AoxomoxoA wondering4 October 2011 at 10:16
"The main remaining uncertainty is the average number of times per game that a potentially half point winning or losing tactic occurs."..

Igor Khmelnitsky did write an award winning book: Chess Exam ( http://www.amazon.com/Chess-Exam-Training-Guide-Yourself/dp/0975476122 ).

In this book Khmelnitsky gives a rating in several "Subskills" of Chessability as Strategy, Tactics, Opnening and so on. His formula indicates that a gain of x points in ( Slow ) Tactics will increase the Elo by 0.4 * x.

His exam is now online for free :
http://www.chessik.com/index.htm
ReplyDelete
Replies
Geoff Fergusson4 October 2011 at 11:24
His exam may be based on the performances of reliably rated players at his test, rather than assigning ratings to problems and using the Elo formula. If so, 0.4 x looks plausible, but I would expect the multiplier to be larger for beginners and smaller for GMs, and larger for tactical players and smaller for positional players.

With my method the tactical rating directly reflects your probability of solving a single tactical problem within the time available. However, if you find a single tactic that your opponent misses, you will not necessarily win, and if you fail to find it, you will not necessarily lose. To convert your probability of finding a single tactic into your probability of winning rather than losing a whole point, you need to know how many tactical chances there are per game. (Easy chances that both players nearly always spot do not count, and neither do chances that neither player has much chance of spotting.)

The methods used by the tactical servers have the same limitations as mine - plus the limitation of muddling up many different time limits in an opaque way!

By the way, I have just found my spam folder, and found an old comment of yours about Chess Hero, which should now be on the blog.
ReplyDelete
Replies

Add comment