Empirical Rabbit: Tactics Performance Measurement

Friday, 1 April 2011

Tactics Performance Measurement

One method of measuring your performance at solving a set of chess problems is to measure the average time spent solving each problem and the percentage score achieved, but there are difficulties with this method:

* If you solve the problems faster, but your score goes down, you do not know whether you are doing better or worse.

* Similarly, if you solve the problems more slowly, but your score goes up, you do not know whether you are doing better or worse.

* You are rewarded for giving up quickly whenever you encounter a difficult problem.

(You might say that you are being rewarded for good time management, but the time management skill here is different from that in a game, and your skill at time management really ought to be measured separately. What we really want to measure here is just your ability to get the right answer quickly.)

These difficulties can be avoided by measuring the solution times for individual problems. The smallest time limit applied to each problem individually that would still have allowed you to score 50% (the median solution time) does not suffer from the difficulties identified above, provided that you are always able to get at least 50% right. The largest time limit applied to each problem individually that would still have allowed you to solve at least 85% would also work, provided that you scored at least 85%. Similarly for any other percentage. The number of problems that can be solved within a fixed time limit applied to each problem individually also does not suffer from the difficulties identified above. The cumulative distribution of the solution times gives us the percentage that were solved with any such time limit that we may choose, assuming that failures are counted as infinitely long solution times. Here is the cumulative distribution for my first pass through batches E+F in the Bain Experiment:

Measuring performance at solving a set of chess problems is all very well, but it does not necessarily have a direct relationship with your tactical ability in a chess game. The problem here is that the tactics in the chess problems may not necessarily be representative of what you are likely to meet in practice. You might do very well at solving the chess problems, but this skill might turn out to be of little practical value. Alternatively, you might do badly at solving the chess problems, but this might not matter much in practice. The solution to this problem is to construct a set of problems that is statistically representative of what you are likely meet in practice.

There are computer programs that will automatically extract tactics problems from the games in a chess database. The problems on the Chess Tempo tactics server were constructed in this way, see: http://chesstempo.com/faq.html#tactics. In principle, we could use one of these programs to construct random samples of chess tactics as they occur in practice, and use these to measure our tactical ability. (We need a collection of tactics exams to measure our progress, because we can only use each exam once, for this purpose.) The difficulty here is the same one that opinion pollsters face: you need a very large sample to achieve an acceptable level of accuracy. The solution to this difficulty is statistical profiling. We can reduce the sample size needed by ensuring that each sample of tactical problems has the same statistical profile as the whole population of chess tactics as they occur in practice.

We could, in principle, use problem sets that have the same statistical profile as in the whole population of chess tactics for training - but the sets would contain a very high proportion of trivial tactics, and tactics so difficult that they can only be found by a computer. It makes more sense to construct samples in which the level of difficulty is restricted to a narrow band, so that we can chose problem sets that are at an appropriate level of difficulty for us. The level of difficulty could be assessed by a computer program, which could tell us how many half moves deep the solution is - or better, perhaps, it could tell us the total number of half moves in all the variations of the solution. Alternatively, the difficulty for human players could be assessed by carrying out tests.

If we are going to statistically profile our problem sets, we also need to classify the problems by type, e.g. by primary and secondary motif, or something more sophisticated. We could, in principle, program a computer to carry out this task. Alternatively, we could use human assessment. We need to ensure that each set of sample problems has the same distribution of problem types as the whole population, and the same distribution of difficulty within each problem type. This not only ensures that each set of problems is representative, but also avoids the practical difficulty that some players might be better at some types of problems than others.

If we measure our performance at solving these problem sets, we are measuring our absolute performance at finding chess tactics, as it occurs in practice, rather than our relative performance against the competition. Clearly, we would also like to know how well the competition does for each problem set as a whole - and for each problem type / level of difficulty within each set - so that we can our identify relative strengths and weaknesses. We can relate our scores (e.g. median solution time) to those of other players by plotting them on a scatter graph against their ratings. A typical scatter graph might look like this:

Clearly, there is not going to be a one to one relationship between tactics performance and rating (even for players with reliable real world ratings), because some players will be better at tactics, and others at other aspects of the game.

The method of constructing the problem sets described above is essentially the same the one that I used to construct the batches of problems in the Bain Experiment. Bain is one of many problem books in which each chapter contains problems of a different type, and the problems within each chapter are sorted into ascending order of difficulty. If we want to divide the problems into two representative sets, which both have the same distribution of difficulty for each type of problem, we can take the first set to be the odd numbered problems, and the second set to be the even numbered problems. Alternatively, if we want six sets, and there are six diagrams per page, we can take the first set to be first diagram on each page, the second set to be the second diagram on each page, and so on.

If we carry out this process with a problem book, the statistical distribution of problems will not necessarily reflect that of tactics as they occur in real games. As we saw in the Bain Experiment, this makes it difficult to relate an improvement at solving those problems to an improvement at finding tactics in real games. However, this would be less of a concern with a larger problem set, and it is possible that problems that have been selected for their instructional value are more effective for training purposes than problems that have been randomly extracted from games.

The approach outlined here is different from that typically adopted by tactical training software, which usually gives tactical ratings to its users based on their performance at solving problems. The tactical ratings assigned by the online tactical servers usually use the Glicko rating system, with the problems given ratings and treated as opponents. Solving a problem in a timely fashion is counted as a win for the user, and a failure or a slow success is counted as a loss. For a correct solution, Chess Tactics Server assigns the user a result between 0 and 1, according to the time the spent solving the problem:

(The Chess Tactics Server website says that this graph was chosen to make their rating system work, and that the short time limits discourage cheating, which affects the rating of the problems, see: http://chess.emrald.net/time.php.)

With tactical training software, you are invariably allowed to solve the problems repeatedly - which will improve your performance at solving those problems - but this improvement will not be fully reflected in your ability to solve fresh problems. Consequently, you are given a false impression of progress. (A worse problem is that you often do not get the opportunity to repeat the same problems to your chosen schedule.) I did a Google search and found forum posts that said that one of the online tactical servers gives average players International Master ratings, whereas another gives International Masters the ratings of average players!

See my later articles Rating Points Revisited and Rethinking Problem Server Ratings for further discussion.

12 comments:

King and Pawn1 April 2011 at 06:37
I've set up a custom set of problems on chesstempo.com that are 1100 to 1200, do not contain mates, and have been rated as 4 or better out of 5.

The rating is based on the users who have solved and failed to solve, so takes care of the difficulty of the problem. The rating level means that they are single motif or fairly simple multiple motif problems. The non-mate means that that I'm training on the tactics that come up more often in games.

They have a great ability to create another custom set which adds the ability to filter by whether I've ever gotten a problem wrong, or whether I've ever solved it under 30 seconds, etc.

I pay a small fee to be able to create these sets and have them track my success and my daily progress, but it's more than worth it for the features, time and convenience. Having everything available from anywhere internet-connected is brilliant.

They haven't given me anything for my advocacy, except a great experience.
ReplyDelete
Replies
King and Pawn1 April 2011 at 07:03
Two follow up notes:
My aim for this set is to master recognition of basic tactical patterns, rather than to practice solving and calculation.

Second, I've independently come up with the same visualization you have, the pareto curve of correct responses. So I'm aiming for each repetition of the set (550 problems) to be both faster (left) and higher (more correct) than the previous repetitions. I've been building in excel based on the .csv downloads from the site of my progress, but I'm going to suggest that richard from chesstempo.com includes this chart as part of the premium service.
ReplyDelete
Replies
Geoff Fergusson1 April 2011 at 12:46
Improvement at the problems that you are practicing is not really what matters. What you really need to monitor is your improvement at problems that you have never seen before. To monitor this progress, you need many sets of problems of equal difficulty. I recommend that you solve one or two sets in a day, so that you can precisely set the number of days between repetitions. 550 problems per set is too many for this unless you have more time and stamina than me! The Pareto Distribution is not a good fit for my solution time distributions.

Is it possible to get Chess Tempo to always show white at the bottom of the diagram?
ReplyDelete
Replies
AoxomoxoA10 July 2011 at 15:17
It is possible to monitor the improvement in tactics at CT. The numbers of tactic-problems is increasing, so any tactician does see new tactics, never seen before from time to time during the regular blitz-training. An analysys of the .csv download can show the progress only on these problems.
ReplyDelete
Replies
Geoff Fergusson11 July 2011 at 00:07
I have looked at Chess Tempo. There does not appear to be a way of setting up equally difficult problem sets with the same statistical profile of problems. It does, however, appear to be possible to set up problem sets within narrow rating ranges with a random collection of problem types within each of them. Hopefully, the csv download would give me the solution times and results for each pass through each problem set, and the rating of each problem. I could then compensate for the difference in rating between the sets.

I have been put off using Chess Tempo by the coloured highlighting of squares on the board, but I believe it is possible to turn this off. There does not appear to be an option to make black always be at the top of the board though, but I would probably get used to that. Five star membership for ever more would be expensive, but I believe that the FEN is made easily available at this level.

One of next month's articles addresses the topic of how many rating points I have improved, but it raises as many problems as it solves!
ReplyDelete
Replies
AoxomoxoA11 July 2011 at 03:06
The columns of the CT-history:

time problem_id problem_rating_at_time problem_rating_now type av_secs seconds after_first user_rating user_rating_change correct wrong_move
ReplyDelete
Replies
AoxomoxoA11 July 2011 at 14:51
"There does not appear to be a way of setting up equally difficult problem sets with the same statistical profile of problems"

You can do it "by hand". You can create a tag "set 1" and aply it to all problems of "SET 1". Then you create a custom set with the problems tagged "set 1" and name it SET 1
ReplyDelete
Replies
Geoff Fergusson11 July 2011 at 23:30
Thank you for that. If you do it like that, it should be possible to make problem sets that are more equal than is practicable with my problem books. Having sets that gradually increasing in difficulty from one to the next by a known amount looks helpful. The advantage of the servers is that they have ratings for the problems - but these numbers may not mean very much at the end of the day. If I just want to track my tactical rating progress, I could just do a few problems a day. I think you can do 4 a day for free on chess.com, which appears to have the best board display. Best of all would be the FEN, solution and problem rating on a CSV file, so I could make my own user interface. However, as I have said, perhaps the problems in the better books are especially instructive. On the other hand, a rating based on a random sample from real games is more immediately meaningful.
ReplyDelete
Replies
AoxomoxoA12 July 2011 at 00:54
I had my problems ;-) with the problems of chess.com Tactics Trainer. Many of them are not "fresh" generated but taken from free available problemscollection in the internet. So there are f.e. many problems from Fred Reinfeld in there, wich are well known by many tacticians. I did not know them so i was very astonished that such a complicated problems where rated that low. My rating at TT was always jumping up and down 300+ points a day depending on how many of Reinfelds problems i got.

A far as i know, you get no where the FEN, the solution and a rating together.

I mesure my progress ( thats the point ) on a trainingssets by the increase of my avarage score at that set. At CT i use
Score=(av_secs / 2 + 2.5) / seconds if is solved it and Score = 0 if i did not solved it.
So my score is 0 if i did not solve it, its low if i solved it slow and its high if i soved it quick. The score is 1.0 if i do the problem in half of the time as av_secs, that is a speed of a tactician who is "usually" 600 points stronger than the problem. ( see my blog for the "calculations" )

i do a cutof of the score if it reaches 1.5

av_sec is the average seconds of the tacticians at CT, seconds is the time i did need to solve it.
The 2.5 is a "delay" i calculated form my data's ( see my blog )
av_sec might be replaced by a number like 5, that dont change much
ReplyDelete
Replies
Geoff Fergusson12 July 2011 at 04:27
That is interesting. I think I have only looked at chess.com's problem of the day. The trainer is probably different. I know all the reasonable Reinfelds, and a good proportion of the other popular problems, so chess.com looks pretty useless for monitoring my performance.

An exponential might be better for measuring your performance. exp(-0.099021*(t-3)) for t>3 matches the CTS graph well, and you could adjust the exponent according to av_secs.
ReplyDelete
Replies
AoxomoxoA12 July 2011 at 05:28
There is the rating system of CTS
http://chess.emrald.net/time.php
the rating system of Tactics trainer
http://www.chess.com/tactics/help.html#rating
and the rating system of chesstempo
http://chesstempo.com/user-guide/en/tacticRatingSystem.html#blitzRating

At CTS after 3 seconds the result is linearly decreasing with 1/2:1/2 at 10 seconds.
Thats just "chosen". CTS was the first system the system of CT is the last ( from what i know! ), Richard did a lot of thinking how to chose it.
So i did start my personal rating calculations orientating at the system of CT. The system of CT seemingly ( i dont have enouf data )overrate low rated problems. One of the problems of CT is the "overscoring" of quick solved prolems. But good tactician solve easy problems not only with less blunders but with higher speed too.... Well the result of my "analysis" is that i just lock at the avarage score and its progress. For the purpose to monitor the improvement on a set of problems your performance function will do very well but i cant see that it is somehow "better".

If you want to monitor your improvement at tactcis you should orientate your calculation on the calculation of that server you use. CT gives a time penalty if you dont do your moves after move #1 quick, because it want to force you to think first and move then. Thats different to CTS.
To monitor my "real" improvement in tactics i calculate a rating close related to the CT-Rating only on problems i see the first time. This is the most interesting rating because OTB i see "always" new problems.
ReplyDelete
Replies
Geoff Fergusson12 July 2011 at 06:01
Thanks for the links. There is certainly some food for thought there.
ReplyDelete
Replies

Add comment