Empirical Rabbit: A Simpler Server Rating Method

I have been looking at a simpler chess problem server rating method. With this method, the ratings calculated are what they would be the users always took the target time on their first attempt at every problem. (N.B. Only first attempts at solving problems are used for rating purposes.) I assume that a user's effective rating increases by K Elo rating points for each doubling of his thinking time (see my earlier article Rating, Time and Score). With this method, the problem ratings are adjusted prior to calculating a user rating, and the user ratings are adjusted prior to calculating a problem rating. Once these adjustments have been made, a standard rating calculation can be used

Rating a User

If a user takes the target time T on a problem, the USCF version of the Elo formula gives his expected score as:

s_U = 1 / (1 + 10^[(R_P - R_U)/400])

Suppose that the user takes time t, i.e. t/T times the target time T. By hypothesis, this makes him K*log₂(t/T) rating points stronger, increasing his expected score. It follows that:

s_U = 1 / (1 + 10^[(R_P - R_U - K*log₂(t/T))/400])

The graph below plots the user's expected score against the time he takes on an equally rated problem with a target time of 100 seconds:

For this graph, K = 150. In this example, the user's expected score is 0.5 when he takes 100 seconds on the problem. (N.B. This is essentially the same as the graph in Rating Time and Score. Only the scaling is different.) The longer the user takes, the greater his expected score. Suppose that the user's actual score against a single problem is 0.5. When his expected score is equal to his actual score:

1 / (1 + 10^[(R_P - R_U - K*log₂(t/T))/400]) = 0.5

10^[(R_P - R_U - K*log₂(t/T))/400] = 1

R_U = R_P - K*log₂(t/T))/400

The longer the user takes, the lower his estimated rating.

Rating a Problem

The expected score of the problem is:

s_P = 1 - s_U

This reduces algebraically to:

s_P = 1 / (1 + 10^[R_U - R_P + K*log₂( t/T))/400])

When the problem's actual score against a single user is 0.5:

R_P = R_U + K*log₂(t/T))/400

The longer the user takes, the higher the problem's estimated rating.

Calculation Details

To minimise the size of the rating adjustments needed, it is sensible to make the target time T roughly equal to the average time taken by all the users on all the problems. With this method, we can use the rating calculation described in my previous article Rating by Expected Score. This calculation avoids the difficulties with the basic Elo calculation identified in an earlier article Rating by Maximum Likelihood. The formula for the expected score simplifies here:

10^[(R_P - K*log₂(t/T))/400] =

10^(R_P/400) / 10^log₂[(t/T)^(K/400)]

Using log₂ (z) = log(z) / log(2) gives:

10^(R_P/400) / (t/T)^[K/(400*log(2))] =

10^(R_P/400) / (t/T)^k

were k = K / (400*log(2))

The only additional work that we need to do when calculating user ratings is to divide the x_i = 10^(R_i/400) values for the problems by (t_i / T)^k.

Similarly:

10^(R_U + K*log₂(t/T)) = 10^(R_U/400) * (t/T)^k

The only additional work that we need to do when calculating problem ratings is to multiply the x_i = 10^(R_i/400) values for the users by (t_i / T)^k.

No Initial Ratings For a new server, where there are initially no ratings, the simplest problem scheduling method is to have all the users to solve the problems in the same sequence. To calculate reliable ratings, I expect that we will need about fifty users who have solved at least a hundred problems. For simplicity, we can do our initial calculations with just these users, and just these problems. We can initially set the problem ratings at time T to 1500, and calculate user ratings at time T (relative to 1500), as above, using an assumed value of K. Given a suitable value of K, this calculation should be reasonably accurate, because all the users have tackled all the problems, and have therefore faced equal opposition. We can then use these user ratings to calculate ratings for the problems.

Finding the Best Value of K [I later found that it was not possible to find K by this method, see Varying K,] Suppose that we have fifty users who have each solved five hundred problems. (N.B. We need at least one success and one failure to calculate a meaningful rating.) We can divide the results for each user into ten deciles, according to the time he took on each problem. We can then calculate ratings for each of these deciles from the problem ratings. We can also calculate the variance of these ten ratings. We can then add together these variances for all the users, and find the value of K that minimises this number.

Rating Accuracy We can estimate the variance in the rating for a single decile by applying Bessel's correction. In the case above, we multiply the variance for the user by ten, and divide it by nine. We can also estimate the variance of the mean of these ten ratings by dividing the answer by ten. This provides a reasonable estimate of the variance of a rating calculated from all the user's results. The corresponding standard deviation (square root of the variance) provides a reasonable estimate of the likely error the user's rating. (N.B. This calculation overestimates the effect of any inaccuracy in the K relationship. A more accurate answer can be obtained by randomly selecting the problems for each decile.) The main sources of inaccuracy in the rating calculation are random variation, inaccuracy in the K relationship, and any change in the user's performance. Ideally, we should ignore any results during a user's familiarisation period. The effect of changes in user performance on the problem ratings can be minimised by presenting the problems in a different order to each user (e.g. by selecting them at random).

This method is simple, statistically sound, flexible, easy to implement, and a big improvement on the server rating methods currently in use. (See my original article on this topic, Rethinking Problem Server Ratings, for a critique of these methods.)

Also see my later article: More Rating, Time and Score.

Empirical Rabbit

Friday, 7 September 2012

A Simpler Server Rating Method

No comments:

Post a Comment