Empirical Rabbit: Simulating Solution Times and Scores

The purpose of this article is to describe a model to simulate chess problem server user solution times and scores using random numbers. These simulated times and scores can be used as test data to validate the algorithms that I have proposed in my recent articles.

In my last article, Rating, Time and Score Revisited, I analysed the distribution of solution times on the assumption that the user always checks his solutions perfectly. But what if he is not so careful? The obvious approach is to assume a distribution of (correct or incorrect) solution times, and use the success probability function discussed in the previous article to generate distributions of correct and incorrect solutions. This is fine mathematically, but there is a problem. With this method, the smaller the solution time, the larger the success probability. My practical experience is the opposite. If I see the solution to a problem quickly, my success probability is high. If I struggle to find the solution, my success probability is lower. It is true that the problems on which I struggle are typically harder, but I still believe that this approach is unrealistic. Even if the problem rating is calculated perfectly from the results of millions of users, its perceived difficulty to me on the occasion that I first encounter it may be different. I will find the problem easier if I have seen a similar problem recently, or if the first move that I consider works. I believe that a more realistic approach is to add a normally distributed random variable to the problem rating to give the rating as it is perceived by the user. I do not know precisely what the standard deviation of the random variable should be, but 100 Elo points is a reasonable starting point.

A reasonable assumption is that a user takes less time on the easier problems, and more time on the harder ones, in an attempt to keep his expected score at a constant level, unless the solution is taking too long, in which case he gives up. The simplest assumption is that he gives up after a fixed time limit. Lets see how these ideas work out mathematically, using the notation of the previous article. The perceived rating R is assumed to be sampled from a normal distribution with the problem rating as its mean, and variance σ squared.

R ∼ N(R_P, σ²)

The success probability S is:

S = 1 / (1 + (T/t)^k * 10^[(R-R_U)/400])

(As before, T is the target time at which the user and problem ratings are valid, k = K/(400*log(2)), where K = rating points gained for each doubling in the thinking time, and R_U is the user rating.) Solving for t gives:

(T/t)^k * 10^[(R-R_U)/400] = 1/S - 1 = (1-S)/S

t = [S/(1-S)]^1/k * T * 10^[(R-R_U)/(400*k)]

The solution time t is log-normally distributed (i.e. the logarithm of t is normally distributed). The time to score 50% is:

h = T * 10^[(R-R_U)/(400*k)]

The time to score S is therefore:

h * [S/(1-S)]^1/k

Solution times of less than one second are not realistic. We could avoid this situation by adding a latency time. A simpler rough and ready solution is to round up this time to the next integer. The solution time is then the smaller of this time and the time limit L:

min{ceil(h*[S/(1-S)]^1/k), L}

The expected score at this solution time is:

s = 1 / (1 + (T/min{ceil(h*[S/(1-S)]^1/k), L})^k * 10^[(R-R_U)/400])

We can decide the actual score by comparing s with a random number that is uniformly distributed in the range 0 to 1. The actual score is 1 if the random number is less than s, and 0 if it is more. The chart below shows the histogram of (correct or incorrect) solution times for a user rated 1700 and problem with a mean rating of 1500 with a standard deviation of 100 points:

For this chart, T = 100 seconds, L = 400 seconds, S = 0.75, and K=200. (The histogram has 1001 data points.) This chart looks very plausible The chart below shows the corresponding histogram when the mean problem rating is 1500:

This chart again looks plausible, but perhaps a standard deviation of 50 points is more realistic. The sum of two normal distributions is another normal distribution, so we can accommodate a standard deviation in the problem ratings simply by increasing the standard deviation in this simulation. The histogram below is based on a 300 point standard deviation, with a user rating 300 points more than the mean problem rating, a success probability of 0.8, and K=160:

This chart looks remarkably similar to those that I obtained in my problem book experiments. In those experiments, the histogram closely matched a negative exponential, and I typically exceed the time limit on about 10% of the problems.

When the user's rating is low in relation to the mean rating of the problems, most of his solution times will exceed his time limit if we set him a high target success probability. This is not realistic, and modest increases in the time limit have little effect because the histogram is relatively flat in this range. A possible solution is to set the target success probability so that most of the user's solution times fall within the time limit. 90% of the sampled values for a normal distribution fall below the mean plus 1.282 standard deviations. The target success probability for which 10% of the times fall outside the time limit is given by:

L = [(S/(1-S)]^1/k * T * 10^[(R_P+1.282*σ-R_U)/(400*k)]

where σ is the standard deviation applied to R_P. Solving for S gives:

(1-S)/S = (T/L)^k * 10^[(R_P+1.282*σ-R_U)/400]

1/S - 1 = (T/L)^k * 10^[(R_P+1.282*σ-R_U)/400]

S = 1/(1 + (T/L)^k * 10^[(R_P+1.282*σ-R_U)/400])

Rearranging the equation for L gives:

[S/(1-S)]^1/k = (L/T) * 10^[(R_U-R_P-1.282*σ)/(400*k)]

We also have:

t = [S/(1-S)]^1/k * T * 10^[(R-R_U)/(400*k)]

= (L/T) * 10^[(R_U-R_P-1.282*σ)/400*k] * T * 10^[(R-R_U)/(400*k)]

= L * 10^[R-R_P-1.282*σ)/(400*k)]

With this model, the distribution of solution times is independent of the user rating and the target time T. It is also independent of the problem rating, because R - R_P is just a normally distributed random variable with mean zero and standard deviation σ.

When the user rating is equal to the mean problem rating, and σ = 300, S falls to about 0.5 (for K in the range 160-200). Nonetheless, on Chess Tactics Server, very few users score less than 50%. Users probably give up unless their rating is higher than the mean rating of the problems that they tackle.

T is an arbitrary scale factor in the mathematics, and I have made it 100 seconds for clarity. However, the average solution time on Chess Tempo is about 45 seconds, so T = 30 seconds and L = 120 seconds would be more realistic.

This is not a perfect model of user behaviour, but it should be good enough to generate reasonably realistic test data.

Empirical Rabbit

Thursday, 27 September 2012

Simulating Solution Times and Scores

No comments:

Post a Comment