Sunday 1 January 2012

Improving on Least Squares

Initially, I used the method of least squares to fit exponential curves to my chess tactics training results.  See my previous articles: An Important Discovery, and A Three Parameter Model.  This method served me well, but does have some limitations.  Firstly, it does not treat all the results symmetrically.  The shortest solution time contributes to all the data points on the cumulative distribution, whereas the longest solution time that is within the time limit contributes to only one.  More importantly, perhaps, because of their much greater numbers, the shorter solution times are weighted much more heavily than the later solution times.  The least squares curves, as a result, often do not fit the tail of the distribution well.  It would also be more satisfying if the calculation was based on quantities that are more directly meaningful.

It turns out to be possible to calculate the parameters a and b of the cumulative probability distribution P = a*(1 - exp(-t/b)) from the average correct solution time and the number of problems solved correctly.  The probability distribution corresponding to P is:

dP/dt = a*exp(-t/b)/b

If we multiply this expression by t and integrate by parts from 0 to t, we get:

a*(b - (b + t)*exp(-t/b))

The average correct solution time is therefore:

tav = a*(b - (b + t)*exp(-t/b)) / (a*(1 - exp(-t/b)))

which simplifies to:

tav = b - t * exp(-t/b) / (1 - exp(-t/b))

We also have:

1 - exp(-t/b) = P/a

and

exp(-t/b) = 1 - P/a

Eliminating the exponentials gives:

tav = b - t*(1 - P/a) / (P/a) = b - t*(a - P) / P

which can be rewritten as:

b = tav - t*(a - P)/P

This is a simple linear equation connecting a and b.  (For P = 1 - exp(-t/c), this equation becomes c = tav - t*(1 - P)/P.  For my data, this value of c matches the value given by least squares very closely indeed.)  Eliminating b gives:

exp(-t/(tav - t*(a - P)/P)) = 1 - P/a

We can solve this equation numerically using Newton’s method, using P as the first approximation for a.

f(a) = exp(-t/(tav - t*(a - P)/P)) - 1 + P/a

f'(a) = (t / (tav - t*(a - P)/P))^2 * exp(-t / (tav - t*(a - P)/P)) / P - P/a^2

h = -f(a) / f'(a)

a' = a + h

where a’ is the next approximation for a.

(Note that f(a) can be rewritten as:

f(a) = exp(-1/(tav/t-(a - P)/P)) - 1 + P/a

The solution for a (and therefore b) therefore depends on tav/t, rather than on tav and t separately.)

For a fixed time limit t, we can calculate a, b and b/a from two clearly meaningful quantities:

tav = average of the correct solution times within the time limit, and

P = number of correct solutions found within the time limit (applied to each problem individually), divided by the total number of problems in the problem set.

This calculation involves all the solution times symmetrically and matches the tail of the distribution well.  It is also easy to carry out with a spreadsheet:







N.B. An average successful solution time of zero gives numerical problems here, as does a success probability of zero, but these should not be realistic scenarios.  The calculated value of a can be more than 1 in some cases.  Here is a plot of the P against tav for a = 1 and a = 1.1, with t = 100 seconds:














The value of a is more than 1 when P is too large in relation to tav.

No comments:

Post a Comment