## Wednesday 1 June 2011

### Distinct Random Selections

In this section, I derive a mathematical result that enables us to estimate the number of problems that we need to solve to achieve a given level of coverage of the underlying tactical patterns.  If you find that your eyes glaze over, just skip down to the graph!

Suppose that we make M selections from N distinguishable items, such that for each selection, it is equally likely that any of the N items will be selected.  This is called sampling with replacement, because whenever one of the N items is extracted, it is replaced with another identical item, so that item can be selected again.  Since items can be selected again, it is possible that some of the selected M items will duplicate one another.  If we remove the duplicates, on average, how many distinct items will remain?

(1).  The first item selected will always be distinct.  After this item has been selected, the number of distinct items will be 1.

(2).  There is a 1/N chance that the second item will match the first.  There is therefore a   1 - 1/N chance that the second item will add one to the total of distinct items.  On average, the second item will add 1 - 1/N to the total of distinct items.

(3).  There is a 1/N chance that the third item will match the first item and a 1/N chance that it will match the second item.  There is therefore a 1- 1/N chance that the third item will not match the first item and a 1 - 1/N chance that it will not match the second item. There is therefore a (1 - 1/N) * (1 - 1/N) chance that the third item will not match either the first item or the second item.  If the third item does not match either item it will add 1 to the number of distinct items.  If the third item matches either or both items, it will not add anything to the total of distinct items, irrespective of whether they duplicated one another. On average, the third item will add (1 - 1/N)2 to the total of distinct items.

(4).  Similarly, the fourth item will, on average, add (1 - 1/N)3 to the total of distinct items.

The average number of distinct items selected is therefore:

1 + (1 + 1/N) + (1 - 1/N)2 + (1 - 1/N)3 +….+ (1 - 1/N)M-1

The sum of this geometric series is N*(1 - (1 - 1/N)M).  (N.B. I have checked this important formula with a Monte Carlo simulation.)  The average fraction of the items that will have been selected after M tries is given by 1 - (1 - 1/N)M  = 1 - {(1 - 1/N)N}M/N.  For large N, this expression becomes 1 - e-M/N.

In a chess context, 1 - (1 - 1/N)M estimates the fraction of the N patterns that we will have encountered after solving M problems (assuming that there is one pattern per problem).  The graph below plots this fraction (as a percentage) against M for three values of N:

Is sampling with replacement a realistic model to apply to the patterns underlying the problems in elementary tactics books?

*  Firstly, does this model over estimate the number of problems that we need to solve? For this to happen, the authors would need to have carefully studied the rival books and had a high rate of success at avoiding repetition of the patterns in those books.  In reality, most of these authors are not at all good at avoiding exact and near duplication within their own books, and frequently reuse problems from other books.  They also usually claim to offer comprehensive tactics course, which is not compatible with missing out all the common patterns that occur in other books.

*  Secondly, does this model over underestimate the number of problems that we need to solve?  For this to happen, the level of duplication in elementary tactics books would have to be larger than the level of duplication that would arise from sampling with replacement. This is possible, but I think that sampling with replacement is a reasonable approximation here.