Digging further into the Statistical Abyss
Author: Jeff Rollason

Copyright Notice: This article is Copyright AI Factory Ltd. Ideas and code belonging to AI Factory may only be used with the direct written permission of AI Factory Ltd.

With our articles Backgammon Programs Cheat: Urban Myth?? and Statistical Minefields with Version Testing in mind, we have returned to the difficult area of assessing the significance of statistical results. This topic has been further drawn to our attention by the 1,000,000+ people who have played our Backgammon program. Traffic from these users has illustrated the issues that people can have when assessing the significance of test results. In the latter case many people are certain our program rolls non-random dice but this is a false perception...

This can also (paradoxically) be further undermined by the ill-directed application of common statistical theory, which can easily lead to misleading conclusions.

My general thesis behind this is that while humans are astonishingly capable of complex intellectual and intuitive reasoning, which is why we appear to be able to, not only dominate this world but, now have footprints on other worlds too, that human intuitive grasp of statistical probability is actually quite poor.

This article takes another look at the application of statistical probability to games development.


The Background

First I am not going to claim I have some advance profound intuitive grasp of this topic. In reality I have also allowed myself to be mesmerised at times by results that appeared to convey meaning, but actually said very little.

In defence of that frailty I created a program BINOM.EXE that reverse-engineers the binomial distribution so that a set of results can be assessed to determine not only the significance of results but also the probability of different possible p and q values that would fit the observed distribution. This was my first arm of defence.

This very useful little program is already detailed in Statistical Minefields with Version Testing, and allows the user to provide some win + loss totals and from that deduce the probability that X > Y or that the probability of X is greater than a given value etc.

For example if player A beats player B 7 times and loses 3 times, then this looks like it might be a good indicator that A is stronger than B. However a dip into BINOM.EXE gives the following analysis (showing only part of its output):

  	BINOMIAL v6 Aug2004 - Jeff Rollason
	
  	Population n=10, a : b 7 : 3 'p' calculated from a/n is 0.7000
	
  		The probability that A > B (p>q) is 0.886719
    				    and that A < B (p<q) is 0.113281
					
  		p=0.887 expressed as a fraction/ratio is ( 7.8 : 1 )
		
  		Results 7 + 3 give the following range of values for 'p' (as %) within 0.01%
		
    			'p' within 50% probability within range 58.8% <-- 70.0% --> 78.8%
    			'p' within 67% probability within range 54.2% <-- 70.0% --> 82.3%
    			'p' within 95% probability within range 38.0% <-- 70.0% --> 91.7%
    			'p' within 99% probability within range 29.0% <-- 70.0% --> 95.4%
  

This tells us that it is indeed more probable that A is stronger than B, but that actually the chance that A > B is just 88.7%, so B might be stronger (11.3% chance) and so this is just a too small a sample.

If we extended the play trial further and achieved a win rate of 49 for A and 33 for B, then this is a much less impressive win percentage, but now we feel much more confident that A is stronger:

      BINOMIAL v6 Aug2004 - Jeff Rollason
	
      Population n=82,  a : b   49 : 33   'p' calculated from a/n is 0.5976
	
     		The probability that A > B (p>q) is 0.960790
    				    and that A < B (p<q) is 0.039210
				
    	 	p=0.961 expressed as a fraction/ratio is ( 24.5 : 1 )
	
    	 	Results 49 + 33 give the following range of values for 'p' (as %) within 0.01%
  
     			'p' within 50% probability within range 56.0% <-- 59.8% --> 63.3%
    			'p' within 67% probability within range 54.5% <-- 59.8% --> 64.8%
    			'p' within 95% probability within range 48.9% <-- 59.8% --> 69.9%
     			'p' within 99% probability within range 45.5% <-- 59.8% --> 72.9%

Now our confidence is 96% that A is stronger.

So What is the Problem?

The above process invites potentially dubious practices. If you have a set of results that do not conform to the confidence limits you desire, the temptation is to just keep churning out results until the threshold is satisfied. This has some validity, but invites the user to stop at some point when randomness has briefly pushed the confidence bounds over the desired threshold. It the experiment had been continued there is a good probability that the confidence level may drop back, even if only briefly. It is also possible that this threshold will be briefly crossed, never to return on more substantial sampling.

This highlights the key basis in which these statistical distributions depend: the base assumption is that these are random samples of that distribution, not samples the observer has somehow selected. Abuse of this assumption easily leads to a number of paradoxes:

The traps of inadvertently selecting your "random" samples

Armed with your trusty BINOM.EXE, you can march on generally accepting and rejecting sets of results, depending on whether they conform to the required level of confidence that you need to be able to assert that the results are good.

A simple example where such a strategy is flawed is as follows:

Imagine that you have a games-playing program with a tuneable parameterised evaluation function. Given this, a plausible means of finding the correct values for the evaluation is to simply run multiple trials with randomised values. From this you pick the trial with the best result and this then becomes your "best" set of parameters. To add confidence to your result you test these with BINOM.EXE to confirm that, within statistical bounds, your new parameters are indeed a strong indication that these are good parameters.

This all sounds very reasonable, but lets construct an example that shows just how flaky this strategy could be.

A Multiple trial Paradox

In this pseudo-example below we can emulate testing of 10 possible parameter sets by instead just running 10 identical random trials of "heads-or-tails". This clearly should show no meaningful differentiation between trials as they all test exactly the same thing. The following is a completely random trial run on this basis.

 
 	  1. Wins  52  Losses  48
 	  2. Wins  48  Losses  52
 	  3. Wins  61  Losses  39
 	  4. Wins  50  Losses  50
 	  5. Wins  53  Losses  47
 	  6. Wins  45  Losses  55
 	  7. Wins  54  Losses  46
 	  8. Wins  65  Losses  35
 	  9. Wins  47  Losses  53
	10. Wins  44  Losses  56

However here we can see that our 3rd trial and 8th trial are skewed away from the population expectation of 50/50. If we run the most skewed of these, trial 8th, through BINOM.EXE:

	Population n=100,  a : b    65 : 35    'p' calculated from a/n is 0.6500
	
		The probability that A > B (p>q) is 0.998673
    				    and that A < B (p<q) is 0.001327
					
  		p=0.999 expressed as a fraction/ratio is ( 752.6 : 1 )
		
  		Results 65 + 35 give the following range of values for 'p' (as %) within 0.01%
		
    			'p' within 50% probability within range 61.7% <-- 65.0% --> 68.1%
    			'p' within 67% probability within range 60.3% <-- 65.0% --> 69.4%
    			'p' within 95% probability within range 55.3% <-- 65.0% --> 73.9%
    			'p' within 99% probability within range 52.2% <-- 65.0% --> 76.4%
  

This analysis is telling us that the probability that A > B is now 99.8673% !! This is clearly nonsense. However if you ran such a test with real parameters it does appear that trial 8 is a good result. Here this simplistic approach to statistics has not been very helpful. Of course, in the limiting case, we could run huge number of trials, in which the validity would be relatively sound. With infinite trials the results are exactly correct. However running huge numbers of trials is not helpful as it will undoubtedly take too long.

So maybe we can Tweak this?

We could just run 10 times the number of trials:

 
 	  1. Wins  501  Losses  499
 	  2. Wins  507  Losses  493
 	  3. Wins  495  Losses  505
 	  4. Wins  495  Losses  505
 	  5. Wins  518  Losses  482
 	  6. Wins  492  Losses  508
 	  7. Wins  492  Losses  508
 	  8. Wins  525  Losses  475
 	  9. Wins  483  Losses  517
	10. Wins  504  Losses  496

Now the figures look more flat. The most skewed of these is the 5th one, which gives a BINOM.EXE value of A > B is 87.2%, which is poor enough for us to ignore.

Are we safe?

On repeating this same trial again I got:

 
 	  1. Wins  497  Losses  503
 	  2. Wins  516  Losses  484
 	  3. Wins  518  Losses  482
 	  4. Wins  521  Losses  479
 	  5. Wins  485  Losses  515
 	  6. Wins  497  Losses  503
 	  7. Wins  518  Losses  482
 	  8. Wins  489  Losses  511
 	  9. Wins  543  Losses  457
	10. Wins  490  Losses  510

Now we have one substantially skewed result, the 9th one, and this time BINOM.EXE tells us that the chance of A > B is 99.67%, which again is nonsense.

So we have shown that this method is looking highly suspect and simply piling in substantially longer runs is not going to remove the problem.

At this point I do not have a magic statistical methods from my toolbox to test the significance of such trials that I can offer. However as-is this clearly does not work.

It is this paradox that also contributes to many people observing "clearly non-random dice rolls" in our backgammon program. While dice rolls look normal the user ignores them. Once they eventually see a long sequence of doubles they present some maths that show that the chance of this small sample result is very low, but this ignores the number of times that the results were normal.

Looking at the backgammon paradox another way, millions of people are playing our program, and statistically most of these people will see normal dice rolls, but a large number will also see skewed ones. The chances of there being people finding such skewed results is very high indeed.

To add injury to this paradox, just as many people will see the same defect occurring to give them "good" dice rolls as those who get "bad" rolls, but they will never complain as they could not imagine that the program might be cheating in their favour!

Defensive strategies for testing multiple trials

The above paradox demonstrates an issue that has, at many points in my game development history, managed to lead me astray. I have seen this also undermine colleagues pursuing similar development goals. The constraints imposed by such development need you to confine the amount of testing so that you can efficiently progress to the next stage to test more parameters. The need for limited testing naturally steers you into making this kind of mistake.

Superficially this looks a bit terminal, if radically increasing sample sets does not solve the problem. If nothing else, you need to be aware of the paradox.

However there are ways to reduce the danger of following bad test results. In many ways the following method mirrors some of the methods I use in game evaluation. The basis of this is not to prove the validity of results at each stage but to reduce the probability that bad results are selected for given effort.

If we consider the original trial, but add in the emulation of a "significant" set (set 1 is skewed so it wins 55% of the time), we can now repeat this trial 3 times:

    1.  Wins	58 		Lose 	42 		Wins 	64 		Lose 	36 		Wins 	62 		Lose 	38 	WWW
    2.  Wins 	49 		Lose 	51 		Wins 	51 		Lose 	49 		Wins 	52 		Lose 	48 	LWW
    3.  Wins 	47 		Lose 	53 		Wins 	44 		Lose 	56 		Wins 	56 		Lose 	44 	LLW
    4.  Wins 	45 		Lose 	55 		Wins 	48 		Lose 	52 		Wins 	61 		Lose 	39 	LLW
    5.  Wins 	49 		Lose 	51 		Wins 	52 		Lose 	48 		Wins 	36 		Lose 	64 	LWL
    6.  Wins 	45 		Lose 	55 		Wins 	46 		Lose 	54 		Wins 	56 		Lose 	44 	LLW
    7.  Wins 	50 		Lose 	50 		Wins 	47 		Lose 	53 		Wins 	43 		Lose 	57 	=LL
    8.  Wins 	41 		Lose 	59 		Wins 	56 		Lose 	44 		Wins 	49 		Lose 	51 	LWL
    9.  Wins 	62 		Lose 	38 		Wins 	56 		Lose 	44 		Wins 	53 		Lose 	47 	WWW
  10.  Wins 	52 		Lose 	48 		Wins 	46 		Lose 	54 		Wins 	51 		Lose 	49 	WLW

In the output above the number of net "wins" and losses" is marked at the end so that the sets with winning or losing can be seen easily. From this our "significant" result test set 1 did achieve a +ve result, but so did 9 and 10. In fact 9 was a "better" result.

A strategy for handing this test set is to simply run the first run of these trials and then weed out "unpromising" results. We could actually just take the top 50% of these sets, 1, 2, 5, 9, 10, even though 3 of these were losing results. We then just re-test these sets for another trial and repeat the process to narrow down to a further subset of 3 for a 3rd trial.

Taking this simplistic approach we would then examine sets 1, 5 and 9 from the 2nd trial and test just these on the 3rd trial. From that the skewed set 1 is correctly chosen in the 3rd trial.

Of course this is just one trivial example trial, but repeating this model over a long sample gave a 55% chance of the skewed sample set correctly being selected as the top set.

How does this compare to a trial of 180 per trial? ("180" to match the total number of tests in the 3 stage trial above. i.e 10x180 = 10x100 + 5x100 + 3x100).

This 10x180 flat trial gives a 45% chance of the top set being selected, so the staged trials with eliminations offer a better probability that the best set is chosen compared to a flat trial with same number of tests per set.

Conclusion

This is a simple test example, but the general idea holds. It offers a better chance of finding the best parameter set by simply eliminating likely weaker sets early on. If a parameter does badly in the first trial there is a good probability that it is not the best set, so why not dump these from the trial?

Arguably this is dangerous as we are throwing away sets on the basis of just a few tests that might actually be the best sets, but note that testing a set from a larger flat trial still offers the danger that the best set does not come top.

The success of this makes sense. It is squeezing more relevant information from limited testing. It is testing sets that are more likely to be the correct sets to test at the expense of not having a complete ordered assessment of all sets. This has some parallel with alpha-beta, which also strives to select the best result, but at the expense of losing ordering of "losing" moves. It also has a parallel with a knock-out tournament compared to round-robin. The former has less matches but still does a good job of detecting the "best" competitor and the expense of losing the ordering of losing competitors.

Note that the model above is very simple. In practice it can be made much more effective by setting cut-offs based on distance from the best set, so that an ordered set of percentage wins : 55,54,50,20,20,15,12,,11,10 might select just 55,54,50 but 55,54,50,50,49,15,12,,11,10 might select 55,54,50,50,49. This squeezes more information from the limited trials and will allow more efficient trials.

Jeff Rollason - May 2011