Something will get that best score: Avoiding delusions in block testing

Author: Jeff Rollason

Copyright Notice: This article is Copyright AI Factory Ltd. Ideas and code belonging to AI Factory may only be used with the direct written permission of AI Factory Ltd.

In previous articles we have addressed the issue of AI testing and how to avoid the pitfalls of quirks in AI behaviour and testing regimes. These showed how easily you can fool yourself into believing that your testing was viable and useful. Some of these articles also considered the pitfalls of testing purely from a statistical point-of-view. The primary focus has been on the incremental modification of a heuristically-driven linear evaluation. This style of development has dominated game-playing AI development over much of the last 50 years. In more recent times neural net driven AI learning based on deep learning or automated tuning has been more common. However even then you still need a framework to test whether your latest AI version is good or not, so this still shares the same pitfalls identified here.

This article extracts out the nature of the AI from testing and just looks at a detailed emulation of testing, just from random numbers, so removes the characteristics of the AI and instead just tests the numbers.

From this it examines the properties of various approaches from a statistical perspective. This confines the examination to exclude any AI property vagaries, but still demands decisions that fall outside pure stats!

Background

This article also attempts to draw together all previous articles I have published in this area and further drill down to test a practical agenda, supported by real test results. Given this, and the large number of previous articles, it seem prescient to provide a summary of these previous articles to offer a solid back reference to complete the background.

Below is a review of the previous articles on AI testing in chronological order.

Developing Competition-Level Games Intelligence:

This starts head-on with the issue of raw statistical significance and how marginal improvements need much more testing. It moves on to automated testing, testing against multiple opponents, test suites and the creepy issues where bugs can appear to improve performance!

Statistical Minefields with Version Testing:

This expands further on the above, offering up our bespoke BINOM.EXE program for reverse engineering the binomial distribution, our automated round-robin test suite and the need to continue round robin until results are consistent.

Digging further into the Statistical Abyss:

This exposes the problems of using statistical tests on selected trials from many trials, using coin toss as the base example. It proposes running multiple trials of a set of test cases, eliminating the weakest test cases and continuing until a best test case is consistently top. The object here is increasing the efficiency of block testing given the high volume of testing that might be needed.

Lessons from the Road - Championship level AI development:

This looks at a particular historical example, the Japanese game Shogi, where the option for running large blocks of games to assess statistical significance was not an option, as fast hardware was scarce. This instead took a heuristic approach. The system, created in 1996 offered a system that compared game records and noted where deviations lead to wins or losses. All this testing was against just one fixed rival opponent. From this variations tested were ranked by how often they deviated for a win or loss. This provides a highly targeted assessment that depends only on comparing points where there were deviations, not just gross win/loss game counts. This data was injected into a rating program (see The Rating Sidekick) that allowed the results of many version head-to-head comparisons to be compared and from this a "rating" was created for each version.

This system proved to be extremely efficient and quickly delivered a competitive Shogi program that defeated 3 world champions in the year that they were world champion and in two years this program was officially ranked #3 in the world.

Developing in Fog:

This article examines the issues of testing and assessing Monte Carlo Tree Search (MCTS) with imperfect information (such as Spades) instead of Minimax with perfect information (such as Shogi and Chess). This introduces more opaqueness making many development assessments dependant on the analysis on a linear evaluation impossible as MCTS only provides a statistical output. There is no hash tree that can be lifted after a move to lead to the defective variation.

Keeping on the Tightrope:

This looks at the idea of converting the multiple binaries used in Lessons from the Road - Championship level AI development above into a single source system that supports all variations. This offers no opportunity for automated testing against external opponents, as this generic system needed to be shared by all our game engines.

This exposes the risk of corrupting your test procedures. An overlooked bad edit might create a crippling weakness but the incestuous testing procedure may not detect this and instead deliver the same weakness across all variations tested. This is high risk. The article instead also considers a generic system of running games against other versions using moves written to the file system. This would avoid the risks of accidental corruption as any version could be compared against older historical binary versions. This system had not been built as we did not have a generic framework that allowed all game types. However the existing generic facility for exporting games did allow old binaries to test games played by new binaries. This however was a game by game test, not a block test. This identified the problem and desirable goal, but did not actually deliver a solution.

Mixing Development Practices:

This takes the broad sweep of how you can analyse search-based programs progress, taking the black-box approach of just counting the win numbers and the "right-in-close" approach of tinkering with the evaluation to solve some single node test case. These are extreme opposites. The article outlines 5 additional strategies to bridge these two extremes. In the end developers are likely to mix and match such approaches, probably driven by some combination of hunches and convenience.

Adding New Muscle to Our AI Testing:

This articles shows the experience of using deeper tools that makes use of AI character control, the RATING program and a new suite to cross-analyse different program versions. This significant system, introduced for the game Spades, shared some of the favourable characteristics of the successful system used for Shogi outlined in Lessons from the Road - Championship level AI development. The primary benefit of this was an effective framework for automatically choosing the direction of development. This worked very well, lifting the burden of manual test selection by the developer and avoiding easy errors arising from manually assessing test results. This new automated system did a much better job.

Unwelcome Testing Realities:

This articles follows on from the article Adding New Muscle to Our AI Testing above. It raises the issue again about why managing block testing is critical. In this case, using a 32 processor multicore setup the solution is to run early speculative follow-on trials before the first trial completes. It shows how the more complex breakdown detailed in the above article can help expose flaws.

Testing Methodology Traps:

This looks at exploiting hardware where you have many processors available. You can simply run the same test suite many times in parallel. The purpose is to continue the tests until they converge to agree. This simple approach avoids needing to spin out a stats program to assess results to figure out when the test should end. It also provides an assessment means when there is no easy statistical approach to take (e.g. not binomial).

The article provides 4 strategies for dealing with this, none of which is definitively effective. However these may still be the best paths.

Avoiding Issues of Test Result Significance by Running Parallel Test Rigs:

This looks at exploiting hardware where you have many processors available. You can simply run the same test suite many times in parallel. The purpose is to continue the tests until they converge to agree. This simple approach avoids needing to spin out a stats program to assess results to figure out when the test should end. It also provides an assessment means when there is not an easy statistical approach to take (e.g. not binomial).

However this short article also looks at the valuable idea of running early speculative follow-on trials before the first trial completes. So this draws conclusions from early test results to anticipate which follow-on tests are most likely needed. As the first test progresses the conclusion may change, in which case the earlier speculative tests can be junked and replaced by new tests. This removes the stress of the question "do we have enough information to be confident about the test result so far?" Instead the developer can start follow-on tests when they are only marginally confident, knowing that the new anticipated tests may save considerable testing time and, at worst, result in the final follow-on tests only being run at the last moment. Net you are no worse off than running pure serial tests, but in practice the total test time is likely to be substantially reduced.

The Issues of Multiple Trials

This article returns to the problems when you run multiple tests. A single test can be statistically analysed with credibility, but this goes out the window when you apply the same statistical test to each of the multiple trials. As soon as you are testing many trials then some fraction of these will fall out of the expected bounds that the stats would project for a single trial. This issue was core to two more articles in this periodical that are not listed above. These are The Self-Perpetuating Conspiracy Paradox and Computer Backgammon Programs Cheat: Urban Myth??. In this case the anomaly that statistics that work for a single trial do not work for many repeated trials was the foundation for undermining the credibility of our Backgammon, which (like all Backgammon programs) was accused of cheating as throws fell outside the results you would expect of a solitary trial.

Here we will test some properties of multiple trials, with no AI-specific input to fog the results.

The Setup for the Test

First we will factor out any issue of AI evaluation and simply run a trial based on random numbers. To do this we setup 32 imagined testable emulated program versions that initially are rated 50% each, or "strength" 50. Each of these are then modified to provide a range of ratings by randomly adding a value in the range -5.5 to +5.5 to the rating "N" times and then dividing that net added value by N. So if N is very large then the added value will approximate to zero. If N is small then the value of each rating will be somewhere in the absolute range of 50 +/- (5.5 * N). We actually used a value of N=6. If you test this N=6 for 2 million repetitions it will deliver a bell-shaped distribution of emulated AI opponents as shown in figure 1 below:

Figure 1: The Test distribution of Emulated AI Strength

This distribution gives 50% of emulated ratings inside the bound 49.1% to 50.9%, 75% within 48.5% to 51.5%. This is emulating a fairly narrow range of performance that, in chess, would correspond to about +- 6 ELO points or less for 50% and +- 10 ELO points or less for 75% of emulated versions.

This is a reasonable basis for imagining testing tweaks to AI, where a single incremental version is unlikely to create shifts in ELO greater than the values above. Note that an ELO difference of 100 would correspond to the stronger side winning 66% of games, so a very substantial shift. It is more probable that tested AI changes would mostly deliver much less than this.

These probabilities carry to the 32 sample distribution below.

The value of N of 6 gives a single example distribution like the one in figure 2 below:

Figure 2: Distribution of emulated AI strength for just 32 samples

This single example skews mostly below 50 for the 32 emulated opponents, each with 6 added values between -5.5 to +5.5. In this case the best emulated AI has a strength of 51.8 and weakest 47.3.

This is then just an example sample of an emulated set of opponents that would arise for testing.

These emulated versions can then be tested against randomised opponents in the range 1 to 100. If greater this scores a win, otherwise a loss. The net win count then becomes the performance. To increase the accuracy of such a test this can be repeated 100,000 times to compensate for random variation. Each of these 100,000 tests would re-create a new batch of emulated versions. The average test result would then track the average best possible strength and average actual achieved strength.

Running the Trials

There are multiple things that could be tested. In this first case we can look at testing 32 versions and promoting the one with the largest winning score to a follow-on re-trial. This promotion can be repeated multiple times to emulate the creation of multiple improved versions. This is emulating a developer testing versions and promoting the best as a base for future testing. Note that with each promotion the version chosen will likely not necessarily be the strongest. The promoted version will use the top score to select a version but promote the actual real preset emulated rating, not its test score. The next iteration will then update a new range of emulated tests, 6 times adding values in the range -5.5 to +5.5. In the initial first iteration this would be 50 plus -5.5 to +5.5. In the second iteration it might be 53.3 plus -5.5 to +5.5 or even 49.6 plus -5.5 to +5.5. The implication is that iteration 2 might start with an inferior version.

First let us just look at just a single iteration testing 32 versions, each for 200 test results. This exposes a common issue.

The diagram below shows a single randomly chosen example iteration of this process with 32 tests, each randomly tested 200 times, comparing a random value of 1->100 against the emulated preset. If lower then the emulated version scores a point. With an infinite number of tests an emulated test version with a neutral rate of 50 would score 50% success of scoring hits against randomised trials in the range 1->100.

In the table below, figure 3, the 32 tests are sorted by the real emulated rating, which in this case has a maximum value 52.75 and minimum 47.45. So the top line Rank 1 is the strongest emulated rating. The score in the last column is the result of the testing. The bar graph is in proportion to this score. The green section shows where the score overestimates and red where it underestimates the rating of the version tested.

Figure 3: A single block emulation of 32 versions

The obvious feature here in this random test sample, is that the top scoring version was actually ranked 21 from 32 with a real emulated rating of 49.71, giving it the highest score of 61. This is a bad result and shows that even with 200 tests (200 games per emulated version) that it has delivered a version with a real rating of 49.71 as the best version. The actual best version is rated 52.75.

From the bar graph though you can see that the typical error is generally much lower. This makes the point that you just need one flaky test from multiple tests that shows a big deviation and this single deviation might well then be flagged as the strongest.

If you apply the binomial distribution to that single test it actually scored 122 wins to 78 losses. This gives the following value:

The probability that A > B (p>q) is 0.999080

So the score of 122 wins and 78 losses implies that the tested version is 99.908% certain it is stronger, but is actually weaker. That corresponds to one chance in 1086 that it is weaker. This shows how you cannot use the binomial distribution to test cherry-picked results. This point is already made in many of the articles listed about.

However let us look at the tangible impact of iterating on the basis of such tests.

Running an Iterated Trial

We want to simulate a developer creating 10 successive versions of the program, each an "improved" version of the previous one, therefore emulating a period of time that a developer might work on a program. The emulation will give some indication of the developer's possible progress, given the testing regime they follow.

The emulation will progress through 10 successive versions. If no version tested appears to improve the current base version for any iteration then that iteration is reset and re-played. i.e. to reach 10 new versions might need 11 or more iterations. We need to model this around a plausible AI testing situation. We have a defined range of emulated test versions, with rating 50 +/- 5.5 points applied 6 times. The choice of testing 32 versions stems from the idea that it is probable that we might have access to 32 processors, so this invites the idea of simultaneously testing 32 different versions.

We then have to consider how many "games" each version must play to provide an assessment (Note this set of games will be replayed 100,000 times, each with a new set of emulated versions to test, to increase the accuracy of the final emulation).

The chart below in figure 4 shows an emulated trial run for 50, 100, 200 and 400 tests, emulating win/loss counts.

Figure 4: 32 versions tests for 50,100,200 and 400, testing against 1->100

The chart above shows this emulated trial run for 50, 100, 200 and 400 tests, emulating win counts. This is repeated 100,000 times for each. This plots the following:

	(1)	The average best possible improved evaluation, assuming that for each iteration that the best (highest value) version is actually promoted. This is the theoretical ideal, assuming a perfect selection each time.
	(2)	The average actual achieved improved evaluation. This is dependent on the number 50, 100, 200 or 400 tests it used to select the expected best version to promote. Each time the promoted version is chosen by the highest scoring version, which might not be the actual strongest version.
	(3)	The worst achieved progression. This is simply the worst case from the 100,000 repeated runs.
		This assessed to what degree each piece is blocked from making progress. It provides an ordered list of blocked pieces, scoring the top block at full value with a diminishing penalty for lesser blocks.

Examining these charts it can be seen that choosing from 50 tests offers very poor progression, as might be expected. The orange plot falls very well below the blue optimum progression and the grey plot shows that the final version could fall substantially below the original version, scoring just 12.3% of the improvement of the average theoretical best ((52.92-50)/(73.66-50)), therefore much weaker. On that same plot we can also see that 20.75% of all runs will deliver a final weaker version! Considering this is for just 50 tests, perhaps this is a reasonable outcome.

However note that if you are testing some AI that might take some considerable time to complete a game, then only running 50 tests might be tempting. Further to this though, even if we increase to 400 tests (games) then it is still possible that the final version might be weaker than the first, although only with a 0.07% chance. However the expected achieved performance is still less than half of the average theoretical best at 44.4%. That is still disappointing.

Tweaks to improve this

A defect in above is that each emulated version is tested against an opponent range 1->100, which is wasteful as our emulated versions will lie inside the absolute bounds 50-(5.5*6) to 50-(5.5*6), or 17 to 83. Any test less than 17 or greater than 83 would always yield the same result. We can instead confine the opponent range. Emulating this for the ranges 20->80, 25->75, 30->70 and 35->65 showed that 30->70 was the optimum range, as follows in figure 5:

Figure 5 : 32 versions tests for 50,100,200 and 400, testing against 30->70

With this narrowed range of opponents the performance with just 50 tests is still poor, but the chance of a weaker final version has reduced from 20.75% to just 0.68%. The average achieved version is stronger but still weak at about 36% of the improvement of average theoretical best at 73.66. Jumping ahead to 400 tests the chance of a weaker version is eliminated and the average achieved version is much closer to the average theoretical maximum, achieving some 79% of the maximum average theoretical improvement.

If you are expecting to test versions of the AI with a full evaluation then 400 games may well be just too many. However the test could be modified from 32 to only test 8 versions, and instead test each version 4 times over to deliver the same total number of tests. We have 8 x 1600 replacing 32 x 400. This delivers the following result in figure 6:

Figure 6: 8 versions tests for 50,100,200 and 400, testing against 30->70

Of course we have made no time saving here but just generated a more accurate result in the same time. The first obvious detail here is that the average theoretical maximum value has diminished from 73.66 to just 66.56 (the blue graph is lower), so the average best possible target has reduced. The average achieved results are much closer to this lower maximum, managing 94% of the possible improvement (the consequence of replacing 400 by 1600). However this drops to 66% if comparing the average theoretical maximum reached by the 32 version test.

This appears to be a curious result, where testing versions longer result in an inferior test result. Of course it is because we are testing less versions so a greater chance we overlook a test version we never got to try.

However the object here of reducing 32 test versions to 8 was to run the test faster, so replacing 32 x 400 by 8 x 4 x 100, so that each test version is run on 4 processors, overall reducing the test time by 4-fold. This would be the equivalent of the 400 AI trials above, which yields an average achieved rate of 63.38, which only manages 56% of the average theoretical level when testing 32 versions. That is significantly less than the 68.8 scored by 32 x 400.

Testing for more Marginal AI Improvements

The examples above considered a range of 50 +/- 5.5 (x6). What if the testing was for more marginal improvements, perhaps in the range 50 +/- 2.75? For N=6 this would give a maximum range 33.5 to 66.5 and 50% of all values would be inside 49.55% to 50.45%. This may reflect the level on enhancement one might be looking for later in development, when most of the low hanging fruit has already gone.

Figure 7: 32 versions tests for narrow variation, testing against 30->70

In this example above, in figure 7, the lower variation (33.5 to 66.5) naturally results in a lower average theoretical maximum, as expected. However also note that the average achieved result is relatively inferior to the 17 to 83 variation test. Quantifying this the 33.5 to 66.5 range achieves 53% of the average theoretical maximum instead of 79.5% for the wider 17 to 83. Note also that the worst result dips below 50, whereas the 17 to 83 test range is comfortably above at 58.8. That means that the lower level of improvements will require much more testing.

Conclusions

We can extract some meaning from this mesmerising collection of results.

This is not offering any sunny uplands for testing but does expose the fragility of block testing. If you test version A against version B for 400 games you might expect that it would deliver a decisive result. However if the difference between A and B is narrow this then demands an escalating volume of testing. If testing multiple versions, as has been shown, then the rules of significance go out the window and a block of versions tested will throw up some spurious high results that would not be exposed in solitary tests. The emulation exposes this fragility.

Also an inviting potential flaw in the very last simulation might be that if you choose to test 8 versions instead of 32 versions, then the overlooked 24 versions would likely be inferior to the 8 set actually tested. This depends on how these sets are constructed. If we assumed human infallibility then we might only test one version as we know which one to test, but that might imply we anyway did not need to test it as we knew it was good! If you instead take this as a probability measure, in which case we assume version 1 has 55% chance of improvement and version 2 has 45%, then we could order our choices and test the best 8 from our 32 possible options.

However this seems to underestimate human frailty. I recall when creating a solver for our Move it! puzzle, that could require solutions of over depth 100. The sophisticated probability-based search really struggled and mostly failed. What was needed was an added random choice at each search level, at which point it performed very well indeed. The well-crafted best choice was inferior to well-crafted + random choice.

The conclusion is that we need to allow for non-heuristic choices or at least not just choices confined by developer preference. This axiom is already in play in deep learning where humans play no part in early parameter variations. Choices in parameter changes are mostly pretty random at the start.

If you are developing game engines via deep learning or just auto-tuning, you may still need to separate these out to test different versions created in external direct play. This may still direct you towards multi-version tests such as above. Such testing may allow you to progress further than what you might achieve just using an incestuous internal self-play improvement loop.

Jeff Rollason - August 2022