This article looks at developing a program in the light of the methods I used to develop the Shogi program Shotest, which has been successful in the CSA World Computer Shogi championships over a 6 year period. This has been more a test of engineering than of Shogi expertise, as Shotest was created using very limited Shogi knowledge.
Step 1: Creating the game engine in the first place.
This program was started in 1997 with no background experience whatever in Shogi. I had not even played this game. This imposes special problems, but the game shares many features of chess, so I analysed it from this perspective. Many ideas used earlier in chess were picked up and applied to the Shogi engine. However Shogi also has extra problems: It is more complex than chess, and has the difficult property that pieces never leave play. A piece captured can be played again onto the board in any position, as a turn. Therefore the technique of playing out captures to simplify analysis, commonly used in chess, simply does not work. This key feature was central to the design effort, and gradually a working program was put together.
Step 2: The program works! Now make it strong!
Not easy. Of course using a combination of game knowledge and knowledge of how tree-searching works gives you a basis for making changes. However this is also a black art where you need to keep an open mind. Adding "clear" improvements to your program can often make it much weaker. This phenomenon demonstrates the complex behaviour of games-playing programs. The Shogi program, as with many such games, depends on evaluation and tree search. The interaction of these is obscure, and can be very hard to understand. It is not really easy to intuit how tree search behaves, and often almost impossible to know exactly why a move has been selected. Changes have to be supported by rigorous testing. In addition to this, you may also find that different components of evaluation interact in strange ways. The maxim "a little knowledge is a bad thing" can hold. There are times that your program lacks knowledge, so makes optimistically naive choices in the search, as it thinks it has found advantages that are not really there. As your program gets more "clever" it sees less of these opportunities and then becomes more passive. Since it cannot prove that an advantage can be forced, it starts doing nothing, and then plays badly. (This topic will be discussed in a later article).
Step 3: Mixing development and testing of your program.
This requires thought. Obviously you play the program directly and watch for errors and positions where it fails to make a reasonable move. This is the common iterated cycle used to develop most such programs. It does not guarantee good progress. A more rigorous testing procedure can show up how easily a program can be "improved" but result in inferior play. However if you rely entirely on the cycle of manual testing and adjusting, then you are probably doomed to a poor competition result. You need more than this, as follows:
Step 4: The unavoidable: Testing against other programs!
An obvious means of progressing is to test against other programs. This is a meaningful exercise, in that success in testing against other programs is needed for success in competition. However it has its own pitfalls.
1. Statistical Minefields
For some reason the human brain is poor at instinctively assessing probability and statistical significance. For example, this deficiency makes it hard to anticipate the outcome of simple probability estimates, such as the likelihood that 2 players in a soccer game of 22 might share the same birthday. This general failing is shared by most of us, including those who understand the mathematics, like myself. The problem is that it is instinctively easy to read results into far too little information. If the program wins a game against some opponent, it is easy to feel that it has done well. This ignores the fact that were a further 9 games to be played, it may lose all of them.
There is an exact science available here, called the binomial distribution, which allows the significance of results to be accurately assessed. As it stands you cannot just sit down with a formula and easily work it out, as the results of the distribution need to be reverse engineered to test probability ranges. This requires a program, which I duly wrote and used.
Given this, you can assess the meaning of a result. So your program wins 3 and loses 1 game: What does that mean? The table below shows a few possible win/lose results and their significance:
White wins |
Black wins |
Win rate |
Probability that White is stronger than Black |
3 |
1 |
75% |
81% |
8 |
4 |
66% |
86% |
20 |
12 |
62% |
91% |
57 |
43 |
57% |
92% |
550 |
450 |
55% |
99.918% |
From this you can see that each result shown has a diminishing win rate, falling from 75% down to 55%, but that the chance that white is stronger improves each time. However none of the first 4 results are that convincing. Even the best of these gives a 92% chance that white is stronger, which means a not improbable 8% chance that white is weaker. Note that with a much larger sample, 550 wins to 450, with the lowest win rate, the probability that white is stronger is very high.
This is bad news. A test of 100 games will need a win margin of 59 to 41 to allow you to be more than 95% confident that the result means that white is stronger. So you need to run many games. That leads to the next technique.
2. Block testing
If you are going to run 100 results, then you need to play lots of games. This means that sitting down in front of 2 computers typing in moves is not really getting you anywhere. This has to be automated. Many programs support serial port protocols to allow two machines to play across a serial cable. However this will only give you just one game at a time.
What you need is a program that will run your opponent's program directly, operating the new game controls directly. To do this needs a program that can read pixels off the screen and detect moves being played and then to operate emulated mouse clicks to operate the controls. This is a lot of work, but allowed me to run Shotest to play some 100,000 test games.
3. It gets worse
Of course, what you need is to be able improve your program, not just show that it is stronger. It is not easy to always make big jumps in performance, so you often need to make many small improvements. This means turning a 55% win rate to 58%. That is a subtle difference statistically. Also the statistical method above only tells you the probability of one side being stronger than the other. How do you test improvements? One method is to get the program to play against another version of itself. You are then looking for a marginal advantage. This means running many games. There is a further caveat here: Game results are often not truly random. This undermines the significance of the purely statistical view.
Of course, as any game programmer may tell you, self-play tests are suspect! The chance of critical good moves being overlooked magnifies, as both sides will fail to see it! One means of increasing the credibility of this is to test against more than one version of the program, and setup special test versions with deliberately different characteristics. This helps.
4. Multiple opponents
Testing against just one computer opponent is potentially a recipe for failure. Your program may get better and better at beating the test opponent, but may actually be getting weaker against all other opponents. Playing against one opponent again and again may mean that your program is depending on a particular type of error by the opponent, and particular style of play. It may also be thoroughly untested in the vast majority of position types it might reach. Making your program weaker by the method is not just a possibility, but even quite likely. You need to test against more than one program.
5. Test Suites
There are suites of positions out there with known problems that you can test your program against. These are seductive as they come with published results of tests by other programs. Unfortunately this may say very little about the general performance of your program. To tune against suites can be dangerous, as this will probably unbalance your program to concentrate on solving problems, whereas actual play across the board requires much wider abilities. If your program works harder at looking for tactical tricks, it is probably neglecting other things it should be looking out for. These suites can be good for getting bugs out of your program, and developing certain base skills, but they contribute little to the real strength of your program.
Step 5: More strangeness: Bugs can improve your program!
This is not God playing dice, but simply further demonstrates the complexity of the behaviour of games-playing programs. This is something I have hit a number of times. You may have an evaluation error that is clearly wrong, but when you fix it, its plays gets worse. There may be more than one reason for this. The first is that this "bug" has been compensated by you tweaking other parts of the evaluation. Once the bug is fixed the rest of the compensating evaluation is now way out of tune. The second reason is that sometimes your program can benefit from getting it wrong. It can even assess that a position is "won" when it is not and get better for it. This kind of error can lead the program to make generally unsound but successful aggressive attacks, as your program stumbles its way into a winning position, even though it did not see it coming.
Step 6: The nightmare of time control.
In chess competitions, the rules for time control are reasonable. You get a time allocation for 40 moves and, when that limit is reached, you get a further time allocation. Life is not always that easy, and the World Shogi championships use a rule that you are allocated 30 minutes in which to make all of your moves. This is scary as it means that you have to estimate how many more moves are to be played. This is one of our Shogi program Shotest's strong areas. It uses a statistical approach to assess the current evaluation to estimate how many moves are needed to complete the game. In competition it has won many games where it has handled time control better than the opponent, never getting into time trouble, and always managing to make good use of its available time. Shotest is probably one of the world's best programs at handling time control.
Step 7: The Competition: Do's and Don'ts.
The advantage of entering computer competitions is that you do not need to play yourself, as your program does this for you. The disadvantage is that you cannot help your program when it gets into trouble, but are reduced to agony as it starts to move into difficulty that is out of your control. Each year the programmers turn up for these competitions, and each usually has been working round the clock right up until the competition date. There are generally agreed good maxims to follow at these events, but most cannot resist breaking them! The most important is "Do not make changes to your program at the tournament venue". This is a recipe for disaster as the change cannot be backed up by proper testing. Statistically a change made during a competition will probably make it weaker. Of course there are sometimes major problems discovered in round one, that must be fixed. Usually the best thing though is to leave it alone.
So, how did our Shogi program Shotest do?
Shotest has entered in 6 world championships in Tokyo (7th to 12th), 2 ISF invitation tournaments in Tokyo and 2 Computer Olympiads. In the World championship, it came 13th, 3rd, 3rd, 7th, 11th and then 7th, competing against up to 60 programs. It has, at all times, been the world's strongest Shogi program developed by a westerner. It also won the Gold in one of the 2 Olympiads. Its performance has been impressive, and its history (with photos) can be traced in the articles published at:
https://www2.teu.ac.jp/gamelab/SHOGI/articlesmain.html
Also at the official CSA World Championship site at:
http://www2.computer-shogi.org/index_e.html
It has also generated many publications, through the journal "Artificial Intelligence", various books and other publications, both in English and Japanese translations. A major paper on this program can be found at:
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/r/Rollason:Jeff.html
or download SUPER-SOMA.doc directly from here.
Of course, Shogi is also detailed on the AI Factory website www.aifactory.co.uk under "Competitions".
Jeff Rollason: 28th January 2005