Playing Stronger by learning - AI Factory newsletter article Winter 2005

Playing Stronger by learning:

Author: Jeff Rollason

Automatic game-learning is one of the areas of AI where it is hard to mimic human intelligence. Any commercial game-playing programs that do learn will actually look rather primitive compared to human learning. These programs may outclass human play, but their ability to learn will be much less impressive.

However this is not a lost cause, and it is still possible to employ simple learning techniques to gain practical improvements in play. This uses a simple rote-learning that can actually generate good results.

The example used here is AI Factory's "Gobang" program, which has significantly benefited from learning.

How does this work?

Gobang is similar to Go-Moku, in that the goal is to win by creating a row of 5 pieces. However Gobang game is made more interesting by the addition of a capturing rule, which allows a pair of pieces to be surrounded and captured.

The Gobang program uses minimax to select moves. However it also has an opening book to choose early moves. This book was entirely machine generated, by rote-learning. This is learning by simple trial repetition of moves to find out which moves worked and which failed, without any attempt to consider why a move is good or bad.

The program uses the minimax search to choose its moves, and if the game is won, then all the moves played are added to the book. If the game is lost, then the opponent's moves are added to the book.

How is this used to choose moves?

The simple scheme above sounds plausible, but it is not obvious how it might be used. There are a number of ways such an accumulation of simple knowledge could be employed.

	1	Choosing moves that proportionally win the most often

		This is reasonably attractive, as a move that leads to a win will keep succeeding, so must be a good candidate to try again. However this is very slow, and responds poorly to the opponent later finding a refutation of that move, as the learned knowledge base will take some time before this move that previously won most of the time, now loses most of the time.

	2	Performing a minimax on the stored learned knowledge

		In this scheme a move that refutes a previously successful move will immediately have an impact. The program will minimax this and choose an alternative for the next game. This makes the learning system very responsive.

The second scheme above may look a little fickle, as it seemingly abandons a previously successful move so quickly, but this does work.

With Gobang the program can be left to autoplay, and thereby explore a varied range of possible moves. If the program finds some obscure move that works, then the autotesting will concentrate on that move, trying out different responses until it finds a way to refute it.

However, of course, all responses to an apparently winning move may fail. In this instance the move selection tie-breaks by using the proportion of winning games that followed each move. This allows a previously successful move that was refuted to be re-considered early on (basically using technique (1) above).

How well does this work?

Perhaps contrary to expectation this works well. The program left to learn using autoplay testing creates a program that overwhelmingly defeats the program that had no learning.

This may seem an artificial test, but the program went on to continue learning against strong human opposition. In this case the opponent, Thong Dinh, was an accomplished Gobang player, and initially beat the program quite easily.

However as he repeatedly played the program his relative performance diminished. Thong must have been improving but the program apparently improved more quickly. Eventually Thong found it very hard to beat the program at all.

Conclusion

The autoplay creation of the Gobang book was initially a revelation, as the original program played moves that reflected my personal understanding of the game. However after autoplay learning the program created a completely different opening system to the one I had assumed was best and demonstrated this by demolishing my attempts to play the openings that I had assumed were good.

The rapidity of its learning against Thong Dinh was unexpected and it seemed remarkable that the program got stronger more quickly than Thong could.

The conclusion is that it is indeed possible to generate useful learning behaviour using primitive techniques. The Treebeard program uses a technique very similar to that described above. It should be realised that a motivation for doing this is not just to create better play, but to create an opponent that evolves as you play it. This is a feature of a playing opponent that is necessary if games are not to become repetitively dull. An opponent that exposes a weakness and that repeats that failing is not an attractive opponent.

This technique is particularly valuable when testing some new game that has no history of knowledge to back it. The minimaxed opening book provides an effective means of determining how the opening for that game should be played, in a way that simply using a longer minimax search could not accomplish. AI Factory will continue to use this technique in future new games.

Jeff Rollason: March 2006