All previous articles on game AI in this article series have been concerned with game heuristics and search based on Minimax. In recent years though the alternative less promising branch of AI, "Monte Carlo search", has scored some notable successes in areas considered to be otherwise extremely difficult to solve. This is a new area for us and we have subsequently become involved in working with the Artificial Intelligence Research Group at the University of Bradford who work with a wider MCTS group that meet regularly at Imperial College. They have been working with our Dou Di Zhu engine and published papers based on work using this engine
Our attention has switched to consider this and we have experimented with some simple schemes to try and re-use our existing heuristics within this type of framework. This article details some of our findings.
Our testbed for trying out MCTS (Monte Carlo Tree Search) was our card playing Spades program and then later Hearts. Both of these engines are built around probabilities, so that as the game progresses the program can gradually infer what the opponent hands contain. Some of this is absolute, such as knowing player A does not have a Club if they discarded on Clubs, but also lesser inferences when a player fails to win a trick they might have been expected to try and win, which allows the program to infer that it is unlikely that they possessed the needed winning card. Both Spades and Hearts use many such inferences to gradually build up a view of the opponent's hand. From this the program can estimate the probability that any one card play can win a trick. This is combined with heuristics to determine the merits of playing any one card.
This prescriptive heuristic approach to these cards games with no search (as with many such games), suffers from likely knowledge vulnerabilities. The program may handle many classic play situations very well as standard plays are needed, but when reaching certain parts of the game which require more difficult decisions, the AI may make mistakes that humans might not make. So broadly the play is reasonable, but with the potential for some significant holes in play at different times.
In its crudest form MCTS works by simply playing out a game to conclusion along a non-branched path, all the way to end of game, and registering a win or loss (a "rollout"). This is repeated and the count of wins and losses recorded to determine which move had the highest percentage of wins at the root.
Superficially this sounds a very unpromising AI method as it is hard to imagine, for example for chess, that this could possibly deliver a meaningful result. Comparing the first move Pe4 with Pd4 by playing to end of game looks like a hopeless way of choosing between these options. The merits of these 2 moves surely lies in position much closer to the first move than some deep endgame final win. This would have so much noise in the returned win/loss with no account made of refutations, that it would seem reasonable to dismiss this as a viable method to choose moves.
MCTS indeed seemed to be the AI method to try "once all else has failed".
The evolution in MCTS that has radically changed the perception of the method is UCT (upper confidence bounds applied to trees), which allow the probability of moves selected in the rollout to be selected on the basis of probability of success. Each rollout not only records the final result at the root, but also provides votes in the shallow part of the tree, marking which led to success or failure. These votes are used to determine the probability that each move is explored subsequent rollouts. This deals with the concept of refutation, handled well by minimax, but missing in simple MCTS and that uses recent advances in bandit algorithms.
I first saw this in action at a Computer Olympiad, where the presence of a Go program (Wei-chi), regarded as an extremely hard game for computer AI, was known to be using MCTS. At the time this was regarded with some distain by other competitors as is was mixing the world's hardest game with the world's worst AI method. However it did well and suddenly a radically better method for computer Go was born and now all serious Go programs use this method.
If it could be made to work, a key, very attractive feature of MCTS is that
it is potentially heuristic-free, needing no understanding of strategy in
order to be able to play the game. It only needs to be able to recognise a
win. If this could be made to work then you have a system that can absorb
any number of sub rules or any type of game situation with the prospect that
it could always be able to play plausible moves. The idea of heuristic knowledge
here however manifests in UCT by having the search learn the heuristic for
move selection by learning via vote-taking as the search progresses.
Therefore it need not lean on heuristics in order to play and therefore would
not suffer from the potential holes in knowledge that might harm a heuristically-driven
program.
Note one difficult issue with vote taking during UCT rollout is that if the game had imperfect information then the process of randomising the expected game state would mean that votes could not be transmitted between different randomised states. Each projected randomised game state would have to have its own version of the UCT vote tree.
Having indicated that MCTS has advanced to UCT, we did not elect to use UCT in the few engines we needed to update. An issue here is that switching to UCT inevitably requires a substantial re-structure and time constraints required that we were instead able to make some advance with the minimum of structural disruption.
Therefore we "branched" into another take on MCTS, to determine whether we could marry the existing heuristic programs we had with a MCTS-based method.
Firstly our plan was to rollout not to end of game, but to end of hand, which delivered not the result win or lose, but an evaluation score. Our system simply would return a score to the root play and determine the average score for that root move. The move with the highest average score would be chosen.
Since we already had a heuristic evaluator there was merit in using this to select moves during rollout. So the rollout would not be random but actually a plausible choice. This, of course, is expensive in terms of time so would limit the number of rollouts, but at least would allow the engine to generate reasonable choices from very few rollouts.
This type of idea actually also appears in Looking for Alternatives to Quiescence Search with "Super Soma", which attempts to predict tactical sequences with no branching. This method successfully allowed our Shogi engine "Shotest" to reach world ranking #3 in the CSA World championship in Tokyo, also detailed in other articles. If you can make a single move path deliver a result, rather than a branched tree, then the speed advantage can be huge.
In practice, with multiple rollouts, it was also necessary to try alternative moves for the same attempted rollout, to have some chance of picking up some alternative variations.
An issue here is that we have imperfect information. We do not know our opponent's cards. However to complete a rollout we must have cards to play. Therefore we need to generate a randomised instance of a possible game state in order to be able to play it out.
The simple approach to this would be to simply generate random cards for each player, excluding cards already known (already played or on AI player's hand or known not to be in your opponent's hand). This is however rather unsatisfactory as it ignores information the AI player should know. However resolving this is complex and has pitfalls, as follows:
Clearly if you know that an opponent has gone "nil" (must lose all tricks) then in Spades that player cannot have AS and almost certainly not KS, QS etc.. This has to be accounted otherwise the rollout might be meaningless. So you could rollout taking simple probability inferences into account.
However a further horrible part of this is that part way though a hand you need to take into account how many tricks that player has already made. If they have 3 tricks and bid 4 you could infer that their hand is likely to deliver 1 more trick, so randomise with that in mind. However that player may have already over or under-reached and actually their hand is already imbalanced.
A further nasty here is that the process of distributing skewed randomised cards to the player opponents risks giving the wrong cards to players such that the remaining cards to be distributed conflict with what you know about the remaining players. It is disaster to dole out cards and find that the last remaining card to randomly distribute is AS and that that card is to be given to the player bidding "Nil". That is the extreme case, but has a general severe impact on this process of skewed randomised deals.
Note that multiple skewed randomised deals need to be tested before choosing a move.
There are few free lunches and MCTS is not one of them. A common issue with MCTS is that, although it may win many games, it is frequently inclined to play "bad" moves that appear to lead to a win as often as a good move. The principle being played out here is that providing you win, then it is ok. This causes the top Go programs to appear to play badly when actually they are doing quite well.
Even though we knew this bear trap existed, it still bit our Spades engine and indeed we had a barrage of complaints of very poor plays. Indeed the program would often play apparently absurd cards as it could see that it appeared to reach a good result, even with this "bad" play. The new program easily beat the old program but statically it played many more "bad" moves than the old program. This is a curious paradox, but very bad for the human opponent.
To "cure" the issue above we found we needed to use the heuristics to skew the choices made at ply 1 and also perform other tweaks to the rollout to get the needed result. Some rollouts mixed random and heuristic choices and other branches were entirely heuristically driven. Within these heuristic choices randomisation needed to be added to avoid repeat rollouts.
How well did this do? Testing the original heuristic program against one that used our MCTS model above, MCTS showed a win rate of 81.4% in terms of the rate of accumulated points. The program plays quite well and for a long time has been the top Spades program on Android. At time of publication it is still the highest rated Spades program.
There is clearly work to be done as Spades is a complex game and the needs of partnered play adds significant complexities. We will continue to work on this.
Jeff Rollason - February 2012