After a lifetime of working with AI based on heuristic linear evaluation based tree search, the switch to Monte Carlo Tree Search (MCTS) was a minor culture shock. MCTS can function with no heuristic knowledge at all, i.e. totally discarding the whole basis of conventional heuristic linear evaluation tree search. However MCTS alone still needs heuristic knowledge to perform very well, but mixing selection by MCTS and heuristic choice is basically a bit like mixing chalk and cheese.
This article, starting with input from a paper cited at the end of the article, discusses the issues and then the use of "Progressive bias" to allow MCTS and heuristics to co-exist in a meaningful manner.
Heuristics delivering a linear evaluation embedded in minimax delivers something that maps well to intuitive understanding of what evaluation should be. In the case of chess you can look at two linear evaluations and see that one score indicates a value of two pawns more than another, with marginal deviations indicating some compensating or detrimental added positional evaluation difference. If you plotted the returned values of such a search throughout a game you could see at what points one side edged ahead and the jumps in evaluation showing the points where material loss was inevitable.
MCTS provides no such mapping. Choices are made from simply voting on moves. Selection is based on choosing the move most often searched. Plotting the number of votes throughout a game offers nothing in terms of assessing progress. It might even be flat if MCTS was designed to stop when any one move received "N" searches. If you instead plot the number of nodes for that move divided by all nodes searched, then you have something, but only a measure of how obvious that move was, not the net advantage.
So the two systems are basically quite alien to each other.
AI Factory's primary use of MCTS has been via Information Set MCTS (ISMCTS) for Spades and Euchre, and MCTS for Gomoku. Of these Spades and Euchre have been the candidates that would most benefit from heuristic input as the lack of perfect information adds significant fuzziness to the analysis.
However the major characteristic of all MCTS, that poses an issue, is that where 2 moves lead to the same game result outcome, it could play either. So where a win is inevitable then any random play might happen. This is particularly poignant in imperfect information games such as Spades where players mostly cannot exactly calculate all outcomes simply because the number of possible game situations is extremely large, so players have to play basic efficient routine "good play" moves, even if a fully calculated outcome would have shown that all plays clearly and inevitably lead to the same game result.
In Spades the number of opportunities for such irregular plays is very great. This phenomenon is touched on in Mixing MCTS with Conventional Static Evaluation: Experiments and shortcuts en-route to full UCT and Tuning Spades, which discuss these issues.
In development this actually hit us quite hard, as after conversion from heuristic to MCTS, the points win rate for MCTS over heuristics was some 90%, which is an impressive result. Given this good outcome we went ahead, published the new app and were hit by a barrage of complaints as the program was perceived to be much much weaker. We hurriedly withdrew it and re-worked some heuristics into MCTS.
This last anecdote makes it plain that this is not just a marginal issue but a core issue that absolutely needed to be addressed.
As pointed out MCTS is a voting system and Heuristics are typically based linear evaluation. If we were using the latter then for most of it this whole issue would not arise. We would simply skew all heuristic values below the root favoured choice move to bias choice for that ply1 move. A common reason to want to do that might be to avoid marginal minimax horizon effects, where a heuristically stupid move might be chosen at the root simply because it wastes time and pushes the key event beyond the search horizon. This bias can use some exact heuristic value.
In MCTS we cannot really easily do this as it at least will not use simple heuristic values in the search and may even just use win/loss, so that we cannot easily predict the outcome. Moves are chosen by votes, not values.
If we wanted to skew a ply1 choice for MCTS the natural solution is to skew the choice of each ply 1 move as analysed. This might manifest in more than one way, as follows:
(1) | Any evaluation delivered inside the sub-branch could be modified so that it appeared to be better than it was. This would trickle down the tree. Note that if the MCTS were working with just win/draw/loss, this would need a randomised flip of a loss to a win. | |
(2) | To simply modify the root decision about which moves are favoured to be explored, so that heuristically we would bias search to try moves we might like better. | |
(3) | A third and horrible option is to simply modify the number of votes cast, but this is really not good! |
This maybe sounds plausible, however we have the problem that the final choice is made by counting the number of votes. Fiddling some heuristic value is therefore likely to be unstable and unpredictable. A maxim of saying that a certain play is worth "x" points may not deliver what you expect. Tuning this would be hard, trying to draw some compromise that avoids the heuristic from easily overturning a decision where it should not. Making this weaker would also likely turn it into an ineffective measure.
A solution to this is to stop trying to treat the MCTS vote and linear heuristic as if it were some common compatible currency. They are not and so a different approach is needed.
A solution to this is to treat heuristics as advisors, but actually leave MCTS in control. The very nice solution to this was suggested to me by Daniel Whitehouse at York University: a suggestion that I was very grateful for. This was "Progressive bias", which uses heuristics to skew the MCTS to favour particular plays but this bias tails off as search progresses.
The implication of this may not be obvious. It essentially points MCTS to like certain plays, so it then searches them more. The strength of this suggestion weakens during search and then eventually vanishes, so that ultimately MCTS is still in control.
In practice the advice might be rejected right away by MCTS as the search shows that the heuristic choice is actually a blunder. However it may instead just cause MCTS to appear to favour it, but as the advice fades MCTS may find that the choice does not persist as other plays return more convincing results, so the advice is rejected. If the move was good, MCTS will re-enforce this and the chosen move will receive more votes than it would have without Progressive bias.
This system works very well where the MCTS search would have delivered a tie that could have flipped either way. The progressive bias here is enough to act as a reliable tie-break, without risking overturning an MCTS choice by what might be a defective or "out of context" heuristic.
This method is clean and avoids the kludge of trying to hack MCTS to incorporate heuristics. By avoiding having the final choice being a mix of the two we avoid mixing the immiscible. Instead MCTS is always in control, with a single currency: Votes. Heuristics simply offer guidance.
This was worked very well in our Spades, solving many problems with irregular or incomprehensible play. Adding new ideas for choice very naturally fit this advisor framework. Our user community is very happy.
Progressive Strategies for Monte-Carlo Tree Search - G.M.J.B. Chaslot, M.H.M. Winands, J.W.H.M. Uiterwijk and H.J. van den Herik
Jeff Rollason - October 2015