A Rating System for Developers

Author: Jeff Rollason

Copyright Notice: This article is Copyright AI Factory Ltd. Ideas and code belonging to AI Factory may only be used with the direct written permission of AI Factory Ltd.

This expands on the previous article Lessons from the Road - Championship level AI development, which detailed the big picture of how we approached championship-level AI for competition. This article picks up on the RATING program that was an essential cog in this process.

The Need for a Rating System

In developing computer AI it is easy to become over-focussed on solving small decisions defects one-by-one, but each change made to a program, even if it now solves the previously unsolvable, can easily have unexpected effects on overall performance. These need not be simply bugs but shifts in style that cause the new version to have an inferior head-to-head performance against other opponents.

An example in mind is for Shotest, where the program had a bug that over-optimistically believed that a type of move lead to a substantial advantage, but that with bug-free analysis actually gave only a marginal advantage. Once fixed, the program did less well. The reason for this is that it saw these same moves and assessed that they offered a marginal advantage so it instead played some more conservative move. The program was better off keeping the bug and playing the over-optimistic play. If you project this principle to computer chess, where the strongest programs are so much better than any human players, you will see a solid safe style. Modern programs do not play like the previous world champion Mikhail Tal, who is attributed as the best attacking chess player of all time, because in absolute terms his play was likely often unsound, but it gave him wins against imperfect but maybe theoretically stronger opponents.

Testing therefore needs to be comprehensive to maintain an accurate picture of the program strength, ideally against multiple opponents. If you do have multiple opponents you then have multiple win-rates, which means you must have a rating system to extrapolate how strong your program is.

The Reason that simple ELO is not enough

A problem with ELO is that it is not retrospective. If A plays B and A gets a rating, then B beats C, with an already high rating, then B's rating increases but A does not change. If A scored 50% against B and then B achieves some outstanding further result, then this says something about A.

Of course ELO here is designed to be a snapshot in time, making the assumption that in a multi-player community that each player will anyway drift in rating.

However in computer testing you have the helpful opportunity of working with fixed opponents that have a rigidly standard strength, so that if A>B and B=C and C>D then the performance of C over D can give us information about A. It is therefore possible to accumulate results over a long period of time, such that the very first result is just as relevant as the most recent. From this it is possible to accumulate the results of multiple versions of the program being developed and compare these to each other and to fixed known reference opponents.

Finally ELO needs all players to have ratings in advance, which may not be possible.

As detailed in the previous article, AI Factory testing for Shogi was also carried one step further by the ANALGAME program which assessed the relevance of any one win. i.e. achieving a win playing a line you normally win, yields less points than a win following a line where you normally lose.

This leads to fractional results such as 0.75 of a win, which further excludes ELO as being a viable method, which in turn leads to the method described in this article, as follows:

"RATING" a cyclic rating program

This program was actually designed for devising a rating for Pool players (by this author) but then switched for game development. It has also been used for a variety of other side uses, but we can look at an illustrative example here.

The input to this program is a text file of results, such as the following:

Anne 25 Joe 20
Sue 5 Anne 5

This specifies Anne beat Joe by 25 to 20 points and Sue drew with Anne with 5 points each.

RATING takes this input and assigns each player a base rating of 1500 and then cycles the following:

FOR each match player is in:

sum = sum + (points in match x (sum of (player ratings-1500)))

It then calculates a new rating from the following formula:

rating = 1500 + sum/(total points scored in all matches played)

This then repeats, seeding each subsequent loop with the rating just calculated in the previous loop. This is repeated until the ratings for each player stabilises.

From the above example the output is:

Player Rating Points Points
1. Anne 2045 ( 30 pts) ********************************************
2. Sue 2038 ( 5 pts) ********************************************
3. Joe 1937 ( 20 pts) ************************************

What was that??

The above was all the information you need to replicate this, but maybe it is not obvious what is actually happening here.

This system injects rating points to each player and then in each round re-distributes ratings points in proportion to the number of points, times the relative share of points in each match. Rating points therefore flow between the players. A winning performance of A over B may initially award a low credit, but in subsequent loops B may also earn more rating points so that A receives more rating points next time around.

This operates a bit like a fluid system, with rating points "flowing" between players, perhaps a metaphor for William Phillips famous hydraulic economics computer, now in the Science Museum.

This system is very sensitive and links the multiple results very well. Here you can see that Sue has played very few points, but earns a rating close to the strongest player Anne, who beat Joe. Although Anne and Sue tied in points it is Anne that gets the slightly higher rating. This intuitively makes some sense.

Let's see how this behaves.

Testing RATING

If we repeat the input file above and its rating:

Anne 25 Joe 20
Sue 5 Anne 5

Player Rating Points Points
1. Anne 2045 ( 30 pts) ********************************************
2. Sue 2038 ( 5 pts) ********************************************
3. Joe 1937 ( 20 pts) ************************************

If we now modify the score to

Anne 25 Joe 20
Sue 5 Anne 5
Sue 20 Anne 20

We see:

Player Rating Points Points
1. Anne 2025 ( 50 pts) *********************************************
2. Sue 2021 ( 25 pts) ********************************************
3. Joe 1923 ( 20 pts) ************************************

Here the relative score for the Sue Vs Anne match is the same, but the number of points (or wins) has increased. The net effect is that Sue and Anne have converged, diminishing the range of ratings as the tied match has a more dominant influence.

If we throw in an extra match result:

Anne 25 Joe 20
Sue 5 Anne 5
Sue 20 Anne 20
Joe 10 Sue 1

Here Joe has a big advantage over Sue, even though the total match points for this contest is smaller than the other two matches. This gives:

Player Rating Points Points
1. Joe 2058 ( 30 pts) *********************************************
2. Anne 2031 ( 50 pts) ******************************************
3. Sue 1898 ( 26 pts) ********************************

Now Joe has jumped to the top ranking, even though he lost to Anne. The difference between Anne and Sue has also widened.

When this is applied to hundreds of test results then adding a single extra result can cause many other ratings to adjust. In the environment of play testing computer AI this may mean that an otherwise middling test version of the program may suddenly look better because of its strong performance against some other new common opponent. This may cause the developer to return and re-assess this version.


This is not backed by a solid mathematical model but instead depends on a mechanism that appears to be intuitively reasonable. It is flexible as it can work with any point system, even with games with more than 2 players per game, and does not need seed ratings. As with ELO, it can appear to throw up results that you may be suspicious of, perhaps because of the influence of some added small number of results. The imperative then is to perform extra test runs that fill the gaps in the entire test matrix. In the example above the missing result was Joe Vs Sue. Perhaps the matrix is already complete, in which case add more test results in critical matches where the number of results is already low.

This system was the backbone to drive the development of the Shogi playing program Shotest and allowed the number of tests to be reduced, so that the complete play matrix need not be full. It steered development of program versions that achieved short-term better results as the more promising versions bubbled to the top. It was very good at drawing attention to which range of versions warranted investigation.

The end product was a highly successful Shogi program that rose to top ranking faster than any other competing program at the time.

Jeff Rollason - September 2014