Zero knowledge AI for Tic-tac-toe and Gomoku (Five in a row) games

Starting from zero

When building AI for games, zero knowledge simply means that when the system starts learning the game it only understands what it means to move, win, or lose. In certain situations, it is certainly easier and faster to simply instruct the machine how to play instead of building a system that can infer knowledge from a gameplay: for simple or very fixed problems, like Tic-tac-toe, an algorithmic approach can be an excellent solution.

However, as the game complexity increases it is not always easy to program the system how to play - sometimes advanced players can employ different strategies and beat the computer all the time. In these situations, a self-learning system might be a better approach.

Artificial neural networks

Neural networks are great to recognise complex patterns and build predictions based on these patterns, although it's worth keeping in mind that they aren't the best solution to every problem: as the "knowledge" is stored in matrix weights, usually most if not all intuition is lost as of why a certain type of prediction was generated based on some input data.

Generally speaking, ANNs are great choices in situations where the rules can be very complex and some prediction error is acceptable. On the other hand, when the rules are reasonably simple (eg, can be phrased in simple human language) other approaches, like a manually created set of rules, would be potentially better for solving the problem.

Generate training data

To teach ANNs we need as much training data as we can get. Ideally, we would have access to some expert players' previous games and start from there but that's not always feasible - maybe we are inventing a new game and want to see what strategies would be good, or simply don't have access to expert gameplays.

Also, the training data should be labelled so the system can learn what is a "better" or "worse" move - simply showing gameplays without proper labels will make the data impossible to learn.
Imagine the following gameplay:


It is very obvious to a human player that X could have won the game so depending on how we label the data we could either conclude that placing the X on the middle position will make X lose the game, or we could conclude that the 3rd step of X was a terrible idea. It all depends how we label the data across many gameplays.

Obviously, the naive approach to label the steps solely based on who won isn't right. What we could do instead to label each move based on how good idea it has been during previous games, so eventually the labels will be correct percentages of winning, based on that setup.

Labelling based on the Monte Carlo Tree

The Monte Carlo Tree Search was invented to guide AI of computer games based on previous successes and failures of the same move by adding some random to it to explore new ideas, in case those might be better than the current ones.

For our gameplay we aren't using the MCTS but we use it for generating the training data - after each game, we update the boards in our training dataset for successes and failures counts and retrain our model based on the updated values.

Even if based on the first game above the middle X position seems to be a losing position, eventually, over many-many games, it will have a higher "win" ratio than a "draw" or a "lose" ratio and our model will learn that it is the best next move from an empty board. This logic applies to all moves so after enough training data the correctly trained model would learn that in similar situations what was the best move.

And the "similar situations" is the key here - while the MCTS cannot answer similarity questions, our artificial neural network can. It will learn what patterns generally lead to winning and losing and the predicted weights will reflect that.

Randomisation

When our model starts learning it knows nothing about the game - all predicted moves are probably incorrect (although not random, the network will have some initial weights and it will consistently predict the poor moves, not randomly).

The first question is, should we generate totally random games, label those moves and train the model first on that, or should we play the model against a somewhat random itself right away?

Also, when the model is playing against itself, how random should be the opponent player?
To answer the first question, it seems like that the more complex the game, the worse idea it is as the totally random moves will significantly distort the weights - just think about our previous example. For an 8x8 board where 5 in a row required to win we have 3^64 potential board setups, which is 3.43e30. The random games will be so diverse that it would be very difficult to learn from those.

The second question is a bit trickier - one approach to take is to randomise based on how bad the situation is. If the weights predict a guaranteed winner move then don't randomise too much, let's say only 1 in 20 moves (5%). If the prediction is a draw in that situation, randomise more (25%) but if it's a guaranteed losing setup we may as well something new (50%).

Rotating, flipping, and whose turn it is

The other tricky question is how to store the tables? Looking at the following setup, guess who will win:



It depends on whose turn it is! We need to make sure that we store each board in a way that is consistent for learning, so we store it as if it was X's move. If it was O's move, we swap the boards and make sure that it looked like it was X's move. Without doing this the system would need to learn inconsistent values: for the setup above sometimes it's a winning, sometimes it's a losing setup.

Also, because we know that the game doesn't have a direction (unlike chess), we can flip horizontally and vertically, and rotate the boards 90, 180 and 270 degrees after each game and store the same weights with it, increasing the training data without playing more games.

Neural Network Architecture

As the game is about local features (N in a row, not anywhere on the board) it seems logical to use convnets. In my case I used the following architecture:

Conv2D > Conv2D > MaxPooling > Conv2D > Conv2D > Flatten > Dense > Dense (1)

The size (or filter count) for each layer really depends on the problem at hand, but for the board size of 6 and win length of 5 the

128 > 128 > 256 > 256 > 128 > 1

setup seemed sufficient.

One drawback of this approach is that it only evaluates the current board with a value between -1 and 1, but does not predict potential next moves. To get the next move we need to iterate through the board, place a piece on an empty spot, evaluate that setup, and remove the piece. If this is a better situation than the previous, we recommend this move.

However, for larger boards, this is quite slow which makes the training slow. For the sizes mentioned above we already have 1,068,289 tunable parameters, so predicting a single value for an empty cell is quite computationally expensive, and for each move we need to do it as many times as there are empty cells, making it an O(N^2) approach.

This can be considered as the "value net" and the AlphaGo Zero used a policy net as well, which predicted which moves are worth evaluating at all, dramatically reducing the cost of finding the next best move.

Does it work?

The model isn't perfect but reasonably good. Here is model (O) beating me (X) in an actual gameplay. It learned very well how to block me from creating a straight line of 5 pieces. In other words, it adapted to the board size very well, as these approaches wouldn't work on a larger board.



The code

The full source code is available on GitHub

Comments

Popular posts from this blog

MurMurHash3, an ultra fast hash algorithm for C# / .NET

ESP32 - send a push notification from the Arduino ESP32 device to your phone

Octoprint as a systemd service - running it with high priority