[From the Stanford link above]
Unsurprisingly, there's a neural network at the core of things. The neural network is parameterised by and takes as input the state of the board. It has two outputs: a continuous value of the board state from the perspective of the current player, and a policy that is a probability vector over all possible actions.
When training the network, at the end of each game of self-play, the neural network is provided training examples of the form . is an estimate of the policy from state (we'll get to how is arrived at in the next section), and is the final outcome of the game from the perspective of the player at (+1 if the player wins, -1 if the player loses). The neural network is then trained to minimise the following loss function (excluding regularisation terms):The underlying idea is that over time, the network will learn what states eventually lead to wins (or losses). In addition, learning the policy would give a good estimate of what the best action is from a given state. The neural network architecture in general would depend on the game. Most board games such as Go can use a multi-layer CNN architecture. In the paper by DeepMind, they use 20 residual blocks, each with 2 convolutional layers. I was able to get a 4-layer CNN network followed by a few feedforward layers to work for 6x6 Othello.
Monte Carlo Tree Search for Policy Improvement
Given a state , the neural network provides an estimate of the policy . During the training phase, we wish to improve these estimates. This is accomplished using a Monte Carlo Tree Search (MCTS). In the search tree, each node represents a board configuration. A directed edge exists between two nodes if a valid action can cause state transition from state to . Starting with an empty search tree, we expand the search tree one node (state) at a time. When a new node is encountered, instead of performing a rollout, the value of the new node is obtained from the neural network itself. This value is propagated up the search path. Let's sketch this out in more detail.
For the tree search, we maintain the following:
- : the expected reward for taking action from state , i.e. the Q values
- : the number of times we took action from state across simulations
- : the initial estimate of taking an action from the state according to the policy returned by the current neural network.
From these, we can calculate , the upper confidence bound on the Q-values asHere is a hyperparameter that controls the degree of exploration. To use MCTS to improve the initial policy returned by the current neural network, we initialise our empty search tree with as the root. A single simulation proceeds as follows. We compute the action that maximises the upper confidence bound . If the next state (obtained by playing action on state ) exists in our tree, we recursively call the search on . If it does not exist, we add the new state to our tree and initialise and the value from the neural network, and initialise and to 0 for all . Instead of performing a rollout, we then propagate up along the path seen in the current simulation and update all values. On the other hand, if we encounter a terminal state, we propagate the actual reward (+1 if player wins, else -1).
After a few simulations, the values at the root provide a better approximation for the policy. The improved stochastic policy is simply the normalised counts . During self-play, we perform MCTS and pick a move by sampling a move from the improved policy . Below is a high-level implementation of one simulation of the search algorithm.