The maths behind AlphaGo

3 min readJul 21, 2019

In May 2017, Youtube was broadcasting an epic game between the AI AlphaGo and the human player Ke Jie.

AlphaGo is an AI trained by playing solely against itself, i.e. against an AI, starting with zero data inputs and learning exclusively through self-play.

Photo by **Anubhaw Anand** from **Pexels**

We can think of AlphaGo as made of 3 components:
. policy iteration
. Monte Carlo tree search (MCTS)
. policy evaluation

 // Starting from scratch:
 // setting the initial position S        -----------------------------------------------------
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ....... S ....................................... |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        | ................................................. |
        -----------------------------------------------------

Policy, or the game scheme

Given a blank board to operate on, AlphaGo picks a root position s on it and starts working through policy iteration and evaluation — policies can be seen as possible trajectories, or game schemes.
Each node of AlphaGo neural network performs a MCTS to determine the best move probability. Once a stone is placed, MCTS computes its winning probability score based on the stone’s position and game policy, and on the basis of that score, it evaluates and improves the policy itself.

Monte Carlo tree search

Monte Carlo tree search is widely used in video games implementation and is the very core of AlphaGo neural network.
It is a heuristic algorithm, i.e. uses a general approach to solve a task, it’s fast (=always finishes in a bounded time) but not optimal (=does not always find the best solution).

In decision tree systems, computers can potentially calculate every single state and every single output of the game, and then place a move. In games like tic-tac-toe, every state has 9 potential outcomes, and all case scenarios can be rapidly computed. But Go has 300 possible outcomes per state.

MCTS does not take into account every single output, but picks a move, simulates its results, grows as “tree” and gives an input back. The more it grows as tree, the more it learns how to play.
MCTS develops in 4 phases: SELECTION (of a branch) > EXPANSION (of a new leaf) > SIMLUATION (of a playout) > BACKPROPAGATION (of the result).

// SELECTION                         // EXPANSION          Node0                                Node0
       |    |    |                          |    |    |
       |    |    |                          |    |    |
   state1 state2 state3                 state1 state2 state3
                                            |
                                            |
                                            ↓
                                     new possible move
// SIMULATION                         // BACK-PROPAGATION
          Node0                         --- ►  Node0
       |    |    |                      |   |    |    |
       |    |    |                      |   |    |    |
   state1 state2 state3                 state1 state2 state3
       |                                |   |
       |                                ▲   |
       ↓                                |   ↓
new possible move                    new possible move
       |                                |   |
       |                                ▲   |
       ↓                                |   ↓
game simulation                       game simulation  
                                      and back-propagation of the   
                                      results back to the node

Now, how does MCTS choose its move? Through UCT ( upper confidence bound for trees). UCT is based on the multiarmed bandit problem (or, how-to-find-the-most-rewarding-slot-machine problem).
UCT formula goes as:

**Vi + C * √(ln N)/ni**

where,
. Vi is the child node win ratio
. C is a constant value
. ln is natural logarithm
. N is the number of visits to the parent node
. n is the number of visits to child note
. i is the child node

Vi is the estimated value of the node, ni is the number of the times the node has been visited and N is the total number of times its parent has been visited. C is a tunable bias parameter.

The complete academic article on AlphaGo Zero implementation is here. And on MCST here.

The maths behind AlphaGo

Policy, or the game scheme

Monte Carlo tree search

**Vi + C * √(ln N)/ni**

Written by zihan

No responses yet