In May 2017, Youtube was broadcasting an epic game between the AI AlphaGo and the human player Ke Jie.
AlphaGo is an AI trained by playing solely against itself, i.e. against an AI, starting with zero data inputs and learning exclusively through self-play.
We can think of AlphaGo as made of 3 components:
. policy iteration
. Monte Carlo tree search (MCTS)
. policy evaluation
// Starting from scratch:
// setting the initial position S -----------------------------------------------------
| ................................................. |
| ................................................. |
| ................................................. |
| ................................................. |
| ................................................. |
| ................................................. |
| ................................................. |
| ................................................. |
| ................................................. |
| ....... S ....................................... |
| ................................................. |
| ................................................. |
| ................................................. |
| ................................................. |
| ................................................. |
| ................................................. |
-----------------------------------------------------
Policy, or the game scheme
Given a blank board to operate on, AlphaGo picks a root position s on it and starts working through policy iteration and evaluation — policies can be seen as possible trajectories, or game schemes.
Each node of AlphaGo neural network performs a MCTS to determine the best move probability. Once a stone is placed, MCTS computes its winning probability score based on the stone’s position and game policy, and on the basis of that score, it evaluates and improves the policy itself.
Monte Carlo tree search
Monte Carlo tree search is widely used in video games implementation and is the very core of AlphaGo neural network.
It is a heuristic algorithm, i.e. uses a general approach to solve a task, it’s fast (=always finishes in a bounded time) but not optimal (=does not always find the best solution).
In decision tree systems, computers can potentially calculate every single state and every single output of the game, and then place a move. In games like tic-tac-toe, every state has 9 potential outcomes, and all case scenarios can be rapidly computed. But Go has 300 possible outcomes per state.
MCTS does not take into account every single output, but picks a move, simulates its results, grows as “tree” and gives an input back. The more it grows as tree, the more it learns how to play.
MCTS develops in 4 phases: SELECTION (of a branch) > EXPANSION (of a new leaf) > SIMLUATION (of a playout) > BACKPROPAGATION (of the result).
// SELECTION // EXPANSION Node0 Node0
| | | | | |
| | | | | |
state1 state2 state3 state1 state2 state3
|
|
↓
new possible move
// SIMULATION // BACK-PROPAGATION
Node0 --- ► Node0
| | | | | | |
| | | | | | |
state1 state2 state3 state1 state2 state3
| | |
| ▲ |
↓ | ↓
new possible move new possible move
| | |
| ▲ |
↓ | ↓
game simulation game simulation
and back-propagation of the
results back to the node
Now, how does MCTS choose its move? Through UCT ( upper confidence bound for trees). UCT is based on the multiarmed bandit problem (or, how-to-find-the-most-rewarding-slot-machine problem).
UCT formula goes as:
Vi + C * √(ln N)/ni
where,
. Vi is the child node win ratio
. C is a constant value
. ln is natural logarithm
. N is the number of visits to the parent node
. n is the number of visits to child note
. i is the child node
Vi is the estimated value of the node, ni is the number of the times the node has been visited and N is the total number of times its parent has been visited. C is a tunable bias parameter.