What is the "novel reinforcement learning algorithm" in AlphaGo Zero?

Question

For some reason, AlphaGo Zero isn't getting as much publicity as the original AlphaGo, despite its incredible results. Starting from scratch, it's already beaten AlphaGo Master and has passed numerous other benchmarks. Even more incredibly, it's done this in 40 days. Google names it as "arguably the best Go player in the world".

DeepMind claims this is a "novel form of reinforcement learning" - is this technique truly novel? Or have there been other times when this technique was used- and if so, what were their results? I think the requirements I'm talking about are 1) no human intervention and 2) no historical play, but these are flexible.

This appears to be a similar question, but all the answers seem to start from the assumption that AlphaGo Zero is the first of its kind.

mjul · Accepted Answer · 2017-10-26T09:59:12.987

The AlphaGo Zero article from Nature, "Mastering the Game of Go without Human Knowledge", claims four major differences from the earlier version:

Self-learning only (not trained on human games)
Using only the board and stones as input (no hand-written features).
Using a single neural network for policies and values
A new tree-search algorithm that uses this combined policy/value network to guide where to search for good moves.

Points (1) and (2) are not new in Reinforcement learning, but improve on the previous AlphaGo software as stated in the comments to your question. It just means they are now using pure Reinforcement Learning starting from randomly initialized weights. This is enabled by better, faster learning algorithms.

Their claim here is "Our primary contribution is to demonstrate that superhuman performance can be achieved without human domain knowledge." (p. 22).

Points (3) and (4) are novel in the sense that their algorithm is simpler and more general than their previous approach. They also mention that is is an improvement on previous work by Guo et al.

Unifying the policy/value network (3) enables them to implement a more efficient variant of Monte-Carlo tree search to search for good moves and simultaneous using the search tree to train the network faster (4). This is very powerful.

Furthermore, they describe a number of interesting implementation details like batching and reusing data-structures to optimize the search for new moves.

The effect is that it needs less computing power, running on 4 TPUs rather than 176 GPUs and 48 TPUs for previous versions of their software.

This definitely makes it "novel" in the context of Go software. I believe that (3) and (4) are also "novel" in a broader context and will be applicable in other Reinforcement Learning domains such as e.g. robotics.

What is the "novel reinforcement learning algorithm" in AlphaGo Zero?

1 Answers1