David Silver

Research and Development of AlphaGo Zero

Chemnitz, United Kingdom

Experience

Jan 2016 - Oct 2017
10 months
London, United Kingdom

Research and Development of AlphaGo Zero

DeepMind

  • Introduced a novel algorithm based solely on reinforcement learning for the game of Go, without requiring human data, guidance, or domain knowledge beyond game rules.

  • Developed AlphaGo Zero to learn tabula rasa, acting as its own teacher by training a neural network to predict its own move selections and game outcomes.

  • The neural network architecture combined policy and value networks into a single system, utilizing residual blocks of convolutional layers, batch normalisation, and rectifier non-linearities.

  • Trained the system using a reinforcement learning algorithm with self-play, where a Monte-Carlo Tree Search (MCTS), guided by the neural network, generated improved move probabilities and game data for iterative network updates.

  • The MCTS stored prior probabilities, visit counts, and action-values, with simulations selecting moves that maximized an upper confidence bound, and leaf nodes evaluated by the neural network.

  • Neural network parameters were updated to minimize error between predicted values and self-play outcomes, and to maximize similarity between network move probabilities and MCTS search probabilities, using a loss function: l = (z − v)^2 − π^T log p + c||θ||^2.

  • An initial training instance (20 residual blocks) ran for approximately 3 days, generating 4.9 million self-play games (1,600 MCTS simulations per move), achieving superhuman performance and defeating AlphaGo Lee 100-0 using a single machine with 4 TPUs.

  • A second, larger instance (40 residual blocks) trained for approximately 40 days, generating 29 million self-play games, achieving an Elo rating of 5,185 and defeating AlphaGo Master 89-11.

  • Discovered extensive Go knowledge from first principles, including fundamental concepts (fuseki, tesuji, life-and-death, ko, yose) and novel strategies, surpassing traditional Go knowledge.

  • The system learned using only raw board history as input features and minimal domain knowledge: game rules, Tromp-Taylor scoring, 19x19 board structure, and symmetries (rotation, reflection, color transposition).

  • Key team contributions for the "Mastering the Game of Go without Human Knowledge" publication (Nature, October 2017) included: design and implementation of the reinforcement learning algorithm, MCTS search algorithm, and evaluation framework; project management and advisement; and authorship of the paper.

Sep 2016 - Jan 2017
5 months
London, United Kingdom

Research and Development of AlphaGo Master

DeepMind

  • Developed AlphaGo Master, a program that defeated top human professional Go players 60–0 in online games in January 2017.
  • Utilized the same neural network architecture, reinforcement learning algorithm, and MCTS algorithm as AlphaGo Zero.
  • Differed from AlphaGo Zero by incorporating handcrafted features and rollouts derived from AlphaGo Lee.
  • Training was initialized using supervised learning from human game data.
  • Operated on a single machine with 4 TPUs during evaluation games.
Nov 2015 - Mar 2016
5 months
London, United Kingdom

Research and Development of AlphaGo Lee

DeepMind

  • Developed AlphaGo Lee, the program that defeated 18-time world champion Lee Sedol 4–1 in March 2016.
  • Based on a similar architecture to AlphaGo Fan, with significant enhancements.
  • The value network was trained using outcomes from fast self-play games generated by AlphaGo, with an iterated training procedure representing an early step towards tabula rasa learning.
  • Featured larger policy and value networks compared to AlphaGo Fan (12 convolutional layers with 256 planes each) and underwent more extensive training.
  • Operated as a distributed system utilizing 48 TPUs for faster neural network evaluations during search.
Jan 2015 - Oct 2015
10 months
London, United Kingdom

Research and Development of AlphaGo Fan

DeepMind

  • Developed AlphaGo Fan, the program that defeated European Go champion Fan Hui in October 2015 (results published in Nature, 2016).
  • Employed two deep neural networks: a policy network to predict move probabilities and a value network to evaluate board positions.
  • The policy network was initially trained via supervised learning on human expert moves, then refined using policy-gradient reinforcement learning.
  • The value network was trained to predict game winners from games played by the policy network against itself.
  • Combined these neural networks with a Monte-Carlo Tree Search (MCTS) algorithm for lookahead search.
  • The MCTS utilized the policy network to narrow the search space to high-probability moves and the value network (along with Monte-Carlo rollouts with a fast rollout policy) to evaluate positions within the search tree.
  • Operated as a distributed system across many machines, utilizing 176 GPUs.

Languages

English
Native
Chinese
Advanced

Education

Oct 2014 - Jun 2015

Imperial College London

Master's, Using Deep Reinforcement Learning to Play Chess · London, United Kingdom

Sep 2004 - Jun 2009

University of Alberta

Reinforcement Learning and Simulation-Based Search in Computer Go · Edmonton, Canada