Beating My Brother in Chess

Replicating AlphaGo for Chess

Peter Yu
12 min readApr 13, 2018

I have a brother who’s two years younger than me, and as with all brothers, we grew up competing with each other in pretty much everything. Our competitions more often than not resembled the story of The Tortoise and the Hare. I, ever the hare, would be the first one to be interested in a game, quickly learn how to play, and lose interest as fast as I learned to play. My brother, ever the tortoise, would learn the game, usually after I introduce him to it, then slowly with the patience of a tortoise improve his skills and eventually beat me.

Chess was one of the games that followed this pattern. I learned of chess in elementary school after learning Janggi from my grandfather and father. I asked my parents to buy me a chess board, and because no one around me knew how to play chess (chess is quite obscure among Koreans who prefer Go or Janggi), I taught myself how to play with the manual that came with the board. The only person I could practice chess with was my brother, so naturally he also started playing.

My brother and I weren’t always this civil when we played.

By the time I was in high school, I had already lost interest in chess, but my brother joined the chess club at his middle school. With proper training, my brother beat me at chess in a matter of weeks. To this day, I haven’t been able to beat him at chess.

By now, you might be wondering why I’m telling you the story. Well, it’s because I found a way to beat my brother in chess, and it’s by using deep learning. After my first deep learning project, I started looking for a new project to continue my studies in machine learning. After consulting with some of my colleagues and friends, I thought it’d be fun practice to replicate the results of AlphaGo and AlphaZero in a smaller scale. Originally, I planned to try two approaches: replicate AlphaGo for the smaller 9x9 Go board, and AlphaZero for chess, but this soon proved to be difficult, because there simply isn’t enough training data for 9x9 Go. As a result, I adjusted my plan and decided to replicate AlphaGo for chess, and this comes with an added bonus: I might be able to beat my brother at chess, albeit by proxy.

AlphaGo

Before diving into my chess engine, it’d be helpful to go over what AlphaGo exactly does. In order to do so, first we must understand what it means to be good at Go, chess or any perfect information game, i.e. all the information is available to everyone. Simply put, you have to play the optimal move at each turn. Humans search for the optimal moves using their intuitions and simulations in their heads (looking ahead a few moves). Computers, on the other hand, have traditionally relied on brute force search using their computational prowess. For example, DeepBlue, the first chess program to defeat a world champion, employed brute force search using an algorithm called Minimax combined with hand-crafted heuristics developed by chess experts. When run on the state-of-the-art computers, this was enough to search for the optimal move, at least better than humans, at each turn despite its computational intractability.

Unfortunately, DeepBlue’s approach did not quite work for Go, because its complexity is orders of magnitude larger than chess. In order to tackle this seemingly insurmountable challenge, other search algorithms have been employed, and one algorithm called Monte Carlo tree search (MCTS) proved to be most promising. Although it was not enough to beat a human in Go by itself, AlphaGo proved that MCTS can beat the best human Go player when supplemented by deep learning models.

I’m definitely not cool enough. (source: https://xkcd.com/1875/)

Monte Carlo Tree Search

The origin of the name comes from the casino in Monte Carlo, Monaco, and just as the name suggests, it involves random chance. Instead of analyzing the moves deterministically using mathematical formulas like Minimax, MCTS relies on simulations to evaluate the moves. First, it represents the game as a tree, each node being the state of the game, and the edges connecting each node representing the moves. Then, it repeatedly execute the following 4 steps:

  1. Selection: Starting from the root, repeatedly follow down the tree by choosing the best move at each turn until you reach the bottom (called the leaf).
  2. Expansion: If the leaf node is not terminal (i.e. it’s not the end of the game), add children nodes to it and randomly choose one child.
  3. Simulation: From the selected child, run a simulation till the game is over, and calculate the value of the game (e.g. if the player won the game 1, otherwise -1).
  4. Backpropagation: Update the visit count for each node we have visited from the root, as well as the value calculated from the simulation.

After repeating these steps enough times, we pick the move that maximizes a value calculated from the visit count and the simulation value. If you want to learn more about MCTS, here are some resources I referenced extensively when I implemented it: What is MCTS?, Video from Udacity.

We now have the general structure of the algorithm, but there are a couple of details that are still unclear:

  1. How do we select the best move in Selection?
  2. How do we calculate the value of a node in Simulation?

And these are the two places where AlphaGo used deep learning to achieve superhuman performance.

Selection

How does deep learning help in selecting the best move in the Selection step of MCTS? The model the AlphaGo team developed, which they called the policy network, basically looks at the Go board and tells us what it “thinks” the next best move is. More specifically, first, the current state of the board is transformed into a stack of images. Each stack refers to a type of information such as the stone positions and various stone patterns (you can find the complete list in their paper). Then this stack of images is fed into the policy network, and it tells us the next move. The policy network uses a neural network called Convolutional Neural Network (CNN) whose structure takes inspiration from how our own visual cortex works. Thanks to its structure, CNN excels in visual tasks and has been heavily employed in tasks like face and object recognition. As the task of determining the next best move based on the current board configuration could be regarded as a visual task, CNN was a good choice for the policy network.

The AlphaGo team then trained the policy network from 30 million positions extracted from expert online Go games, a process called supervised learning. In order to further train the supervised learning network (SL network), they also employed reinforcement learning, essentially having it repeatedly play itself, akin to playing practice matches. At this point, the reinforcement learning network (RL network) was able to win 85% of the games against Pachi, a highly sophisticated Go engine based on MCTS, based solely on “intuitions” without any search. However, despite its superior Go playing skills, the RL network did not perform as well when used as the policy network in MCTS than the SL network, presumably because the RL network “optimizes for the single best move”, where as humans, and by extension the SL network, “select a diverse beam of promising moves.”

Simulation

Now that we know how to select the best move in Selection, let’s explore how the AlphaGo team figured how to calculate the value of a node in the Simulation step in MCTS. They did this by combining two methods: 1. actually simulating until the end of a game from the given node, and 2. using another CNN based network to estimate the value. The key insight here is that speed matters more than accuracy, because MCTS is guaranteed to converge to an optimal move given enough time. As a result, the AlphaGo team came up with a simpler but faster policy network called the rollout policy network for method 1, and a value network that would effectively estimate the value of a node for method 2. The rollout policy network was trained from expert positions just like the SL network, but with simpler images and smaller network size for speed. They tried to train the value network using expert positions, but this resulted in the network basically memorizing the values instead of generalizing to new positions (this phenomenon is called overfitting). This is due to the strong correlation between successive positions as they differ by only one position. As a result, the AlphaGo team decided to simulate 30 million games using the SL network and the RL network and sampling one position from each game. This eliminated the strong correlation between positions, and resulted in a value network that generalized well.

Beating Humans in Go

With all the pieces in place, AlphaGo was able to perform MCTS more efficiently than ever before. They even went further and parallelized most of the computations. As a result, AlphaGo could figure out the optimal move faster and more accurately than any human could ever do, beating the 18 time world champion Lee Sedol 4 times out of the 5 matches, and the current World Champion Ke Jie in all three matches. The Chinese champion even described AlphaGo as “the God of Go” after the matches.

Yureka

After studying what the AlphaGo, I began to write my chess engine, Yureka. One of my good friends, who is an avid chess player and has been of great help on this project, came up with the name by combining my last name and the famous word by Archimedes. The overall architecture is pretty much the same as AlphaGo, except for some adjustments I needed to make due to the differences between Go and chess, data availability and hardware limitations.

Engine

Before I dove into the machine learning part of the project, I needed to write code for actually playing chess. This included a couple of components:

  1. A component that transforms the chess board into a stack of images that could be fed into my neural networks.
  2. A component that translates the output of the networks into an actual chess move.

I used this wonderful open source library called python-chess to write these components. Also, I implemented the Universal Chess Interface so that my engine can be hooked up to any chess GUI software. Actually, most of the code I wrote for the engine was actually for these two components, rather than the neural networks.

Selection

I created my policy network using a new library called PyTorch and trained it with the expert chess games I downloaded from KingBase Chess Database. I generated about 10 million positions and ran the training algorithm for about three days. The supervised learning policy network (SL network) at this point became good enough to play against, so a few of my friends and my brother played against it, but while it was acceptable at openings and mid-game, it had no idea when it got to end-game. This is presumably because the training dataset mainly consists of openings and mid-game positions (the network just doesn’t see as many end-game positions), and also it’s hard to finish a game just with “intuitions” as it has to look ahead a few moves. I mitigated the first problem by adding positions from an end-game database, and deferred on the second problem as I concluded that it will be addressed by MCTS. When I evaluated the policy network against a state-of-the-art chess engine StockFish at its easiest level, I found that adding end-game positions indeed helped.

After my supervised learning policy network reached an acceptable level, I attempted to improve it further by using reinforcement learning. Unfortunately, this proved to be difficult as it failed to improve itself via self-play. The underlying reasons for failure weren’t so clear, but my guess is that chess, unlike Go, has draws, and because my SL network was bad at endgame, it more often than not tied with itself, which didn’t provide a strong enough signal to improve itself. I tried to adjust my algorithm to make the signal strong, but that proved to be fruitless. After a few more attempts to mitigate the problem, I decided to move forward without an RL network.

Simulation

Following what the AlphaGo team did, I tried to implement both of the approaches. I first started with the value network as the training method is similar to the policy network I trained earlier. I tried to generate the training data the same way the AlphaGo team did by simulating millions of games using the SL network, but it simply took too long on my desktop. Instead, I sampled 10 million positions from the expert games from KingBase Chess Database I used to train the SL network earlier. Since these positions were randomly sampled, I figured they wouldn’t be too correlated with each other. Thankfully, my intuition proved to be correct, and I didn’t observe overfitting.

The actual simulation proved to be difficult. As I mentioned earlier, the key here was to make this simulation fast so that we can go through as many iterations in MCTS as possible, so I tried to make the policy network smaller with less features to decrease the evaluation time and parallelize the simulations. However, it still was not fast enough. Parallelization was especially difficult because of the nature of the Python multiprocessing library and the hardware limitations of my desktop. Eventually, I had to give up on this approach and rely just on the value network.

Beating My Brother in Chess

With the SL policy network for Selection and the value network for Simulation, Yureka was finally complete. I tried to do some formal evaluations by having Yureka play Stockfish and calculating its ELO rating, but it was taking too long on my desktop especially with time control, so I had to give up on that. I am planning to run these formal evaluations on better machines (possibly on AWS) in the future, so I will update this blog with the results.

Most importantly, Yureka played my brother, and well, here’s what happened:

Black checkmate. Yureka (Black) vs My Brother (White)

I finally beat my brother in chess, albeit by proxy. You can see the full game with computer analysis here.

Conclusion and Future Work

Based on the techniques described in the AlphaGo paper, I successfully implemented a competent chess engine that can beat some human players. It consisted mainly of three parts: 1. the “brain” of the engine based on deep neural networks, 2. a translation component that turns the board configuration into an image stack that the neural networks can understand, 3. an implementation of the Monte Carlo Tree Search algorithm that uses the output of the neural networks to search for the best move. As the AlphaGo team remarked in their paper, I also observed that the efficiency of a conventional algorithm like MCTS can be greatly enhanced when combined with machine learning models like deep neural networks. Expanding on this point, another big insight I gained was that the good machine learning models are only a part of the whole engine, and they require other components that utilize its output to generate meaningful results.

It was also quite eye-opening to draw parallels between how Yureka plays chess and how humans play chess. Magnus Carlsen, the current World Chess Champion, once said this in an interview when asked how he plays chess:

I think it’s partly subconscious, because sometimes there will be a decision-making process in my mind and then suddenly I make a move and I really don’t know why I did that. I really cannot sometimes sense the moment when I decide on that particular move… and at some point I just make it.

You could say Yureka has developed a similar kind of “subconscious” on how to play chess through the training process as it looks at the board and “intuitively” figures out what the next move should be and how good the current state of the board is. The moment when Yureka showed this “sign of life” was truly incredible, and this will forever keep me motivated to explore the world of artificial intelligence.

Building on top of the success, I plan to work on a few remaining tasks for my chess engine such as code clean-up, combining the policy network and value network as described in the AlphaGo Zero paper, and training the network with a bigger dataset. Also, I plan to implement a Go engine that could play on the 9x9 board (as opposed to the standard 19x19 board) based on the AlphaGo Zero paper. Of course, I will continue to write new blog posts for future projects, so please stay tuned!

Update 10/22/2018: I have updated the code to combine the policy network and value network. I have not tried to write a Go engine, as the computing power required seemed too big for me. You can check out the code here. I wish I had more time to clean the code even more, but for now, I think this is good enough. Also, I have presented Yureka to my company’s engineering education session last spring with great success! The demo at the end was very exciting.

--

--

Peter Yu

PhD Student at UMich Researching NLP and Cognitive Architectures • Perviously Real-time Distributed System Engineer turned NLP Research Engineer at ASAPP