Many recent successes of Reinforcement Learning are fueled by a massive increase in the scale of experiments. This article traces the evolution of the distributed architectures used to support these models and discusses how algorithm families (eg on-policy, off-policy) affect implementation choices at datacenter scale.
In Supervised Learning, the training data distribution is stationary. Gradient Descent in SL is an (almost) "embarassingly parallel" problem in which multiple workers can progress independently with few communication bottlenecks. Data parallelism, where different workers train in parallel on different chunks of the training dataset is a great fit for Stochastic Gradient Descent.
The archetype for this approach is DistBelief. 1 Multiple workers sample batches of training data, compute gradients, and send them to a global parameter server which applies them to the shared instance of the model. Since the training data distribution is stationary, the parameter server can simply average incoming gradients before applying them (ie the gradient and sum operators commute). Individual workers can also take advantage of model parallelism and break the model down in multiple parts to fit in limited GPU memory. The parameter server itself can also be sharded to cope with the traffic generated by many workers.
Reinforcement Learning doesn't enjoy the same luxury. For one, models need to go out and fetch their own data from the environment. Even more worringly, the training data itself depends on the policy, and therefore can't be reused willy-nilly (this is true even in off-policy algorithms). As a result, each RL breakthrough relies on a delicate interplay of algorithms, distributed architectures and plain hacks.
In the next sections, we'll follow the history of distributed architectures for the two main families of RL algorithms: state-based and policy-based. We'll see how they recycle and iterate on the same ideas and have somewhat converged over the past few years. Basic knowledge of RL fundamentals and vocabulary is assumed.
Q-Learning algorithms are value-based off-policy algorithms especially suited for discrete action spaces.
Google's 2015 paper about Gorila describes the archetypical architecture for distributed Q-Learning at scale. 2 Gorila follows the same blueprint as DeepBelief, with independent agents that collect samples from the environment, update their parameters, and communicate gradients back to a global parameter server. In the "bundled" configuration analyzed in the paper, each worker also has its dedicated replay memory to store state transitions and periodically downloads an updated copy of the Q model from the parameter server. As sketched out in the paper, workers are designed to run on different machines within a datacenter, and coordinate over the network through the parameter server.
DeepMind's Ape-X architecture from 2018 is another approach for distributed Q-Learning where the focus is not on parallelizing learning (as in Gorila), but data collection. 3 Again, we have independent workers that periodically sync with a global copy of the Q-Network. Compared to Gorila, however, workers share experiences rather than gradients. This is achieved through a global replay buffer which receives experience samples from all independent workers.
Since Ape-X takes advantage of Prioritized Experience Replay, a single GPU worker has enough capacity to keep up with the experiences produced by hundreds of CPU workers. The insight behind sharing experiences instead of gradients is that they remain useful for longer because they're not as strongly dependent on the current Q function.
Actor-critic algorithms are a family of on-policy algorithms that combine a policy-based (actor) and value-based (critic) models with the goal of improving the performance of vanilla policy gradients (mainly by reducing variance).
The landmark architecture for Actor-Critic methods is A3C, which came out in 2016. 4 A3C also uses asynchronous actor-learners, but implements them as CPU threads within a single physical host, therefore eliminating network communication overhead. Each worker maintains a local copy of the value and policy networks, and periodically synchronizes them with the global copy.
The algorithmic insight behind A3C is to take advantage of parallelism to reduce the adverse effects of correlation in the training data. In other words, aggregating values from multiple concurrent actors effectively achieves a similar result as random sampling from the replay buffer in Q-Learning.
A3C doesn't scale well beyond 16 concurrent actor threads because the local policy gradients quickly become outdated as the updates from other threads are included asynchronously in the global model. On-policy algorithms like Actor-Critic don't cope well with this policy lag problem.
Pratically, it turns out that A3C is little more than a thought exercise. A non-distributed version of A3C, A2C, obtains the same performance with simpler code. 5 A2C achieves a higher throughput than A3C by (synchronously) batching together updates from multiple environments and running gradient descent on a GPU. In effect, the regularization and exploration benefits in A3C should not be attributed to async-induced noise, but to the presence of multiple independent exploration policies.
DeepMind's IMPALA from 2018 takes A2C further and converges towards the multiple-actor, single-learner model proposed in Ape-X. 6 In contrast to A3C, IMPALA workers communicate experiences, and not gradients, just like Ape-X. However, Ape-X runs on-policy algorithms which can learn from any state transition, whereas IMPALA is an Actor-Critic that can only learn from on-policy experience. In other words, experience collected by workers under out-of-date policies corrupts the policy gradient (policy lag). IMPALA uses a novel importance sampling algorithm (V-trace) to work around this problem without throwing away samples.
Another success story of large-scale RL based on Actor-Critic methods was reported by OpenAI when their OpenAI Five agent defeated the Dota 2 world champions. 7 Like IMPALA, actors submit samples to a pool of GPU optimizers, each of which maintains a local experience buffer and samples it to produce policy gradients for the PPO algorithm. Gradients are aggregated across the GPU pool and applied synchronously.
Both the Dota game engine and the OpenAI Five model are significantly more complex than the Atari games used in other experiments. Under these conditions, it's more efficient to create a dedicated pool of GPU machines to run the forward pass on the latest policy, rather than leaving this responsibility in the actors. Coupled with the synchronous policy updates, this likely explains why the paper doesn't make any mention of policy-lag corrections.
From the architecture point of view, the main difference between state-based and policy-based algorithms is that the former are generally off-policy, whereas the latter usually need on-policy samples. Nevertheless, the most recent designs for both families have converged on a similar design:
Both Ape-X and IMPALA share experiences instead of gradients to reduce communication overhead. Q-Learning adapts naturally to this change, but Actor-Critics need correction factors.
Likewise, distributed learning has fallen out of favor given the rapid increase in GPU capacity. In recent designs, each GPU learner can service hundreds of actor CPUs and reduce communication overhead.
If we take for granted that interactions with the environment (ie actors) need to be parallelized (that's the whole point of this exercise in cloud madness), the real choice is whether learning should be as well. The recent consensus seems to be no, GPUs are enough for learning.
Probably the most enduring lesson from this review is to be careful about recognizing where accidental complexity and implementation details begin to obscure novel ideas. Case in point is A3C's supposedly better exploration behavior due to asynchrony, which later turned out to be inferior to synchronous methods in both performance and cost.
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In NIPS. 2012.
Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. Massively parallel methods for deep reinforcement learning. 2015. arXiv:1507.04296.
Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. 2018. arXiv:1803.00933.
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. 2016. arXiv:1602.01783.
Adam Stooke and Pieter Abbeel. Accelerated methods for deep reinforcement learning. 2018. arXiv:1803.02811.
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. 2018. arXiv:1802.01561.
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning. 2019. arXiv:1912.06680.