US 11,783,182 B2
Asynchronous deep reinforcement learning
Volodymyr Mnih, Toronto (CA); Adrià Puigdomènech Badia, London (GB); Alexander Benjamin Graves, London (GB); Timothy James Alexander Harley, London (GB); David Silver, Hitchin (GB); and Koray Kavukcuoglu, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Feb. 8, 2021, as Appl. No. 17/170,316.
Application 17/170,316 is a continuation of application No. 16/403,388, filed on May 3, 2019, granted, now 11,334,792.
Application 16/403,388 is a continuation of application No. 15/977,923, filed on May 11, 2018, granted, now 10,346,741, issued on Jul. 9, 2019.
Application 15/977,923 is a continuation of application No. 15/349,950, filed on Nov. 11, 2016, granted, now 10,936,946.
Claims priority of provisional application 62/254,701, filed on Nov. 12, 2015.
Prior Publication US 2021/0166127 A1, Jun. 3, 2021
Int. Cl. G06N 3/08 (2023.01); G06N 3/045 (2023.01); G06N 3/04 (2023.01)
CPC G06N 3/08 (2013.01) [G06N 3/04 (2013.01); G06N 3/045 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A method of training a deep neural network having a plurality of parameters that is used to select actions to be performed by an agent that interacts with an environment by performing actions selected from a predetermined set of actions, the method comprising:
using a plurality of workers to generate training data for training the parameters of the deep neural network;
wherein for each worker:
the worker is configured to operate independently of each other worker;
the worker is associated with a respective actor that interacts with a respective replica of the environment in accordance with a respective exploration policy;
the exploration policy is parameterized by a set of exploration policy parameters, wherein values of the exploration policy parameters are specific to the worker and are different from values of exploration policy parameters of each of one or more other workers of the plurality of workers; and
wherein each worker is configured to generate training data by repeatedly performing operations comprising:
determining current values of the parameters of the deep neural network;
receiving a current observation characterizing a current state of the environment replica interacted with by the actor associated with the worker;
selecting a current action to be performed by the actor associated with the worker in response to the current observation in accordance with the exploration policy for the worker and using one or more outputs generated by the deep neural network in accordance with the current values of the parameters of the deep neural network;
identifying an actual reward resulting from the actor performing the current action when the environment replica is in the current state;
receiving a next observation characterizing a next state of the environment replica interacted with by the actor, wherein the environment replica transitioned into the next state from the current state in response to the actor performing the current action; and
adding the current action, the actual reward, and the next observation to the training data generated by the worker;
applying a reinforcement learning technique to the training data generated by each of the plurality of workers to determine one or more current gradients; and
determining updated values of the parameters of the deep neural network using the current gradients.