US 11,948,085 B2
Distributional reinforcement learning for continuous control tasks
David Budden, London (GB); Matthew William Hoffman, London (GB); and Gabriel Barth-Maron, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Apr. 19, 2023, as Appl. No. 18/303,117.
Application 18/303,117 is a continuation of application No. 17/945,622, filed on Sep. 15, 2022, granted, now 11,663,475.
Application 17/945,622 is a continuation of application No. 16/759,519, granted, now 11,481,629, issued on Oct. 25, 2022, previously published as PCT/EP2018/079526, filed on Oct. 29, 2018.
Claims priority of provisional application 62/578,389, filed on Oct. 27, 2017.
Prior Publication US 2023/0409907 A1, Dec. 21, 2023
Int. Cl. G06N 3/08 (2023.01); G06N 3/045 (2023.01)
CPC G06N 3/08 (2013.01) [G06N 3/045 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for training an action selection neural network having a plurality of action selection parameters and used to select actions to be performed by an agent interacting with an environment, wherein the action selection neural network is configured to receive an input observation characterizing a state of the environment and to map the input observation to an action, the method comprising:
maintaining a respective replica of the action selection neural network;
receiving an observation characterizing a current state of an instance of the environment;
generating a respective transition starting from the received observation by selecting actions to be performed by the agent using the action selection neural network replica and in accordance with current values of the action selection parameters;
storing respective data for the respective transition in a memory; and
using a transition sampled from the memory to train the action selection neural network, the sampled transition comprising at least an observation-action-reward triple, and the training comprising:
processing an observation-action pair in the observation-action-reward triple of the sampled transition to generate, for the triple, a distribution over possible returns that could result if the action is performed in response to the observation; and
determining an update to the action selection parameters using the distribution over the possible returns.