US 11,868,894 B2
	Distributed training using actor-critic reinforcement learning with off-policy correction factors
Hubert Josef Soyer, London (GB); Lasse Espeholt, Amsterdam (NL); Karen Simonyan, London (GB); Yotam Doron, London (GB); Vlad Firoiu, London (GB); Volodymyr Mnih, Toronto (CA); Koray Kavukcuoglu, London (GB); Remi Munos, London (GB); Thomas Ward, London (GB); Timothy James Alexander Harley, London (GB); and Iain Robert Dunning, New York, NY (US)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Jan. 4, 2023, as Appl. No. 18/149,771.
Application 18/149,771 is a continuation of application No. 16/767,049, granted, now 11,593,646, previously published as PCT/EP2019/052692, filed on Feb. 5, 2019.
Claims priority of provisional application 62/626,643, filed on Feb. 5, 2018.
Prior Publication US 2023/0153617 A1, May 18, 2023
Int. Cl. G06N 3/08 (2023.01); G06N 3/045 (2023.01)

CPC G06N 3/08 (2013.01) [G06N 3/045 (2023.01)]

20 Claims

1. A method performed by one or more computers, the method comprising:

jointly training an action selection neural network and a state value neural network, wherein:

the action selection neural network is configured to process an observation of an environment, in accordance with current values of a set of action selection neural network parameters, to generate an output that defines a score distribution over a set of actions that can be performed by an agent to interact with the environment;

the state value neural network is configured to process an input comprising an observation of the environment to generate a state value for the observation that defines an estimate of a cumulative reward that will be received by the agent, starting from a state of the environment represented by the observation, by selecting actions using a current action selection policy defined by the current values of the set of action selection neural network parameters;

the training comprising:

obtaining an off-policy trajectory that characterizes interaction of the agent with the environment over a sequence of time steps as the agent performed actions selected in accordance with an off-policy action selection policy that is different than the current action selection policy;

training the state value neural network on the off-policy trajectory, comprising:

determining a state value target that defines a prediction target for the state value neural network, wherein the state value target is a combination of:

(i) a state value for a first observation in the off-policy trajectory; and

(ii) a correction term that accounts for a discrepancy between the current action selection policy and the off-policy action selection policy;

training the state value neural network to reduce a discrepancy between the state value target and a state value generated by the state value neural network by processing the first observation in the off-policy trajectory; and

training the action selection neural network on the off-policy trajectory using the state value neural network.