US 11,842,281 B2
	Reinforcement learning with auxiliary tasks
Volodymyr Mnih, Toronto (CA); Wojciech Czarnecki, London (GB); Maxwell Elliot Jaderberg, London (GB); Tom Schaul, London (GB); David Silver, Hitchin (GB); and Koray Kavukcuoglu, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Feb. 24, 2021, as Appl. No. 17/183,618.
Application 17/183,618 is a continuation of application No. 16/403,385, filed on May 3, 2019, granted, now 10,956,820.
Application 16/403,385 is a continuation of application No. PCT/IB2017/056906, filed on Nov. 4, 2017.
Claims priority of provisional application 62/418,120, filed on Nov. 4, 2016.
Prior Publication US 2021/0182688 A1, Jun. 17, 2021
Int. Cl. G06N 3/08 (2023.01); G06N 20/00 (2019.01); G06N 3/00 (2023.01); G06N 3/04 (2023.01); G06N 3/084 (2023.01); G06N 3/006 (2023.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01)

CPC G06N 3/084 (2013.01) [G06N 3/006 (2013.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01)]

20 Claims

1. A method performed by one or more data processing apparatus, the method comprising:

training an action selection policy neural network using a first reinforcement learning technique,

wherein the action selection policy neural network has a plurality of network parameters and is used in selecting actions to be performed by an agent interacting with an environment,

wherein the action selection policy neural network is configured to receive an input comprising an observation input and to process the input in accordance with the network parameters to generate an action selection policy output, and

wherein training the action selection policy neural network comprises adjusting values of the action selection policy network parameters;

during the training of the action selection neural network using the first reinforcement learning technique:

training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network,

wherein the reward prediction neural network has reward prediction parameters and is configured to:

receive a plurality of intermediate outputs generated by the action selection policy neural network, wherein the plurality of intermediate outputs are generated by one or more hidden layers of the action selection policy neural network in response to processing a sequence of multiple observation images that result from interactions of the agent with the environment, and

process the plurality of intermediate outputs, generated by the hidden layers of the action selection policy neural network in response to processing the sequence of multiple observation images, in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence; and

wherein training the reward prediction neural network comprises:

determining gradients based on predicted rewards generated by the reward prediction neural network; and

adjusting values of the reward prediction parameters and the action selection policy network parameters using the gradients.