CPC G06N 3/084 (2013.01) [G06N 3/006 (2013.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01)] | 20 Claims |
1. A method performed by one or more data processing apparatus, the method comprising:
training an action selection policy neural network using a first reinforcement learning technique,
wherein the action selection policy neural network has a plurality of network parameters and is used in selecting actions to be performed by an agent interacting with an environment,
wherein the action selection policy neural network is configured to receive an input comprising an observation input and to process the input in accordance with the network parameters to generate an action selection policy output, and
wherein training the action selection policy neural network comprises adjusting values of the action selection policy network parameters;
during the training of the action selection neural network using the first reinforcement learning technique:
training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network,
wherein the reward prediction neural network has reward prediction parameters and is configured to:
receive a plurality of intermediate outputs generated by the action selection policy neural network, wherein the plurality of intermediate outputs are generated by one or more hidden layers of the action selection policy neural network in response to processing a sequence of multiple observation images that result from interactions of the agent with the environment, and
process the plurality of intermediate outputs, generated by the hidden layers of the action selection policy neural network in response to processing the sequence of multiple observation images, in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence; and
wherein training the reward prediction neural network comprises:
determining gradients based on predicted rewards generated by the reward prediction neural network; and
adjusting values of the reward prediction parameters and the action selection policy network parameters using the gradients.
|