US 12,468,779 B1
Training action-selection neural networks from demonstrations using multiple losses
Todd Andrew Hester, Seattle, WA (US)
Assigned to GDM Holding LLC, Mountain View, CA (US)
Filed by DeepMind Technologies Limited, London (GB)
Filed on May 6, 2024, as Appl. No. 18/656,462.
Application 18/656,462 is a continuation of application No. 18/120,912, filed on Mar. 13, 2023, granted, now 12,008,077.
Application 18/120,912 is a continuation of application No. 16/174,148, filed on Oct. 29, 2018, granted, now 11,604,941, issued on Mar. 14, 2023.
Claims priority of provisional application 62/578,367, filed on Oct. 27, 2017.
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 18/214 (2023.01); G06F 17/11 (2006.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01)
CPC G06F 18/2148 (2023.01) [G06F 17/11 (2013.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method of training a neural network system, the method comprising:
training an action selection neural network to perform a task,
wherein the action selection neural network is configured to receive inputs comprising observations of an environment and to process the inputs to generate action selection outputs indicating actions to perform the task, and
during training of the action selection neural network:
training an auxiliary prediction task neural network, wherein the auxiliary prediction task neural network is configured to receive an intermediate output from the action selection neural network and to generate a prediction output which indicates a predicted characteristic of the task,
wherein training the auxiliary prediction task neural network comprises training the auxiliary prediction task neural network and the action selection neural network using demonstration data for the task by backpropagating gradients determined from an auxiliary learning loss function through the auxiliary prediction task neural network and into the action selection neural network to bring the predicted characteristic closer to a corresponding observed characteristic of the task from the demonstration data.