US 12,343,874 B2
	Reinforcement and imitation learning for a task
Saran Tunyasuvunakool, London (GB); Yuke Zhu, Stanford, CA (US); Joshua Merel, Chicago, IL (US); János Kramár, London (GB); Ziyu Wang, Markham (CA); and Nicolas Manfred Otto Heess, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Apr. 25, 2023, as Appl. No. 18/306,711.
Application 18/306,711 is a continuation of application No. 16/174,112, filed on Oct. 29, 2018, abandoned.
Claims priority of provisional application 62/578,368, filed on Oct. 27, 2017.
Prior Publication US 2023/0330848 A1, Oct. 19, 2023
Int. Cl. B25J 9/16 (2006.01); G06N 3/008 (2023.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01); G06N 3/084 (2023.01)

CPC B25J 9/163 (2013.01) [B25J 9/161 (2013.01); B25J 9/1697 (2013.01); G06N 3/008 (2013.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06N 3/084 (2013.01)]

20 Claims

1. A computer-implemented method for controlling a robotic agent to perform a task, the method comprising:

obtaining, for each of a plurality of performances of the task by a real-world agent controlled by an operator in a real-world environment, a respective demonstration dataset characterizing the corresponding performance of the task in the real-world environment; and

training a neural network for controlling a simulated robotic agent to perform the task in a simulated environment using the demonstration dataset, the training comprising:

obtaining (i) simulated image data encoding simulated camera images characterizing a current state of the simulated environment and (ii) simulated proprioceptive data comprising one or more variables characterizing configurations of the simulated robotic agent;

processing at least (i) the simulated image data and (ii) the simulated proprioceptive data using the neural network, according to current values of parameters of the neural network, to generate one or more sets of control commands for controlling movements of a plurality of components of the simulated robotic agent;

for each set of control commands, computing a task reward value characterizing how successfully the task is carried out upon implementation of the set of control commands by the simulated robotic agent in the simulated environment; and

adjusting the parameters of the neural network based on a hybrid energy function including (i) an imitation reward value derived using the demonstration datasets obtained for the real-world environment and the sets of control commands generated for the simulated environment and (ii) a task reward term computed using the task reward values; and

using the trained neural network to control the real-world robotic agent to perform the task in the real-world environment.