US 11,712,799 B2
	Data-driven robot control
Serkan Cabi, London (GB); Ziyu Wang, St. Albans (GB); Alexander Novikov, London (GB); Ksenia Konyushkova, London (GB); Sergio Gomez Colmenarejo, London (GB); Scott Ellison Reed, New York, NY (US); Misha Man Ray Denil, London (GB); Jonathan Karl Scholz, London (GB); Oleg O. Sushkov, London (GB); Rae Chan Jeong, London (GB); David Barker, Reading (GB); David Budden, London (GB); Mel Vecerik, London (GB); Yusuf Aytar, London (GB); and Joao Ferdinando Gomes de Freitas, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Sep. 14, 2020, as Appl. No. 17/20,294.
Claims priority of provisional application 62/900,407, filed on Sep. 13, 2019.
Prior Publication US 2021/0078169 A1, Mar. 18, 2021
Int. Cl. B25J 9/16 (2006.01)

CPC B25J 9/161 (2013.01) [B25J 9/163 (2013.01); B25J 9/1661 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

maintaining robot experience data characterizing robot interactions with an environment, the robot experience data comprising a plurality of experiences that each comprise an observation and an action performed by a respective robot in response to the observation;

obtaining annotation data that assigns, to each experience in a first subset of the experiences in the robot experience data, a respective task-specific reward for a particular task, wherein the first subset of experiences comprises experiences from a plurality of different task episodes of the particular task;

training, on the annotation data, a reward model that receives as input an input observation and generates as output a reward prediction that is a prediction of a task-specific reward for the particular task that should be assigned to the input observation, wherein training the reward model comprises training the reward model to optimize a loss function that includes a term that measures, for a given pair of experiences from a same task episode of the plurality of different task episodes, (i) a difference in a respective reward prediction generated by the reward model for a first observation in a first experience in the given pair and a respective reward prediction generated by the reward model for a second observation in a second experience in the given pair and (ii) a difference in a respective task-specific reward for the first experience in the given pair and a respective task-specific reward for the second experience in the given pair;

generating task-specific training data for the particular task that associates each of a plurality of experiences with a task-specific reward for the particular task, comprising, for each experience in a second subset of the experiences in the robot experience data:

processing the observation in the experience using the trained reward model to generate a reward prediction, and

associating the reward prediction with the experience; and

training a policy neural network on the task-specific training data for the particular task, wherein the policy neural network is configured to receive a network input comprising an observation and to generate a policy output that defines a control policy for a robot performing the particular task.