US 12,240,117 B2
	Optimizing policy controllers for robotic agents using image embeddings
Yevgen Chebotar, Los Angeles, CA (US); Pierre Sermanet, Palo Alto, CA (US); and Harrison Lynch, San Francisco, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 23, 2023, as Appl. No. 18/157,919.
Application 18/157,919 is a continuation of application No. 16/649,596, granted, now 11,559,887, previously published as PCT/US2018/052078, filed on Sep. 20, 2018.
Claims priority of provisional application 62/561,133, filed on Sep. 20, 2017.
Prior Publication US 2023/0150127 A1, May 18, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. B25J 9/16 (2006.01); G05B 13/02 (2006.01); G06N 3/084 (2023.01); G06N 20/00 (2019.01)

CPC B25J 9/163 (2013.01) [B25J 9/1664 (2013.01); B25J 9/1697 (2013.01); G05B 13/0205 (2013.01); G05B 13/027 (2013.01); G06N 3/084 (2013.01); G06N 20/00 (2019.01)]

20 Claims

1. A method of optimizing a policy controller used to select actions to be performed by a robotic agent interacting with an environment to perform a specified task, the method comprising:

obtaining a demonstration sequence of demonstration images of another agent performing a version of the specified task;

for each respective demonstration image in the demonstration sequence, generating a respective demonstration embedding of the respective demonstration image by processing the respective demonstration image using a time contrastive neural network that has been trained on time-sequenced images of a training environment to minimize a loss that includes a difference between embeddings generated for co-occurring input images captured from different viewpoints or by different modalities;

obtaining a robot sequence of robot images of the robotic agent performing the specified task by performing actions selected using a current policy controller, wherein each robot image in the robot sequence corresponds to a respective demonstration image in the demonstration sequence;

for each respective robot image in the robot sequence, generating a respective robot embedding for the respective robot image by processing the respective robot image using the same time contrastive neural network that has been trained; and

updating the current policy controller by performing an iteration of a reinforcement learning technique to optimize a reward function that depends on, for each demonstration image, a distance between (i) the demonstration embedding that has been generated by processing the demonstration image using the time contrastive neural network and (ii) the robot embedding that has been generated by processing the corresponding robot image using the same time contrastive neural network.