US 11,992,945 B2
	System and methods for training robot policies in the real world
Jie Tan, Mountain View, CA (US); Sehoon Ha, Atlanta, GA (US); Peng Xu, Santa Clara, CA (US); Sergey Levine, Berkeley, CA (US); and Zhenyu Tan, Sunnyvale, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Nov. 10, 2020, as Appl. No. 17/094,521.
Prior Publication US 2022/0143819 A1, May 12, 2022
Int. Cl. B25J 9/16 (2006.01); B25J 13/08 (2006.01); G05D 1/00 (2006.01); G06N 3/08 (2023.01)

CPC B25J 9/163 (2013.01) [B25J 9/162 (2013.01); B25J 9/1689 (2013.01); B25J 13/089 (2013.01); G05D 1/02 (2013.01); G06N 3/08 (2013.01)]

17 Claims

1. A method implemented by one or more processors, the method comprising:

determining a pose of a mobile robot within a real world training workspace;

selecting, from a plurality of disparate policy networks each being fora corresponding component of locomotion, a corresponding policy network, wherein selecting the corresponding policy network is based on comparing the pose to a position of a center of the real world training workspace to determine the corresponding policy network with the corresponding component of locomotion that will move the mobile robot towards the center of the real world training workspace; and

for each of a plurality of iterations, and until one or more conditions are satisfied:

determining current state data of the mobile robot,

using the selected policy network and the corresponding current state data to determine one or more corresponding actions to move the mobile robot towards the center of the real world training workspace,

storing a corresponding training instance in association with the selected corresponding policy network, the corresponding training instance including at least the corresponding current state data and the one or more corresponding actions, and

implementing the one or more corresponding actions at the mobile robot;

updating one or more portions of the selected policy network using the training instances stored in association with the selected policy network in the iterations;

after cessation of the plurality of iterations:

determining an additional pose of the mobile robot within the real world training workspace, wherein the additional pose is distinct from the pose of the mobile robot, and wherein the additional pose of the mobile robot corresponds with an additional component of locomotion distinct from the component of locomotion of the pose of the mobile robot;

selecting, from the plurality of disparate policy networks, an additional policy network corresponding to the additional component of locomotion,

wherein selecting the additional policy network is based on comparing the additional pose to the center of the real world training workspace with the corresponding additional component of locomotion that will move the mobile robot towards the center of the real world training workspace,

wherein the additional policy network is distinct from the policy network, and

wherein the additional component of locomotion corresponding to the additional policy network is distinct from the component of locomotion corresponding to the policy network;

for each of a plurality of additional iterations, and until the one or more conditions are satisfied:

determining additional current state data of the mobile robot,

using the selected additional policy network and the corresponding additional current state data to generate one or more corresponding additional actions to move the mobile robot towards the center of the real world training workspace,

storing an additional corresponding training instance in association with the selected corresponding additional policy network, the corresponding additional training instance including at least the corresponding additional current state data and the one or more corresponding additional actions, and

implementing the one or more corresponding additional actions at the mobile robot; and

updating one or more portions of the selected additional policy network using the additional training instances stored in association with the selected additional policy network in the additional iterations.