US 11,897,133 B2
	Deep reinforcement learning for robotic manipulation
Sergey Levine, Berkeley, CA (US); Ethan Holly, San Francisco, CA (US); Shixiang Gu, Mountain View, CA (US); and Timothy Lillicrap, London (GB)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Aug. 1, 2022, as Appl. No. 17/878,186.
Application 17/878,186 is a continuation of application No. 16/333,482, granted, now 11,400,587, previously published as PCT/US2017/051646, filed on Sep. 14, 2017.
Claims priority of provisional application 62/395,340, filed on Sep. 15, 2016.
Prior Publication US 2022/0388159 A1, Dec. 8, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 17/00 (2019.01); B25J 9/16 (2006.01); G05B 13/02 (2006.01); G06N 3/08 (2023.01); G06N 3/008 (2023.01); G06N 3/045 (2023.01); G05B 19/042 (2006.01)

CPC B25J 9/161 (2013.01) [B25J 9/163 (2013.01); B25J 9/1664 (2013.01); G05B 13/027 (2013.01); G05B 19/042 (2013.01); G06N 3/008 (2013.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G05B 2219/32335 (2013.01); G05B 2219/33033 (2013.01); G05B 2219/33034 (2013.01); G05B 2219/39001 (2013.01); G05B 2219/39298 (2013.01); G05B 2219/40499 (2013.01)]

20 Claims

1. A method implemented by one or more processors, comprising:

receiving a given instance of robot experience data generated by a given robot of a plurality of robots, wherein the given instance of the robot experience data is generated during a given episode of explorations of performing a task based on a given version of policy parameters of a policy network utilized by the given robot in generating the given instance;

receiving additional instances of robot experience data from additional robots of the plurality of robots, the additional instances generated during episodes, by the additional robots, of explorations of performing the task based on the policy network;

while the given robot and the additional robots continue the episodes of explorations of performing the task, generating a new version of the policy parameters of the policy network based on training of the policy network based at least in part on the given instance and the additional instances;

providing the new version of the policy parameters to the given robot for performing of an immediately subsequent episode of explorations of performing the task by the given robot based on the new version of the policy parameters.