US 12,240,113 B2
	Deep reinforcement learning for robotic manipulation
Sergey Levine, Berkeley, CA (US); Ethan Holly, San Francisco, CA (US); Shixiang Gu, Mountain View, CA (US); and Timothy Lillicrap, London (GB)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on Dec. 1, 2023, as Appl. No. 18/526,443.
Application 18/526,443 is a continuation of application No. 17/878,186, filed on Aug. 1, 2022, granted, now 11,897,133.
Application 17/878,186 is a continuation of application No. 16/333,482, granted, now 11,400,587, issued on Aug. 2, 2022, previously published as PCT/US2017/051646, filed on Sep. 14, 2017.
Claims priority of provisional application 62/395,340, filed on Sep. 15, 2016.
Prior Publication US 2024/0131695 A1, Apr. 25, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 17/00 (2019.01); B25J 9/16 (2006.01); G05B 13/02 (2006.01); G05B 19/042 (2006.01); G06N 3/008 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01)

CPC B25J 9/161 (2013.01) [B25J 9/163 (2013.01); B25J 9/1664 (2013.01); G05B 13/027 (2013.01); G05B 19/042 (2013.01); G06N 3/008 (2013.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G05B 2219/32335 (2013.01); G05B 2219/33033 (2013.01); G05B 2219/33034 (2013.01); G05B 2219/39001 (2013.01); G05B 2219/39298 (2013.01); G05B 2219/40499 (2013.01)]

21 Claims

1. A system, comprising:

memory storing instructions;

one or more processors operable to execute the instructions, stored in the memory, to:

during performance of a plurality of episodes by each of a plurality of robots, each of the episodes including performing a task based on a policy neural network representing a reinforcement learning policy for the task:

store, in a buffer, instances of robot experience data generated during the episodes by the plurality of robots, each of the instances of the robot experience data being generated during a corresponding one of the episodes, and being generated at least in part on corresponding output generated using the policy neural network with corresponding policy parameters for the policy neural network for the corresponding episode;

iteratively generate updated policy parameters of the policy neural network, wherein in each of the iterations of iteratively generating the updated policy parameters one or more of the processors are to generate the updated policy parameters using a group of one or more of the instances of the robot experience data in the buffer during the iteration; and

by each of the robots in conjunction with a start of each of a plurality of the episodes performed by the robot, update the policy neural network to be used by the robot in the episode, wherein updating the policy neural network comprises using the updated policy parameters of a most recent iteration of the iteratively generating the updated policy parameters.