US 12,277,194 B2
Task prioritized experience replay algorithm for reinforcement learning
Varun Kompella, Kanata (CA); James MacGlashan, Riverside, RI (US); Peter Wurman, Acton, MA (US); and Peter Stone, Austin, TX (US)
Assigned to SONY GROUP CORPORATION, Tokyo (JP)
Filed by Sony Corporation, Tokyo (JP); and Sony Corporation of America, New York, NY (US)
Filed on Sep. 29, 2020, as Appl. No. 17/036,913.
Prior Publication US 2022/0101064 A1, Mar. 31, 2022
Int. Cl. G06N 7/00 (2023.01); G06F 18/21 (2023.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01)
CPC G06F 18/2178 (2023.01) [G06F 18/214 (2023.01); G06N 20/00 (2019.01)] 18 Claims
OG exemplary drawing
 
1. A method of training an agent in a control loop, comprising:
performing, by the agent, an action (at) sampled from a behavior policy (πb) for an observation (st), wherein the observation comprises information the agent receives, by any means, about an environment of the agent or the agent itself, wherein the information includes one or more of sensory information or signals received through sensory devices; compiled, abstract, or situational information compiled from a collection of the sensory devices combined with stored information; information about people or customers, or to characteristics of the people or the customers; information about internal parts of the agent; proprioceptive information; information regarding current or past actions of the agent; information about an internal state of the agent; information already computed or processed by the agent; and a termination value for each task of a plurality of tasks for which the agent is being trained;
storing a transition tuple in a main buffer of the agent, the transition tuple including {(st, at, rt, st+1)}, where rt is a reward vector for each task of the plurality of tasks for the agent in an environment and st+1 is a next environment state after action (at);
storing a priority value p(i), of the transition tuple with index i in the main buffer;
determining a probability, P(i) of sampling the transition tuple with the index i from the main buffer;
updating transiting priorities for each transition tuple stored in the main buffer;
sampling a minibatch of transition tuples to update the task networks based on the stored priority value p(i) thereof;
determining an action probability distribution parameter, πi(st), of updating task policies for the observation st; and
optimizing the task policies from the updated task networks with an off-policy algorithm, wherein:
data that is prioritized for one task is shared with one or more other tasks to transfer learning between multiple tasks.