US 12,293,283 B2
	Reinforcement learning using meta-learned intrinsic rewards
Zeyu Zheng, Ann Arbor, MI (US); Junhyuk Oh, London (GB); and Satinder Singh Baveja, Ann Arbor, MI (US)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Sep. 25, 2020, as Appl. No. 17/033,410.
Claims priority of provisional application 62/905,964, filed on Sep. 25, 2019.
Prior Publication US 2021/0089910 A1, Mar. 25, 2021
Int. Cl. G06N 3/08 (2023.01); G06N 3/04 (2023.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/084 (2023.01)

CPC G06N 3/08 (2013.01) [G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/084 (2013.01)]

20 Claims

1. A method of training a reinforcement learning system, the reinforcement learning system comprising an agent configured to perform actions based upon a policy and an intrinsic reward system configured to generate intrinsic reward values for the agent based upon the actions taken by the agent, the method comprising:

training the reinforcement learning system by updating the agent's policy based upon, for each of a plurality of task episodes performed by the agent and each corresponding to a respective task, a first objective that measures an expected return accumulated within the task episode, the training comprising, for each of the plurality of task episodes:

at each of a plurality of time steps within the task episode, determining a respective intrinsic reward value for the time step using the intrinsic reward system conditioned on a history of the agent at earlier time steps within the task episode and at time steps during previous task episodes across a lifetime of the agent, wherein the lifetime of the agent comprises the plurality of task episodes; and

updating the intrinsic reward system based upon a second objective that measures, for a particular time step within a particular one of the plurality of task episodes, an expected return accumulated across a remainder of the lifetime of the agent subsequent to the particular time step, wherein the expected return accumulated across the remainder of the lifetime of the agent is determined based on extrinsic reward values at one or more subsequent time steps and on an approximation of a lifetime extrinsic reward for subsequent task episodes during the remainder of the lifetime of the agent generated by a lifetime return function.