US 12,086,714 B2
	Training neural networks using a prioritized experience memory
Tom Schaul, London (GB); John Quan, London (GB); and David Silver, Hitchin (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Jan. 30, 2023, as Appl. No. 18/103,416.
Application 18/103,416 is a continuation of application No. 16/866,365, filed on May 4, 2020, granted, now 11,568,250.
Application 16/866,365 is a continuation of application No. 15/349,894, filed on Nov. 11, 2016, granted, now 10,650,310, issued on May 12, 2020.
Claims priority of provisional application 62/254,610, filed on Nov. 12, 2015.
Prior Publication US 2023/0244933 A1, Aug. 3, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/08 (2023.01); G06N 3/088 (2023.01)

CPC G06N 3/08 (2013.01) [G06N 3/088 (2013.01); Y04S 10/50 (2013.01)]

20 Claims

1. A method for controlling an agent in an environment to perform a task, the method comprising:

receiving a current observation characterizing a current state of the environment;

processing the current observation using a neural network to generate an output that specifies an action to be performed by the agent in response to the current observation, wherein the neural network has been trained through reinforcement learning to determine trained values of parameters of the neural network using a plurality of pieces of selected experience data selected from a prioritized experience memory that stored, during the training of the neural network through reinforcement learning, a plurality of pieces of experience data in association with expected learning progress measures,

wherein each piece of experience data is a training tuple that comprises a training current observation characterizing a training current state of the environment, and a training current action performed by the agent in response to the training current observation, and wherein, for each piece of experience data, a respective value of an expected learning progress measure that is stored in association with the piece of experience data in the prioritized experience memory is derived from a result of a preceding time that values of the parameters of the neural network were updated using the piece of experience data during the training, and

wherein, during the training, the plurality of pieces of selected experience data were selected from the prioritized experience memory based on the respective values of the expected learning progress measures that are stored in association with the plurality of pieces of experience data in the prioritized experience memory; and

causing the agent to perform the action specified by the output in response to the current observation.