US 11,907,837 B1
	Selecting actions from large discrete action sets using reinforcement learning
Gabriel Dulac-Arnold, Paris (FR); Richard Andrew Evans, London (GB); and Benjamin Kenneth Coppin, Cottenham (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Dec. 22, 2020, as Appl. No. 17/131,500.
Application 17/131,500 is a continuation of application No. 15/382,383, filed on Dec. 16, 2016, granted, now 10,885,432.
Claims priority of provisional application 62/268,406, filed on Dec. 16, 2015.
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/08 (2023.01)

CPC G06N 3/08 (2013.01)

24 Claims

1. A method performed by one or more computers, the method comprising:

receiving a particular observation representing a particular state of an environment; and

selecting an action from a discrete set of actions to be performed by an agent interacting with the environment, wherein each action in the discrete set of actions is represented by a respective point in a multi-dimensional space, and wherein selecting the action comprises:

processing the particular observation using an actor policy network having a plurality of parameters, wherein the actor policy network is a neural network that is configured to receive the particular observation and to process the observation to generate an ideal point in the multi-dimensional space in accordance with first values of the parameters of the actor policy network, wherein the ideal point does not represent any of the actions in the discrete set of actions;

determining, from the points that represent actions in the discrete set of actions, k points based on the ideal point, wherein k is an integer greater than one;

for each point of the k points:

processing the point and the particular observation using a Q network to generate a respective Q value for the action represented by the point, wherein the Q network is a neural network that has a plurality of parameters and that is configured to receive the particular observation and the point and to process the particular observation and the point to generate the respective Q value for the action represented by the point in accordance with first values of the parameters of the Q network; and

selecting, as the action to be performed by the agent, an action from the k actions represented by the k points based on the respective Q values for the k actions represented by the k points.