US 12,147,899 B2
	Training action selection neural networks using look-ahead search
Karen Simonyan, London (GB); David Silver, Hitchin (GB); and Julian Schrittwieser, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Dec. 4, 2023, as Appl. No. 18/528,640.
Application 18/528,640 is a continuation of application No. 17/948,016, filed on Sep. 19, 2022, granted, now 11,836,625.
Application 17/948,016 is a continuation of application No. 16/617,478, granted, now 11,449,750, issued on Sep. 20, 2022, previously published as PCT/EP2018/063869, filed on May 28, 2018.
Claims priority of provisional application 62/511,945, filed on May 26, 2017.
Prior Publication US 2024/0185070 A1, Jun. 6, 2024
Int. Cl. G06N 3/08 (2023.01); G06N 7/01 (2023.01)

CPC G06N 3/08 (2013.01) [G06N 7/01 (2023.01)]

20 Claims

1. A method of selecting, using a neural network, actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result,

wherein the neural network has a plurality of network parameters and is configured to receive an input observation characterizing a state of the environment and to process the input observation in accordance with the network parameters to generate a network output that comprises an action selection output that defines an action selection policy for selecting an action to be performed by the agent in response to the input observation, and

wherein the method comprises:

receiving a current observation characterizing a current state of the environment;

determining a target action selection output for the current observation by performing, using the neural network and in accordance with current values of the network parameters, a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is a tree search of a state tree having nodes representing states of the environment starting from a root node that represents the current state, and wherein performing the look ahead search comprises adding noise to prior probabilities for the root node that are used to traverse from the root node to other nodes in the state tree; and

selecting an action to be performed by the agent in response to the current observation using the target action selection output generated by performing the look ahead search.