US 12,277,493 B2
Selecting action slates using reinforcement learning
Peter Goran Sunehag, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, Mountain View, CA (US)
Filed on May 18, 2020, as Appl. No. 16/876,866.
Application 16/876,866 is a continuation of application No. 15/367,094, filed on Dec. 1, 2016, granted, now 10,699,187.
Claims priority of provisional application 62/261,781, filed on Dec. 1, 2015.
Prior Publication US 2020/0279162 A1, Sep. 3, 2020
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/08 (2023.01); G06N 3/088 (2023.01); G06F 16/9032 (2019.01)
CPC G06N 3/08 (2013.01) [G06N 3/088 (2013.01); G06F 16/90324 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method of providing a slate of actions to an action selector that interacts with an environment by selecting and performing actions, wherein the slate of actions includes a plurality of actions selected from a predetermined set of actions to fill a plurality of slots in an action slate, and wherein the environment transitions states in response to actions performed by the action selector, the method comprising:
receiving an observation characterizing a current state of the environment;
dividing the plurality of slots into a plurality of subsets;
selecting actions to fill in each subset, comprising, for a given subset of the plurality of subsets:
generating a plurality of candidate action slates for the given subset of slots, wherein each candidate action slate comprises a plurality of actions from the predetermined set of actions, wherein each candidate action slate for the given subset of slots has a same, predetermined number of slots with each slot being filled with a respective action from the predetermined set of actions, and wherein, for each candidate action slate, the slots are filled with a different combination of candidate actions from each of other candidate action slates for the given subset of slots;
for each candidate action slate, processing the candidate action slate using a deep neural network, wherein, for each candidate action slate, the deep neural network:
receives an input that comprises the plurality of actions in the candidate action slate and the observation, and
generates, as output, a slate Q value for the candidate action slate that is an estimate of a long-term reward resulting from the candidate action slate comprising the plurality of actions being provided to the action selector in response to the observation;
selecting a candidate action slate from the plurality of candidate action slates based on the slate Q values generated as output by the deep neural network for the candidate action slates; and
selecting, as the actions in the slots in the given subset, the actions in the slots in the selected candidate action slate;
generating a final action slate, wherein the final action slate comprises the selected actions for the slots in each subset; and
providing the final action slate to the action selector in response to the observation.