| CPC G06N 3/08 (2013.01) [G06N 3/088 (2013.01); G06F 16/90324 (2019.01)] | 20 Claims |

|
1. A method of providing a slate of actions to an action selector that interacts with an environment by selecting and performing actions, wherein the slate of actions includes a plurality of actions selected from a predetermined set of actions to fill a plurality of slots in an action slate, and wherein the environment transitions states in response to actions performed by the action selector, the method comprising:
receiving an observation characterizing a current state of the environment;
dividing the plurality of slots into a plurality of subsets;
selecting actions to fill in each subset, comprising, for a given subset of the plurality of subsets:
generating a plurality of candidate action slates for the given subset of slots, wherein each candidate action slate comprises a plurality of actions from the predetermined set of actions, wherein each candidate action slate for the given subset of slots has a same, predetermined number of slots with each slot being filled with a respective action from the predetermined set of actions, and wherein, for each candidate action slate, the slots are filled with a different combination of candidate actions from each of other candidate action slates for the given subset of slots;
for each candidate action slate, processing the candidate action slate using a deep neural network, wherein, for each candidate action slate, the deep neural network:
receives an input that comprises the plurality of actions in the candidate action slate and the observation, and
generates, as output, a slate Q value for the candidate action slate that is an estimate of a long-term reward resulting from the candidate action slate comprising the plurality of actions being provided to the action selector in response to the observation;
selecting a candidate action slate from the plurality of candidate action slates based on the slate Q values generated as output by the deep neural network for the candidate action slates; and
selecting, as the actions in the slots in the given subset, the actions in the slots in the selected candidate action slate;
generating a final action slate, wherein the final action slate comprises the selected actions for the slots in each subset; and
providing the final action slate to the action selector in response to the observation.
|