US 12,444,182 B2
	Training action selection neural networks using auxiliary tasks of controlling observation embeddings
Markus Wulfmeier, Balgheim (DE); Tim Hertweck, Lauchringen (DE); and Martin Riedmiller, Balgheim (DE)
Assigned to GDM Holding LLC, Mountain View, CA (US)
Appl. No. 18/016,746
Filed by DeepMind Technologies Limited, London (GB)
PCT Filed Jul. 27, 2021, PCT No. PCT/EP2021/071078 § 371(c)(1), (2) Date Jan. 18, 2023, PCT Pub. No. WO2022/023385, PCT Pub. Date Feb. 3, 2022.
Claims priority of provisional application 63/057,795, filed on Jul. 28, 2020.
Prior Publication US 2023/0290133 A1, Sep. 14, 2023
Int. Cl. G06V 10/82 (2022.01); G06V 10/70 (2022.01)

CPC G06V 10/82 (2022.01) [G06V 10/87 (2022.01)]

20 Claims

20. A system comprising:

one or more computers; and

one or more storage devices communicatively coupled to the one or more computers,

wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for training an action selection neural network having a plurality of parameters that is used to select actions to be performed by an agent interacting with an environment, wherein the action selection neural network is configured to process an input comprising an observation characterizing a state of the environment to generate an action selection output that comprises a respective action score for each action in a set of possible actions that can be performed by the agent, and select the action to be performed by the agent from the set of possible actions based on the action scores, the operations comprising:

obtaining an observation characterizing a state of the environment at a time step;

processing the observation using an embedding model to generate a lower-dimensional embedding of the observation, wherein the lower-dimensional embedding of the observation has a plurality of dimensions;

determining an auxiliary task reward for the time step based on a value of a particular dimension of the embedding, wherein the auxiliary task reward corresponds to an auxiliary task of controlling the value of the particular dimension of the embedding;

determining an overall reward for the time step based at least in part on the auxiliary task reward for the time step; and

determining an update to values of the plurality of parameters of the action selection neural network based on the overall reward for the time step using a reinforcement learning technique.