| CPC G06V 10/82 (2022.01) [G06V 10/87 (2022.01)] | 20 Claims |

|
20. A system comprising:
one or more computers; and
one or more storage devices communicatively coupled to the one or more computers,
wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for training an action selection neural network having a plurality of parameters that is used to select actions to be performed by an agent interacting with an environment, wherein the action selection neural network is configured to process an input comprising an observation characterizing a state of the environment to generate an action selection output that comprises a respective action score for each action in a set of possible actions that can be performed by the agent, and select the action to be performed by the agent from the set of possible actions based on the action scores, the operations comprising:
obtaining an observation characterizing a state of the environment at a time step;
processing the observation using an embedding model to generate a lower-dimensional embedding of the observation, wherein the lower-dimensional embedding of the observation has a plurality of dimensions;
determining an auxiliary task reward for the time step based on a value of a particular dimension of the embedding, wherein the auxiliary task reward corresponds to an auxiliary task of controlling the value of the particular dimension of the embedding;
determining an overall reward for the time step based at least in part on the auxiliary task reward for the time step; and
determining an update to values of the plurality of parameters of the action selection neural network based on the overall reward for the time step using a reinforcement learning technique.
|