| CPC G10L 15/22 (2013.01) [G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 20/50 (2022.01); G10L 13/02 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 2015/223 (2013.01)] | 33 Claims |

|
1. A method performed by one or more computers and for controlling an agent interacting with an environment, the method comprising, at each of a plurality of time steps:
receiving an observation image characterizing a state of the environment at the time step;
receiving a natural language text sequence for the time step that characterizes a task being performed by the agent in the environment at the time step;
processing the observation image using an image embedding neural network to generate a plurality of image embeddings that represent the observation image;
processing the natural language text sequence using a text embedding neural network to generate a plurality of text embeddings that represent the natural language text sequence;
processing an input comprising the image embeddings, the text embeddings, and a set of one or more dedicated embeddings using a multi-modal Transformer neural network to generate an aggregated embedding, wherein the multi-modal Transformer neural network is configured to (i) apply self-attention over at least the text embeddings and the image embeddings to generate respective updated embeddings for at least the plurality of dedicated embeddings and (ii) generate the aggregated embedding from at least the respective updated embeddings for the dedicated embeddings;
selecting, using the aggregated embedding, one or more actions to be performed by the agent in response to the observation image; and
causing the agent to perform the one or more selected actions.
|