US 12,482,464 B2
	Controlling interactive agents using multi-modal inputs
Joshua Simon Abramson, London (GB); Arun Ahuja, London (GB); Federico Javier Carnevale, London (GB); Petko Ivanov Georgiev, London (GB); Chia-Chun Hung, London (GB); Timothy Paul Lillicrap, London (GB); Alistair Michael Muldal, London (GB); Adam Anthony Santoro, London (GB); Tamara Louise von Glehn, Cambridge (GB); Jessica Paige Landon, London (GB); Gregory Duncan Wayne, London (GB); Chen Yan, London (GB); and Rui Zhu, Beaconsfield (CA)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Dec. 7, 2022, as Appl. No. 18/077,194.
Claims priority of provisional application 63/286,999, filed on Dec. 7, 2021.
Prior Publication US 2023/0178076 A1, Jun. 8, 2023
Int. Cl. G10L 15/22 (2006.01); G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 20/50 (2022.01); G10L 13/02 (2013.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01)

CPC G10L 15/22 (2013.01) [G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 20/50 (2022.01); G10L 13/02 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 2015/223 (2013.01)]

33 Claims

1. A method performed by one or more computers and for controlling an agent interacting with an environment, the method comprising, at each of a plurality of time steps:

receiving an observation image characterizing a state of the environment at the time step;

receiving a natural language text sequence for the time step that characterizes a task being performed by the agent in the environment at the time step;

processing the observation image using an image embedding neural network to generate a plurality of image embeddings that represent the observation image;

processing the natural language text sequence using a text embedding neural network to generate a plurality of text embeddings that represent the natural language text sequence;

processing an input comprising the image embeddings, the text embeddings, and a set of one or more dedicated embeddings using a multi-modal Transformer neural network to generate an aggregated embedding, wherein the multi-modal Transformer neural network is configured to (i) apply self-attention over at least the text embeddings and the image embeddings to generate respective updated embeddings for at least the plurality of dedicated embeddings and (ii) generate the aggregated embedding from at least the respective updated embeddings for the dedicated embeddings;

selecting, using the aggregated embedding, one or more actions to be performed by the agent in response to the observation image; and

causing the agent to perform the one or more selected actions.