US 12,236,192 B2
Task-specific text generation based on multimodal inputs
Xudong Lin, New York, NY (US); Gediminas Bertasius, Boston, MA (US); Jue Wang, Cambridge, MA (US); Devi Niru Parikh, Menlo Park, CA (US); and Lorenzo Torresani, Norwich, VT (US)
Assigned to Meta Platforms, Inc., Menlo Park, CA (US)
Filed by Meta Platforms, Inc., Menlo Park, CA (US)
Filed on Jun. 4, 2021, as Appl. No. 17/339,759.
Claims priority of provisional application 63/135,456, filed on Jan. 8, 2021.
Prior Publication US 2022/0222435 A1, Jul. 14, 2022
Int. Cl. G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06N 3/084 (2023.01); G06N 7/01 (2023.01)
CPC G06F 40/284 (2020.01) [G06F 40/30 (2020.01); G06N 3/084 (2013.01); G06N 7/01 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising, by a computing device:
accessing at least one first set of tokens associated with a desired task and one or more modalities associated with a context of the desired task;
determining, for the one or more modalities, a second set of tokens using a classifier network associated with at least one modality;
generating a plurality of embedding vectors comprising a first set of embedding vectors mapped to the at least one first set of tokens and a second set of embedding vectors mapped to the second set of tokens, the at least one first set of tokens and the second set of tokens associated with the one or more modalities, wherein the first set of embedding vectors and the second set of embedding vectors are different and are mapped to an embedding space; and
producing a sequence of words addressing the desired task based on determining probability distributions of the words to determine whether to select the words for the sequence and based on processing the plurality of embedding vectors with an encoder-decoder network.