US 12,288,549 B2
	Spoken query processing for image search
Ajay Jain, San Jose, CA (US); Sanjeev Tagra, Redmond, WA (US); Sachin Soni, New Delhi (IN); Ryan Rozich, Austin, TX (US); Nikaash Puri, New Delhi (IN); and Jonathan Roeder, San Jose, CA (US)
Assigned to adobe inc., San Jose, CA (US)
Filed by Adobe Inc., San Jose, CA (US)
Filed on Aug. 15, 2022, as Appl. No. 17/887,959.
Prior Publication US 2024/0054991 A1, Feb. 15, 2024
Int. Cl. G10L 15/06 (2013.01); G06F 3/16 (2006.01); G06F 16/532 (2019.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06V 10/774 (2022.01); G10L 15/183 (2013.01); G10L 15/22 (2006.01)

CPC G10L 15/063 (2013.01) [G06F 3/167 (2013.01); G06F 16/532 (2019.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06V 10/7747 (2022.01); G10L 15/183 (2013.01); G10L 15/22 (2013.01)]

19 Claims

1. One or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform operations, the operations comprising:

training a spoken language model using audio data to provide a trained spoken language model, wherein the spoken language model comprises a first model and a second model, and wherein the spoken language model is trained by:

in a first stage, training the first model on the audio data without training the second model, and

in a second stage after training the first model in the first stage, combining the first model and the second model and training the second model on the audio data; and

after training the spoken language model, jointly training, using a training dataset comprising a plurality of spoken queries and one or more images associated with each spoken query, the trained spoken language model and an image processing model to provide a multi-modal model comprising a retrained spoken language model and a trained image processing model that generate a relevance score for an input spoken query processed by the retrained spoken language model and an input image processed by the trained image processing model.