| CPC G10L 15/063 (2013.01) [G06F 3/167 (2013.01); G06F 16/532 (2019.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06V 10/7747 (2022.01); G10L 15/183 (2013.01); G10L 15/22 (2013.01)] | 19 Claims |

|
1. One or more computer storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform operations, the operations comprising:
training a spoken language model using audio data to provide a trained spoken language model, wherein the spoken language model comprises a first model and a second model, and wherein the spoken language model is trained by:
in a first stage, training the first model on the audio data without training the second model, and
in a second stage after training the first model in the first stage, combining the first model and the second model and training the second model on the audio data; and
after training the spoken language model, jointly training, using a training dataset comprising a plurality of spoken queries and one or more images associated with each spoken query, the trained spoken language model and an image processing model to provide a multi-modal model comprising a retrained spoken language model and a trained image processing model that generate a relevance score for an input spoken query processed by the retrained spoken language model and an input image processed by the trained image processing model.
|