CPC G10L 15/005 (2013.01) [G10L 15/07 (2013.01); G10L 15/16 (2013.01); G10L 2015/0631 (2013.01)] | 18 Claims |
1. A computer-implemented method of jointly performing speech recognition and language prediction using a sequence-to-sequence speech recognition model, the method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving audio data characterizing a spoken utterance;
processing, using the sequence-to-sequence speech recognition model, the audio data to generate, at each of a plurality of time steps:
a probability distribution over a predetermined set of linguistic units; and
a predicted language of the spoken utterance among multiple different languages the speech recognition model has been trained to recognize; and
providing, as an output from the sequence-to-sequence speech recognition model, a transcription of the utterance based on the probability distribution over the predetermined set of linguistic units and the predicted language generated at each of the plurality of time steps,
wherein the speech recognition model is trained using multi-task learning using:
a first objective function corresponding to grapheme prediction; and
a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction.
|