| CPC G10L 15/063 (2013.01) [G10L 13/02 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 2015/0635 (2013.01)] | 18 Claims |

|
1. A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising:
receiving a set of training utterances, each training utterance in the set of training utterances comprising a non-synthetic speech representation of a corresponding utterance;
for each training utterance in the set of training utterances, converting, using a text-to-speech (TTS) model, a ground-truth transcription of the corresponding utterance to generate one or more synthetic speech representations of the same corresponding utterance;
receiving audio data corresponding to an utterance by:
receiving one of the non-synthetic speech representations of the corresponding utterance; or
receiving one of the one or more synthetic speech representations of the corresponding utterance;
generating, using a data augmentation module, a pair of positive audio data examples, each positive audio data example in the pair of positive audio data examples comprising a respective augmented copy of the received audio data corresponding to the utterance;
for each respective positive audio data example in the pair of positive audio data examples:
generating, using a neural network encoder, a respective sequence of encoder outputs; and
projecting, using a convolutional neural network (CNN), the respective sequence of encoder outputs for the respective positive audio data example into a contrastive loss space;
determining a L2 distance between each corresponding encoder output in each projected respective sequence of encoder outputs for the pair of positive audio data examples;
determining a per-utterance consistency loss by averaging a set of L2 distances determined for the respective sequence of encoder outputs in the projected respective sequence of encoder outputs;
generating, using a speech recognition model, corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples; and
updating parameters of the speech recognition model based on a respective supervised loss term associated with each corresponding speech recognition result and the per-utterance consistency loss.
|