US 12,230,249 B2
	Supervised and unsupervised training with contrastive loss over sequences
Andrew Rosenberg, Brooklyn, NY (US); Bhuvana Ramabhadran, Mt. Kisco, NY (US); Zhehuai Chen, Edgewater, NJ (US); Yuan Wang, Brooklyn, NY (US); Yu Zhang, Mountain View, CA (US); and Jesse Emond, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 22, 2022, as Appl. No. 17/655,903.
Claims priority of provisional application 63/166,908, filed on Mar. 26, 2021.
Prior Publication US 2022/0310065 A1, Sep. 29, 2022
Int. Cl. G10L 15/06 (2013.01); G10L 13/02 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01)

CPC G10L 15/063 (2013.01) [G10L 13/02 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 2015/0635 (2013.01)]

18 Claims

1. A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising:

receiving a set of training utterances, each training utterance in the set of training utterances comprising a non-synthetic speech representation of a corresponding utterance;

for each training utterance in the set of training utterances, converting, using a text-to-speech (TTS) model, a ground-truth transcription of the corresponding utterance to generate one or more synthetic speech representations of the same corresponding utterance;

receiving audio data corresponding to an utterance by:

receiving one of the non-synthetic speech representations of the corresponding utterance; or

receiving one of the one or more synthetic speech representations of the corresponding utterance;

generating, using a data augmentation module, a pair of positive audio data examples, each positive audio data example in the pair of positive audio data examples comprising a respective augmented copy of the received audio data corresponding to the utterance;

for each respective positive audio data example in the pair of positive audio data examples:

generating, using a neural network encoder, a respective sequence of encoder outputs; and

projecting, using a convolutional neural network (CNN), the respective sequence of encoder outputs for the respective positive audio data example into a contrastive loss space;

determining a L2 distance between each corresponding encoder output in each projected respective sequence of encoder outputs for the pair of positive audio data examples;

determining a per-utterance consistency loss by averaging a set of L2 distances determined for the respective sequence of encoder outputs in the projected respective sequence of encoder outputs;

generating, using a speech recognition model, corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples; and

updating parameters of the speech recognition model based on a respective supervised loss term associated with each corresponding speech recognition result and the per-utterance consistency loss.