US 12,315,499 B2
	Semi-supervised training scheme for speech recognition
Soheil Khorram, Redwood City, CA (US); Anshuman Tripathi, Mountain View, CA (US); Kim Jaeyoung, Cupertino, CA (US); Han Lu, Redmond, WA (US); Qian Zhang, Mountain View, CA (US); and Hasim Sak, Santa Clara, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 14, 2022, as Appl. No. 18/065,685.
Prior Publication US 2024/0203406 A1, Jun. 20, 2024
Int. Cl. G10L 15/00 (2013.01); G10L 15/06 (2013.01); G10L 15/183 (2013.01); G10L 15/22 (2006.01); G10L 15/02 (2006.01)

CPC G10L 15/183 (2013.01) [G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 15/02 (2013.01)]

24 Claims

1. A cross-training network for training a speech recognition model, the cross-training network comprising an unsupervised subnetwork trained on a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions, the unsupervised subnetwork comprising:

a target branch configured to:

receive, as input to a supervised audio encoder of the speech recognition model, a sequence of acoustic frames extracted from the unlabeled audio samples; and

at each of a plurality of output steps, generate a target higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames input to the supervised audio encoder at a corresponding output step; and

an augmented branch configured to:

augment the sequence of acoustic frames extracted from the unlabeled audio samples by masking one or more acoustic frames in the sequence of acoustic frames; and

at each of the plurality of output steps, generate, as output from an unsupervised audio encoder of the speech recognition model, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames,

wherein the unsupervised subnetwork is configured to:

at each of the plurality of output steps, determine an unsupervised loss term based on the target higher order feature representation generated by the target branch at the corresponding output step and the predicted higher order feature representation generated by the augmented branch at the corresponding output step; and

update parameters of the speech recognition model based on the unsupervised loss term determined at each of the plurality of output steps.