US 12,334,059 B2
Contrastive Siamese network for semi-supervised speech recognition
Jaeyoung Kim, Cupertino, CA (US); Soheil Khorram, Redwood City, CA (US); Hasim Sak, Santa Clara, CA (US); Anshuman Tripathi, Mountain View, CA (US); Han Lu, Redmond, WA (US); and Qian Zhang, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 28, 2024, as Appl. No. 18/619,684.
Application 18/619,684 is a continuation of application No. 17/644,337, filed on Dec. 14, 2021, granted, now 11,961,515.
Claims priority of provisional application 63/261,895, filed on Sep. 30, 2021.
Prior Publication US 2024/0242712 A1, Jul. 18, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/00 (2013.01); G06N 3/088 (2023.01); G10L 15/16 (2006.01); G10L 15/18 (2013.01)
CPC G10L 15/16 (2013.01) [G06N 3/088 (2013.01); G10L 15/1815 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving training samples comprising:
a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions; and
a plurality of labeled audio samples corresponding to spoken utterances paired with corresponding transcriptions:
executing a semi-supervised training process for training a speech recognition model, the semi-supervised training process comprising an unsupervised subnetwork training process and a supervised subnetwork training process;
during execution of the unsupervised subnetwork training process:
performing augmentation on the unlabeled audio samples;
generating, using an audio encoder of the speech recognition model, a sequence of augmented encoder outputs for the augmented unlabeled audio samples;
generating, using a prediction network configured to receive the sequence of augmented encoder outputs, predictions of a sequence of target branch outputs; and
determining an unsupervised loss term based on the sequence target branch outputs and the predictions of the sequence of target branch outputs;
during execution of the supervised subnetwork training process:
generating, using the speech recognition model, speech recognition results for the labeled audio samples; and
determining a supervised loss term based on the speech results for the labeled audio samples and the corresponding transcriptions of the labeled audio samples; and
updating parameters of the speech recognition model based on the unsupervised loss term and the supervised loss term.