| CPC G10L 15/16 (2013.01) [G06N 3/088 (2013.01); G10L 15/1815 (2013.01)] | 20 Claims |

|
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving training samples comprising:
a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions; and
a plurality of labeled audio samples corresponding to spoken utterances paired with corresponding transcriptions:
executing a semi-supervised training process for training a speech recognition model, the semi-supervised training process comprising an unsupervised subnetwork training process and a supervised subnetwork training process;
during execution of the unsupervised subnetwork training process:
performing augmentation on the unlabeled audio samples;
generating, using an audio encoder of the speech recognition model, a sequence of augmented encoder outputs for the augmented unlabeled audio samples;
generating, using a prediction network configured to receive the sequence of augmented encoder outputs, predictions of a sequence of target branch outputs; and
determining an unsupervised loss term based on the sequence target branch outputs and the predictions of the sequence of target branch outputs;
during execution of the supervised subnetwork training process:
generating, using the speech recognition model, speech recognition results for the labeled audio samples; and
determining a supervised loss term based on the speech results for the labeled audio samples and the corresponding transcriptions of the labeled audio samples; and
updating parameters of the speech recognition model based on the unsupervised loss term and the supervised loss term.
|