| CPC G10L 15/183 (2013.01) [G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 15/02 (2013.01)] | 24 Claims |

|
1. A cross-training network for training a speech recognition model, the cross-training network comprising an unsupervised subnetwork trained on a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions, the unsupervised subnetwork comprising:
a target branch configured to:
receive, as input to a supervised audio encoder of the speech recognition model, a sequence of acoustic frames extracted from the unlabeled audio samples; and
at each of a plurality of output steps, generate a target higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames input to the supervised audio encoder at a corresponding output step; and
an augmented branch configured to:
augment the sequence of acoustic frames extracted from the unlabeled audio samples by masking one or more acoustic frames in the sequence of acoustic frames; and
at each of the plurality of output steps, generate, as output from an unsupervised audio encoder of the speech recognition model, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames,
wherein the unsupervised subnetwork is configured to:
at each of the plurality of output steps, determine an unsupervised loss term based on the target higher order feature representation generated by the target branch at the corresponding output step and the predicted higher order feature representation generated by the augmented branch at the corresponding output step; and
update parameters of the speech recognition model based on the unsupervised loss term determined at each of the plurality of output steps.
|