| CPC G10L 15/063 (2013.01) [G10L 15/02 (2013.01); G10L 15/16 (2013.01); G10L 15/19 (2013.01)] | 20 Claims |

|
1. A joint unsupervised and supervised training (JUST) framework for training a multilingual automatic speech recognition (ASR) model, the JUST framework comprising:
a feature encoder configured to:
receive, as input, audio features corresponding to an utterance of speech; and
generate, at each of a plurality of time steps, a latent speech representation;
a quantizer configured to:
receive, as input, the latent speech representations generated by the feature encoder at each of the plurality of time steps; and
generate, at each of the plurality of time steps, a target quantized vector token and a target token index for a corresponding latent speech representation generated by the feature encoder, wherein the target token index maps the corresponding latent speech representation to the target quantized vector token stored in a codebook;
a contrastive net configured to:
receive, as input, the latent speech representations generated by the feature encoder at each of the plurality of time steps after masking a subset of the latent speech representations;
generate, at each of the plurality of time steps, a contrastive context vector for the corresponding unmasked or masked latent speech representation; and
derive, at each of the plurality of time steps, a contrastive self-supervised loss based on the corresponding contrastive context vector and the corresponding target quantized vector token generated by the quantizer for the corresponding latent speech representation;
a masked language modeling (MLM) module configured to:
receive, as input, the contrastive context vector generated by the contrastive net at each of the plurality of time steps;
generate, at each of the plurality of time steps, a high-level context vector; and
for each high-level context vector, learn to predict the target token index at the corresponding time step using a cross-entropy loss based on the target token index generated by the quantizer at the corresponding time step; and
a decoder configured to:
receive, as input, the high-level context vector generated by the MLM module at each of the plurality of time steps; and
predict speech recognition hypotheses for the utterance,
wherein the multilingual ASR model is trained on:
an unsupervised loss based on the contrastive self-supervised loss and the cross-entropy; and
a supervised loss based on the predicted speech recognition hypotheses and a ground-truth transcription of the utterance.
|