US 12,249,317 B2
	Joint unsupervised and supervised training for multilingual ASR
Bo Li, Fremont, CA (US); Junwen Bai, Mountain View, CA (US); Yu Zhang, Mountain View, CA (US); Ankur Bapna, Sunnyvale, CA (US); Nikhil Siddhartha, Mountain View, CA (US); Khe Chai Sim, Dublin, CA (US); and Tara N. Sainath, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Sep. 6, 2022, as Appl. No. 17/929,934.
Claims priority of provisional application 63/262,174, filed on Oct. 6, 2021.
Prior Publication US 2023/0104228 A1, Apr. 6, 2023
Int. Cl. G10L 15/16 (2006.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/187 (2013.01); G10L 15/19 (2013.01)

CPC G10L 15/063 (2013.01) [G10L 15/02 (2013.01); G10L 15/16 (2013.01); G10L 15/19 (2013.01)]

20 Claims

1. A joint unsupervised and supervised training (JUST) framework for training a multilingual automatic speech recognition (ASR) model, the JUST framework comprising:

a feature encoder configured to:

receive, as input, audio features corresponding to an utterance of speech; and

generate, at each of a plurality of time steps, a latent speech representation;

a quantizer configured to:

receive, as input, the latent speech representations generated by the feature encoder at each of the plurality of time steps; and

generate, at each of the plurality of time steps, a target quantized vector token and a target token index for a corresponding latent speech representation generated by the feature encoder, wherein the target token index maps the corresponding latent speech representation to the target quantized vector token stored in a codebook;

a contrastive net configured to:

receive, as input, the latent speech representations generated by the feature encoder at each of the plurality of time steps after masking a subset of the latent speech representations;

generate, at each of the plurality of time steps, a contrastive context vector for the corresponding unmasked or masked latent speech representation; and

derive, at each of the plurality of time steps, a contrastive self-supervised loss based on the corresponding contrastive context vector and the corresponding target quantized vector token generated by the quantizer for the corresponding latent speech representation;

a masked language modeling (MLM) module configured to:

receive, as input, the contrastive context vector generated by the contrastive net at each of the plurality of time steps;

generate, at each of the plurality of time steps, a high-level context vector; and

for each high-level context vector, learn to predict the target token index at the corresponding time step using a cross-entropy loss based on the target token index generated by the quantizer at the corresponding time step; and

a decoder configured to:

receive, as input, the high-level context vector generated by the MLM module at each of the plurality of time steps; and

predict speech recognition hypotheses for the utterance,

wherein the multilingual ASR model is trained on:

an unsupervised loss based on the contrastive self-supervised loss and the cross-entropy; and

a supervised loss based on the predicted speech recognition hypotheses and a ground-truth transcription of the utterance.