US 12,354,600 B2
Fast and efficient text only adaptation for factorized neural transducer
Rui Zhao, Bellevue, WA (US); Jian Xue, Bellevue, WA (US); Sarangarajan Parthasarathy, Mountain View, CA (US); and Jinyu Li, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on Nov. 9, 2022, as Appl. No. 17/983,660.
Claims priority of provisional application 63/416,892, filed on Oct. 17, 2022.
Prior Publication US 2024/0135919 A1, Apr. 25, 2024
Int. Cl. G10L 15/16 (2006.01); G10L 15/06 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2006.01)
CPC G10L 15/16 (2013.01) [G10L 15/063 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01)] 16 Claims
OG exemplary drawing
 
1. A method implemented by a computing system for performing improved automatic speech recognition using a factorized neural transducer, the method comprising:
accessing a factorized neural transducer comprising a first set of layers for predicting blank tokens and a second set of layers for predicting vocabulary tokens, wherein the second set of layers has been modified to facilitate an improvement in an accuracy of the factorized neural transducer in performing automatic speech recognition,
the first set of layers comprising a blank predictor, an encoder, and a joint network, wherein a blank predictor output from the blank predictor and an encoder output from the encoder are processed by the joint network for predicting a blank token,
the second set of layers comprising a language model that comprises a vocabulary predictor which is a separate predictor from the blank predictor, wherein a vocabulary predictor output from the vocabulary predictor and the encoder output are used for predicting a vocabulary token;
receiving electronic content comprising speech data as input to the factorized neural transducer;
predicting a blank token and a vocabulary token for a particular portion of the speech data, the particular portion of speech data associated with a new domain;
obtaining a set of adaptation data associated with the new domain;
accessing an N-gram model trained on the set of adaptation data;
generating an N-gram output based on receiving the particular portion of speech data; and
prior to predicting the vocabulary token, interpolating the N-gram output from the N-gram model with the vocabulary predictor output; and
using the blank token and the vocabulary token to perform speech recognition on the speech data.