| CPC G10L 15/16 (2013.01) [G10L 2015/081 (2013.01)] | 20 Claims |

|
1. A system, comprising:
at least one processor; and
a memory, storing program instructions that when executed by the at least one processor, cause the at least one processor to implement a speech recognition system, configured to:
receive audio data to recognize speech in the audio data;
cause the audio data to be processed through a machine learning model trained for speech recognition that outputs original token predictions corresponding to respective words recognized in the audio data by the machine learning model;
generate a plurality of masked versions of the audio data, wherein individual ones of plurality of masked version of the audio data comprise different respective masked portions of the audio data;
cause the plurality of masked versions of the audio data to be processed through the machine learning model to output internal language model estimation token predictions corresponding to respective words recognized in the plurality of masked versions of the audio data by the machine learning model;
compare the internal language model estimation token predictions with the original token predictions to determine modifications to one or more of the original token predictions according to differences between the internal language model estimation token predictions with the original token predictions that are above difference threshold;
apply the modifications to the one or more original token predictions; and
generate a speech prediction in the audio data according to the original token predictions including the modified one or more original token predictions.
|