US 12,387,718 B1
Removing bias from automatic speech recognition models using internal language model estimates
Nilaksh Das, Seattle, WA (US); Monica Lakshmi Sunkara, San Jose, CA (US); Sravan Babu Bodapati, Fremont, CA (US); Jinglun Cai, Seattle, WA (US); Devang Kulshreshtha, Montreal (CA); Jeffrey John Farris, Crystal Lake, IL (US); Nicholas G Aldridge, Seattle, WA (US); Srikanth Ronanki, San Jose, CA (US); and Katrin Kirchhoff, Seattle, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on May 3, 2023, as Appl. No. 18/311,849.
Int. Cl. G10L 15/16 (2006.01); G10L 15/08 (2006.01)
CPC G10L 15/16 (2013.01) [G10L 2015/081 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A system, comprising:
at least one processor; and
a memory, storing program instructions that when executed by the at least one processor, cause the at least one processor to implement a speech recognition system, configured to:
receive audio data to recognize speech in the audio data;
cause the audio data to be processed through a machine learning model trained for speech recognition that outputs original token predictions corresponding to respective words recognized in the audio data by the machine learning model;
generate a plurality of masked versions of the audio data, wherein individual ones of plurality of masked version of the audio data comprise different respective masked portions of the audio data;
cause the plurality of masked versions of the audio data to be processed through the machine learning model to output internal language model estimation token predictions corresponding to respective words recognized in the plurality of masked versions of the audio data by the machine learning model;
compare the internal language model estimation token predictions with the original token predictions to determine modifications to one or more of the original token predictions according to differences between the internal language model estimation token predictions with the original token predictions that are above difference threshold;
apply the modifications to the one or more original token predictions; and
generate a speech prediction in the audio data according to the original token predictions including the modified one or more original token predictions.