US 12,334,068 B1
Detecting corrupted speech in voice-based computer interfaces
Di Wang, Seattle, WA (US); Deshen Wang, Redmond, WA (US); Lan Ma, Lake Forest Park, WA (US); Shu Wang, Bellevue, WA (US); Wenbo Yan, Redmond, WA (US); and Prathap Ramachandra, Kirkland, WA (US)
Assigned to AMAZON TECHNOLOGIES, INC., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 29, 2022, as Appl. No. 17/956,003.
Int. Cl. G10L 15/22 (2006.01); G06F 40/279 (2020.01); G10L 15/18 (2013.01); G10L 15/26 (2006.01)
CPC G10L 15/22 (2013.01) [G06F 40/279 (2020.01); G10L 15/18 (2013.01); G10L 15/26 (2013.01); G10L 2015/225 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving first input data comprising first audio data representing a user utterance;
generating, using a recurrent neural network-based audio encoder, a first vector representing the first audio data;
generating, by inputting the first audio data into an automatic speech recognition (ASR) component, first text data representing the user utterance;
generating, using a pre-trained transformer-based language model, a second vector representing the first text data;
generating a combined vector by concatenating the first vector and the second vector, wherein the combined vector represents acoustic characteristics and textual characteristics of the first input data;
sending the combined vector to a fully-connected classifier configured to classify the first input data as pertaining to one of a plurality of corruption states;
generating, using the fully-connected classifier, first output data indicating that the first audio data pertains to a first corruption state of the plurality of corruption states; and
generating second output data indicating that the first audio data was not clearly received due to the first corruption state.