| CPC G10L 15/22 (2013.01) [G06F 40/279 (2020.01); G10L 15/18 (2013.01); G10L 15/26 (2013.01); G10L 2015/225 (2013.01)] | 20 Claims |

|
1. A computer-implemented method comprising:
receiving first input data comprising first audio data representing a user utterance;
generating, using a recurrent neural network-based audio encoder, a first vector representing the first audio data;
generating, by inputting the first audio data into an automatic speech recognition (ASR) component, first text data representing the user utterance;
generating, using a pre-trained transformer-based language model, a second vector representing the first text data;
generating a combined vector by concatenating the first vector and the second vector, wherein the combined vector represents acoustic characteristics and textual characteristics of the first input data;
sending the combined vector to a fully-connected classifier configured to classify the first input data as pertaining to one of a plurality of corruption states;
generating, using the fully-connected classifier, first output data indicating that the first audio data pertains to a first corruption state of the plurality of corruption states; and
generating second output data indicating that the first audio data was not clearly received due to the first corruption state.
|