| CPC G10L 15/197 (2013.01) [G10L 15/02 (2013.01); G10L 15/04 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 15/30 (2013.01); G10L 15/32 (2013.01); G10L 2015/025 (2013.01)] | 24 Claims |

|
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving, from a first speech recognizer comprising an encoder and a decoder, a speech recognition result corresponding to a transcription of an utterance spoken by a user, the speech recognition result comprising a sequence of hypothesized sub-word units that form one or more words of the transcription of the utterance, each sub-word unit output from the first speech recognizer at a corresponding output step;
using a confidence estimation module, for each sub-word unit in the sequence of hypothesized sub-word units:
obtaining a respective confidence embedding associated with the corresponding output step when the corresponding sub-word unit was output from the first speech recognizer;
generating, using a first attention mechanism that self-attends to the respective confidence embedding for the corresponding sub-word unit and the confidence embeddings obtained for earlier sub-word units in the sequence of hypothesized sub-word units that correspond to the same word from the one or more words as the corresponding sub-word unit, a confidence feature vector;
generating, using a second attention mechanism that cross-attends to a sequence of encodings each associated with a corresponding acoustic frame segmented from audio data that corresponds to the utterance, an acoustic context vector; and
generating, as output from an output layer of the confidence estimation module, a respective confidence output score for the corresponding sub-word unit based on the confidence feature vector and the acoustic context vector received as input by the output layer of the confidence estimation module;
determining, based on the respective confidence output score generated for each sub-word unit in the sequence of hypothesized sub-word units, an utterance-level confidence score for the transcription of the utterance; and
training the confidence estimation module and at least one of the encoder or the decoder of the first speech recognizer jointly on an utterance-level loss.
|