US 12,456,460 B2
	Multi-task learning for end-to-end automated speech recognition confidence and deletion estimation
David Qiu, Brookline, MA (US); Yanzhang He, Mountain View, CA (US); Yu Zhang, Mountain View, CA (US); Qiujia Li, Mountain View, CA (US); Liangliang Cao, Mountain View, CA (US); and Ian McGraw, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 11, 2021, as Appl. No. 17/643,826.
Claims priority of provisional application 63/166,399, filed on Mar. 26, 2021.
Prior Publication US 2022/0310080 A1, Sep. 29, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/197 (2013.01); G10L 15/02 (2006.01); G10L 15/04 (2013.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01); G10L 15/30 (2013.01); G10L 15/32 (2013.01)

CPC G10L 15/197 (2013.01) [G10L 15/02 (2013.01); G10L 15/04 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 15/30 (2013.01); G10L 15/32 (2013.01); G10L 2015/025 (2013.01)]

24 Claims

1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:

receiving, from a first speech recognizer comprising an encoder and a decoder, a speech recognition result corresponding to a transcription of an utterance spoken by a user, the speech recognition result comprising a sequence of hypothesized sub-word units that form one or more words of the transcription of the utterance, each sub-word unit output from the first speech recognizer at a corresponding output step;

using a confidence estimation module, for each sub-word unit in the sequence of hypothesized sub-word units:

obtaining a respective confidence embedding associated with the corresponding output step when the corresponding sub-word unit was output from the first speech recognizer;

generating, using a first attention mechanism that self-attends to the respective confidence embedding for the corresponding sub-word unit and the confidence embeddings obtained for earlier sub-word units in the sequence of hypothesized sub-word units that correspond to the same word from the one or more words as the corresponding sub-word unit, a confidence feature vector;

generating, using a second attention mechanism that cross-attends to a sequence of encodings each associated with a corresponding acoustic frame segmented from audio data that corresponds to the utterance, an acoustic context vector; and

generating, as output from an output layer of the confidence estimation module, a respective confidence output score for the corresponding sub-word unit based on the confidence feature vector and the acoustic context vector received as input by the output layer of the confidence estimation module;

determining, based on the respective confidence output score generated for each sub-word unit in the sequence of hypothesized sub-word units, an utterance-level confidence score for the transcription of the utterance; and

training the confidence estimation module and at least one of the encoder or the decoder of the first speech recognizer jointly on an utterance-level loss.