US 12,073,824 B2
Two-pass end to end speech recognition
Tara N. Sainath, Jersey City, NJ (US); Yanzhang He, Palo Alto, CA (US); Bo Li, Fremont, CA (US); Arun Narayanan, Milpitas, CA (US); Ruoming Pang, New York, NY (US); Antoine Jean Bruguier, Milpitas, CA (US); Shuo-Yiin Chang, Mountain View, CA (US); and Wei Li, Fremont, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Appl. No. 17/616,135
Filed by GOOGLE LLC, Mountain View, CA (US)
PCT Filed Dec. 3, 2020, PCT No. PCT/US2020/063012
§ 371(c)(1), (2) Date Dec. 2, 2021,
PCT Pub. No. WO2021/113443, PCT Pub. Date Jun. 10, 2021.
Claims priority of provisional application 62/943,703, filed on Dec. 4, 2019.
Prior Publication US 2022/0238101 A1, Jul. 28, 2022
Int. Cl. G10L 15/00 (2013.01); G06N 3/08 (2023.01); G10L 15/05 (2013.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01)
CPC G10L 15/16 (2013.01) [G06N 3/08 (2013.01); G10L 15/05 (2013.01); G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 2015/0635 (2013.01)] 14 Claims
OG exemplary drawing
 
1. A method implemented by one or more processors, the method comprising:
receiving audio data comprising a sequence of segments and capturing an utterance spoken by a human speaker;
for each of the segments, and in the sequence:
processing the segment using a first-pass portion of an automatic speech recognition (“ASR”) model to generate recurrent neural network transformer (“RNN-T”) output, wherein processing the segment using the first-pass portion of the ASR model comprises:
processing the segment using a shared encoder portion to generate shared encoder output,
adding the shared encoder output as the next item in a shared encoder buffer, and
processing the shared encoder output using a RNN-T decoder portion to generate a corresponding portion of RNN-T output;
determining one or more first-pass candidate text representations of the utterance based on the RNN-T output;
determining the human speaker has finished speaking the utterance;
in response to determining the human speaker has finished speaking the utterance:
processing the shared encoder output from the shared encoder buffer using an additional encoder to generate additional encoder output;
generating listen attend spell (“LAS”) output based on processing, using a second-pass LAS decoder portion of the ASR model, the additional encoder output along with at least one of (a) the RNN-T output or (b) the one or more first-pass candidate text representations of the utterance; and
generating a final text representation of the utterance based on the LAS output.