US 12,444,408 B2
Two-pass end to end speech recognition
Tara N. Sainath, Jersey City, NJ (US); Ruoming Pang, New York, NY (US); David Rybach, Mountain View, CA (US); Yanzhang He, Palo Alto, CA (US); Rohit Prabhavalkar, Mountain View, CA (US); Wei Li, Fremont, CA (US); Mirkó Visontai, Mountain View, CA (US); Qiao Liang, Redwood City, CA (US); Trevor Strohman, Sunnyvale, CA (US); Yonghui Wu, Fremont, CA (US); Ian C. McGraw, Menlo Park, CA (US); and Chung-Cheng Chiu, Sunnyvale, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Appl. No. 17/616,129
Filed by GOOGLE LLC, Mountain View, CA (US)
PCT Filed Jun. 3, 2020, PCT No. PCT/US2020/035912
§ 371(c)(1), (2) Date Dec. 2, 2021,
PCT Pub. No. WO2020/247489, PCT Pub. Date Dec. 10, 2020.
Claims priority of provisional application 62/943,703, filed on Dec. 4, 2019.
Claims priority of provisional application 62/856,815, filed on Jun. 4, 2019.
Prior Publication US 2022/0310072 A1, Sep. 29, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/16 (2006.01); G10L 15/05 (2013.01); G10L 15/32 (2013.01)
CPC G10L 15/16 (2013.01) [G10L 15/05 (2013.01); G10L 15/32 (2013.01)] 18 Claims
OG exemplary drawing
 
9. A method implemented by one or more processors, the method comprising:
receiving audio data comprising a sequence of segments and capturing an utterance spoken by a human speaker;
for each segment in the sequence of segments, and in the sequence:
processing the segment using a first-pass portion of an automatic speech recognition (“ASR”) model to generate recurrent neural network transformer (“RNN-T”) output, wherein processing the segment using the first-pass portion of the ASR model comprises:
processing the segment using a shared encoder portion to generate shared encoder output,
adding the shared encoder output as the next item in a shared encoder buffer, and
processing the shared encoder output using a RNN-T decoder portion to generate a corresponding portion of RNN-T output;
determining one or more first-pass candidate text representations of the utterance based on the RNN-T output;
determining the human speaker has finished speaking the utterance;
in response to determining the human speaker has finished speaking the utterance, generating listen attention spell (“LAS”) output based on processing, using a second-pass LAS decoder portion of the ASR model, the shared encoder output from the shared encoder buffer along with at least one of (a) the RNN-T output or (b) the one or more first-pass candidate text representations of the utterance; and
generating a final text representation of the utterance based on the LAS output.