US 12,254,869 B2
One model unifying streaming and non-streaming speech recognition
Anshuman Tripathi, Mountain View, CA (US); Hasim Sak, Santa Clara, CA (US); Han Lu, Redmond, WA (US); Qian Zhang, Mountain View, CA (US); and Jaeyoung Kim, Cupertino, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jul. 24, 2023, as Appl. No. 18/357,225.
Application 18/357,225 is a continuation of application No. 17/210,465, filed on Mar. 23, 2021, granted, now 11,741,947.
Claims priority of provisional application 63/087,817, filed on Oct. 5, 2020.
Prior Publication US 2023/0368779 A1, Nov. 16, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/16 (2006.01); G06N 3/04 (2023.01); G06N 3/088 (2023.01); G10L 15/06 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2006.01); G10L 15/30 (2013.01)
CPC G10L 15/16 (2013.01) [G06N 3/04 (2013.01); G06N 3/088 (2013.01); G10L 15/063 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01); G10L 15/30 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
receiving audio data corresponding to a spoken utterance;
encoding, by an initial stack of multi-head attention layers, the audio data to compute shared activations:
while receiving the audio data corresponding to the spoken utterance:
encoding, by a final stack of multi-head attention layers while applying a first look ahead audio context, the shared activations to compute low latency activations; and
decoding the low latency activations into partial speech recognition results for the spoken utterance; and
after the audio data corresponding to the spoken utterance is received:
encoding, by the final stack of multi-head attention layers while applying a second look ahead audio context, the shared activations to compute high latency activations; and
decoding the high latency activations into a final speech recognition result for the spoken utterance,
wherein:
the initial stack of multi-head attention layers are trained with zero look ahead audio context; and
the final stack of multi-head attention layers are trained with variable look ahead audio context.