US 11,741,947 B2
Transformer transducer: one model unifying streaming and non-streaming speech recognition
Anshuman Tripathi, Mountain View, CA (US); Hasim Sak, Santa Clara, CA (US); Han Lu, Santa Clara, CA (US); Qian Zhang, Mountain View, CA (US); and Jaeyoung Kim, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 23, 2021, as Appl. No. 17/210,465.
Claims priority of provisional application 63/087,817, filed on Oct. 5, 2020.
Prior Publication US 2022/0108689 A1, Apr. 7, 2022
Int. Cl. G10L 15/16 (2006.01); G06N 3/04 (2023.01); G06N 3/088 (2023.01); G10L 15/06 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2006.01); G10L 15/30 (2013.01)
CPC G10L 15/16 (2013.01) [G06N 3/04 (2013.01); G06N 3/088 (2013.01); G10L 15/063 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01); G10L 15/30 (2013.01)] 27 Claims
OG exemplary drawing
 
1. A single transformer-transducer model for unifying streaming and non-streaming speech recognition, the single transformer-transducer model comprising:
an audio encoder configured to:
receive, as input, a sequence of acoustic frames; and
generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames;
a label encoder configured to:
receive, as input, a sequence of non-blank symbols output by a final softmax layer; and
generate, at each of the plurality of time steps, a dense representation; and
a joint network configured to:
receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of time steps and the dense representation generated by the label encoder at each of the plurality of time steps; and
generate, at each of the plurality of time steps, a probability distribution over possible speech recognition hypothesis at the corresponding time step,
wherein the audio encoder comprises a neural network having a plurality of transformer layers, the plurality of transformer layers comprising:
an initial stack of transformer layers each trained with zero look ahead audio context; and
a final stack of transformer layers each trained with a variable look ahead audio context.