US 12,136,415 B2
Mixture model attention for flexible streaming and non-streaming automatic speech recognition
Kartik Audhkhasi, Mountain View, CA (US); Bhuvana Ramabhadran, Mt. Kisco, NY (US); Tongzhou Chen, Mountain View, CA (US); and Pedro J. Moreno Mengibar, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 15, 2021, as Appl. No. 17/644,343.
Claims priority of provisional application 63/166,347, filed on Mar. 26, 2021.
Prior Publication US 2022/0310073 A1, Sep. 29, 2022
Int. Cl. G10L 15/16 (2006.01); G06F 1/03 (2006.01); G06N 3/04 (2023.01); G06N 3/0455 (2023.01); G10L 19/16 (2013.01)
CPC G10L 15/16 (2013.01) [G06F 1/03 (2013.01); G06N 3/04 (2013.01); G06N 3/0455 (2023.01); G10L 19/167 (2013.01)] 18 Claims
OG exemplary drawing
 
1. Data processing hardware executing instructions stored on memory hardware that causes the data processing hardware to execute an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition, the ASR model comprising:
an audio encoder configured to:
receive, as input, a sequence of acoustic frames; and
generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and
a joint network configured to:
receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of time steps; and
generate, at each of the plurality of time steps, a probability distribution over possible speech recognition hypothesis at the corresponding time step,
wherein the audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window spanning from a left+center context to a right context, the set of mixture components of softmaxes comprising:
a first mixture component that operates over the left+center context; and
a second mixture component that operates over the right context,
wherein the ASR model switches between streaming and non-streaming modes by adjusting mixture weights of the MiMO attention.