CPC G10L 15/16 (2013.01) [G06F 1/03 (2013.01); G06N 3/04 (2013.01); G06N 3/0455 (2023.01); G10L 19/167 (2013.01)] | 18 Claims |
1. Data processing hardware executing instructions stored on memory hardware that causes the data processing hardware to execute an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition, the ASR model comprising:
an audio encoder configured to:
receive, as input, a sequence of acoustic frames; and
generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and
a joint network configured to:
receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of time steps; and
generate, at each of the plurality of time steps, a probability distribution over possible speech recognition hypothesis at the corresponding time step,
wherein the audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window spanning from a left+center context to a right context, the set of mixture components of softmaxes comprising:
a first mixture component that operates over the left+center context; and
a second mixture component that operates over the right context,
wherein the ASR model switches between streaming and non-streaming modes by adjusting mixture weights of the MiMO attention.
|