US 11,715,458 B2
	Efficient streaming non-recurrent on-device end-to-end model
Tara Sainath, Jersey City, NJ (US); Arun Narayanan, Milpitas, CA (US); Rami Botros, Mountain View, CA (US); Yanzhang He, Mountain View, CA (US); Ehsan Variani, Mountain View, CA (US); Cyril Allauzen, Mountain View, CA (US); David Rybach, Aachen (DE); Ruoming Pang, New York, NY (US); and Trevor Strohman, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on May 10, 2021, as Appl. No. 17/316,198.
Claims priority of provisional application 63/165,068, filed on Mar. 23, 2021.
Prior Publication US 2022/0310062 A1, Sep. 29, 2022
Int. Cl. G10L 15/00 (2013.01); G10L 15/06 (2013.01); G10L 15/02 (2006.01); G10L 15/22 (2006.01); G10L 15/30 (2013.01)

CPC G10L 15/063 (2013.01) [G10L 15/02 (2013.01); G10L 15/22 (2013.01); G10L 15/30 (2013.01)]

20 Claims

1. Memory hardware storing instructions that, when executed by data processing hardware, cause the data processing hardware to implement an automated speech recognition (ASR) model, the ASR model comprising:

a first encoder configured to:

receive, as input, a sequence of acoustic frames corresponding to an utterance; and

generate, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames;

a second encoder configured to:

receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; and

generate, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame;

a decoder configured to:

receive, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps; and

generate, at each of the plurality of output steps, a first probability distribution over possible speech recognition hypotheses; and

a language model configured to:

receive, as input, the first probability distribution over possible speech hypotheses; and

generate, at each of the plurality of output steps, a rescored probability distribution over possible speech recognition hypotheses to generate a transcription for the utterance.