US 12,190,869 B2
	Optimizing inference performance for conformer
Tara N. Sainath, Jersey City, NJ (US); Rami Botros, Mountain View, CA (US); Anmol Gulati, Mountain View, CA (US); Krzysztof Choromanski, Mountain View, CA (US); Ruoming Pang, New York, NY (US); Trevor Strohman, Mountain View, CA (US); Weiran Wang, Mountain View, CA (US); and Jiahui Yu, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Sep. 29, 2022, as Appl. No. 17/936,547.
Claims priority of provisional application 63/262,140, filed on Oct. 5, 2021.
Prior Publication US 2023/0130634 A1, Apr. 27, 2023
Int. Cl. G10L 15/16 (2006.01); G10L 15/06 (2013.01); G10L 15/22 (2006.01)

CPC G10L 15/16 (2013.01) [G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 2015/223 (2013.01)]

18 Claims

1. A system comprising data processing hardware and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations that comprise implementing an automated speech recognition (ASR) model, the ASR model comprising:

a causal encoder comprising a stack of causal encoder layers, the causal encoder configured to:

receive, as input, a sequence of acoustic frames; and

generate, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and

a decoder configured to:

receive, as input, the first higher order feature representation generated by the causal encoder at each of the plurality of output steps; and

generate, at each of the plurality of output steps, a first probability distribution over possible speech recognition hypotheses,

wherein each causal encoder layer in the stack of causal encoder layers includes a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention,

wherein, during pre-training of the ASR model, each causal encoder layer comprises:

a first feedforward module;

a convolution module;

a multi-head attention module;

a second feedforward module; and

a layernorm module.