US 12,417,770 B2
	Unified cascaded encoder ASR model for dynamic model sizes
Shaojin Ding, Mountain View, CA (US); Yangzhang He, Mountain View, CA (US); Xin Wang, Mountain View, CA (US); Weiran Wang, Palo Alto, CA (US); Trevor Strohman, Mountain View, CA (US); Tara N. Sainath, Jersey City, NJ (US); Rohit Prakash Prabhavalkar, Palo Alto, CA (US); Robert David, Mountain View, CA (US); Rina Panigrahy, Mountain View, CA (US); Rami Botros, Mountain View, CA (US); Qiao Liang, Mountain View, CA (US); Ian Mcgraw, Mountain View, CA (US); Ding Zhao, Mountain View, CA (US); and Dongseong Hwang, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 13, 2023, as Appl. No. 18/182,925.
Claims priority of provisional application 63/269,703, filed on Mar. 21, 2022.
Prior Publication US 2023/0326461 A1, Oct. 12, 2023
Int. Cl. G10L 15/32 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01)

CPC G10L 15/32 (2013.01) [G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 2015/223 (2013.01)]

19 Claims

1. An automated speech recognition (ASR) model comprising:

a first encoder configured to:

receive, as input, a sequence of acoustic frames; and

generate, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames;

a first decoder configured to:

receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; and

generate, at each of the plurality of output steps, a first probability distribution over possible speech recognition hypotheses;

a second encoder configured to:

receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; and

generate, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame; and

a second decoder configured to:

receive, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps; and

generate, at each of the plurality of output steps, a second probability distribution over possible speech recognition hypotheses,

wherein the first decoder and the second decoder each comprise a respective recurrent neural network-transducer (RNN-T) architecture having a same number of parameters.