US 12,106,749 B2
	Speech recognition with sequence-to-sequence models
Rohit Prakash Prabhavalkar, Mountain View, CA (US); Zhifeng Chen, Sunnyvale, CA (US); Bo Li, Fremont, CA (US); Chung-cheng Chiu, Sunnyvale, CA (US); Kanury Kanishka Rao, Santa Clara, CA (US); Yonghui Wu, Fremont, CA (US); Ron J. Weiss, New York, NY (US); Navdeep Jaitly, Mountain View, CA (US); Michiel A. u. Bacchiani, Summit, NJ (US); Tara N. Sainath, Jersey City, NJ (US); Jan Kazimierz Chorowski, Poland (PL); Anjuli Patricia Kannan, Berkeley, CA (US); Ekaterina Gonina, Sunnyvale, CA (US); and Patrick An Phu Nguyen, Palo Alto, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Sep. 20, 2021, as Appl. No. 17/448,119.
Application 17/448,119 is a continuation of application No. 16/516,390, filed on Jul. 19, 2019, granted, now 11,145,293.
Claims priority of provisional application 62/701,237, filed on Jul. 20, 2018.
Prior Publication US 2022/0005465 A1, Jan. 6, 2022
Int. Cl. G10L 15/00 (2013.01); G06N 3/08 (2023.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01); G10L 25/30 (2013.01); G10L 15/26 (2006.01)

CPC G10L 15/16 (2013.01) [G06N 3/08 (2013.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 25/30 (2013.01); G10L 2015/025 (2013.01); G10L 15/26 (2013.01)]

20 Claims

11. A system comprising:

data processing hardware; and

memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations comprising:

obtaining an n-best list of decoded speech recognition hypotheses for a training utterance;

training, using a loss function having a minimum word error rate (MWER) criterion, a recurrent neural network model by determining a word error rate expectation for the training utterance that is restricted to the n-best list of decoded speech recognition hypotheses for the training utterance; and

generating, using the trained recurrent neural network model, a transcription for audio data indicating acoustic characteristics of an utterance.