US 12,094,453 B2
Fast emit low-latency streaming ASR with sequence-level emission regularization utilizing forward and backward probabilities between nodes of an alignment lattice
Jiahui Yu, Mountain View, CA (US); Chung-cheng Chiu, Sunnyvale, CA (US); Bo Li, Fremont, CA (US); Shuo-yiin Chang, Sunnyvale, CA (US); Tara Sainath, Jersey City, NJ (US); Wei Han, Mountain View, CA (US); Anmol Gulati, Mountain View, CA (US); Yanzhang He, Mountain View, CA (US); Arun Narayanan, Santa Clara, CA (US); Yonghui Wu, Fremont, CA (US); and Ruoming Pang, New York, NY (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Sep. 9, 2021, as Appl. No. 17/447,285.
Claims priority of provisional application 63/094,274, filed on Oct. 20, 2020.
Prior Publication US 2022/0122586 A1, Apr. 21, 2022
Int. Cl. G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/187 (2013.01); G10L 15/22 (2006.01); G10L 15/30 (2013.01)
CPC G10L 15/063 (2013.01) [G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 15/30 (2013.01); G10L 15/187 (2013.01)] 24 Claims
OG exemplary drawing
 
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations for training a streaming speech recognition model, the operations comprising:
receiving, as input to the streaming speech recognition model, a sequence of acoustic frames, the streaming speech recognition model configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens, the vocabulary tokens comprising a plurality of label tokens and a blank token;
generating an alignment lattice comprising a plurality of nodes, the alignment lattice defined as a matrix with T columns of nodes and U rows of nodes, each column of the T columns corresponding to a corresponding step of the plurality of output steps, each row of the U rows corresponding to a label that textually represents the sequence of acoustic frames;
at each node location in the matrix of the alignment lattice:
determining a forward probability for predicting a subsequent node adjacent to the respective node; and
determining, from the subsequent node adjacent to the respective node, a backward probability of including the respective subsequent node in an output sequence of vocabulary tokens;
at each step of a plurality of output steps:
determining a first probability of emitting one of the label tokens; and
determining a second probability of emitting the blank token, wherein the forward probability comprises the first probability and the second probability;
generating the alignment probability at a sequence level based on the first probability of emitting one of the label tokens and the second probability of emitting the blank token at each output step; and
applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens, the tuning parameter applied to the alignment probability independent of any speech-word alignment information.