US 12,322,383 B2
Predicting word boundaries for on-device batching of end-to-end speech recognition models
Shaan Jagdeep Patrick Bijwadia, San Francisco, CA (US); Tara N. Sainath, Jersey City, NJ (US); Jiahui Yu, Mountain View, CA (US); Shuo-yiin Chang, Sunnyvale, CA (US); and Yangzhang He, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Sep. 21, 2022, as Appl. No. 17/934,184.
Claims priority of provisional application 63/262,141, filed on Oct. 5, 2021.
Prior Publication US 2023/0107493 A1, Apr. 6, 2023
Int. Cl. G10L 15/05 (2013.01); G06N 3/045 (2023.01); G06N 3/0455 (2023.01); G06N 3/048 (2023.01); G06N 3/0499 (2023.01); G06N 3/09 (2023.01); G06N 3/096 (2023.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/183 (2013.01); G10L 15/26 (2006.01); G10L 15/32 (2013.01)
CPC G10L 15/183 (2013.01) [G10L 15/05 (2013.01); G10L 15/26 (2013.01)] 26 Claims
OG exemplary drawing
 
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving a sequence of input audio frames corresponding to an utterance captured by a user device, the utterance comprising a plurality of uttered words;
for each respective input audio frame, predicting, using a word boundary detection model configured to receive the sequence of input audio frames as an input, whether the respective input audio frame is a word boundary between an adjacent pair of uttered words of the plurality of uttered words, the word boundary detection model trained on transcript labels augmented with a special boundary token inserted between each adjacent pair of words;
batching the sequence of input audio frames into a plurality of batches based on the input audio frames predicted as word boundaries, wherein each respective batch comprises a corresponding plurality of batched input audio frames representing a respective one of the plurality of uttered words of the utterance; and
for each respective batch of the plurality of batches, processing, using a speech recognition model, the corresponding plurality of batched input audio frames in parallel to generate a speech recognition result for the respective one of the plurality of uttered words.