| CPC G10L 15/183 (2013.01) [G10L 15/05 (2013.01); G10L 15/26 (2013.01)] | 26 Claims |

|
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving a sequence of input audio frames corresponding to an utterance captured by a user device, the utterance comprising a plurality of uttered words;
for each respective input audio frame, predicting, using a word boundary detection model configured to receive the sequence of input audio frames as an input, whether the respective input audio frame is a word boundary between an adjacent pair of uttered words of the plurality of uttered words, the word boundary detection model trained on transcript labels augmented with a special boundary token inserted between each adjacent pair of words;
batching the sequence of input audio frames into a plurality of batches based on the input audio frames predicted as word boundaries, wherein each respective batch comprises a corresponding plurality of batched input audio frames representing a respective one of the plurality of uttered words of the utterance; and
for each respective batch of the plurality of batches, processing, using a speech recognition model, the corresponding plurality of batched input audio frames in parallel to generate a speech recognition result for the respective one of the plurality of uttered words.
|