US 11,935,542 B2
Hypothesis stitcher for speech recognition of long-form audio
Naoyuki Kanda, Bellevue, WA (US); Xuankai Chang, Baltimore, MD (US); Yashesh Gaur, Redmond, WA (US); Xiaofei Wang, Bellevue, WA (US); Zhong Meng, Mercer Island, WA (US); and Takuya Yoshioka, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC., Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Jan. 19, 2023, as Appl. No. 18/157,070.
Application 18/157,070 is a continuation of application No. 17/127,938, filed on Dec. 18, 2020, granted, now 11,574,639.
Prior Publication US 2023/0154468 A1, May 18, 2023
Int. Cl. G10L 15/00 (2013.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01); G10L 17/02 (2013.01); G10L 19/022 (2013.01); G10L 21/0272 (2013.01)
CPC G10L 17/02 (2013.01) [G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 19/022 (2013.01); G10L 21/0272 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method, comprising:
segmenting an audio stream into a plurality of audio segments;
identifying a speaker within one of the audio segments using characteristics of the identified speaker;
generating, via a trained automatic speech recognition (ASR) model, a short-segment hypothesis for the audio segment, wherein the trained ASR model is trained to correct inserted errors in training data during training, the inserted errors comprising incorrect words or incorrect speakers;
merging a first portion of the short-segment hypothesis into a merged hypothesis set specific to the speaker;
inserting stitching symbols into the merged hypothesis set, the stitching symbols including a window change (WC) symbol; and
outputting a transcription of the hypothesis for the speaker, the output transcription including the stitched symbols.