CPC G10L 17/02 (2013.01) [G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 19/022 (2013.01); G10L 21/0272 (2013.01)] | 20 Claims |
1. A computer-implemented method, comprising:
segmenting an audio stream into a plurality of audio segments;
identifying a speaker within one of the audio segments using characteristics of the identified speaker;
generating, via a trained automatic speech recognition (ASR) model, a short-segment hypothesis for the audio segment, wherein the trained ASR model is trained to correct inserted errors in training data during training, the inserted errors comprising incorrect words or incorrect speakers;
merging a first portion of the short-segment hypothesis into a merged hypothesis set specific to the speaker;
inserting stitching symbols into the merged hypothesis set, the stitching symbols including a window change (WC) symbol; and
outputting a transcription of the hypothesis for the speaker, the output transcription including the stitched symbols.
|