US 12,469,490 B2
	Streaming punctuation for long-form dictation
Piyush Behre, Santa Clara, CA (US); Sharman W Tan, Fremont, CA (US); Shuangyu Chang, Davis, CA (US); Padma Varadharajan, San Jose, CA (US); Sayan Dev Pathak, Kirkland, WA (US); and Ravikant Gupta, Sunnyvale, CA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on Apr. 29, 2022, as Appl. No. 17/732,971.
Prior Publication US 2023/0352009 A1, Nov. 2, 2023
Int. Cl. G10L 15/19 (2013.01); G06F 40/103 (2020.01); G06F 40/232 (2020.01); G06F 40/58 (2020.01); G10L 15/04 (2013.01); G10L 15/22 (2006.01)

CPC G10L 15/19 (2013.01) [G06F 40/103 (2020.01); G06F 40/58 (2020.01); G10L 15/04 (2013.01); G10L 15/22 (2013.01)]

23 Claims

1. A method for generating transcription data for speech data comprising one or more spoken language utterances, the method comprising:

generating one or more partial segments of the one or more spoken language utterances, each partial segment comprising one or more words recognized in the speech data;

causing the one or more partial segments to be transmitted to and displayed at a remote client device in a first visual format corresponding to a first font style that indicates to a user that the transcription data being displayed on a user display are partial segments;

generating one or more decoder segments based on the one or more partial segments and a first set of segmentation boundaries, each decoder segment comprising one or more-consecutive words recognized in the speech data;

generating one or more formatted segments with punctuation based on the one or more decoder segments by assigning a punctuation tag selected from a plurality of punctuation tags at each segmentation boundary included in the first set of segmentation boundaries;

causing the one or more formatted segments to be transmitted to and displayed at the remote client device in a second visual format corresponding to a second font style, different than the first font style, that indicates to the user that the transcription data being displayed on the user display are formatted segments;

subsequent to the one or more formatted segments being transmitted to the remote client device, generating a second set of segmentation boundaries such that at least one segmentation boundary included in the second set of segmentation boundaries is determined to be a final segmentation boundary corresponding to an end of a sentence included in the one or more spoken language utterances;

applying the second set of segmentation boundaries to the one or more decoder segments, the second set of segmentation boundaries being different than the first set of segmentation boundaries;

in response to applying the second set of segmentation boundaries to the one or more decoder segments, generating one or more final sentences from the one or more decoder segments, wherein the one or more final sentences have different punctuation than the punctuation of the one or more formatted segments; and

causing the one or more final sentences to be transmitted to and displayed at the remote client device in a third visual format corresponding to a third font style, different from the first font style and the second font style, that indicates to the user that the transcription data being displayed on the user display are final sentences.