US 11,862,171 B2
Multithreaded speech data preprocessing
Xiaolong Li, Cary, NC (US); Xiaozhuo Cheng, Cary, NC (US); Samuel Norris Henderson, Raleigh, NC (US); and Xu Yang, Cary, NC (US)
Assigned to SAS Institute Inc., Cary, NC (US)
Filed by SAS Institute Inc., Cary, NC (US)
Filed on Nov. 23, 2022, as Appl. No. 17/993,385.
Application 17/993,385 is a continuation in part of application No. 17/851,264, filed on Jun. 28, 2022, granted, now 11,538,481.
Application 17/851,264 is a continuation in part of application No. 17/498,811, filed on Oct. 12, 2021, granted, now 11,373,655, issued on Jun. 28, 2022.
Application 17/498,811 is a continuation in part of application No. 17/370,441, filed on Jul. 8, 2021, granted, now 11,404,053, issued on Aug. 2, 2022.
Application 17/370,441 is a continuation of application No. PCT/CN2021/082572, filed on Mar. 24, 2021.
Application 17/498,811 is a continuation in part of application No. 17/205,871, filed on Mar. 18, 2021, granted, now 11,145,309, issued on Oct. 12, 2021.
Application 17/205,871 is a continuation in part of application No. 17/138,521, filed on Dec. 30, 2020, granted, now 11,049,502, issued on Jun. 29, 2021.
Application 17/138,521 is a continuation of application No. 17/138,445, filed on Dec. 30, 2020, granted, now 11,138,979, issued on Oct. 5, 2021.
Claims priority of provisional application 62/991,275, filed on Mar. 18, 2020.
Claims priority of provisional application 63/297,002, filed on Jan. 6, 2022.
Claims priority of provisional application 63/288,385, filed on Dec. 10, 2021.
Prior Publication US 2023/0107312 A1, Apr. 6, 2023
Int. Cl. G10L 15/22 (2006.01); G10L 15/26 (2006.01); G10L 15/04 (2013.01); G10L 25/78 (2013.01); G10L 25/30 (2013.01); G10L 15/02 (2006.01)
CPC G10L 15/26 (2013.01) [G10L 15/02 (2013.01); G10L 15/04 (2013.01); G10L 25/30 (2013.01); G10L 25/78 (2013.01); G10L 2025/783 (2013.01)] 30 Claims
OG exemplary drawing
 
1. An apparatus comprising at least one processor and a storage to store instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising:
receive, from a requesting device via a network, a request to perform speech-to-text conversion of a specified speech data set representing speech audio;
in response to the request, the at least one processor is caused to perform preprocessing operations comprising:
within a first thread of a thread pool that comprises multiple threads of execution supported by the at least one processor, perform a first pause detection technique to identify a first set of likely sentence pauses in the speech audio;
within a second thread of the thread pool, perform a second pause detection technique to identify a second set of likely sentence pauses in the speech audio; and
perform a speaker diarization technique to identify a set of likely speaker changes in the speech audio; and
in response to the request, the at least one processor is caused to perform speech-to-text processing operations comprising:
divide the speech data set into multiple data segments that each represent a speech segment of multiple speech segments of the speech audio based on a combination of at least the first set of likely sentence pauses, the second set of likely sentence pauses, and the set of likely speaker changes;
use at least an acoustic model with each data segment of the multiple data segments to identify likely speech sounds in the speech audio; and
generate a transcript of the speech data set based, at least in part, on the identified likely speech sounds, or transmit an indication of the generation of the transcript to the requesting device.