US 12,219,154 B2
	Computer systems exhibiting improved computer speed and transcription accuracy of automatic speech transcription (AST) based on a multiple speech-to-text engines and methods of use thereof
Tejas Shastry, Chicago, IL (US); Matthew Goldey, Chicago, IL (US); and Svyat Vergun, Morton Grove, IL (US)
Assigned to VOXSMART LIMITED, London (GB)
Filed by VOXSMART LIMITED, London (GB)
Filed on Nov. 30, 2022, as Appl. No. 18/060,351.
Application 18/060,351 is a continuation of application No. 17/153,575, filed on Jan. 20, 2021, granted, now 11,545,152.
Application 17/153,575 is a continuation of application No. 16/208,291, filed on Dec. 3, 2018, granted, now 10,930,287, issued on Feb. 23, 2021.
Application 16/208,291 is a continuation of application No. 15/993,040, filed on May 30, 2018, granted, now 10,147,428, issued on Dec. 3, 2018.
Prior Publication US 2023/0114591 A1, Apr. 13, 2023
Int. Cl. G10L 15/04 (2013.01); G10L 15/00 (2013.01); G10L 15/01 (2013.01); G10L 15/06 (2013.01); G10L 15/14 (2006.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01); G10L 15/32 (2013.01); G10L 17/00 (2013.01); G10L 25/78 (2013.01); H04N 19/159 (2014.01); H04N 19/172 (2014.01); H04N 19/184 (2014.01); H04N 19/187 (2014.01); H04N 19/30 (2014.01); H04N 19/70 (2014.01); G10L 25/51 (2013.01)

CPC H04N 19/159 (2014.11) [G10L 15/04 (2013.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 17/00 (2013.01); G10L 25/78 (2013.01); H04N 19/172 (2014.11); H04N 19/184 (2014.11); H04N 19/187 (2014.11); H04N 19/30 (2014.11); H04N 19/70 (2014.11); G10L 15/005 (2013.01); G10L 15/01 (2013.01); G10L 15/063 (2013.01); G10L 15/142 (2013.01); G10L 15/16 (2013.01); G10L 15/32 (2013.01); G10L 25/51 (2013.01)]

17 Claims

1. A computer-implemented method for improving computer speed and accuracy of automatic speech transcription, comprising:

generating, by at least one processor, at least one speech recognition model specification for a plurality of distinct speech-to-text transcription engines;

wherein each distinct speech-to-text transcription engine corresponds to a respective distinct speech recognition model;

wherein, for each distinct speech-to-text transcription engine, the at least one speech recognition model specification at least identifies:

i) a respective value for at least one pre-transcription evaluation parameter, and

ii) a respective value for at least one post-transcription evaluation parameter;

wherein the generating the at least one speech recognition model specification comprises:

receiving, by the at least one processor, at least one training audio recording and at least one truth transcript of the at least one training audio recording;

segmenting, by the at least one processor, the at least one training audio recording into a plurality of training audio segments and the at least one truth transcript into a plurality of corresponding truth training segment transcripts;

applying, by the at least one processor, at least one pre-transcription audio classifier to each training audio segment of the plurality of training audio segments to generate first metadata classifying each training audio segment based at least on:

i) language,

ii) audio quality, and

iii) accent;

applying, by the at least one processor, at least one text classifier to each corresponding truth training segment transcript of the plurality of corresponding truth training segment transcripts to generate second metadata classifying each corresponding truth training segment transcript based at least on at least one content category;

combining, by the at least one processor, the plurality of training audio segments, the plurality of corresponding truth training segment transcripts, the first metadata, and the second metadata to form at least one benchmark set;

receiving at least one audio recording representing at least one speech of at least one person;

segmenting the at least one audio recording into a plurality of audio segments;

wherein each audio segment corresponds to a respective single phrase of a respective single person that has been bounded by points of silence in the at least one audio recording;

determining, based on the respective value of the at least one pre-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification, a respective distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines to be utilized to transcribe a respective audio segment of the plurality of audio segments;

submitting the respective audio segment to the respective distinct speech-to-text transcription engine;

receiving from the respective distinct speech-to-text transcription engine, at least one hypothesis for the respective audio segment;

accepting the at least one hypothesis for the respective audio segment based on the respective value of the at least one post-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification to obtain a respective accepted hypothesis for the respective audio segment of the plurality of audio segments of the at least one audio recording;

wherein the accepting of the at least one hypothesis for each respective audio segment as the respective accepted hypothesis for the respective audio segment removes a need to submit the respective audio segment to another distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines resulting in the improved computer speed and the accuracy of automatic speech transcription;

generating at least one transcript of the at least one audio recording from respective accepted hypotheses for the plurality of audio segments; and

outputting the at least one transcript of the at least one audio recording.