US 12,407,777 B2
Performance optimization for real-time large language speech to text systems
Emad Noorizadeh, Plano, TX (US); Emmanuel Dibia, Dallas, TX (US); Jennifer Russell, Dallas, TX (US); and Rajan Jhaveri, Plano, TX (US)
Assigned to Bank of America Corporation, Charlotte, NC (US)
Filed by Bank of America Corporation, Charlotte, NC (US)
Filed on Jun. 2, 2023, as Appl. No. 18/204,981.
Prior Publication US 2024/0406314 A1, Dec. 5, 2024
Int. Cl. H04M 3/493 (2006.01); G10L 15/26 (2006.01)
CPC H04M 3/4936 (2013.01) [G10L 15/26 (2013.01)] 15 Claims
OG exemplary drawing
 
1. A method for maintaining accuracy in transcribing a communication, the method comprising:
in a first environment:
receiving a communication; and
transcribing the communication, using a robust speech recognition machine learning model, said robust speech recognition machine learning model that uses noisy sources to provide a supervision signal for labeling training data, into a first transcription;
in a second environment:
receiving the communication;
splitting the communication into a plurality of communication segments, each communication segment comprising two or more words;
identifying a number of communication segments included in the plurality of communication segments;
instantiating an instance of the robust speech recognition machine learning model for each communication segment included in the plurality of communication segments;
assigning each communication segment, included in the plurality of communication segments, to a corresponding instance of the robust speech recognition machine learning model;
transcribing, using parallel processing, each communication segment, the transcribing using the assigned instance of the robust speech recognition machine learning model, into a transcribed communication segment;
combining the transcribed communication segments into a combined transcription; and
correcting the combined transcription using a correction module, said correction module operable to tune transcriptions specific to a discipline;
in a test environment:
identifying a first resources consumed value, said first resources consumed value corresponding to a number of resources consumed by transcribing the communication model in the first environment;
identifying a first accuracy level of the first transcription;
identifying a second resources consumed value, said second resources consumed value corresponding to a number of resources consumed by transcribing the communication in the second environment model;
determining that the first resources consumed value is greater than the second resources consumed value by over a predetermined resources value threshold;
identifying a second accuracy level of the combined transcription;
determining that the first accuracy level is greater than the second accuracy level by over a predetermined accuracy level threshold;
identifying a third resources consumed value, said third resources consumed value corresponding to a number of resources consumed by correcting the combined transcription in the second environment;
identifying a third accuracy level of the combined transcription upon completion of correcting the combined transcription using the correction module;
determining that the third accuracy level is equivalent to or greater than the first accuracy level;
identifying a fourth resources consumed value, the fourth resources consumed value comprising the second resources consumed value and the third resources consumed value; and
determining that the fourth resources consumed value is less than the first resources consumed value.