US 12,361,924 B2
Systems and methods for audio transcription switching based on real-time identification of languages in an audio stream
Sushant Hiray, Redwood City, CA (US); Prashant Kukde, Milpitas, CA (US); and Shashi Kant Gupta, Hazaribagh (IN)
Assigned to RingCentral, Inc., Belmont, CA (US)
Filed by RingCentral, Inc., Belmont, CA (US)
Filed on Dec. 28, 2022, as Appl. No. 18/147,130.
Prior Publication US 2024/0221721 A1, Jul. 4, 2024
Int. Cl. G10L 15/22 (2006.01); G10L 13/08 (2013.01); G10L 15/00 (2013.01); G10L 15/16 (2006.01)
CPC G10L 13/086 (2013.01) [G10L 15/005 (2013.01); G10L 15/16 (2013.01)] 19 Claims
OG exemplary drawing
 
1. A computer-implemented method for machine-generated transcription of different languages that are spoken in audio streams, the computer-implemented method comprising:
receiving an audio stream involving one or more speakers speaking multiple languages;
identifying a first user that speaks during a first snippet of the audio stream;
selecting a user model comprising different languages spoken by the first user;
filtering a plurality of vectors from a trained neural network to a subset of the plurality of vectors based on the different languages from the user model, wherein each vector of the plurality of vectors is used to detect a different language and the subset of vectors correspond to vectors of the trained neural network that are used to detect the different languages spoken by the first user;
determining, using the subset of vectors of the trained neural network, that a first language is spoken in the first snippet of the audio stream based on a first vector of the subset of vectors outputting a first probability value that satisfies a threshold probability for a first set of features of the first snippet and the first probability value being greater than probability values output by other vectors of the subset of vectors for the first set of features;
transcribing the first snippet from the first language to a target language in response to determining that the first language is spoken in the first snippet;
detecting, using the trained neural network, a transition from the first language to a new language based on the first vector outputting a probability value that does not satisfy the threshold probability for a second set of features from a second snippet of the audio stream;
transcribing the second snippet from the new language to the target language in response to detecting the transition;
determining, using the trained neural network, that a second language is spoken in a third snippet of the audio stream based on a second vector of the trained neural network outputting a second probability value that satisfies the threshold probability for a third set of features of the third snippet; and
transcribing the third snippet from the second language to the target language in response to determining that the second language is spoken in the third snippet.