US 12,431,127 B2
System and method for neural network multilingual speech recognition
Purvi Agrawal, Hyderabad (IN); Vikas Joshi, Bengaluru (IN); Basil Abraham, Hyderabad (IN); Tejaswi Seeram, Kakinada (IN); and Rupeshkumar Rasiklal Mehta, Hyderabad (IN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Jun. 29, 2022, as Appl. No. 17/853,055.
Prior Publication US 2024/0005912 A1, Jan. 4, 2024
Int. Cl. G10L 15/00 (2013.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01)
CPC G10L 15/16 (2013.01) [G10L 15/005 (2013.01); G10L 15/063 (2013.01); G10L 15/22 (2013.01)] 19 Claims
OG exemplary drawing
 
1. A computer-implemented method for improved recognition of multiple languages in audio data, the method comprising:
training a multilingual neural network model on first input audio data, the multilingual neural network model including shared acoustic model layers and a single projection layer, and the first input audio data including speech in a primary language and a secondary language;
splitting the single projection layer of the multilingual neural network model to produce a split head multilingual neural network model;
training the split head multilingual neural network model on second input audio data, the second input audio data including speech in the primary language and the secondary language to generate a trained split head multilingual neural network model, the trained split head multilingual neural network model including shared acoustic model layers and a plurality of projection layers, each projection layer of the plurality of projection layers corresponding to a language that the trained split head multilingual neural network model recognizes;
receiving audio data, the audio data including speech in a plurality of languages in the audio data, the speech in the plurality of languages corresponding the language recognized by a projection layer of the plurality of projection layers of the trained split head multilingual neural network model; and
classifying one or more languages of the speech of the audio data using the trained split head multilingual neural network model.
 
9. A system for improved recognition of multiple languages in audio data, the system including:
a data storage device that stores instructions for improved recognition of multiple languages in audio data; and
a processor configured to execute the instructions to perform a method including: training a multilingual neural network model on first input audio data, the multilingual neural network model including shared acoustic model layers and a single projection layer, and the first input audio data including speech in a primary language and a secondary language;
splitting the single projection layer of the multilingual neural network model to produce a split head multilingual neural network model;
training the split head multilingual neural network model on second input audio data, the second input audio data including speech in the primary language and the secondary language to generate a trained split head multilingual neural network model, the trained split head multilingual neural network model including shared acoustic model layers and a plurality of projection layers, each projection layer of the plurality of projection layers corresponding to a language that the trained split head multilingual neural network model recognizes;
receiving audio data, the audio data including speech in a plurality of languages in the audio data, the speech in the plurality of languages corresponding the language recognized by a projection layer of the plurality of projection layers of the trained split head multilingual neural network model; and
classifying one or more languages of the speech of the audio data using the trained split head multilingual neural network model.