US 12,249,336 B2
	Canonical training for highly configurable multilingual speech
Jinyu Li, Bellevue, WA (US); Long Zhou, Beijing (CN); Xie Sun, Bellevue, WA (US); and Shujie Liu, Beijing (CN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 18/573,846
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US); Jinyu Li, Bellevue, WA (US); Long Zhou, Beijing (CN); Xie Sun, Bellevue, WA (US); and Shujie Liu, Beijing (CN)
PCT Filed Jun. 29, 2021, PCT No. PCT/CN2021/102947 § 371(c)(1), (2) Date Dec. 22, 2023, PCT Pub. No. WO2023/272466, PCT Pub. Date Jan. 5, 2023.
Prior Publication US 2024/0265924 A1, Aug. 8, 2024
Int. Cl. G10L 15/32 (2013.01); G10L 15/00 (2013.01); G10L 15/06 (2013.01); G10L 15/30 (2013.01)

CPC G10L 15/32 (2013.01) [G10L 15/005 (2013.01); G10L 15/063 (2013.01); G10L 15/30 (2013.01); G10L 2015/0635 (2013.01)]

20 Claims

1. A computing system comprising:

one or more processors; and

one or more hardware storage devices storing one or more computer-readable instructions that are executable by the one or more processors to configure the computing system to at least:

obtain a plurality of language-specific automatic speech recognition modules, each language-specific automatic speech recognition module of the plurality of language-specific automatic speech recognition modules having been trained on a different language-specific training dataset and such that each of the plurality of language-specific automatic speech recognition modules is configured to recognize speech in a correspondingly different language of a plurality of different languages;

obtain a universal automatic speech recognition module trained on a multi-language training dataset comprising training data corresponding to each of the plurality of different languages and such that the universal automatic speech recognition module is trained to recognize speech in all of the plurality of different languages;

compile the universal automatic speech recognition module with the plurality of language-specific automatic speech recognition modules as a configurable multilingual model that is configured to selectively and dynamically utilize a sub-set of the plurality of language-specific automatic speech recognition modules with the universal automatic speech recognition module to process audio content in response to user input identifying one or more target languages associated with the audio content; and

training the configurable multilingual model to recognize user input for selecting combinations of the plurality of different languages when configuring the configurable multilingual model into a user-specific automatic speech recognition model by providing the configurable multilingual model with user choice input vectors corresponding to different combinations of the plurality of different languages.