US 12,217,745 B2
Unified speech representation learning
Yao Qian, Bellevue, WA (US); Yu Wu, Beijing (CN); Kenichi Kumatani, Sammamish, WA (US); Shujie Liu, Beijing (CN); Furu Wei, Beijing (CN); Nanshan Zeng, Bellevue, WA (US); Xuedong David Huang, Yarrow Point, WA (US); and Chengyi Wang, Jinan (CN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Jul. 3, 2023, as Appl. No. 18/217,888.
Application 18/217,888 is a continuation of application No. 17/320,496, filed on May 14, 2021, granted, now 11,735,171.
Prior Publication US 2023/0368782 A1, Nov. 16, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/187 (2013.01); G06N 20/00 (2019.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/22 (2006.01)
CPC G10L 15/187 (2013.01) [G06N 20/00 (2019.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 2015/025 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computing system comprising:
one or more processors; and
one or more computer-readable instructions that are executable by the one or more processors to configure the computing system to at least:
obtain a first training data set comprising labeled speech data or both labeled and unlabeled data corresponding to a high-resource data set, as well as latent speech representations based on the first training data set;
train a machine learning model on the first training data set to learn phonetically aware speech representations corresponding to the first training data set;
apply the latent speech representations from the machine learning model to a transformer context network to generate contextual representations;
align each contextual representation included in the contextual representations with a phoneme label to generate phonetically-aware contextual representations; and
cause a refinement engine to further refine the machine learning model based on a refinement dataset, wherein the refinement engine fine-tunes the machine learning model on a limited labeled dataset corresponding to a low-resource target language and/or target domain; and
transform at least some of the contextual representations by randomly replacing a sub-set of the contextual representations with quantized latent speech representations.