US 11,735,171 B2
Unified speech representation learning
Yao Qian, Bellevue, WA (US); Yu Wu, Beijing (CN); Kenichi Kumatani, Sammamish, WA (US); Shujie Liu, Beijing (CN); Furu Wei, Beijing (CN); Nanshan Zeng, Bellevue, WA (US); Xuedong David Huang, Yarrow Point, WA (US); and Chengyi Wang, Jinan (CN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on May 14, 2021, as Appl. No. 17/320,496.
Prior Publication US 2022/0366898 A1, Nov. 17, 2022
Int. Cl. G10L 15/187 (2013.01); G06N 20/00 (2019.01); G10L 15/06 (2013.01); G10L 15/22 (2006.01); G10L 15/02 (2006.01)
CPC G10L 15/187 (2013.01) [G06N 20/00 (2019.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 2015/025 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computing system comprising:
one or more processors; and
one or more computer-readable instructions that are executable by the one or more processors to configure the computing system to at least:
obtain a first training data set comprising labeled speech data or both labeled and unlabeled data sets;
apply the first training data set to a feature extractor of a machine learning model to generate latent speech representations;
apply the latent speech representations to a quantizer to generate quantized latent speech representations;
apply the latent speech representations to a transformer context network to generate contextual representations;
align each contextual representation included in the contextual representations with a phoneme label to generate phonetically aware contextual representations;
align quantized latent representations with phoneme labels to generate phonetically aware latent speech representations;
randomly replace a sub-set of the contextual representations with quantized latent speech representations during their alignments to phoneme labels; and
align the phonetically aware latent speech representations to the contextual representations using supervised learning.