US 11,996,083 B2
Global prosody style transfer without text transcriptions
Kaizhi Qian, Champaign, IL (US); Yang Zhang, Cambridge, MA (US); Shiyu Chang, Elmsford, NY (US); Jinjun Xiong, Goldens Bridge, NY (US); Chuang Gan, Cambridge, MA (US); and David Cox, Somerville, MA (US)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Jun. 3, 2021, as Appl. No. 17/337,518.
Prior Publication US 2022/0392429 A1, Dec. 8, 2022
Int. Cl. G10L 13/10 (2013.01); G06N 20/00 (2019.01); G10L 17/04 (2013.01); G10L 21/013 (2013.01); G10L 25/63 (2013.01)
CPC G10L 13/10 (2013.01) [G06N 20/00 (2019.01); G10L 17/04 (2013.01); G10L 21/013 (2013.01); G10L 25/63 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A computer-implemented method of using a machine learning model for disentanglement of prosody in spoken natural language, the method comprising:
encoding, by a computing device, the spoken natural language to produce content code;
resampling, by the computing device without text transcriptions, the content code to obscure the prosody by applying an unsupervised technique to the machine learning model to generate prosody-obscured content code, the content code being resampled using a similarity-based random resampling technique configured for shortening, using similarity based down sampling, or lengthening, using similarity based up sampling, for resampling content code segments with a similarity above a prosody similarly threshold to shorten or lengthen, respectively, such that the content code segments are of equal length to each other to form the prosody-obscured content code; and
decoding, by the computing device, the prosody-obscured content code to synthesize speech indirectly based upon the content code.