US 12,266,342 B2
Multi-speaker neural text-to-speech synthesis
Yan Deng, Redmond, WA (US); and Lei He, Redmond, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 17/293,640
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
PCT Filed Dec. 11, 2018, PCT No. PCT/CN2018/120300
§ 371(c)(1), (2) Date May 13, 2021,
PCT Pub. No. WO2020/118521, PCT Pub. Date Jun. 18, 2020.
Prior Publication US 2022/0013106 A1, Jan. 13, 2022
Int. Cl. G10L 13/08 (2013.01); G06N 3/045 (2023.01); G10L 13/047 (2013.01)
CPC G10L 13/08 (2013.01) [G06N 3/045 (2023.01); G10L 13/047 (2013.01)] 19 Claims
OG exemplary drawing
 
1. A method for generating speech through multi-speaker neural text-to-speech (TTS) synthesis, comprising:
receiving a text input at an acoustic feature predictor, the acoustic feature predictor including an encoder and a decoder, the decoder having:
a linear projection;
a feed forward layer, the feed forward layer receiving a first output from the linear projection;
a memory layer; and
convolution layers;
providing, through at least one speaker model, speaker latent space information of a target speaker where the speaker latent space information has a first speaker embedding vector;
using the feed forward layer, the memory layer, the convolution layers and a combination of a second output of the linear projection with an output of the convolution layers to predict at least one acoustic feature based on the text input and the speaker latent space information;
receiving, at a first neural network, the at least one acoustic feature at a quasi-recurrent neural network of the first neural network;
transforming, at the first neural network, the at least one acoustic feature from a first dimension to a second dimension;
providing, to a second neural network, a second speaker embedding vector, the first speaker embedding vector and the second speaker embedding vector being associated with a same speaker, where the speaker latent space information has the second speaker embedding vector;
transforming, at the second neural network, the second speaker embedding vector from a third dimension to the second dimension;
combining the transformed second speaker embedding vector output from the second neural network and the transformed at least one acoustic feature output from the first neural network to generate a combined input; and
generating, through a neural vocoder, a speech waveform corresponding to the text input based on the combined input and the speaker latent space information having the first speaker embedding vector and the second speaker embedding vector.