| CPC G10L 13/02 (2013.01) [G10L 21/10 (2013.01)] | 18 Claims |

|
1. A method for text to speech, comprising:
encoding a reference waveform of a first speaker in a feature separation encoder of a processor-based machine learning system to obtain an encoded style feature separated from a second speaker, wherein the feature separation encoder implements a feature separation algorithm based on entropy minimization utilizing pair-wise weights computed between features of respective pairs of multiple speakers;
transferring in a synthesizer of the processor-based machine learning system the encoded style feature to a spectrogram obtained by encoding an input text in a text encoder of the synthesizer, to obtain a style transferred spectrogram; and
converting the style transferred spectrogram into a time-domain speech waveform in a vocoder of the processor-based machine learning system;
wherein encoding the reference waveform of the first speaker to obtain the encoded style feature separated from the second speaker comprises:
inputting a feature of the second speaker, an output of a speaker encoder from the first speaker encoded by the speaker encoder, and an output of the synthesizer obtained by passing a random text through the synthesizer, into the feature separation encoder as inputs; and
performing feature learning based on the feature separation algorithm in the feature separation encoder, and encoding the reference waveform of the first speaker, to obtain the encoded style feature separated from the second speaker.
|