US 12,475,877 B2
	Method, device, and computer program product for text to speech
Wenbin Yang, Shanghai (CN); Zijia Wang, WeiFang (CN); Jiacheng Ni, Shanghai (CN); and Zhen Jia, Shanghai (CN)
Assigned to Dell Products L.P., Round Rock, TX (US)
Filed by Dell Products L.P., Round Rock, TX (US)
Filed on Nov. 21, 2022, as Appl. No. 17/991,443.
Claims priority of application No. 202211288200.2 (CN), filed on Oct. 20, 2022.
Prior Publication US 2024/0185830 A1, Jun. 6, 2024
Int. Cl. G10L 13/10 (2013.01); G10L 13/02 (2013.01); G10L 13/033 (2013.01); G10L 21/10 (2013.01)

CPC G10L 13/02 (2013.01) [G10L 21/10 (2013.01)]

18 Claims

1. A method for text to speech, comprising:

encoding a reference waveform of a first speaker in a feature separation encoder of a processor-based machine learning system to obtain an encoded style feature separated from a second speaker, wherein the feature separation encoder implements a feature separation algorithm based on entropy minimization utilizing pair-wise weights computed between features of respective pairs of multiple speakers;

transferring in a synthesizer of the processor-based machine learning system the encoded style feature to a spectrogram obtained by encoding an input text in a text encoder of the synthesizer, to obtain a style transferred spectrogram; and

converting the style transferred spectrogram into a time-domain speech waveform in a vocoder of the processor-based machine learning system;

wherein encoding the reference waveform of the first speaker to obtain the encoded style feature separated from the second speaker comprises:

inputting a feature of the second speaker, an output of a speaker encoder from the first speaker encoded by the speaker encoder, and an output of the synthesizer obtained by passing a random text through the synthesizer, into the feature separation encoder as inputs; and

performing feature learning based on the feature separation algorithm in the feature separation encoder, and encoding the reference waveform of the first speaker, to obtain the encoded style feature separated from the second speaker.