| CPC G10L 13/08 (2013.01) [G10L 13/04 (2013.01)] | 13 Claims |

|
1. A method for phrase-based end-to-end text-to-speech (TTS) synthesis, comprising:
obtaining a text;
identifying a target phrase in the text;
determining a phrase context of the target phrase;
obtaining a reference audio;
generating an acoustic feature corresponding to the target phrase based at least on the target phrase, the phrase context and the reference audio, wherein the generating an acoustic feature comprises:
generating a context embedding representation of the phrase context;
generating an acoustic embedding representation of the reference audio; and
generating the acoustic feature through an acoustic model conditioned by the context embedding representation and the acoustic embedding representation; and
generating a speech waveform corresponding to the target phrase based on the acoustic feature.
|