CPC G10L 13/00 (2013.01) [G10L 19/02 (2013.01)] | 18 Claims |
1. A speech synthesis method performed at a computer device having one or more processors and memory storing one or more programs to be executed by the one or more processors, the method comprising:
obtaining linguistic data;
encoding the linguistic data, to obtain encoded linguistic data;
obtaining reference linguistic data and corresponding target reference speech data;
encoding the reference linguistic data, to obtain encoded reference linguistic data;
decoding the encoded reference linguistic data, to obtain synthesized reference speech data;
determining a residual between the target reference speech data and the synthesized reference speech data;
obtaining an embedded vector for speech feature conversion, the embedded vector representing a speaking style feature of a target user and being generated according to the residual between the synthesized reference speech data synthesized from the reference linguistic data different from the linguistic data and the target reference speech data that correspond to the same reference linguistic data; and
decoding the encoded linguistic data by performing the speech feature conversion on the encoded linguistic data according to the embedded vector, to obtain target synthesized speech data corresponding to the linguistic data.
|