| CPC G10L 13/08 (2013.01) [G06F 40/30 (2020.01); G10L 13/047 (2013.01)] | 20 Claims |

|
1. A speech synthesis method, comprising:
acquiring target text corresponding to each sentence in a plurality of sentences included in text to be synthesized;
for each sentence, inputting the target text corresponding to the sentence, historical text, and historical audio into a pre-trained speech synthesis model to acquire target audio corresponding to the sentence which is output by the speech synthesis model, wherein the historical text is target text corresponding to an associated sentence of each sentence in the text to be synthesized, and the historical audio is target audio corresponding to the historical text; and
synthesizing target audio corresponding to respective sentences to obtain total audio corresponding to the text to be synthesized,
wherein the speech synthesis model is configured to:
obtain, based on the target text corresponding to the sentence, a text feature corresponding to the target text corresponding to the sentence,
obtain, based on the historical text, a historical text feature corresponding to the historical text, and obtain, based on the historical audio, a historical audio feature corresponding to the historical audio,
obtain, based on the text feature, the historical text feature, and the historical audio feature, a semantic feature corresponding to the target text corresponding to the sentence, and
obtain, based on the semantic feature corresponding to the target text corresponding to the sentence, the target audio corresponding to the sentence.
|