US 12,456,451 B2
Speech synthesis method, apparatus, readable medium, and electronic device
Junjie Pan, Beijing (CN)
Assigned to BEIJING YOUZHUJU NETWORK TECHNOLOGY CO., LTD., Beijing (CN)
Appl. No. 18/017,570
Filed by BEIJING YOUZHUJU NETWORK TECHNOLOGY CO., LTD., Beijing (CN)
PCT Filed Oct. 25, 2021, PCT No. PCT/CN2021/126146
§ 371(c)(1), (2) Date Jan. 23, 2023,
PCT Pub. No. WO2022/105545, PCT Pub. Date May 27, 2022.
Claims priority of application No. 202011312059.6 (CN), filed on Nov. 20, 2020.
Prior Publication US 2023/0298562 A1, Sep. 21, 2023
Int. Cl. G10L 13/08 (2013.01); G06F 40/30 (2020.01); G10L 13/047 (2013.01)
CPC G10L 13/08 (2013.01) [G06F 40/30 (2020.01); G10L 13/047 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A speech synthesis method, comprising:
acquiring target text corresponding to each sentence in a plurality of sentences included in text to be synthesized;
for each sentence, inputting the target text corresponding to the sentence, historical text, and historical audio into a pre-trained speech synthesis model to acquire target audio corresponding to the sentence which is output by the speech synthesis model, wherein the historical text is target text corresponding to an associated sentence of each sentence in the text to be synthesized, and the historical audio is target audio corresponding to the historical text; and
synthesizing target audio corresponding to respective sentences to obtain total audio corresponding to the text to be synthesized,
wherein the speech synthesis model is configured to:
obtain, based on the target text corresponding to the sentence, a text feature corresponding to the target text corresponding to the sentence,
obtain, based on the historical text, a historical text feature corresponding to the historical text, and obtain, based on the historical audio, a historical audio feature corresponding to the historical audio,
obtain, based on the text feature, the historical text feature, and the historical audio feature, a semantic feature corresponding to the target text corresponding to the sentence, and
obtain, based on the semantic feature corresponding to the target text corresponding to the sentence, the target audio corresponding to the sentence.