US 12,444,401 B2
Method, apparatus, computer readable medium, and electronic device of speech synthesis
Haopeng Lin, Beijing (CN); and Zejun Ma, Beijing (CN)
Assigned to Beijing Youzhuju Network Technology Co., Ltd., Beijing (CN)
Filed by Beijing Youzhuju Network Technology Co., Ltd., Beijing (CN)
Filed on Aug. 26, 2024, as Appl. No. 18/815,598.
Application 18/815,598 is a continuation of application No. PCT/CN2023/077478, filed on Feb. 21, 2023.
Claims priority of application No. CN202210179831.4 (CN), filed on Feb. 25, 2022.
Prior Publication US 2024/0420678 A1, Dec. 19, 2024
Int. Cl. G10L 13/08 (2013.01); G10L 13/02 (2013.01)
CPC G10L 13/08 (2013.01) [G10L 13/02 (2013.01)] 14 Claims
OG exemplary drawing
 
1. A method of speech synthesis, comprising:
obtaining a phoneme sequence corresponding to text to be synthesized;
inputting the phoneme sequence and the text to be synthesized into a speech synthesis model;
generating, via the speech synthesis model, a phonemic-level tones and break indices (TOBI) representation sequence and a prosodic-acoustic feature corresponding to the text to be synthesized based on the phoneme sequence and the text to be synthesized, and generating acoustic feature information corresponding to the text to be synthesized based on the TOBI representation sequence and the prosodic-acoustic feature; and
generating first audio information corresponding to the text to be synthesized based on the acoustic feature information,
wherein the speech synthesis model comprises an encoding network, an attention network a decoding network, a prosodic language feature prediction module, a prosodic-acoustic feature prediction module, an embedded layer, a first splicing module, a second splicing module, and a third splicing module,
the prosodic language feature prediction module is configured to generate, based on the text to be synthesized, a phonemic-level TOBI representation sequence corresponding to the text to be synthesized,
the embedded layer is configured to generate a phoneme representation sequence corresponding to the text to be synthesized based on the phoneme sequence,
the first splicing module is configured to splice the phonemic-level TOBI representation sequence and the phoneme representation sequence to obtain a first splicing sequence,
the encoding network is configured to encode the first splicing sequence to generate a coded sequence,
the second splicing module is configured to splice the coded sequence and the phonemic level TOBI representation sequence to obtain a second splicing sequence,
the prosodic-acoustic feature prediction module is configured to generate the prosodic-acoustic feature corresponding to the text to be synthesized based on the second splicing sequence,
the third splicing module is configured to splice the coding sequence and the prosodic-acoustic feature to obtain a third splicing sequence,
the attention network is configured to generate, based on the third splicing sequence, a semantic representation corresponding to the text to be synthesized, and
the decoding network is configured to generate, based on the semantic representation, acoustic feature information corresponding to the text to be synthesized.