US 11,908,451 B2
	Text-based virtual object animation generation method, apparatus, storage medium, and terminal
Congyi Wang, Shanghai (CN); Yu Chen, Shanghai (CN); and Jinxiang Chai, Shanghai (CN)
Assigned to Mofa (Shanghai) Information Technology Co., Ltd., Shanghai (CN); and Shanghai Movu Technology Co., Ltd., Shanghai (CN)
Appl. No. 18/024,021
Filed by MOFA (SHANGHAI) INFORMATION TECHNOLOGY CO., LTD., Shanghai (CN); and SHANGHAI MOVU TECHNOLOGY CO., LTD., Shanghai (CN)
PCT Filed Aug. 9, 2021, PCT No. PCT/CN2021/111424 § 371(c)(1), (2) Date Feb. 28, 2023, PCT Pub. No. WO2022/048405, PCT Pub. Date Mar. 10, 2022.
Claims priority of application No. 202010905539.7 (CN), filed on Sep. 1, 2020.
Prior Publication US 2023/0267916 A1, Aug. 24, 2023
Int. Cl. G10L 13/10 (2013.01); G06T 13/00 (2011.01); G10L 13/033 (2013.01); G10L 13/047 (2013.01); G10L 15/02 (2006.01); G10L 15/26 (2006.01)

CPC G10L 13/10 (2013.01) [G06T 13/00 (2013.01); G10L 13/033 (2013.01); G10L 13/047 (2013.01); G10L 15/02 (2013.01); G10L 15/26 (2013.01); G10L 2013/105 (2013.01)]

17 Claims

1. A text-based virtual object animation generation method, comprising:

acquiring text information, wherein the text information comprises an original text of a virtual object animation to be generated;

analyzing an emotional feature and a rhyme boundary of the text information;

performing speech synthesis according to the emotional feature, the rhyme boundary, and the text information to obtain audio information, wherein the audio information comprises emotional speech obtained by conversion based on the original text; and

generating a corresponding virtual object animation based on the text information and the audio information, wherein the virtual object animation is synchronized in time with the audio information;

wherein generating the corresponding virtual object animation based on the text information and the audio information comprises:

receiving input information, wherein the input information comprises the text information and the audio information;

converting the audio information into pronunciation units and time codes based on speech recognition technology and a preset pronunciation dictionary, wherein the text information is used for determining a duration of each piece of speech in the audio information; and

performing a time alignment operation on the pronunciation units according to the time codes so as to obtain a time-aligned pronunciation unit sequence;

performing a feature analysis on the time-aligned pronunciation unit sequence to obtain a corresponding linguistic feature sequence; and

inputting the linguistic feature sequence into a preset temporal sequence mapping model to generate the corresponding virtual object animation based on the linguistic feature sequence;

wherein performing the speech synthesis based on the emotional feature, the rhyme boundary, and the text information to obtain the audio information comprises:

inputting the text information, the emotional feature, and the rhyme boundary into a preset speech synthesis model, wherein the preset speech synthesis model is used for converting an inputted text sequence into a speech sequence in a temporal sequence, and speech in the speech sequence carries emotion of the text at a corresponding point in time; and

acquiring the audio information outputted by the preset speech synthesis model.