US 12,254,552 B1
	Audio-driven three-dimensional facial animation model generation method and apparatus, and electronic device
Huapeng Sima, Jiangsu (CN); and Zheng Liao, Jiangsu (CN)
Assigned to NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD., Jiangsu (CN)
Filed by NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD., Jiangsu (CN)
Filed on Dec. 23, 2024, as Appl. No. 18/999,832.
Claims priority of application No. 202311870903.0 (CN), filed on Dec. 29, 2023.
Int. Cl. G06T 13/40 (2011.01); G06T 13/20 (2011.01); G06T 17/00 (2006.01)

CPC G06T 13/40 (2013.01) [G06T 13/205 (2013.01); G06T 17/00 (2013.01)]

8 Claims

1. An audio-driven three-dimensional facial animation model generation method, comprising:

acquiring sample data including sample audio data, sample speaking style data, and a sample blend shape value, wherein the sample audio data and the sample speaking style data belong to a same user, the sample speaking style data is used for representing a facial expression of the user, and the sample blend shape value is obtained by preprocessing the sample audio data;

performing feature extraction on the sample audio data to obtain a sample audio feature;

performing convolution on the sample audio feature based on the to-be-trained audio-driven three-dimensional facial animation model to obtain at least one intermediate audio feature; matching every two intermediate audio features in the at least one intermediate audio feature based on the to-be-trained audio-driven three-dimensional facial animation model to obtain an intermediate audio feature group corresponding to every two intermediate audio features in the at least one intermediate audio feature; merging two intermediate audio features in each intermediate audio feature group in at least one intermediate audio feature group based on the to-be-trained audio-driven three-dimensional facial animation model to obtain an initial audio feature, wherein each intermediate audio feature in the at least one intermediate audio feature corresponds to one convolutional calculation channel, and sequence values of two convolutional calculation channels corresponding to the intermediate audio feature group are not adjacent to each other; and performing encoding on the sample speaking style data based on the to-be-trained audio-driven three-dimensional facial animation model to obtain a sample speaking style feature;

performing encoding on the initial audio feature and the sample speaking style feature based on the to-be-trained audio-driven three-dimensional facial animation model to obtain an output blend shape value;

performing calculation on the sample blend shape value and the output blend shape value to obtain a loss function value; and

updating a model parameter of the to-be-trained audio-driven three-dimensional facial animation model based on the loss function value.