US 12,260,481 B1
Method for generating a dynamic image based on audio, device, and storage medium
Huapeng Sima, Nanjing (CN); Maolin Zhang, Nanjing (CN); and Liyan Mao, Nanjing (CN)
Assigned to NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD., Nanjing (CN)
Filed by NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD., Nanjing (CN)
Filed on Jul. 19, 2024, as Appl. No. 18/777,676.
Claims priority of application No. 202410022841.6 (CN), filed on Jan. 8, 2024.
Int. Cl. G06T 13/00 (2011.01); G06T 7/70 (2017.01); G06T 7/73 (2017.01); G06T 13/20 (2011.01); G06V 10/77 (2022.01); G06V 10/778 (2022.01); G06V 20/40 (2022.01); G06V 40/16 (2022.01); G10L 15/02 (2006.01); G10L 25/24 (2013.01)
CPC G06T 13/00 (2013.01) [G06T 7/74 (2017.01); G06V 10/7715 (2022.01); G06V 10/778 (2022.01); G06V 20/46 (2022.01); G06V 40/174 (2022.01); G10L 15/02 (2013.01); G10L 25/24 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/30201 (2013.01)] 9 Claims
OG exemplary drawing
 
1. A method implemented by a computer device and for generating a target dynamic image based on audio, comprising:
obtaining a reference image and reference audio input by a user;
determining a target head pose feature and a target expression coefficient feature based on the reference image and a trained generation network model, wherein the trained generation network model is configured to generate a plurality of predicted images based on the input reference image, and determine a target head pose feature and a target expression coefficient feature based on a difference between each predicted image and the reference image;
adjusting parameters of the trained generation network model based on the target head pose feature and the target expression coefficient feature, to obtain a target generation network model; and
processing a to-be-processed image based on the reference audio, the reference image, and the target generation network model, to obtain a target dynamic image, wherein the target dynamic image represents a dynamic image indicating that a target person in the to-be-processed image changes a facial expression based on the reference audio, an image object in the to-be-processed image is same as that in the reference image, and the target generation network model is configured to obtain a target driving feature based on the input reference audio and reference image, and drive a target area in the input to-be-processed image based on the target driving feature to output the target dynamic image,
wherein the determining a target head pose feature and a target expression coefficient feature based on the reference image and a trained generation network model comprises:
obtaining reference data based on a video with preset duration that is generated based on the reference image, wherein the reference data is obtained by processing the reference image;
extracting the target head pose feature and the target expression coefficient feature from the reference data by using the trained generation network model; and
wherein the target generation network model comprises an affine subnetwork and a driving subnetwork; and
the processing a to-be-processed image based on the reference audio, the reference image, and the target generation network model, to obtain a target dynamic image comprises:
processing the to-be-processed image by using the affine subnetwork to obtain a to-be-processed feature map, and obtaining a deformation feature map based on the reference audio, the reference image, and the to-be-processed feature map by using the affine subnetwork, wherein the affine subnetwork is configured to determine a target Mel-frequency cepstral coefficient corresponding to the reference video, perform feature extraction on the reference image to obtain a reference feature image, and perform affine transformation on the reference feature image to obtain the deformation feature image; and
processing the to-be-processed image based on the deformation feature map by using the driving subnetwork, to obtain the target dynamic image, wherein the driving subnetwork is configured to drive the to-be-processed image to obtain the target dynamic image.