| CPC G06T 13/00 (2013.01) [G06T 7/74 (2017.01); G06V 10/7715 (2022.01); G06V 10/778 (2022.01); G06V 20/46 (2022.01); G06V 40/174 (2022.01); G10L 15/02 (2013.01); G10L 25/24 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/30201 (2013.01)] | 9 Claims |

|
1. A method implemented by a computer device and for generating a target dynamic image based on audio, comprising:
obtaining a reference image and reference audio input by a user;
determining a target head pose feature and a target expression coefficient feature based on the reference image and a trained generation network model, wherein the trained generation network model is configured to generate a plurality of predicted images based on the input reference image, and determine a target head pose feature and a target expression coefficient feature based on a difference between each predicted image and the reference image;
adjusting parameters of the trained generation network model based on the target head pose feature and the target expression coefficient feature, to obtain a target generation network model; and
processing a to-be-processed image based on the reference audio, the reference image, and the target generation network model, to obtain a target dynamic image, wherein the target dynamic image represents a dynamic image indicating that a target person in the to-be-processed image changes a facial expression based on the reference audio, an image object in the to-be-processed image is same as that in the reference image, and the target generation network model is configured to obtain a target driving feature based on the input reference audio and reference image, and drive a target area in the input to-be-processed image based on the target driving feature to output the target dynamic image,
wherein the determining a target head pose feature and a target expression coefficient feature based on the reference image and a trained generation network model comprises:
obtaining reference data based on a video with preset duration that is generated based on the reference image, wherein the reference data is obtained by processing the reference image;
extracting the target head pose feature and the target expression coefficient feature from the reference data by using the trained generation network model; and
wherein the target generation network model comprises an affine subnetwork and a driving subnetwork; and
the processing a to-be-processed image based on the reference audio, the reference image, and the target generation network model, to obtain a target dynamic image comprises:
processing the to-be-processed image by using the affine subnetwork to obtain a to-be-processed feature map, and obtaining a deformation feature map based on the reference audio, the reference image, and the to-be-processed feature map by using the affine subnetwork, wherein the affine subnetwork is configured to determine a target Mel-frequency cepstral coefficient corresponding to the reference video, perform feature extraction on the reference image to obtain a reference feature image, and perform affine transformation on the reference feature image to obtain the deformation feature image; and
processing the to-be-processed image based on the deformation feature map by using the driving subnetwork, to obtain the target dynamic image, wherein the driving subnetwork is configured to drive the to-be-processed image to obtain the target dynamic image.
|