| CPC G06T 13/40 (2013.01) [G06N 3/0464 (2023.01); G06N 3/08 (2013.01); G06T 13/205 (2013.01); G06T 15/04 (2013.01); G06T 17/00 (2013.01); G10L 25/57 (2013.01); G06T 2200/08 (2013.01)] | 18 Claims |

|
1. A method for processing video, comprising:
generating, based on a reference image and a first frame of a video comprising an object, a two-dimensional avatar image of the object;
generating a base three-dimensional avatar of the object by performing a three-dimensional transformation on the two-dimensional avatar image and the object in the first frame; and
generating a three-dimensional avatar video corresponding to the video based on the base three-dimensional avatar and features of the video, the features comprising image differences of the object between adjacent frames of the video;
wherein the method is implemented using a generation model for three-dimensional avatar videos, and the generation model is trained based on a loss function that includes a plurality of cross-modality loss components including an image-audio loss component, an image-text loss component, and an audio-text loss component; and
wherein the loss function is determined at least in part by:
determining the image-audio loss component as an image-audio contrastive loss function based on image data and audio data;
determining the image-text loss component as an image-text contrastive loss function based on the image data and text data;
determining the audio-text loss component as an audio-text contrastive loss function based on the audio data and the text data; and
determining the loss function based on the image-audio contrastive loss function, the image-text contrastive loss function, and the audio-text contrastive loss function.
|