US 12,482,163 B2
Method, device, and computer program product for processing video
Zhisong Liu, Shenzhen (CN); Zijia Wang, WeiFang (CN); and Zhen Jia, Shanghai (CN)
Assigned to Dell Products L.P., Round Rock, TX (US)
Filed by Dell Products L.P., Round Rock, TX (US)
Filed on Nov. 23, 2022, as Appl. No. 17/993,025.
Claims priority of application No. 202211296657.8 (CN), filed on Oct. 21, 2022.
Prior Publication US 2024/0185494 A1, Jun. 6, 2024
Int. Cl. G06T 13/40 (2011.01); G06N 3/0464 (2023.01); G06N 3/08 (2023.01); G06T 13/20 (2011.01); G06T 15/04 (2011.01); G06T 17/00 (2006.01); G10L 25/57 (2013.01)
CPC G06T 13/40 (2013.01) [G06N 3/0464 (2023.01); G06N 3/08 (2013.01); G06T 13/205 (2013.01); G06T 15/04 (2013.01); G06T 17/00 (2013.01); G10L 25/57 (2013.01); G06T 2200/08 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A method for processing video, comprising:
generating, based on a reference image and a first frame of a video comprising an object, a two-dimensional avatar image of the object;
generating a base three-dimensional avatar of the object by performing a three-dimensional transformation on the two-dimensional avatar image and the object in the first frame; and
generating a three-dimensional avatar video corresponding to the video based on the base three-dimensional avatar and features of the video, the features comprising image differences of the object between adjacent frames of the video;
wherein the method is implemented using a generation model for three-dimensional avatar videos, and the generation model is trained based on a loss function that includes a plurality of cross-modality loss components including an image-audio loss component, an image-text loss component, and an audio-text loss component; and
wherein the loss function is determined at least in part by:
determining the image-audio loss component as an image-audio contrastive loss function based on image data and audio data;
determining the image-text loss component as an image-text contrastive loss function based on the image data and text data;
determining the audio-text loss component as an audio-text contrastive loss function based on the audio data and the text data; and
determining the loss function based on the image-audio contrastive loss function, the image-text contrastive loss function, and the audio-text contrastive loss function.