| CPC G06V 10/774 (2022.01) [G06V 20/41 (2022.01); G06V 20/49 (2022.01); G06V 40/172 (2022.01); G06V 40/40 (2022.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/26 (2013.01); G10L 25/57 (2013.01)] | 18 Claims |

|
1. A method for training a model comprising a visual encoder, an audio encoder, an audio-to-visual (A2V) network, a visual-to-audio (V2A) network, and a classifier for classifying videos as real or fake, the method comprising:
generating a sequence of image tiles from image data from an input video;
generating a plurality of data segments representing audio data from the input video;
generating a sequence of image embeddings based on the sequence of image tiles using the visual encoder;
generating a sequence of audio embeddings based on the sequence of data segments using the audio encoder;
transforming, using the V2A network, a first subset of the sequence of image embeddings into one or more synthetic audio embeddings, wherein the first subset of the sequence of image embeddings corresponds to a first set of time points in the input video;
transforming, using the A2V network, a first subset of the sequence of audio embeddings into one or more synthetic image embeddings, wherein the first subset of the sequence of audio embeddings corresponds to a second set of time points in the input video complementary to the first set of time points;
updating the sequence of image embeddings by replacing a second subset of the sequence of image embeddings with the one or more synthetic image embeddings, wherein the second subset of the sequence of image embeddings corresponds to the second set of time points;
updating the sequence of audio embeddings by replacing a second subset of the sequence of audio embeddings with the one or more synthetic audio embeddings, wherein the second subset of the sequence of audio embeddings corresponds to the first set of time points;
training the visual encoder, the audio encoder, the V2A network, and the A2V network based on the updated sequence of image embeddings and the updated sequence of audio embeddings;
training the classifier to classify videos as real or fake using the trained visual encoder, the trained audio encoder, the trained V2A network, and the trained A2V network, wherein the classifier is configured to receive image embeddings for the videos from the trained visual encoder, audio embeddings for the videos from the trained audio encoder, synthetic image embeddings for the videos from the trained A2V network, and synthetic audio embeddings for the videos from the trained V2A network.
|