US 12,243,290 B2
	Video transformer for deepfake detection with incremental learning
Sohail Ahmed Khan, Abu Dhabi (AE); and Hang Dai, Abu Dhabi (AE)
Assigned to Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi (AE)
Filed by Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi (AE)
Filed on Jun. 8, 2022, as Appl. No. 17/835,453.
Prior Publication US 2023/0401824 A1, Dec. 14, 2023
Int. Cl. G06V 10/77 (2022.01); G06T 7/40 (2017.01); G06V 10/54 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 20/70 (2022.01); G06V 40/16 (2022.01); G06V 40/40 (2022.01); H04N 21/44 (2011.01); H04N 21/637 (2011.01)

CPC G06V 10/7715 (2022.01) [G06T 7/40 (2013.01); G06V 10/54 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/49 (2022.01); G06V 20/70 (2022.01); G06V 40/161 (2022.01); G06V 40/40 (2022.01); H04N 21/44008 (2013.01); H04N 21/637 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/30201 (2013.01)]

16 Claims

1. A system for detecting DeepFake videos, comprising:

an input device for inputting a potential DeepFake video, wherein the input device is configured to input a sequence of video frames of the potential DeepFake video;

processing circuitry that

detects faces frame by frame in the potential DeepFake video to obtain consecutive face images,

creates UV texture maps from the face images,

inputs both face images and corresponding UV texture maps,

extracts image feature maps, by a convolution neural network (CNN) backbone, from the input face images and corresponding UV texture maps and forms an input data structure,

receives the input data structure, by a video transformer model that includes multiple encoders,

computes, by the video transformer model, a classification of the video as being Real or Fake; and

a display device that plays back the potential DeepFake video and an indication that the potential DeepFake video is Real or Fake,

the system further comprising:

learnable segment embeddings, wherein the learnable segment embeddings are a fixed token for the face image and a fixed token for the UV texture map, and

wherein the processing circuitry forms the input data structure including the extracted image feature maps and the learnable segment embeddings,

wherein all tokens belonging to the face image are assigned to a first vector (index 0), and all tokens belonging to the UV texture map are assigned to a second vector (index 1), and the first vector and the second vector are concatenated into a single feature vector.