CPC G06V 10/7715 (2022.01) [G06T 7/40 (2013.01); G06V 10/54 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/49 (2022.01); G06V 20/70 (2022.01); G06V 40/161 (2022.01); G06V 40/40 (2022.01); H04N 21/44008 (2013.01); H04N 21/637 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/30201 (2013.01)] | 16 Claims |
1. A system for detecting DeepFake videos, comprising:
an input device for inputting a potential DeepFake video, wherein the input device is configured to input a sequence of video frames of the potential DeepFake video;
processing circuitry that
detects faces frame by frame in the potential DeepFake video to obtain consecutive face images,
creates UV texture maps from the face images,
inputs both face images and corresponding UV texture maps,
extracts image feature maps, by a convolution neural network (CNN) backbone, from the input face images and corresponding UV texture maps and forms an input data structure,
receives the input data structure, by a video transformer model that includes multiple encoders,
computes, by the video transformer model, a classification of the video as being Real or Fake; and
a display device that plays back the potential DeepFake video and an indication that the potential DeepFake video is Real or Fake,
the system further comprising:
learnable segment embeddings, wherein the learnable segment embeddings are a fixed token for the face image and a fixed token for the UV texture map, and
wherein the processing circuitry forms the input data structure including the extracted image feature maps and the learnable segment embeddings,
wherein all tokens belonging to the face image are assigned to a first vector (index 0), and all tokens belonging to the UV texture map are assigned to a second vector (index 1), and the first vector and the second vector are concatenated into a single feature vector.
|