| CPC G06V 20/41 (2022.01) [G06F 18/22 (2023.01); G06N 3/045 (2023.01); G06N 3/048 (2023.01); G06N 3/08 (2013.01); G06V 20/46 (2022.01); G06V 20/48 (2022.01); G06V 40/174 (2022.01); G10L 25/30 (2013.01); G10L 25/63 (2013.01)] | 20 Claims |

|
1. An apparatus comprising:
a first feature extraction module to receive visual content of a video and produce facial features therefrom, the facial features including facial modalities and facial affective cues, including facial emotions;
a second feature extraction module to receive audio content of the video and produce speech features therefrom, the speech features including speech modalities and speech affective cues, including speech emotions;
a neural network including:
a first network responsive to the facial modalities to produce a facial modality embedding of the facial modalities;
a second network responsive to the speech modalities to produce a speech modality embedding of the speech modalities;
a third network responsive to the facial affective cues to produce an embedding of the facial affective cues, including a facial emotion embedding of the facial emotions; and
a fourth network responsive to the speech affective cues to produce an embedding of the speech affective cues, including a speech emotion embedding of the speech emotions;
a comparison module to determine a first measure of a similarity between the facial modality embedding and the speech modality embedding and further to determine a second measure of a similarity between the embedding of the facial affective cues and the embedding of the speech affective cues, where the first measure of the similarity comprises a first distance between the facial modality embedding and the speech modality embedding and the second measure of the similarity comprises a second distance between the embedding of the facial affective cues, including the facial emotion embedding of the facial emotions, and the embedding of the speech affective cues, including the speech emotion embedding of the speech emotions; and
a classification module to determine the video to be real or fake dependent upon the first and second measures of similarity, where the classification module is configured to classify the video as fake when a sum of the first distance and the second distance exceeds a threshold distance and to classify the video as real when the sum of the first distance and the second distance does not exceed the threshold distance.
|