US 12,347,238 B2
	Deepfake detection using synchronous observations of machine learning residuals
Guy G. Michaeli, Seattle, WA (US); Mandip S. Bhuller, San Carlos, CA (US); Timothy D. Cline, Gainesville, VA (US); and Kenny C. Gross, Escondido, CA (US)
Assigned to Oracle International Corporation, Redwood Shores, CA (US)
Filed by ORACLE INTERNATIONAL CORPORATION, Redwood Shores, CA (US)
Filed on Oct. 17, 2022, as Appl. No. 17/967,254.
Prior Publication US 2024/0127630 A1, Apr. 18, 2024
Int. Cl. G06V 40/40 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 20/40 (2022.01); G06V 40/16 (2022.01); G10L 17/10 (2013.01); G10L 17/18 (2013.01); G10L 17/26 (2013.01)

CPC G06V 40/40 (2022.01) [G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 20/41 (2022.01); G06V 20/46 (2022.01); G06V 40/168 (2022.01); G06V 40/172 (2022.01); G10L 17/18 (2013.01); G10L 17/26 (2013.01); G10L 17/10 (2013.01)]

20 Claims

1. A non-transitory computer-readable medium that includes stored thereon computer-executable instructions that when executed by at least a processor of a computer cause the computer to:

convert an audio-visual signal that includes speech by a human speaker into a set of time series signals that includes a video subset of time series signals for the video and an audio subset of time series signals for the audio;

generate a set of residual time series signals from the set of time series signals and a set of estimates for the time series signals made by a machine learning model, wherein the machine learning model generates the estimates to be consistent with authentic speech by the human speaker;

place residual values from one synchronous observation of the set of residual time series signals into a two-dimensional array that is divided into a video partition and an audio partition, wherein residual values generated for the video subset are placed within the video partition, and residual values generated for the audio subset are placed in the audio partition;

perform a sequential analysis of the residual values across two dimensions of the two-dimensional array to detect an anomaly in the residual values; and

in response to detection of the anomaly, generate an alert that deepfake content that misrepresents the human speaker or the speech is detected in the audio-visual signal.