| CPC G10L 17/26 (2013.01) [G10L 17/18 (2013.01); G10L 25/57 (2013.01)] | 22 Claims |

|
1. A method for generating a visual representation of manipulations in an audio signal, comprising:
inputting the audio signal into a trained machine-learning model, wherein the machine-learning model is trained by:
generating, based on a training bona fide audio signal, a training bona fide time-frequency representation;
generating, based on a training spoofed audio signal, a training spoofed time-frequency representation, wherein the training spoofed audio signal is a manipulated version of the training bona fide audio signal;
generating a training visual representation of manipulations in the training spoofed audio signal based at least on a difference between the training bona fide time-frequency representation and the training spoofed time-frequency representation; and
training the audio deepfake detection machine-learning model based on the training visual representation of the manipulations in the training spoofed audio signal; and
generating, by the machine-learning model, the visual representation of the manipulations in the audio signal.
|