US 12,189,712 B1
	Audio spoof detection using attention-based contrastive learning
Gaurav Bharaj, Los Angeles, CA (US); Chirag Goel, Montreal (CA); Surya Koppisetti, Coquitlam (CA); Ben Colman, New York, NY (US); and Ali Shahriyari, Las Vegas, NV (US)
Assigned to Reality Defender, Inc., New York, NY (US)
Filed by Reality Defender, Inc., New York, NY (US)
Filed on Jan. 29, 2024, as Appl. No. 18/426,016.
Int. Cl. G10L 25/30 (2013.01); G06F 18/2132 (2023.01); G06F 18/2415 (2023.01); G10L 21/14 (2013.01); G10L 25/18 (2013.01); G10L 25/51 (2013.01)

CPC G06F 18/2132 (2023.01) [G06F 18/2415 (2023.01); G10L 21/14 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01); G10L 25/51 (2013.01)]

18 Claims

1. A method for detecting fake audios, comprising:

converting audio data into an image representation of the audio data;

providing the image representation of the audio data to a trained machine-learning model, the machine learning model:

generating, using a trained self-attention branch, one or more representation embeddings corresponding to the image representation of the audio data; and

receiving, using a trained classifier component, the one or more representation embeddings and outputting a classification result; and

wherein the machine-learning model is trained by:

in a first stage, training one or more self- and cross-attention components via contrastive learning,

wherein the one or more self- and cross-attention components comprise a first self-attention branch, a second self-attention branch, and a cross-attention branch, and

wherein the trained self-attention branch is based on the first self-attention branch or the second self-attention branch of the one or more self- and cross-attention components; and

in a second stage, training the classifier component; and

providing the classification result.