| CPC G10L 17/04 (2013.01) [G10L 17/18 (2013.01)] | 18 Claims |

|
1. A method for cross-domain speech deepfake detection, the method comprising a training stage and a testing stage, wherein the training stage comprises:
(a) inputting a source speech and generating an auxiliary speech by simulating a transmission distortion and simulating a codec compression, wherein the source speech comprises a first genuine speech and a fake speech, the auxiliary speech only comprises a second genuine speech, and the auxiliary speech is cross-domain;
(b) extracting a frame-level feature from the auxiliary speech or the source speech using an SSL pretrained model, and projecting the frame-level feature into a lower-dimensional feature space via a projection network, to obtain a first feature vector and a second feature vector respectively;
(c) domain-Invariant representation learning: applying at least two sets of adversarial domain classifiers to ensure that the compact feature from the auxiliary speech even converted from the original speech of different speakers and under various conditions have a domain-invariant representation, thereby making the various speech features more compactly distributed in the lower-dimensional feature space, and generating a cross-entropy loss for each set of the adversarial domain classifier based on the second feature vector;
(d) one-class learning: using a one-class learning classifier to learn compact boundary of the first genuine speech in the lower-dimensional feature space based on the first feature vector, facilitating a separation of the fake speech, and generating a one-class loss; wherein a sequential order is not existed between the step (c) and the step (d);
wherein the SSL pretrained model, the projection network, the one-class learning classifier, and the at least two sets of adversarial domain classifiers are updated according at least one of the cross-entropy loss and the one-class loss;
wherein the testing stage comprises:
(e) generating a judgment based on a testing score output by a testing model and a predefined threshold to classify a test speech as genuine or fake.
|