US 12,217,762 B1
System and method for detecting synthetic speech based on prosody analysis
Denys Shyrman, Dnipropetrovsk Region (UA)
Assigned to CORSOUND AI LTD., Tel Aviv (IL)
Filed by Corsound AI Ltd, Tel Aviv (IL)
Filed on Jun. 6, 2024, as Appl. No. 18/735,525.
Int. Cl. G10L 13/10 (2013.01); G10L 13/047 (2013.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/06 (2013.01); G10L 17/26 (2013.01); G10L 25/18 (2013.01)
CPC G10L 17/26 (2013.01) [G10L 13/047 (2013.01); G10L 13/10 (2013.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/06 (2013.01); G10L 25/18 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method for detecting synthetic speech, the method comprising, using a processor:
training a prosody extractor comprising a first neural network by:
generating a channel degraded speech sample by providing a training speech sample to an encoder-decoder (codec) model, wherein the codec model comprises a second neural network that represents effects of a transmission channel on the speech sample, and wherein the channel degraded speech sample comprises a spectrogram that is degraded as if the training speech sample has been transferred through the transmission channel represented by the codec model;
generating a prosody embedding by providing the channel degraded speech sample to a prosody extractor;
generating a spectrogram of the training speech sample by providing to a speech synthesis model the prosody embedding, a codec embedding representing the codec model, speaker identity information and a text representation, wherein the speech synthesis model comprises a third neural network;
training the speech synthesis model and the prosody extractor using a loss function defined based on the spectrogram generated by the speech synthesis model compared with a spectrogram of the channel degraded speech sample;
receiving an examined speech sample;
determining whether the examined speech sample includes synthetic speech using the trained prosody extractor; and
providing a notice in case the examined speech sample is determined to include synthetic speech, wherein the notice comprises an indication that the examined speech sample is suspected as being a synthetic voice sample.