| CPC G10L 17/26 (2013.01) [G10L 13/047 (2013.01); G10L 13/10 (2013.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/06 (2013.01); G10L 25/18 (2013.01)] | 20 Claims |

|
1. A method for detecting synthetic speech, the method comprising, using a processor:
training a prosody extractor comprising a first neural network by:
generating a channel degraded speech sample by providing a training speech sample to an encoder-decoder (codec) model, wherein the codec model comprises a second neural network that represents effects of a transmission channel on the speech sample, and wherein the channel degraded speech sample comprises a spectrogram that is degraded as if the training speech sample has been transferred through the transmission channel represented by the codec model;
generating a prosody embedding by providing the channel degraded speech sample to a prosody extractor;
generating a spectrogram of the training speech sample by providing to a speech synthesis model the prosody embedding, a codec embedding representing the codec model, speaker identity information and a text representation, wherein the speech synthesis model comprises a third neural network;
training the speech synthesis model and the prosody extractor using a loss function defined based on the spectrogram generated by the speech synthesis model compared with a spectrogram of the channel degraded speech sample;
receiving an examined speech sample;
determining whether the examined speech sample includes synthetic speech using the trained prosody extractor; and
providing a notice in case the examined speech sample is determined to include synthetic speech, wherein the notice comprises an indication that the examined speech sample is suspected as being a synthetic voice sample.
|