US 12,394,431 B2
Synthetic voice detection method based on biological sound, recording medium and apparatus for performing the same
Souhwan Jung, Seoul (KR); Kihun Hong, Seoul (KR); and Thien-Phuc Doan, Seoul (KR)
Assigned to FOUNDATION OF SOONGSIL UNIVERSITY-INDUSTRY COOPERATION, Seoul (KR)
Filed by Foundation of Soongsil University-Industry Cooperation, Seoul (KR)
Filed on Dec. 2, 2022, as Appl. No. 18/073,779.
Claims priority of application No. 10-2022-0111400 (KR), filed on Sep. 2, 2022; and application No. 10-2022-0129615 (KR), filed on Oct. 11, 2022.
Prior Publication US 2024/0079027 A1, Mar. 7, 2024
Int. Cl. G10L 25/78 (2013.01); G10L 21/10 (2013.01)
CPC G10L 25/78 (2013.01) [G10L 21/10 (2013.01)] 9 Claims
OG exemplary drawing
 
1. A method for detecting a synthetic voice based on a biological sound comprising:
receiving an audio stream;
extracting a biological feature vector corresponding to a meaningless voice from the audio stream;
extracting a synthetic voice feature vector from the audio stream;
combining the biological feature vector and the synthetic voice feature vector to generate a combined feature vector; and
determining whether the audio stream is a synthetic voice based on the combined feature vector,
wherein extracting the biological feature vector comprises extracting the biological feature vector by inputting the audio stream to a pre-trained biological sound segmentation model, encoding the biological feature vector using a sequence model to extract encoded data corresponding to the last hidden state of the sequence model, and converting the encoded data into a scoring embedding vector of length H through a fully connected layer without an activation function,
wherein the biological sound segmentation model extracts the biological feature vector by converting the audio stream into a spectrogram, dividing the spectrogram into a plurality of frames, classifying a biological sound type for each divided frame, and assigning a corresponding ID to each classified biological sound type.