US 12,406,673 B2
Real-time speaker identification system utilizing meta learning to process short utterances in an open-set environment
Jeng-Shin Sheu, Yunlin County (TW); and Cheng-Hsuan Lu, Yunlin County (TW)
Assigned to NATIONAL YUNLIN UNIVERSITY OF SCIENCE, Yunlin County (TW)
Filed by NATIONAL YUNLIN UNIVERSITY OF SCIENCE AND TECHNOLOGY, Yunlin County (TW)
Filed on Aug. 8, 2023, as Appl. No. 18/366,979.
Prior Publication US 2025/0054499 A1, Feb. 13, 2025
Int. Cl. G10L 17/06 (2013.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01)
CPC G10L 17/06 (2013.01) [G10L 17/02 (2013.01); G10L 17/04 (2013.01)] 6 Claims
OG exemplary drawing
 
1. A real-time speaker identification system, utilizing meta learning to process short utterances in an open-set environment, comprising:
a speaker embedding generator, comprising a speaker model and a Mel-filter bank, the speaker model converting acoustic feature vectors extracted by the Mel-filter bank into a speaker embedding vector, wherein the speaker model is trained with a plurality of episodes based on meta learning with a composite objective function composed of two loss functions, each of the plurality of episodes comprises a support set of long utterances and a query set of short utterances, and gradients of the composite objective function are backpropagated to update the speaker model;
wherein the speaker identification system converts an input registration utterance of each of a plurality of enrolled speakers into at a prototype vector by the speaker embedding generator to complete an enrollment process, and after the enrollment process is completed, the speaker identification system is provided for each tester to perform the following steps:
a spoofing attack identification step to determine whether an input test utterance of a tester is a spoofing attack, wherein the spoofing attack identification step comprises converting the input test utterance of the tester into a test embedding vector by the speaker embedding generator, calculating a cosine similarity between the test embedding vector and the prototype vector of each of the plurality of enrolled speakers, and determining whether the cosine similarity between the test embedding vector of the tester and the prototype vector of each of the plurality of enrolled speakers exceeds an impostor threshold, if all the cosine similarities exceed the impostor threshold, the input test utterance of the tester is determined as a real speech; otherwise, the input test utterance of the tester is determined as the spoofing attack, and the tester is rejected for login;
and an impostor and enrolled speaker identification step to determine whether the input test utterance of the tester is from an impostor or one of the plurality of enrolled speakers, wherein the impostor and enrolled speaker identification step comprising randomly dividing the input test utterance of the tester into three sound segments which are continuous parts of the input test utterance, converting the three sound segments into three segment speaker embedding vectors by the speaker embedding generator, and calculating a similarity score between each of the three segment speaker embedding vectors and the prototype vector of each of the plurality of enrolled speakers, and determining whether the maximum similarity scores of the three segment speaker embedding vectors of the tester all point to a specific enrolled speaker of the plurality of enrolled speakers and are all greater than the impostor threshold, if yes, the tester is determined as that specific enrolled speaker; otherwise, the tester is determined as an impostor and is rejected for login;
wherein in the impostor and enrolled speaker identification step, a unanimity vote holds when all the three segment speaker embedding vectors have the maximum similarity scores greater than the impostor threshold and point to the same enrolled speaker; the tester is rejected for login if the enrolled speaker with the highest cosine similarity in the spoofing attack identification step is different from the enrolled speaker identified through the unanimity vote in the impostor and enrolled speaker identification step.