| CPC G10L 17/04 (2013.01) [G10L 17/02 (2013.01); G10L 21/0208 (2013.01); G10L 2021/02082 (2013.01)] | 20 Claims |

|
1. A method for speaker identification and verification, the method being executed by at least one processor, and the method comprising:
training a dynamic clustering model for speaker identification and verification, the training comprising:
receiving, by a first encoder, a plurality of original speech segments;
receiving, by a second encoder, a plurality of augmented speech segments based on the plurality of the original speech segments,
generating, by the first encoder, first speaker representations based on the plurality of original speech segments;
generating, by the second encoder, second speaker representations based on the plurality of augmented speech segments;
prior to estimating a cluster for the second speaker representations, inputting the second speaker representations into a memory queue;
dynamically determining a number of clusters available for classification of the plurality of original speech segments based statistical characteristics of the second speaker representations in the memory queue;
assigning a respective cluster to each second speaker representation among the second speaker representations, the cluster being one from among the dynamically determined number of clusters; and
generating a contrastive loss based on the first speaker representations and the assigned second speaker representations, wherein the contrastive loss is based on queue-level centroids associated with the dynamically determined number of clusters instead of dataset-level centroids; and
verifying an identity of a speaker of a first audio by applying the first audio as input into the trained dynamic clustering model for speaker identification and verification.
|