US 12,277,939 B2
	Progressive contrastive learning framework for self-supervised speaker verification
Chunlei Zhang, Palo Alto, CA (US); and Dong Yu, Palo Alto, CA (US)
Assigned to TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed by TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed on May 2, 2022, as Appl. No. 17/734,471.
Prior Publication US 2023/0352029 A1, Nov. 2, 2023
Int. Cl. G10L 17/04 (2013.01); G10L 17/02 (2013.01); G10L 21/0208 (2013.01)

CPC G10L 17/04 (2013.01) [G10L 17/02 (2013.01); G10L 21/0208 (2013.01); G10L 2021/02082 (2013.01)]

20 Claims

1. A method for speaker identification and verification, the method being executed by at least one processor, and the method comprising:

training a dynamic clustering model for speaker identification and verification, the training comprising:

receiving, by a first encoder, a plurality of original speech segments;

receiving, by a second encoder, a plurality of augmented speech segments based on the plurality of the original speech segments,

generating, by the first encoder, first speaker representations based on the plurality of original speech segments;

generating, by the second encoder, second speaker representations based on the plurality of augmented speech segments;

prior to estimating a cluster for the second speaker representations, inputting the second speaker representations into a memory queue;

dynamically determining a number of clusters available for classification of the plurality of original speech segments based statistical characteristics of the second speaker representations in the memory queue;

assigning a respective cluster to each second speaker representation among the second speaker representations, the cluster being one from among the dynamically determined number of clusters; and

generating a contrastive loss based on the first speaker representations and the assigned second speaker representations, wherein the contrastive loss is based on queue-level centroids associated with the dynamically determined number of clusters instead of dataset-level centroids; and

verifying an identity of a speaker of a first audio by applying the first audio as input into the trained dynamic clustering model for speaker identification and verification.