US 12,482,470 B2
	Speaker-turn-based online speaker diarization with constrained spectral clustering
Quan Wang, Hoboken, NJ (US); Han Lu, Santa Clara, CA (US); Evan Clark, San Francisco, CA (US); Ignacio Lopez Moreno, Brooklyn, NY (US); Hasim Sak, Santa Clara, CA (US); Wei Xia, Mountain View, CA (US); Taral Joglekar, Sunnyvale, CA (US); and Anshuman Tripathi, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 14, 2021, as Appl. No. 17/644,261.
Claims priority of provisional application 63/261,536, filed on Sep. 23, 2021.
Prior Publication US 2023/0089308 A1, Mar. 23, 2023
Int. Cl. G10L 15/26 (2006.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01)

CPC G10L 15/26 (2013.01) [G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 2015/0631 (2013.01)]

24 Claims

1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:

receiving an input audio signal corresponding to utterances spoken by multiple speakers;

before segmenting the input audio signal, processing, using a speech recognition model, the input audio signal to jointly generate as output from the speech recognition model:

a transcription of the utterances; and

a sequence of speaker turn tokens based on semantic information of the transcription, each speaker turn token indicating a location of a respective speaker turn detected in the transcription and located between a respective pair of adjacent terms of the transcription spoken by different speakers;

segmenting the input audio signal into a plurality of speaker segments based on the sequence of speaker turn tokens;

for each speaker segment of the plurality of speaker segments, extracting a corresponding speaker-discriminative embedding from the speaker segment;

performing spectral clustering on the speaker-discriminative embeddings extracted from the plurality of speaker segments to cluster the plurality of speaker segments into k classes; and

for each respective class of the k classes, assigning a respective speaker label to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to the speaker segments clustered into each other class of the k classes.