| CPC G10L 15/26 (2013.01) [G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 2015/0631 (2013.01)] | 24 Claims |

|
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving an input audio signal corresponding to utterances spoken by multiple speakers;
before segmenting the input audio signal, processing, using a speech recognition model, the input audio signal to jointly generate as output from the speech recognition model:
a transcription of the utterances; and
a sequence of speaker turn tokens based on semantic information of the transcription, each speaker turn token indicating a location of a respective speaker turn detected in the transcription and located between a respective pair of adjacent terms of the transcription spoken by different speakers;
segmenting the input audio signal into a plurality of speaker segments based on the sequence of speaker turn tokens;
for each speaker segment of the plurality of speaker segments, extracting a corresponding speaker-discriminative embedding from the speaker segment;
performing spectral clustering on the speaker-discriminative embeddings extracted from the plurality of speaker segments to cluster the plurality of speaker segments into k classes; and
for each respective class of the k classes, assigning a respective speaker label to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to the speaker segments clustered into each other class of the k classes.
|