US 12,087,307 B2
Method and apparatus for performing speaker diarization on mixed-bandwidth speech signals
Myungjong Kim, Milpitas, CA (US); Vijendra Raj Apsingekar, San Jose, CA (US); Aviral Anshu, Santa Clara, CA (US); and Taeyeon Ki, Milpitas, CA (US)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed on Nov. 30, 2021, as Appl. No. 17/538,604.
Prior Publication US 2023/0169981 A1, Jun. 1, 2023
Int. Cl. G10L 17/06 (2013.01); G10L 17/02 (2013.01); G10L 17/18 (2013.01); G10L 21/0272 (2013.01); G10L 21/0308 (2013.01)
CPC G10L 17/06 (2013.01) [G10L 17/02 (2013.01); G10L 17/18 (2013.01); G10L 21/0272 (2013.01); G10L 21/0308 (2013.01)] 12 Claims
OG exemplary drawing
 
1. An apparatus for processing speech data, the apparatus comprising:
a memory storing instructions; and
a processor configured to execute the instructions to:
separate an input speech into speech signals;
identify a bandwidth of each of the speech signals;
obtain speaker embeddings from each of a plurality of different speech embedding extraction models by inputting the speech signals having different bandwidths to the different speech embedding extraction models, wherein at least one neural network of each of the plurality of different speech embedding extraction models is trained based on the different bandwidths;
cluster the speaker embeddings for each of the different bandwidths separately, to obtain bandwidth-dependent embedding clusters for each of the different bandwidths; and
combine the bandwidth-dependent embedding clusters based on a vector dissimilarity between the bandwidth-dependent clusters, to obtain cross-bandwidth embedding clusters as speaker clusters, each of the speaker clusters corresponding to a speaker identity,
wherein each of the plurality of different speech embedding extraction models comprises a plurality of frame-level layers, a pooling layer, a plurality of segmentation-level layers, and an output layer; and
wherein the processor is further configured to obtain the speaker embeddings by inputting bandwidth information to one of the plurality of frame-level layers, and to the plurality of segment-level layers.