US 12,315,517 B2
Method and system for correcting speaker diarization using speaker change detection based on text
Namkyu Jung, Seongnam-si (KR); Geonmin Kim, Seongnam-si (KR); Youngki Kwon, Seongnam-si (KR); Hee Soo Heo, Seongnam-si (KR); Bong-Jin Lee, Seongnam-si (KR); and Chan Kyu Lee, Seongnam-si (KR)
Assigned to NAVER CORPORATION, Seongnam-si (KR); and LINE WORKS CORP., Tokyo (JP)
Filed by NAVER CORPORATION, Seongnam-si (KR); and LINE WORKS CORP., Tokyo (JP)
Filed on Feb. 7, 2022, as Appl. No. 17/665,672.
Claims priority of application No. 10-2021-0017814 (KR), filed on Feb. 8, 2021.
Prior Publication US 2022/0254351 A1, Aug. 11, 2022
Int. Cl. G10L 17/14 (2013.01); G06F 40/284 (2020.01); G10L 15/26 (2006.01); G10L 17/22 (2013.01); G10L 21/028 (2013.01)
CPC G10L 17/14 (2013.01) [G10L 17/22 (2013.01); G10L 21/028 (2013.01)] 8 Claims
OG exemplary drawing
 
1. A speaker diarization correction method of a computer apparatus comprising at least one processor, the method, which uses the at least one processor, comprising:
performing speaker diarization on an input audio stream;
recognizing a speech included in the input audio stream and converting the speech to text;
detecting a speaker change based on the converted text; and
correcting the speaker diarization based on the detected speaker change,
wherein the detecting of the speaker change comprises:
receiving a speech recognition result for each utterance section, wherein each utterance section consists of at least one word unit, and further wherein each word unit comprises a single word of text;
encoding text included in the speech recognition result for each utterance section to one or more word units of text, wherein the encoding of the text to the one or more word units of text comprises encoding an EndPoint Detection (EPD) unit text included in the speech recognition result for each utterance section to the one or more word units of text using sentence Bidirectional Encoder Representations from Transformers (sBERT);
encoding each of the word units of text to consider a conversation context; and
determining whether a speaker change compared to a previous word unit of text is present for each word unit of text, individually, in which the conversation context is considered.