US 12,217,760 B2
Metadata-based diarization of teleconferences
Eilon Reshef, Tel Aviv (IL); Hanan Shteingart, Herzliya (IL); Zohar Shay, Even Yehuda (IL); and Shlomi Medalion, Ramat Gan (IL)
Assigned to GONGIO Ltd., Ramat-Gan (IL)
Filed by GONG.IO LTD., Ramat Gan (IL)
Filed on Jan. 30, 2022, as Appl. No. 17/588,296.
Application 17/588,296 is a continuation in part of application No. 16/297,757, filed on Mar. 11, 2019, granted, now 11,276,407.
Claims priority of provisional application 62/658,604, filed on Apr. 17, 2018.
Prior Publication US 2022/0157322 A1, May 19, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 17/06 (2013.01); G06V 30/19 (2022.01); G06V 40/16 (2022.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/18 (2013.01); H04L 65/403 (2022.01); G10L 21/028 (2013.01)
CPC G10L 17/06 (2013.01) [G06V 30/19 (2022.01); G06V 40/172 (2022.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/18 (2013.01); H04L 65/403 (2013.01); G10L 21/028 (2013.01)] 35 Claims
OG exemplary drawing
 
1. A method for audio processing, comprising:
receiving, in a computer, a recording of a teleconference among multiple participants over a network, the recording including an audio stream containing speech uttered by the participants and information outside the audio stream;
processing the audio stream by the computer to identify speech segments, in which one or more of the participants were speaking, interspersed with intervals of silence in the audio stream;
extracting speaker identifications, which are indicative of the participants who spoke during periods of the teleconference, from the information outside the audio stream in the received recording, so as to provide for a plurality of periods the of the teleconference, corresponding speaker identifications;
labeling a first set of the identified speech segments from the audio stream with the speaker identifications extracted from the information outside the audio stream in the received recording, wherein each speech segment from the audio stream, in the first set, is labelled with a speaker identification of a period corresponding to a time of the speech segment;
extracting acoustic features from the speech segments in the first set;
learning a correlation between the speaker identifications labelled to the segments in the first set, and the extracted acoustic features extracted from the corresponding segments of the first set; and
relabeling one or more of the identified speech segments of the first set, using the learned correlation, to indicate the participants who spoke based on the learned correlation instead of the speaker identifications extracted from the information outside the audio stream.