US 12,190,890 B2
	System, method and programmed product for uniquely identifying participants in a recorded streaming teleconference
Shlomi Medalion, Ramat Gan (IL); Omri Allouche, Tel Aviv (IL); and Maxim Bulanov, Ramat Gan (IL)
Assigned to GONG.IO LTD, Ramat Gan (IL)
Filed by GONG.IO LTD, Ramat Gan (IL)
Filed on Mar. 5, 2024, as Appl. No. 18/596,327.
Application 18/596,327 is a continuation of application No. 17/651,208, filed on Feb. 15, 2022, granted, now 11,978,456.
Prior Publication US 2024/0212691 A1, Jun. 27, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 17/06 (2013.01); G06Q 30/01 (2023.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 40/16 (2022.01); G10L 17/02 (2013.01); G10L 17/18 (2013.01); G10L 21/028 (2013.01); G10L 25/57 (2013.01); H04L 65/403 (2022.01)

CPC G10L 17/06 (2013.01) [G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 40/171 (2022.01); G10L 17/02 (2013.01); G10L 17/18 (2013.01); G10L 21/028 (2013.01); G10L 25/57 (2013.01); H04L 65/403 (2013.01); G06Q 30/01 (2013.01)]

17 Claims

1. A system for using visual information in a video stream of a first recorded teleconference among a plurality of participants to diarize speech comprising:

one or more processors; and

non-transitory computer-readable memory operatively connected to the one or more processors, the non-transitory computer-readable memory including machine readable instructions that, when executed by the one or more processors, cause the one or more processors to perform steps of:

(a) obtaining components of the first recorded teleconference among the plurality of participants conducted over a network, wherein the components include:

(1) an audio component including utterances of respective participants that spoke during the first recorded teleconference;

(2) a video component including a video feed as to respective participants that spoke during the first recorded teleconference;

(3) teleconference metadata associated with the first recorded teleconference and including a first plurality of timestamp information and respective speaker identification information associated with each respective timestamp information; and

(4) transcription data associated with the first recorded teleconference, wherein said transcription data is indexed by timestamp;

(b) parsing the audio component into a plurality of speech segments in which one or more participants were speaking during the first recorded teleconference, wherein each respective speech segment is associated with a respective time segment including a start timestamp indicating a first time in the first recorded teleconference when the respective speech segment begins, and a stop timestamp associated with a second time in the first recorded teleconference when the respective speech segment ends;

(c) tagging each respective speech segment with the respective speaker identification information based on the teleconference metadata associated with the respective time segment; and

(d) diarizing the first recorded teleconference, wherein diarizing the first recorded teleconference includes:

(1) indexing the transcription data in accordance with respective speech segments and the respective speaker identification information to generate a segmented transcription data set for the first recorded teleconference;

(2) identifying respective speaker information associated with respective speech segments using a neural network with at least a portion of the video feed corresponding in time to at least a portion of the segmented transcription data set determined according to the indexing as an input, and providing source indication information for each respective speech segment as an output and using a training set including visual content tagged with prior source indication information, wherein the portion of the video feed includes a first artificial visual representation not including a face generated by telephone conferencing software in the visual content associated with a first participant that spoke during a first speech segment of the first recorded teleconference, and the portion of the video feed does not include any artificial visual representation associated with a second participant that did not speak during the first speech segment of the first recorded teleconference, and the source indication information is based at least on presence of the first artificial visual representation; and

(3) labeling each respective speech segment based on the identified respective speaker information associated with the respective speech segment;

wherein the identified respective speaker information is based on the source indication information.