US 11,710,496 B2
Adaptive diarization model and user interface
Aaron Donsbach, Seattle, WA (US); and Dirk Padfield, Seattle, WA (US)
Assigned to Google LLC, Mountain View, CA (US)
Appl. No. 17/596,861
Filed by Google LLC, Mountain View, CA (US)
PCT Filed Jul. 1, 2019, PCT No. PCT/US2019/040111
§ 371(c)(1), (2) Date Dec. 20, 2021,
PCT Pub. No. WO2021/002838, PCT Pub. Date Jul. 1, 2021.
Prior Publication US 2022/0310109 A1, Sep. 29, 2022
Int. Cl. G10L 15/26 (2006.01); G10L 15/08 (2006.01); G10L 21/0308 (2013.01); G06F 3/0481 (2022.01); G06F 3/16 (2006.01); G10L 17/06 (2013.01); G10L 17/24 (2013.01); G10L 21/028 (2013.01)
CPC G10L 21/0308 (2013.01) [G06F 3/0481 (2013.01); G06F 3/167 (2013.01); G10L 17/06 (2013.01); G10L 17/24 (2013.01); G10L 21/028 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
displaying, by way of a user interface of a computing device, a visual prompt for input of identity data;
receiving, by the computing device, a first audio waveform captured during an initial time window and representing a first plurality of utterances by a first speaker and a second plurality of utterances by a second speaker;
receiving, by the computing device, the identity data of a first type indicating that (i) a first utterance of the first plurality of utterances corresponds to the first speaker and (ii) a second utterance of the second plurality of utterances corresponds to the second speaker;
determining, by the computing device and based on the first utterance, the second utterance, and the identity data of the first type, a diarization model configured to distinguish between utterances by the first speaker and utterances by the second speaker;
determining an accuracy of the diarization model in distinguishing between the first plurality of utterances and the second plurality of utterances;
determining that the accuracy exceeds a threshold accuracy;
based on determining that the accuracy exceeds the threshold accuracy, modifying the user interface to remove therefrom the visual prompt;
receiving, by the computing device and exclusively of receiving further identity data of the first type indicating a source speaker of a third utterance, a second audio waveform captured during a subsequent time window and representing the third utterance; and
determining, by the computing device, by way of the diarization model, and independently of the further identity data of the first type, the source speaker of the third utterance, wherein the source speaker is determined to be the first speaker or the second speaker.