US 11,735,176 B2
Speaker diarization using speaker embedding(s) and trained generative model
Ignacio Lopez Moreno, New York, NY (US); and Luis Carlos Cobo Rus, San Francisco, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 29, 2021, as Appl. No. 17/215,129.
Application 17/215,129 is a continuation of application No. 16/607,977, granted, now 10,978,059, previously published as PCT/US2018/052724, filed on Sep. 25, 2018.
Prior Publication US 2021/0217411 A1, Jul. 15, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/00 (2013.01); G10L 15/20 (2006.01); G10L 15/30 (2013.01); G10L 15/02 (2006.01); G10L 15/22 (2006.01); G10L 21/0208 (2013.01); G10L 15/06 (2013.01)
CPC G10L 15/20 (2013.01) [G10L 15/02 (2013.01); G10L 15/22 (2013.01); G10L 15/30 (2013.01); G10L 15/063 (2013.01); G10L 21/0208 (2013.01)] 19 Claims
OG exemplary drawing
 
1. A method implemented by one or more processors, the method comprising:
invoking an automated assistant client at a client device, wherein invoking the automated assistant client is in response to detecting one or more invocation queues in received user interface input;
in response to invoking the automated assistant client:
processing initial spoken input received via one or more microphones of the client device;
generating a responsive action based on the processing of the initial spoken input;
causing performance of the responsive action;
automatically monitoring for additional spoken input after causing performance of at least part of the responsive action;
receiving audio data during the automatically monitoring;
generating a refined version of the audio data, wherein the refined version isolates any of the audio data that is from the human speaker, wherein generating the refined version of the audio data comprises:
identifying a speaker embedding for the human speaker that provided the initial spoken input; and
processing the audio data using and the speaker embedding using a trained generative model; and
determining, based on the refined version of the audio data, whether the audio data includes an additional spoken input that is from the same human speaker that provided the initial spoken utterance; and
in response to determining that the audio data does include additional spoken input that is from the same human speaker:
performing certain further processing that is based on the additional spoken input.