US 11,854,533 B2
Speaker awareness using speaker dependent speech model(s)
Ignacio Lopez Moreno, New York, NY (US); Quan Wang, Hoboken, NJ (US); Jason Pelecanos, New York, NY (US); Li Wan, New York, NY (US); Alexander Gruenstein, Mountain View, CA (US); and Hakan Erdogan, Belmont, MA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on Jan. 28, 2022, as Appl. No. 17/587,424.
Application 17/587,424 is a continuation of application No. 17/251,163, granted, now 11,238,847, previously published as PCT/US2019/064501, filed on Dec. 4, 2019.
Prior Publication US 2022/0157298 A1, May 19, 2022
Int. Cl. G10L 15/16 (2006.01); G10L 15/06 (2013.01); G10L 15/07 (2013.01); G10L 15/20 (2006.01); G10L 17/04 (2013.01); G10L 17/20 (2013.01); G10L 21/0208 (2013.01); G10L 15/08 (2006.01)
CPC G10L 15/063 (2013.01) [G10L 15/07 (2013.01); G10L 15/20 (2013.01); G10L 17/04 (2013.01); G10L 17/20 (2013.01); G10L 21/0208 (2013.01); G10L 2015/088 (2013.01)] 12 Claims
OG exemplary drawing
 
1. A method implemented by one or more processors, the method comprising:
receiving an instance of audio data that captures one or more spoken utterances of a user of a client device, wherein the instance of audio data is captured using one or more microphones of the client device;
determining a speaker embedding corresponding to a target user of the client device, the speaker embedding being generated based on audio data from the target user;
processing the instance of audio data along with the speaker embedding using a speaker dependent voice activity detection model (SD VAD model) to generate output indicating whether the audio data comprises voice activity that is of the target user of the client device, wherein the SD VAD model is personalizable to any user of the client device;
performing one or more actions based on the output, wherein performing the one or more actions based on the output comprises:
determining, based on the output whether the audio data comprises voice activity that is of the target user;
in response to determining the audio data does not comprise voice activity that is of the target user:
determining an additional speaker embedding corresponding to an additional target user of the client device, the additional speaker embedding being generated based on additional audio data from the additional target user; and
processing the instance of audio data, along with the additional speaker embedding and using the SD VAD model, to generate additional output indicating whether the audio data comprises voice activity that is of the additional target user.