US 12,437,750 B2
Speaker awareness using speaker dependent speech model(s)
Ignacio Lopez Moreno, New York, NY (US); Quan Wang, Hoboken, NJ (US); Jason Pelecanos, New York, NY (US); Li Wan, New York, NY (US); Alexander Gruenstein, Mountain View, CA (US); and Hakan Erdogan, Belmont, MA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on Dec. 22, 2023, as Appl. No. 18/394,632.
Application 18/394,632 is a continuation of application No. 17/587,424, filed on Jan. 28, 2022, granted, now 11,854,533.
Application 17/587,424 is a continuation of application No. 17/251,163, granted, now 11,238,847, issued on Feb. 1, 2022, previously published as PCT/US2019/064501, filed on Dec. 4, 2019.
Prior Publication US 2024/0203400 A1, Jun. 20, 2024
Int. Cl. G10L 15/16 (2006.01); G10L 15/06 (2013.01); G10L 15/07 (2013.01); G10L 15/20 (2006.01); G10L 17/04 (2013.01); G10L 17/20 (2013.01); G10L 21/0208 (2013.01); G10L 15/08 (2006.01)
CPC G10L 15/063 (2013.01) [G10L 15/07 (2013.01); G10L 15/20 (2013.01); G10L 17/04 (2013.01); G10L 17/20 (2013.01); G10L 21/0208 (2013.01); G10L 2015/088 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A method implemented by one or more processors, the method comprising:
receiving an instance of audio data that captures one or more spoken utterances of a user of a client device, wherein the instance of audio data is captured using one or more microphones of the client device;
determining a speaker embedding corresponding to a target user of the client device, the speaker embedding being generated based on prior audio data from the target user, wherein the target user is a primary user of the client device;
processing the instance of audio data along with the speaker embedding using a speaker dependent (SD) voice activity detection (VAD) model to generate output indicating whether the audio data comprises voice activity that is of the target user of the client device,
wherein the SD VAD model is personalizable to the target user of the client device based on processing of the speaker embedding generated based on the prior audio data from the target user,
wherein the SD VAD model is also personalizable to at least one additional user who is not the target user based on an additional prior speaker embedding corresponding to the at least one additional user who is not the target user,
wherein the SD VAD model is trained based on a trained speaker independent (SI) VAD model prior to receiving the instance of audio data, and
wherein training the SD VAD model based on the trained SI VAD model prior to receiving the instance of audio data comprises:
identifying a training instance of training audio data that captures one or more training spoken utterances of a target training user;
determining a training speaker embedding for the target training user, the training speaker embedding being generated based on audio data from the target training user;
generating a noisy instance of training audio data by combining the training instance of training audio data with one or more additional sounds that are not from the target training user;
processing the training instance of training audio data using the SI VAD model to generate training SI VAD output;
processing the noisy instance of the training audio data, along with the training speaker embedding for the target training user, using the SD speech model to generate training SD VAD output;
generating a training loss based on the training SI output and the training SD output; and
updating one or more portions of the SD model based on the generated training loss;
performing one or more actions based on the output, wherein performing the one or more actions based on the output comprises:
determining, based on the output whether the audio data comprises voice activity that is of the target user; and
in response to determining the audio data does not comprise voice activity that is of the target user, rendering output indicating the audio data does not comprise voice activity that is of the target user via one or more user interface output devices of the client device.