US 12,260,857 B2
	Selective adaptation and utilization of noise reduction technique in invocation phrase detection
Christopher Hughes, Redwood City, CA (US); Yiteng Huang, Basking Ridge, NJ (US); Turaj Zakizadeh Shabestary, San Francisco, CA (US); and Taylor Applebaum, Mountain View, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on May 13, 2024, as Appl. No. 18/662,334.
Application 18/662,334 is a continuation of application No. 17/886,726, filed on Aug. 12, 2022, granted, now 11,984,117.
Application 17/886,726 is a continuation of application No. 16/886,139, filed on May 28, 2020, granted, now 11,417,324, issued on Aug. 16, 2022.
Application 16/886,139 is a continuation of application No. 16/609,619, granted, now 10,706,842, issued on Jul. 7, 2020, previously published as PCT/US2019/013479, filed on Jan. 14, 2019.
Claims priority of provisional application 62/620,885, filed on Jan. 23, 2018.
Prior Publication US 2024/0304187 A1, Sep. 12, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/22 (2006.01); G10L 15/02 (2006.01); G10L 15/08 (2006.01); G10L 15/20 (2006.01); G10L 21/0232 (2013.01); G10L 25/84 (2013.01); G10L 21/0216 (2013.01)

CPC G10L 15/20 (2013.01) [G10L 15/02 (2013.01); G10L 15/08 (2013.01); G10L 15/22 (2013.01); G10L 21/0232 (2013.01); G10L 25/84 (2013.01); G10L 2015/025 (2013.01); G10L 2015/088 (2013.01); G10L 2015/223 (2013.01); G10L 2021/02166 (2013.01)]

15 Claims

1. A client device comprising:

one or more microphones;

memory storing instructions; and

one or more processors operable to execute the instructions to:

receive a stream of audio data frames that are based on output from the one or more microphones;

process each of the audio data frames of the stream using a trained machine learning model to generate respective output indicating one or more corresponding probabilities of the presence of one or more corresponding invocation phonemes;

store the audio data frames of the stream in a buffer, along with output indications for the audio data frames, each of the output indications being for a respective one of the audio data frames and being based on the corresponding output generated based on processing of the respective one of the audio data frames using the trained machine learning model;

determine, at a first instance, that the output indications in the buffer at the first instance indicate that the audio data frames in the buffer at the first instance all fail to include any of the one or more corresponding invocation phonemes;

in response to the determination at the first instance:

use at least one of the audio data frames in the buffer at the first instance to adapt a noise reduction filter;

determine, at a second instance after the first instance, that the output indications in the buffer at the second instance indicate that at least one of the audio data frames in the buffer at the second instance potentially includes at least one of the one or more corresponding invocation phonemes;

in response to the determination at the second instance:

generate filtered data frames based on processing of a plurality of the audio data frames in the buffer at the second instance using the noise reduction filter as adapted at least in part in response to the determination at the first instance; and

determine whether the filtered data frames indicate presence of the invocation phrase based on processing the filtered data frames using the trained machine learning model, or an additional trained machine learning model; and

in response to determining that the filtered data frames indicate presence of the invocation phrase:

cause at least one function of the automated assistant to be activated.