| CPC G10L 15/20 (2013.01) [G10L 15/02 (2013.01); G10L 15/08 (2013.01); G10L 15/22 (2013.01); G10L 21/0232 (2013.01); G10L 25/84 (2013.01); G10L 2015/025 (2013.01); G10L 2015/088 (2013.01); G10L 2015/223 (2013.01); G10L 2021/02166 (2013.01)] | 15 Claims |

|
1. A client device comprising:
one or more microphones;
memory storing instructions; and
one or more processors operable to execute the instructions to:
receive a stream of audio data frames that are based on output from the one or more microphones;
process each of the audio data frames of the stream using a trained machine learning model to generate respective output indicating one or more corresponding probabilities of the presence of one or more corresponding invocation phonemes;
store the audio data frames of the stream in a buffer, along with output indications for the audio data frames, each of the output indications being for a respective one of the audio data frames and being based on the corresponding output generated based on processing of the respective one of the audio data frames using the trained machine learning model;
determine, at a first instance, that the output indications in the buffer at the first instance indicate that the audio data frames in the buffer at the first instance all fail to include any of the one or more corresponding invocation phonemes;
in response to the determination at the first instance:
use at least one of the audio data frames in the buffer at the first instance to adapt a noise reduction filter;
determine, at a second instance after the first instance, that the output indications in the buffer at the second instance indicate that at least one of the audio data frames in the buffer at the second instance potentially includes at least one of the one or more corresponding invocation phonemes;
in response to the determination at the second instance:
generate filtered data frames based on processing of a plurality of the audio data frames in the buffer at the second instance using the noise reduction filter as adapted at least in part in response to the determination at the first instance; and
determine whether the filtered data frames indicate presence of the invocation phrase based on processing the filtered data frames using the trained machine learning model, or an additional trained machine learning model; and
in response to determining that the filtered data frames indicate presence of the invocation phrase:
cause at least one function of the automated assistant to be activated.
|