US 12,451,140 B2
	Automatic generation and/or use of text-dependent speaker verification features
Matthew Sharifi, Kilchberg (CH); and Victor Carbune, Zurich (CH)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on May 13, 2024, as Appl. No. 18/662,590.
Application 18/662,590 is a continuation of application No. 17/700,135, filed on Mar. 21, 2022, granted, now 11,984,128.
Application 17/700,135 is a continuation of application No. 17/069,565, filed on Oct. 13, 2020, granted, now 11,315,575, issued on Apr. 26, 2022.
Prior Publication US 2024/0296848 A1, Sep. 5, 2024
Int. Cl. G10L 17/08 (2013.01); G06F 3/16 (2006.01); G06F 21/32 (2013.01); G10L 15/22 (2006.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/10 (2013.01); G10L 17/14 (2013.01); G10L 17/18 (2013.01); G10L 17/22 (2013.01); G10L 17/24 (2013.01); G10L 15/08 (2006.01)

CPC G10L 17/08 (2013.01) [G06F 3/167 (2013.01); G06F 21/32 (2013.01); G10L 15/22 (2013.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/10 (2013.01); G10L 17/14 (2013.01); G10L 17/18 (2013.01); G10L 17/22 (2013.01); G10L 17/24 (2013.01); G10L 2015/088 (2013.01); G10L 2015/227 (2013.01)]

18 Claims

1. A method implemented by one or more processors, the method comprising:

performing automatic speech recognition, based on processing audio data that captures a spoken utterance of a user and that is detected via one or more microphones of an assistant device, to generate a recognition of the spoken utterance;

determining, based on processing the recognition of the spoken utterance, at least one assistant action to perform responsive to the spoken utterance:

determining whether performance of assistant action requires authentication;

in response to determining that performance of the assistant action requires authentication:

determining, based on the recognition, that a portion of the audio data includes a set of one or more terms for a text dependent speaker verification (TD-SV) for the user, the set of one or more terms being in addition to any general invocation wake words, the portion of the audio data being one of multiple portions of the audio data that captures the spoken utterance, and the portion being non-overlapping with at least an additional portion of the multiple portions;

processing the portion of the audio data to generate utterance features that correspond to the portion of the audio data, wherein processing the portion of the audio data is responsive to determining the portion includes the set of one or more terms for the TD-SV for the user;

performing a comparison of the utterance features to stored speaker features for the TD-SV for the user to generate a metric for the TD-SV for the user;

determining, based on the recognition, that the additional portion of the audio data includes an additional set of one or more terms for an additional TD-SV for the user, the additional set of one or more terms being in addition to any general invocation wake words;

processing the additional portion of the audio data to generate additional utterance features that correspond to the additional portion of the audio data;

performing an additional comparison of the additional utterance features to additional stored speaker features for the additional TD-SV for the user to generate an additional metric for the additional TD-SV for the user; and

determining, based on the metric and the additional metric, whether to cause performance of the assistant action in response to the spoken utterance.