US 12,266,358 B2
Dynamically determining whether to perform candidate automated assistant action determined from spoken utterance
Konrad Miller, Zurich (CH); Ágoston Weisz, Zurich (CH); and Herbert Jordan, Zurich (CH)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on Sep. 1, 2022, as Appl. No. 17/901,513.
Claims priority of provisional application 63/396,108, filed on Aug. 8, 2022.
Prior Publication US 2024/0046925 A1, Feb. 8, 2024
Int. Cl. G10L 15/22 (2006.01); G06F 40/30 (2020.01); G10L 15/26 (2006.01)
CPC G10L 15/22 (2013.01) [G06F 40/30 (2020.01); G10L 15/26 (2013.01); G10L 2015/223 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method implemented by one or more processors, the method comprising:
performing, independent of any explicit invocation of an automated assistant, automatic speech recognition (ASR) on audio data, to generate ASR text that predicts a spoken utterance of a user, wherein the audio data is detected via one or more microphones of a client device in an environment and captures the spoken utterance of the user;
generating, based on processing the ASR text:
a candidate automated assistant action that corresponds to the ASR text, and
a confidence measure for the candidate automated assistant action;
generating one or more environment features that each reflects a corresponding current value for a corresponding dynamic state of the environment,
wherein generating the one or more environment features is based on processing data from the client device and/or from one or more additional client devices in the environment, and
wherein the one or more environment features comprise one or more of:
a temporal feature indicative of one or more current temporal conditions,
a spoken utterance origin feature indicative of an origination location and/or origination direction of the spoken utterance,
a quantity of people feature that is indicative of a quantity of people in the environment,
a user activity feature that is indicative of one or more activities in which the user is currently engaged, or
an environment location feature that is indicative of one or more semantic classifications of the environment;
determining whether to cause automatic performance of the candidate automated assistant action responsive to the spoken utterance, wherein determining whether to cause automatic performance of the candidate automated assistant action is based on processing both:
the confidence measure for the candidate automated assistant action, and
the one or more environment features; and
in response to determining to cause automatic performance of the candidate automated assistant action:
causing automatic performance of the candidate automated assistant action responsive to the spoken utterance;
in response to not determining to cause automatic performance of the candidate automated assistant action:
suppressing any automatic performance of the candidate automated assistant action responsive to the spoken utterance.