US 12,451,142 B2
	Non-wake word invocation of an automated assistant from certain utterances related to display content
Pu-sen Chao, Los Altos, CA (US); Alex Fandrianto, Los Gatos, CA (US); and Muhammad Umair, San Jose, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on Jul. 28, 2022, as Appl. No. 17/876,156.
Prior Publication US 2024/0038246 A1, Feb. 1, 2024
Int. Cl. G10L 17/22 (2013.01); G06F 3/0481 (2022.01); G06F 3/16 (2006.01); G10L 13/02 (2013.01); G10L 17/06 (2013.01); G10L 15/22 (2006.01)

CPC G10L 17/22 (2013.01) [G06F 3/0481 (2013.01); G06F 3/167 (2013.01); G10L 13/02 (2013.01); G10L 17/06 (2013.01); G10L 2015/223 (2013.01)]

20 Claims

1. A method implemented by one or more processors, the method comprising:

determining, by an automated assistant, that a user has invoked the automated assistant with a first spoken utterance that includes an invocation phrase and that includes a request for the automated assistant to initialize performance of one or more operations,

wherein the user invokes the automated assistant via a computing device that includes a display interface;

determining, by the automated assistant, that non-selectable display elements are being rendered at the display interface of the computing device,

wherein the non-selectable display elements are rendered in response to initialization of one or more of the operations by the automated assistant;

comparing the non-selectable display elements to one or more operations capable of being initialized by the automated assistant, wherein the comparing includes:

generating, from the non-selectable elements rendered on the display interface, one or more embeddings which are mapped to a latent space, wherein the latent space includes other embeddings associated with operations capable of being performed by the automated assistant, and

determining whether one or more of the embeddings generated from the non-selectable elements are within a threshold distance in the latent space to one or more of the other embeddings associated with the operations;

based on the comparing, identifying a set of the non-selectable display elements that relate to one or more operations capable of being initialized by the automated assistant;

based on the set of the non-selectable display elements, updating a dynamic set of one or more spoken inputs,

wherein in response to at least one spoken input of the dynamic set of one or more spoken inputs being detected by the automated assistant while the non-selectable display elements are being rendered at the display interface and without the user providing an additional invocation phrase subsequent to the first spoken utterance, the automated assistant initializes performance of a particular operation related to a particular non-selectable display element of the non-selectable display elements;

determining, that the user has provided a second spoken utterance that corresponds to a particular spoken input of the dynamic set of one or more spoken inputs; and

causing, in response to determining that the user has provided the second spoken utterance that corresponds to the particular spoken input, the automated assistant to initialize performance of the particular operation corresponding to the particular display element of the non-selectable display elements.