US 11,853,649 B2
	Voice-controlled entry of content into graphical user interfaces
Srikanth Pandiri, Zurich (CH); Luv Kothari, Sunnyvale, CA (US); Behshad Behzadi, Freienbach (CH); Zaheed Sabur, Baar, CA (US); Domenico Carbotta, Zurich (CH); Akshay Kannan, Fremont, CA (US); Qi Wang, Palo Alto, CA (US); Gokay Baris Gultekin, Palo Alto, CA (US); Angana Ghosh, Mountain View, CA (US); Xu Liu, San Jose, CA (US); Yang Lu, Los Altos, CA (US); and Steve Cheng, Los Altos, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Appl. No. 17/619,414
Filed by GOOGLE LLC, Mountain View, CA (US)
PCT Filed Dec. 13, 2019, PCT No. PCT/US2019/066211 § 371(c)(1), (2) Date Dec. 15, 2021, PCT Pub. No. WO2021/076166, PCT Pub. Date Apr. 22, 2021.
Claims priority of provisional application 62/915,607, filed on Oct. 15, 2019.
Prior Publication US 2022/0253277 A1, Aug. 11, 2022
Int. Cl. G06F 40/117 (2020.01); G06F 3/16 (2006.01); G06F 40/143 (2020.01); G06F 40/30 (2020.01); G06F 40/174 (2020.01); G06F 3/0481 (2022.01); G06F 3/0484 (2022.01); G06F 3/04886 (2022.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01)

CPC G06F 3/167 (2013.01) [G06F 3/0481 (2013.01); G06F 3/0484 (2013.01); G06F 3/04886 (2013.01); G06F 40/117 (2020.01); G06F 40/143 (2020.01); G06F 40/174 (2020.01); G06F 40/30 (2020.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01)]

17 Claims

1. A method implemented by one or more processors, the method comprising:

determining that a selection of an entry field of a graphical user interface of an application that is being rendered at a computing device was provided,

wherein the computing device provides access to an automated assistant that is separate from the application and utilizes one or more speech-to-text models stored at the computing device;

receiving, subsequent to determining that the entry field was selected, a spoken utterance from a user;

generating, based on the spoken utterance, a candidate text string that characterizes at least a portion of the spoken utterance provided by the user,

wherein the candidate text string is generated using the one or more speech-to-text models stored at the computing device;

determining, by the automated assistant and based on the candidate text string, whether to incorporate the candidate text string into the entry field or whether to incorporate non-textual visual content into the entry field and in lieu of the candidate text string,

wherein determining, by the automated assistant and based on the candidate text string, whether to incorporate the candidate text string into the entry field or whether to incorporate the non-textual visual content into the entry field comprises:

determining, based on processing the candidate text string, the non-textual visual content;

identifying, based on the non-textual visual content, one or more non-textual visual content properties of the non-textual visual content; and

determining, based on comparing the one or more non-textual visual content properties of the non-textual visual content to one or more entry field properties of the entry field, whether to incorporate the candidate text string into the entry field or whether to incorporate the non-textual visual content into the entry field;

when a determination is made to incorporate the non-textual visual content into the entry field:

causing the non-textual visual content to be provided as input to the entry field of the graphical user interface, wherein the non-textual visual content is determined via performance of one or more automated assistant actions that are based on the candidate text string; and

when a different determination is made to incorporate the candidate text string into the entry field:

causing the candidate text string to be provided as input to the entry field of the graphical user interface.