US 12,462,797 B2
	Rendering responses to a spoken utterance of a user utilizing a local text-response map
Yuli Gao, Sunnyvale, CA (US); and Sangsoo Sung, Palo Alto, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 10, 2020, as Appl. No. 17/118,463.
Application 17/118,463 is a continuation of application No. 16/609,403, granted, now 10,891,958, previously published as PCT/US2018/039850, filed on Jun. 27, 2018.
Prior Publication US 2021/0097999 A1, Apr. 1, 2021
Int. Cl. G06F 3/16 (2006.01); G06F 40/35 (2020.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01); G10L 15/30 (2013.01)

CPC G10L 15/22 (2013.01) [G06F 3/167 (2013.01); G10L 15/26 (2013.01); G10L 15/30 (2013.01); G06F 40/35 (2020.01)]

18 Claims

1. A method implemented by one or more processors of a client device, the method comprising:

capturing, via at least one microphone of the client device, audio data that captures a spoken utterance of a user;

processing the audio data to generate current text that corresponds to the spoken utterance, wherein processing the audio data to generate the current text utilizes a voice-to-text model stored locally on the client device;

accessing a text-response map stored locally on the client device, wherein the text-response map includes a plurality of mappings, each of the mappings defining a corresponding direct relationship between corresponding text and a corresponding response based on the corresponding text being previously generated from previous audio data captured by the client device and based on the corresponding response being previously received from a remote system in response to transmitting, to the remote system, at least one of the previous audio data and the corresponding text;

determining, by the client device, that the corresponding texts of the text-response map fail to match the current text;

in response to determining that the corresponding texts of the text-response map fail to match the current text, transmitting, to a remote system and based on determining that the corresponding texts of the text-response map fail to match the current text, the audio data or the current text;

receiving, from the remote system in response to transmitting the audio data or the current text, a response and an indication that the response is static only until an expiration event occurs, wherein the expiration event comprises determining that the user is no longer present at a particular location;

updating, in response to the indication that is received from the remote system indicating that the response is static, the text-response map by adding a given text mapping and including an indication of the expiration event with the given text mapping, the given text mapping defining a direct relationship between the current text and the response;

capturing, subsequent to updating the text-response map, second audio data;

processing the second audio data to generate a second text utilizing the voice-to-text model stored locally on the client device;

determining, based on the text-response map, that the current text matches the second text;

in response to determining that the current text matches the second text, and based on the text-response map including the given text mapping that defines the direct relationship between the current text and the response:

causing the response, from the text-response map, to be implemented; and

removing the given text mapping from the text-response map when the expiration event occurs.