| CPC G10L 15/22 (2013.01) [G06F 3/167 (2013.01); G10L 15/26 (2013.01); G10L 15/30 (2013.01); G06F 40/35 (2020.01)] | 18 Claims |

|
1. A method implemented by one or more processors of a client device, the method comprising:
capturing, via at least one microphone of the client device, audio data that captures a spoken utterance of a user;
processing the audio data to generate current text that corresponds to the spoken utterance, wherein processing the audio data to generate the current text utilizes a voice-to-text model stored locally on the client device;
accessing a text-response map stored locally on the client device, wherein the text-response map includes a plurality of mappings, each of the mappings defining a corresponding direct relationship between corresponding text and a corresponding response based on the corresponding text being previously generated from previous audio data captured by the client device and based on the corresponding response being previously received from a remote system in response to transmitting, to the remote system, at least one of the previous audio data and the corresponding text;
determining, by the client device, that the corresponding texts of the text-response map fail to match the current text;
in response to determining that the corresponding texts of the text-response map fail to match the current text, transmitting, to a remote system and based on determining that the corresponding texts of the text-response map fail to match the current text, the audio data or the current text;
receiving, from the remote system in response to transmitting the audio data or the current text, a response and an indication that the response is static only until an expiration event occurs, wherein the expiration event comprises determining that the user is no longer present at a particular location;
updating, in response to the indication that is received from the remote system indicating that the response is static, the text-response map by adding a given text mapping and including an indication of the expiration event with the given text mapping, the given text mapping defining a direct relationship between the current text and the response;
capturing, subsequent to updating the text-response map, second audio data;
processing the second audio data to generate a second text utilizing the voice-to-text model stored locally on the client device;
determining, based on the text-response map, that the current text matches the second text;
in response to determining that the current text matches the second text, and based on the text-response map including the given text mapping that defines the direct relationship between the current text and the response:
causing the response, from the text-response map, to be implemented; and
removing the given text mapping from the text-response map when the expiration event occurs.
|