CPC G10L 15/26 (2013.01) [G06F 3/0488 (2013.01); G06N 20/00 (2019.01); G10L 15/18 (2013.01); G10L 15/22 (2013.01); G10L 2015/223 (2013.01)] | 20 Claims |
1. A method implemented by one or more processors, the method comprising:
receiving audio data that captures a spoken utterance of a user of a client device, the audio data being generated by one or more microphones of the client device;
determining whether touch input of the user is being simultaneously directed to a transcription, that is displayed at the client device via a software application accessible at the client device, at the same time the audio data that captures the spoken utterance is received;
in response to determining that no touch input of the user is being simultaneously directed to the transcription at the same time the audio data that captures the spoken utterance is received:
determining to incorporate recognized text, that corresponds to the spoken utterance, into the transcription;
in response to determining that touch input of the user is being simultaneously directed to the transcription at the same time the audio data that captures the spoken utterance is received:
determining, based on one or more terms of the spoken utterance, whether to:
incorporate the recognized text, that corresponds to the spoken utterance, into the transcription, or
perform an assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance;
in response to determining to incorporate the recognized text that corresponds to the spoken utterance into the transcription:
automatically incorporating the recognized text that corresponds to the spoken utterance into the transcription; and
in response to determining to perform the assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance:
causing an automated assistant to perform the assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance.
|