CPC G10L 15/22 (2013.01) [G06F 3/167 (2013.01); G06F 18/24 (2023.01); G06V 10/40 (2022.01); G06V 40/10 (2022.01); G06V 40/20 (2022.01); G10L 13/08 (2013.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/08 (2013.01); G10L 15/20 (2013.01); G10L 15/222 (2013.01); G10L 15/24 (2013.01); G10L 2015/0635 (2013.01); G10L 2015/088 (2013.01); G10L 2015/223 (2013.01); G10L 2015/227 (2013.01)] | 20 Claims |
1. A computer-implemented method comprising:
receiving, by a user device comprising at least one microphone and at least one speaker, output audio data representing synthesized speech of a list of entries;
using the at least one speaker, beginning playback of audio corresponding to the output audio data;
during the playback of the audio, detecting, by the at least one microphone, user speech;
determining input audio data representing the user speech;
determining a first time corresponding to the beginning of the playback;
determining a second time corresponding to detection of the user speech;
determining, by the user device, offset time data representing a difference between the first time and the second time;
using a first trained machine learning (ML) model, processing the input audio data to determine that the user speech is system directed;
based at least in part on determining that the user speech is system directed, sending, to at least one remote device, the input audio data and the offset time data;
performing automatic speech recognition (ASR) processing on the input audio data to determine ASR output data representing a transcript of the input audio data;
using a second trained ML model, performing natural language understanding (NLU) processing on the ASR output data to determine NLU output data representing at least an intent corresponding to the user speech;
based at least in part on the NLU output data, determining that the user speech refers to an entry that is absent from the user speech;
based at least in part on the entry being absent from the user speech, processing the offset time data to determine the entry is a first entry in the list of entries; and
causing an action to be performed based at least in part on the first entry.
|