| CPC G10L 15/16 (2013.01) [G06N 3/045 (2023.01); G10L 15/063 (2013.01)] | 19 Claims |

|
1. A computer-implemented method, comprising:
receiving, at a first device, first audio data corresponding to a first portion of a spoken input;
processing, at the first device, the first audio data using a first trained model to generate first embedding data representing acoustic features corresponding to the first audio data, the first embedding data corresponding to at least a first vector output by at least a first layer of the first trained model;
storing the first embedding data in a computer-readable medium;
after storing the first embedding data and prior to performing automatic speech recognition (ASR) processing using the first embedding data, determining, at the first device and using a second trained model configured to detect device-directed speech, that the first audio data includes first device-directed speech;
in response to determining that the first audio data includes the first device-directed speech, retrieving the first embedding data from the computer-readable medium and sending the first embedding data to an ASR component;
processing, using the ASR component, the first embedding data to determine first ASR output data corresponding to the first audio data;
after receiving the first audio data, receiving, at the first device, second audio data corresponding to a second portion of the spoken input;
sending the second audio data to the ASR component; and
processing, using the ASR component, the first ASR output data and the second audio data to determine second ASR output data corresponding to the spoken input.
|