US 12,412,567 B1
	Low latency audio processing techniques
Bjorn Hoffmeister, Seattle, WA (US); Ariya Rastrow, Seattle, WA (US); and Grant Strimel, Presto, PA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on May 5, 2021, as Appl. No. 17/308,550.
Int. Cl. G10L 15/16 (2006.01); G06N 3/045 (2023.01); G10L 15/06 (2013.01)

CPC G10L 15/16 (2013.01) [G06N 3/045 (2023.01); G10L 15/063 (2013.01)]

19 Claims

1. A computer-implemented method, comprising:

receiving, at a first device, first audio data corresponding to a first portion of a spoken input;

processing, at the first device, the first audio data using a first trained model to generate first embedding data representing acoustic features corresponding to the first audio data, the first embedding data corresponding to at least a first vector output by at least a first layer of the first trained model;

storing the first embedding data in a computer-readable medium;

after storing the first embedding data and prior to performing automatic speech recognition (ASR) processing using the first embedding data, determining, at the first device and using a second trained model configured to detect device-directed speech, that the first audio data includes first device-directed speech;

in response to determining that the first audio data includes the first device-directed speech, retrieving the first embedding data from the computer-readable medium and sending the first embedding data to an ASR component;

processing, using the ASR component, the first embedding data to determine first ASR output data corresponding to the first audio data;

after receiving the first audio data, receiving, at the first device, second audio data corresponding to a second portion of the spoken input;

sending the second audio data to the ASR component; and

processing, using the ASR component, the first ASR output data and the second audio data to determine second ASR output data corresponding to the spoken input.