US 11,961,514 B1
	Streaming self-attention in a neural network
Chia-Jung Chang, Cambridge, MA (US); Qingming Tang, Cambridge, MA (US); Ming Sun, Winchester, MA (US); and Chao Wang, Newton, MA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Dec. 10, 2021, as Appl. No. 17/547,610.
Int. Cl. G10L 15/16 (2006.01); G10L 15/14 (2006.01); G10L 17/16 (2013.01)

CPC G10L 15/16 (2013.01)

20 Claims

1. A method comprising:

receiving a plurality of audio frames representing audio captured by a device;

determining a first subset of the plurality of audio frames;

determining a second subset of the plurality of audio frames, wherein the second subset is different from the first subset but includes at least one audio frame from the first subset;

processing, using a convolutional recurrent neural network encoder (CRNN) configured to determine audio features in the plurality of audio frames, the first subset to generate first hidden state data of the CRNN;

determining preliminary embedding data using at least the first hidden state data;

processing, using the CRNN, the second subset and the first hidden state data to generate second hidden state data of the CRNN;

determining interim embedding data using the second hidden state data and the preliminary embedding data;

determining, using at least the interim embedding data, final embedding data representing audio features representing the audio;

processing the final embedding data with respect to stored data to determine results data, the stored data representing audio of an event; and

determining, based at least in part on the results data, that an instance of the event has occurred.

5. A method comprising:

receiving a first portion of audio data;

receiving a second portion of audio data;

processing, using a recurrent neural network (RNN), the first portion of audio data to generate first data representing a first hidden state of the RNN;

processing the first data to determining a first variable value;

determining second data representing a first embedding using at least the first data;

processing, using the RNN, the second portion of audio data and the first data to generate third data representing a second hidden state of the RNN;

determining, using the third data, a second variable value;

determining fourth data representing a second embedding using the third data, and the first embedding; and

determining, using at least the fourth data and the second variable value, fifth data representing audio features representing the audio data.

13. A system, comprising:

at least one processor; and

at least one memory comprising instructions that, when executed by the at least one processor, cause the system to:

receive a first portion of audio data;

receive a second portion of audio data;

process, using a recurrent neural network (RNN), a first portion of audio data to generate first data representing a first hidden state of the RNN;

process the first data to determining a first variable value;

determine second data representing a first embedding using at least the first data;

process, using the RNN, the second portion of audio data and the first data to generate third data representing a second hidden state of the RNN;

determine, using the third data, a second variable value;

determine fourth data representing a second embedding using the third data and the first embedding; and

determine, using at least the fourth data and the second variable value, fifth data representing audio features representing the audio data.