US 12,431,143 B1
	Neural coding for redundant audio information transmission
Jean-Marc Valin, Montreal (CA); Jan Buethe, Munich (DE); and Ahmed Mustafa, Aachen (DE)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Jun. 30, 2023, as Appl. No. 18/345,838.
Int. Cl. G10L 19/00 (2013.01); G10L 19/04 (2013.01)

CPC G10L 19/0017 (2013.01) [G10L 19/04 (2013.01)]

20 Claims

1. A system, comprising:

a first device, comprising:

an audio sensor;

a first processor; and

a first memory, storing program instructions that when executed by the first processor, cause the first processor to:

receive a stream of audio data captured by the audio sensor for transmission over a network to a recipient;

encode acoustic features of the stream of audio data for a plurality of individual frames of the audio data according to an autoencoder technique, wherein the encoding processes the stream of audio data through a first recurrent neural network (RNN) trained to apply continuous forward encoding between the individual frames to output respective latent vectors for the individual frames of the audio data and respective initial states for decoding backward in the stream of audio data starting from the initial states;

generate a plurality of network packets corresponding to different, overlapping portions of the stream of audio data, wherein individual ones of the plurality of network packets comprise:

a subset of the respective latent vectors for a subset of the plurality of audio frames that corresponds to the portion of the stream of audio data; and

one of the respective initial states that corresponds to a most recent audio frame in the subset of the audio frames; and

send the plurality of network packets to the recipient over the network;

a second device, comprising:

a second processor; and

a second memory, storing further program instructions that when executed by the second processor, cause the second processor to:

receive one of plurality of network packets;

decode the one network packet, wherein the decode processes the subset of the respective latent vectors for the subset of the plurality of audio frames and the initial state as input to a second RNN trained to apply backward decoding from the most recent audio frame in the subset of audio frames to generate a decoded version for at least one of the subset of the plurality of audio frames.