US 11,875,810 B1
Echo cancellation using neural networks for environments with unsynchronized devices for audio capture and rendering
Karim Helwani, Mountain View, CA (US); and Emmanouil Theodosis, Somerville, MA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 29, 2021, as Appl. No. 17/489,538.
Int. Cl. G10L 21/02 (2013.01); G10L 21/0208 (2013.01); H04M 9/08 (2006.01); G06N 3/045 (2023.01); G10L 21/0216 (2013.01)
CPC G10L 21/0208 (2013.01) [G06N 3/045 (2023.01); H04M 9/082 (2013.01); G10L 2021/02082 (2013.01); G10L 2021/02166 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A system, comprising:
one or more computing devices;
wherein the one or more computing devices include instructions that upon execution on or across the one or more computing devices:
obtain, as input at a neural network-based multi-layer echo canceler comprising a first layer which includes a non-linear effects handler and a second layer which includes a linear effects handler, (a) output of a first microphone in a first communication environment comprising one or more microphones and one or more speakers, and (b) a reference signal received at the first communication environment from a second communication environment and directed to a first speaker of the one or more speakers;
generate, at the non-linear effects handler, a first output obtained at least in part by applying a first learned compensation for a first set of properties of the output of the first microphone, wherein the first set of properties includes (a) a first non-linearity resulting from a clock skew between the first speaker and the first microphone, and (b) a second non-linearity in an audio reproduction capability of the first speaker, wherein applying the first learned compensation comprises modifying one or more weights of a first neural network based at least in part on processing of the reference signal and the output of the first microphone;
provide, as input to the linear effects handler, at least the output of the non-linear effects handler;
generate, at the linear effects handler, a second output obtained at least in part by applying a second learned compensation for a second set of properties of the output of the non-linear effects handler, wherein the second set of properties includes a first echo resulting from capturing audio output of the first speaker at the first microphone, and wherein applying the second learned compensation comprises utilizing, at a second neural network, a learned linear model of an acoustic path between the first speaker and the first microphone; and
transmit the second output to the second communication environment.