US 11,924,367 B1
Joint noise and echo suppression for two-way audio communication enhancement
Jean-Marc Valin, Montreal (CA); Karim Helwani, Mountain View, CA (US); Srikanth Venkata Tenneti, Sunnyvale, CA (US); Erfan Soltanmohammadi, Silver Spring, MD (US); Mehmet Umut Isik, Menlo Park, CA (US); Richard Newman, Pullman, WA (US); Michael Mark Goodwin, Scotts Valley, CA (US); and Arvindh Krishnaswamy, Palo Alto, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Feb. 9, 2022, as Appl. No. 17/668,297.
Int. Cl. H04M 3/00 (2006.01); G10L 21/0232 (2013.01); G10L 21/034 (2013.01); G10L 25/18 (2013.01); H04S 3/00 (2006.01); G10L 21/0208 (2013.01)
CPC H04M 3/002 (2013.01) [G10L 21/0232 (2013.01); G10L 21/034 (2013.01); G10L 25/18 (2013.01); H04S 3/008 (2013.01); G10L 2021/02082 (2013.01); H04S 2400/01 (2013.01); H04S 2400/03 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A system, comprising:
at least one processor; and
a memory, storing program instructions that when executed by the at least one processor, cause the at least one processor to implement an audio enhancement system, configured to:
receive, via an interface for the audio enhancement system, first audio data captured by a microphone at a first communication device as part of a two-way audio communication between the first communication device and a second communication device;
receive second audio data transmitted from the second communication device to the first communication device for playback through a speaker at the first communication device as part of the two-way audio communication;
apply a machine learning model trained to determine respective gain values for a plurality of different spectrum bands of the first audio data to suppress noise and suppress echo captured in the first audio data from playback of the second audio data through a speaker at the first communication device, wherein the machine learning model accepts respective input features extracted from the second audio data as a reference signal and extracted from the first audio data based on respective representations of the second audio data and the first audio data in respective sets of frequency bands; and
apply an envelope post-filter that individually modifies the respective gain values according to a monotonically increasing function applied to the respective gain values;
perform an inverse transform on the plurality of different spectrum bands with the respectively modified gain values to generate an enhanced version of the first audio data; and
send the enhanced version of the first audio data to the second communication device for playback at the second communication device.