US 12,014,748 B1
	Speech enhancement machine learning model for estimation of reverberation in a multi-task learning framework
Ritwik Giri, Sunnyvale, CA (US); Mehmet Umut Isik, Menlo Park, CA (US); Neerad Dilip Phansalkar, Half Moon Bay, CA (US); Jean-Marc Valin, Montreal (CA); Karim Helwani, Mountain View, CA (US); and Arvindh Krishnaswamy, Palo Alto, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Aug. 7, 2020, as Appl. No. 16/988,423.
Int. Cl. G10L 21/0208 (2013.01); G06N 5/04 (2023.01); G06N 20/00 (2019.01); G10L 21/034 (2013.01)

CPC G10L 21/0208 (2013.01) [G06N 5/04 (2013.01); G06N 20/00 (2019.01); G10L 21/034 (2013.01); G10L 2021/02082 (2013.01)]

19 Claims

1. A computer-implemented method comprising:

receiving, at a machine learning service of a provider network, a plurality of training audio files and a request to create a machine learning model;

training, by the machine learning service of the provider network, an algorithm into the machine learning model that generates a clean speech portion of an audio file and a reverb only portion of the audio file;

generating, by the machine learning model, a reverb only portion and a clean speech portion of at least one of the plurality of training audio files;

determining a direct to reverberant ratio of the at least one of the plurality of training audio files based on the reverb only portion and the clean speech portion of the at least one of the plurality of training audio files;

filtering out the at least one of the plurality of training audio files having the direct to reverberant ratio below a reverberance threshold to generate a proper subset of the plurality of training audio files;

performing a training iteration with the proper subset of the plurality of training audio files to update the machine learning model;

receiving an inference request for an input audio file from a computing device of a user located outside the provider network;

generating, by the machine learning model, a clean speech portion of the input audio file and a reverb only portion of the input audio file;

generating an inference based at least in part on the clean speech portion of the input audio file and the reverb only portion of the input audio file; and

transmitting the inference to a client application or to a storage location.