US 12,308,042 B2
	Multistage low power, low latency, and real-time deep learning single microphone noise suppression
Mouna Elkhatib, Irvine, CA (US); and Adil Benyassine, Irvine, CA (US)
Assigned to AONDEVICES, INC., Irvine, CA (US)
Filed by AONDEVICES, INC., Irvine, CA (US)
Filed on Mar. 11, 2022, as Appl. No. 17/654,462.
Claims priority of provisional application 63/159,893, filed on Mar. 11, 2021.
Prior Publication US 2022/0293119 A1, Sep. 15, 2022
Int. Cl. G10L 21/0232 (2013.01); G10L 21/034 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)

CPC G10L 21/0232 (2013.01) [G10L 21/034 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)]

13 Claims

1. A multi-stage noise suppression system for cleaning a noisy input speech signal with an underlying speech combined with noise from a surrounding environment as captured from a single transducer source, comprising:

a first noise gain extractor generating a set of ideal noise gain values for each of a spectrum of discrete frequency segments in a frequency domain representation of the noisy input speech signal based upon estimates of the noise components in the noisy input speech signal, the first noise gain extractor being a first neural network specifically trained to generate the first set of ideal noise gain values based upon an identification of optimal neural network weight values from predetermined criteria tuned for speech captured from noisy environments;

a first noise signal processor applying the set of ideal noise gain values to the spectrum of discrete frequency segments of the noisy input speech signal with estimated noise power spectrum values being generated therefrom;

a noise subtractor receptive to the estimated noise power spectrum values and the noisy input speech signal, the noise subtractor generating partially denoised signal spectrum values as first stage outputs from the noisy input speech signal reduced by the estimated noise power spectrum values;

a second noise gain extractor generating a set of ideal signal gain values for each of the spectrum of discrete frequency segments in the frequency domain representation of the noisy input speech signal as an interdependent function of the partially denoised signal spectrum values, the second noise gain extractor being a second neural network independently trained on the first stage outputs to progressively derive the clean signal power spectrum values as a refinement of the partially denoised signal spectrum values from the first stage based upon identifying optimal neural network weight values from predetermined criteria tuned for speech captured from noisy environments;

a second noise signal processor applying the set of ideal signal gain values to the frequency domain representation of the noisy input speech signal with clean signal power spectrum values being generated therefrom; and

a signal reconstructor receptive to the clean signal power spectrum values and the noisy input speech signal, a set of time-domain clean signal values representative of a cleaned underlying speech being generated by the signal reconstructor.