CPC G10L 21/0208 (2013.01) | 20 Claims |
1. A method for generating enhanced target speech from audio data, performed by a computing device, the method comprising:
receiving audio data corresponding to one or more speakers;
generating an estimated target speech, an estimated noise, and an estimated echo simultaneously based on the audio data using a jointly trained complex ratio mask;
predicting frame-level multi-tap time-frequency (T-F) spatio-temporal-echo filter weights using a first intermediate concatenation generated by concatenating the estimated target speech and the estimated echo and a second intermediate concatenation by concatenating the estimated noise and the estimated echo using a trained neural network model,
wherein the estimated target speech and the estimated echo are jointly modeled using the trained neural network model; and
predicting enhanced target speech based on the frame-level multi-tap T-F spatio-temporal-echo filter weights.
|