| CPC G10L 15/16 (2013.01) [G10L 25/21 (2013.01)] | 20 Claims |

|
1. A method of speech recognition, executable by a processor, comprising:
receiving audio data corresponding to one or more speakers;
applying, by a complex ratio filter, a complex ratio mask to estimate a covariance matrix of a target speech and a covariance matrix of a noise associated with the received audio data;
estimating, by inputting the covariance matrix of the target speech to a first gated recurrent unit-based network (GRU-Net), a steering vector of the target speech;
estimating, by inputting the covariance matrix of the noise to a second GRU-Net, an inverse of the covariance matrix of the noise; and
generating, by using a minimum variance distortionless response (MVDR) jointly trained with a recurrent neural network (RNN), a predicted target waveform corresponding to a target speaker from among the one or more speakers based on the estimated steering vector and inverse, wherein MVDR coefficients are calculated using the first GRU-Net and the second GRU-Net and the predicted target waveform is used for speech recognition.
|