US 12,456,457 B2
All deep learning minimum variance distortionless response beamformer for speech separation and enhancement
Yong Xu, Bellevue, WA (US); Meng Yu, Bellevue, WA (US); Shi-Xiong Zhang, Redmond, WA (US); and Dong Yu, Bellevue, WA (US)
Assigned to TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed by TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed on May 23, 2022, as Appl. No. 17/750,973.
Application 17/750,973 is a continuation of application No. 17/038,498, filed on Sep. 30, 2020, granted, now 11,380,307.
Prior Publication US 2022/0284885 A1, Sep. 8, 2022
Int. Cl. G10L 15/16 (2006.01); G10L 25/21 (2013.01)
CPC G10L 15/16 (2013.01) [G10L 25/21 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method of speech recognition, executable by a processor, comprising:
receiving audio data corresponding to one or more speakers;
applying, by a complex ratio filter, a complex ratio mask to estimate a covariance matrix of a target speech and a covariance matrix of a noise associated with the received audio data;
estimating, by inputting the covariance matrix of the target speech to a first gated recurrent unit-based network (GRU-Net), a steering vector of the target speech;
estimating, by inputting the covariance matrix of the noise to a second GRU-Net, an inverse of the covariance matrix of the noise; and
generating, by using a minimum variance distortionless response (MVDR) jointly trained with a recurrent neural network (RNN), a predicted target waveform corresponding to a target speaker from among the one or more speakers based on the estimated steering vector and inverse, wherein MVDR coefficients are calculated using the first GRU-Net and the second GRU-Net and the predicted target waveform is used for speech recognition.