US 12,094,481 B2
ADL-UFE: all deep learning unified front-end system
Yong Xu, Bellevue, WA (US); Meng Yu, Bellevue, WA (US); Shi-Xiong Zhang, Redmond, WA (US); and Dong Yu, Bellevue, WA (US)
Assigned to TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed by TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed on Nov. 18, 2021, as Appl. No. 17/455,497.
Prior Publication US 2023/0154480 A1, May 18, 2023
Int. Cl. G10L 21/0208 (2013.01); G06N 3/044 (2023.01); G06N 3/08 (2023.01); G10L 21/0216 (2013.01); G10L 21/0264 (2013.01); G10L 25/30 (2013.01)
CPC G10L 21/0208 (2013.01) 20 Claims
OG exemplary drawing
 
1. A method for generating enhanced target speech from audio data, performed by a computing device, the method comprising:
receiving audio data corresponding to one or more speakers;
generating an estimated target speech, an estimated noise, and an estimated echo simultaneously based on the audio data using a jointly trained complex ratio mask;
predicting frame-level multi-tap time-frequency (T-F) spatio-temporal-echo filter weights using a first intermediate concatenation generated by concatenating the estimated target speech and the estimated echo and a second intermediate concatenation by concatenating the estimated noise and the estimated echo using a trained neural network model,
wherein the estimated target speech and the estimated echo are jointly modeled using the trained neural network model; and
predicting enhanced target speech based on the frame-level multi-tap T-F spatio-temporal-echo filter weights.