US 12,073,828 B2
	Method and apparatus for speech source separation based on a convolutional neural network
Jundai Sun, Beijing (CN); Zhiwei Shuang, Beijing (CN); Lie Lu, Dublin, CA (US); Shaofan Yang, Beijing (CN); and Jia Dai, Beijing (CN)
Assigned to Dolby Laboratories Licensing Corporation, San Francisco, CA (US)
Appl. No. 17/611,121
Filed by Dolby Laboratories Licensing Corporation, San Francisco, CA (US)
PCT Filed May 13, 2020, PCT No. PCT/US2020/032762 § 371(c)(1), (2) Date Nov. 12, 2021, PCT Pub. No. WO2020/232180, PCT Pub. Date Nov. 19, 2020.
Claims priority of provisional application 62/856,888, filed on Jun. 4, 2019.
Claims priority of application No. PCT/CN2019/086769 (WO), filed on May 14, 2019; and application No. 19188010 (EP), filed on Jul. 24, 2019.
Prior Publication US 2022/0223144 A1, Jul. 14, 2022
Int. Cl. G10L 15/20 (2006.01); G06N 3/08 (2023.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01); G10L 21/0308 (2013.01); G10L 25/18 (2013.01)

CPC G10L 15/20 (2013.01) [G06N 3/08 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 21/0308 (2013.01); G10L 25/18 (2013.01)]

26 Claims

1. A method for Convolutional Neural Network (CNN) based speech source separation, wherein the method includes:

providing multiple frames of a time-frequency transform of an original noisy speech signal;

inputting the time-frequency transform of said multiple frames into an aggregated multi-scale CNN having a plurality of parallel convolution paths, wherein each parallel convolution path, out of the plurality of parallel convolution paths of the CNN, includes a cascade of L convolution layers, wherein L is a natural number>1, wherein an l-th layer among the L layers has N_lfilters with l=1 . . . L, wherein the filter size of the filters is different between different parallel convolution paths and wherein the filter size of the filters is the same within each parallel convolution path;

extracting and outputting, by each parallel convolution path, features from the input time-frequency transform of said multiple frames;

obtaining an aggregated output of the outputs of the parallel convolution paths; and

generating an output mask for extracting speech from the original noisy speech signal based on the aggregated output.