US 12,340,817 B2
	Audio signal processing method, training method, apparatus and storage medium
Wenkai Zhang, Beijing (CN); Ce Zhang, Beijing (CN); Zheng Li, Beijing (CN); and Lei Jia, Beijing (CN)
Assigned to Beijing Baidu Netcom Science Technology Co., Ltd., Beijing (CN)
Filed by Beijing Baidu Netcom Science Technology Co., Ltd., Beijing (CN)
Filed on Jul. 15, 2022, as Appl. No. 17/812,784.
Claims priority of application No. 202111541269.7 (CN), filed on Dec. 16, 2021.
Prior Publication US 2023/0197096 A1, Jun. 22, 2023
Int. Cl. G10L 15/22 (2006.01); G10L 15/06 (2013.01); G10L 21/0224 (2013.01); G10L 25/30 (2013.01); G10L 21/0208 (2013.01)

CPC G10L 21/0224 (2013.01) [G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 25/30 (2013.01); G10L 2015/223 (2013.01); G10L 2021/02082 (2013.01)]

16 Claims

1. An audio signal processing method, comprising:

eliminating at least part of a linear echo signal from a mixed voice signal, to obtain an intermediate processing signal; wherein the mixed voice signal is obtained by mixing a target voice signal with an echo signal, and the echo signal is generated in an environment where the target voice signal is located and comprises the linear echo signal and a nonlinear echo signal; and

removing the nonlinear echo signal and a residual part of the linear echo signal from the intermediate processing signal, by using a target full convolution neural network model, to obtain an approximate target voice signal, wherein the target full convolution neural network model comprises at least two convolution layers, and a convolution layer in the target full convolution neural network model is able to perform convolution processing on an audio frame in the intermediate processing signal and remove the nonlinear echo signal and the residual part of the linear echo signal from the intermediate processing signal;

wherein the audio frame on which the convolution processing is performed by the convolution layer in a time dimension comprises: a t-th audio frame at time t, a (t−1)-th audio frame at time t−1, . . . , and a (t−N)-th audio frame at time t−N; wherein the N is an integer greater than or equal to 1, the t is an integer greater than or equal to 1, and the time t is current time; and

in a case of a value of the t is 1, a 1-st audio frame represents a first audio frame in the intermediate processing signal, and a 0-th audio frame to a (1−N)-th audio frame are preset frames;

wherein the method further comprises: setting the N preset frames before the first audio frame in the intermediate processing signal, to update the intermediate processing signal, so that the first N frames of the updated intermediate processing signal are the preset frames; wherein for each convolutional layer after a first convolutional layer, before the convolution processing of the convolution layer, the N preset frames are set for the convolution layer, and the N is equal to a kernel quantity of a convolution kernel of the convolution layer in the time dimension minus 1.