US 12,236,974 B2
	Method and apparatus for processing signal, computer readable medium
Libiao Yu, Beijing (CN); Guochang Zhang, Beijing (CN); and Jianqiang Wei, Beijing (CN)
Assigned to Beijing Baidu Netcom Science Technology Co., Ltd., Beijing (CN)
Filed by Beijing Baidu Netcom Science Technology Co., Ltd., Beijing (CN)
Filed on Jul. 25, 2022, as Appl. No. 17/872,909.
Claims priority of application No. 202111440574.7 (CN), filed on Nov. 30, 2021.
Prior Publication US 2022/0358951 A1, Nov. 10, 2022
Int. Cl. G10L 25/51 (2013.01); G10L 15/16 (2006.01); G10L 21/0224 (2013.01); G10L 25/18 (2013.01); G10L 15/06 (2013.01); G10L 21/0208 (2013.01)

CPC G10L 25/51 (2013.01) [G10L 15/16 (2013.01); G10L 21/0224 (2013.01); G10L 25/18 (2013.01); G10L 2015/0631 (2013.01); G10L 2021/02082 (2013.01)]

18 Claims

1. A method for processing a signal, the method comprising:

acquiring a reference signal of a to-be-tested voice, the reference signal being a signal output to a voice output device, wherein the voice output device outputs the to-be-tested voice after obtaining the reference signal;

receiving, from a voice input device, an echo signal of the to-be-tested voice, the echo signal being a signal of the to-be-tested voice collected by the voice input device;

performing signal preprocessing on the reference signal and the echo signal respectively; and

inputting the processed reference signal and the processed echo signal into a pre-trained time delay estimation model, to obtain a time difference between the reference signal and the echo signal output by the time delay estimation model, the time delay estimation model being used to represent a corresponding relationship between the reference signal, the echo signal and the time difference, wherein the time delay estimation model is configured to extract a feature of the reference signal and a feature of the echo signal, and is obtained by training operations based on long-term correlations between features of reference signals and features of echo signals; the time delay estimation model comprises: a convolutional neural network, a temporal convolutional network, and a fully connected layer, wherein the convolutional neural network, the temporal convolutional network and the fully connected layer are connected in sequence, the convolutional neural network is configured to extract and deeply fuse the feature of the reference signal and the feature of the echo signal, the temporal convolutional network is configured to learn the long-term correlation between the feature of the reference signal and the feature of the echo signal, and the fully connected layer is configured to extract the time delay between the reference signal and the echo signal.