US 11,967,316 B2
	Audio recognition method, method, apparatus for positioning target audio, and device
Jimeng Zheng, Shenzhen (CN); Ian Ernan Liu, Shenzhen (CN); Yi Gao, Shenzhen (CN); and Weiwei Li, Shenzhen (CN)
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, Shenzhen (CN)
Filed by Tencent Technology (Shenzhen) Company Limited, Shenzhen (CN)
Filed on Feb. 23, 2021, as Appl. No. 17/183,209.
Application 17/183,209 is a continuation of application No. PCT/CN2019/121946, filed on Nov. 29, 2019.
Claims priority of application No. 201811455880.6 (CN), filed on Nov. 30, 2018.
Prior Publication US 2021/0174792 A1, Jun. 10, 2021
Int. Cl. G10L 15/22 (2006.01); G01S 3/80 (2006.01); G01S 3/802 (2006.01); G10L 15/08 (2006.01); G10L 15/20 (2006.01); G10L 21/0224 (2013.01); G10L 21/0232 (2013.01); G10L 25/51 (2013.01); G10L 21/0208 (2013.01); G10L 21/0216 (2013.01)

CPC G10L 15/20 (2013.01) [G01S 3/8006 (2013.01); G01S 3/802 (2013.01); G10L 15/08 (2013.01); G10L 15/22 (2013.01); G10L 21/0224 (2013.01); G10L 21/0232 (2013.01); G10L 25/51 (2013.01); G10L 2015/088 (2013.01); G10L 2021/02082 (2013.01); G10L 2021/02166 (2013.01)]

20 Claims

1. An audio recognition method, comprising:

obtaining audio signals collected in a plurality of directions in a space, the audio signals comprising a target-audio direct signal;

performing echo cancellation on the audio signals;

obtaining weights of a plurality of time-frequency points in the echo-canceled audio signals, a respective weight of each time-frequency point indicating a relative proportion of the target-audio direct signal in the echo-canceled audio signals at the time-frequency point;

weighting time-frequency components of the audio signals at the plurality of time-frequency points separately for each of the plurality of directions by using the weights of the plurality of time-frequency points, to obtain a weighted audio signal energy distribution of the audio signals in the plurality of directions, further including:

obtaining a weighted covariance matrix of each of the plurality of time-frequency points based at least in part on the obtained weights of the plurality of time-frequency points, starting and ending time points of a target wakeup word in the echo-canceled audio signals, and time-frequency domain expressions of the echo-canceled audio signals; and

performing weighted calculation on a spatial spectrum of the audio signals by using the weighted covariance matrix, to obtain the spatial spectrum of the audio signals weighted at the plurality of time-frequency points;

obtaining a sound source azimuth corresponding to the target-audio direct signal in the audio signals by using the weighted audio signal energy distribution of the audio signals in the plurality of directions; and

performing audio recognition to the audio signals based on the sound source azimuth corresponding to the target-audio direct signal.