US 11,676,619 B2
Noise spatial covariance matrix estimation apparatus, noise spatial covariance matrix estimation method, and program
Tomohiro Nakatani, Tokyo (JP); Marc Delcroix, Tokyo (JP); Keisuke Kinoshita, Tokyo (JP); Shoko Araki, Tokyo (JP); and Yuki Kubo, Tokyo (JP)
Assigned to NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
Appl. No. 17/437,701
Filed by NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
PCT Filed Feb. 28, 2020, PCT No. PCT/JP2020/008216
§ 371(c)(1), (2) Date Sep. 9, 2021,
PCT Pub. No. WO2020/184210, PCT Pub. Date Sep. 17, 2020.
Claims priority of application No. JP2019-045649 (JP), filed on Mar. 13, 2019.
Prior Publication US 2022/0130406 A1, Apr. 28, 2022
Int. Cl. G10L 21/0232 (2013.01); G10K 11/175 (2006.01); G10L 21/028 (2013.01)
CPC G10L 21/0232 (2013.01) [G10K 11/1752 (2020.05); G10L 21/028 (2013.01)] 5 Claims
OG exemplary drawing
 
1. A noise spatial covariance matrix estimation device comprising processing circuitry configured to:
use time-frequency-divided observation signals xt, f and mask information λt, f(j) to acquire, for each noise source j, time-independent first noise spatial covariance matrices ψf(j) corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) for all t∈L, wherein j is a positive integer expressing a noise source number, J is a positive integer expressing a number of the noise sources, j=1, . . . , J holds, t is a positive integer expressing a time frame number, f is a positive integer expressing a frequency band number, L is a long time interval, the time-frequency-divided observation signals xt, f are based on observation signals acquired using one or more microphones by collecting acoustic signals emitted from one or a plurality of sound sources, and the mask information λt, f(j) expresses an occupancy probability of a component corresponding to each noise source j in each of the time-frequency-divided observation signals Xt, f;
use the mask information λt, f(j) for t∈Bk of each of a plurality of different short time intervals B1, . . . , BK to acquire a mixture weight μk, f(j) corresponding to each noise source j in each short time interval Bk, wherein K is an integer greater than 1, k=1, . . . , K, each short time interval Bk is shorter than the long time interval L, and each short time interval Bk is a part of L; and
acquire and output a time-variant third noise spatial covariance matrix R k, f for a noise of the acoustic signals based on a time-variant second noise spatial covariance matrix and a weighted sum of the first noise spatial covariance matrices ψf(j) with the mixture weights μk, f(j) for each short time interval Bk, wherein the second noise spatial covariance matrix corresponds to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) for the noise source j and t∈Bk of each short time interval Bk, and the noise is formed by all of the noise sources j=1, . . . , J.