US 12,451,148 B2
	Sound signal refinement method, sound signal decode method, apparatus thereof, program, and storage medium
Ryosuke Sugiura, Tokyo (JP); Takehiro Moriya, Tokyo (JP); and Yutaka Kamamoto, Tokyo (JP)
Assigned to NTT, Inc., Tokyo (JP)
Appl. No. 18/031,579
Filed by NTT, Inc., Tokyo (JP)
PCT Filed Nov. 5, 2020, PCT No. PCT/JP2020/041400 § 371(c)(1), (2) Date Apr. 12, 2023, PCT Pub. No. WO2022/097237, PCT Pub. Date May 12, 2022.
Prior Publication US 2023/0377585 A1, Nov. 23, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 19/008 (2013.01); G10L 21/043 (2013.01)

CPC G10L 19/008 (2013.01) [G10L 21/043 (2013.01)]

14 Claims

1. A sound signal purification method for obtaining, a purified decoded sound signal representing a sound signal of a channel of stereo,

the sound signal purification method comprising:

a decoded sound common signal estimation step of obtaining, for each frame, a decoded sound common signal Y_Mthat is a signal common to all channels of the stereo by using at least all of one or more and N or less n-th channel decoded sound signals X_n, wherein n represents each integer of 1 or more and N or less, the n-th channel decoded sound signal X_nrepresents a decoded sound signal of the each channel of the stereo, the n-th channel decoded sound signal X_nis obtained by decoding a stereo code CS and a monaural decoded sound signal X_Mrepresents a monaural decoded sound signal, the monaural decoded sound signal X_Mis obtained by decoding a monaural code CM that is a code different from the stereo code CS, the n-th channel decoded sound signal X_nis obtained by decoding the stereo code CS without using either information obtained by decoding the monaural code CM or the monaural code CM;

a decoded sound common signal upmixing step of obtaining, for the each frame, an n-th channel upmixed common signal Y_Mn, wherein the n-th channel upmixed common signal Y_Mnis obtained by upmixing the decoded sound common signal Y_Mfor the each channel by an upmixing process using the decoded sound common signal Y_Mand inter-channel relationship information, and the inter-channel relationship information indicates a relationship between the channels of the stereo;

a monaural decoded sound upmixing step of obtaining, for the each frame, an n-th channel upmixed monaural decoded sound signal X_Mn, wherein the n-th channel upmixed monaural decoded sound signal X_Mnis obtained by upmixing the monaural decoded sound signal X_Mfor the each channel by an upmixing process using the monaural decoded sound signal X_Mand the inter-channel relationship information;

an n-th channel signal purification step of obtaining, for the each frame and for each corresponding sample t with respect to the each channel n, a sequence based on a value ^˜y_Mn(t)=(1−α_Mn)× y_Mn(t)+α_Mn× x_Mn(t) as an n-th channel purified upmixed signal ^˜Y_Mn, wherein the value ^˜y_Mn(t) is obtained by adding a value α_Mn× X_Mn(t) and a value (1−α_Mn)× y_Mn(t), the value α_Mn× X_Mn(t) is obtained by multiplying an n-th channel purification weight α_Mnby a sample value x_Mn(t) of the n-th channel upmixed monaural decoded sound signal X_Mn, the value (1−α_Mn)× y_Mn(t) is obtained by multiplying a value (1−α_Mn) by a sample value y_Mn(t) of the n-th channel upmixed common signal Y_Mn, and the value (1−α_Mn) is obtained by subtracting the n-th channel purification weight α_Mnfrom 1;

an n-th channel separation combination weight estimation step of obtaining, for the each frame with respect to the each channel n, a normalized inner product value for the n-th channel upmixed common signal Y_Mnof the n-th channel decoded sound signal X_nas an n-th channel separation combination weight β_n; and

an n-th channel separation combination step of obtaining, for the each frame and for each corresponding sample t with respect to the each channel n, a sequence based on a value ^˜x_n(t)= x_n(t)−β_n× y_Mn(t)+β_n×^˜y_Mn(t) as an n-th channel purified decoded sound signal ^˜X_n, wherein the value ^˜x_n(t) is obtained by subtracting a value β_n× y_Mn(t) from a sample value x_n(t) of the n-th channel decoded sound signal X_nand adding a value β_n×^˜y_Mn(t), the value β_n× y_Mn(t) is obtained by multiplying the n-th channel separation combination weight β_nby the sample value y_Mn(t) of the n-th channel upmixed common signal Y_Mn, and the value β_n×^˜y_Mn(t) is obtained by multiplying the n-th channel separation combination weight β_nby a sample value ^˜y_Mn(t) of the n-th channel purified upmixed signal ^˜Y_Mn.