US 11,790,926 B2
	Method and apparatus for processing audio signal
Mi Suk Lee, Daejeon (KR); Seung Kwon Beack, Daejeon (KR); Jongmo Sung, Daejeon (KR); Tae Jin Lee, Daejeon (KR); Jin Soo Choi, Daejeon (KR); Minje Kim, Bloomington, IN (US); and Kai Zhen, Bloomington, IN (US)
Assigned to Electronics and Telecommunications Research Institute, Daejeon (KR); and The Trustees of Indiana University, Indianapolis, IN (US)
Filed by Electronics and Telecommunications Research Institute, Daejeon (KR); and The Trustees of Indiana University, Indianapolis, IN (US)
Filed on Jan. 22, 2021, as Appl. No. 17/156,006.
Claims priority of provisional application 62/966,917, filed on Jan. 28, 2020.
Claims priority of application No. 10-2020-0056492 (KR), filed on May 12, 2020.
Prior Publication US 2021/0233547 A1, Jul. 29, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 19/038 (2013.01); G10L 19/028 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)

CPC G10L 19/038 (2013.01) [G10L 19/028 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)]

3 Claims

1. A processing method comprising:

acquiring a final audio signal for an initial audio signal using a plurality of neural network models generating output audio signals by encoding and decoding input audio signals;

acquiring a masking threshold and a power spectral density for the initial audio signal through a psychoacoustic model;

determining a weight based on a relationship between the masking threshold and the power spectral density for each frequency;

calculating a difference between a power spectral density of the initial audio signal and a power spectral density of the final audio signal for each frequency based on the determined weight;

training the neural network models based on a result of the calculating; and

generating a new final audio signal distinguished from the final audio signal from the initial audio signal using the trained neural network models,

wherein the plurality of neural networks is in a consecutive relationship, where an i-th neural network model generates an output audio signal using, as an input audio signal, a difference between an output audio signal of an (i−1)-th neural network model and an input audio signal of the (i−1)-th neural network model

wherein the masking threshold is a criterion for masking noise generated in an encoding and decoding process of the plurality of neural network models, respectively.