US 11,790,926 B2
Method and apparatus for processing audio signal
Mi Suk Lee, Daejeon (KR); Seung Kwon Beack, Daejeon (KR); Jongmo Sung, Daejeon (KR); Tae Jin Lee, Daejeon (KR); Jin Soo Choi, Daejeon (KR); Minje Kim, Bloomington, IN (US); and Kai Zhen, Bloomington, IN (US)
Assigned to Electronics and Telecommunications Research Institute, Daejeon (KR); and The Trustees of Indiana University, Indianapolis, IN (US)
Filed by Electronics and Telecommunications Research Institute, Daejeon (KR); and The Trustees of Indiana University, Indianapolis, IN (US)
Filed on Jan. 22, 2021, as Appl. No. 17/156,006.
Claims priority of provisional application 62/966,917, filed on Jan. 28, 2020.
Claims priority of application No. 10-2020-0056492 (KR), filed on May 12, 2020.
Prior Publication US 2021/0233547 A1, Jul. 29, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 19/038 (2013.01); G10L 19/028 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)
CPC G10L 19/038 (2013.01) [G10L 19/028 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)] 3 Claims
OG exemplary drawing
 
1. A processing method comprising:
acquiring a final audio signal for an initial audio signal using a plurality of neural network models generating output audio signals by encoding and decoding input audio signals;
acquiring a masking threshold and a power spectral density for the initial audio signal through a psychoacoustic model;
determining a weight based on a relationship between the masking threshold and the power spectral density for each frequency;
calculating a difference between a power spectral density of the initial audio signal and a power spectral density of the final audio signal for each frequency based on the determined weight;
training the neural network models based on a result of the calculating; and
generating a new final audio signal distinguished from the final audio signal from the initial audio signal using the trained neural network models,
wherein the plurality of neural networks is in a consecutive relationship, where an i-th neural network model generates an output audio signal using, as an input audio signal, a difference between an output audio signal of an (i−1)-th neural network model and an input audio signal of the (i−1)-th neural network model
wherein the masking threshold is a criterion for masking noise generated in an encoding and decoding process of the plurality of neural network models, respectively.