US 12,223,426 B2
	Method and apparatus for designing and testing audio codec by using white noise modeling
Jongmo Sung, Daejeon (KR); Seung Kwon Beack, Daejeon (KR); Tae Jin Lee, Daejeon (KR); Woo-taek Lim, Daejeon (KR); Inseon Jang, Daejeon (KR); Byeongho Cho, Daejeon (KR); Young Cheol Park, Wonju-si (KR); Joon Byun, Wonju-si (KR); and Seungmin Shin, Wonju-si (KR)
Assigned to ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE, Daejeon (KR); and YONSEI UNIVERSITY WONJU INDUSTRY-ACADEMIC COOPERATION FOUNDATION, Wonju-si (KR)
Filed by ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE, Daejeon (KR); and YONSEI UNIVERSITY WONJU INDUSTRY-ACADEMIC COOPERATION FOUNDATION, Wonju-si (KR)
Filed on Feb. 8, 2023, as Appl. No. 18/166,407.
Claims priority of application No. 10-2022-0025344 (KR), filed on Feb. 25, 2022.
Prior Publication US 2023/0274141 A1, Aug. 31, 2023
Int. Cl. G10L 19/00 (2013.01); G06N 3/08 (2023.01); G10L 19/028 (2013.01); G10L 19/038 (2013.01); G10L 25/30 (2013.01); G10L 25/60 (2013.01); G10L 25/69 (2013.01); G06N 3/084 (2023.01); G10L 15/00 (2013.01); G10L 19/22 (2013.01)

CPC G06N 3/08 (2013.01) [G10L 19/028 (2013.01); G10L 19/038 (2013.01); G10L 25/30 (2013.01); G10L 25/60 (2013.01); G10L 25/69 (2013.01); G06N 3/084 (2013.01); G10L 15/00 (2013.01); G10L 19/00 (2013.01); G10L 19/22 (2013.01)]

12 Claims

1. A method of designing a neural network-based audio codec, the method comprising:

generating a quantized latent vector and a reconstructed signal corresponding to an input signal by using a white noise modeling-based quantization process;

computing a total loss for training of the neural network-based audio codec, based on the input signal, the reconstructed signal, and the quantized latent vector;

training the neural network-based audio codec by using the total loss; and

validating the trained neural network-based audio codec to select the best neural network-based audio codec,

wherein the computing of the total loss comprises:

calculating a reconstruction loss term as mean squared error (MSE) between the input signal and the reconstructed signal, a bit-rate control loss term as an entropy of the quantized latent vector, and a perceptual loss term reflecting human perceptual characteristics, respectively; and

calculating the total loss by adding the reconstruction loss term, the bit-rate control loss term, and the perceptual loss term,

wherein the reconstruction loss term is determined based on a square of an L2-norm of a difference between the input signal and the reconstructed signal,

wherein the bit-rate control loss term is determined based on probability distribution for a latent vector with added random noise,

wherein the latent vector with added random noise is generated by adding a random noise to a latent vector output from an encoder of the neural network-based audio codec that receives the input signal, and

wherein the quantized latent vector is generated by de-warping the latent vector with added random noise.