US 12,334,043 B2
	Time-varying and nonlinear audio processing using deep neural networks
Marco Antonio Martinez Ramirez, London (GB); Joshua Daniel Reiss, London (GB); and Emmanouil Benetos, London (GB)
Assigned to WAVESHAPER TECHNOLOGIES INC., Montreal (CA)
Appl. No. 17/924,701
Filed by WAVESHAPER TECHNOLOGIES INC., Montreal (CA)
PCT Filed May 12, 2020, PCT No. PCT/GB2020/051150 § 371(c)(1), (2) Date Nov. 11, 2022, PCT Pub. No. WO2021/229197, PCT Pub. Date Nov. 18, 2021.
Prior Publication US 2023/0197043 A1, Jun. 22, 2023
Int. Cl. G10H 1/00 (2006.01); G06N 3/0442 (2023.01); G06N 3/045 (2023.01); G06N 3/0499 (2023.01); G06N 3/08 (2023.01); G10H 1/16 (2006.01)

CPC G10H 1/0091 (2013.01) [G06N 3/0442 (2023.01); G06N 3/045 (2023.01); G06N 3/0499 (2023.01); G06N 3/08 (2013.01); G10H 1/16 (2013.01); G10H 2210/215 (2013.01); G10H 2210/281 (2013.01); G10H 2210/311 (2013.01); G10H 2250/025 (2013.01); G10H 2250/311 (2013.01)]

20 Claims

1. A computer-implemented method of processing audio data, the method comprising:

receiving input audio data (x) comprising a time-series of amplitude values;

transforming the input audio data (x) into an input frequency band decomposition (X1) of the input audio data (x);

transforming the input frequency band decomposition (X1) into a first latent representation (Z);

processing the first latent representation (Z) by a first deep neural network to obtain a second latent representation (Z , Z1 );

transforming the second latent representation (Z , Z1 ) to obtain a discrete approximation (X3 );

element-wise multiplying the discrete approximation (X3 ) and a residual feature map (R, X5 ) to obtain a modified feature map, wherein the residual feature map (R, X5 ) is derived from the input frequency band decomposition (X1);

processing a pre-shaped frequency band decomposition by a waveshaping unit to obtain a waveshaped frequency band decomposition (X1 , X1.2 ), wherein the pre-shaped frequency band decomposition is derived from the input frequency band decomposition (X1), wherein the waveshaping unit comprises a second deep neural network;

summing the waveshaped frequency band decomposition (X1 , X1.2 ) and a modified frequency band decomposition (X2 , X1.1 ) to obtain a summation output (X0 ), wherein the modified frequency band decomposition (X2 , X1.1 ) is derived from the modified feature map; and

transforming the summation output (X0 ) to obtain target audio data (y ).