US 12,456,481 B2
Method and appartus for audio processing using a nested convolutional neural network architechture
Jundai Sun, Beijing (CN); Lie Lu, Dublin, CA (US); and Zhiwei Shuang, Beijing (CN)
Assigned to DOLBY LABORATORIES LICENSING CORPORATION, San Francisco, CA (US)
Appl. No. 18/032,325
Filed by DOLBY LABORATORIES LICENSING CORPORATION, San Francisco, CA (US)
PCT Filed Oct. 19, 2021, PCT No. PCT/US2021/055691
§ 371(c)(1), (2) Date Apr. 17, 2023,
PCT Pub. No. WO2022/087025, PCT Pub. Date Apr. 28, 2022.
Claims priority of provisional application 63/164,028, filed on Mar. 22, 2021.
Claims priority of provisional application 63/112,220, filed on Nov. 11, 2020.
Claims priority of application No. PCT/CN2020/121829 (WO), filed on Oct. 19, 2020; application No. 20211501 (EP), filed on Dec. 3, 2020; and application No. PCT/CN2021/078705 (WO), filed on Mar. 2, 2021.
Prior Publication US 2023/0386500 A1, Nov. 30, 2023
Int. Cl. G10L 25/30 (2013.01); G06N 3/0464 (2023.01); G10L 25/84 (2013.01)
CPC G10L 25/30 (2013.01) [G06N 3/0464 (2023.01); G10L 25/84 (2013.01)] 19 Claims
OG exemplary drawing
 
1. A computing system implementing a convolutional neural network (CNN) architecture, the CNN architecture comprising a multi-scale input block and a multi-scale nested block, wherein the multi-scale input block is configured to:
receive input data based on an audio signal; and
generate a first downsampled input data set by downsampling the input data;
wherein the multi-scale nested block comprises:
a first encoding layer configured to generate a first encoded data set by performing a convolution based on the input data,
a second encoding layer configured to generate a second encoded data set by performing a convolution based on the first downsampled input data set,
a first convolutional layer configured to generate a first output data set by performing a convolution based on the first encoded data set and an upsampled second encoded data set, wherein the upsampled second encoded data set is obtained by upsampling the second encoded data set, and
a second convolutional layer configured to generate a second output data set based on the first output data set;
wherein the computing system is configured to output an output audio signal based on the second output data set.