US 12,080,319 B2
Weakly-supervised sound event detection method and system based on adaptive hierarchical pooling
Qirong Mao, Jiangsu (CN); Lijian Gao, Jiangsu (CN); Yaxin Shen, Jiangsu (CN); Qinghua Ren, Jiangsu (CN); Yongzhao Zhan, Jiangsu (CN); and Keyang Cheng, Jiangsu (CN)
Assigned to Jiangsu University, Jiangsu (CN)
Appl. No. 18/035,934
Filed by Jiangsu University, Jiangsu (CN)
PCT Filed Jun. 27, 2022, PCT No. PCT/CN2022/101361
§ 371(c)(1), (2) Date May 9, 2023,
PCT Pub. No. WO2023/221237, PCT Pub. Date Nov. 23, 2023.
Claims priority of application No. 202210528373.0 (CN), filed on May 16, 2022.
Prior Publication US 2024/0105211 A1, Mar. 28, 2024
Int. Cl. G10L 25/78 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01)
CPC G10L 25/78 (2013.01) [G10L 25/18 (2013.01); G10L 25/30 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A weakly-supervised sound event detection method based on adaptive hierarchical pooling, comprising:
extracting an acoustic feature of a pre-processed audio signal, inputting the acoustic feature to an acoustic model, dividing a frame-level prediction probability sequence predicted by the acoustic model into a plurality of consecutive sub-bags, calculating significant information of each of the sub-bags through maximum pooling to obtain a sub-bag-level prediction set, and obtaining an average probability of the sub-bag-level prediction set through mean pooling as a sentence-level prediction probability;
jointly optimizing the acoustic model and a relaxation parameter until convergence to obtain an optimal model weight and an optimal relaxation parameter, and formulating an optimal pooling strategy for each category of sound event based on the optimal relaxation parameter; and
performing pre-processing and feature extraction on a given unknown audio signal to obtain a pre-processed audio signal, inputting the pre-processed audio signal to a trained acoustic model to obtain frame-level prediction probabilities of all target sound events to complete an audio locating task, and obtaining sentence-level prediction probabilities of all categories of the target sound events based on the optimal pooling strategy of each category of the target sound events to complete an audio classification task.