US 12,094,484 B2
General speech enhancement method and apparatus using multi-source auxiliary information
Jingsong Li, Hangzhou (CN); Zhenchuan Zhang, Hangzhou (CN); Tianshu Zhou, Hangzhou (CN); and Yu Tian, Hangzhou (CN)
Assigned to ZHEJIANG LAB, Hangzhou (CN)
Filed by ZHEJIANG LAB, Zhejiang (CN)
Filed on Jul. 28, 2023, as Appl. No. 18/360,838.
Claims priority of application No. 202210902896.7 (CN), filed on Jul. 29, 2022.
Prior Publication US 2024/0079022 A1, Mar. 7, 2024
Int. Cl. G10L 21/0232 (2013.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 25/30 (2013.01)
CPC G10L 21/0232 (2013.01) [G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 25/30 (2013.01)] 9 Claims
OG exemplary drawing
 
1. A general speech enhancement method using multi-source auxiliary information, comprising steps of:
step S1: building a training data set;
step S2: building a speech enhancement model according to three sub-networks: an encoder module, an attention module and a decoder module, and using the training data set to learn network parameters of the speech enhancement model;
step S3: building a sound source information database in a pre-collection or on-site collection mode;
step S4: acquiring an input of the speech enhancement model, wherein the input comprises a noisy original signal to be processed and auxiliary sound signals of a target source group and auxiliary sound signals of an interference source group obtained by using the sound source information database;
step S5: taking the noisy original signal as a main input of the speech enhancement model, taking the auxiliary sound signals of the target source group and the auxiliary sound signals of the interference source group as side inputs of the speech enhancement model for speech enhancement, and obtaining an enhanced speech signal;
step S51: obtaining an original signal representation from the noisy original signal by the corresponding encoder module; and obtaining an auxiliary sound signal representation of the target source group and an auxiliary sound signal representation of the interference source group from the auxiliary sound signals of the target source group and the auxiliary sound signals of the interference source group respectively by the corresponding encoder module;
step S52: sequentially reading a first signal representation pair and a second signal representation pair from the original signal representation, the auxiliary sound signal representation of the target source group and the auxiliary sound signal representation of the interference source group by the attention module, and obtaining an auxiliary sound signal representation mask of the target source group and an auxiliary sound signal representation mask of the interference source group, wherein the first signal representation pair comprises the original signal representation and the auxiliary sound signal representation of the target source group, and the second signal representation pair comprises the original signal representation and the auxiliary sound signal representation of the interference source group;
step S53: fusing the auxiliary sound signal representation mask of the target source group and the auxiliary sound signal representation mask of the interference source group through attention fusion, and obtaining a fusion mask;
step S54: obtaining an enhanced representation from the original signal representation using the fusion mask; and
step S55: converting the enhanced representation into an enhanced speech signal by the decoder module.