US 12,334,092 B2
Speech separation method, electronic device, chip, and computer- readable storage medium
Henghui Lu, Beijing (CN); Lei Qin, Shenzhen (CN); Peng Zhang, Beijing (CN); Jiaming Xu, Beijing (CN); and Bo Xu, Beijing (CN)
Assigned to Huawei Technologies Co., Ltd., Shenzhen (CN); and Institute of Automation, Chinese Academy of Sciences, Beijing (CN)
Appl. No. 18/026,960
Filed by Huawei Technologies Co., Ltd., Shenzhen (CN); and INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES, Beijing (CN)
PCT Filed Aug. 24, 2021, PCT No. PCT/CN2021/114204
§ 371(c)(1), (2) Date Mar. 17, 2023,
PCT Pub. No. WO2022/062800, PCT Pub. Date Mar. 31, 2022.
Claims priority of application No. 202011027680.8 (CN), filed on Sep. 25, 2020.
Prior Publication US 2023/0335148 A1, Oct. 19, 2023
Int. Cl. G10L 21/0208 (2013.01); G06V 20/40 (2022.01); G10L 21/055 (2013.01)
CPC G10L 21/0208 (2013.01) [G06V 20/46 (2022.01); G10L 21/055 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
obtaining, by an electronic device, audio information and video information, the audio information including a user speech, the video information including a user face of a user, wherein the audio information and the video information correspond to a speaking process of the user;
coding, by the electronic device, the audio information to obtain a mixed acoustic feature;
extracting, by the electronic device, a visual semantic feature of the user from the video information, the visual semantic feature comprising a feature of a facial motion of the user in the speaking process;
inputting, by the electronic device, the mixed acoustic feature and the visual semantic feature into a preset visual speech separation network to obtain an acoustic feature of the user, wherein obtaining the acoustic feature of the user comprises:
performing regularization and one-dimensional convolutional layer processing on the mixed acoustic feature to obtain a deep mixed acoustic feature;
upsampling the visual semantic feature to obtain a deep visual semantic feature that is time-synchronized with the deep mixed acoustic feature:
connecting the deep mixed acoustic feature and the deep visual semantic feature in a channel dimension;
performing dimension transformation to obtain a fused visual and auditory feature;
predicting a mask value of the user speech based on the fused visual and auditory feature;
performing mapping and output processing on the mask value to obtain a mask output; and
performing a matrix dot product calculation on the mask output and the mixed acoustic feature to obtain the acoustic feature of the user; and
decoding, by the electronic device, the acoustic feature of the user to obtain a speech signal of the user.