US 12,334,092 B2
	Speech separation method, electronic device, chip, and computer- readable storage medium
Henghui Lu, Beijing (CN); Lei Qin, Shenzhen (CN); Peng Zhang, Beijing (CN); Jiaming Xu, Beijing (CN); and Bo Xu, Beijing (CN)
Assigned to Huawei Technologies Co., Ltd., Shenzhen (CN); and Institute of Automation, Chinese Academy of Sciences, Beijing (CN)
Appl. No. 18/026,960
Filed by Huawei Technologies Co., Ltd., Shenzhen (CN); and INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES, Beijing (CN)
PCT Filed Aug. 24, 2021, PCT No. PCT/CN2021/114204 § 371(c)(1), (2) Date Mar. 17, 2023, PCT Pub. No. WO2022/062800, PCT Pub. Date Mar. 31, 2022.
Claims priority of application No. 202011027680.8 (CN), filed on Sep. 25, 2020.
Prior Publication US 2023/0335148 A1, Oct. 19, 2023
Int. Cl. G10L 21/0208 (2013.01); G06V 20/40 (2022.01); G10L 21/055 (2013.01)

CPC G10L 21/0208 (2013.01) [G06V 20/46 (2022.01); G10L 21/055 (2013.01)]

20 Claims

1. A method, comprising:

obtaining, by an electronic device, audio information and video information, the audio information including a user speech, the video information including a user face of a user, wherein the audio information and the video information correspond to a speaking process of the user;

coding, by the electronic device, the audio information to obtain a mixed acoustic feature;

extracting, by the electronic device, a visual semantic feature of the user from the video information, the visual semantic feature comprising a feature of a facial motion of the user in the speaking process;

inputting, by the electronic device, the mixed acoustic feature and the visual semantic feature into a preset visual speech separation network to obtain an acoustic feature of the user, wherein obtaining the acoustic feature of the user comprises:

performing regularization and one-dimensional convolutional layer processing on the mixed acoustic feature to obtain a deep mixed acoustic feature;

upsampling the visual semantic feature to obtain a deep visual semantic feature that is time-synchronized with the deep mixed acoustic feature:

connecting the deep mixed acoustic feature and the deep visual semantic feature in a channel dimension;

performing dimension transformation to obtain a fused visual and auditory feature;

predicting a mask value of the user speech based on the fused visual and auditory feature;

performing mapping and output processing on the mask value to obtain a mask output; and

performing a matrix dot product calculation on the mask output and the mixed acoustic feature to obtain the acoustic feature of the user; and

decoding, by the electronic device, the acoustic feature of the user to obtain a speech signal of the user.