| CPC G10L 21/0208 (2013.01) [G06V 20/46 (2022.01); G10L 21/055 (2013.01)] | 20 Claims |

|
1. A method, comprising:
obtaining, by an electronic device, audio information and video information, the audio information including a user speech, the video information including a user face of a user, wherein the audio information and the video information correspond to a speaking process of the user;
coding, by the electronic device, the audio information to obtain a mixed acoustic feature;
extracting, by the electronic device, a visual semantic feature of the user from the video information, the visual semantic feature comprising a feature of a facial motion of the user in the speaking process;
inputting, by the electronic device, the mixed acoustic feature and the visual semantic feature into a preset visual speech separation network to obtain an acoustic feature of the user, wherein obtaining the acoustic feature of the user comprises:
performing regularization and one-dimensional convolutional layer processing on the mixed acoustic feature to obtain a deep mixed acoustic feature;
upsampling the visual semantic feature to obtain a deep visual semantic feature that is time-synchronized with the deep mixed acoustic feature:
connecting the deep mixed acoustic feature and the deep visual semantic feature in a channel dimension;
performing dimension transformation to obtain a fused visual and auditory feature;
predicting a mask value of the user speech based on the fused visual and auditory feature;
performing mapping and output processing on the mask value to obtain a mask output; and
performing a matrix dot product calculation on the mask output and the mixed acoustic feature to obtain the acoustic feature of the user; and
decoding, by the electronic device, the acoustic feature of the user to obtain a speech signal of the user.
|