| CPC G10L 17/08 (2013.01) [G10L 17/04 (2013.01); G10L 25/21 (2013.01); G10L 25/78 (2013.01)] | 20 Claims |

|
1. A computer-implemented method for target speaker extraction, comprising:
receiving, by a target speaker extraction system, an audio frame of an audio signal and a corresponding video, wherein the target speaker extraction system comprises a trained multi-speaker detection machine-learning (“ML”) model, a trained lip-movement-based (“LM-based”) target speaker voice activity detection (VAD) ML model, and a trained speech separation ML model;
responsive to determining, by the trained multi-speaker detection ML model of the target speaker extraction system, a single speaker within the audio frame:
inputting, by the target speaker extraction system, the audio frame and the video to the trained LM-based target speaker VAD ML model; and
suppressing, by the trained LM-based target speaker VAD ML model of the target speaker extraction system and based on the video, speech in the audio frame from a non-target speaker, wherein suppressing the speech in the audio from a non-target speaker comprises comparing the audio frame to a voiceprint of a target speaker; and
responsive to determining, by the trained multi-speaker detection ML model of the target speaker extraction system, a plurality of speakers within the audio frame:
inputting, by the target speaker extraction system, the audio frame to the trained speech separation ML model; and
separating, by the trained speech separation ML model of the target speaker extraction system, the voice of the target speaker from a voice mixture in the audio frame.
|