US 12,217,761 B2
	Target speaker mode
Yuhui Chen, San Jose, CA (US); Qiyong Liu, Singapore (SG); Zhengwei Wei, Jiangxi (CN); and Yangbin Zeng, Zhejiang (CN)
Assigned to Zoom Video Communications, Inc., San Jose, CA (US)
Filed by Zoom Video Communications, Inc., San Jose, CA (US)
Filed on Oct. 31, 2021, as Appl. No. 17/515,480.
Claims priority of application No. 202111122227.X (CN), filed on Sep. 24, 2021.
Prior Publication US 2023/0095526 A1, Mar. 30, 2023
Int. Cl. G10L 17/08 (2013.01); G10L 17/04 (2013.01); G10L 25/21 (2013.01); G10L 25/78 (2013.01)

CPC G10L 17/08 (2013.01) [G10L 17/04 (2013.01); G10L 25/21 (2013.01); G10L 25/78 (2013.01)]

20 Claims

1. A computer-implemented method for target speaker extraction, comprising:

receiving, by a target speaker extraction system, an audio frame of an audio signal and a corresponding video, wherein the target speaker extraction system comprises a trained multi-speaker detection machine-learning (“ML”) model, a trained lip-movement-based (“LM-based”) target speaker voice activity detection (VAD) ML model, and a trained speech separation ML model;

responsive to determining, by the trained multi-speaker detection ML model of the target speaker extraction system, a single speaker within the audio frame:

inputting, by the target speaker extraction system, the audio frame and the video to the trained LM-based target speaker VAD ML model; and

suppressing, by the trained LM-based target speaker VAD ML model of the target speaker extraction system and based on the video, speech in the audio frame from a non-target speaker, wherein suppressing the speech in the audio from a non-target speaker comprises comparing the audio frame to a voiceprint of a target speaker; and

responsive to determining, by the trained multi-speaker detection ML model of the target speaker extraction system, a plurality of speakers within the audio frame:

inputting, by the target speaker extraction system, the audio frame to the trained speech separation ML model; and

separating, by the trained speech separation ML model of the target speaker extraction system, the voice of the target speaker from a voice mixture in the audio frame.