US 11,790,900 B2
System and method for audio-visual multi-speaker speech separation with location-based selection
Yaniv Shaked, Binyamina (IL); Yoav Ramon, Tel Aviv (IL); Eyal Shapira, Kiryat Ono (IL); and Roy Baharav, Tel Aviv (IL)
Assigned to HI AUTO LTD., Tel Aviv (IL)
Filed by Hi Auto LTD., Tel Aviv (IL)
Filed on Apr. 6, 2020, as Appl. No. 16/841,142.
Prior Publication US 2021/0312915 A1, Oct. 7, 2021
Int. Cl. G10L 15/20 (2006.01); G10L 21/0272 (2013.01); G10L 17/18 (2013.01)
CPC G10L 15/20 (2013.01) [G10L 17/18 (2013.01); G10L 21/0272 (2013.01)] 29 Claims
OG exemplary drawing
 
1. A method for audio-visual multi-speaker speech separation, comprising:
receiving audio signals captured by at least one microphone;
receiving video signals captured by at least one camera; and
providing the audio signals and the video signals to a sync engine configured to:
derive an audio vector from the audio signals and a video vector from the video signals;
compute a correlation score by shifting either the audio vector or the video vector and compare the shifted vector against a remaining unshifted vector, wherein the correlation score is based on a number of shifts needed to achieve a match; and
extract facial characteristics of each speaker from multi-speaker synchronized video signals to provide for mutual influence between audio and video to assist in an audio-visual separation;
apply audio-visual separation on the received audio signals and the video signals by simultaneously analyzing each of multi-speaker synchronized video signals to provide isolation of sounds from the at least one microphone and the at least one camera based on the correlation score by generating an audio output comprising any of:
a time-shifted variant of the audio signal based on a number of shifts of the audio signal assigned a highest correlation score;
a time-shifted variant of the video signal based on a number of shifts of the video signal assigned a highest correlation score; and
the audio signal time-shifted to synchronize with lip movements in the video signal,
wherein the audio-visual separation is based, in part, on angle positions of at least one speaker relative to the at least one camera.