US 12,266,361 B2
	Systems and methods for correlating speech and lip movement
Yadong Wang, Campbell, CA (US); and Shilpa Jois Rao, Cupertino, CA (US)
Assigned to Netflix, Inc., Los Gatos, CA (US)
Filed by Netflix, Inc., Los Gatos, CA (US)
Filed on Jun. 24, 2020, as Appl. No. 16/911,247.
Prior Publication US 2021/0407510 A1, Dec. 30, 2021
Int. Cl. G10L 15/25 (2013.01); G10L 25/78 (2013.01)

CPC G10L 15/25 (2013.01) [G10L 25/78 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

training at least one machine learning model to identify auditory speech content within a media file, the machine learning model comprising a voice activity detection algorithm;

using the trained voice activity detection VAD algorithm, identifying auditory speech content within the media file;

detecting, solely among those segments of the media file that include auditory speech content, visible movement of a speaker's lips rendered in media content of the media file, wherein the visible movement is detected when variance for the segments of the media file that include auditory speech content is above a heuristic threshold;

correlating the detected visible movement of the speaker's lips with the auditory speech content identified using the trained voice activity detection algorithm according to a cross-correlation metric, wherein the cross-correlation metric is weighted using the trained VAD algorithm, and wherein the heuristic threshold for those segments of the media file that include auditory speech is adjusted using the trained VAD algorithm;

determining, based on the correlation between the visible movement of the speaker's lips and the identified auditory speech content, that a portion of the auditory speech content is visibly synchronized with the visible movement of the speaker's lips; and

recording, based on the determination that the portion of the auditory speech content is visibly synchronized with the visible movement of the speaker's lips, an indicator of an importance of visual synchrony between a dubbing of the auditory speech content and the visible movement of the speaker's lips as metadata of the media file.