US 12,272,384 B1
	Synchronization of lip movement images to audio voice signal
Kyrylo Sydorchuk, Kharkov (UA); Volodymyr Cherniavskyi, Kharkov (UA); Stanislav Mihailevschii, Ribnita (MD); Oleh Vallas, Kharkov (UA); Ivan Shuhaienko, Kharkov (UA); Daniil Krasylnikov, Kharkov (UA); and Yurii Astafiev, Kharkov (UA)
Assigned to Pheon, Inc., San Francisco, CA (US)
Filed by Pheon, Inc., San Francisco, CA (US)
Filed on Aug. 15, 2024, as Appl. No. 18/805,819.
Int. Cl. G11B 27/031 (2006.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 40/16 (2022.01); G10L 15/02 (2006.01); G10L 15/04 (2013.01); G10L 15/16 (2006.01); G10L 25/57 (2013.01)

CPC G11B 27/031 (2013.01) [G06V 10/82 (2022.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01); G06V 40/169 (2022.01); G06V 40/175 (2022.01); G10L 15/02 (2013.01); G10L 15/04 (2013.01); G10L 15/16 (2013.01); G10L 25/57 (2013.01)]

20 Claims

1. A method comprising:

acquiring, by a computing device, a source video;

dividing, by the computing device, the source video into a set of image frames and a set of audio frames;

generating, by the computing device, a vector database based on the set of image frames and the set of audio frames, wherein a vector of the vector database includes a face vector and an audio vector, the face vector being determined based on an image frame of the set of image frames and the audio vector being determined based an audio frame of the set of audio frames, the audio frame corresponding to the image frame;

receiving, by the computing device, a target image frame and a target audio frame, the target audio frame being selected from a target audio record, wherein the source video is a first record and the target audio record is a second record, the second record being different from the first record;

determining, by the computing device, a target image vector based on the target image frame and a target audio vector based on the target audio frame;

searching, by the computing device, the vector database to select a pre-determined number of vectors corresponding to the target image vector and the target audio frame; and

generating, by the computing device and based on the pre-determined number of vectors, an output image frame of an output video, the output video being a third record, the third record being different from the first record.