US 12,452,477 B2
Video and audio synchronization with dynamic frame and sample rates
Clara Fernandez Labrador, Zurich (CH); Cafer Mertcan Akcay, Zurich (CH); Christopher Richard Schroers, Uster (CH); Joan Massich Vall, Zurich (CH); Scott Labrozzi, Cary, NC (US); Mitchel Jacobs, Malibu, CA (US); Katherine Hinsen, Los Angeles, CA (US); and Eitan Abecassis, Raleigh, NC (US)
Assigned to Disney Enterprises, Inc., Burbank, CA (US)
Filed by Disney Enterprises, Inc., Burbank, CA (US)
Filed on May 24, 2024, as Appl. No. 18/674,558.
Claims priority of provisional application 63/521,604, filed on Jun. 16, 2023.
Prior Publication US 2024/0422380 A1, Dec. 19, 2024
Int. Cl. H04N 21/43 (2011.01); H04N 19/60 (2014.01)
CPC H04N 21/4307 (2013.01) [H04N 19/60 (2014.11)] 24 Claims
OG exemplary drawing
 
1. A system comprising:
a hardware processor; and
a memory storing a video/audio (V/A) synchronizer including a video encoder and an audio encoder;
the hardware processor configured to execute the V/A synchronizer to:
receive raw video and raw audio extracted from media content;
partition the raw video into a plurality of video frame patches;
partition the raw audio into a plurality of audio samples;
pre-process the plurality of video frame patches for encoding to provide a plurality of pre-processed video frame patches;
pre-process the plurality of audio samples for encoding to provide a plurality of pre-processed audio samples;
encode, using the video encoder, the plurality of pre-processed video frame patches to provide a plurality of pre-processed and encoded video frame patches;
encode, using the audio encoder, the plurality of pre-processed audio samples to provide a plurality of pre-processed and encoded audio samples;
provide, using one or more of the plurality of pre-processed and encoded video frame patches, a latent representation of the raw video;
provide, using the plurality of pre-processed and encoded audio samples, a latent representation of the raw audio; and
synchronize, using the latent representation of the raw video and the latent representation of the raw audio, the raw audio with the raw video.