US 11,659,217 B1
Event based audio-video sync detection
Hooman Mahyar, Seattle, WA (US); Avijit Vajpayee, West Windsor, NJ (US); Abhinav Jain, Atlanta, GA (US); Arjun Cholkar, Bothell, WA (US); and Vimal Bhat, Redmond, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 29, 2021, as Appl. No. 17/301,212.
Int. Cl. H04N 21/242 (2011.01); H04N 21/234 (2011.01); H04N 21/233 (2011.01)
CPC H04N 21/242 (2013.01) [H04N 21/233 (2013.01); H04N 21/234 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A method, comprising:
receiving a media presentation, wherein the media presentation comprises a video component that includes a sequence of video frames, and an audio component that includes a sequence of audio bins, each audio bin corresponding to one of the video frames relative to a media timeline;
generating a first audio vector based on a first audio bin of the sequence of audio bins, the first audio vector representing a plurality of features of the first audio bin;
generating a first object vector based on a first video frame that corresponds to the first audio bin, wherein the first object vector represents one or more objects in the first video frame;
generating a first object attribute vector that represents one or more features of the one or more objects represented by the first object vector;
generating a first confidence score using a first machine learning model, the first audio vector, the first object vector, and the first object attribute vector, the first confidence score representing a measure of correlation between the first audio bin and the first video frame;
generating a predicted object vector and a predicted object attribute vector using a second machine learning model and the first audio vector, the predicted object vector and the predicted object attribute vector representing a hypothetical video frame having a high degree of correlation with the first audio bin;
generating a second confidence score that represents a measure of correlation between the predicted object vector and the predicted object attribute vector, and the first video frame; and
determining, based on the first confidence score and the second confidence score, that the audio component and the video component are desynchronized; and
modifying one or both of the audio component and the video component to improve synchronization of the audio and video components of the media presentation.