| CPC G06V 10/774 (2022.01) [G06V 10/761 (2022.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01); G10L 25/03 (2013.01); G10L 25/57 (2013.01)] | 18 Claims |

|
1. A computer-implemented method comprising:
receiving a training input including a media sequence, the media sequence including a video sequence paired with an audio sequence;
segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments;
extracting visual features for each video sequence segment of the set of video sequence segments and audio features for each audio sequence segment of the set of audio sequence segments;
generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer;
generating, using a similarity metric, predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features;
ranking the predicted video sequence segment and audio sequence segment pairings based on their corresponding similarity values to identify mismatched pairings; and
training the visual transformer to generate the contextualized visual features and the audio transformer to generate the contextualized audio features based on calculating a loss using ground truth pairings and the mismatched pairings.
|