US 12,340,563 B2
Self-supervised audio-visual learning for correlating music and video
Justin Salamon, San Francisco, CA (US); Bryan Russell, San Francisco, CA (US); and Didac Suris Coll-Vinent, New York, NY (US)
Assigned to Adobe Inc., San Jose, CA (US)
Filed by Adobe Inc., San Jose, CA (US)
Filed on May 11, 2022, as Appl. No. 17/742,322.
Prior Publication US 2023/0368503 A1, Nov. 16, 2023
Int. Cl. G06V 10/774 (2022.01); G06V 10/74 (2022.01); G06V 20/40 (2022.01); G10L 25/03 (2013.01); G10L 25/57 (2013.01)
CPC G06V 10/774 (2022.01) [G06V 10/761 (2022.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01); G10L 25/03 (2013.01); G10L 25/57 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving a training input including a media sequence, the media sequence including a video sequence paired with an audio sequence;
segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments;
extracting visual features for each video sequence segment of the set of video sequence segments and audio features for each audio sequence segment of the set of audio sequence segments;
generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer;
generating, using a similarity metric, predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features;
ranking the predicted video sequence segment and audio sequence segment pairings based on their corresponding similarity values to identify mismatched pairings; and
training the visual transformer to generate the contextualized visual features and the audio transformer to generate the contextualized audio features based on calculating a loss using ground truth pairings and the mismatched pairings.