US 12,340,563 B2
	Self-supervised audio-visual learning for correlating music and video
Justin Salamon, San Francisco, CA (US); Bryan Russell, San Francisco, CA (US); and Didac Suris Coll-Vinent, New York, NY (US)
Assigned to Adobe Inc., San Jose, CA (US)
Filed by Adobe Inc., San Jose, CA (US)
Filed on May 11, 2022, as Appl. No. 17/742,322.
Prior Publication US 2023/0368503 A1, Nov. 16, 2023
Int. Cl. G06V 10/774 (2022.01); G06V 10/74 (2022.01); G06V 20/40 (2022.01); G10L 25/03 (2013.01); G10L 25/57 (2013.01)

CPC G06V 10/774 (2022.01) [G06V 10/761 (2022.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01); G10L 25/03 (2013.01); G10L 25/57 (2013.01)]

18 Claims

1. A computer-implemented method comprising:

receiving a training input including a media sequence, the media sequence including a video sequence paired with an audio sequence;

segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments;

extracting visual features for each video sequence segment of the set of video sequence segments and audio features for each audio sequence segment of the set of audio sequence segments;

generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer;

generating, using a similarity metric, predicted video sequence segment and audio sequence segment pairings based on the contextualized visual features and the contextualized audio features;

ranking the predicted video sequence segment and audio sequence segment pairings based on their corresponding similarity values to identify mismatched pairings; and

training the visual transformer to generate the contextualized visual features and the audio transformer to generate the contextualized audio features based on calculating a loss using ground truth pairings and the mismatched pairings.