US 11,989,941 B2
Systems and methods for video and language pre-training
Dongxu Li, Canberra (AU); Junnan Li, Singapore (SG); and Chu Hong Hoi, Singapore (SG)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on Dec. 30, 2021, as Appl. No. 17/566,173.
Claims priority of provisional application 63/279,986, filed on Nov. 16, 2021.
Prior Publication US 2023/0154188 A1, May 18, 2023
Int. Cl. G06V 10/00 (2022.01); G06F 40/279 (2020.01); G06F 40/284 (2020.01); G06V 10/26 (2022.01); G06V 10/74 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/80 (2022.01); G06V 20/40 (2022.01)
CPC G06V 20/41 (2022.01) [G06F 40/279 (2020.01); G06F 40/284 (2020.01); G06V 10/26 (2022.01); G06V 10/761 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/806 (2022.01); G06V 20/46 (2022.01); G06V 20/47 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A method of video-and-language alignment contrastive pretraining, the method comprising:
obtaining, via a data interface, a plurality of video frames and a plurality of text descriptions corresponding to the plurality of video frames;
encoding, by a video encoder, the plurality of video frames into video feature representations;
encoding, by a text encoder, the plurality of text descriptions into text feature representations;
computing similarity scores between the video feature representations and the text feature representations;
computing a first contrastive loss based at least in part on an exponential of computed similarity scores corresponding to a matching pair of a first video frame and a first text description and one or more pairs of the first video frame and text descriptions that do not match the first video frame;
computing a second contrastive loss based at least in part on an exponential of computed similarity scores corresponding to the matching pair of the first video frame and the first text description and one or more pairs of the first text description and video frames that do not match the first text description;
computing a video-text contrastive loss by taking a weighted sum of the first and the second contrastive losses; and
updating the video encoder and the text encoder based at least in part on the video-text contrastive loss.