US 12,106,541 B2
	Systems and methods for contrastive pretraining with video tracking supervision
Brian Chen, New York, NY (US); Ramprasaath Ramasamy Selvaraju, Atlanta, GA (US); Juan Carlos Niebles Duque, Palo Alto, CA (US); and Nikhil Naik, Mountain View, CA (US)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce.com, Inc., San Francisco, CA (US)
Filed on Jan. 31, 2022, as Appl. No. 17/589,709.
Claims priority of provisional application 63/280,083, filed on Nov. 16, 2021.
Prior Publication US 2023/0154139 A1, May 18, 2023
Int. Cl. G06V 10/00 (2022.01); G06V 10/44 (2022.01); G06V 10/46 (2022.01); G06V 10/62 (2022.01)

CPC G06V 10/454 (2022.01) [G06V 10/462 (2022.01); G06V 10/62 (2022.01)]

20 Claims

20. A non-transitory processor-readable storage medium storing a plurality of processor-executable instructions for pretraining a vision model, the instructions being executed by a processor to perform operations comprising:

receiving, via a communication interface, an input video;

extracting a first set of video frames from the input video subject to a temporal constraint, wherein each of the first set of video frames corresponds to a respective area of salient region that is non-zero;

generating a first set of saliency maps as tracking masks corresponding to the first set of video frames;

generating, a key crop and a query crop from the first set of video frames subject to a spatial constraint that the key crop and the query crop satisfy an intersection over union (IOU) threshold with a respective tracking mask from the tracking masks;

encoding, by a momentum encoder, the key crop into a key feature representation;

encoding, by an encoder of the vision model, the query crop into a query feature representation;

computing a contrastive loss based on the key feature representation and the query feature representation; and

updating the vision model based at least in part on the contrastive loss.