CPC G06V 10/454 (2022.01) [G06V 10/462 (2022.01); G06V 10/62 (2022.01)] | 20 Claims |
20. A non-transitory processor-readable storage medium storing a plurality of processor-executable instructions for pretraining a vision model, the instructions being executed by a processor to perform operations comprising:
receiving, via a communication interface, an input video;
extracting a first set of video frames from the input video subject to a temporal constraint, wherein each of the first set of video frames corresponds to a respective area of salient region that is non-zero;
generating a first set of saliency maps as tracking masks corresponding to the first set of video frames;
generating, a key crop and a query crop from the first set of video frames subject to a spatial constraint that the key crop and the query crop satisfy an intersection over union (IOU) threshold with a respective tracking mask from the tracking masks;
encoding, by a momentum encoder, the key crop into a key feature representation;
encoding, by an encoder of the vision model, the query crop into a query feature representation;
computing a contrastive loss based on the key feature representation and the query feature representation; and
updating the vision model based at least in part on the contrastive loss.
|