US 12,254,691 B2
	Cooperative-contrastive learning systems and methods
Nishant Rai, Stanford, CA (US); Ehsan Adeli Mosabbeb, Menlo Park, CA (US); Kuan-Hui Lee, Los Altos, CA (US); Adrien Gaidon, Los Altos, CA (US); and Juan Carlos Niebles, Mountain View, CA (US)
Assigned to TOYOTA RESEARCH INSTITUTE, INC., Los Altos, CA (US); and THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY, Stanford, CA (US)
Filed by TOYOTA RESEARCH INSTITUTE, INC., Los Altos, CA (US); and The Board of Trustees of the Leland Stanford Junior University, Stanford, CA (US)
Filed on Dec. 3, 2020, as Appl. No. 17/111,352.
Prior Publication US 2022/0180101 A1, Jun. 9, 2022
Int. Cl. G06V 20/40 (2022.01); G06N 20/00 (2019.01); G06V 30/194 (2022.01)

CPC G06V 20/41 (2022.01) [G06N 20/00 (2019.01); G06V 30/194 (2022.01)]

20 Claims

1. A method for multi-view self-supervised learning, comprising:

receiving a plurality of video sequences, the video sequences comprising a plurality of image frames;

applying selected images of a first and second video sequence of the plurality of video sequences to a plurality of different encoders to derive a plurality of embeddings for different views of the selected images of the first and second video sequences, the plurality of embeddings comprising RGB embeddings, flow embeddings, and KeyPoint embeddings;

determining distances of the derived plurality of embeddings for the selected images of the first and second video sequences;

detecting inconsistencies between distances of the RGB embeddings, distances of the flow embeddings, and distances of the KeyPoint embeddings outside a threshold distance; and

predicting semantics of a future image based on the determined distances.