CPC G06F 18/214 (2023.01) [G06N 3/04 (2013.01); G06N 3/088 (2013.01); G06V 10/751 (2022.01); G06V 10/95 (2022.01); G06V 20/46 (2022.01)] | 14 Claims |
1. A computer-implemented unsupervised learning method of training a video feature extractor, wherein the video feature extractor is configured to extract a feature representation from a video sequence, the method comprising the following steps:
accessing training data representing multiple training video sequences, and model data representing a set of parameters of the video feature extractor;
training the video feature extractor by:
selecting from a training video sequence of the multiple training video sequences: a current subsequence, a preceding subsequence preceding the current subsequence; and a succeeding subsequence succeeding the current subsequence;
applying the video feature extractor to the current subsequence to extract a current feature representation of the current subsequence;
deriving a training signal from a joint predictability of the preceding and succeeding subsequences given the current feature representation, wherein deriving the training signal includes extracting a positive comparative example from the preceding subsequence followed by the succeeding subsequence, extracting a negative comparative example from the succeeding subsequence followed by the preceding subsequence, and determining a contrastive loss based on comparing the current feature representation to the positive and negative comparative examples;
updating the set of parameters of the video feature extractor based on the training signal;
outputting the trained video feature extractor.
|