US 11,921,817 B2
	Unsupervised training of a video feature extractor
Mehdi Noroozi, Leonberg (DE); and Nadine Behrmann, Stuttgart (DE)
Assigned to ROBERT BOSCH GMBH, Stuttgart (DE)
Filed by Robert Bosch GmbH, Stuttgart (DE)
Filed on Sep. 28, 2021, as Appl. No. 17/449,184.
Claims priority of application No. 20203782 (EP), filed on Oct. 26, 2020.
Prior Publication US 2022/0129699 A1, Apr. 28, 2022
Int. Cl. G06F 18/214 (2023.01); G06N 3/04 (2023.01); G06N 3/088 (2023.01); G06V 10/75 (2022.01); G06V 10/94 (2022.01); G06V 20/40 (2022.01)

CPC G06F 18/214 (2023.01) [G06N 3/04 (2013.01); G06N 3/088 (2013.01); G06V 10/751 (2022.01); G06V 10/95 (2022.01); G06V 20/46 (2022.01)]

14 Claims

1. A computer-implemented unsupervised learning method of training a video feature extractor, wherein the video feature extractor is configured to extract a feature representation from a video sequence, the method comprising the following steps:

accessing training data representing multiple training video sequences, and model data representing a set of parameters of the video feature extractor;

training the video feature extractor by:

selecting from a training video sequence of the multiple training video sequences: a current subsequence, a preceding subsequence preceding the current subsequence; and a succeeding subsequence succeeding the current subsequence;

applying the video feature extractor to the current subsequence to extract a current feature representation of the current subsequence;

deriving a training signal from a joint predictability of the preceding and succeeding subsequences given the current feature representation, wherein deriving the training signal includes extracting a positive comparative example from the preceding subsequence followed by the succeeding subsequence, extracting a negative comparative example from the succeeding subsequence followed by the preceding subsequence, and determining a contrastive loss based on comparing the current feature representation to the positive and negative comparative examples;

updating the set of parameters of the video feature extractor based on the training signal;

outputting the trained video feature extractor.