US 12,437,514 B2
	Video domain adaptation via contrastive learning for decision making
Yi-Hsuan Tsai, Santa Clara, CA (US); Xiang Yu, Mountain View, CA (US); Bingbing Zhuang, San Jose, CA (US); Manmohan Chandraker, Santa Clara, CA (US); and Donghyun Kim, Mukilteo, WA (US)
Assigned to NEC Corporation, Tokyo (JP)
Filed by NEC Laboratories America, Inc., Princeton, NJ (US)
Filed on Nov. 8, 2021, as Appl. No. 17/521,057.
Claims priority of provisional application 63/114,120, filed on Nov. 16, 2020.
Claims priority of provisional application 63/113,464, filed on Nov. 13, 2020.
Claims priority of provisional application 63/111,766, filed on Nov. 10, 2020.
Prior Publication US 2022/0147761 A1, May 12, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06V 10/74 (2022.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06N 3/08 (2023.01); G06V 10/75 (2022.01); G06V 10/774 (2022.01)

CPC G06V 10/774 (2022.01) [G06F 18/2155 (2023.01); G06F 18/22 (2023.01); G06N 3/08 (2013.01); G06V 10/74 (2022.01); G06V 10/751 (2022.01)]

13 Claims

1. A computer-implemented video method, comprising:

extracting features of a first modality and a second modality from a labeled first training dataset in a first domain and an unlabeled second training dataset in a second domain;

training a video analysis model using contrastive learning on the extracted features, including optimization of a loss function that includes a cross-domain regularization part that compares features from a first training data from the first training dataset and a second training data from the second training dataset, the second training data having a pseudo label that matches the label of the first training data, and a cross-modality regularization part that compares features from different cue types in a same domain, with the cross-domain less regularization part being expressed as

where ϕ₊^st(F_s_{_i}^k,F_t_{_i}₊^l) measures similarity between features having a same modality and different domains for positive samples and ϕ₋^st(F_s_{_i}^k,F_t_{_i}₋^l) measures similarity between features having a same modality and different domains for negative samples, and further including generating pseudo-labels for the unlabeled dataset.