US 12,437,517 B2
	Video domain adaptation via contrastive learning for decision making
Yi-Hsuan Tsai, Santa Clara, CA (US); Xiang Yu, Mountain View, CA (US); Bingbing Zhuang, San Jose, CA (US); Manmohan Chandraker, Santa Clara, CA (US); and Donghyun Kim, Mukilteo, WA (US)
Assigned to NEC Corporation, Tokyo (JP)
Filed by NEC Laboratories America, Inc., Princeton, NJ (US)
Filed on Oct. 11, 2023, as Appl. No. 18/484,826.
Application 18/484,826 is a continuation of application No. 17/521,057, filed on Nov. 8, 2021.
Claims priority of provisional application 63/114,120, filed on Nov. 16, 2020.
Claims priority of provisional application 63/113,464, filed on Nov. 13, 2020.
Claims priority of provisional application 63/111,766, filed on Nov. 10, 2020.
Prior Publication US 2024/0037186 A1, Feb. 1, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06V 10/74 (2022.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06N 3/08 (2023.01); G06V 10/75 (2022.01); G06V 10/774 (2022.01)

CPC G06V 10/774 (2022.01) [G06F 18/2155 (2023.01); G06F 18/22 (2023.01); G06N 3/08 (2013.01); G06V 10/74 (2022.01); G06V 10/751 (2022.01)]

8 Claims

1. A computer-implemented video method, comprising:

extracting features of a first modality and a second modality from a labeled first training dataset in a first domain and an unlabeled second training dataset in a second domain, the labeled first training dataset including source videos and action labels, the source videos being received from a camera, the action labels indicating a patient's interactions with therapeutic equipment and use of medications in healthcare;

training a video analysis model using contrastive learning on the extracted features, including optimization of a loss function that includes a cross-domain regularization part that compares features from a first training data from the first training dataset and a second training data from the second training dataset, the second training data having a pseudo label that matches the label of the first training data, and a cross-modality regularization part that compares features from different cue types in a same domain, with the cross-domain regularization part being expressed as

where ϕ₊^st(F_s_{_i}^k, F_t_{_i}₊^l) measures similarity between features having a same modality and different domains for positive samples and ϕ₋^st(F_s_{_i}^k, F_t_{_i}₋^l) measures similarity between features having a same modality and different domains for negative samples, and further including generating pseudo-labels for the unlabeled dataset.