US 12,367,259 B2
	Method, electronic device, and computer program product for training model
Zijia Wang, WeiFang (CN); Danqing Sha, Shanghai (CN); Jiacheng Ni, Shanghai (CN); and Zhen Jia, Shanghai (CN)
Assigned to Dell Products L.P., Round Rock, TX (US)
Filed by Dell Products L.P., Round Rock, TX (US)
Filed on Jan. 31, 2022, as Appl. No. 17/588,515.
Claims priority of application No. 202111665428.4 (CN), filed on Dec. 31, 2021.
Prior Publication US 2023/0214450 A1, Jul. 6, 2023
Int. Cl. G06F 18/214 (2023.01); G06F 18/25 (2023.01); G06F 40/126 (2020.01); G06V 20/40 (2022.01); G06V 40/16 (2022.01)

CPC G06F 18/214 (2023.01) [G06F 18/253 (2023.01); G06F 40/126 (2020.01); G06V 20/46 (2022.01); G06V 40/168 (2022.01)]

20 Claims

1. A method for training a model, comprising:

determining image features, audio features, and text features of a reference object, in respective ones of a video encoder, an audio encoder and a text encoder, based on reference image information, reference audio information, and reference text information associated with the reference object, respectively;

constructing a feature tensor from the image features, the audio features, and the text features, the feature tensor defining a multi-dimensional space in which a given position within the multi-dimensional space corresponds to a combination of a corresponding image feature of the image features, a corresponding audio feature of the audio features, and a corresponding text feature of the text features, wherein a match type indicating a relationship between the corresponding image feature, the corresponding audio feature and the corresponding text feature is represented in the feature tensor as a particular one of a plurality of values each indicating a different match type;

decomposing the feature tensor into a first feature vector, a second feature vector, and a third feature vector corresponding to the image features, the audio features, and the text features, respectively, to determine a loss function value of the model, wherein the loss function value is computed at least in part based on a combination of a first absolute value of a difference between the first feature vector and one or more of the image features, a second absolute value of a difference between the second feature vector and one or more of the audio features, and a third absolute value of a difference between the third feature vector and one or more of the text features; and

updating parameters of the model based on the loss function value;

wherein the method is performed by at least one processing device comprising a processor coupled to a memory.