| CPC G06F 18/214 (2023.01) [G06F 18/253 (2023.01); G06F 40/126 (2020.01); G06V 20/46 (2022.01); G06V 40/168 (2022.01)] | 20 Claims |

|
1. A method for training a model, comprising:
determining image features, audio features, and text features of a reference object, in respective ones of a video encoder, an audio encoder and a text encoder, based on reference image information, reference audio information, and reference text information associated with the reference object, respectively;
constructing a feature tensor from the image features, the audio features, and the text features, the feature tensor defining a multi-dimensional space in which a given position within the multi-dimensional space corresponds to a combination of a corresponding image feature of the image features, a corresponding audio feature of the audio features, and a corresponding text feature of the text features, wherein a match type indicating a relationship between the corresponding image feature, the corresponding audio feature and the corresponding text feature is represented in the feature tensor as a particular one of a plurality of values each indicating a different match type;
decomposing the feature tensor into a first feature vector, a second feature vector, and a third feature vector corresponding to the image features, the audio features, and the text features, respectively, to determine a loss function value of the model, wherein the loss function value is computed at least in part based on a combination of a first absolute value of a difference between the first feature vector and one or more of the image features, a second absolute value of a difference between the second feature vector and one or more of the audio features, and a third absolute value of a difference between the third feature vector and one or more of the text features; and
updating parameters of the model based on the loss function value;
wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
|