| CPC G06V 30/19147 (2022.01) [G06F 16/583 (2019.01); G06V 30/1916 (2022.01)] | 17 Claims |

|
1. A computer-implemented method, comprising:
obtaining a sample text and a sample image corresponding to the sample text;
labeling a true semantic tag for the sample text according to a first preset rule;
inputting the sample text into a text coding sub-model of an image-text matching model, and obtaining a text feature representation of the sample text and a predicted semantic tag output by the text coding sub-model, wherein an output of the text coding sub-model further comprises a predicted attribute tag;
inputting the sample image into an image coding sub-model of the image-text matching model, and obtaining an image feature representation of the sample image output by the image coding sub-model;
calculating a first loss based on the true semantic tag and the predicted semantic tag;
calculating a contrast loss based on the text feature representation of the sample text and the image feature representation of the sample image;
labeling a true attribute tag for the sample text according to a second preset rule;
calculating a second loss based on the true attribute tag and the predicted attribute tag;
adjusting one or more parameters of the text coding sub-model based on the first loss, the second loss, and the contrast loss; and
adjusting one or more parameters of the image coding sub-model based on the contrast loss.
|