US 12,333,837 B2
	Method for training image-text matching model, computing device, and storage medium
Feng He, Beijing (CN); Qi Wang, Beijing (CN); Hu Yang, Beijing (CN); Shuai Chen, Beijing (CN); Zhifan Feng, Beijing (CN); and Chunguang Chai, Beijing (CN)
Assigned to BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed by BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed on Sep. 13, 2022, as Appl. No. 17/943,458.
Claims priority of application No. 202111101658.8 (CN), filed on Sep. 18, 2021.
Prior Publication US 2023/0005284 A1, Jan. 5, 2023
Int. Cl. G06V 30/194 (2022.01); G06F 16/583 (2019.01); G06V 30/19 (2022.01)

CPC G06V 30/19147 (2022.01) [G06F 16/583 (2019.01); G06V 30/1916 (2022.01)]

17 Claims

1. A computer-implemented method, comprising:

obtaining a sample text and a sample image corresponding to the sample text;

labeling a true semantic tag for the sample text according to a first preset rule;

inputting the sample text into a text coding sub-model of an image-text matching model, and obtaining a text feature representation of the sample text and a predicted semantic tag output by the text coding sub-model, wherein an output of the text coding sub-model further comprises a predicted attribute tag;

inputting the sample image into an image coding sub-model of the image-text matching model, and obtaining an image feature representation of the sample image output by the image coding sub-model;

calculating a first loss based on the true semantic tag and the predicted semantic tag;

calculating a contrast loss based on the text feature representation of the sample text and the image feature representation of the sample image;

labeling a true attribute tag for the sample text according to a second preset rule;

calculating a second loss based on the true attribute tag and the predicted attribute tag;

adjusting one or more parameters of the text coding sub-model based on the first loss, the second loss, and the contrast loss; and

adjusting one or more parameters of the image coding sub-model based on the contrast loss.