US 12,277,169 B2
Method and apparatus for training an image-text mutual retrieval model, image-text mutual retrieval method, and device
Rengang Li, Jiangsu (CN); Li Wang, Jiangsu (CN); Zhenhua Guo, Jiangsu (CN); and Baoyu Fan, Jiangsu (CN)
Assigned to SUZHOU METABRAIN INTELLIGENT TECHNOLOGY CO., LTD., Jiangsu (CN)
Appl. No. 18/724,836
Filed by SUZHOU METABRAIN INTELLIGENT TECHNOLOGY CO., LTD., Jiangsu (CN)
PCT Filed Nov. 24, 2022, PCT No. PCT/CN2022/134092
§ 371(c)(1), (2) Date Jun. 27, 2024,
PCT Pub. No. WO2024/011815, PCT Pub. Date Jan. 18, 2024.
Claims priority of application No. 202210829134.9 (CN), filed on Jul. 15, 2022.
Prior Publication US 2024/0419725 A1, Dec. 19, 2024
Int. Cl. G06F 16/583 (2019.01); G06F 16/353 (2025.01); G06F 40/30 (2020.01)
CPC G06F 16/5846 (2019.01) [G06F 16/353 (2019.01); G06F 40/30 (2020.01)] 20 Claims
OG exemplary drawing
 
1. A method for training an image-text mutual retrieval model, comprising:
acquiring training data pairs, wherein the training data pairs comprise text training data and image training data, the text training data comprises long text data, the long text data is text data containing a plurality of target texts, and the target text is a sentence or a phrase;
inputting the training data pairs into an initial model, extracting text coding features of the text training data by using a text coding module in the initial model, and extracting image coding features of the image training data by using an image coding module in the initial model, respectively, wherein the text coding module comprises multi-layer Long-Short Term Memory (LSTM) networks, the multi-layer LSTM networks comprising a first LSTM network layer and a second LSTM network layer, the first LSTM network layer being configured to acquire a feature of each target text based on a feature of each word in each target text, and the second LSTM network layer being configured to acquire a feature of the long text data based on the feature of each target text;
calculating a training loss based on the text coding features and the image coding features, and performing parameter adjustment on the initial model based on the training loss; and
in response to the training loss meeting a convergence condition, determining the initial model after the parameter adjustment as the image-text mutual retrieval model.