US 12,277,401 B2
Method and apparatus for acquiring pre-trained model
Guocheng Niu, Beijing (CN); Wei Li, Beijing (CN); Can Gao, Beijing (CN); Xinyan Xiao, Beijing (CN); and Hua Wu, Beijing (CN)
Assigned to BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed by BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed on Oct. 15, 2021, as Appl. No. 17/502,108.
Claims priority of application No. 202110274515.0 (CN), filed on Mar. 15, 2021.
Prior Publication US 2022/0292269 A1, Sep. 15, 2022
Int. Cl. G06F 18/25 (2023.01); G06F 40/205 (2020.01); G06F 40/47 (2020.01); G06F 40/58 (2020.01); G06N 3/02 (2006.01)
CPC G06F 40/58 (2020.01) [G06F 18/256 (2023.01); G06F 40/205 (2020.01); G06F 40/47 (2020.01); G06N 3/02 (2013.01)] 7 Claims
OG exemplary drawing
 
1. A method for training a pre-trained model, comprising:
acquiring training data, the training data comprising a single-modal language material comprising an image or a text and a multi-modal language material comprising an image-text pair which is semantically in pairs, and the multi-modal language material comprising a language material pair formed by a first-modal language material which is a text in a first language and a second-modal language material which is a text in a second language;
performing at least one of rewriting extension and retrieval extension on the multi-modal language material in the training data, and adding the extended multi-modal language material into the training data;
performing a multi-task training operation on a pre-trained model using the training data, the multi-task comprising at least one cross-modal contrastive learning task and at least one single-modal learning task;
wherein the cross-modal contrastive learning task which is trained utilizing the multi-modal language material comprises: determining similarity between the first-modal language material and the second-modal language material in the multi-modal language material utilizing a vector representation of the first-modal language material and a vector representation of the second-modal language material in the multi-modal language material by the pre-trained model, with a training target of maximizing the similarity between the first-modal language material and the second-modal language material in a positive multi-modal language material and minimizing the similarity between the first-modal language material and the second-modal language material in a negative multi-modal language material; and
the single-modal learning task which is trained utilizing the single-modal language material comprises: predicting a second part of content in the single-modal language material utilizing a vector representation of a first part of content in the single-modal language material by the pre-trained model, with a training target of minimizing a difference between the predicted second part of content and the second part of content in the single-modal language material,
wherein in the cross-modal contrastive learning task, the similarity between the first-modal language material and the second-modal language material in the multi-modal language material obtained by the retrieval extension is determined by: calculating similarity between a vector representation of the first-modal language material obtained by the pre-trained model and a vector representation of the second-modal language material obtained by the pre-trained model; and
the similarity between the first-modal language material and the second-modal language material in the multi-modal language material obtained by the rewriting extension is determined by: stitching the first-modal language material and the second-modal language material, and mapping a vector representation of the stitched language material obtained by the pre-trained model into a similarity value,
wherein parameters of the pre-trained model are updated using a constructed total loss function when the multi-task training operation is performed;
the total loss function is obtained by an arithmetic sum of the loss function of the at least one cross-modal contrastive learning task and the loss function of the at least one single-modal learning task, wherein the loss function of the at least one cross-modal contrastive learning task is constructed according to: a similarity value calculated by stitching the image and the text, and mapping a vector representation of the stitched language material obtained by the pre-trained model; or a similarity calculated by calculating cosine similarity between a vector representation of the image obtained by the pre-trained model and a vector representation of the text obtained by the pre-trained model, and the loss function of the at least one single-modal learning task comprises the loss function of a visual learning task and the loss function of a text learning task, and
adjusting finely the pre-trained model according to training data corresponding to a downstream task comprising a text classification task, an image classification task, a task of generating questions and answers for images, or a task of generating images for texts.