CPC G06F 40/58 (2020.01) [G06F 18/256 (2023.01); G06F 40/205 (2020.01); G06F 40/47 (2020.01); G06N 3/02 (2013.01)] | 7 Claims |
1. A method for training a pre-trained model, comprising:
acquiring training data, the training data comprising a single-modal language material comprising an image or a text and a multi-modal language material comprising an image-text pair which is semantically in pairs, and the multi-modal language material comprising a language material pair formed by a first-modal language material which is a text in a first language and a second-modal language material which is a text in a second language;
performing at least one of rewriting extension and retrieval extension on the multi-modal language material in the training data, and adding the extended multi-modal language material into the training data;
performing a multi-task training operation on a pre-trained model using the training data, the multi-task comprising at least one cross-modal contrastive learning task and at least one single-modal learning task;
wherein the cross-modal contrastive learning task which is trained utilizing the multi-modal language material comprises: determining similarity between the first-modal language material and the second-modal language material in the multi-modal language material utilizing a vector representation of the first-modal language material and a vector representation of the second-modal language material in the multi-modal language material by the pre-trained model, with a training target of maximizing the similarity between the first-modal language material and the second-modal language material in a positive multi-modal language material and minimizing the similarity between the first-modal language material and the second-modal language material in a negative multi-modal language material; and
the single-modal learning task which is trained utilizing the single-modal language material comprises: predicting a second part of content in the single-modal language material utilizing a vector representation of a first part of content in the single-modal language material by the pre-trained model, with a training target of minimizing a difference between the predicted second part of content and the second part of content in the single-modal language material,
wherein in the cross-modal contrastive learning task, the similarity between the first-modal language material and the second-modal language material in the multi-modal language material obtained by the retrieval extension is determined by: calculating similarity between a vector representation of the first-modal language material obtained by the pre-trained model and a vector representation of the second-modal language material obtained by the pre-trained model; and
the similarity between the first-modal language material and the second-modal language material in the multi-modal language material obtained by the rewriting extension is determined by: stitching the first-modal language material and the second-modal language material, and mapping a vector representation of the stitched language material obtained by the pre-trained model into a similarity value,
wherein parameters of the pre-trained model are updated using a constructed total loss function when the multi-task training operation is performed;
the total loss function is obtained by an arithmetic sum of the loss function of the at least one cross-modal contrastive learning task and the loss function of the at least one single-modal learning task, wherein the loss function of the at least one cross-modal contrastive learning task is constructed according to: a similarity value calculated by stitching the image and the text, and mapping a vector representation of the stitched language material obtained by the pre-trained model; or a similarity calculated by calculating cosine similarity between a vector representation of the image obtained by the pre-trained model and a vector representation of the text obtained by the pre-trained model, and the loss function of the at least one single-modal learning task comprises the loss function of a visual learning task and the loss function of a text learning task, and
adjusting finely the pre-trained model according to training data corresponding to a downstream task comprising a text classification task, an image classification task, a task of generating questions and answers for images, or a task of generating images for texts.
|