| CPC G06V 10/806 (2022.01) [G06F 40/279 (2020.01); G06V 10/40 (2022.01); G06V 10/7715 (2022.01); G06V 10/774 (2022.01); G06V 10/86 (2022.01)] | 18 Claims |

|
1. A multi-modal model training method, wherein the method comprises:
acquiring sample images and text feature vectors corresponding to the sample images;
inputting the sample images into a feature extraction network of an initial multi-modal model, so as to generate image feature vectors corresponding to the sample images, wherein the feature extraction network is used for encoding the sample images, and generating the image feature vectors according to association relationships between features to be generated and generated features;
inputting the text feature vectors and the image feature vectors into a transformer structure of the initial multi-modal model, and outputting candidate texts corresponding to the sample images; and
updating parameters of the initial multi-modal model according to target texts corresponding to the text feature vectors, and the candidate texts, so as to determine a target multi-modal model;
wherein inputting the sample images into the feature extraction network of the initial multi-modal model, so as to generate the image feature vectors corresponding to the sample images comprises: acquiring a feature generation sequence corresponding to each sample image; and generating the image feature vectors according to the association relationships between the features to be generated and the generated features and the feature generation sequence;
wherein generating the image feature vectors according to the association relationships between the features to be generated and the generated features and the feature generation sequence comprises: determining the generated features around each feature to be generated; and generating, from outside to inside, all the image feature vectors according to the association relationships between the features to be generated and the generated features around same and the feature generation sequence.
|