US 12,260,629 B2
	Multi-modal model training method and apparatus, image recognition method and apparatus, and electronic device
Chong Shen, Jiangsu (CN); and Feng Li, Jiangsu (CN)
Assigned to SUZHOU METABRAIN INTELLIGENT TECHNOLOGY CO., LTD., Jiangsu (CN)
Appl. No. 18/697,418
Filed by SUZHOU METABRAIN INTELLIGENT TECHNOLOGY CO., LTD., Jiangsu (CN)
PCT Filed Sep. 28, 2022, PCT No. PCT/CN2022/122303 § 371(c)(1), (2) Date Mar. 29, 2024, PCT Pub. No. WO2023/159945, PCT Pub. Date Aug. 31, 2023.
Claims priority of application No. 202210174577.9 (CN), filed on Feb. 25, 2022.
Prior Publication US 2024/0331370 A1, Oct. 3, 2024
Int. Cl. G06V 10/77 (2022.01); G06F 40/279 (2020.01); G06V 10/40 (2022.01); G06V 10/774 (2022.01); G06V 10/80 (2022.01); G06V 10/86 (2022.01)

CPC G06V 10/806 (2022.01) [G06F 40/279 (2020.01); G06V 10/40 (2022.01); G06V 10/7715 (2022.01); G06V 10/774 (2022.01); G06V 10/86 (2022.01)]

18 Claims

1. A multi-modal model training method, wherein the method comprises:

acquiring sample images and text feature vectors corresponding to the sample images;

inputting the sample images into a feature extraction network of an initial multi-modal model, so as to generate image feature vectors corresponding to the sample images, wherein the feature extraction network is used for encoding the sample images, and generating the image feature vectors according to association relationships between features to be generated and generated features;

inputting the text feature vectors and the image feature vectors into a transformer structure of the initial multi-modal model, and outputting candidate texts corresponding to the sample images; and

updating parameters of the initial multi-modal model according to target texts corresponding to the text feature vectors, and the candidate texts, so as to determine a target multi-modal model;

wherein inputting the sample images into the feature extraction network of the initial multi-modal model, so as to generate the image feature vectors corresponding to the sample images comprises: acquiring a feature generation sequence corresponding to each sample image; and generating the image feature vectors according to the association relationships between the features to be generated and the generated features and the feature generation sequence;

wherein generating the image feature vectors according to the association relationships between the features to be generated and the generated features and the feature generation sequence comprises: determining the generated features around each feature to be generated; and generating, from outside to inside, all the image feature vectors according to the association relationships between the features to be generated and the generated features around same and the feature generation sequence.