US 12,236,699 B1
	Multimodal data heterogeneous transformer-based asset recognition method, system, and device
Feiran Huang, Guangzhou (CN); Zhenghang Yang, Guangzhou (CN); Zhibo Zhou, Guangzhou (CN); Shuyuan Lin, Guangzhou (CN); and Jian Weng, Guangzhou (CN)
Assigned to JINAN UNIVERSITY, Guangzhou (CN)
Filed by JINAN UNIVERSITY, Guangzhou (CN)
Filed on Nov. 22, 2024, as Appl. No. 18/956,232.
Claims priority of application No. 202410257623.0 (CN), filed on Mar. 7, 2024.
Int. Cl. G06V 30/24 (2022.01); G06V 20/62 (2022.01); G06V 30/16 (2022.01); G06V 30/19 (2022.01)

CPC G06V 30/2552 (2022.01) [G06V 20/62 (2022.01); G06V 30/16 (2022.01); G06V 30/19127 (2022.01); G06V 30/19147 (2022.01); G06V 30/19173 (2022.01)]

8 Claims

1. A multimodal data heterogeneous Transformer-based asset recognition method, comprising:

collecting various-modal information of an asset, comprising text information and image information;

building an A Lite Bidirectional Encoder Representations from Transformers, ALBERT, model, a Vision Transformer, ViT, model, and a Contrastive Language-Image Pre-Training, CLIP, model;

by the ALBERT model, extracting a text information feature: using a multilayer Transformer encoder to learn a context relation in a text sequence; connecting an output of the ALBERT model to a fully connected layer; and outputting final classification information, wherein this step comprises:

preprocessing the text information; converting the preprocessed text information into a vector representation; adding an identifier to indicate a start or an end; performing padding and truncating; randomly replacing part of the texts with [MASK] tokens; and by a Masked Language Modeling, MLM, model, performing inferential prediction;

generating a token embedding vector E_token, a segment embedding vector E_seg, and a position embedding vector E_pos; and representing a generated embedding by:

E=E_token∥E_seg∥E_pos

wherein the ∥ denotes concatenation;

randomly initializing a token embedding matrix and selecting a corpus for training, wherein the training comprises updating values in the embedding matrix to fit the corpus; setting a memorized token embedding vector upon termination of the training to be a final embedding vector; learning a paragraph containing a word based on the segment embedding vector; and learning a relative position of a word based on the position embedding vector;

feeding the generated embedding into a multilayer perceptron to obtain a vector E_obj; feeding the vector E_objinto the Transformer encoder to generate a presence vector E_present, denoted as:

E_present=Transformer_encoder(E_obj)

wherein the Transformer_encoderdenotes the Transformer encoder;

passing the presence vector E_presentthrough the fully connected layer MLP_classand a softmax function for classification to obtain an recognition type as:

type=softmax(MLP_class(E_present));

by the ViT model, extracting an image information feature: dividing the image information into tokens; using a Transformer encoder to capture and learn content information from the dividing of the image information; and using a classification head to map an image feature to class information;

by the CLIP model, extracting image-text matching information feature: building an image with matching text description sample pair; encoding the image information and the text information to obtain an image feature representation vector and a text feature representation vector; linearly projecting the image feature representation vector and the text feature representation vector into a multimodal space; calculating a similarity between two modalities to obtain a matching degree between the image information and the text information;

by different channels, applying asset type recognition to information in different modalities; outputting classification information from the different channels; by the CLIP model, generating asset void information; and

discriminatively fusing the classification information from the different channels with the matching degree between the image information and the text information obtained by the CLIP model, and outputting final asset class information, and this step comprises:

for an asset having both an image and a text, performing a discriminative fusion training, comprising:

obtaining final feature embedding vectors from the text channel and the image channel, respectively, wherein distances between the feature embedding vectors of the matching text and image as in respective modal spaces are denoted as:

Dis_s=(E_presentⁱ−E_imageⁱ)

Dis_n=(E_presentⁱ−E_image^j)i≠j

wherein the E_presentⁱdenotes a text feature embedding vector, the E_imageⁱdenotes an image feature embedding vector, the Dis_sdenotes a distance between the feature embedding vectors of image and text having matching information, and the Dis_ndenotes a distance between the feature embedding vectors of non-matching image and text;

wherein, in different modalities, distances between embedding vectors representing different information is denoted as:

Dis(e_p1,e_p2)=Dis(e_i1,e_i2)

Dis(e_p1,e_p2)=e_p1−e_p2

Dis(e_i1,e_i2)=e_i1−e_i2

using the matching degree output from the CLIP model as an accumulation term to build a loss for the discriminative fusion training, denoted as:

Loss_critic=min α(−Σlog σ(Dis_s−Dis_n))+β(Dis(e_ii,e_ij)+Dis(e_p1,e_p2))+γSim

where α, β, γ are automatically learned and generated for different datasets, σ is a sigmoid activation function, and Sim denotes the matching degree;

after the training, obtaining the discriminatively fused embedding vector representation E_final, and passing the same through a softmax classifier for classification to obtain the final asset class information, denoted as:

Class_final=softmax(critic(E_present,E_image,Sim)

wherein the E_presentdenotes the text channel feature embedding vector, the E_imagedenotes the image channel feature embedding vector, and the Class_finaldenotes the final asset class information.