US 12,236,699 B1
Multimodal data heterogeneous transformer-based asset recognition method, system, and device
Feiran Huang, Guangzhou (CN); Zhenghang Yang, Guangzhou (CN); Zhibo Zhou, Guangzhou (CN); Shuyuan Lin, Guangzhou (CN); and Jian Weng, Guangzhou (CN)
Assigned to JINAN UNIVERSITY, Guangzhou (CN)
Filed by JINAN UNIVERSITY, Guangzhou (CN)
Filed on Nov. 22, 2024, as Appl. No. 18/956,232.
Claims priority of application No. 202410257623.0 (CN), filed on Mar. 7, 2024.
Int. Cl. G06V 30/24 (2022.01); G06V 20/62 (2022.01); G06V 30/16 (2022.01); G06V 30/19 (2022.01)
CPC G06V 30/2552 (2022.01) [G06V 20/62 (2022.01); G06V 30/16 (2022.01); G06V 30/19127 (2022.01); G06V 30/19147 (2022.01); G06V 30/19173 (2022.01)] 8 Claims
OG exemplary drawing
 
1. A multimodal data heterogeneous Transformer-based asset recognition method, comprising:
collecting various-modal information of an asset, comprising text information and image information;
building an A Lite Bidirectional Encoder Representations from Transformers, ALBERT, model, a Vision Transformer, ViT, model, and a Contrastive Language-Image Pre-Training, CLIP, model;
by the ALBERT model, extracting a text information feature: using a multilayer Transformer encoder to learn a context relation in a text sequence; connecting an output of the ALBERT model to a fully connected layer; and outputting final classification information, wherein this step comprises:
preprocessing the text information; converting the preprocessed text information into a vector representation; adding an identifier to indicate a start or an end; performing padding and truncating; randomly replacing part of the texts with [MASK] tokens; and by a Masked Language Modeling, MLM, model, performing inferential prediction;
generating a token embedding vector Etoken, a segment embedding vector Eseg, and a position embedding vector Epos; and representing a generated embedding by:
E=Etoken∥Eseg∥Epos
wherein the ∥ denotes concatenation;
randomly initializing a token embedding matrix and selecting a corpus for training, wherein the training comprises updating values in the embedding matrix to fit the corpus; setting a memorized token embedding vector upon termination of the training to be a final embedding vector; learning a paragraph containing a word based on the segment embedding vector; and learning a relative position of a word based on the position embedding vector;
feeding the generated embedding into a multilayer perceptron to obtain a vector Eobj; feeding the vector Eobj into the Transformer encoder to generate a presence vector Epresent, denoted as:
Epresent=Transformerencoder(Eobj)
wherein the Transformerencoder denotes the Transformer encoder;
passing the presence vector Epresent through the fully connected layer MLPclass and a softmax function for classification to obtain an recognition type as:
type=softmax(MLPclass(Epresent));
by the ViT model, extracting an image information feature: dividing the image information into tokens; using a Transformer encoder to capture and learn content information from the dividing of the image information; and using a classification head to map an image feature to class information;
by the CLIP model, extracting image-text matching information feature: building an image with matching text description sample pair; encoding the image information and the text information to obtain an image feature representation vector and a text feature representation vector; linearly projecting the image feature representation vector and the text feature representation vector into a multimodal space; calculating a similarity between two modalities to obtain a matching degree between the image information and the text information;
by different channels, applying asset type recognition to information in different modalities; outputting classification information from the different channels; by the CLIP model, generating asset void information; and
discriminatively fusing the classification information from the different channels with the matching degree between the image information and the text information obtained by the CLIP model, and outputting final asset class information, and this step comprises:
for an asset having both an image and a text, performing a discriminative fusion training, comprising:
obtaining final feature embedding vectors from the text channel and the image channel, respectively, wherein distances between the feature embedding vectors of the matching text and image as in respective modal spaces are denoted as:
Diss=(Epresenti−Eimagei)
Disn=(Epresenti−Eimagej)i≠j
wherein the Epresenti denotes a text feature embedding vector, the Eimagei denotes an image feature embedding vector, the Diss denotes a distance between the feature embedding vectors of image and text having matching information, and the Disn denotes a distance between the feature embedding vectors of non-matching image and text;
wherein, in different modalities, distances between embedding vectors representing different information is denoted as:
Dis(ep1,ep2)=Dis(ei1,ei2)
Dis(ep1,ep2)=ep1−ep2
Dis(ei1,ei2)=ei1−ei2
using the matching degree output from the CLIP model as an accumulation term to build a loss for the discriminative fusion training, denoted as:
Losscritic=min α(−Σlog σ(Diss−Disn))+β(Dis(eii,eij)+Dis(ep1,ep2))+γSim
where α, β, γ are automatically learned and generated for different datasets, σ is a sigmoid activation function, and Sim denotes the matching degree;
after the training, obtaining the discriminatively fused embedding vector representation Efinal, and passing the same through a softmax classifier for classification to obtain the final asset class information, denoted as:
Classfinal=softmax(critic(Epresent,Eimage,Sim)
wherein the Epresent denotes the text channel feature embedding vector, the Eimage denotes the image channel feature embedding vector, and the Classfinal denotes the final asset class information.