| CPC G06V 10/778 (2022.01) [G06V 10/751 (2022.01); G06V 10/761 (2022.01)] | 3 Claims |

|
1. A method for re-recognizing an object image based on a multi-feature information capture and correlation analysis comprising:
a) collecting a plurality of object images to form an object image re-recognition database, labeling identifier (ID) information of an object image in the object image re-recognition database, and dividing the object image re-recognition database into a training set and a test set;
b) establishing an object image re-recognition model by using the multi-feature information capture and correlation analysis;
c) optimizing an objective function of the object image re-recognition model by using a cross-entropy loss function and a triplet loss function to obtain an optimized object image re-recognition model;
d) marking the object images with the ID information to obtain marked object images, inputting the marked object images into the optimized object image re-recognition model in step c) for training to obtain a trained object image re-recognition model and storing the trained object image re-recognition model;
e) inputting a to-be-retrieved object image into the trained object image re-recognition model in step d) to obtain a feature of a to-be-retrieved object; and
f) comparing the feature of the to-be-retrieved object with features of the object images in the test set and sorting comparison results by a similarity measurement
wherein step b) comprises the following steps:
b-1) setting an image input network to two branch networks comprising a first feature branch network and a second feature branch network;
b-2) inputting an object image h in the training set into the first feature branch network, wherein h∈
e×w×3, represents a real number space, e represents a number of horizontal pixels of the object image h, w represents a number of vertical pixels of the object image h, and 3 represents a number of channels of each red, green, and blue (RGB) image; processing the object image h by using a convolutional layer to obtain a feature map f; processing the feature map f by using a channel attention mechanism; performing a global average pooling and a global maximum pooling on the feature map f to obtain two one-dimensional vectors; normalizing the two one-dimensional vectors through a convolution, a Rectified Linear Unit (ReLU) activation function, a 1*1 convolution, and sigmoid function operations in turn to weight the feature map f to obtain a weighted feature map f; performing a maximum pooling and an average pooling on all channels at each position in the weighted feature map f by using a spatial attention mechanism to obtain a maximum pooled feature map and an average pooled feature map; stitching the maximum pooled feature map and the average pooled feature map to obtain a stitched feature map; performing a 7*7 convolution on the stitched feature map, and then normalizing the stitched feature map by using a batch normalization layer and a sigmoid function to obtain a normalized stitched feature map; and multiplying the normalized stitched feature map by the feature map f to obtain a new feature;b-3) inputting the object image h in the training set into the second feature branch network, wherein h∈
e×w×3; dividing the image h into n two-dimensional blocks; representing embeddings of the two-dimensional blocks as a one-dimensional vector hl∈ n×(p2·3) by using a linear transformation layer, wherein P represents a resolution of an image block, and n=ew/p2; calculating an average embedding ha of all the two-dimensional blocks according to a formula![]() wherein hi represents an embedding of an ith block obtained through a Gaussian distribution initialization, and i∈{1, . . . , n}; calculating an attention coefficient ai of the ith block according to a formula ai=qTσ(W1h0+W2hi+W3ha), wherein qT represents a weight, σ represents the sigmoid function, h0 represents a class marker, and W1, W2, and W3 are weights; calculating a new embedding hl of each of the two-dimensional blocks according to a formula
![]() and calculating a new class marker h′0 according to a formula h′0=W4[h0∥h1], wherein W4 represents a weight;
b-4) taking the new class marker h′0 and a sequence with an input size of hl∈
n×dc as an overall representation of a new image, wherein dc=d*m, d represents a dimension size of a head of each self-attention mechanism in a multi-head attention mechanism, and m represents a number of heads of the multi-head attention mechanism; adding position information in the new image, and then taking the new image as an input of a transformer encoder to complete the establishment of the object image re-recognition model. |