US 12,386,883 B2
Fine-grained visual content search platform
Chun Ming Chan, Hong Kong (HK); Zheng Long Li, Hong Kong (HK); Yi Ping Tse, Hong Kong (HK); and Sung Ho Cheung, Hong Kong (HK)
Assigned to Hong Kong Applied Science and Technology Research Institute Company Limited, Hong Kong (HK)
Filed by Hong Kong Applied Science and Technology Research Institute Company Limited, Hong Kong (HK)
Filed on Feb. 22, 2023, as Appl. No. 18/172,356.
Claims priority of provisional application 63/330,311, filed on Apr. 12, 2022.
Prior Publication US 2023/0325434 A1, Oct. 12, 2023
Int. Cl. G06F 16/532 (2019.01); G06F 16/538 (2019.01); G06F 16/55 (2019.01); G06N 3/045 (2023.01); G06N 3/0464 (2023.01)
CPC G06F 16/532 (2019.01) [G06F 16/538 (2019.01); G06F 16/55 (2019.01); G06N 3/045 (2023.01); G06N 3/0464 (2023.01)] 8 Claims
OG exemplary drawing
 
1. A method for training an apparatus for multi-focus fine-grained (MFFG) image search and retrieval, wherein the apparatus comprises:
a feature extraction network executed by at least one processor configured to extract one or more basic query features of a query object from a query image;
a class learning module executed by at least one processor configured to generate one or more first specific query features from the basic features, wherein the first query specific features represent an overall appearance of the query object;
a local description module executed by at least one processor configured to generate one or more second specific query features from the basic features, wherein the second query specific features represent local details of the query object;
an image search engine executed by at least one processor configured to:
combine the first specific query features and the second specific query features to form one or more image-query joint features;
obtain one or more features of each of a plurality of gallery image objects belonging to a meta-category of the query object;
determine a cosine distance between the image-query joint features and the features of each of the gallery image objects;
sort the gallery image objects by the cosine distances from most similar to the query object to least similar to the query object, wherein the gallery image object having the shortest cosine distance between the image-query joint features and the features of the gallery image object being the most similar to the query object, and the gallery image object having the longest cosine distance between the image-query joint features and the features of the gallery image object being the least similar to the query object; and
output N number of gallery images of the sorted gallery image objects that are most similar to the query object; and
an outline description module executed by at least one processor configured to generate one or more third specific query features from the basic query features, wherein the third specific query features represent an outline of the query object;
wherein the image search engine is further configured to combine the first specific query features, the second specific query features, and the third specific query features to form one or more image-query joint features of the query object;
wherein the method for training the apparatus comprises:
obtaining a training dataset comprising a plurality of original images each containing one of a plurality of sample objects belonging to one of a plurality of sub-categories belonging to a single meta-category;
generating a Region Confusion Mechanism (RCM) image for each of the original images by an augmentation module, wherein the RCM image is generated by separating the corresponding original image into a plurality of blocks followed by randomly reshuffling positions of the blocks and one or more of vertical flipping and horizontal flipping of the blocks;
extracting one or more sample features of the sample object from each of the original images and each of the RCM images by the feature extraction network;
iteratively training the class learning module, the local description module, and the outline description module with the sample features until classification models of all of the modules converge, comprising:
minimizing a first pair-wise loss of the class learning module;
minimizing a second pair-wise loss of the local description module; and
minimizing a third pair-wise loss of the outline description module.