US 12,437,030 B2
Fine-grained classification of retail products
Avishek Kumar Shaw, Kolkata (IN); Shilpa Yadukumar Rao, Chennai (IN); Pranoy Hari, Chennai (IN); Dipti Prasad Mukherjee, Kolkata (IN); and Bikash Santra, Kolkata (IN)
Assigned to TATA CONSULTANCY SERVICES LIMITED, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Oct. 5, 2021, as Appl. No. 17/450,066.
Claims priority of application No. 202021044605 (IN), filed on Oct. 13, 2020.
Prior Publication US 2022/0114403 A1, Apr. 14, 2022
Int. Cl. G06F 18/2411 (2023.01); G06F 18/22 (2023.01); G06F 18/23213 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01)
CPC G06F 18/2411 (2023.01) [G06F 18/22 (2023.01); G06F 18/23213 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01)] 11 Claims
OG exemplary drawing
 
1. A processor-implemented method for fine-grained classification of a product image from images of a plurality of similar-looking products comprising steps of:
receiving, via an input/output interface, at least one template image of each of the plurality of similar-looking products and the product image from a user;
pre-processing, via a one or more hardware processors, the received at least one template image of each of the plurality of similar-looking products and the product image according to one or more predefined standards;
augmenting, via the one or more hardware processors, the pre-processed at least one template image of each of the plurality of similar-looking products based on a predefined photometric transformation and a geometric transformation;
training, via the one or more hardware processors, a reconstruction-classification network (RC-Net) and a stacked convolutional Long Short-Term Memory (conv-LSTM) network using the augmented at least one template image of each of the plurality of similar-looking products;
capturing, via the one or more hardware processors, an object-level information of the product image using the trained RC-Net, wherein the object-level information represents an underlying pattern of the product image;
estimating, via the one or more hardware processors, an object-level classification score of the product image using the trained RC-Net based on the captured object-level information;
identifying, via the one or more hardware processors, one or more key points on the product image using a predefined Binary Robust Invariant Scalable Keypoints (BRISK) model, wherein each of the one or more identified key points is represented by a predefined co-ordinate;
generating, via the one or more hardware processors, one or more part-proposals of the product image based on the identified one or more key points, wherein a feature for each of the part-proposal from a last convolution layer among convolution layers of an encoder of the trained RC-Net is extracted and the features are derived by resizing the part-proposal into a size of a receptive field of the last convolution layer thereby destroying a spatial relationship between neighboring pixels that indicates product's part-level information, and the resized part-proposal is forward propagated through the convolution layers of the encoder in the trained RC-Net, and wherein the product image is untemplated during the part-proposal extraction in such a way that a dimension of the one or more part-proposals matches with that of a fixed-size receptive field;
clustering, via the one or more hardware processors, the generated one or more part-proposals of the product image into one or more clusters using a predefined K-means clustering model based on the predefined co-ordinates of the one or more key points;
extracting, via the one or more hardware processors, a feature vector from each of the one or more part-proposals of the product image using the trained RC-Net;
calculating, via the one or more hardware processors, a cosine similarity score between the extracted feature vector of one or more part-proposals in each of the one or more clusters using the trained RC-Net;
creating, via the one or more hardware processors, a symmetric matrix using the calculated cosine similarity score for each of the one or more clusters to determine a discriminative part-proposal from the one or more part-proposals in each of the one or more clusters;
sequencing, via the one or more hardware processors, the determined discriminative part-proposal in each of the one or more clusters based on the predefined co-ordinate, wherein in the sequencing of the determined discriminative part-proposal from the one or more part-proposals, the template image is also included as a last member of the sequence to relate parts with the template image;
estimating, via the one or more hardware processors, a part-level classification score of the sequenced discriminative part-proposal and the product image using the trained stacked conv-LSTM network;
combining, via the one or more hardware processors, the object-level and part-level classification score to get a final classification score of the product image, wherein
the final classification score lF, for the product image is obtained as:
lF=l′+γl
where l is a label vector of the product image,
l′ and l are predicted vectors,
γϵ[0, 1] is a boost factor; and
classifying, via the one or more hardware processors, a product image from the plurality of similar-looking products based on the final classification score.