CPC G06Q 30/0627 (2013.01) [G06F 16/9535 (2019.01); G06F 16/9577 (2019.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06N 3/04 (2013.01); G06N 3/08 (2013.01); G06Q 30/0631 (2013.01); G06Q 30/0643 (2013.01); G06V 30/19147 (2022.01); G06V 30/19173 (2022.01); G06V 30/274 (2022.01)] | 20 Claims |
15. A non-transitory computer-readable medium storing instructions, which when executed by a processing device, cause the processing device to perform operations comprising iteratively:
computing a visual semantic embedding for a training image that has been categorized, the visual semantic embedding comprising feature vectors mapped within a type-specific feature space;
executing, on the visual semantic embedding, one or more sets of fully connected neural network (NN) layers and rectifier linear unit layers to generate intermediate NN vector outputs of a NN regressor model;
executing a complete vector predictor on the intermediate NN vector outputs to predict values of a complete text vector corresponding to the training image, the complete text vector comprising a subset of characteristic terms of a text lexicon of item characteristics that are predicted as a group with one or more NN layers;
executing an individual term predictor on the intermediate NN vector outputs to separately predict individual term values using corresponding individual NN neurons, wherein the individual term values are separately related to respective characteristic terms of the text lexicon;
minimizing a loss function using the predicted values of the complete text vector and the predicted individual term values generated by the individual term predictor to generate a final subset of predicted term values that trains the NN regressor model to generate a string of the characteristic terms most associated with the training image;
receiving, via a communication interface, a multi-modal query from a browser of a client device, the multi-modal query comprising at least a first image of an item;
executing the trained NN regressor model on the first image to identify a plurality of second items that are one of similar to or compatible with the item depicted in the first image;
generating, by the processing device and using characteristic terms corresponding to highest values of the final subset of predicted term values over multiple training iterations, structured text that explains, within one of a phrase or a sentence, why the plurality of second items are relevant to the item; and
returning, to the browser of the client device, a set of search results comprising a set of images, corresponding to the plurality of second items, and the structured text.
|