US 12,131,365 B2
	Search engine use of neural network regressor for multi-modal item recommendations based on visual semantic embeddings
David A. Forsyth, Urbana, IL (US); Ranjitha Kumar, Urbana, IL (US); Krishna Dusad, Urbana, IL (US); Kedan Li, Urbana, IL (US); Mariya I. Vasileva, Urbana, IL (US); Bryan Plummer, Urbana, IL (US); Yuan Shen, Urbana, IL (US); and Shreya Rajpal, Urbana, IL (US)
Assigned to The Board of Trustees of the University of Illinois, Urbana, IL (US)
Filed by Board of Trustees of the University of Illinois, Urbana, IL (US)
Filed on Mar. 24, 2020, as Appl. No. 16/828,776.
Claims priority of provisional application 62/823,512, filed on Mar. 25, 2019.
Prior Publication US 2020/0311798 A1, Oct. 1, 2020
Int. Cl. G06N 3/08 (2023.01); G06F 16/9535 (2019.01); G06F 16/957 (2019.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06N 3/04 (2023.01); G06Q 30/0601 (2023.01); G06V 30/19 (2022.01); G06V 30/262 (2022.01)

CPC G06Q 30/0627 (2013.01) [G06F 16/9535 (2019.01); G06F 16/9577 (2019.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06N 3/04 (2013.01); G06N 3/08 (2013.01); G06Q 30/0631 (2013.01); G06Q 30/0643 (2013.01); G06V 30/19147 (2022.01); G06V 30/19173 (2022.01); G06V 30/274 (2022.01)]

20 Claims

15. A non-transitory computer-readable medium storing instructions, which when executed by a processing device, cause the processing device to perform operations comprising iteratively:

computing a visual semantic embedding for a training image that has been categorized, the visual semantic embedding comprising feature vectors mapped within a type-specific feature space;

executing, on the visual semantic embedding, one or more sets of fully connected neural network (NN) layers and rectifier linear unit layers to generate intermediate NN vector outputs of a NN regressor model;

executing a complete vector predictor on the intermediate NN vector outputs to predict values of a complete text vector corresponding to the training image, the complete text vector comprising a subset of characteristic terms of a text lexicon of item characteristics that are predicted as a group with one or more NN layers;

executing an individual term predictor on the intermediate NN vector outputs to separately predict individual term values using corresponding individual NN neurons, wherein the individual term values are separately related to respective characteristic terms of the text lexicon;

minimizing a loss function using the predicted values of the complete text vector and the predicted individual term values generated by the individual term predictor to generate a final subset of predicted term values that trains the NN regressor model to generate a string of the characteristic terms most associated with the training image;

receiving, via a communication interface, a multi-modal query from a browser of a client device, the multi-modal query comprising at least a first image of an item;

executing the trained NN regressor model on the first image to identify a plurality of second items that are one of similar to or compatible with the item depicted in the first image;

generating, by the processing device and using characteristic terms corresponding to highest values of the final subset of predicted term values over multiple training iterations, structured text that explains, within one of a phrase or a sentence, why the plurality of second items are relevant to the item; and

returning, to the browser of the client device, a set of search results comprising a set of images, corresponding to the plurality of second items, and the structured text.