US 12,008,331 B2
	Utilizing visual and textual aspects of images with recommendation systems
Xun Luan, Sunnyvale, CA (US); Aman Gupta, San Jose, CA (US); Sirjan Kafle, San Diego, CA (US); Ananth Sankar, Palo Alto, CA (US); Di Wen, Sunnyvale, CA (US); Saurabh Kataria, Newark, CA (US); Ying Xuan, Sunnyvale, CA (US); Sakshi Verma, Haryana (IN); Bharat Kumar Jain, Hyderabad (IN); Xue Xia, Los Angeles, CA (US); Bhargavkumar Kanubhai Patel, Gujarat (IN); Vipin Gupta, Bangalore (IN); and Nikita Gupta, Delhi (IN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Dec. 23, 2021, as Appl. No. 17/560,436.
Prior Publication US 2023/0206010 A1, Jun. 29, 2023
Int. Cl. G10L 15/22 (2006.01); G06F 40/40 (2020.01); G06N 3/04 (2023.01); G06V 30/19 (2022.01)

CPC G06F 40/40 (2020.01) [G06N 3/04 (2013.01); G06V 30/19147 (2022.01)]

20 Claims

8. A system comprising:

a memory storage device for storing computer-executable instructions; and

at least one processor, which, when executing the computer-executable instructions, causes the system to:

with a machine learning algorithm, train an encoder-decoder model to generate a caption for an image by generating with the encoder an embedding, and then decoding the embedding with the decoder to generate the caption, wherein a dataset comprising a plurality of images with associated captions is used to train the encoder-decoder model;

generate a first embedding for the image by:

detecting words present in an image with an optical character recognition (OCR) algorithm;

using pre-trained word embeddings to derive a word embedding for each word detected in the image; and

performing an average pooling operation on the word embeddings for each word detected in the image, wherein the result of the average pooling operation is the first embedding for the image;

generate a second embedding for the image by:

using the image as input to the pre-trained encoder-decoder model, generate from the image with the encoder the second embedding for the image;

concatenate the first embedding with the second embedding to derive for the image a final embedding; and

store the final embedding for the image.