US 11,720,651 B2
	Text-conditioned image search based on transformation, aggregation, and composition of visio-linguistic features
Pinkesh Badjatiya, Ujain (IN); Surgan Jandial, Jammu (IN); Pranit Chawla, Delhi (IN); Mausoom Sarkar, New Delhi (IN); and Ayush Chopra, Cambridge, MA (US)
Assigned to Adobe Inc., San Jose, CA (US)
Filed by Adobe Inc., San Jose, CA (US)
Filed on Jan. 28, 2021, as Appl. No. 17/160,893.
Prior Publication US 2022/0245391 A1, Aug. 4, 2022
Int. Cl. G06F 18/25 (2023.01); G06N 3/04 (2023.01); G06F 16/583 (2019.01); G06F 16/532 (2019.01); G06F 16/538 (2019.01); G06F 18/214 (2023.01)

CPC G06F 18/253 (2023.01) [G06F 16/532 (2019.01); G06F 16/538 (2019.01); G06F 16/5846 (2019.01); G06F 18/214 (2023.01); G06F 18/251 (2023.01); G06N 3/04 (2013.01)]

20 Claims

8. A system for performing an image search, the system comprising:

one or more non-transitory computer readable media;

one or more processors configured to receive a source image and a text query defining a target image attribute, wherein the source image includes visual features and textual features;

a first neural network (NN) trained to decompose the source image into a first visual feature vector associated with a first level of granularity, and a second visual feature vector associated with a second level of granularity;

a second NN trained to decompose the text query into a first text feature vector associated with the first level of granularity, a second text feature vector associated with the second level of granularity, and a global text feature vector, wherein the global text feature vector spans multiple levels of granularity;

a semantic feature transformation module, encoded on the one or more non-transitory computer readable media, configured to generate a first image-text embedding based on the first visual feature vector and the first text feature vector, and a second image-text embedding based on the second visual feature vector and the second text feature vector, wherein the first and second image-text embeddings each encode information from the visual features and the textual features;

a visio-linguistic composition module, encoded on the one or more non-transitory computer readable media, configured to compose a visio-linguistic representation based on a hierarchical aggregation of the first image-text embedding with the second image-text embedding, wherein the visio-linguistic representation encodes a combination of visual and textual information at multiple levels of granularity; and

a selection module, encoded on the one or more non-transitory computer readable media, configured to identify a target image that includes the visio-linguistic representation and the global text feature vector, so that the target image relates to the target image attribute, the target image to be provided as a result of the image search.