US 12,481,705 B1
	Natural language selection of objects in image data
Ahmet Emre Barut, Boston, MA (US); Chengwei Su, Belmont, MA (US); Weitong Ruan, Revere, MA (US); and Wael Hamza, Yorktown Heights, NY (US)
Assigned to AMAZON TECHNOLOGIES, INC., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Jun. 17, 2024, as Appl. No. 18/745,530.
Application 18/745,530 is a continuation of application No. 17/031,062, filed on Sep. 24, 2020, granted, now 12,045,288.
Int. Cl. G06F 16/00 (2019.01); G06F 16/532 (2019.01); G06F 16/583 (2019.01); G06F 16/9032 (2019.01); G06V 20/20 (2022.01); G06N 20/00 (2019.01)

CPC G06F 16/90332 (2019.01) [G06F 16/532 (2019.01); G06F 16/583 (2019.01); G06V 20/20 (2022.01); G06N 20/00 (2019.01)]

20 Claims

1. A computer-implemented method comprising:

receiving input query data comprising a word token;

receiving first image data;

generating first embedding data comprising the word token and a corresponding token based on the input query data;

identifying, by an object detector model, a first image region representing at least a first portion of the first image data, and a second image region representing at least a second portion of the first image data;

generating second embedding data comprising the first image region;

generating cluster centroid data representing the first image region;

generating a spatial attention score by applying a first activation function to a product of a representation of the first image region and the cluster centroid data;

generating a spatial attention map by multiplying the spatial attention score and the representation of the first image region;

generating a down-sampled spatial attention map by down-sampling the spatial attention map;

calculating a channel attention score by applying a second activation function to a product of the down-sampled spatial attention map and a channel attention weight,

generating a channel attention amp by multiplying the channel attention score and the representation of the first image region;

generating third embedding data comprising the channel attention map;

generating fourth embedding data comprising a representation of the second image region;

storing the first embedding data, the second embedding data, the third embedding data, and the fourth embedding data in at least one non-transitory computer readable memory;

inputting the first embedding data, the second embedding data, the third embedding data, and the fourth embedding data into a multi-modal natural language understanding model to determine an output score quantifying how the input query data relates to the first image region;

determining derived query data representing the first image region, wherein the derived query data comprises a search engine query;

inputting the derived query data into a search interface;

receiving a first search result from the search interface; and

outputting the first search result in response to the input query data.