| CPC G06F 16/90332 (2019.01) [G06F 16/532 (2019.01); G06F 16/583 (2019.01); G06V 20/20 (2022.01); G06N 20/00 (2019.01)] | 20 Claims |

|
1. A computer-implemented method comprising:
receiving input query data comprising a word token;
receiving first image data;
generating first embedding data comprising the word token and a corresponding token based on the input query data;
identifying, by an object detector model, a first image region representing at least a first portion of the first image data, and a second image region representing at least a second portion of the first image data;
generating second embedding data comprising the first image region;
generating cluster centroid data representing the first image region;
generating a spatial attention score by applying a first activation function to a product of a representation of the first image region and the cluster centroid data;
generating a spatial attention map by multiplying the spatial attention score and the representation of the first image region;
generating a down-sampled spatial attention map by down-sampling the spatial attention map;
calculating a channel attention score by applying a second activation function to a product of the down-sampled spatial attention map and a channel attention weight,
generating a channel attention amp by multiplying the channel attention score and the representation of the first image region;
generating third embedding data comprising the channel attention map;
generating fourth embedding data comprising a representation of the second image region;
storing the first embedding data, the second embedding data, the third embedding data, and the fourth embedding data in at least one non-transitory computer readable memory;
inputting the first embedding data, the second embedding data, the third embedding data, and the fourth embedding data into a multi-modal natural language understanding model to determine an output score quantifying how the input query data relates to the first image region;
determining derived query data representing the first image region, wherein the derived query data comprises a search engine query;
inputting the derived query data into a search interface;
receiving a first search result from the search interface; and
outputting the first search result in response to the input query data.
|