| CPC G06V 30/413 (2022.01) [G06V 10/454 (2022.01); G06V 10/76 (2022.01)] | 17 Claims |

|
1. A method for building a machine learning model for finding visual targets from text queries, the method comprising the steps of:
receiving a set of training data comprising text attribute labelled images, wherein each image has more than one text attribute label;
receiving a first vector space comprising a mapping of words, the mapping defining relationships between words;
generating a visual feature vector space by grouping images of the set of training data having similar attribute labels;
mapping each attribute label within the training data set on to the first vector space to form a second vector space;
fusing the visual feature vector space and the second vector space to form a third vector space;
generating a similarity matching model from the third vector space; and
obtaining a global textual embedding, zglo, according to:
![]() where w1 and w2 are learnable parameters and Tan h is a non-linear activation function of a neuron in a Convolutional Neural Network, CNN,
wherein mapping each attributed label within the training data set on to the first vector space to form a second vector space further comprises embedding each attribute label, ziloc, i∈{1, . . . , Natt}.
|