US 12,322,198 B2
Text based image search
Shaogang Gong, Pinner (GB); Qi Dong, London (GB); and Xiatian Zhu, Cambridge (GB)
Assigned to VERITONE, INC., Denver, CO (US)
Appl. No. 17/635,108
Filed by Veritone, Inc., Denver, CO (US)
PCT Filed Aug. 5, 2020, PCT No. PCT/GB2020/051872
§ 371(c)(1), (2) Date Feb. 14, 2022,
PCT Pub. No. WO2021/028656, PCT Pub. Date Feb. 18, 2021.
Claims priority of application No. 1911724 (GB), filed on Aug. 15, 2019.
Prior Publication US 2022/0343626 A1, Oct. 27, 2022
Int. Cl. G06V 30/413 (2022.01); G06N 3/045 (2023.01); G06V 10/44 (2022.01); G06V 10/75 (2022.01)
CPC G06V 30/413 (2022.01) [G06V 10/454 (2022.01); G06V 10/76 (2022.01)] 17 Claims
OG exemplary drawing
 
1. A method for building a machine learning model for finding visual targets from text queries, the method comprising the steps of:
receiving a set of training data comprising text attribute labelled images, wherein each image has more than one text attribute label;
receiving a first vector space comprising a mapping of words, the mapping defining relationships between words;
generating a visual feature vector space by grouping images of the set of training data having similar attribute labels;
mapping each attribute label within the training data set on to the first vector space to form a second vector space;
fusing the visual feature vector space and the second vector space to form a third vector space;
generating a similarity matching model from the third vector space; and
obtaining a global textual embedding, zglo, according to:

OG Complex Work Unit Math
where w1 and w2 are learnable parameters and Tan h is a non-linear activation function of a neuron in a Convolutional Neural Network, CNN,
wherein mapping each attributed label within the training data set on to the first vector space to form a second vector space further comprises embedding each attribute label, ziloc, i∈{1, . . . , Natt}.