US 12,322,198 B2
	Text based image search
Shaogang Gong, Pinner (GB); Qi Dong, London (GB); and Xiatian Zhu, Cambridge (GB)
Assigned to VERITONE, INC., Denver, CO (US)
Appl. No. 17/635,108
Filed by Veritone, Inc., Denver, CO (US)
PCT Filed Aug. 5, 2020, PCT No. PCT/GB2020/051872 § 371(c)(1), (2) Date Feb. 14, 2022, PCT Pub. No. WO2021/028656, PCT Pub. Date Feb. 18, 2021.
Claims priority of application No. 1911724 (GB), filed on Aug. 15, 2019.
Prior Publication US 2022/0343626 A1, Oct. 27, 2022
Int. Cl. G06V 30/413 (2022.01); G06N 3/045 (2023.01); G06V 10/44 (2022.01); G06V 10/75 (2022.01)

CPC G06V 30/413 (2022.01) [G06V 10/454 (2022.01); G06V 10/76 (2022.01)]

17 Claims

1. A method for building a machine learning model for finding visual targets from text queries, the method comprising the steps of:

receiving a set of training data comprising text attribute labelled images, wherein each image has more than one text attribute label;

receiving a first vector space comprising a mapping of words, the mapping defining relationships between words;

generating a visual feature vector space by grouping images of the set of training data having similar attribute labels;

mapping each attribute label within the training data set on to the first vector space to form a second vector space;

fusing the visual feature vector space and the second vector space to form a third vector space;

generating a similarity matching model from the third vector space; and

obtaining a global textual embedding, z^glo, according to:

where w₁and w₂are learnable parameters and Tan h is a non-linear activation function of a neuron in a Convolutional Neural Network, CNN,

wherein mapping each attributed label within the training data set on to the first vector space to form a second vector space further comprises embedding each attribute label, z_i^loc, i∈{1, . . . , N_att}.