US 11,928,854 B2
	Open-vocabulary object detection in images
Matthias Johannes Lorenz Minderer, Zurich (CH); Alexey Alexeevich Gritsenko, Amsterdam (NL); Austin Charles Stone, San Francisco, CA (US); Dirk Weissenborn, Berlin (DE); Alexey Dosovitskiy, Berlin (DE); and Neil Matthew Tinmouth Houlsby, Zurich (CH)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on May 5, 2023, as Appl. No. 18/144,045.
Claims priority of provisional application 63/339,165, filed on May 6, 2022.
Prior Publication US 2023/0360365 A1, Nov. 9, 2023
Int. Cl. G06K 9/00 (2022.01); G06F 40/40 (2020.01); G06V 10/22 (2022.01); G06V 10/74 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)

CPC G06V 10/764 (2022.01) [G06F 40/40 (2020.01); G06V 10/225 (2022.01); G06V 10/761 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)]

20 Claims

1. A method performed by one or more computers, the method comprising:

obtaining: (i) an image, and (ii) a set of one or more query embeddings, wherein each query embedding represents a respective category of object;

processing the image and the set of query embeddings using an object detection neural network to generate object detection data for the image, comprising:

processing the image using an image encoding subnetwork of the object detection neural network to generate a set of object embeddings, wherein the image encoding subnetwork comprises one or more self-attention neural network layers;

processing each object embedding using a localization subnetwork of the object detection neural network to generate localization data defining a corresponding region of the image; and

processing: (i) the set of object embeddings, and (ii) the set of query embeddings, using a classification subnetwork of the object detection neural network to generate, for each object embedding, a respective classification score distribution over the set of query embeddings,

wherein the respective classification score distribution for each of the object embeddings defines, for each query embedding, a likelihood that the region of the image corresponding to the object embedding depicts an object that is included in the category represented by the query embedding.