US 12,230,011 B2
	Open-vocabulary object detection in images
Matthias Johannes Lorenz Minderer, Zurich (CH); Alexey Alexeevich Gritsenko, Amsterdam (NL); Austin Charles Stone, San Francisco, CA (US); Dirk Weissenborn, Berlin (DE); Alexey Dosovitskiy, Berlin (DE); and Neil Matthew Tinmouth Houlsby, Zürich (CH)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 25, 2024, as Appl. No. 18/422,887.
Application 18/422,887 is a continuation of application No. 18/144,045, filed on May 5, 2023, granted, now 11,928,854.
Claims priority of provisional application 63/339,165, filed on May 6, 2022.
Prior Publication US 2024/0161459 A1, May 16, 2024
Int. Cl. G06K 9/00 (2022.01); G06F 40/40 (2020.01); G06V 10/22 (2022.01); G06V 10/74 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)

CPC G06V 10/764 (2022.01) [G06F 40/40 (2020.01); G06V 10/225 (2022.01); G06V 10/761 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)]

20 Claims

1. A method performed by one or more computers, the method comprising:

pre-training an image encoding subnetwork and a text encoding subnetwork of an object detection neural network, wherein the image encoding subnetwork comprises one or more self-attention neural network layers, and wherein the pre-training includes repeatedly performing operations comprising:

obtaining: (i) a training image, (ii) a positive text sequence, wherein the positive text sequence characterizes the training image, and (iii) one or more negative text sequences, wherein the negative text sequences do not characterize the training image;

generating an embedding of the training image using the image encoding subnetwork, comprising:

processing the training image using the image encoding subnetwork to generate a set of object embeddings for the training image; and

processing the object embeddings using an embedding neural network to generate the embedding of the training image;

generating respective embeddings of the positive text sequence and each of the negative text sequences using the text encoding subnetwork; and

jointly training the image encoding subnetwork and the text encoding subnetwork to encourage: (i) greater similarity between the embedding of the training image and the embedding of the positive text sequence, (ii) lesser similarity between the embedding of the training image and the embeddings of the negative text sequences; and

fine-tuning the image encoding subnetwork and the text encoding subnetwork on a task of object detection.