CPC G06V 10/764 (2022.01) [G06F 40/40 (2020.01); G06V 10/225 (2022.01); G06V 10/761 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)] | 20 Claims |
1. A method performed by one or more computers, the method comprising:
obtaining: (i) an image, and (ii) a set of one or more query embeddings, wherein each query embedding represents a respective category of object;
processing the image and the set of query embeddings using an object detection neural network to generate object detection data for the image, comprising:
processing the image using an image encoding subnetwork of the object detection neural network to generate a set of object embeddings, wherein the image encoding subnetwork comprises one or more self-attention neural network layers;
processing each object embedding using a localization subnetwork of the object detection neural network to generate localization data defining a corresponding region of the image; and
processing: (i) the set of object embeddings, and (ii) the set of query embeddings, using a classification subnetwork of the object detection neural network to generate, for each object embedding, a respective classification score distribution over the set of query embeddings,
wherein the respective classification score distribution for each of the object embeddings defines, for each query embedding, a likelihood that the region of the image corresponding to the object embedding depicts an object that is included in the category represented by the query embedding.
|