US 12,387,510 B2
	Instance level scene recognition with a vision language model
Harshit Kharbanda, Pleasanton, CA (US); Boris Bluntschli, Canton of Zurich (CH); Vibhuti Mahajan, Los Angeles, CA (US); and Louis Wang, San Francisco, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 28, 2024, as Appl. No. 18/620,136.
Application 18/620,136 is a continuation of application No. 18/496,402, filed on Oct. 27, 2023, granted, now 11,978,271.
Prior Publication US 2025/0140006 A1, May 1, 2025
This patent is subject to a terminal disclaimer.
Int. Cl. G06V 20/70 (2022.01); G06V 10/764 (2022.01); G06V 20/40 (2022.01)

CPC G06V 20/70 (2022.01) [G06V 10/764 (2022.01); G06V 20/41 (2022.01)]

20 Claims

1. A computer-implemented method, the method comprising:

obtaining, by a computing system comprising one or more processors, image data and text data, wherein the image data comprises an input image, and wherein the text data comprises a query associated with the input image;

processing, by the computing system, the input image with an object recognition model to generate a fine-grained object recognition output, wherein the fine-grained object recognition output is descriptive of identification details for an object depicted in the input image;

processing, by the computing system, the input image and the text data with a vision language model to generate a language output, wherein the language output comprises a set of predicted words predicted to be responsive to the query and based on the input image, wherein the set of predicted words comprise a coarse-grained term descriptive of predicted identification of the object depicted in the input image; and

processing, by the computing system, the fine-grained object recognition output and the language output to generate an augmented language output, wherein the augmented language output comprises the set of predicted words with the coarse-grained term replaced with the fine-grained object recognition output.