| CPC G06V 20/70 (2022.01) [G06V 10/764 (2022.01); G06V 20/41 (2022.01)] | 20 Claims |

|
1. A computer-implemented method, the method comprising:
obtaining, by a computing system comprising one or more processors, image data and text data, wherein the image data comprises an input image, and wherein the text data comprises a query associated with the input image;
processing, by the computing system, the input image with an object recognition model to generate a fine-grained object recognition output, wherein the fine-grained object recognition output is descriptive of identification details for an object depicted in the input image;
processing, by the computing system, the input image and the text data with a vision language model to generate a language output, wherein the language output comprises a set of predicted words predicted to be responsive to the query and based on the input image, wherein the set of predicted words comprise a coarse-grained term descriptive of predicted identification of the object depicted in the input image; and
processing, by the computing system, the fine-grained object recognition output and the language output to generate an augmented language output, wherein the augmented language output comprises the set of predicted words with the coarse-grained term replaced with the fine-grained object recognition output.
|