US 12,148,233 B1
	Systems and methods for AI generation of image captions enriched with multiple AI modalities
Frédéric Petitpont, Velizy Villacoublay (FR); Yannis Tevissen, Paris (FR); and Khalil Guetari, Le Chesnay-Rocquencourt (FR)
Assigned to Newsbridge SAS, Boulogne Billancourt (FR)
Filed by Newsbridge SAS, Boulogne Billancourt (FR)
Filed on May 9, 2024, as Appl. No. 18/659,800.
Int. Cl. G06K 9/00 (2022.01); G06V 10/764 (2022.01); G06V 20/40 (2022.01); G06V 20/70 (2022.01); G06V 40/16 (2022.01)

CPC G06V 20/70 (2022.01) [G06V 10/764 (2022.01); G06V 20/41 (2022.01); G06V 40/172 (2022.01)]

20 Claims

1. A method comprising:

obtaining, by at least one processor, at least one image;

obtaining, by the at least one processor, an artificial intelligence (AI)-generated caption comprising at least one textual description of the at least one image;

wherein the at least one textual description comprises at least one identification of at least one item in the at least one image;

inputting, by the at least one processor, the at least one image and the at least one textual description into at least one vision transformer model to produce at least one heat map for the at least one image;

wherein the at least one heat map comprises a representation of a degree of significance of at least one portion of the at least one image to the at least one identification of the at least one item in the at least one textual description based at least in part on the at least one gradient;

inputting, by the at least one processor, the at least one image into an expert recognition machine learning model to output at least one bounding box comprising at least one label representative of the at least one item;

determining, by the at least one processor, for the at least one image, a spatial alignment within the at least one image between the at least one bounding box and the at least one portion of the at least one heat map; and

modifying, by the at least one processor, the at least one textual description of the AI-generated caption to comprise the at least one label of the at least one item based on the spatial alignment within the at least one image so as to produce a modified AI-generated caption associated with the at least one item.