| CPC G06V 10/86 (2022.01) [G06F 40/284 (2020.01); G06V 10/761 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 20/70 (2022.01)] | 20 Claims |

|
11. A system, comprising:
a processor programmed to:
(i) receive a plurality of input images indicative of radar, sonar, video, picture, sound, or LiDar information;
(ii) generate a visual matrix utilizing the plurality of input images and an image encoder of the machine learning network, wherein the visual matrix includes a list of encoded images;
(iii) receive a plurality of text prompts;
(iv) select a first one of the text prompts from the plurality of text prompts;
(v) send the first one of the text prompts to a large language model (LLM) to generate a candidate list of tokens, wherein the candidate list of tokens is generated by selecting a subset of tokens from every token associated with the first one of the text prompts, wherein the subset includes highest-probable tokens associated with the first one of the text prompts, wherein the highest-probable tokens are calculated in response to output of the LLM;
(vi) select one or more tokens from the candidate list;
(vii) convert the one of the text prompts into updated text prompts via appending the one or more selected tokens associated to the plurality of text prompts;
(viii) generate a text matrix utilizing both (1) the updated text prompt that include one or more tokens and (2) a text encoder of the machine learning network, wherein the text matrix includes a list of encoded visual descriptors that includes the updated text prompt with one or more tokens;
(ix) multiply the text matrix and the visual matrix to generate an image-text similarity matrix, wherein the image-text similarity matrix assigns a numerical value indicating similarities between each of encoded visual descriptors and each of the encoded images, wherein similarities are indicated by entries of the image-text similarity matrix having numerical values;
(x) utilizing the numerical values assigned at the image-text similarity matrix, determine a score associated with the image-text similarity matrix;
(xi) when the score falls below a threshold, repeating steps (vi-xi) for a second token for the first one of the text prompts, and when the score exceeds the threshold, adding the one or more tokens to the updated text prompts and repeating steps (iv-xi) for a remainder of each of the plurality of text prompts; and
(xii) output a final token to the updated text prompt in response to identifying a highest score associated with the final token after evaluating each of the plurality of text prompts.
|