US 11,900,067 B1
Multi-modal machine learning architectures integrating language models and computer vision systems
Michael Love, Marble Falls, TX (US); Blake Love, Austin, TX (US); and Tiago Soromenho, Austin, TX (US)
Assigned to SURGETECH, LLC, Austin, TX (US)
Filed by SurgeTech, LLC, Austin, TX (US)
Filed on Sep. 21, 2023, as Appl. No. 18/471,934.
Application 18/471,934 is a continuation of application No. 18/191,746, filed on Mar. 28, 2023, granted, now 11,803,710.
Int. Cl. G06F 40/35 (2020.01); G06V 10/96 (2022.01); G06F 16/532 (2019.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/80 (2022.01)
CPC G06F 40/35 (2020.01) [G06F 16/532 (2019.01); G06F 40/40 (2020.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 10/803 (2022.01); G06V 10/82 (2022.01); G06V 10/96 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A method implemented via execution of computing instructions by one or more processors and stored on one or more non-transitory computer-readable storage devices, the method comprising:
providing a multi-modal application that includes:
a computer vision system configured to execute one or more computer vision tasks and generate computer vision outputs; and
a client interface that facilitates interactions between an end-user and a language model that generates language model outputs for communicating with the end-user;
executing, using the computer vision system, at least one computer vision task on an image to generate a computer vision output;
comparing the computer vision output generated by the computer vision system with one or more trigger events;
in response to detecting a match between the computer vision output and the one or more trigger events, executing, by the language model, at least one natural language processing (NLP) task to generate a preemptive output; and
outputting the preemptive output via the client interface.