US 12,475,151 B1
	Fault tolerant multi-agent generative AI applications
Vijay Madisetti, Alpharetta, GA (US); and Arshdeep Bahga, Chandigarh (IN)
Assigned to Vijay Madisetti, Alpharetta, GA (US)
Filed by Vijay Madisetti, Alpharetta, GA (US)
Filed on Oct. 21, 2024, as Appl. No. 18/921,852.
Application 18/921,852 is a continuation in part of application No. 18/812,707, filed on Aug. 22, 2024.
Application 18/812,707 is a continuation in part of application No. 18/470,487, filed on Sep. 20, 2023, granted, now 12,147,461.
Application 18/470,487 is a continuation of application No. 18/348,692, filed on Jul. 7, 2023, granted, now 12,001,462.
Claims priority of provisional application 63/693,351, filed on Sep. 11, 2024.
Claims priority of provisional application 63/647,092, filed on May 14, 2024.
Claims priority of provisional application 63/607,647, filed on Dec. 8, 2023.
Claims priority of provisional application 63/607,112, filed on Dec. 7, 2023.
Claims priority of provisional application 63/535,118, filed on Aug. 29, 2023.
Claims priority of provisional application 63/534,974, filed on Aug. 28, 2023.
Claims priority of provisional application 63/529,177, filed on Jul. 27, 2023.
Claims priority of provisional application 63/469,571, filed on May 30, 2023.
Claims priority of provisional application 63/463,913, filed on May 4, 2023.
Int. Cl. G06F 16/3329 (2025.01); G06F 40/284 (2020.01)

CPC G06F 16/3329 (2019.01) [G06F 40/284 (2020.01)]

14 Claims

1. A method executable by a server for large language model (LLM) agent failure detection of an LLM system comprising one or more LLMs and one or more agents interacting with the one or more LLMs comprising:

listening for an agent heartbeat signal from an agent of the one or more agents;

upon receiving the agent heartbeat signal from the agent, updating a most recent heartbeat timestamp for the agent;

upon a heartbeat threshold length of time elapsing without hearing receiving the agent heartbeat signal from the agent, designating the agent as a suspect agent;

probing the agent for a response;

upon receiving the response from the agent:

removing the suspect agent designation from the agent; and

updating the most recent heartbeat timestamp;

upon a response threshold length of time elapsing without receiving the response from the agent:

designating the agent as a failed agent;

sending a message to at least one of a service mesh or a message pool associated with the one or more agents regarding the agent being designated a failed agent;

receiving a requirement comprising at least one of an application to be executed by the agent or receiving a task to be performed by the agent;

analyzing the requirement to determine whether the requirement requires a replacement with a lesser degree of similarity or requires a replacement with a greater degree of similarity;

activating a shadow agent associated with the agent responsive to analyzing the requirement, comprising upon determining the requirement requires a replacement with a greater degree of similarity:

removing the agent from the at least one of the service mesh or messaging pool between the one or more agents; and

launching a shadow agent configured to have a greater degree of similarity to the agent relative to a replacement with a lesser degree of similarity;

integrating the shadow agent into the at least one of the service mesh or messaging pool between the one or more agents; and

restoring a known good state of the agent to the shadow agent, comprising:

retrieving the known good state of the agent from a checkpoint storage;

applying the known good state to the shadow agent;

receiving an alert that the agent has failed from a watchdog process associated with the agent; and

resuming operation of the LLM system with the shadow agent replacing the agent.