| CPC G06F 16/3329 (2019.01) [G06F 40/284 (2020.01)] | 14 Claims |

|
1. A method executable by a server for large language model (LLM) agent failure detection of an LLM system comprising one or more LLMs and one or more agents interacting with the one or more LLMs comprising:
listening for an agent heartbeat signal from an agent of the one or more agents;
upon receiving the agent heartbeat signal from the agent, updating a most recent heartbeat timestamp for the agent;
upon a heartbeat threshold length of time elapsing without hearing receiving the agent heartbeat signal from the agent, designating the agent as a suspect agent;
probing the agent for a response;
upon receiving the response from the agent:
removing the suspect agent designation from the agent; and
updating the most recent heartbeat timestamp;
upon a response threshold length of time elapsing without receiving the response from the agent:
designating the agent as a failed agent;
sending a message to at least one of a service mesh or a message pool associated with the one or more agents regarding the agent being designated a failed agent;
receiving a requirement comprising at least one of an application to be executed by the agent or receiving a task to be performed by the agent;
analyzing the requirement to determine whether the requirement requires a replacement with a lesser degree of similarity or requires a replacement with a greater degree of similarity;
activating a shadow agent associated with the agent responsive to analyzing the requirement, comprising upon determining the requirement requires a replacement with a greater degree of similarity:
removing the agent from the at least one of the service mesh or messaging pool between the one or more agents; and
launching a shadow agent configured to have a greater degree of similarity to the agent relative to a replacement with a lesser degree of similarity;
integrating the shadow agent into the at least one of the service mesh or messaging pool between the one or more agents; and
restoring a known good state of the agent to the shadow agent, comprising:
retrieving the known good state of the agent from a checkpoint storage;
applying the known good state to the shadow agent;
receiving an alert that the agent has failed from a watchdog process associated with the agent; and
resuming operation of the LLM system with the shadow agent replacing the agent.
|