US 12,248,883 B1
Generative artificial intelligence model prompt injection classifier
Jacob Rideout, Raleigh, NC (US); Tanner Burns, Austin, TX (US); Kwesi Cappel, Austin, TX (US); and Kenneth Yeung, Ottawa (CA)
Assigned to HiddenLayer, Inc., Austin, TX (US)
Filed by HiddenLayer, Inc., Austin, TX (US)
Filed on Mar. 14, 2024, as Appl. No. 18/605,337.
Int. Cl. G06N 3/094 (2023.01); G06F 21/55 (2013.01); G06N 20/00 (2019.01)
CPC G06N 3/094 (2023.01) [G06F 21/55 (2013.01); G06N 20/00 (2019.01)] 24 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving, by an analysis engine from a model environment executing a generative artificial intelligence (GenAI) model, data characterizing a prompt for ingestion by the GenAI model, the analysis engine executing in a monitoring environment remote from the model environment;
determining, by the analysis engine using an ensemble of machine learning-based prompt injection classifiers, whether the prompt comprises malicious content or elicits malicious actions, a first of the prompt injection classifiers being trained to identify a first type of prompt injection attack and a second of the prompt injection classifiers being trained to identify a second, different type of prompt injection attack; and
providing data characterizing the determination to a consuming application or process to (i) initiate a remediation action to ensure that the GenAI model does not operate in an undesired manner when it is determined that the prompt comprises malicious content or elicits malicious actions and (ii) allow the prompt to be ingested when it is determined that the prompt does not comprise malicious content or elicit malicious actions;
wherein the initiated remediation action prevents the prompt from being ingested by the GenAI model to ensure that the GenAI model does not operate in an undesired manner and is based on whether the ensemble of machine learning-based prompt injection classifiers identifies the first type of prompt injection attack or the second type of prompt injection attack.