CPC H04L 63/1466 (2013.01) [G06N 20/00 (2019.01); H04L 63/1416 (2013.01)] | 30 Claims |
1. A computer-implemented method comprising:
receiving data characterizing a prompt for ingestion by a generative artificial intelligence (GenAI) model;
capturing an intermediate result generated by an intermediate layer of the GenAI model, the GenAI model comprising a plurality of transformer layers and the intermediate result comprising activations in residual streams generated by one or more of the transformer layers;
determining, using a prompt injection classifier and based on the intermediate result, whether the prompt comprises malicious content or elicits undesired model behavior;
initiating at least one remediation action when it is determined that the prompt comprises malicious content or elicits undesired model behavior; and
returning an output of the GenAI model responsive to the prompt when it is determined that the prompt does not comprise malicious content or elicits undesired model behavior.
|