US 12,137,118 B1
Prompt injection classifier using intermediate results
Amelia Kawasaki, Corvallis, OR (US); and Andrew Davis, Portland, OR (US)
Assigned to HiddenLayer, Inc., Austin, TX (US)
Filed by HiddenLayer, Inc., Austin, TX (US)
Filed on Jul. 29, 2024, as Appl. No. 18/787,768.
Application 18/787,768 is a continuation of application No. 18/648,252, filed on Apr. 26, 2024, granted, now 12,107,885.
This patent is subject to a terminal disclaimer.
Int. Cl. H04L 9/40 (2022.01); G06N 20/00 (2019.01)
CPC H04L 63/1466 (2013.01) [G06N 20/00 (2019.01); H04L 63/1416 (2013.01)] 30 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving data characterizing a prompt for ingestion by a generative artificial intelligence (GenAI) model;
capturing an intermediate result generated by an intermediate layer of the GenAI model, the GenAI model comprising a plurality of transformer layers and the intermediate result comprising activations in residual streams generated by one or more of the transformer layers;
determining, using a prompt injection classifier and based on the intermediate result, whether the prompt comprises malicious content or elicits undesired model behavior;
initiating at least one remediation action when it is determined that the prompt comprises malicious content or elicits undesired model behavior; and
returning an output of the GenAI model responsive to the prompt when it is determined that the prompt does not comprise malicious content or elicits undesired model behavior.