US 12,137,118 B1
	Prompt injection classifier using intermediate results
Amelia Kawasaki, Corvallis, OR (US); and Andrew Davis, Portland, OR (US)
Assigned to HiddenLayer, Inc., Austin, TX (US)
Filed by HiddenLayer, Inc., Austin, TX (US)
Filed on Jul. 29, 2024, as Appl. No. 18/787,768.
Application 18/787,768 is a continuation of application No. 18/648,252, filed on Apr. 26, 2024, granted, now 12,107,885.
This patent is subject to a terminal disclaimer.
Int. Cl. H04L 9/40 (2022.01); G06N 20/00 (2019.01)

CPC H04L 63/1466 (2013.01) [G06N 20/00 (2019.01); H04L 63/1416 (2013.01)]

30 Claims

1. A computer-implemented method comprising:

receiving data characterizing a prompt for ingestion by a generative artificial intelligence (GenAI) model;

capturing an intermediate result generated by an intermediate layer of the GenAI model, the GenAI model comprising a plurality of transformer layers and the intermediate result comprising activations in residual streams generated by one or more of the transformer layers;

determining, using a prompt injection classifier and based on the intermediate result, whether the prompt comprises malicious content or elicits undesired model behavior;

initiating at least one remediation action when it is determined that the prompt comprises malicious content or elicits undesired model behavior; and

returning an output of the GenAI model responsive to the prompt when it is determined that the prompt does not comprise malicious content or elicits undesired model behavior.