US 12,437,058 B1
	Security threat mitigation for large language models
Jimit Majmudar, Sunnyvale, CA (US); Ben Smith, Laguna Beach, CA (US); Rahul Gupta, Waltham, MA (US); Sachin P. Joglekar, Bothell, WA (US); and Lili Gehorsam, Seattle, WA (US)
Assigned to AMAZON TECHNOLOGIES, INC., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Dec. 1, 2023, as Appl. No. 18/526,452.
Claims priority of provisional application 63/599,209, filed on Nov. 15, 2023.
Int. Cl. G06F 21/54 (2013.01)

CPC G06F 21/54 (2013.01) [G06F 2221/033 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

receiving a first natural language input;

generating first prompt data comprising context data and the first natural language input;

generating, by a large language model (LLM) using the first prompt data, first action plan data, wherein the first action plan data comprises a first application programming interface (API) call to a first API of a first computer-implemented service;

determining that the first API and arguments for the call to the first API are permissible using a first allow list stored in memory;

executing the first API call;

receiving first result data from the first API in response to the first API call, wherein the first result data is received by the first computer-implemented service from a first source;

generating first encoded data by encoding the first result data using a sequence-to-sequence encoder model;

sending the first encoded data to a first binary classifier model, wherein the first binary classifier model is trained to determine whether API results comprise indirect prompt injection instructions received from the first source, wherein the first binary classifier model is trained using positive examples of the indirect prompt injection instructions and negative examples that comprise valid API results;

determining, by the first binary classifier model, that the first result data comprises a first indirect prompt injection instruction;

sending, by the first binary classifier model, first data to an action plan executor component, the first data indicating that the first result data is invalid; and

terminating, by the action plan executor component, a current session of LLM processing in response to the first data.