US 12,431,131 B1
	Cache techniques for large language model processing
Kartik Balasubramaniam, Framingham, MA (US); Venkata Siva Sai Krishna Balakavi, Jersey City, NJ (US); and Austin Doolittle, Roslindale, MA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 19, 2023, as Appl. No. 18/469,858.
Int. Cl. G10L 15/197 (2013.01); G10L 15/18 (2013.01); G10L 15/22 (2006.01); G10L 15/28 (2013.01)

CPC G10L 15/197 (2013.01) [G10L 15/1815 (2013.01); G10L 15/22 (2013.01); G10L 15/285 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

receiving first input data representing a first user input;

generating a first prompt including at least the first input data, the first prompt being a first input for a large language model (LLM) to determine a response to the first user input;

determining, using the LLM, first encoded representations corresponding to the first prompt;

storing, using a cache associated with the LLM, the first encoded representations;

performing, using the LLM and the first encoded representations, a first iteration of processing to determine a response to the first user input, the first iteration of processing resulting in generation of first processing data;

determining second encoded representations corresponding to the first processing data;

storing, using the cache, the second encoded representations;

performing, using the LLM, the first encoded representations and the second encoded representations, a second iteration of processing to determine a first response corresponding to the first user input;

causing presentation of the first response;

based on the LLM determining the first response, deleting, from the cache, the second encoded representations;

receiving second input data representing a second user input;

generating a second prompt including at least the first input data and the second input data, the second prompt being a second input for the LLM to determine a response to the second user input;

determining, from the cache, the first encoded representations corresponding to a first portion of the second prompt, wherein the first portion of the second prompt includes the first input data;

determining, using the LLM, third encoded representations corresponding to a second portion of the second prompt, wherein the second portion of the second prompt includes the second input data;

determining, using the LLM, the first encoded representations and the third encoded representations, a second response to the second user input; and

causing presentation of the second response.