US 12,346,252 B1
Efficient key-value cache management for large language models
Aditya Dhakal, Santa Clarita, CA (US); Pedro H. R. Bruel, San Jose, CA (US); Gourav Rattihalli, Milpitas, CA (US); Sai Rahul Chalamalasetti, Newark, CA (US); and Dejan S. Milojicic, Palo Alto, CA (US)
Assigned to Hewlett Packard Enterprise Development LP, Spring, TX (US)
Filed by Hewlett Packard Enterprise Development LP, Spring, TX (US)
Filed on Apr. 3, 2024, as Appl. No. 18/626,045.
Int. Cl. G06F 12/08 (2016.01); G06F 12/0802 (2016.01)
CPC G06F 12/0802 (2013.01) [G06F 2212/60 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
monitoring, by a first network interface controller (NIC), a key-value cache associated with a large language model (LLM), the LLM being executed by a compute node to infer a query from a user, the compute node comprising the first NIC and an accelerator, and the key-value cache being stored in a memory associated with the accelerator;
in response to detecting that the key-value cache is updated by the accelerator, transferring, by the first NIC on behalf of the accelerator, a copy of the key-value cache update to a remote storage node;
deleting the key-value cache from the memory after the query is inferred, thereby allowing the memory to be used for storing key-value caches associated with other users;
in response to receiving a follow-up query from the user, determining, by the first NIC, a storage location on the remote storage node that stores the key-value cache corresponding to the user; and
sending a key-value (KV)-cache-transfer request to a second NIC on the remote storage node, the KV-cache-transfer request specifying the storage location, thereby facilitating the second NIC to transfer the key-value cache corresponding to the user from the specified storage location to the memory associated with the accelerator.