| CPC G06F 12/0802 (2013.01) [G06F 2212/60 (2013.01)] | 20 Claims |

|
1. A method, comprising:
monitoring, by a first network interface controller (NIC), a key-value cache associated with a large language model (LLM), the LLM being executed by a compute node to infer a query from a user, the compute node comprising the first NIC and an accelerator, and the key-value cache being stored in a memory associated with the accelerator;
in response to detecting that the key-value cache is updated by the accelerator, transferring, by the first NIC on behalf of the accelerator, a copy of the key-value cache update to a remote storage node;
deleting the key-value cache from the memory after the query is inferred, thereby allowing the memory to be used for storing key-value caches associated with other users;
in response to receiving a follow-up query from the user, determining, by the first NIC, a storage location on the remote storage node that stores the key-value cache corresponding to the user; and
sending a key-value (KV)-cache-transfer request to a second NIC on the remote storage node, the KV-cache-transfer request specifying the storage location, thereby facilitating the second NIC to transfer the key-value cache corresponding to the user from the specified storage location to the memory associated with the accelerator.
|