US 12,452,345 B2
	Managing artificial intelligence inference requests that are directed to an AI model external to a distributed cloud computing network
Michelle Chen, New York, NY (US); Dane Orion Knecht, Austi, TX (US); Celso Martinho, Lisbon (PT); Yoav Moshe, Amsterdam (NL); and Simona Andreea Badoiu, Lisbon (PT)
Assigned to CLOUDFLARE, INC., San Francisco, CA (US)
Filed by CLOUDFLARE, INC., San Francisco, CA (US)
Filed on Sep. 26, 2024, as Appl. No. 18/898,508.
Claims priority of provisional application 63/585,593, filed on Sep. 26, 2023.
Prior Publication US 2025/0103744 A1, Mar. 27, 2025
Int. Cl. G06F 21/62 (2013.01); G06F 9/50 (2006.01); H04L 41/16 (2022.01); H04L 67/1014 (2022.01); H04L 67/63 (2022.01)

CPC H04L 67/63 (2022.05) [G06F 21/6218 (2013.01); H04L 41/16 (2013.01); H04L 67/1014 (2013.01); G06F 9/5027 (2013.01)]

18 Claims

1. A method in a first compute server of a plurality of compute servers of a distributed cloud computing network, comprising:

receiving a first inference request directed to a first AI model of a plurality of AI models that are each hosted at different destinations external to the distributed cloud computing network, wherein at least some of the plurality of AI models are provided by different third-party providers, wherein the first inference request is received at the first compute server of the distributed cloud computing network instead of a first server of a first third-party provider that hosts the first AI model as a result of a first endpoint for the first AI model pointing to the distributed cloud computing network instead of to the first third-party provider, and wherein the first endpoint for the first AI model identifies the first third-party provider and an account;

applying a rate limiting rule to the first inference request, the rate limiting rule to prevent excessive or suspicious inference requests targeting the first AI model to be transmitted to the first AI model, wherein the first inference request complies with the rate limiting rule;

accessing, based on a hash key generated from the first inference request, a distributed data store of the distributed cloud computing network that stores representations of previous inference requests with corresponding inference responses for the first AI model as key-value pairs, wherein the hash key generated from the first inference request does not have a matching key-value pair in the distributed data store;

transmitting the first inference request to the first AI model responsive to not finding a matching key-value pair in the distributed data store for the hash key generated from the first inference request;

receiving a first inference response from the first AI model in response to the transmitted first inference request;

transmitting the first inference response in response to the first inference request; storing the hash key generated from the first inference request and the first inference response as a key-value pair in the distributed data store;

causing information about the first inference request and the first inference response to be logged including information that identifies: the first AI model, the first third-party provider, whether the first inference response was served from the distributed data store, a first number of tokens in, and a first number of tokens out;

receiving a second inference request directed to a second AI model of the plurality of AI models, wherein the first AI model and the second AI model are provided by different third-party providers, wherein the second inference request is received at the first compute server of the distributed cloud computing network instead of a second server of a second third-party provider that hosts the second AI model as a result of a second endpoint for the second AI model pointing to the distributed cloud computing network instead of to the second third-party provider, and wherein the second endpoint for the second AI model identifies the second third-party provider and the account;

applying a second rate limiting rule to the second inference request, the second rate limiting rule to prevent excessive or suspicious inference requests targeting the second AI model to be transmitted to the second AI model, wherein the second inference request complies with the second rate limiting rule;

accessing, based on a hash key generated from the second inference request, a distributed data store of the distributed cloud computing network that stores representations of previous inference requests with corresponding inference responses for the first AI model as key-value pairs, wherein the hash key generated from the first inference request has a matching key-value pair in the distributed data store;

retrieving a second inference response of the matching key-value pair in the distributed data store;

transmitting the second inference response in response to the second inference request; and

causing information about the second inference request and the second inference response to be logged including information that identifies: the second AI model, the second third-party provider, whether the second inference response was served from the distributed data store, a second number of tokens in, and a second number of tokens out, wherein the logged information about the first inference request and the first inference response and the logged information about the second inference request and the second inference response are used in an analytics service provided by the distributed cloud computing network that aggregates the logged information for the account.