CPC G06F 9/5083 (2013.01) [G06F 9/5016 (2013.01); G06F 11/3442 (2013.01); G06F 12/0238 (2013.01); H04L 41/0896 (2013.01); G06F 9/505 (2013.01); G06F 11/3409 (2013.01); G06F 2209/508 (2013.01); G06F 2212/1024 (2013.01); G06N 20/00 (2019.01)] | 18 Claims |
1. A method, comprising:
sending, by one or more processors, input data to a plurality of computing devices configured to process the input data, wherein a respective portion of the input data is sent to each of the plurality of computing devices according to a load-balancing distribution;
receiving, from a first computing device of the plurality of computing devices, data characterizing memory bandwidth for a memory device on the first computing device over a period of time, wherein the portion of the input data sent to the first computing device comprises a request to the first computing device to return output data by processing the data using a machine learning model;
determining, at least from the data characterizing the memory bandwidth and a memory bandwidth saturation point for the first computing device, that the first computing device can process additional data within a predetermined latency threshold, the predetermined latency threshold specifying a maximum tolerated rate of change between access latency and the memory bandwidth, and the access latency being a measure of time to access data stored on the memory device, wherein the memory bandwidth saturation point for the first computing device is determined by:
receiving measures of access latency at different memory bandwidths, each memory bandwidth corresponding to a respective measure of access latency, and
identifying the memory bandwidth saturation point for the first computing device as a point where a graph of the access latency as a function of the memory bandwidth begins to exhibit a non-linear relationship such that an increase in access latency per unit increase of memory bandwidth is greater than another increase in access latency per unit increase of memory bandwidth exhibiting a linear relationship between the access latency and the memory bandwidth;
identifying a first memory bandwidth corresponding to the memory bandwidth saturation point; and
in response to determining that the first computing device can process additional data within the predetermined latency threshold, sending the additional data to the first computing device for processing by the first computing device.
|