US 11,836,642 B2
	Method, system, and computer program product for dynamically scheduling machine learning inference jobs with different quality of services on a shared infrastructure
Yinhe Cheng, Austin, TX (US); Yu Gu, Austin, TX (US); Igor Karpenko, Dublin, CA (US); Peter Walker, Cedar Park, TX (US); Ranglin Lu, Austin, TX (US); and Subir Roy, Austin, TX (US)
Assigned to Visa International Service Association, San Francisco, CA (US)
Filed by Visa International Service Association, San Francisco, CA (US)
Filed on Dec. 23, 2022, as Appl. No. 18/088,193.
Application 18/088,193 is a continuation of application No. 16/745,932, filed on Jan. 17, 2020, granted, now 11,562,263.
Prior Publication US 2023/0130887 A1, Apr. 27, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/46 (2006.01); G06N 5/04 (2023.01); G06F 9/50 (2006.01); G06N 20/00 (2019.01)

CPC G06N 5/04 (2013.01) [G06F 9/5011 (2013.01); G06N 20/00 (2019.01)]

20 Claims

1. A computer-implemented method, comprising:

receiving or determining, with at least one processor, a plurality of performance profiles associated with a plurality of system resources, wherein each performance profile is associated with a machine learning model, wherein each performance profile for each system resource includes a latency associated with the machine learning model for that system resource, a throughput associated with the machine learning model for that system resource, and an availability of that system resource for processing an inference job associated with the machine learning model, and wherein the plurality of system resources includes at least one central processing unit (CPU) and at least one graphics processing unit (GPU);

receiving, with at least one processor, a request for system resources for the inference job associated with the machine learning model;

determining, with at least one processor, a system resource of the plurality of system resources for processing the inference job associated with the machine learning model based on the plurality of performance profiles and a quality of service requirement associated with the inference job; and

assigning, with at least one processor, the system resource to the inference job for processing the inference job, wherein the system resource assigned to the inference job executes the machine learning model associated with the inference job to process the inference job.