US 12,001,511 B2
Systems and methods of resource configuration optimization for machine learning workloads
Lianjie Cao, Milpitas, CA (US); Faraz Ahmed, Milpitas, CA (US); Puneet Sharma, Milpitas, CA (US); and Ali Tariq, Palo Alto, CA (US)
Assigned to Hewlett Packard Enterprise Development LP, Spring, TX (US)
Filed by HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, Houston, TX (US)
Filed on Mar. 11, 2021, as Appl. No. 17/199,294.
Prior Publication US 2022/0292303 A1, Sep. 15, 2022
Int. Cl. G06F 18/214 (2023.01); G06F 9/50 (2006.01); G06F 11/30 (2006.01); G06F 11/34 (2006.01); G06F 18/2415 (2023.01); G06N 3/0464 (2023.01); G06N 3/063 (2023.01); G06N 3/0985 (2023.01); G06N 7/01 (2023.01); G06N 20/00 (2019.01); G06V 40/16 (2022.01)
CPC G06F 18/214 (2023.01) [G06F 9/5022 (2013.01); G06F 9/5027 (2013.01); G06F 9/505 (2013.01); G06F 9/5061 (2013.01); G06F 11/3414 (2013.01); G06F 18/24155 (2023.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
determining a plurality of computing resource configurations used to perform machine learning model training jobs, wherein a computing resource configuration comprises:
a first tuple including numbers of worker nodes and parameter server nodes, and
a second tuple including resource allocations for the worker nodes and parameter server nodes;
executing at least one machine learning training job using a first computing resource configuration having a first set of values associated with the first tuple, wherein during the executing the machine learning training job:
monitors resource usage of the worker nodes and parameter server nodes caused by a second set of values associated with the second tuple; and
determines whether to adjust the second set of values;
determining whether a stopping criterion is satisfied; and
selecting one of the plurality of computing resource configurations.