CPC G06F 18/214 (2023.01) [G06F 9/5022 (2013.01); G06F 9/5027 (2013.01); G06F 9/505 (2013.01); G06F 9/5061 (2013.01); G06F 11/3414 (2013.01); G06F 18/24155 (2023.01); G06N 20/00 (2019.01)] | 20 Claims |
1. A computer-implemented method comprising:
determining a plurality of computing resource configurations used to perform machine learning model training jobs, wherein a computing resource configuration comprises:
a first tuple including numbers of worker nodes and parameter server nodes, and
a second tuple including resource allocations for the worker nodes and parameter server nodes;
executing at least one machine learning training job using a first computing resource configuration having a first set of values associated with the first tuple, wherein during the executing the machine learning training job:
monitors resource usage of the worker nodes and parameter server nodes caused by a second set of values associated with the second tuple; and
determines whether to adjust the second set of values;
determining whether a stopping criterion is satisfied; and
selecting one of the plurality of computing resource configurations.
|