CPC G06F 18/2163 (2023.01) [G06F 9/4881 (2013.01); G06F 9/5066 (2013.01); G06F 9/54 (2013.01); G06F 18/211 (2023.01); G06N 20/00 (2019.01); G06F 2209/5017 (2013.01)] | 17 Claims |
1. A system for automatic partitioning of machine learning models and parallel execution management for training across a plurality of devices for different machine learning frameworks, the system comprising:
at least one processor; and
a memory, storing program instructions that when executed by the at least one processor, cause the at least one processor to:
receive a training job for a machine learning model that includes a request for automatic partitioning of the machine learning model across the plurality of devices, wherein the training job is a code file or a script;
evaluate the request to determine one feature in an optimization parameter specified in the request for automatic partitioning, wherein the optimization parameter configures application of a partitioning technique applied to determine the different respective partitions in order to optimize one feature specified in the optimization parameter out of a plurality features that can be optimized;
determine different respective partitions of the machine learning model based, at least in part, on a number of partitions and the optimization parameter, and wherein to determine the different respective partitions of the machine learning model, the program instructions cause the at least one processor to:
execute a first training run to construct a version of the machine learning model on a computer processing unit (CPU) memory; and
apply a selection of a tree-based partitioning algorithm or a graph-based partitioning algorithm using the constructed version of the machine learning model;
generate a schedule for executing the training job across the plurality of processing devices according to the different respective partitions of the machine learning model; and
cause the training job to be executed according to the schedule.
|