CPC G06N 20/00 (2019.01) [G06F 16/164 (2019.01)] | 20 Claims |
1. A computer-implemented method comprising:
receiving, at an endpoint of a multi-tenant provider network, one or more messages indicating a request to train a machine learning (ML) model;
initiating, by a model training system including memory storing model training system instructions and one or more processors for executing the model training system instructions, at least one training instance to train a machine learning (ML) model within a provider network using an iterative training process;
obtaining, by a training monitor of the model training system during the iterative training process of the ML model by the at least one training instance, training data including resource utilization data associated with the iterative training process and model state data including current weights associated with portions of the ML model at a current iteration in the iterative training process;
generating, by a ML analysis system separate from the model training system and based on the training data, including the resource utilization data associated with the iterative training process, and including the model state data including the current weights associated with portions of the ML model at the current iteration in the iterative training process, a plurality of feature importance metric values, each indicating a relative importance of a corresponding feature within the iterative training process, wherein the ML analysis system includes memory storing ML analysis system instructions and one or more processors for executing the ML analysis system instructions;
determining, by the ML analysis system based at least in part on a first of the plurality of feature importance metric values that corresponds to a first feature of the plurality of features, that a modification condition is satisfied;
causing the at least one training instance to modify a utilization or importance of the first feature in at least a subsequent iteration of the iterative training process, the modification affecting a numeric convergence of the training; and
storing, at a conclusion of the iterative training process, one or more model artifacts for the ML model at a location of a storage service of the provider network.
|