CPC G06N 20/00 (2019.01) [G06N 5/046 (2013.01); G06F 11/36 (2013.01)] | 20 Claims |
1. A system, comprising:
one or more computing devices respectively comprising one or more processors configured to implement a machine learning training cluster comprising a plurality of training instances, wherein the machine learning training cluster, using the one or more processors, is configured to:
train a machine learning model; and
iteratively collect data produced from the plurality of training instances from the training of the machine learning model on the one or more computing devices for debugging the training of the machine learning model, wherein the data produced from the training of the machine learning model is collected using agent software on the one or more computing devices, and wherein the data produced from the training of the machine learning model comprises tensor-level numerical values; and
one or more computing devices comprising one or more processors configured to implement a machine learning analysis system, wherein the machine learning analysis system, using the one or more processors, is configured to, for respective iterations of the data collected from the training of the machine learning model:
aggregate the data produced from the plurality of training instances for a plurality of iterations from the training of the machine learning model to reduce the amount of data produced from the training of the machine learning model for analysis, wherein aggregating the data produced from the training of the machine learning model comprises generating one or more aggregated tensor-level numerical values from the tensor-level numerical values, wherein the one or more aggregated tensor-level numerical values comprises at least one of a minimum tensor-level numerical value from the tensor-level numerical values, a maximum tensor-level numerical value from the tensor-level numerical values, or an average tensor-level numerical value of the tensor-level numerical values;
perform an analysis of the aggregated data produced from the training of the machine learning model to detect one or more problems associated with the training of the machine learning model for debugging the training of the machine learning model, wherein the analysis of aggregated data comprises a comparison between one or more aggregated tensor-level numerical values from the aggregated data and one or more threshold values;
detect the one or more problems associated with the training of the machine learning model based at least in part on the analysis of the aggregated data; and
generate one or more alarms describing the one or more problems from the training of the machine learning model.
|