| CPC G06F 11/3684 (2013.01) [G06F 11/3688 (2013.01); G06F 11/3692 (2013.01); G06F 9/5077 (2013.01); G06F 9/5083 (2013.01)] | 17 Claims |

|
1. A system comprising:
at least one hardware processor;
a workload abstractor implemented using the at least one hardware processor and configured for:
receiving monitored traffic in a distributed computing system performing a machine learning task, wherein the machine learning task includes using the distributed computing system to train a machine learning model or perform inferencing using the machine learning model;
generating, using the monitored traffic, a test environment-agnostic workload model for the machine learning task, wherein generating, using the monitored traffic, the test environment-agnostic workload model for the machine learning task comprises removing one or more deployment-specific dependencies and attributes from the monitored traffic, wherein removing the one or more deployment-specific dependencies includes removing attributes relating to network configuration used by the distributed computing system; and
storing the test environment-agnostic workload model in a workload model repository with one or more other workload models; and
a test controller implemented using the at least one hardware processor and configured for:
selecting a test case for the machine learning task and a testbed mode for the test case;
executing the test case by translating the test environment-agnostic workload model into a testbed-specific workload model for the testbed mode, including generating an input feed stream and providing the input feed stream to a testbed corresponding to the testbed mode to test at least one aspect of a machine learning cluster that executes the machine learning task and uses a different transport or network topology than the distributed computing system; and
reporting, based on executing the test case, one or more performance metrics for the machine learning task.
|