CPC G06F 9/5083 (2013.01) [G06N 5/04 (2013.01); G06N 20/00 (2019.01)] | 18 Claims |
1. A method for managing containers in a machine learning (ML) serving infrastructure, the method comprising:
receiving or detecting an update of container metrics including resource usage and serviced requests per ML model or per container, where a plurality of ML models are hosted by and distributed amongst a plurality of containers;
processing the container metrics per ML model or per container to determine recent resource usage and serviced requests per ML model or per container;
rebalancing the distribution of ML models to containers in response to detecting a load imbalance between containers or detecting a stressed container;
identifying the plurality of containers as available to execute ML models;
updating an expected model assignment for each container in the plurality of containers;
sending the expected model assignment to a container manager to implement loading or unloading of ML models at each container of the plurality of containers;
updating the expected model assignment for each container in the plurality of containers in response to the rebalancing of the distribution of ML models to the plurality of containers; and
sending the updated expected model assignment to the container manager to implement moving of ML models between containers according to the updated expected model assignment.
|