US 11,715,033 B2
	Dynamically scaled training fleets for machine learning
Leo Parker Dirac, Seattle, WA (US); Rakesh Madhavan Nambiar, Seattle, WA (US); and Avinash Aghoram Ravichandran, Seattle, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Jan. 14, 2020, as Appl. No. 16/742,768.
Application 16/742,768 is a continuation of application No. 14/720,166, filed on May 22, 2015, granted, now 10,540,608.
Prior Publication US 2020/0151606 A1, May 14, 2020
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 5/04 (2023.01); G06F 30/20 (2020.01); G06N 3/08 (2023.01); G06N 20/00 (2019.01)

CPC G06N 20/00 (2019.01) [G06N 5/04 (2013.01)]

20 Claims

1. A method, comprising:

performing, at one or more computing devices:

providing access to a respective partition of a training data set of a machine learning model to a plurality of computing resources, including a first computing resource and a second computing resource, wherein the first computing resource is assigned to perform operations of a training technique on a first partition of the training data set, and wherein the second computing resource is assigned to perform operations of the training technique on a second partition of the training data set;

executing a training phase of the machine learning model on the first computing resource and the second computing resource according to the training technique;

detecting, during the training phase of the machine learning model, that a measure of progress of operations of the training technique through the first partition at the first computing resource exceeds a measure of progress of operations of the training technique through the second partition at the second computing resource;

configuring, during the training phase, based at least in part on said detecting, one or more additional computing resources to perform at least a subset of remaining operations of the training technique on the second partition; and

executing the at least subset of remaining operations of the training technique on the second partition on the one or more additional computing resources.