CPC G06F 9/4887 (2013.01) [G06F 9/5038 (2013.01); G06F 9/5072 (2013.01); H04L 67/1044 (2013.01)] | 20 Claims |
1. A computer-implemented method when executed on data processing hardware of a job scheduler that causes the job scheduler to perform operations comprising:
receiving a request to perform a job comprising one or more steps;
initiating performance of one step from the one or more steps of the job by a first worker system of a plurality of distributed worker systems, the plurality of distributed worker systems executing on respective different compute nodes of a plurality of compute nodes of a distributed computing system, the first worker system executing on a first compute node of the plurality of distributed compute nodes;
receiving, from the first worker system of the plurality of distributed worker systems, a status update comprising results of performing the one step from the one or more steps of the job;
storing the status update to a shared data store shared among each worker system of the plurality of distributed worker systems;
determining, based on the status update, that the one step from the one or more steps of the job failed; and
in response to determining that the one step from the one or more steps of the job failed, initiating performance of the one step from the one or more steps of the job by a second worker system of the plurality of distributed worker systems using the status update stored at the shared data store, the second worker system executing on a second compute node of the plurality of distributed compute nodes, the second compute node different from the first compute node.
|