US 12,236,268 B2
	Distributed job scheduling system
Ilya Beyer, Mill Valley, CA (US); Ievgen Ignatiev, Kyiv (UA); and Maksym Skrynnik, Kyiv (UA)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Oct. 27, 2021, as Appl. No. 17/452,571.
Application 17/452,571 is a continuation of application No. 16/137,921, filed on Sep. 21, 2018, granted, now 11,182,209.
Prior Publication US 2022/0050713 A1, Feb. 17, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/48 (2006.01); G06F 9/50 (2006.01); H04L 67/104 (2022.01)

CPC G06F 9/4887 (2013.01) [G06F 9/5038 (2013.01); G06F 9/5072 (2013.01); H04L 67/1044 (2013.01)]

20 Claims

1. A computer-implemented method when executed on data processing hardware of a job scheduler that causes the job scheduler to perform operations comprising:

receiving a request to perform a job comprising one or more steps;

initiating performance of one step from the one or more steps of the job by a first worker system of a plurality of distributed worker systems, the plurality of distributed worker systems executing on respective different compute nodes of a plurality of compute nodes of a distributed computing system, the first worker system executing on a first compute node of the plurality of distributed compute nodes;

receiving, from the first worker system of the plurality of distributed worker systems, a status update comprising results of performing the one step from the one or more steps of the job;

storing the status update to a shared data store shared among each worker system of the plurality of distributed worker systems;

determining, based on the status update, that the one step from the one or more steps of the job failed; and

in response to determining that the one step from the one or more steps of the job failed, initiating performance of the one step from the one or more steps of the job by a second worker system of the plurality of distributed worker systems using the status update stored at the shared data store, the second worker system executing on a second compute node of the plurality of distributed compute nodes, the second compute node different from the first compute node.