CPC G06F 11/1407 (2013.01) [G06F 11/0772 (2013.01); H04L 67/1029 (2013.01)] | 20 Claims |
1. A distributed computer system comprising:
a plurality of computer nodes, wherein the plurality of computer nodes comprise respective programs to cooperate to perform a workload, and wherein a first computer node of the plurality of computer nodes comprises:
a communication proxy between the program of the first computer node and a communication library that supports communications between the program of the first computer node and the programs of other computer nodes of the plurality of computer nodes; and
a fault management service to:
monitor a health of the other computer nodes, and
in response to a detection of a fault of a second computer node of the plurality of computer nodes, relaunch the communication proxy,
wherein the relaunched communication proxy is to select, from a plurality of states, a common state to which the programs are to roll back.
|