US 11,966,292 B2
Fault management in a distributed computer system
Nicholas Hill, Bloomingdale, MN (US); Peter J. Mendygral, Bloomingdale, MN (US); Kent D. Lee, Bloomington, MN (US); and Benjamin James Keen, Enfield, CT (US)
Assigned to Hewlett Packard Enterprise Development LP, Spring, TX (US)
Filed by HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, Houston, TX (US)
Filed on May 27, 2022, as Appl. No. 17/804,392.
Prior Publication US 2023/0385152 A1, Nov. 30, 2023
Int. Cl. G06F 11/00 (2006.01); G06F 11/07 (2006.01); G06F 11/14 (2006.01); H04L 67/1029 (2022.01)
CPC G06F 11/1407 (2013.01) [G06F 11/0772 (2013.01); H04L 67/1029 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A distributed computer system comprising:
a plurality of computer nodes, wherein the plurality of computer nodes comprise respective programs to cooperate to perform a workload, and wherein a first computer node of the plurality of computer nodes comprises:
a communication proxy between the program of the first computer node and a communication library that supports communications between the program of the first computer node and the programs of other computer nodes of the plurality of computer nodes; and
a fault management service to:
monitor a health of the other computer nodes, and
in response to a detection of a fault of a second computer node of the plurality of computer nodes, relaunch the communication proxy,
wherein the relaunched communication proxy is to select, from a plurality of states, a common state to which the programs are to roll back.