| CPC G06F 3/0617 (2013.01) [G06F 3/0647 (2013.01); G06F 3/0679 (2013.01); G06F 12/0828 (2013.01); G06F 2212/271 (2013.01); G06F 2212/621 (2013.01)] | 18 Claims |

|
1. A method for replacing a failing node with a spare node in a non-uniform memory access (NUMA) system, the method comprising:
in response to determining that a node-migration condition is met, initializing a node controller of the spare node such that accesses to a memory local to the spare node are to be processed by the node controller;
quiescing the failing node and the spare node to allow state information of processors on the failing node to be migrated to processors on the spare node;
subsequent to unquiescing the failing node and the spare node, migrating data from the failing node to the spare node while maintaining cache coherence in the NUMA system and while the NUMA system remains in operation, thereby facilitating continuous execution of processes previously executed on the failing node;
maintaining, at the node controller of the spare node, a partial directory of the local memory;
wherein initializing the node controller comprises marking every cache line in the local memory as corrupted; and
wherein migrating the data comprises, in response to determining that a requested cache line in the local memory of the spare node is marked as corrupted, coherently fetching the cache line from the failing node and writing the fetched cache line to the local memory of the spare node.
|