| CPC G06F 11/0793 (2013.01) [G06F 11/0727 (2013.01); G06F 16/1744 (2019.01)] | 12 Claims |

|
1. A method for improving fault tolerance and computational efficiency in a distributed parallel computing environment when processing extremely large data sets, the method comprising the steps of:
at a plurality of mappers operating in parallel in a computing cluster having at least one master node and a plurality of worker nodes configured with solid state storage devices to facilitate parallel processing, wherein each mapper is configured to perform a map function comprising filtering and sorting on a subset of an input data set to produce key-value pairs, executing a map function against each subset of the input data set of at least 150 Terabytes in parallel to produce an initial mapped data set;
executing a shuffle phase against the initial mapped data set to produce a shuffled data set by redistributing the initial mapped data set based on the key-value pairs such that all data belonging to a particular key is located on a common worker node in the distributed parallel computing environment;
at a plurality of reducers operating in parallel in the computing cluster, wherein each reducer is configured to apply a summation to its key-value pairs of the shuffled data set to convert the key-value pairs into a final key-value pair set, executing at each reducer in parallel a reduce function against the shuffled data set to produce a reduced data set comprising the final key-value pair set;
compressing the reduced data set using a non-block compression algorithm that maintains file integrity and prevents file splitting during subsequent processing to produce a plurality of compressed files, wherein the non-block compression ensures that each compressed file can be processed as a complete unit by a single map-only task;
assigning each of the plurality of compressed files to a separate map-only task within a single map-only reducer to produce a plurality of partial results in parallel;
aggregating each of the plurality of partial results to produce a final result set; and
in response to detecting a failure at one of the map-only tasks, rerunning only the failed map-only task without rerunning all map and reduce functions, thereby improving fault tolerance of the distributed computing environment by eliminating redundant processing and reducing recovery time by at least fifty percent compared to traditional MapReduce implementations that require complete reprocessing after failures.
|