US 12,443,572 B2
Method to track and clone data artifacts associated with distributed data processing pipelines
Annmary Justine Koomthanam, Bangalore (IN); Suparna Bhattacharya, Bangalore (IN); Aalap Tripathy, Houston, TX (US); Sergey Serebryakov, Milpitas, CA (US); Martin Foltin, Ft. Collins, CO (US); and Paolo Faraboschi, Milpitas, CA (US)
Assigned to Hewlett Packard Enterprise Development LP, Spring, TX (US)
Filed by HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, Houston, TX (US)
Filed on Jun. 28, 2022, as Appl. No. 17/851,546.
Prior Publication US 2023/0418792 A1, Dec. 28, 2023
Int. Cl. G06F 16/215 (2019.01); G06F 16/25 (2019.01); G06F 16/27 (2019.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01)
CPC G06F 16/215 (2019.01) [G06F 16/254 (2019.01); G06F 16/27 (2019.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
receiving, from a first data processing site, a first hash content value that identifies a data artifact and a first indication that the data artifact was an output from a first processing stage of a data processing pipeline, wherein the first processing stage is performed at the first data processing site;
receiving, from a second data processing site, a second hash content value that identifies the data artifact and a second indication that the data artifact was an input to a second processing stage of the data processing pipeline, wherein the second processing stage is performed at the second data processing site;
generating a first node of the first hash content value from the first data processing site and a second node of the second hash content value from the second data processing site, wherein output of the second processing stage generates a descendant node that depends on the first node;
in response to determining that the first hash content value and the second hash content value are a same hash, determining that the first node and the second node represent a same data artifact based on the same hash;
merging the first node and the second node to generate a merged node of the same hash;
constructing a data lineage representation that comprises the merged node as a predecessor of the descendant node, the data lineage representation comprising lineal relationships of the data artifact at the first processing stage and the second processing stage that associates the merged node and the descendant node;
exporting the data lineage representation to the first data processing site or the second data processing site; and
enabling the first data processing site or the second data processing site to locally reproduce the data processing pipeline for the data artifact using the data lineage representation.