US 12,423,591 B2
Annotation of a machine learning pipeline with operational semantics to support distributed lineage tracking
Mudhakar Srivatsa, White Plains, NY (US); Raghu Kiran Ganti, White Plains, NY (US); Carlos Henrique Andrade Costa, White Plains, NY (US); Linsong Chu, White Plains, NY (US); and Joshua M. Rosenkranz, White Plains, NY (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Nov. 30, 2021, as Appl. No. 17/538,309.
Prior Publication US 2023/0169354 A1, Jun. 1, 2023
Int. Cl. G06F 9/44 (2018.01); G06F 8/75 (2018.01); G06F 8/77 (2018.01); G06F 11/3604 (2025.01); G06N 5/02 (2023.01); G06N 20/00 (2019.01)
CPC G06N 5/02 (2013.01) [G06F 8/75 (2013.01); G06F 8/77 (2013.01); G06F 11/3608 (2013.01); G06N 20/00 (2019.01)] 19 Claims
OG exemplary drawing
 
1. A computer system comprising:
a processor set:
one or more computer readable storage media; and
program instructions stored on the one or more computer readable storage media to cause the processor set to perform operations comprising:
pre-processing a pipeline configured to train a machine learning (ML) model, the pipeline represented in a data flow graph (DFG), including:
annotating one or more nodes of the DFG with two or more operational semantics for pipeline operations; and
selectively annotating one or more output object references to a corresponding input object and a prior node state for an output object from the DFG;
executing the pipeline represented in the DFG with the selectively annotated one or more output object references, including capturing an object lineage to provide data of how an object in the pipeline is produced with respect to the output object using the selectively annotated one or more output object references;
identifying provenance of one or more objects represented in the pipeline corresponding to a generated output including performance of the executed pipeline using the object lineage;
selectively applying a remediation action to the DFG based on the provenance of the one or more objects corresponding to the generated output; and
restarting the pipeline by executing the pipeline from a location in a sub-graph of the DFG where the remediation action was selectively applied.