US 12,380,077 B2
System and method for automated discovery of fine-grain lineage of transactional data
Olusegun Oshin, Edison, NJ (US); Colin E Alexander, Palm Harbor, FL (US); Anshul Agarwal, Ridgewood, NJ (US); and Meisam Hosseini, New York, NY (US)
Assigned to JPMORGAN CHASE BANK, N.A., New York, NY (US)
Filed by JPMorgan Chase Bank, N.A., New York, NY (US)
Filed on May 15, 2023, as Appl. No. 18/197,379.
Prior Publication US 2024/0386005 A1, Nov. 21, 2024
Int. Cl. G06F 16/22 (2019.01); G06F 16/25 (2019.01)
CPC G06F 16/221 (2019.01) [G06F 16/254 (2019.01)] 16 Claims
OG exemplary drawing
 
1. A method for performing automated data lineage discovery between a source record and a target record within a network, the method comprising:
identifying, by a processor, a plurality of source records and at least one target record, wherein each of the plurality of source records and the at least one target record includes a plurality of columns, wherein the at least one target record is formed from data from the plurality of source records, wherein data from each of the plurality of columns is associated with a common key, and wherein the common key is a transaction identifier;
identifying, by the processor and from the plurality of source records and the at least one target record, a plurality of pairs of a source column and a target column for each column included in the at least one target record;
identifying, by the processor, at least one common key pair for at least one pair of a source column and a target column among the plurality of pairs, wherein the identifying of the at least one common key pair includes:
performing the following operations for each pair of the plurality of pairs:
obtaining a uniqueness score in the target column of the particular pair, wherein the uniqueness score is determined by dividing a number of unique values in the target column by a number of all values in the target column;
obtaining an intersection score between the target column and the source column of the particular pair, wherein the intersection score is calculated by a length function that indicates a size or cardinality of a resulting intersection between a set of unique values of the target column and a set of unique values of the source column;
calculating, by the processor, a common key score for the particular pair using the uniqueness score and the intersection score, wherein the common key score is calculated by a product of the uniqueness score and the intersection score; and
selecting a pair among the plurality of pairs with highest common key score as the at least one pair;
performing, by the processor and based on the at least one common key pair identified, at least one preprocessing of data included in the at least one pair for augmentation;
performing, by the processor and by using the at least one common key pair, data lineage discovery for each of the at least one pair among the plurality of pairs for identifying one or more data lineages;
calculating, by the processor, a data lineage score for each of the one or more data lineages;
selecting, by the processor and among the one or more data lineages, a target data lineage for the at least one pair based on the calculated data lineage score, wherein values included in a source column and values included in a target column of the at least one pair consistently correspond to one another, and wherein the target data lineage between the source column and the target column of the at least one pair is determined without knowledge of manipulation performed on the values of the source column for obtaining the values of the target column for more efficient determination of the target data lineage; and
displaying, on a display and by the processor, the target data lineage for the at least one pair.