US 12,248,445 B2
Method and system for providing data quality capability within distributed data lakes
Sreenivas Vittal, Bangalore (IN); and Raghuram Sampathkrishna, Cypress, TX (US)
Assigned to JPMORGAN CHASE BANK, N.A., New York, NY (US)
Filed by JPMorgan Chase Bank, N.A., New York, NY (US)
Filed on Oct. 21, 2022, as Appl. No. 17/970,743.
Claims priority of application No. 202211051387 (IN), filed on Sep. 8, 2022.
Prior Publication US 2024/0086380 A1, Mar. 14, 2024
Int. Cl. G06F 16/215 (2019.01); G06F 16/25 (2019.01); G06F 16/27 (2019.01)
CPC G06F 16/215 (2019.01) [G06F 16/258 (2019.01); G06F 16/27 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method for providing an integrated data quality capability for distributed data repositories, the method being implemented by at least one processor, the method comprising:
identifying, by the at least one processor from a data stream, an indication that at least one job corresponding to a source data set has been started, the at least one job relating to at least one from among a data ingestion job and a data transformation job;
triggering, by the at least one processor, at least one data reconciliation action based on the identified indication,
wherein the at least one data reconciliation action is triggered according to a variable threshold; and
wherein the variable threshold includes an available processing bandwidth of a host system;
persisting, by the at least one processor in a repository, a first result of the at least one data reconciliation action,
wherein the first result includes information that relates to a system metric, the at least one data reconciliation action, the source data set, and a target data set;
initiating, by the at least one processor, at least one data quality action on the source data set that has been processed by the at least one data reconciliation action based on the first result, wherein the at least one data quality action determines at least one shortcoming related to inaccuracy or security, and wherein the at least one data quality action is initiated only after the at least one data reconciliation action is determined to have succeeded and not before;
persisting, by the at least one processor in the repository, a second result of the at least one data quality action, the second result including information that relates to at least one from among profile metadata and exception data;
initiating, by the at least one processor, at least one scan action based on the second result, the at least one scan action identifying personally identifiable information in the data stream;
persisting, by the at least one processor in the repository, a third result of the at least one scan action; and
generating, by the at least one processor, at least one graphical element and at least one report based on an analysis of the first result, the second result, and the third result, the at least one graphical element is displayable via a graphical user interface.