US 12,013,827 B2
Duplicate determination in a graph using different versions of the graph
Lars Bremer, Boeblingen (DE); Thuany Karoline Stuart, Nice (FR); Hemanth Kumar Babu, Böblingen (DE); and Martin Anton Oberhofer, Sindelfingen (DE)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Jan. 18, 2022, as Appl. No. 17/648,194.
Prior Publication US 2023/0229644 A1, Jul. 20, 2023
Int. Cl. G06F 16/21 (2019.01); G06F 16/215 (2019.01); G06F 18/22 (2023.01)
CPC G06F 16/219 (2019.01) [G06F 16/215 (2019.01); G06F 18/22 (2023.01)] 16 Claims
OG exemplary drawing
 
1. A computer-implemented method for determining duplicates in a graph in a hybrid master data management system based on different versions of the graph, comprising:
providing a first version of a graph, with the first version of the graph being a previous version of the graph stored on a virtual master data management (MDM) system, the virtual MDM system being configured to store and create data in a distributed arrangement across one or more source systems;
identifying at least two target nodes of the graph, wherein each node of the at least two target nodes has a set of entity attributes and for each entity attribute of the set of entity attributes:
comparing each version of an entity attribute of one target node with each version of the entity attribute of a second target node, with each comparison resulting in an individual data similarity score;
weighting with a penalty weight the individual data similarity scores that resulted from a comparison involving a first version of the entity attribute that is different from a second version of the entity attribute;
selecting a highest data similarity score of the individual data similarity scores of the entity attribute; and
combining the selected highest data similarity scores of the set of entity attributes for obtaining a comparison score;
comparing the first version and a second version of the graph for determining the comparison score indicative of a similarity between the two target nodes, the second version being a current version of the graph stored on a physical MDM system, the physical MDM system being configured to store and create data in a centralized system; and
using the comparison score for determining whether the two target nodes are duplicates with respect to each other.