CPC G06F 16/906 (2019.01) [G06F 16/9024 (2019.01); G06N 20/00 (2019.01)] | 20 Claims |
1. A computer-implemented method, comprising:
receiving a plurality of data records from one or more data sources;
providing at least a subset of the data records to a scoring model that determines scores for various pairings of the data records, a score for a given pair of the data records representing a probability that the given pair of data records contains data elements about the same entity;
generating a graph data structure that includes a plurality of nodes, each individual node of the plurality of nodes representing a different record from the plurality of data records, where edges between given node pairs are associated with corresponding determined scores for respective pairs of data records;
performing a connected component analysis of the graph data structure, including pruning one or more edges that fall below a threshold score;
performing optimal weighted clustering of the graph data structure to determine final clusters of the plurality of nodes, wherein computer processing time is reduced in performing the optimal weighted clustering at least in part by reduction to a linear programming problem such that only a subset of millions of potential clusters possible from the plurality of data records are analyzed in determining the final clusters;
assigning a different unique identifier to each individual cluster of the final clusters, where different identifiers represent different entities; and
responding to a request for data regarding a given entity by providing aggregated data elements from those data records of the plurality of data records associated with a cluster of the final clusters having an identifier that represents the given entity.
|