US 11,941,065 B1
	Single identifier platform for storing entity data
Hua Li, San Diego, CA (US); Sophie Liu, San Diego, CA (US); Yi He, San Diego, CA (US); Zhixuan Wang, San Diego, CA (US); Chi Zhang, San Diego, CA (US); Kevin Chen, San Diego, CA (US); Shanji Xiong, San Diego, CA (US); Christer Dichiara, Carlsbad, CA (US); Mason Carpenter, Richmond, VA (US); Mark Hirn, Hermosa Beach, CA (US); and Julian Yarkony, Jersey City, NJ (US)
Assigned to Experian Information Solutions, Inc., Costa Mesa, CA (US)
Filed by Experian Information Solutions, Inc., Costa Mesa, CA (US)
Filed on Sep. 11, 2020, as Appl. No. 17/018,953.
Claims priority of provisional application 63/015,333, filed on Apr. 24, 2020.
Claims priority of provisional application 62/900,341, filed on Sep. 13, 2019.
Int. Cl. G06F 16/906 (2019.01); G06F 16/901 (2019.01); G06N 20/00 (2019.01)

CPC G06F 16/906 (2019.01) [G06F 16/9024 (2019.01); G06N 20/00 (2019.01)]

20 Claims

1. A computer-implemented method, comprising:

receiving a plurality of data records from one or more data sources;

providing at least a subset of the data records to a scoring model that determines scores for various pairings of the data records, a score for a given pair of the data records representing a probability that the given pair of data records contains data elements about the same entity;

generating a graph data structure that includes a plurality of nodes, each individual node of the plurality of nodes representing a different record from the plurality of data records, where edges between given node pairs are associated with corresponding determined scores for respective pairs of data records;

performing a connected component analysis of the graph data structure, including pruning one or more edges that fall below a threshold score;

performing optimal weighted clustering of the graph data structure to determine final clusters of the plurality of nodes, wherein computer processing time is reduced in performing the optimal weighted clustering at least in part by reduction to a linear programming problem such that only a subset of millions of potential clusters possible from the plurality of data records are analyzed in determining the final clusters;

assigning a different unique identifier to each individual cluster of the final clusters, where different identifiers represent different entities; and

responding to a request for data regarding a given entity by providing aggregated data elements from those data records of the plurality of data records associated with a cluster of the final clusters having an identifier that represents the given entity.