US 12,038,933 B2
	Systems and methods for automatic clustering and canonical designation of related data in various data structures
Lawrence Manning, New York, NY (US); Rahul Mehta, New York, NY (US); Daniel Erenrich, Mountain View, CA (US); Guillem Palou Visa, London (GB); Roger Hu, New York, NY (US); Xavier Falco, London (GB); Rowan Gilmore, London (GB); Eli Bingham, New York, NY (US); Jason Prestinario, New York, NY (US); Yifei Huang, Jersey City, NJ (US); Daniel Fernandez, New York, NY (US); Jeremy Elser, New York, NY (US); Clayton Sader, San Francisco, CA (US); Rahul Agarwal, San Francisco, CA (US); Matthew Elkherj, Menlo Park, CA (US); Nicholas Latourette, San Francisco, CA (US); and Aleksandr Zamoshchin, Aurora, CO (US)
Assigned to Palantir Technologies Inc., Denver, CO (US)
Filed by Palantir Technologies Inc., Denver, CO (US)
Filed on May 30, 2023, as Appl. No. 18/325,616.
Application 18/325,616 is a continuation of application No. 17/812,984, filed on Jul. 15, 2022, granted, now 11,704,325.
Application 17/812,984 is a continuation of application No. 16/189,040, filed on Nov. 13, 2018, granted, now 11,392,591, issued on Jul. 19, 2022.
Application 16/189,040 is a continuation of application No. 15/233,149, filed on Aug. 10, 2016, granted, now 10,127,289, issued on Nov. 13, 2018.
Claims priority of provisional application 62/207,335, filed on Aug. 19, 2015.
Prior Publication US 2023/0297582 A1, Sep. 21, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/00 (2019.01); G06F 16/2457 (2019.01); G06F 16/28 (2019.01); G06F 16/35 (2019.01); G06F 16/9535 (2019.01); G06F 18/23 (2023.01)

CPC G06F 16/24578 (2019.01) [G06F 16/285 (2019.01); G06F 16/35 (2019.01); G06F 16/9535 (2019.01); G06F 18/23 (2023.01)]

20 Claims

1. A computer-implemented method comprising:

generating a plurality of record pairs, wherein each record pair in the plurality of record pairs comprises a respective first record from a first plurality of records and a respective second record from a second plurality of records;

applying a machine learning model to determine respective probabilities, for each of the plurality of record pairs, that the respective first record and second record of the respective record pairs are associated with a respective same entity;

causing a client computing device to present any indeterminate record pairs to a user, wherein indeterminate record pairs are identified based at least in part on the respective determined probabilities for individual record pairs of the plurality of record pairs being below a pre-established threshold;

receiving, from the client computing device, user feedback indicating whether the first and second record of an indeterminate record pair are associated with the same entity;

retraining the machine learning model and revising the probability of the indeterminate record pair based at least in part on the user feedback;

determining, based at least in part on the probabilities, respective entities associated with one or more clusters of record pairs; and

outputting the clusters of record pairs and the respective entities associated with each cluster to the client computing device.