CPC G06F 16/215 (2019.01) [G06F 16/24556 (2019.01)] | 20 Claims |
1. A method of data unification, the method comprising:
receiving a plurality of data records, each data record of the plurality of data records comprising a plurality of data fields;
performing a self-conflation process for the plurality of data records, the self-conflation process comprising:
performing a partition-based clustering, in parallel, for a plurality of partitions, wherein the plurality of data records are distributed among the plurality of partitions; and
producing a unified data record from the plurality of records;
selecting, from among the plurality of data fields, a subset of the data fields, the subset of the data fields being fewer in number than the plurality of data fields, wherein selecting the subset of the data fields comprises:
applying a first rule to select at least a first one of the data fields within the unified data record for inclusion in the subset of the data fields;
using content of the subset of the data fields, generating a stable identifier (stableID) for the unified data record; and
inserting the stableID into a primary key data field of the unified data record.
|