CPC G06F 16/285 (2019.01) [G06F 16/221 (2019.01); G06F 16/282 (2019.01); G06F 18/23 (2023.01); G06F 40/177 (2020.01); G06F 40/18 (2020.01); G06N 7/01 (2023.01); G06N 20/00 (2019.01)] | 17 Claims |
1. A method comprising:
performing pairwise comparisons on a set of records, the pairwise comparisons comprising, for a given record, comparing the given record to other records in the set of records;
generating feature signatures associated with each of the pairwise comparisons, a given feature signature comprising a vector representing a likelihood that two respective records associated with the feature signature are relate to a single entity;
inputting the feature signatures into a trained ordinal classifier to obtain a first set of match scores predicted by the trained ordinal classifier wherein the trained ordinal classifier is configured using ordinal training data and hard conflict rules, and wherein the ordinal classifier generates non-binary output labels indicating at least one of a strong match, a moderate match, a weak match, an unknown match, and a hard conflict;
generating, based on the first set of match scores, a first cluster of records and second cluster of records;
inputting the first cluster of records and the second cluster of records into the ordinal classifier to obtain a second set of match scores;
determining whether a hard conflict exists between the first cluster of records and the second cluster of records based on the second set of match scores;
generating a hierarchical clustering based on the first set of match scores, second set of match scores, and the determination of whether a hard conflict exists;
assigning hierarchical cluster identifiers to records in the set of records based on the hierarchical clustering, wherein a hierarchical cluster identifier for a given record comprises a series of values, each value reflecting a respective tier within the hierarchical clustering; and
generating a processed database table with the hierarchical cluster identifiers, wherein the hierarchical cluster identifiers allow selection of clusters according to different degrees of confidence.
|