| CPC G06N 5/025 (2013.01) [G06F 16/00 (2019.01)] | 20 Claims |

|
1. A computer-implemented method comprising:
identifying a first tree person from a first genealogical tree and a second tree person from a second genealogical tree, wherein both the first genealogical tree and the second genealogical tree comprise a plurality of interconnected tree persons corresponding to individuals that are related to each other;
extracting, from first tree data of the first genealogical tree, a first set of features for the first tree person and, from second tree data of the second genealogical tree, a second set of features for the second tree person;
based on extracting the first set of features for the first tree person and the second set of features for the second tree person, generating a metric function, by comparing like features from the first set of features for the first tree person with corresponding features from the second set of features for the second tree person;
generating a plurality of feature weights for similarity metrics of the metric function using a machine learning model configured to output the plurality of feature weights based on receiving an input comprising the first set of features and the second set of features, wherein the machine learning model is trained by:
providing training data comprising pairs of tree persons to the machine learning model; and
modifying the machine learning model using an error computed based on an output of the machine learning model when provided with the training data;
generating a plurality of weighted similarity metrics by multiplying similarity metrics of the metric function with corresponding feature weights from the plurality of feature weights;
generating a similarity score indicating a likelihood of the first tree person and the second tree person being duplicates by calculating a sum of the plurality of weighted similarity metrics; and
modifying a cluster in a genealogical database based on the likelihood of the first tree person and the second tree person being duplicates.
|