| CPC G06F 16/24558 (2019.01) [G06F 16/2246 (2019.01); G06F 16/24578 (2019.01); G16B 10/00 (2019.02)] | 18 Claims |

|
1. A computer-implemented method comprising:
receiving a target individual dataset associated with a target individual;
identifying a plurality of candidate individual datasets that are potentially related to the target individual dataset;
identifying a related individual dataset from the plurality of candidate individual datasets, wherein the related individual dataset has data bits that match at least a portion of data bits in the target individual dataset, wherein identifying the related individual dataset comprises:
phasing a genotype corresponding to the target individual and a genotype corresponding to the related individual;
identifying identity by descent (IBD) segments shared between the phased genotype of the target individual dataset and the phased genotype of the related individual dataset;
determining a total length of the IBD segments shared between the phased genotype of the target individual dataset and the phased genotype of the related individual dataset; and
determining that the total length of shared IBD segments exceeds a threshold;
identifying a parent node that is a common parent node for both the related individual dataset and the target individual dataset;
retrieving a data tree to which the parent node belongs, the data tree describing inter-relationships among datasets in the data tree;
identifying, based on strings of matched data bits and number of the strings of matched data bits between the target individual dataset and the datasets in the data tree, a descendant position to the common parent node in the data tree to which the target individual dataset is assigned, wherein identifying the descendant position to the common parent node in the data tree comprises: generating a plurality of candidate data trees that have the individual dataset assigned to different candidate descendant positions to the common parent node, wherein generating the plurality of candidate data trees comprises:
generating a first candidate data tree that is based on the data tree to which the parent node belongs, the first candidate data tree being the data tree with a first new descendant position, the first candidate data tree adding the individual dataset to the first new descendant position, and
generating a second candidate data tree that is based on the data tree to which the parent node belongs, the second candidate data tree being the data tree with a second new descendant position, the second candidate data tree adding the individual dataset to the second new descendant position,
calculating, for each of the candidate data trees with new descendant positions added for the individual dataset to the common parent node, a plurality of pairwise relationship likelihoods, each pairwise relationship likelihood measuring a likelihood between a candidate descendant position and another dataset that also represents a descendant of the common parent node, and
selecting a candidate data tree as the data tree based on the plurality of pairwise relationship likelihoods; and
outputting the data tree with the target individual dataset located in the descendant position.
|