US 12,229,141 B2
Linking individual datasets to a database
Shiya Song, San Mateo, CA (US); Jingwen Pei, San Mateo, CA (US); Brett Frederick Jorgensen, Draper, UT (US); Aaron James Stern, Berkeley, CA (US); and Ross E. Curtis, Cedar Hills, UT (US)
Assigned to Ancestry.com DNA, LLC, Lehi, UT (US)
Filed by Ancestry.com DNA, LLC, Lehi, UT (US)
Filed on Jul. 20, 2022, as Appl. No. 17/868,775.
Application 17/868,775 is a continuation of application No. 17/128,009, filed on Dec. 19, 2020, granted, now 11,429,615.
Claims priority of provisional application 62/951,646, filed on Dec. 20, 2019.
Prior Publication US 2022/0365934 A1, Nov. 17, 2022
Int. Cl. G06F 15/16 (2006.01); G06F 16/22 (2019.01); G06F 16/2455 (2019.01); G06F 16/2457 (2019.01); G16B 10/00 (2019.01)
CPC G06F 16/24558 (2019.01) [G06F 16/2246 (2019.01); G06F 16/24578 (2019.01); G16B 10/00 (2019.02)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving a target individual dataset associated with a target individual;
identifying a plurality of candidate individual datasets that are potentially related to the target individual dataset;
identifying a related individual dataset from the plurality of candidate individual datasets, wherein the related individual dataset has data bits that match at least a portion of data bits in the target individual dataset, wherein identifying the related individual dataset comprises:
phasing a genotype corresponding to the target individual and a genotype corresponding to the related individual;
identifying identity by descent (IBD) segments shared between the phased genotype of the target individual dataset and the phased genotype of the related individual dataset;
determining a total length of the IBD segments shared between the phased genotype of the target individual dataset and the phased genotype of the related individual dataset; and
determining that the total length of shared IBD segments exceeds a threshold;
identifying a parent node that is a common parent node for both the related individual dataset and the target individual dataset;
retrieving a data tree to which the parent node belongs, the data tree describing inter-relationships among datasets in the data tree;
identifying, based on strings of matched data bits and number of the strings of matched data bits between the target individual dataset and the datasets in the data tree, a descendant position to the common parent node in the data tree to which the target individual dataset is assigned, wherein identifying the descendant position to the common parent node in the data tree comprises: generating a plurality of candidate data trees that have the individual dataset assigned to different candidate descendant positions to the common parent node, wherein generating the plurality of candidate data trees comprises:
generating a first candidate data tree that is based on the data tree to which the parent node belongs, the first candidate data tree being the data tree with a first new descendant position, the first candidate data tree adding the individual dataset to the first new descendant position, and
generating a second candidate data tree that is based on the data tree to which the parent node belongs, the second candidate data tree being the data tree with a second new descendant position, the second candidate data tree adding the individual dataset to the second new descendant position,
calculating, for each of the candidate data trees with new descendant positions added for the individual dataset to the common parent node, a plurality of pairwise relationship likelihoods, each pairwise relationship likelihood measuring a likelihood between a candidate descendant position and another dataset that also represents a descendant of the common parent node, and
selecting a candidate data tree as the data tree based on the plurality of pairwise relationship likelihoods; and
outputting the data tree with the target individual dataset located in the descendant position.