US 12,367,948 B2
Computer-implemented method of analysing genetic data about an organism
Christopher Charles Alan Spencer, Oxford (GB); Gerard Anton Lunter, Oxford (GB); Peter James Donnelly, Oxford (GB); and Vincent Yann Marie Plagnol, Oxford (GB)
Assigned to GENOMICS LIMITED, Oxford (GB)
Appl. No. 15/733,547
Filed by GENOMICS LIMITED, Oxford (GB)
PCT Filed Feb. 26, 2019, PCT No. PCT/GB2019/050525
§ 371(c)(1), (2) Date Aug. 25, 2020,
PCT Pub. No. WO2019/166792, PCT Pub. Date Sep. 6, 2019.
Claims priority of application No. 1803202.9 (GB), filed on Feb. 27, 2018.
Prior Publication US 2020/0402614 A1, Dec. 24, 2020
Int. Cl. G01N 33/48 (2006.01); G06F 18/20 (2023.01); G06N 3/00 (2023.01); G16B 5/20 (2019.01); G16B 20/40 (2019.01); G16H 20/13 (2018.01); G16H 70/40 (2018.01)
CPC G16B 20/40 (2019.02) [G06F 18/295 (2023.01); G06N 3/002 (2013.01); G16B 5/20 (2019.02); G16H 20/13 (2018.01); G16H 70/40 (2018.01); C12Q 2600/112 (2013.01)] 30 Claims
 
1. A computer-implemented method of analysing genetic data about an organism, comprising:
accessing a plurality of genome-wide association studies;
deriving, using the plurality of genome-wide association studies, at least 50 input units, wherein:
each input unit is a data structure derived from a genome-wide association study of the plurality of genome-wide association studies that provides summary statistic data including an inferred effect size of each of a plurality of genetic variants along a genome of the organism on a phenotype corresponding to the input unit and a standard error of the inferred effect size, and
the deriving comprises:
completing each genome-wide association study having missing data about an association between one or more genetic variants and the phenotype corresponding to the input unit by modeling the association between the one or more genetic variants and the phenotype corresponding to the input unit, and
determining a set of characteristics for each input unit based on the summary statistic data associated with the genome-wide association study the set of characteristics comprising probability metrics which quantify evidence for each of the plurality of genetic variants being causal for the phenotype corresponding to the input unit;
selecting a region or regions of the genome of the organism;
for each of the selected region or regions, assigning each of the input units to one or more of a plurality of clusters, the assigning being an iterative process for exploring a space of possible assignment using a Markov Chain Monte Carlo (MCMC) algorithm designed to effectively explore an exponentially large space of study assignments, the iterative process comprising:
determining a degree of similarity between i) the set of characteristics of the input unit, and ii) a set of characteristics of each of the plurality of clusters, wherein the set of characteristics of each cluster is either pre-determined or calculated by combining the sets of characteristics of input units already assigned to the cluster, and
either a) assigning the input unit to one or more existing clusters of the plurality of clusters with a probability dependent on the corresponding degree of similarity, or
b) creating a new cluster in the plurality of clusters and assigning the input unit to the new cluster with a probability dependent on the set of characteristics of the input unit and the sets of characteristics of the existing clusters of the plurality of clusters;
identifying that the phenotypes corresponding to the input units assigned to the same cluster share underlying biological mechanisms; and
displaying, for each of the selected region or regions, an analysis across all input units using a computer-based interface that summarizes a membership of each of the clusters and the assignment of each of the genetic variants to each cluster based on the assigning and the identifying.