US 11,990,206 B2
	Methods for detecting variants in next-generation sequencing genomic data
Lin Song, Saint Sulpice (CH); and Zhenyu Xu, Commugny (CH)
Assigned to SOPHIA GENETICS S.A., Rolle (CH)
Appl. No. 16/467,265
Filed by SOPHIA GENETICS S.A., Saint Sulpice (CH)
PCT Filed Dec. 7, 2017, PCT No. PCT/EP2017/081863 § 371(c)(1), (2) Date Jun. 6, 2019, PCT Pub. No. WO2018/104466, PCT Pub. Date Jun. 14, 2018.
Claims priority of application No. 16202691 (EP), filed on Dec. 7, 2016.
Prior Publication US 2019/0311779 A1, Oct. 10, 2019
Int. Cl. G16B 30/00 (2019.01); G06F 17/18 (2006.01); G16B 20/00 (2019.01); G16B 20/20 (2019.01); G16B 20/30 (2019.01); G16B 30/10 (2019.01)

CPC G16B 30/00 (2019.02) [G06F 17/18 (2013.01); G16B 20/00 (2019.02); G16B 20/20 (2019.02); G16B 20/30 (2019.02); G16B 30/10 (2019.02)]

13 Claims

1. A method for detecting and characterizing, with a processor, a genomic variant scenario as a combination of at least two genomic sequence variants in poly-T and poly-TG tracts of CFTR gene alleles from a cystic fibrosis patient sample, the method comprising:

(a) obtaining a plurality of patient data sequence reads from an enriched genomic sample of a patient using next generation sequencing, wherein obtaining the enriched genomic sample of the patient comprises targeting and enriching sub-regions of a genomic sample of the patient corresponding to the poly-T and poly-TG tracts of CFTR gene alleles;

(b) obtaining a plurality of control data sequence reads from an enriched genomic control sample using next generation sequencing, wherein obtaining the enriched genomic sample of the control data sequence reads comprises targeting and enriching sub-regions of a genomic control sample corresponding to the poly-T and poly-TG tracts of CFTR gene alleles;

(c) measuring a probability distribution of the length of a poly-TG repeat pattern, in the plurality of control data sequence reads;

(d) for each possible genomic variant of the poly-TG repeat pattern relative to the control data poly-TG repeat pattern, estimating the expected probability distribution of the length of the poly-TG repeat pattern for this genomic variant as a function of this genomic variant and of the measured probability distribution of the length of the poly-TG repeat pattern in the plurality of control data sequence reads;

(e) measuring a patient probability distribution of the length of the poly-TG repeat pattern in the plurality of patient data sequence reads;

(f) comparing the measured patient probability distribution and the expected probability distribution of the length of the poly-TG repeat pattern in the control sample;

(g) for each possible genomic variant of the poly-TG repeat pattern, comparing the measured patient probability distribution and the expected probability distribution of the length of the poly-TG repeat pattern for this genomic variant;

(h) selecting an estimated patient probability distribution of the poly-TG repeat pattern length characterizing a poly-TG genomic sequence variant of the cystic fibrosis patient sample as the expected probability distribution which results in the closest comparison with the measured patient probability distribution;

(i) measuring a probability distribution of the length of a poly-T repeat pattern in the plurality of control data sequence reads as the expected probability distribution length of the poly-T repeat pattern in the control sample;

(j) for each possible genomic variant of the poly-T repeat pattern relative to the control data poly-T repeat pattern, estimating the expected probability distribution of the length of the poly-T repeat pattern for this genomic variant as a function of this genomic variant and of the measured probability distribution of the length of the poly-T repeat pattern in the plurality of control data sequence reads;

(k) measuring a patient probability distribution of the length of the poly-T repeat pattern in the plurality of patient data sequence reads;

(l) comparing the measured patient probability distribution and the expected probability distribution of the poly-T repeat pattern in the control sample;

(m) for each possible genomic variant of the poly-T repeat pattern, comparing the measured patient probability distribution and the expected probability distribution of the length of the poly-T repeat pattern for this genomic variant;

(n) selecting an estimated patient probability distribution of the poly-T repeat pattern length characterizing a poly-T genomic sequence variant of the cystic fibrosis patient sample as the expected probability distribution which results in the closest comparison with the measured patient probability distribution;

(o) estimating a first expected joint probability distribution of the length of the poly-TG repeat pattern and of the length of the poly-T repeat pattern for at least one first genomic variant scenario in the plurality of control data sequence reads, the first genomic variant scenario being characterized by: a first allele of the poly-TG repeat pattern genomic variant selected in step (h) and a first allele of the poly-T repeat pattern genomic variant selected in step (n) are on the same reads, while a second allele of the poly-TG repeat pattern genomic variant selected in step (h) and a second allele of the poly-T repeat pattern genomic variant selected in step (n) are on the same reads;

(p) estimating a second expected joint probability distribution of the length of the poly-TG repeat pattern and of the length of the poly-T repeat pattern for at least one second genomic variant scenario in the plurality of control data sequence reads, the second genomic variant scenario being characterized by: the first allele of the poly-TG repeat pattern genomic variant selected in step (h) and the second allele of the poly-T repeat pattern genomic variant selected in step (n) are on the same reads, while the second allele of the poly-TG repeat pattern genomic variant selected in step (h) and the first allele of the poly-T repeat pattern genomic variant selected in step (n) are on the same reads;

(q) measuring, read by read, the patient joint probability distribution for the length of the poly-TG repeat pattern and the length of the poly-T repeat pattern in the plurality of patient data sequence reads;

(r) comparing the measured patient joint probability distribution and the first expected joint probability distribution for the first genomic variant scenario;

(s) comparing the measured patient joint probability distribution and the second expected joint probability distribution for the second genomic variant scenario; and

(t) selecting the genomic variant scenario characterizing the actual genomic variant scenario for the patient as the scenario which results in the closest comparison.