US 12,237,053 B2
	Detecting cross-contamination in sequencing data
Onur Sakarya, Redwood City, CA (US); and John Lamping, Los Altos, CA (US)
Assigned to GRAIL, Inc., Menlo Park, CA (US)
Filed by GRAIL, Inc., Menlo Park, CA (US)
Filed on Jun. 26, 2018, as Appl. No. 16/019,315.
Claims priority of provisional application 62/633,008, filed on Feb. 20, 2018.
Claims priority of provisional application 62/534,868, filed on Jul. 20, 2017.
Claims priority of provisional application 62/525,655, filed on Jun. 27, 2017.
Prior Publication US 2018/0373832 A1, Dec. 27, 2018
Int. Cl. G16B 5/20 (2019.01); C12Q 1/6806 (2018.01); G06N 20/00 (2019.01); G16B 5/00 (2019.01); G16B 20/20 (2019.01); G16B 30/00 (2019.01); G16B 30/10 (2019.01); G16B 30/20 (2019.01); G16B 40/10 (2019.01)

CPC G16B 5/20 (2019.02) [C12Q 1/6806 (2013.01); G16B 5/00 (2019.02); G16B 20/20 (2019.02); G16B 30/00 (2019.02); G06N 20/00 (2019.01); G16B 30/10 (2019.02); G16B 30/20 (2019.02); G16B 40/10 (2019.02)]

19 Claims

1. A method for identifying contamination in a test sequence, the method comprising:

receiving a plurality of test sequences including a plurality of single nucleotide polymorphisms (SNPs) that are homozygous and derived from a targeted sequencing of nucleic acid fragments from a cell-free DNA (cfDNA) sample, with each test sequence comprising one or more SNPs;

filtering the plurality of test sequences by removing one or more test sequences wherein remaining test sequences together form a population, and wherein filtering comprises:

generating a first distribution based on a minor allele depth of the SNPs for the cfDNA sample, a total depth of major alleles and minor alleles of the SNPs for the cfDNA sample, and a heterozygosity level,

determining a second test sequence includes a loss of heterozygosity by applying the distribution to the one or more SNPs of the second test sequence, and

excluding the second test sequence including the loss of heterozygosity from the population;

determining a prior contamination probability for each SNP of the population, the prior contamination probability based on a minor allele frequency for the SNP;

applying a contamination model including a first likelihood test to a first test sequence of the population to determine a first current contamination probability for the first test sequence using the one or more prior contamination probabilities for the one or more SNPs of the first test sequence, the first current contamination probability representing the likelihood that the first test sequence is contaminated, wherein applying the contamination model further comprises comparing a set of generated contaminated test sequences to a set of previously obtained non-contaminated test sequences as part of determining the first current contamination probability;

detecting a contamination event by determining that the first current contamination probability of the first test sequence is above a first threshold; and

responsive to detecting the contamination event, discarding the cfDNA sample without performing variant calling on the plurality of test sequences derived from the cfDNA sample.