US 12,006,533 B2
Detecting cross-contamination in sequencing data using regression techniques
Onur Sakarya, San Francisco, CA (US); and Catalin Barbacioru, Fremont, CA (US)
Assigned to GRAIL, LLC, Menlo Park, CA (US)
Filed by Grail, Inc., Menlo Park, CA (US)
Filed on Feb. 20, 2018, as Appl. No. 15/900,645.
Application 15/900,645 is a continuation of application No. PCT/IB2018/050979, filed on Feb. 17, 2018.
Claims priority of provisional application 62/525,653, filed on Jun. 27, 2017.
Claims priority of provisional application 62/460,268, filed on Feb. 17, 2017.
Prior Publication US 2018/0237838 A1, Aug. 23, 2018
Int. Cl. C12Q 1/6827 (2018.01); C12Q 1/6809 (2018.01); G16B 20/20 (2019.01); G16B 30/10 (2019.01); G16B 30/20 (2019.01); G16B 40/20 (2019.01); G16B 40/30 (2019.01)
CPC C12Q 1/6827 (2013.01) [C12Q 1/6809 (2013.01); G16B 20/20 (2019.02); G16B 30/10 (2019.02); G16B 40/20 (2019.02); G16B 40/30 (2019.02); G16B 30/20 (2019.02)] 14 Claims
 
1. A method for identifying contamination in a test sequence using a processor, the method comprising:
accessing one or more physical samples from a first subject including one or more test sequences that are indicative of a cancer presence;
sequencing the one or more physical samples using a next-generation sequencing machine to produce a plurality of test sequences each comprising at least one single nucleotide polymorphism (SNP) from the first subject and collectively forming an initial population;
calling a plurality of variant alleles in the plurality of test sequences, each called variant allele identified as a SNP across the plurality of test sequences having a variant allele frequency: (VAF) indicating contamination of the initial population with test sequences from a second subject;
identifying a plurality of population minor allele frequencies (MAFs) for the plurality of test sequences, each population minor allele frequency (MAF) quantifying a MAF for a SNP at a test site of a plurality of test sites across the plurality of test sequences;
filtering at least some of the SNPs of the plurality of test sequences in the initial population to form a filtered population, the filtering comprising, for each test sequence of the plurality of test sequences in the initial population:
selecting SNPs having a VAF in either a first range or a second range, both ranges indicative of homozygosity, and the first range different from the second range,
for each selected SNP in test sequences whose identified VAF is in the first range, setting a MAF for the selected SNP to the population MAF corresponding to the site of the selected SNP, and
for each selected SNP in test sequences whose identified VAF is in the second range, setting a MAF for the selected SNP to a quantity one minus the population MAF corresponding to the site of the selected SNP;
generating a noise model that estimates a measure of background noise level present in the plurality of test sequences in the filtered population based on measures of background noise levels present in a plurality of test sequences indicative of healthy individuals;
applying a contamination model to a test sequence of the plurality of test sequences in the filtered population using the identified plurality of population MAFs of the plurality of test sequences, the identified VAFs for SNPs across the plurality of test sequences, and the generated noise model based on the plurality of test sequences indicative of healthy individuals to obtain a confidence score representing a likelihood the test sequence originates from the second subject and is contaminating the initial population, wherein the confidence score is below a threshold, indicating the test sequence originates from the second subject; and
responsive to the confidence score indicating the test sequence originates from the second subject, discarding the one or more physical samples due to contamination.