US 11,929,147 B2
Direct variant phasing in long reads to detect quasispecies
Darya Filippova, Mountain View, CA (US); Khai Luong, Oakland, CA (US); and Garima Kushwaha, San Francisco, CA (US)
Assigned to Roche Sequencing Solutions, Inc., Pleasanton, CA (US)
Appl. No. 16/646,962
Filed by Roche Sequencing Solutions, Inc., Pleasanton, CA (US)
PCT Filed Sep. 12, 2018, PCT No. PCT/EP2018/074639
§ 371(c)(1), (2) Date Mar. 12, 2020,
PCT Pub. No. WO2019/053076, PCT Pub. Date Mar. 21, 2019.
Claims priority of provisional application 62/558,696, filed on Sep. 14, 2017.
Prior Publication US 2020/0211675 A1, Jul. 2, 2020
Int. Cl. G16B 30/00 (2019.01); G16B 20/20 (2019.01); C12Q 1/6883 (2018.01)
CPC G16B 30/00 (2019.02) [G16B 20/20 (2019.02); C12Q 1/6883 (2013.01); C12Q 2600/156 (2013.01)] 19 Claims
 
1. A method comprising:
performing an assay on DNA molecules in one or more samples to obtain a plurality of long sequence reads such that at least some variants in the long sequence reads occur at a frequency above a sequencer noise floor, wherein the one or more samples include a group of organisms of a same species, wherein at least some of the organisms of the same species have different genomes, wherein the plurality of long sequence reads includes at least 100 long sequences reads, and wherein each long sequence read is at least 1000 bases in length;
generating, by a computing system, a variant matrix characterized by a plurality of rows and a plurality of columns, wherein each row of the plurality of rows of the variant matrix represents a long sequence read, wherein each column of the plurality of columns of the variant matrix corresponds to a variant that satisfies one or more quality criteria, wherein the one or more quality criteria comprise: (i) a number of long sequence reads at a locus of the variant is greater than a predetermined threshold, (ii) a frequency of occurrence of the variant at a locus is greater than a predetermined threshold, or (iii) both (i) and (ii), and wherein generating the variant matrix comprises:
determining a total number of the long sequence reads;
comparing the long sequence reads to a reference sequence to identify variant loci;
determining a total number of variant loci that meet the one or more quality criteria;
generating the variant matrix based on the total number of the long sequence reads and the total number of variant loci that satisfy the one or more quality criteria; and
populating column values of the columns of each row of the variant matrix based on a presence or absence of the variants that satisfy the one or more quality criteria in each of the long sequence reads;
creating, by the computing system, a hierarchy of clusters for the plurality of rows of the variant matrix based on differences in the column values among the plurality of rows of the variant matrix;
splitting, by the computing system, the hierarchy of clusters into clusters representing different quasispecies of the genomes; and
identifying one or more quasispecies for a sample in the one or more samples based on the clusters representing the different quasispecies of the genomes.