US 11,894,105 B2
Methods for detection of fusions using compressed molecular tagged nucleic acid
Rajesh Gottimukkala, Fremont, CA (US); Cheng-Zong Bai, Taipei (TW); Dumitru Brinza, Montara, CA (US); Jeoffrey Schageman, Austin, TX (US); and Varun Bagai, Austin, TX (US)
Assigned to Life Technologies Corporation, Carlsbad, CA (US)
Filed by LIFE TECHNOLOGIES CORPORATION, Carlsbad, CA (US)
Filed on Sep. 20, 2018, as Appl. No. 16/136,463.
Claims priority of provisional application 62/560,745, filed on Sep. 20, 2017.
Prior Publication US 2019/0087539 A1, Mar. 21, 2019
Int. Cl. G16B 30/00 (2019.01); C12Q 1/6853 (2018.01); G16B 30/10 (2019.01); G16B 20/20 (2019.01); G16B 50/50 (2019.01)
CPC G16B 30/00 (2019.02) [C12Q 1/6853 (2013.01); G16B 20/20 (2019.02); G16B 30/10 (2019.02); G16B 50/50 (2019.02)] 20 Claims
 
1. A method for compressing molecular tagged nucleic acid sequence data for fusion detection, comprising:
amplifying a nucleic acid sample in a presence of primers to produce a plurality of amplicons, the primers including a 5′ primer and a 3′ primer, the 5′ primer and the 3′ primer flanking a breakpoint region associated with a gene fusion, wherein a prefix tag is appended to the 5′ primer and a suffix tag is appended to the 3′ primer, the prefix tag and suffix tag comprising a unique molecular tag for a polynucleotide molecule in the nucleic acid sample;
sequencing the plurality of amplicons to generate a plurality of nucleic acid sequence reads;
mapping the nucleic acid sequence reads to a reference sequence to produce a plurality of sequence alignments, the reference sequence including a targeted fusion reference sequence;
receiving, at a processor, the plurality of nucleic acid sequence reads and the plurality of sequence alignments for a plurality of families of sequence reads, wherein each sequence read is associated with a molecular tag sequence, the molecular tag sequence identifying a family of sequence reads resulting from a particular polynucleotide molecule in the nucleic acid sample, each family having a number of sequence reads, wherein a portion of the sequence alignments corresponds to sequence reads mapped to the targeted fusion reference sequence;
determining a consensus sequence read for each family of sequence reads based on flow space signal measurements corresponding to the sequence reads for the family;
determining a consensus sequence alignment for each family of sequence reads, comprising selecting the sequence alignment having a highest mapping quality corresponding to the family of sequence reads and comparing the sequence read corresponding to the sequence alignment having the highest mapping quality to the consensus sequence read for the family, wherein a portion of the consensus sequence alignments corresponds to the consensus sequence reads aligned with the targeted fusion reference sequence;
generating a compressed data structure comprising consensus compressed data, the consensus compressed data including the consensus sequence read and the consensus sequence alignment for each family, wherein a data volume of the compressed data structure is less than an original data volume of the plurality of nucleic acid sequence reads and the plurality of sequence alignments;
storing the compressed data structure in a memory, wherein an amount of memory for the storing the compressed data structure is less than an original amount of memory for storing the plurality of nucleic acid sequence reads and the plurality of sequence alignments; and
detecting a fusion using the consensus sequence reads and the consensus sequence alignments from the compressed data structure, wherein the detecting comprises identifying an eligible consensus sequence read based on characteristics of the consensus sequence alignment of the consensus sequence read with the targeted fusion reference sequence and determining whether a number of families corresponding to the eligible consensus sequence reads aligned with the targeted fusion reference sequence is greater than or equal to a minimum molecular count threshold.