| CPC G16B 50/50 (2019.02) [G16B 20/00 (2019.02); G16B 20/20 (2019.02); G16B 20/40 (2019.02); G16B 30/00 (2019.02); G16B 40/00 (2019.02); G16B 40/10 (2019.02); G16B 50/00 (2019.02); H03M 7/70 (2013.01); C12Q 1/6869 (2013.01); G16B 30/10 (2019.02)] | 20 Claims |

|
1. A method for compressing molecular tagged nucleic acid sequence data, comprising:
receiving a plurality of nucleic acid sequence reads, wherein each sequence read is associated with a molecular tag sequence, the molecular tag sequence identifying a family of sequence reads resulting from a particular polynucleotide molecule in a nucleic acid sample;
grouping sequence reads associated with a same molecular tag sequence to form a family of sequence reads, each family having a number of members;
calculating an arithmetic mean of vectors of flow space signal measurements corresponding to the sequence reads for the family to form a vector of consensus flow space signal measurements for the family, wherein each vector of flow space signal measurements corresponds with one of the sequence reads;
determining a consensus base sequence based on the vector of consensus flow space signal measurements for the family;
determining a consensus sequence alignment by comparing the consensus base sequence to the sequence read having a highest mapping quality of sequence alignments corresponding to the sequence reads for the family, wherein each sequence alignment corresponds with one of the sequence reads; and
generating a compressed data structure comprising consensus compressed data, the consensus compressed data including for each family, the consensus base sequence, the consensus sequence alignment, the vector of consensus flow space signal measurements and the number of members.
|