| CPC G06F 16/1744 (2019.01) [G06F 16/162 (2019.01)] | 20 Claims |

|
1. A method for data compression of genomic data comprising:
receiving a data file comprised of sequence bases, quality scores, and identifiers, the sequence bases comprising regular bases and irregular bases, the regular bases comprising adenine (A), cytosine (C), guanine (G), and thymine (T);
applying an optimization algorithm to the quality scores, sequence k-mers from the sequence bases, and the identifiers;
splitting long reads of the sequence bases and the quality scores into smaller segments;
deleting duplicated and semi-duplicated reads for the sequence bases and the quality scores;
performing a dimensionality reduction on the sequence bases;
storing a template of the identifier that is consistent across the data file;
detecting and storing the location and type of each irregular base;
encoding the data file in a binary format such that each regular base is represented by one of the four two-digit binary numbers and all irregular bases are represented by one of the four two-digit binary numbers; and
compressing the encoded data file.
|