US 12,287,764 B2
	FastQ/FastA compression systems and methods
Foad Nazari, Malvern, PA (US); Sneh Patel, Malvern, PA (US); Emma K. Murray, Malvern, PA (US); and Giana J. Schena, Malvern, PA (US)
Assigned to Rajant Health Incorporated, Malvern, PA (US)
Filed by Rajant Health Incorporated, Malvern, PA (US)
Filed on Nov. 18, 2022, as Appl. No. 17/990,361.
Claims priority of provisional application 63/409,993, filed on Sep. 26, 2022.
Claims priority of provisional application 63/280,721, filed on Nov. 18, 2021.
Prior Publication US 2024/0134825 A1, Apr. 25, 2024
Int. Cl. G06F 16/00 (2019.01); G06F 16/16 (2019.01); G06F 16/174 (2019.01)

CPC G06F 16/1744 (2019.01) [G06F 16/162 (2019.01)]

20 Claims

1. A method for data compression of genomic data comprising:

receiving a data file comprised of sequence bases, quality scores, and identifiers, the sequence bases comprising regular bases and irregular bases, the regular bases comprising adenine (A), cytosine (C), guanine (G), and thymine (T);

applying an optimization algorithm to the quality scores, sequence k-mers from the sequence bases, and the identifiers;

splitting long reads of the sequence bases and the quality scores into smaller segments;

deleting duplicated and semi-duplicated reads for the sequence bases and the quality scores;

performing a dimensionality reduction on the sequence bases;

storing a template of the identifier that is consistent across the data file;

detecting and storing the location and type of each irregular base;

encoding the data file in a binary format such that each regular base is represented by one of the four two-digit binary numbers and all irregular bases are represented by one of the four two-digit binary numbers; and

compressing the encoded data file.