CPC G06F 16/1744 (2019.01) | 20 Claims |
1. A method, comprising:
receiving an input file into a compression engine;
inputting hyperparameters associated with the compression engine, wherein the hyperparameters include one or more of initial splits and a minimal consensus length reduction;
aligning the input file, based on the hyperparameters, to create a compression matrix that includes sequences, wherein aligning the input file includes splitting the input file one or more times starting from the initial splits to generate sequences, wherein each of the sequences corresponds to a different portion of the input file, wherein the sequences are arranged as rows of the compression matrix, wherein each column of the sequences in the compression matrix includes identical data after the sequences are aligned in the compression matrix,
wherein the data in a column of the compression matrix are identical when each cell of the column, after the sequences are aligned in the compression matrix, comprise the same characters or the same characters and at least one dash;
determining a consensus sequence from the compression matrix, wherein the consensus sequence includes an entry from each of the columns and does not include any dashes; and
generating a compressed file that includes the consensus sequence and pointer pairs, wherein each pointer pair identifies a subsequence of the consensus sequence and each of the rows of the compression matrix is replaced by at least one of the pointer pairs, the pointer pairs being smaller than the corresponding sequences.
|