| CPC G06F 16/1752 (2019.01) [G06F 3/0608 (2013.01); G06F 3/0641 (2013.01); G06F 3/067 (2013.01)] | 2 Claims |

|
1. A system for random-access manipulation of compacted data files, comprising:
a computing system comprising a memory, a processor, and a non-volatile data storage device;
a deconstruction subsystem comprising a first plurality of programming instructions stored in the memory and operable on the processor, wherein the first plurality of programming instructions, when operating on the processor, cause the computing system to:
deconstruct a data stream into a plurality of sourceblocks;
encode the data stream using a reference codebook by:
retrieving a codeword for each sourceblock from the reference codebook;
where there is no codeword for a first sourceblock, generating a hash code as a new codeword and storing the first sourceblock and its newly-created codeword in the reference codebook; and
storing the codewords corresponding to the data stream in a compacted data file;
a reconstruction subsystem comprising a third plurality of programming instructions stored in the memory and operable on the processor, wherein the third plurality of programming instructions, when operating on the processor, cause the computing system to:
retrieve a plurality of codewords from the compacted data file received from a requesting process;
decode each of the plurality of retrieved codewords by, for each retrieved codeword, retrieving the sourceblock associated with the respective codeword from the reference codebook; and
provide the retrieved sourceblocks as a data stream to the requesting process; and
a random-access subsystem comprising a second plurality of programming instructions stored in the memory and operating on the processor, wherein the second plurality of programming instructions, when operating on the processor, cause the computing subsystem to:
receive a data search query;
estimate, using an estimator module, a first starting bit location in the compacted data file;
refine the first starting bit location by:
determining whether a bit sequence starting at the first starting bit location corresponds to a codeword boundary and, if not, traversing the reference codebook until a codeword boundary is located at a new starting bit;
traversing from the new starting bit until a start codeword corresponding to the beginning of the data search query is identified; and
sending the first start codeword and a plurality of immediately following codewords from the compacted data file to the reconstruction engine for decoding.
|