CPC G06F 3/0608 (2013.01) [G06F 3/0641 (2013.01); G06F 3/0673 (2013.01); G06F 12/084 (2013.01); G06F 2212/62 (2013.01)] | 18 Claims |
1. A system for caching and deduplicating a plurality of segments of data, comprising:
at least one server having one or more processors;
at least one database; and
non-transitory memory comprising instructions that, when executed by the one or more processors, cause the one or more processors to:
receive, at the server, a first segment of data from the plurality of segments of data;
identify a value of a first data field in the first segment of data, the value of the first data field comprising a unique source identifier;
perform a transformation on the value of the first data field to obtain a transformed source identifier;
identify a value of a second data field in the first segment of data, the second data field being densely populated by values in the plurality of received segments of data;
partition the value of the second data field into a first partition comprising more significant bits and a second partition comprising less significant bits;
generate a first key based on the transformed source identifier and the first partition comprising more significant bits;
store, in the at least one database, an entry associating the first key with a bitmap, the bitmap having a maximum length equal to a maximum number of possible values a bitmap of equal length to the second partition could validly take;
set a single bit of the bitmap, corresponding to a value of the second partition, to true;
receive, at the server, a second segment of data from the plurality of segments of data;
likewise generate a second key based on a transformed source identifier in the second segment of data and a value of a first partition of the second data field in the second segment of data;
retrieve a bitmap associated with the second key; and
based on a set bit in the retrieved bitmap corresponding to a value of the second partition of the second data field of the second segment of data, determine that the second segment of data had previously been received by the server.
|