CPC G06F 16/137 (2019.01) [G06F 16/116 (2019.01); G06F 16/185 (2019.01); G06F 16/215 (2019.01); G06F 16/2246 (2019.01); G06F 16/2255 (2019.01); G06F 16/258 (2019.01); H04L 9/0643 (2013.01)] | 20 Claims |
1. A system for using content defined trees to efficiently index and deduplicate data stored in multiple databases, the system comprising:
one or more processors; and
a non-transitory, computer readable medium having instructions recorded thereon that, when executed by the one or more processors, cause operations comprising:
obtaining a request to integrate first data of a legacy database with second data of a content addressed storage (CAS) database;
generating a first content defined tree corresponding to the legacy database, wherein the first content defined tree comprises a first set of parent nodes, each parent node of the first set of parent nodes corresponding to a set of hashes that have been determined using a rolling hash and a grouping condition, wherein each parent node comprises a hash of a concatenation of each hash in a corresponding set of hashes, wherein the first set of parent nodes form a tier of the first content defined tree, and wherein each hash in each set of hashes corresponds to a portion of data in the legacy database;
obtaining a second content defined tree corresponding to the CAS database, wherein the second content defined tree comprises a second set of parent nodes, each parent node in the second set of parent nodes comprising a concatenated hash corresponding to a set of leaf nodes; and
based on comparing the first content defined tree with the second content defined tree, removing a duplicate portion of data from the legacy database or the CAS database.
|