US 12,079,164 B2
	Cross-silo data storage and deduplication
Yucheng Low, Seattle, WA (US); Ajit Banerjee, Seattle, WA (US); and Rajat Arya, Seattle, WA (US)
Assigned to XETDATA INC., Seattle, WA (US)
Filed by Xetdata Inc., Seattle, WA (US)
Filed on Nov. 3, 2022, as Appl. No. 17/980,537.
Claims priority of provisional application 63/299,832, filed on Jan. 14, 2022.
Prior Publication US 2023/0229643 A1, Jul. 20, 2023
Int. Cl. G06F 16/13 (2019.01); G06F 16/11 (2019.01); G06F 16/185 (2019.01); G06F 16/215 (2019.01); G06F 16/22 (2019.01); G06F 16/25 (2019.01); H04L 9/00 (2022.01); H04L 9/06 (2006.01)

CPC G06F 16/137 (2019.01) [G06F 16/116 (2019.01); G06F 16/185 (2019.01); G06F 16/215 (2019.01); G06F 16/2246 (2019.01); G06F 16/2255 (2019.01); G06F 16/258 (2019.01); H04L 9/0643 (2013.01)]

20 Claims

1. A system for using content defined trees to efficiently index and deduplicate data stored in multiple databases, the system comprising:

one or more processors; and

a non-transitory, computer readable medium having instructions recorded thereon that, when executed by the one or more processors, cause operations comprising:

obtaining a request to integrate first data of a legacy database with second data of a content addressed storage (CAS) database;

generating a first content defined tree corresponding to the legacy database, wherein the first content defined tree comprises a first set of parent nodes, each parent node of the first set of parent nodes corresponding to a set of hashes that have been determined using a rolling hash and a grouping condition, wherein each parent node comprises a hash of a concatenation of each hash in a corresponding set of hashes, wherein the first set of parent nodes form a tier of the first content defined tree, and wherein each hash in each set of hashes corresponds to a portion of data in the legacy database;

obtaining a second content defined tree corresponding to the CAS database, wherein the second content defined tree comprises a second set of parent nodes, each parent node in the second set of parent nodes comprising a concatenated hash corresponding to a set of leaf nodes; and

based on comparing the first content defined tree with the second content defined tree, removing a duplicate portion of data from the legacy database or the CAS database.