US 11,669,495 B2
	Probabilistic algorithm to check whether a file is unique for deduplication
Wenguang Wang, Santa Clara, CA (US); Junlong Gao, Mountain View, CA (US); Marcos K. Aguilera, Palo Alto, CA (US); Richard P. Spillane, Mountain View, CA (US); Christos Karamanolis, Los Gatos, CA (US); and Maxime Austruy, Lausanne (CH)
Assigned to VMware, Inc., Palo Alto, CA (US)
Filed by VMware, Inc., Palo Alto, CA (US)
Filed on Aug. 27, 2019, as Appl. No. 16/552,908.
Prior Publication US 2021/0064579 A1, Mar. 4, 2021
Int. Cl. G06F 7/00 (2006.01); G06F 16/174 (2019.01); G06F 16/14 (2019.01)

CPC G06F 16/1752 (2019.01) [G06F 16/152 (2019.01)]

20 Claims

1. A method of deduplicating a first file, the method comprising:

separating the first file into a first plurality of chunks;

choosing a first chunk of the first file;

determining a hash of the first chunk is not in a chunk hash data structure stored in a chunk store;

determining, for a subset of the first plurality of chunks that is a percentage of the first plurality of chunks and that is less than all the first plurality of chunks, whether a hash of each of the chunks of the subset is in the chunk hash data structure, wherein each chunk of the subset of the first plurality of chunks is randomly selected from among all of the first plurality of chunks of the first file; and

based on the determining for the subset that none of the hashes of the chunks of the subset are in the chunk hash data structure, including at least one of the chunks of the subset in the chunk hash data structure without including all of the first plurality of chunks in the chunk hash data structure.