US 11,748,015 B2
	Extending similarity-based deduplication to adjacent data
Uri Shabi, Tel Mond (IL); and Amitai Alkalay, Kadima (IL)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on Apr. 23, 2021, as Appl. No. 17/238,303.
Prior Publication US 2022/0342574 A1, Oct. 27, 2022
Int. Cl. G06F 3/06 (2006.01)

CPC G06F 3/0641 (2013.01) [G06F 3/067 (2013.01); G06F 3/0608 (2013.01); G06F 3/0659 (2013.01)]

17 Claims

1. A method of performing data reduction, comprising:

receiving a sequence of datasets to be written in a data storage system, the sequence of datasets including a candidate dataset and an adjacent candidate dataset;

upon detecting a match between similarity hashes of the candidate dataset and a target dataset, performing a similarity assessment between the adjacent candidate dataset and an adjacent target dataset adjacent to the target dataset; and

in response to the similarity assessment determining that the adjacent candidate dataset and the adjacent target dataset are similar to at least a predetermined degree, performing a data reduction operation on the adjacent candidate dataset with reference to the adjacent target dataset,

wherein performing the similarity assessment includes:

accessing P hash values calculated from the adjacent candidate dataset and P hash values calculated from the adjacent target dataset;

selecting N of the P hash values, N<P and being less than one-tenth of P, of the adjacent candidate block based on a selection rule;

selecting N of the P hash values of the adjacent target block based on the same selection rule; and

determining that the adjacent candidate block is similar to the adjacent target block based at least in part on a number of matches between the N selected hash values of the adjacent candidate block and the N selected hash values of the adjacent target block.