US 12,436,700 B2
Performance of dispersed location-based deduplication
Reut Cohen, Tel Aviv (IL); Jonathan Fischer-Toubol, Tel Aviv (IL); Afief Halumi, Tel Aviv (IL); Danny Harnik, Tel Mond (IL); Ety Khaitzin, Petah Tikva (IL); Sergey Marenkov, Tel Aviv (IL); Asaf Porat-Stoler, Ramat Gan (IL); Yosef Shatsky, Karnei Shomron (IL); and Tom Sivan, Tel Aviv (IL)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Feb. 1, 2022, as Appl. No. 17/590,367.
Application 17/590,367 is a continuation of application No. 15/793,109, filed on Oct. 25, 2017, granted, now 11,269,531.
Prior Publication US 2022/0155987 A1, May 19, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 3/06 (2006.01); G06F 11/14 (2006.01); G06F 16/174 (2019.01); G06F 16/2455 (2019.01)
CPC G06F 3/0641 (2013.01) [G06F 3/0673 (2013.01); G06F 11/1453 (2013.01); G06F 16/1748 (2019.01); G06F 16/1752 (2019.01); G06F 16/24556 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method, in a data processing system, comprising:
configuring a referrer memory region, in a set of memory regions of a data storage system, to have a predetermined maximum number of corresponding owner memory regions in the set of memory regions, wherein
the referrer memory region stores a set of references to locations of data chunks of one or more data files stored in the corresponding owner memory regions, and
the predetermined maximum number limits a number of the corresponding owner memory regions to which the referrer memory region is permitted to have references in the set of references;
receiving, by the data storage system, a request to write a first data file to the referrer memory region;
generating, based on the receiving of the request, a hash value for each data chunk of the first data file;
comparing the generated hash value, of each data chunk of the first data file, to a set of hash values associated with a set of data chunks, of the data chunks, stored in a subset of owner memory regions associated with the referrer memory region;
determining, based on the comparison, that one or more data chunks of the first data file do not exist in the subset of owner memory regions; and
based on the determining, and for each data chunk of the one or more data chunks:
storing the data chunk in a specific owner memory region different from the subset of owner memory regions;
updating a popularity tracking metric for the specific owner memory region based on accessing of the specific owner memory region;
adding the specific owner memory region to the subset of owner memory regions based on a first policy and a second policy, wherein
the first policy adds the specific owner memory region to the subset of owner memory regions until the predetermined maximum number of corresponding owner memory regions is reached, and
the second policy adds the specific owner memory region to the subset of owner memory regions based on each of the updated popularity tracking metric of the specific owner memory region and a predetermined popularity criterion; and
storing a reference to the data chunk in the referrer memory region.