US 11,748,307 B2
Selective data compression based on data similarity
Uri Shabi, Tel Mond (IL); Alexei Kabishcer, Ramat Gan (IL); and Jonathan Volij, Meitar (IL)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on Oct. 13, 2021, as Appl. No. 17/500,246.
Prior Publication US 2023/0113436 A1, Apr. 13, 2023
Int. Cl. G06F 16/00 (2019.01); G06F 16/174 (2019.01); H03M 7/30 (2006.01); G06F 16/13 (2019.01)
CPC G06F 16/1744 (2019.01) [G06F 16/137 (2019.01); H03M 7/3064 (2013.01)] 13 Claims
OG exemplary drawing
 
1. A method comprising:
generating a corresponding set of hash values for each one of a plurality of candidate pages to be compressed;
selecting, from the candidate pages and responsive to the sets of hash values generated for the candidate pages, a set of similar candidate pages, wherein the set of similar candidate pages comprises a subset of the candidate pages that includes less than all the candidate pages, at least in part by:
comparing the sets of hash values corresponding to the candidate pages at least in part by generating, for each pair of candidate pages, a similarity index using the sets of hash values corresponding to that pair of candidate pages,
identifying a set of candidate pages within which each candidate page has a corresponding set of hash values with at least a minimum similarity index value with respect to the corresponding set of hash values of each other candidate page, and
selecting the set of candidate pages with corresponding sets of hash values having at least the minimum threshold level of similarity to each other as the set of similar candidate pages; and
compressing the set of similar candidate pages as a single unit, and separately from one or more other ones of the candidate pages that were not selected to be included in the set of similar candidate pages.