US 12,235,811 B2
Data deduplication in a disaggregated storage system
Yosef Shatsky, Karnei Shomron (IL); and Doron Tal, Geva Carmel (IL)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on Jun. 18, 2021, as Appl. No. 17/351,733.
Prior Publication US 2022/0405254 A1, Dec. 22, 2022
Int. Cl. G06F 16/215 (2019.01); G06F 16/22 (2019.01)
CPC G06F 16/215 (2019.01) [G06F 16/2255 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
performing a data deduplication process in a data storage system, the data storage system comprising storage nodes, and storage control nodes comprising at least a first storage control node and a second storage control node, wherein each of the storage control nodes can access data directly from each of the storage nodes, wherein the data deduplication process comprises:
determining, by the first storage control node, whether a given data block is at least a potential duplicate of an original data block that is managed by another storage control node of the data storage system; and
in response to the first storage control node determining that the given data block is a potential duplicate of an original data block that is managed by the second storage control node:
sending, by the first storage control node, a first message to the second storage control node, wherein the first message comprises a request to initiate a deduplication process for determining whether the given data block is an actual duplicate of the original data block managed by the second storage control node;
incrementing, by the second storage control node, a reference counter associated with the original data block managed by the second storage control node, wherein the reference counter maintains a count indicative of a number of other storage control nodes that hold a reference to the original data block managed by the second storage control node, wherein the second storage control node increments the reference counter prior to determining whether or not the given data block is an actual duplicate of the original data block;
sending, by the second storage control node, a second message to the first storage control node, wherein the second message comprises metadata which comprises information to enable the first storage control node to read the original data block from a given storage node;
reading, by the first storage control node, the original data block from the given storage node based on the metadata of the second message;
performing, by the first storage control node, a data compare process to determine whether the given data block is an actual duplicate of the original data block;
creating, by the first storage control node, a reference to the original data block, in response to determining that the given data block is an actual duplicate of the original data block;
in response to receiving, by the second storage control node, a notification message from the first storage control node that the given data block is not an actual duplicate of the original data block, the second storage control node decrementing the count of the reference counter associated with the original data block; and
in an absence of receiving, by the second storage control node, a notification message from the first storage control node that the given data block is not an actual duplicate of the original data block, the second storage control node maintaining the count of the reference counter associated with the original data block.