US 11,947,497 B2
Partial in-line deduplication and partial post-processing deduplication of data chunks
Zhihuan Qiu, San Jose, CA (US); and Yu Liu, Milpitas, CA (US)
Assigned to Cohesity, Inc., San Jose, CA (US)
Filed by Cohesity, Inc., San Jose, CA (US)
Filed on Aug. 24, 2021, as Appl. No. 17/410,745.
Prior Publication US 2023/0062644 A1, Mar. 2, 2023
Int. Cl. G06F 7/00 (2006.01); G06F 16/174 (2019.01)
CPC G06F 16/1752 (2019.01) 18 Claims
OG exemplary drawing
 
1. A method, comprising:
ingesting at a storage system data received from a source system, wherein ingesting the data includes performing partial in-line deduplication at least in part by:
generating a plurality of data chunks corresponding to the ingested data, wherein the plurality of data chunks includes a first data chunk, a second data chunk, and a third data chunk;
determining corresponding chunk identifiers for the plurality of data chunks corresponding to the ingested data; and
verifying, for each of the plurality of data chunks, whether the corresponding chunk identifier is included in a first data structure tracking identifiers of data chunks that were already stored in a storage of the storage system before the ingesting of the data from the source system, including by:
determining that a first chunk identifier associated with the first data chunk is included in the first data structure;
in response to determining that the first chunk identifier associated with the first data chunk is included in the first data structure, deduplicating the first data chunk against a first copy of the first data chunk that was already stored in the storage of the storage system before the ingesting of the data from the source system;
determining that a second chunk identifier associated with the second data chunk is not included in the first data structure;
in response to determining that the second chunk identifier associated with the second data chunk is not included in the first data structure, storing a copy of the second data chunk in the storage of the storage system;
determining that a third chunk identifier associated with the third data chunk is not included in the first data structure, wherein the third chunk identifier associated with the third data chunk matches the second chunk identifier associated with the second data chunk; and
in response to determining that the third chunk identifier associated with the third data chunk is not included in the first data structure, storing a copy of the third data chunk in the storage of the storage system, wherein the copy of the second data chunk and the copy of the third data chunk are stored in different chunk files; and
after the ingesting of the data from the source system is completed, performing partial post-processing deduplication of the ingested data stored in the storage having a same chunk identifier and updating the first data structure based on the partial post-processing deduplication, wherein performing the partial post-processing deduplication of the ingested data stored in the storage includes performing deduplication on the copy of the second data chunk and the copy of the third data chunk, wherein the second data chunk and the third data chunk store a same data.