US 12,216,547 B1
Granular data source identification for obtaining deduplication storage efficiency within a clustered environment
Abhishek Rajimwale, San Jose, CA (US); George Mathew, Belmont, CA (US); Murthy Mamidi, San Jose, CA (US); and Donna Barry Lewis, Holly Springs, NC (US)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on Aug. 21, 2019, as Appl. No. 16/547,234.
Int. Cl. G06F 11/14 (2006.01); G06F 16/174 (2019.01)
CPC G06F 11/1453 (2013.01) [G06F 16/1748 (2019.01); G06F 2201/84 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A system comprising:
one or more processors; and
a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to:
receive, at a backup system, data to be backed up from a first data source and a second data source through a client system to a cluster storage system as part of a backup process, the first data source and the second data source being components of the client system;
determine, at a backup system, a plurality of portions of the received data as a basis for creating a plurality of backup files to be provided to the cluster storage system by analyzing the received data as part of the backup process, the portions including at least a first portion of the received data that is to be used to create a first backup file;
obtain, at the backup system from the client system, first data source information for the first portion of the data by querying the components of the client system as part of the backup process in response to determining the first portion is a basis for creating the first backup file, the first data source information including an identity of a device that is a component of the client system and is identified as the first data source from which the first portion of the data was obtained;
generate, at the backup system, a first data source identifier associated with the first portion of data based on the obtained first data source information associated with the first data source as part of the backup process; and
provide, by the backup system, the first data source identifier and the first portion of data to the clustered storage system, the first data source identifier used by the clustered storage system to determine a first destination storage node for the first backup file created using the first portion of data, the first destination storage node storing backup files of data from the first data source.