US 12,282,453 B2
	Optimal cluster selection in hierarchical clustering of files
Tony Tzeming Wong, Milpitas, CA (US); and Smriti Thakkar, San Jose, CA (US)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on Aug. 18, 2021, as Appl. No. 17/405,272.
Prior Publication US 2023/0057692 A1, Feb. 23, 2023
Int. Cl. G06F 16/11 (2019.01); G06F 16/13 (2019.01); G06F 16/906 (2019.01)

CPC G06F 16/122 (2019.01) [G06F 16/137 (2019.01)]

18 Claims

1. A system for optimized hierarchical clustering of files, comprising:

one or more processors; and

a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to:

partition files, comprising segments identified by fingerprints into corresponding clusters;

generate hash values by applying a hash function to the fingerprints of the segments in the clusters of the file;

count common fingerprints by counting fingerprints which correspond to both a cluster of a file and a cluster of another file;

count unique fingerprints by counting fingerprints which correspond to at least one of the cluster of the file and the cluster of the other file;

approximate a distance, based on the count of common fingerprints and the count of unique fingerprints, between the cluster of the file and the cluster of the other file;

identify a smallest of distances which are approximated between all clusters of the files;

merge the cluster of the file and the cluster of the other file into a cluster of the file and the other file, in response to a determination that the approximated distance is the smallest of the distances;

determine a distance index number for the merged cluster by one of dividing a next smallest of distances between all the clusters of the files by a smallest of the distances between all the clusters of the files or subtracting the smallest of distances between all the clusters of the files from the next smallest of the distances between all the clusters of the files;

determine distance index numbers for merges of all clusters of the files using a smallest and a next smallest of the distances between the clusters of the files that have not been merged after the determination of the distance index number of the merged cluster; and

identify, based on a maximum of the distance index numbers, a corresponding maximum clustering distance threshold as an optimal clustering of the files.