US 12,455,970 B2
Privacy preserving and high performance data clustering
Pedro Miguel Barbas, Dunboyne (IE); Deepak Kulkarni, Hyderabad (IN); Christian Cesar Bones, Sao Carlos (BR); Guilherme Rodrigues de Abreu, Campinas (BR); and Rodrigo Cravo Dorea Arnez, Rio de Janeiro (BR)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Jun. 26, 2023, as Appl. No. 18/340,935.
Prior Publication US 2024/0427907 A1, Dec. 26, 2024
Int. Cl. G06F 21/60 (2013.01); G06F 18/23213 (2023.01); G06F 40/12 (2020.01)
CPC G06F 21/60 (2013.01) [G06F 18/23213 (2023.01); G06F 40/12 (2020.01)] 20 Claims
OG exemplary drawing
 
1. A method for clustering data objects, said method comprising:
accessing, by one or more processors of a computer system, a set of data objects arranged in an initial sequential order, wherein the set of data objects consists of S data objects, wherein S is at least 2, wherein each data object includes a code and a score, wherein each code represents an instance of the data object and each code is a positive integer subject to codes collectively consisting of positive integers 1, 2, . . . , N subject to N≤S, wherein each score is a positive real number denoting a measure of a parameter pertaining to the instance that is represented by the code, wherein scores collectively consist of B unique scores subject to B≤S;
sorting, by the one or more processors, the data objects using the score as a sort key to rearrange the data objects in an ascending order of the score, wherein each unique score has a sequence number in the sorted data objects, resulting in B consecutive sequence numbers;
transforming, by the one or more processors, the S data objects into respective S binary words, wherein each binary word corresponding to a data object of the S data objects consists of B bits characterized by: (i) a 1 bit in a bit position of the binary word corresponding to respective sequence number of sorted unique score and (ii) a 0 bit in all other bit positions of the binary word;
encoding, by the one or more processors, the S data objects into a sequence of N blocks, wherein each block consists of B bits in a binary format, and wherein the N blocks are sequenced and have bit configurations that depend on the initial sequential order of the data objects and the sequence numbers of sorted unique scores;
generating, by the one or more processors from the N blocks, M block clusters respectively comprising M respective cluster centers, wherein each cluster center is a different block of the N blocks, wherein R remaining blocks of the N blocks are distributed into the M block clusters in a manner that minimizes a weighted bit separation distance between each of the R remaining blocks and each of the M cluster centers, wherein M+R=N, and wherein 2<M<N;
converting, by the one or more processors, the M block clusters into respective M word clusters of binary words, wherein the S binary words are distributed into the respective M word clusters; and
for each word cluster of the respective M word clusters having J binary words in each word cluster, reconfiguring, by the one or more processors, an M word cluster, or the respective M word clusters, into L word clusters into which the J binary words are distributed, by minimizing a total number of deviations in the L word clusters, wherein L is at least 1.