CPC G06F 21/60 (2013.01) [G06F 17/12 (2013.01); G06F 18/23 (2023.01)] | 13 Claims |
1. A processor-implemented method for data anonymization comprising:
obtaining a dataset comprising a plurality of records for anonymization, via one or more hardware processors, the plurality of records comprising a plurality of attributes arranged in a taxonomy tree structure;
clustering the dataset into a plurality of clusters using an extended M-mode clustering technique, via the one or more hardware processors;
for each cluster of the plurality of clusters, performing, via the one or more hardware processors:
generating a set of patterns for an initial level of generalization of a set of records associated with the cluster, wherein each pattern of the set of patterns is representative of a distinct level of generalization and a distinct generalization loss;
calculating a generalized information loss and a beta value for each pattern of the set of patterns, wherein the beta value for a pattern from amongst the set of patterns is indicative of a possibility of record to be anonymized by the pattern, and wherein the generalized information loss captures penalty incurred when generalizing an attribute from amongst the plurality of attributes;
solving an integer linear programming (ILP) model using the generalized information loss and the beta value to obtain a set of anonymized records by generated patterns and a set of suppressed records;
determining whether the solution of the ILP model is acceptable or not, wherein for a solution, the generalized information loss comprises a sum over the set of anonymized records and the set of suppressed records, and wherein the acceptance of the solution is determined based on a percentage of reduction in the generalized information loss in a current iteration as compared to a previous iteration; and
on determination that the solution is unacceptable for the set of anonymized records and the set of suppressed records, iteratively generating patterns with subsequent level of generalization of the set of records, calculating generalized information loss and solving the ILP model to obtain one or more solutions in one or more subsequent iterations until the solution in the one or more subsequent iteration is determined to be improved by a threshold percentage.
|