CPC G06F 21/6254 (2013.01) [G06F 16/221 (2019.01); G06F 16/282 (2019.01); G06F 21/6227 (2013.01)] | 27 Claims |
1. A method comprising:
receiving data from a data source;
transforming the data, using structured query language (SQL), into integer data;
generating a plurality of generalizations of the integer data;
sending the plurality of generalizations of the integer data to a plurality of execution nodes, each node of the plurality of execution nodes operating on a set of data separable from a set of data on another execution node, wherein each of the plurality of execution nodes includes computational resources to compute a candidate generalization using an information loss scoring function, wherein computing the candidate generalization includes grouping the plurality of generalizations into a set of equivalence classes and pruning the set of equivalence classes;
receiving a set of candidate generalizations from the plurality of execution nodes, wherein a candidate generalization includes:
an equivalence class size approximation, wherein an equivalence class approximation is determined by computing an upper bound for an equivalence class size, based at least in part on a k-value, a number of records, a number of records in a set of equivalence classes with a size greater or equal to the k-value, a number of records in a set of equivalence classes with a size less than the k-value, and a maximum equivalence class size;
a suppression approximation comprising a number of suppressed records, the suppression approximation based at least in part on the number of records and the k-value; and
an information loss score;
selecting a preferred generalization from the set of candidate generalizations;
generating an anonymized view of the integer data using the preferred generalization; and
providing a view of the integer data in which a data subject is unidentifiable directly or indirectly.
|