| CPC G06F 21/6254 (2013.01) | 20 Claims |

|
1. A computer-implemented method comprising:
automatically, using a trained machine-learning model, detecting attributes in unstructured data;
determining an amount of undetected attributes and detected attributes in the unstructured data;
simulating additional attributes for the unstructured data according to the amount of undetected attributes, wherein the sampling comprises, for each undetected attribute:
sampling a population distribution for a sampled value, wherein the population distribution is an externally supplied reference distribution;
computing a sampling frequency according to the sampled value; and
assigning the sampling frequency as an additional attribute;
analyzing a risk of disclosure in the unstructured data using the detected attributes and the simulated additional attributes, wherein the analyzing comprises:
assigning a first information value to each detected attribute according to samples received from a first statistical distribution used to simulate the additional attributes;
assigning a second information value to each simulated additional attribute according to samples retrieved from a second statistical distribution, wherein the second statistical distribution is generated based on attributes that change with respect to time;
aggregating the first information value for each detected attribute and the second information value for each simulated additional attribute into an aggregated value;
determining an anonymity value using the first information value, the second information value, the aggregated value, and a size of a population associated with the unstructured data; and
determining the risk of disclosure in the unstructured data using the determined anonymity value;
modifying the detected attributes according to the analyzed risk of disclosure; and
replacing the detected attributes with the modified detected attributes in the unstructured data.
|