US 12,455,984 B2
	Machine learning for data anonymization
Grant Howard George Middleton, Toronto (CA); and Brian Joseph Rasquinha, Ontario (CA)
Assigned to Privacy Analytics Inc., Ottawa (CA)
Filed by Privacy Analytics Inc., Ottawa (CA)
Filed on Apr. 21, 2023, as Appl. No. 18/305,148.
Claims priority of provisional application 63/333,908, filed on Apr. 22, 2022.
Prior Publication US 2024/0119175 A1, Apr. 11, 2024
Int. Cl. G06F 21/62 (2013.01)

CPC G06F 21/6254 (2013.01)

20 Claims

1. A computer-implemented method comprising:

automatically, using a trained machine-learning model, detecting attributes in unstructured data;

determining an amount of undetected attributes and detected attributes in the unstructured data;

simulating additional attributes for the unstructured data according to the amount of undetected attributes, wherein the sampling comprises, for each undetected attribute:

sampling a population distribution for a sampled value, wherein the population distribution is an externally supplied reference distribution;

computing a sampling frequency according to the sampled value; and

assigning the sampling frequency as an additional attribute;

analyzing a risk of disclosure in the unstructured data using the detected attributes and the simulated additional attributes, wherein the analyzing comprises:

assigning a first information value to each detected attribute according to samples received from a first statistical distribution used to simulate the additional attributes;

assigning a second information value to each simulated additional attribute according to samples retrieved from a second statistical distribution, wherein the second statistical distribution is generated based on attributes that change with respect to time;

aggregating the first information value for each detected attribute and the second information value for each simulated additional attribute into an aggregated value;

determining an anonymity value using the first information value, the second information value, the aggregated value, and a size of a population associated with the unstructured data; and

determining the risk of disclosure in the unstructured data using the determined anonymity value;

modifying the detected attributes according to the analyzed risk of disclosure; and

replacing the detected attributes with the modified detected attributes in the unstructured data.