US 12,147,572 B2
Controlling access to de-identified data sets based on a risk of re-identification
Gaston Besanson, Barcelona (ES); Andrea Amorosi, Barcelona (ES); Runar Gunnerud, Oslo (NO); Bartomeu Pou Mulet, Barcelona (ES); Joel Gordillo Solana, Barcelona (ES); Frode Huse Gjendem, Barcelona (ES); Geir Prestegård, Jar (NO); and Rubén Sánchez Fernández, Barcelona (ES)
Assigned to Accenture Global Solutions Limited, Dublin (IE)
Filed by Accenture Global Solutions Limited, Dublin (IE)
Filed on Dec. 2, 2020, as Appl. No. 17/110,193.
Claims priority of application No. 19383071 (EP), filed on Dec. 3, 2019.
Prior Publication US 2021/0165913 A1, Jun. 3, 2021
Int. Cl. G06F 21/62 (2013.01); G06N 3/126 (2023.01); G06N 7/01 (2023.01)
CPC G06F 21/6254 (2013.01) [G06F 21/6227 (2013.01); G06N 3/126 (2013.01); G06N 7/01 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
receiving, by a device and from one or more data sources, one or more de-identified data sets that include de-identified personal data;
receiving, by the device, a request for a feature set of the one or more de-identified data sets, wherein the feature set includes a set of information included in the de-identified personal data,
wherein each piece of information, of the set of information, comprises data that does not include personally identifiable information, and
wherein the personally identifiable information can be created based on combining data of two or more pieces of information, of the set of information;
determining, by the device, a quantity of equivalence classes for a first group of information, of the set of information;
determining, by the device, a quantity of de-identified data sets included in the one or more de-identified data sets;
determining, by the device, a highest re-identification probability determined for a piece of information, included in the set of information, as compared to other pieces of information included in the set of information;
determining, by the device, a differential risk score based on the quantity of equivalence classes, the quantity of de-identified data sets, and the highest re-identification probability;
determining, by the device, that the differential risk score does not satisfy a threshold;
determining, by the device, an amount of computational resources available for determining a re-identification risk score for the set of information;
selecting, by the device and based on the amount of computational resources available for determining the re-identification risk score for the set of information, a technique to be used to calculate the re-identification risk score;
determining, by the device and based on the differential risk score not satisfying the threshold and in accordance with the selected technique, the re-identification risk score based on:
a maximum value selected from a set that includes a first value and a second value,
wherein the first value is equal to a product of a first re-identification probability of a first information having a highest re-identification probability in a first group of information and a second re-identification probability of a second information having a highest re-identification probability in a second group of information included in the set of information; and
utilizing, by the device, a generative adversarial network algorithm with an autoencoder architecture to generate high-dimensional discrete samples for synthetic data,
wherein the high-dimensional discrete samples for the synthetic data are generated from the one or more de-identified data sets, and
wherein minibatch averaging is utilized to avoid mode collapse; and
selectively outputting, by the device and based on the re-identification risk score, one of:
actual data, from the one or more de-identified data sets, of the feature set if the re-identification risk score satisfies a condition, or
the synthetic data, for the feature set, if the re-identification risk score does not satisfy the condition.