US 12,086,287 B2
	Horizontally-scalable data de-identification
David Jensen, Ouray, CO (US); and Joseph David Jensen, Santa Clara, CA (US)
Assigned to Snowflake Inc., Bozeman, MT (US)
Filed by SNOWFLAKE INC., Bozeman, MT (US)
Filed on Nov. 3, 2022, as Appl. No. 17/980,371.
Application 17/980,371 is a continuation of application No. 17/352,217, filed on Jun. 18, 2021, granted, now 11,501,021.
Claims priority of provisional application 63/180,047, filed on Apr. 26, 2021.
Prior Publication US 2023/0050290 A1, Feb. 16, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/00 (2019.01); G06F 16/22 (2019.01); G06F 16/28 (2019.01); G06F 21/62 (2013.01)

CPC G06F 21/6254 (2013.01) [G06F 16/221 (2019.01); G06F 16/282 (2019.01); G06F 21/6227 (2013.01)]

27 Claims

1. A method comprising:

receiving data from a data source;

transforming the data, using structured query language (SQL), into integer data;

generating a plurality of generalizations of the integer data;

sending the plurality of generalizations of the integer data to a plurality of execution nodes, each node of the plurality of execution nodes operating on a set of data separable from a set of data on another execution node, wherein each of the plurality of execution nodes includes computational resources to compute a candidate generalization using an information loss scoring function, wherein computing the candidate generalization includes grouping the plurality of generalizations into a set of equivalence classes and pruning the set of equivalence classes;

receiving a set of candidate generalizations from the plurality of execution nodes, wherein a candidate generalization includes:

an equivalence class size approximation, wherein an equivalence class approximation is determined by computing an upper bound for an equivalence class size, based at least in part on a k-value, a number of records, a number of records in a set of equivalence classes with a size greater or equal to the k-value, a number of records in a set of equivalence classes with a size less than the k-value, and a maximum equivalence class size;

a suppression approximation comprising a number of suppressed records, the suppression approximation based at least in part on the number of records and the k-value; and

an information loss score;

selecting a preferred generalization from the set of candidate generalizations;

generating an anonymized view of the integer data using the preferred generalization; and

providing a view of the integer data in which a data subject is unidentifiable directly or indirectly.