US 11,789,982 B2
	Order independent data categorization, indication, and remediation across realtime datasets of live service environments
Denis Tumpic, Stratford (CA); Brian Schafer, Austin, TX (US); James Nix, Liberty Hill, TX (US); Shina Aofolaju, Austin, TX (US); and Jesse Campbell, Maple Valley, WA (US)
Assigned to ELECTRONIC ARTS INC., Redwood City, CA (US)
Filed by Electronic Arts Inc., Redwood City, CA (US)
Filed on Sep. 23, 2020, as Appl. No. 17/29,524.
Prior Publication US 2022/0092086 A1, Mar. 24, 2022
Int. Cl. G06F 16/28 (2019.01); G06F 16/2458 (2019.01); G06N 7/02 (2006.01); G06F 16/22 (2019.01)

CPC G06F 16/285 (2019.01) [G06F 16/2255 (2019.01); G06F 16/2468 (2019.01); G06N 7/023 (2013.01)]

19 Claims

1. An order independent computer-implemented method of auditing a large-scale dataset for sensitive data, the method comprising:

receiving content from a real-time dataset of a live service environment;

identifying data snippets of the real-time dataset using a set of one or more attention rules, wherein the one or more attention rules are opportunistic attention rules configured for maximising a recall rate, and each identified data snippet includes either a true positive data item or a false positive data item, wherein a true positive data item is a data item that matches one or more defined criteria associated with sensitive data, and a false positive data item is a data item that does not match the one or more defined criteria associated with sensitive data, wherein the recall rate indicates a portion of identified true positive data items that match the one or more defined criteria associated with sensitive data among true positive data items present in the received content;

categorizing the identified data snippets using fuzzy hashing by assigning them to buckets such that each bucket contains data snippets that are similar to one another according to a similarity measure defined by a fuzzy hashing algorithm for performing the fuzzy hashing;

classifying buckets containing data snippets having more than a threshold number of the true positive data items as true positive buckets and remaining buckets as false positive buckets;

calculating culling rules based on the true positive buckets and the false positive buckets, wherein the culling rules are configured to reduce a total number of false positive data items in the true positive buckets to increase a precision rate; and

using the culling rules to remove the false positive data items that do not meet the one or more defined criteria associated with sensitive data from the true positive buckets, to increase the precision rate that corresponds to a portion of true positive data items that meet the one or more defined criteria associated with sensitive data among data items in the true positive buckets.