CPC G06F 16/16 (2019.01) [G06F 21/6227 (2013.01); G06N 3/0475 (2023.01); G06N 3/08 (2013.01)] | 16 Claims |
1. A system for classifying sensitive data elements in a file using machine learning, wherein the system comprises:
one or more hardware processors;
a memory coupled to the one or more hardware processors, wherein the memory comprises a plurality of subsystems in form of programmable instructions executable by the one or more hardware processors, and wherein the plurality of subsystems comprises:
a processing subsystem hosted on a server, and configured to execute on a network to control bidirectional communications among a plurality of modules comprising:
a receiving module configured to receive an unstructured data file wherein the unstructured data file is a data source with the inconsistent structure of data organized in the form of unstructured forms and unstructured natural text;
a conversion module operatively coupled to the receiving module wherein the conversion module is configured to convert the unstructured data file into machine-readable format;
a machine learning module operatively coupled to the conversion module wherein the machine learning module comprises:
a feature generation module operatively coupled with the receiving module and configured to:
generate a plurality of sensitive data features, wherein the plurality of sensitive data features represents single elements of the sensitive data;
generate a plurality of adjacent elements corresponding to the single elements of the sensitive data, and analyze the relationship between the said single elements of the sensitive data and the plurality of adjacent elements; and
generate a plurality of feature categories, wherein the plurality of feature categories comprises a plurality of node features, a plurality of adjacent node features, and a plurality of edge features;
a feature calculation module operatively coupled to the feature generation module, wherein the feature calculation module is configured to:
aggregate the plurality of adjacent node features and the plurality of edge features by calculating the average of the plurality of adjacent node features and a plurality of adjacent edge features;
calculate the plurality of aggregated adjacent nodes features and the plurality of aggregated edge features for one or more adjacent nodes and adjacent node edge; and
concatenate the plurality of aggregated adjacent nodes features and the plurality of aggregated edge features with the features of the individual adjacent node and the individual edge;
a comparison module operatively connected to the feature calculation module, wherein the comparison module is configured to compare the distance of the sensitive data from all of the adjacent sensitive data to calculate the nearest adjacent sensitive data and select the plurality of adjacent node features; and
a classification module operatively coupled with the feature generation module, wherein the classification module is configured to:
classify the sensitive data and predicts the sensitive data to be a true positive or false positive sensitive data by using machine learning, in response to receiving the generated sensitive data from the feature generation module, wherein the false positive sensitive data is filtered out and the true positive personal represents accurate sensitive data.
|