CPC H04L 63/1425 (2013.01) [G06F 18/214 (2023.01); G06N 20/00 (2019.01); H04L 63/1416 (2013.01); H04L 63/1466 (2013.01); H04L 63/166 (2013.01); H04L 63/20 (2013.01)] | 20 Claims |
1. A computerized method comprising:
accessing an initial set of historical network traffic data from a data store, wherein the historical network traffic data represents transmission of data between source devices and destination devices;
preparing a training set of data prior to training a machine learning model, from the initial set of data, by:
applying a plurality of operations to the initial set of historical network traffic data to obtain a plurality of filtered subsets of network transmissions, wherein each filtered subset of network transmissions represents a corresponding set of beaconing candidates and is labeled by at least a security expert or a machine learning model to form a plurality of sets of labeled results,
wherein the plurality of sets of labeled results are augmented to form an augmented labeled training set, and
storing the augmented labeled training set;
applying a first clustering filter rule to the initial set of historical network traffic data to obtain a first filtered subset of network transmissions that represent a first set of beaconing candidates;
performing a clustering logic to generate a set of one or more clusters from the first set of beaconing candidates;
applying a multivariate anomaly detection logic to the set of one or more clusters to detect and extract outliers in the first set of beaconing candidates;
providing an outlier alert to a system administrator indicating that the outliers have been determined to indicate a presence of beaconing, wherein extraction of the outliers results in a remaining set of beaconing candidates and a sampling subset from each cluster of the remaining set of beaconing candidates is labeled by the security expert to form a first set of labeled results; and
training the machine learning model using the augmented labeled training set, the machine learning model being subsequently used to classify data.
|