| CPC H04L 47/2483 (2013.01) [H04L 47/2441 (2013.01); H04L 63/1416 (2013.01)] | 7 Claims |

|
1. A system for classifying encrypted traffic based on a data packet, comprising a traffic capture module, a traffic analysis module, and a traffic classification module, wherein
the traffic capture module is configured to filter data packet information in a network flow by identifying an IP address, a port number, a protocol type, and a flag bit in traffic, to obtain flow data, wherein the network flow refers to all data packets transmitted between two IP addresses and ports corresponding to the two IP addresses;
the traffic analysis module is configured to: extract transport layer security (TLS), hypertext transfer protocol (HTTP), and domain name system (DNS) protocol information and related fields from the flow data; extract information about data packets in the flow data; and perform a cluster analysis on information about sizes, flow directions, and delays of the data packets, to extract spatial-temporal features, header features, load features, and statistical features from the flow data, wherein the spatial-temporal features refer to temporal attributes and spatial attributes of data packets that are normally sent in a network traffic transmission process, the header features comprise 5-tuple information of the traffic, DNS information, and HTTP information, the load features refer to content encapsulated in the flow data, and the statistical features comprise an average packet length, a maximum packet length, an average inter-packet delay, a ratio of a quantity of uplink data packets to a quantity of downlink data packets, and a ratio of a quantity of uplink bytes to a quantity of downlink bytes; and
the traffic classification module is configured to classify normal data packets and malicious data packets through k-means clustering, wherein
an input dataset is in a format of D={x1, x2, . . . , xi}, and an output is a classification result C={C1, C2}, wherein C1 and C2 represent labels of normal traffic and malicious traffic respectively; and a specific classification process comprises: first, randomly selecting two samples from the dataset D, to constitute a centroid set {μ1, μ2}, wherein a centroid of the set is represented by μj; then, calculating a distance between each sample xi and the centroid μμj, wherein the distance is calculated based on the following formula:
dij=∥xi−μj∥22
next, recalculating a centroid of the set C based on the following formula:
![]() subsequently, calculating distances between each sample and two centroids; allocating each sample to a centroid that is closest to the sample, wherein the centroid and the sample that is allocated to the centroid constitute a cluster; and after all samples are allocated, outputting a clustering result if no centroid vector is changed, wherein the following clustering result is finally output:
C={C1,C2}
after categories of the normal data packets and the malicious data packets in the traffic are obtained, calculating a proportion of the normal data packets in the traffic, a proportion of the malicious data packets in the traffic, and a ratio of the normal data packets to the malicious data packets; and adding, as parameters to a feature matrix, the proportion of the normal data packets in the traffic, the proportion of the malicious data packets in the traffic, and the ratio of the normal data packets to the malicious data packets, to finally obtain a sample set S={S_1, S_2|xi∈S}, wherein xi is a sample in the set S;
after the sample set is input, using a light gradient-boosting machine (LightGBM) model for classification, so as to obtain a traffic classification result, wherein a Gini coefficient expression of probability distribution is:
Gini(p)=2p(1−p)
wherein p represents a probability of being normal traffic, a loss function that is used is a log-likelihood loss function, and the log-likelihood loss function is calculated based on the following formula:
![]() wherein L represents the loss function, N represents a quantity of samples, yi represents a true category of an input instance, and pi represents a predicted probability that the input instance belongs to a normal traffic category.
|