US 12,438,819 B2
	System for classifying encrypted traffic based on data packet
Jing Qiu, Guangdong (CN); Jie Ding, Guangdong (CN); Rongrong Chen, Guangdong (CN); Chengliang Gao, Guangdong (CN); Zhihong Tian, Guangdong (CN); Lihua Yin, Guangdong (CN); Guangxia Xu, Guangdong (CN); Shen Su, Guangdong (CN); Xiaoya Ni, Guangdong (CN); Fei Tang, Guangdong (CN); Minghao Hu, Guangdong (CN); Jiaxu Xing, Guangdong (CN); and Qianlong Xiao, Guangdong (CN)
Assigned to Guangzhou University, Guangzhou (CN)
Filed by Guangzhou University, Guangdong (CN)
Filed on Nov. 1, 2023, as Appl. No. 18/386,251.
Application 18/386,251 is a continuation in part of application No. PCT/CN2022/133120, filed on Nov. 21, 2022.
Claims priority of application No. 202210271454.7 (CN), filed on Mar. 18, 2022.
Prior Publication US 2024/0064107 A1, Feb. 22, 2024
Int. Cl. H04L 47/2483 (2022.01); H04L 9/40 (2022.01); H04L 47/2441 (2022.01)

CPC H04L 47/2483 (2013.01) [H04L 47/2441 (2013.01); H04L 63/1416 (2013.01)]

7 Claims

1. A system for classifying encrypted traffic based on a data packet, comprising a traffic capture module, a traffic analysis module, and a traffic classification module, wherein

the traffic capture module is configured to filter data packet information in a network flow by identifying an IP address, a port number, a protocol type, and a flag bit in traffic, to obtain flow data, wherein the network flow refers to all data packets transmitted between two IP addresses and ports corresponding to the two IP addresses;

the traffic analysis module is configured to: extract transport layer security (TLS), hypertext transfer protocol (HTTP), and domain name system (DNS) protocol information and related fields from the flow data; extract information about data packets in the flow data; and perform a cluster analysis on information about sizes, flow directions, and delays of the data packets, to extract spatial-temporal features, header features, load features, and statistical features from the flow data, wherein the spatial-temporal features refer to temporal attributes and spatial attributes of data packets that are normally sent in a network traffic transmission process, the header features comprise 5-tuple information of the traffic, DNS information, and HTTP information, the load features refer to content encapsulated in the flow data, and the statistical features comprise an average packet length, a maximum packet length, an average inter-packet delay, a ratio of a quantity of uplink data packets to a quantity of downlink data packets, and a ratio of a quantity of uplink bytes to a quantity of downlink bytes; and

the traffic classification module is configured to classify normal data packets and malicious data packets through k-means clustering, wherein

an input dataset is in a format of D={x₁, x₂, . . . , x_i}, and an output is a classification result C={C₁, C₂}, wherein C₁and C₂represent labels of normal traffic and malicious traffic respectively; and a specific classification process comprises: first, randomly selecting two samples from the dataset D, to constitute a centroid set {μ₁, μ₂}, wherein a centroid of the set is represented by μ_j; then, calculating a distance between each sample x_iand the centroid μμ_j, wherein the distance is calculated based on the following formula:

d_ij=∥x_i−μ_j∥₂²

next, recalculating a centroid of the set C based on the following formula:

subsequently, calculating distances between each sample and two centroids; allocating each sample to a centroid that is closest to the sample, wherein the centroid and the sample that is allocated to the centroid constitute a cluster; and after all samples are allocated, outputting a clustering result if no centroid vector is changed, wherein the following clustering result is finally output:

C={C₁,C₂}

after categories of the normal data packets and the malicious data packets in the traffic are obtained, calculating a proportion of the normal data packets in the traffic, a proportion of the malicious data packets in the traffic, and a ratio of the normal data packets to the malicious data packets; and adding, as parameters to a feature matrix, the proportion of the normal data packets in the traffic, the proportion of the malicious data packets in the traffic, and the ratio of the normal data packets to the malicious data packets, to finally obtain a sample set S={S_1, S_2|x_i∈S}, wherein x_iis a sample in the set S;

after the sample set is input, using a light gradient-boosting machine (LightGBM) model for classification, so as to obtain a traffic classification result, wherein a Gini coefficient expression of probability distribution is:

Gini(p)=2p(1−p)

wherein p represents a probability of being normal traffic, a loss function that is used is a log-likelihood loss function, and the log-likelihood loss function is calculated based on the following formula:

wherein L represents the loss function, N represents a quantity of samples, y_irepresents a true category of an input instance, and p_irepresents a predicted probability that the input instance belongs to a normal traffic category.