US 12,153,604 B2
Apparatus and method for generating data set
Hyun-Jin Kim, Daejeon (KR); Jong-Hoon Lee, Daejeon (KR); Young-Soo Kim, Sejong-si (KR); Jong-Geun Park, Sejong-si (KR); and Cheol-Hee Park, Gongju-si (KR)
Assigned to Electronics and Telecommunications Research Institute, Daejeon (KR)
Filed by ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE, Daejeon (KR)
Filed on Oct. 18, 2022, as Appl. No. 17/967,999.
Claims priority of application No. 10-2021-0139656 (KR), filed on Oct. 19, 2021.
Prior Publication US 2023/0123045 A1, Apr. 20, 2023
Int. Cl. G06F 16/00 (2019.01); G06F 16/28 (2019.01); G06N 3/088 (2023.01)
CPC G06F 16/285 (2019.01) [G06N 3/088 (2013.01)] 4 Claims
OG exemplary drawing
 
1. An apparatus for generating a data set, comprising:
one or more processors; and
executable memory for storing at least one program executed by the one or more processors,
wherein the at least one program is configured to
classify collected data into numerical feature data and categorical feature data using a filter method, the collected data including network traffic data, system log data, and security event data,
perform correlation analysis on the numerical feature data and the categorical feature data using an analysis of variance (ANOVA) method and a Chi-Squared method,
generate a data set for supervised learning and a data set for unsupervised learning using correlation scores calculated through the correlation analysis, and
generate a neural network model for cyber breach threat detection based on the data set for supervised learning and a data set for unsupervised learning,
wherein the at least one program is configured to:
rank importance of features according to predefined feature criteria using the filter method and measure a correlation between data features based on the ranked importance of the features, thereby classifying the collected data into the numerical feature data and the categorical feature data,
normalize the numerical feature data using a min-max scaling method and convert the categorical feature data into numerical values using a one-hot encoding method, and
determine that data corresponds to the data set for supervised learning as a correlation score calculated using the ANOVA method is higher and a correlation score calculated using the Chi-Squared method is lower.