CPC G06F 18/24137 (2023.01) [G06F 7/24 (2013.01); G06F 18/213 (2023.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06F 18/2431 (2023.01)] | 15 Claims |
1. An unbalanced sample data preprocessing method, comprising:
receiving a data acquisition request, and acquiring initial data according to the data acquisition request;
classifying the initial data according to a preset classification rule to obtain first-class sample sets and second-class sample sets, wherein a number of samples in each of the first-class sample sets is less than a data amount threshold, and wherein a number of samples in each of the second-class sample sets is greater than the data amount threshold;
extracting K first sample points in the first-class sample sets, wherein K is an integer greater than 1;
analyzing characteristics of the K first sample points to obtain a new data characteristic of the first-class sample sets;
obtaining a first-class label corresponding to the first-class sample sets, and generating a new data label of the first-class sample sets according to the first-class label;
respectively obtaining a number of first-class sample sets and a number of second-class sample sets, and calculating a ratio between the number of first-class sample sets and the number of second-class sample sets; and
generating new data of the first-class sample sets according to the new data characteristic and the new data label, and adjusting an amount of new data according to the ratio to increase the number of first-class sample sets,.
wherein analyzing the characteristics of the K first sample points to obtain the new data characteristic comprises:
extracting the characteristics of the K first sample points;
analyzing the characteristics to obtain a characteristic attribute;
respectively extracting the characteristics of the K first sample points according to the characteristic attribute, and respectively obtaining common characteristics of the K first sample points;
forming corresponding common characteristic combinations according to the common characteristics, and calculating a number of common characteristics comprised in the common characteristic combinations;
sorting the common characteristic combinations according to the number of common characteristics to obtain a common characteristic combination corresponding to a maximum number; and
generating the new data characteristic according to the common characteristic combination corresponding to the maximum number.
|