US 11,941,087 B2
Unbalanced sample data preprocessing method and device, and computer device
Xiuming Yu, Guangdong (CN); Wei Wang, Guangdong (CN); and Jing Xiao, Guangdong (CN)
Assigned to Ping An Technology (Shenzhen) Co., Ltd., Guangdong (CN)
Filed by Ping An Technology (Shenzhen) Co., Ltd., Guangdong (CN)
Filed on Feb. 2, 2021, as Appl. No. 17/165,640.
Application 17/165,640 is a continuation of application No. PCT/CN2018/123208, filed on Dec. 24, 2018.
Claims priority of application No. 201811018913.0 (CN), filed on Sep. 3, 2018.
Prior Publication US 2021/0158078 A1, May 27, 2021
Int. Cl. G06F 16/00 (2019.01); G06F 7/24 (2006.01); G06F 18/213 (2023.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06F 18/2413 (2023.01); G06F 18/2431 (2023.01)
CPC G06F 18/24137 (2023.01) [G06F 7/24 (2013.01); G06F 18/213 (2023.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06F 18/2431 (2023.01)] 15 Claims
OG exemplary drawing
 
1. An unbalanced sample data preprocessing method, comprising:
receiving a data acquisition request, and acquiring initial data according to the data acquisition request;
classifying the initial data according to a preset classification rule to obtain first-class sample sets and second-class sample sets, wherein a number of samples in each of the first-class sample sets is less than a data amount threshold, and wherein a number of samples in each of the second-class sample sets is greater than the data amount threshold;
extracting K first sample points in the first-class sample sets, wherein K is an integer greater than 1;
analyzing characteristics of the K first sample points to obtain a new data characteristic of the first-class sample sets;
obtaining a first-class label corresponding to the first-class sample sets, and generating a new data label of the first-class sample sets according to the first-class label;
respectively obtaining a number of first-class sample sets and a number of second-class sample sets, and calculating a ratio between the number of first-class sample sets and the number of second-class sample sets; and
generating new data of the first-class sample sets according to the new data characteristic and the new data label, and adjusting an amount of new data according to the ratio to increase the number of first-class sample sets,.
wherein analyzing the characteristics of the K first sample points to obtain the new data characteristic comprises:
extracting the characteristics of the K first sample points;
analyzing the characteristics to obtain a characteristic attribute;
respectively extracting the characteristics of the K first sample points according to the characteristic attribute, and respectively obtaining common characteristics of the K first sample points;
forming corresponding common characteristic combinations according to the common characteristics, and calculating a number of common characteristics comprised in the common characteristic combinations;
sorting the common characteristic combinations according to the number of common characteristics to obtain a common characteristic combination corresponding to a maximum number; and
generating the new data characteristic according to the common characteristic combination corresponding to the maximum number.