US 11,960,521 B2
Text classification system based on feature selection and method thereof
Yin Lu, Nanjing (CN); Qingyuan Li, Nanjing (CN); Abdusamjan Abdukirim, Nanjing (CN); Jie Hu, Nanjing (CN); Luocheng Wu, Nanjing (CN); and Yongan Guo, Nanjing (CN)
Filed by Nanjing University of Posts and Telecommunications, Nanjing (CN)
Filed on Jan. 16, 2023, as Appl. No. 18/097,329.
Claims priority of application No. 202210479218.4 (CN), filed on May 5, 2022.
Prior Publication US 2023/0214415 A1, Jul. 6, 2023
Int. Cl. G06F 16/35 (2019.01); G06F 16/31 (2019.01)
CPC G06F 16/35 (2019.01) [G06F 16/313 (2019.01)] 6 Claims
OG exemplary drawing
 
1. A text classification method based on feature selection, comprising:
acquiring a text classification data set;
dividing the text classification data set into a training text set and a test text set, and then pre-processing the training text set and the test text set;
extracting feature entries from the pre-processed training text set through an improved chi-square (IMP_CHI) statistical formula to form feature subsets;
using a Term Frequency-Inverse Word Frequency (TF-IWF) algorithm to give weights to the extracted feature entries;
based on the weighted feature entries, establishing a short text classification model based on a support vector machine; and
classifying the pre-processed test text set by the short text classification model;
wherein the pre-processing comprises first performing standard processing including removing stop words on a text, and then selecting Jieba word segmentation tool to segment the processed short text content to obtain the training text set and the test text set which have been segmented, and storing the training text set and the test text set in a text database;
wherein extracting the feature entries from the pre-processed training text set through the improved chi-square statistical formula to form feature subsets comprises:
extracting each feature word t and its related category information from the text database;
calculating a word frequency adjustment parameter α(t,ci), an intra-category position parameter β and a negative correlation correction factor γ of the feature word t with respect to each category;
using an improved formula to calculate an IMP_CHI value of an entry with respect to each category;
according to the improved chi-square statistics statistical formula, obtaining an IMP_CHI value of a feature word t with respect to the whole training text set; and
after calculating IMP_CHI values of the whole training text set, selecting first M words as features represented by a document to form a final feature subset according to descending orders of the IMP_CHI values.