| CPC G06N 20/00 (2019.01) [G06F 16/2379 (2019.01); G06F 40/284 (2020.01)] | 17 Claims |

|
1. A computer-implemented method comprising:
receiving, by one or more processors, an imbalanced dataset;
identifying, by the one or more processors, a small class that includes initial text records included in the imbalanced dataset;
generating, by the one or more processors, a balanced dataset from the imbalanced dataset by augmenting the initial text records by using weighted word scores indicating respective measures of importance of words in classes in the imbalanced dataset;
sending, by the one or more processors, the balanced dataset to a supervised machine learning model;
training, by the one or more processors, the supervised machine learning model on the balanced dataset; and
using the supervised machine learning model employing the augmented initial text records, performing, by the one or more processors, a text classification of a new dataset whose domain matches a domain of the imbalanced dataset,
wherein the augmenting the initial text records includes:
receiving an initial text record included in the initial text records;
for a given word in the initial text record, determining a stop word score indicating whether the given word is present or not present in a list of stop words;
for the given word in the initial text record, determining a dependency score indicating a syntactic dependency relationship between the given word and other words in the initial text record;
for the given word in the initial text record, determining a part of speech (POS) score indicating a lexical category of the given word;
for the given word in the initial text record, determining a word priority score as w1*a class word score+w2*the stop word score+w3*the dependency score+w4*the POS score, wherein w1, w2, w3, and w4 are weights;
determining the word priority score for the given word in the initial text record is greater than a threshold word priority score; and
based on the word priority score for the given word being greater than the threshold word priority score, selecting the given word as a word in the initial text record that needs to be replaced.
|