US 12,229,644 B2
Text augmentation of a minority class in a text classification problem
Kumar Abhishek, Khariar (IN)
Assigned to Kyndryl, Inc., New York, NY (US)
Filed by KYNDRYL, INC., New York, NY (US)
Filed on Apr. 29, 2021, as Appl. No. 17/302,268.
Prior Publication US 2022/0366293 A1, Nov. 17, 2022
Int. Cl. G06N 20/00 (2019.01); G06F 16/23 (2019.01); G06F 40/284 (2020.01)
CPC G06N 20/00 (2019.01) [G06F 16/2379 (2019.01); G06F 40/284 (2020.01)] 17 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving, by one or more processors, an imbalanced dataset;
identifying, by the one or more processors, a small class that includes initial text records included in the imbalanced dataset;
generating, by the one or more processors, a balanced dataset from the imbalanced dataset by augmenting the initial text records by using weighted word scores indicating respective measures of importance of words in classes in the imbalanced dataset;
sending, by the one or more processors, the balanced dataset to a supervised machine learning model;
training, by the one or more processors, the supervised machine learning model on the balanced dataset; and
using the supervised machine learning model employing the augmented initial text records, performing, by the one or more processors, a text classification of a new dataset whose domain matches a domain of the imbalanced dataset,
wherein the augmenting the initial text records includes:
receiving an initial text record included in the initial text records;
for a given word in the initial text record, determining a stop word score indicating whether the given word is present or not present in a list of stop words;
for the given word in the initial text record, determining a dependency score indicating a syntactic dependency relationship between the given word and other words in the initial text record;
for the given word in the initial text record, determining a part of speech (POS) score indicating a lexical category of the given word;
for the given word in the initial text record, determining a word priority score as w1*a class word score+w2*the stop word score+w3*the dependency score+w4*the POS score, wherein w1, w2, w3, and w4 are weights;
determining the word priority score for the given word in the initial text record is greater than a threshold word priority score; and
based on the word priority score for the given word being greater than the threshold word priority score, selecting the given word as a word in the initial text record that needs to be replaced.