US 11,748,393 B2
Creating compact example sets for intent classification
Abhishek Shah, Jersey City, NJ (US); and Tin Kam Ho, Millburn, NJ (US)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Nov. 28, 2018, as Appl. No. 16/203,000.
Prior Publication US 2020/0167604 A1, May 28, 2020
Int. Cl. G06K 9/62 (2022.01); G06F 16/35 (2019.01); G06F 18/214 (2023.01); G06F 18/21 (2023.01); G06F 18/23213 (2023.01); G06V 30/262 (2022.01); G06N 20/00 (2019.01)
CPC G06F 16/35 (2019.01) [G06F 18/217 (2023.01); G06F 18/2148 (2023.01); G06F 18/23213 (2023.01); G06V 30/274 (2022.01); G06N 20/00 (2019.01)] 12 Claims
OG exemplary drawing
 
1. A method for creating compact example subsets for intent classification, by a processor, comprising:
receiving a set of content used for training an intent classifier;
separating entries within the set of content into a first subset and a second subset;
performing a cross-validation operation on the first and second subsets to identify a correctly labeled portion and an incorrectly labeled portion of the set of content, wherein the cross-validation operation further comprises:
performing an initial training of the intent classifier utilizing the first subset to form a first subset trained classifier;
utilizing the first subset trained classifier against the second subset to identify a correctly labeled subset and an incorrectly labeled subset of the second subset;
performing a secondary training of the intent classifier, subsequent to the initial training, to form a second subset trained classifier; and
utilizing the second subset trained classifier against the first subset to identify a correctly labeled subset and an incorrectly labeled subset of the first subset; and
forming a reduced content used for performing a final training of the intent classifier by combining a first number of the entries from the correctly labeled portion and a second number of the entries from the incorrectly labeled portion of the set of content, wherein an anti-clustering procedure, performed separately and independently on each of the first subset and the second subset, is utilized to select members of the first number of the entries and members of the second number of the entries by:
computing a vector representation for each of the entries from the correctly labeled portion and the entries from the incorrectly labeled portion of the set of content;
clustering each of a set of vectors representing the correctly labeled portion and the incorrectly labeled portion into k clusters, wherein k equals a desired size of the reduced content;
selecting a longest entry in each of the k clusters as a cluster representative to yield maximally-spread samples among all of the k clusters, wherein each other in-cluster members of each of the k clusters are ignored; and
using the selected longest entry in each of the k clusters aggregately as the first number of the entries and the second number of the entries comprising the reduced content.