US 12,259,919 B2
	Rare topic detection using hierarchical clustering
Raghu Ganti, White Plains, NY (US); Mudhakar Srivatsa, White Plains, NY (US); Shreeranjani Srirangamsridharan, White Plains, NY (US); Yeon-sup Lim, White Plains, NY (US); and Dakshi Agrawal, Monsey, NY (US)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Oct. 8, 2019, as Appl. No. 16/596,399.
Prior Publication US 2021/0103608 A1, Apr. 8, 2021
Int. Cl. G06F 16/353 (2025.01); G06F 16/334 (2025.01); G06N 5/04 (2023.01); G06N 20/00 (2019.01)

CPC G06F 16/353 (2019.01) [G06F 16/3347 (2019.01); G06N 5/04 (2013.01); G06N 20/00 (2019.01)]

20 Claims

1. A method for providing rare topic detection using hierarchical topic modeling by a processor, comprising:

executing machine learning logic to learn a hierarchical topic model from one or more data sources;

in conjunction with learning the hierarchical topic model, executing the machine learning logic to train the hierarchical topic model by iteratively removing one or more dominant words in a selected cluster using the hierarchical topic model during a progressive drilldown operation through a plurality of hierarchical topic modelling executions, wherein, at each iteration of the plurality of hierarchical topic modelling executions, the progressive drilldown operation removes those of the one or more dominant words identified during a previous iteration which are no longer discriminatory for a next execution of the plurality of hierarchical topic modelling executions, and wherein the dominant words comprise words appearing in at least a defined percentage of conversations, relative to other non-dominant words appearing in less than the defined percentage of the conversations, as identified in the one or more data sources, and the dominant words that relate to one or more primary topics of the cluster; and

executing the machine learning logic to further train the learned hierarchical topic model by seeding the learned hierarchical topic model with one or more words, n-grams, phrases, and text snippets to evolve the hierarchical topic model, wherein the removed dominant words are reinstated upon completion of the seeding, and wherein each of the dominant words removed from each iteration and reinstated upon completion of the seeding are together used to form a natural language explanation, provided to a user by way of a user interface, of each of the one or more primary topics resulting from the hierarchical topic model within a corpus of the one or more data sources.