US 12,254,276 B2
Descriptive topic modeling with LDA on bags of utterance clusters
Javier Miguel Sastre Martinez, County Dublin (IE); Sean Gorman, Goatstown (IE); Aisling Nugent, Dublin (IE); and Anandita Pal, Dublin (IE)
Assigned to Accenture Global Solutions Limited, Dublin (IE)
Filed by ACCENTURE GLOBAL SOLUTIONS LIMITED, Dublin (IE)
Filed on Feb. 28, 2022, as Appl. No. 17/682,368.
Prior Publication US 2023/0274092 A1, Aug. 31, 2023
Int. Cl. G06F 40/35 (2020.01); G06F 18/23213 (2023.01); G06F 40/58 (2020.01)
CPC G06F 40/35 (2020.01) [G06F 18/23213 (2023.01); G06F 40/58 (2020.01)] 18 Claims
OG exemplary drawing
 
1. A system for intent discovery, the system comprising:
a memory storing instructions; and
a processor in communication with the memory, wherein, when the processor executes the instructions, the instructions are configured to cause the processor to:
obtain documents comprising a set of utterances;
extract the set of utterances from the documents;
generate a set of utterance embeddings based on the set of utterances;
clusterize the set of utterance embeddings to obtain a plurality of clusters, each cluster comprising one or more utterance embeddings;
obtain a cluster label for each cluster, wherein to obtain the cluster label for each cluster, the instructions are further configured to cause the processor to:
calculate a center in an embedding space for the one or more utterance embeddings in a corresponding cluster of the plurality of clusters;
select a central utterance embedding in the corresponding cluster, the central utterance embedding being a closest utterance embedding to the center among the one or more utterance embeddings in the corresponding cluster; and
obtain an utterance corresponding to the central utterance embedding as a corresponding cluster label of the corresponding cluster;
encode each document of the documents based on a number of times each utterance cluster identifier (ID) appears to obtain an encoded document, the encoded document comprising at least one cluster ID with a weight;
perform latent Dirichlet allocation (LDA) on the encoded documents to obtain K topics, wherein K is a positive integer, and each topic of the K topics corresponds to a list of key clusters with cluster IDs; and
for each topic, replace the cluster IDs with corresponding cluster labels.