US 12,254,276 B2
	Descriptive topic modeling with LDA on bags of utterance clusters
Javier Miguel Sastre Martinez, County Dublin (IE); Sean Gorman, Goatstown (IE); Aisling Nugent, Dublin (IE); and Anandita Pal, Dublin (IE)
Assigned to Accenture Global Solutions Limited, Dublin (IE)
Filed by ACCENTURE GLOBAL SOLUTIONS LIMITED, Dublin (IE)
Filed on Feb. 28, 2022, as Appl. No. 17/682,368.
Prior Publication US 2023/0274092 A1, Aug. 31, 2023
Int. Cl. G06F 40/35 (2020.01); G06F 18/23213 (2023.01); G06F 40/58 (2020.01)

CPC G06F 40/35 (2020.01) [G06F 18/23213 (2023.01); G06F 40/58 (2020.01)]

18 Claims

1. A system for intent discovery, the system comprising:

a memory storing instructions; and

a processor in communication with the memory, wherein, when the processor executes the instructions, the instructions are configured to cause the processor to:

obtain documents comprising a set of utterances;

extract the set of utterances from the documents;

generate a set of utterance embeddings based on the set of utterances;

clusterize the set of utterance embeddings to obtain a plurality of clusters, each cluster comprising one or more utterance embeddings;

obtain a cluster label for each cluster, wherein to obtain the cluster label for each cluster, the instructions are further configured to cause the processor to:

calculate a center in an embedding space for the one or more utterance embeddings in a corresponding cluster of the plurality of clusters;

select a central utterance embedding in the corresponding cluster, the central utterance embedding being a closest utterance embedding to the center among the one or more utterance embeddings in the corresponding cluster; and

obtain an utterance corresponding to the central utterance embedding as a corresponding cluster label of the corresponding cluster;

encode each document of the documents based on a number of times each utterance cluster identifier (ID) appears to obtain an encoded document, the encoded document comprising at least one cluster ID with a weight;

perform latent Dirichlet allocation (LDA) on the encoded documents to obtain K topics, wherein K is a positive integer, and each topic of the K topics corresponds to a list of key clusters with cluster IDs; and

for each topic, replace the cluster IDs with corresponding cluster labels.