US 12,019,989 B2
	Open domain dialog reply method and system based on thematic enhancement
Taihao Li, Hangzhou (CN); and Jiantao Huang, Hangzhou (CN)
Assigned to ZHEJIANG LAB, Hangzhou (CN)
Filed by ZHEJIANG LAB, Hangzhou (CN)
Filed on Apr. 8, 2023, as Appl. No. 18/297,610.
Application 18/297,610 is a continuation of application No. PCT/CN2022/139320, filed on Dec. 15, 2022.
Claims priority of application No. 202210981384.4 (CN), filed on Aug. 16, 2022.
Prior Publication US 2024/0062006 A1, Feb. 22, 2024
Int. Cl. G06F 40/20 (2020.01); G06F 40/268 (2020.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06N 3/08 (2023.01); G06N 20/00 (2019.01)

CPC G06F 40/284 (2020.01) [G06F 40/268 (2020.01); G06F 40/30 (2020.01); G06N 3/08 (2013.01); G06N 20/00 (2019.01)]

8 Claims

1. An open domain dialog reply method based on thematic enhancement, comprising:

collecting and pre-processing text corpuses of Chinese open domain dialog which are open-source, to obtain Chinese dialog corpus dataset;

performing sentence breaking, word separation, and lexical annotation of dialogs by a public natural language processing toolkit HanLP (Han Language Processing), and extracting noun words by a regular expression;

performing enhancement of semantic and thematic information on each sentence, and learning vector representations of original sentences and enhanced sentences by a pre-trained sentence representation model RoBERTa (Robustly optimized Bidirectional Encoder Representations from Transformers approach);

extracting semantic and thematic information of sentences by a graph convolutional neural network, and performing thematic aggregation enhancement to obtain a sentence vector after the thematic aggregation enhancement; and

inputting the sentence vector after the thematic aggregation enhancement into a generative pre-trained model GPT (Generative Pre-Trained Transformer), generating a candidate set of dialog replies by a decoding strategy of beam search, and finally training a reply ranking selection model with a contrast learning manner to select the most suitable reply;

wherein the performing sentence breaking, word separation, and lexical annotation of dialogs by the public natural language processing toolkit, and extracting noun words by the regular expression further comprises:

performing sentence breaking of each dialog in the Chinese dialog corpus dataset by the public natural language processing toolkit HanLP, to obtain m sentences {S₁, S₂, S₃, . . . , S_m}, performing word separation of each sentence to obtain n words {t₁, t₂, t₃, . . . , t_n}, performing lexical classification of each word t_x(1≤x≤n) according to a processing specification of a modern Chinese corpus set, giving each word a lexical marker by lexical classification according to components that words assume in a syntactic structure or a language morphology, and extracting all words that meet a noun by the regular expression, the noun by the regular expression including adjectives with nouns, nouns, personal names, place names, institutional group names, and proper nouns from lexical categories; and

the performing enhancement of semantic and thematic information on each sentence, and learning vector representations of original sentences and the enhanced sentences by the pre-trained sentence representation model further comprises:

performing enhancement of semantic data on each sentence S_y(1≤y≤m);

performing enhancement of thematic information on the extracted words that meet the noun;

performing another data enhancement on an enhanced dialog text; and

learning vector representations of the original sentences and the enhanced sentences by the pre-trained sentence representation model.