CPC G06F 40/284 (2020.01) [G06F 40/268 (2020.01); G06F 40/30 (2020.01); G06N 3/08 (2013.01); G06N 20/00 (2019.01)] | 8 Claims |
1. An open domain dialog reply method based on thematic enhancement, comprising:
collecting and pre-processing text corpuses of Chinese open domain dialog which are open-source, to obtain Chinese dialog corpus dataset;
performing sentence breaking, word separation, and lexical annotation of dialogs by a public natural language processing toolkit HanLP (Han Language Processing), and extracting noun words by a regular expression;
performing enhancement of semantic and thematic information on each sentence, and learning vector representations of original sentences and enhanced sentences by a pre-trained sentence representation model RoBERTa (Robustly optimized Bidirectional Encoder Representations from Transformers approach);
extracting semantic and thematic information of sentences by a graph convolutional neural network, and performing thematic aggregation enhancement to obtain a sentence vector after the thematic aggregation enhancement; and
inputting the sentence vector after the thematic aggregation enhancement into a generative pre-trained model GPT (Generative Pre-Trained Transformer), generating a candidate set of dialog replies by a decoding strategy of beam search, and finally training a reply ranking selection model with a contrast learning manner to select the most suitable reply;
wherein the performing sentence breaking, word separation, and lexical annotation of dialogs by the public natural language processing toolkit, and extracting noun words by the regular expression further comprises:
performing sentence breaking of each dialog in the Chinese dialog corpus dataset by the public natural language processing toolkit HanLP, to obtain m sentences {S1, S2, S3, . . . , Sm}, performing word separation of each sentence to obtain n words {t1, t2, t3, . . . , tn}, performing lexical classification of each word tx (1≤x≤n) according to a processing specification of a modern Chinese corpus set, giving each word a lexical marker by lexical classification according to components that words assume in a syntactic structure or a language morphology, and extracting all words that meet a noun by the regular expression, the noun by the regular expression including adjectives with nouns, nouns, personal names, place names, institutional group names, and proper nouns from lexical categories; and
the performing enhancement of semantic and thematic information on each sentence, and learning vector representations of original sentences and the enhanced sentences by the pre-trained sentence representation model further comprises:
performing enhancement of semantic data on each sentence Sy (1≤y≤m);
performing enhancement of thematic information on the extracted words that meet the noun;
performing another data enhancement on an enhanced dialog text; and
learning vector representations of the original sentences and the enhanced sentences by the pre-trained sentence representation model.
|