US 12,229,506 B2
Method and system for filtering ill corpus
Xiaoning Jiang, Zhejiang (CN); Kai Liu, Zhejiang (CN); Yuhan Zhou, Zhejiang (CN); Hongmin Xie, Zhejiang (CN); Yukuan He, Zhejiang (CN); Weijie Liu, Zhejiang (CN); Jie Zhang, Zhejiang (CN); and Zhen Liu, Zhejiang (CN)
Assigned to Zhejiang Gongshang University, Hangzhou (CN)
Filed by Zhejiang Gongshang University, Zhejiang (CN)
Filed on Dec. 16, 2022, as Appl. No. 18/067,428.
Claims priority of application No. 202210905334.8 (CN), filed on Jul. 29, 2022.
Prior Publication US 2024/0037328 A1, Feb. 1, 2024
Int. Cl. G06F 40/232 (2020.01); G06F 40/295 (2020.01); H04L 51/212 (2022.01)
CPC G06F 40/232 (2020.01) [G06F 40/295 (2020.01); H04L 51/212 (2022.05)] 12 Claims
OG exemplary drawing
 
1. A method for filtering an ill corpus, comprising:
S1: acquiring, by at least one processor, a text corpus to be recognized, and preprocessing the text corpus to be recognized to obtain a basic text corpus by at least one processor;
S2: extracting, by at least one processor, entities in the basic text corpus, and performing matching search on the entities of the basic text corpus according to an ill-text knowledge graph to obtain a first recognition result;
S3: detecting and recognizing, by at least one processor, the basic text corpus according to a corpus recognition model to obtain a second recognition result; and
S4: filtering, by at least one processor, the text corpus to be recognized according to at least one of the first recognition result and the second recognition result, and
S5: updating the ill-text knowledge graph according to the second recognition result, so as to update ill information in a form of new words into the ill-text knowledge graph in real time;
wherein a construction of the ill-text knowledge graph comprises:
acquiring a large amount of original ill text information in a network platform, and extracting entities of the original ill text information to obtain a plurality of ill word entities;
performing entity conversion processing on the ill word entities so as to obtain ill word pinyin entities and ill word homophonic entities; and
extracting a relationship among the ill word entities, the ill word pinyin entities and the ill word homophonic entities according to pinyin conversion, homophonic conversion and part-of-speech and term frequency, and constructing triples by entity disambiguating so as to obtain the ill-text knowledge graph;
wherein obtaining the first recognition result specifically comprises:
screening the entities of the basic text corpus according to the ill-text knowledge graph so as to obtain a plurality of candidate ill entities by a preset number; and
mapping the basic text corpus and the candidate ill entities into a multidimensional vector using a word2vec model, calculating similarity between the basic text corpus and the candidate ill entities according to a cosine-similarity calculation method, and obtaining the first recognition result according to the similarity;
wherein the corpus recognition model is a k-nearest neighbor model;
wherein a construction of the corpus recognition model comprises:
acquiring the ill information fed back by users and collecting normal corpus information;
performing pinyin conversion and homophonic conversion on the ill information and the normal corpus information word by word to obtain pinyin corpus information and homophonic corpus information;
dividing the ill information, the normal corpus information, phonetic corpus information and the homophonic corpus information as a sample set into a training set and a test set, and mapping the sample set into a spatial vector through the word2vec model; and
performing training on the training set mapped into the spatial vector using a k-nearest neighbor model so as to obtain the corpus recognition model;
wherein the second recognition result is configured as a supplement to the first recognition result, and the model is constructed to filter out hidden ill information in the form of new words;
wherein there is no sequential relationship between steps S2 and S3.