US 12,130,871 B2
Front page news prediction and classification method
Kaichen Cao, Sichuan (CN); Lican Dai, Sichuan (CN); Bing Zeng, Sichuan (CN); Wen Sun, Sichuan (CN); Wanli Liu, Sichuan (CN); and Shou Feng, Sichuan (CN)
Assigned to CHINA ELECTRONICS TECHNOLOGY GROUP CORPORATION NO.10 RESEARCH INSTITUTE, Sichuan (CN)
Appl. No. 17/785,428
Filed by CHINA ELECTRONICS TECHNOLOGY GROUP CORPORATION NO.10 RESEARCH INSTITUTE, Sichuan (CN)
PCT Filed Aug. 10, 2021, PCT No. PCT/CN2021/111885
§ 371(c)(1), (2) Date Jun. 15, 2022,
PCT Pub. No. WO2022/037446, PCT Pub. Date Feb. 24, 2022.
Claims priority of application No. 202010845229.0 (CN), filed on Aug. 20, 2020.
Prior Publication US 2023/0244757 A1, Aug. 3, 2023
Int. Cl. G06F 16/951 (2019.01); G06F 18/2415 (2023.01); G06F 40/279 (2020.01)
CPC G06F 16/951 (2019.01) [G06F 18/2415 (2023.01); G06F 40/279 (2020.01)] 8 Claims
OG exemplary drawing
 
1. A front page news prediction and classification method, comprising the following steps: constructing a news network topology by using news text data;
inputting keywords to be queried by means of a user interface, collecting web pages on the Internet based on the keywords;
compiling a web crawler by using an object-oriented programming language Python, loading the web crawler into a news and newspaper text data collection module;
storing collected news text information on the web pages, by the news and newspaper text data collection module, in a local database;
performing data cleaning, by a data cleaning module, on original data obtained from a website;
performing word segmentation, by a text word segmentation module, on the cleaned data by using Jieba;
performing vector representation, by a text representation module, by using a Doc2Vec representation algorithm, so as to convert each news text into a low-dimensional text feature vector with a high amount of information;
calculating the similarity between news, by a similarity network construction module, by using a locality-sensitive hashing (LSH) algorithm, so as to obtain a similarity matrix;
constructing a news similarity network by taking the similarity matrix obtained by LSH calculation as an adjacent matrix of a news related network;
introducing an H index into a PageRank algorithm by a front page news prediction module, calculating an H-index supporting contribution matrix according to the similarity network;
determining whether the similarity network is traversed, if the similarity network is traversed, iteratively calculating an HR value of the vector according to the H-index supporting contribution matrix;
performing weight sorting on the news by using the HR value, and predicting top-N pieces of news as front page news;
wherein, the keywords are keywords of the web pages;
Jieba is a Chinese text segmenter in Python;
the front page news prediction module performs weight sorting on the news, predicts top-N pieces of news as front page news, and calculates the value of the ith row and the jth column of the H-index supporting contribution matrix according to the similarity network:

OG Complex Work Unit Math
vj∈N(vi), wherein, Aif represents the value of the ith row and the jth column of the adjacent matrix of the network, vi represents a target node, vi represents a node in a domain to which vi belongs, D(vj) represents a degree of the node vi in an adjacent domain, and H(vi) represents an H index of the target node vi;
after traversing and calculating the similarity network, the front page news prediction module iteratively calculates an HR value of the vector according to the proportion of the node vf the network GSHCM, represented by an adjacency function l(vi,vj), in the total number of nodes in the v; domain NSHCM (vi) and the H-index supporting contribution matrix:

OG Complex Work Unit Math
wherein, d represents a damping coefficient, and it is defined that d=0.85, NSHCM(vi) represents the domain of the node vi in the network GSHCM, DSHCH(vj) represents the degree of the node vj in the network GSHCM, Sort, represents the ith element in a sorting sequence obtained on the basis of a certain sorting algorithm, and N represents a prediction length of Top-N prediction.