US 12,130,871 B2
	Front page news prediction and classification method
Kaichen Cao, Sichuan (CN); Lican Dai, Sichuan (CN); Bing Zeng, Sichuan (CN); Wen Sun, Sichuan (CN); Wanli Liu, Sichuan (CN); and Shou Feng, Sichuan (CN)
Assigned to CHINA ELECTRONICS TECHNOLOGY GROUP CORPORATION NO.10 RESEARCH INSTITUTE, Sichuan (CN)
Appl. No. 17/785,428
Filed by CHINA ELECTRONICS TECHNOLOGY GROUP CORPORATION NO.10 RESEARCH INSTITUTE, Sichuan (CN)
PCT Filed Aug. 10, 2021, PCT No. PCT/CN2021/111885 § 371(c)(1), (2) Date Jun. 15, 2022, PCT Pub. No. WO2022/037446, PCT Pub. Date Feb. 24, 2022.
Claims priority of application No. 202010845229.0 (CN), filed on Aug. 20, 2020.
Prior Publication US 2023/0244757 A1, Aug. 3, 2023
Int. Cl. G06F 16/951 (2019.01); G06F 18/2415 (2023.01); G06F 40/279 (2020.01)

CPC G06F 16/951 (2019.01) [G06F 18/2415 (2023.01); G06F 40/279 (2020.01)]

8 Claims

1. A front page news prediction and classification method, comprising the following steps: constructing a news network topology by using news text data;

inputting keywords to be queried by means of a user interface, collecting web pages on the Internet based on the keywords;

compiling a web crawler by using an object-oriented programming language Python, loading the web crawler into a news and newspaper text data collection module;

storing collected news text information on the web pages, by the news and newspaper text data collection module, in a local database;

performing data cleaning, by a data cleaning module, on original data obtained from a website;

performing word segmentation, by a text word segmentation module, on the cleaned data by using Jieba;

performing vector representation, by a text representation module, by using a Doc2Vec representation algorithm, so as to convert each news text into a low-dimensional text feature vector with a high amount of information;

calculating the similarity between news, by a similarity network construction module, by using a locality-sensitive hashing (LSH) algorithm, so as to obtain a similarity matrix;

constructing a news similarity network by taking the similarity matrix obtained by LSH calculation as an adjacent matrix of a news related network;

introducing an H index into a PageRank algorithm by a front page news prediction module, calculating an H-index supporting contribution matrix according to the similarity network;

determining whether the similarity network is traversed, if the similarity network is traversed, iteratively calculating an HR value of the vector according to the H-index supporting contribution matrix;

performing weight sorting on the news by using the HR value, and predicting top-N pieces of news as front page news;

wherein, the keywords are keywords of the web pages;

Jieba is a Chinese text segmenter in Python;

the front page news prediction module performs weight sorting on the news, predicts top-N pieces of news as front page news, and calculates the value of the i^throw and the j^thcolumn of the H-index supporting contribution matrix according to the similarity network:

v_j∈N(v_i), wherein, A_ifrepresents the value of the i^throw and the j^thcolumn of the adjacent matrix of the network, v_irepresents a target node, v_irepresents a node in a domain to which v_ibelongs, D(v_j) represents a degree of the node v_iin an adjacent domain, and H(v_i) represents an H index of the target node v_i;

after traversing and calculating the similarity network, the front page news prediction module iteratively calculates an HR value of the vector according to the proportion of the node v_fthe network G_SHCM, represented by an adjacency function l(v_i,v_j), in the total number of nodes in the v; domain N_SHCM(v_i) and the H-index supporting contribution matrix:

wherein, d represents a damping coefficient, and it is defined that d=0.85, N_SHCM(v_i) represents the domain of the node v_iin the network G_SHCM, D_SHCH(v_j) represents the degree of the node v_jin the network G_SHCM, Sort, represents the i^thelement in a sorting sequence obtained on the basis of a certain sorting algorithm, and N represents a prediction length of Top-N prediction.