US 11,755,677 B2
Data mining method, data mining apparatus, electronic device and storage medium
Qin Mao, Beijing (CN); Pei Zou, Beijing (CN); Yue Zhang, Beijing (CN); Yan Liu, Beijing (CN); and Haichao Deng, Beijing (CN)
Assigned to BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed by Beijing Baidu Netcom Science Technology Co., Ltd., Beijing (CN)
Filed on Dec. 31, 2021, as Appl. No. 17/646,683.
Claims priority of application No. 202110742126.6 (CN), filed on Jun. 30, 2021.
Prior Publication US 2023/0004613 A1, Jan. 5, 2023
Int. Cl. G06F 16/00 (2019.01); G06F 16/9538 (2019.01); G06F 16/955 (2019.01); G06F 16/9537 (2019.01); G06F 16/951 (2019.01); G06F 40/30 (2020.01)
CPC G06F 16/9538 (2019.01) [G06F 16/951 (2019.01); G06F 16/955 (2019.01); G06F 16/9537 (2019.01); G06F 40/30 (2020.01); G06F 2216/03 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A data mining method comprising:
acquiring a current article to be mined;
obtaining information values required for each data identification strategy of a plurality of data identification strategies from the current article, wherein each data identification strategy is used for identifying a preset type of data;
identifying a data type of the current article according to the information values required for each data identification strategy to obtain a data type identification result; and
determining whether the current article belongs to any preset type of data according to the data type identification result;
wherein preset types of the data comprise low quality data, low quality content, and inaccurate sentiment analysis; and obtaining the information values required for each data identification strategy of the plurality of data identification strategies from the current article comprises:
obtaining an article title, an article abstract and an article content from the current article based on a data identification strategy of a low quality data type;
extracting keywords from the current article based on a data identification strategy of a low quality content type; and
obtaining a sentiment polarity label from the current article based on a data identification strategy of an inaccurate sentiment analysis.