CPC H04L 63/1425 (2013.01) [G06F 16/906 (2019.01); H04L 63/1416 (2013.01)] | 6 Claims |
1. An extraction apparatus, comprising:
a non-transitory memory;
a network interface configured to receive an input of information about a plurality of web pages including a first hypertext markup language (HTML) element that is known to reach a malicious web page through browser operation and a second HTML element that is known to reach a benign web page through browser operation; and
processing circuitry configured to:
classify the plurality of web pages into clusters;
extract a first HTML element that reaches the malicious web page and a second HTML element that reaches the benign web page from a web page of each cluster that is classified to extract a first character string included in the first and second HTML elements that are extracted;
extract, as a keyword, a second character string that characterizes the first HTML element that reaches the malicious web page from the first character string;
for each of an HTML text element or an HTML attribute value of each cluster of the web page:
generate a first document by integrating a character string related to the first HTML element that is known to reach the malicious web page in the first character string;
generate a second document by integrating a character string related to the second HTML element that is known to reach the benign web page in the first character string;
evaluate an importance of the character strings of the first document and the second document by comparing the first document and the second document; and
extract, as the keyword, the second character string that characterizes the first HTML element that reaches the malicious web page based on the importance for each of the HTML text element, the HTML attribute value or the character string in a region of each cluster of the web page; and
evaluate the importance of each character string in the integrated first document and the integrated second document and extract a character string with the importance exceeding a threshold as the keyword, wherein
the network interface is further configured to output the extracted keyword.
|