CPC H04L 63/1425 (2013.01) [G06F 16/24578 (2019.01); G06F 16/9566 (2019.01); G06F 16/986 (2019.01); G06N 20/00 (2019.01); H04L 63/1416 (2013.01)] | 17 Claims |
1. A method for detecting harmful content on a network, the method comprising:
receiving, via the network, a URL;
obtaining, from the URL, an HTML document associated therewith;
converting, the HTML document into a text, the converting comprising extracting a respective text portion from an HTML body of the HTML document;
normalizing the text of the HTML document, the normalizing comprising transforming each word of the text to a respective canonic form thereof, thereby generating a plurality of tokens associated with the text;
aggregating each one of the plurality of tokens into a token vector associated with the HTML document;
applying one or more classifiers to the token vector associated with the HTML document to determine a likelihood parameter indicative of the HTML document containing the harmful content,
the one or more classifiers having been trained to determine the harmful content based on a respective training set of data, the respective training set of data comprising a training token matrix having been generated by:
receiving a plurality of training HTML documents;
converting each one of the plurality of training HTML documents, into a respective training text of a plurality of training texts, the converting comprising extracting the respective text portion from an HTML body of each one of the plurality of training HTML document;
normalizing a given one of the plurality of training texts, the normalizing comprising transforming each word of the given one of the plurality of training texts to a respective canonic form thereof, thereby generating a respective plurality of training tokens associated with the given one of the plurality of training texts,
determining, for each one of the respective plurality of training tokens, a respective significance parameter,
the respective significance parameter associated with a given one of the respective plurality of training tokens being determined based on a product of multiplication of a local frequency of occurrence and an inverse value of a global frequency of occurrence associated therewith,
the local frequency of occurrence being a frequency of occurrence of the given one of the respective plurality of training tokens within the given one of the plurality training texts associated with a respective one of the plurality of training HTML documents; and
the global frequency of occurrence being a frequency of occurrence of the given one of the respective plurality of training tokens within the plurality of training texts of the plurality of training HTML documents;
aggregating respective pluralities of training tokens of the plurality of training HTML documents associated with respective significance parameters into the training token matrix;
in response to the likelihood parameter being equal to or greater than a predetermined likelihood parameter threshold:
determining that the HTML document located under the URL contains the harmful content; and
storing the URL in a database of harmful URLs.
|