US 11,936,673 B2
Method and system for detecting harmful web resources
Nikolay Prudkovskiy, Moscow (RU)
Assigned to GROUP IB, LTD, Moscow (RU)
Filed by Group IB, Ltd, Moscow (RU)
Filed on Dec. 10, 2020, as Appl. No. 17/117,893.
Claims priority of application No. 2020115830 (RU), filed on May 12, 2020.
Prior Publication US 2021/0360012 A1, Nov. 18, 2021
Int. Cl. H04L 9/40 (2022.01); G06F 16/2457 (2019.01); G06F 16/955 (2019.01); G06F 16/958 (2019.01); G06N 20/00 (2019.01)
CPC H04L 63/1425 (2013.01) [G06F 16/24578 (2019.01); G06F 16/9566 (2019.01); G06F 16/986 (2019.01); G06N 20/00 (2019.01); H04L 63/1416 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A method for detecting harmful content on a network, the method comprising:
receiving, via the network, a URL;
obtaining, from the URL, an HTML document associated therewith;
converting, the HTML document into a text, the converting comprising extracting a respective text portion from an HTML body of the HTML document;
normalizing the text of the HTML document, the normalizing comprising transforming each word of the text to a respective canonic form thereof, thereby generating a plurality of tokens associated with the text;
aggregating each one of the plurality of tokens into a token vector associated with the HTML document;
applying one or more classifiers to the token vector associated with the HTML document to determine a likelihood parameter indicative of the HTML document containing the harmful content,
the one or more classifiers having been trained to determine the harmful content based on a respective training set of data, the respective training set of data comprising a training token matrix having been generated by:
receiving a plurality of training HTML documents;
converting each one of the plurality of training HTML documents, into a respective training text of a plurality of training texts, the converting comprising extracting the respective text portion from an HTML body of each one of the plurality of training HTML document;
normalizing a given one of the plurality of training texts, the normalizing comprising transforming each word of the given one of the plurality of training texts to a respective canonic form thereof, thereby generating a respective plurality of training tokens associated with the given one of the plurality of training texts,
determining, for each one of the respective plurality of training tokens, a respective significance parameter,
the respective significance parameter associated with a given one of the respective plurality of training tokens being determined based on a product of multiplication of a local frequency of occurrence and an inverse value of a global frequency of occurrence associated therewith,
the local frequency of occurrence being a frequency of occurrence of the given one of the respective plurality of training tokens within the given one of the plurality training texts associated with a respective one of the plurality of training HTML documents; and
the global frequency of occurrence being a frequency of occurrence of the given one of the respective plurality of training tokens within the plurality of training texts of the plurality of training HTML documents;
aggregating respective pluralities of training tokens of the plurality of training HTML documents associated with respective significance parameters into the training token matrix;
in response to the likelihood parameter being equal to or greater than a predetermined likelihood parameter threshold:
determining that the HTML document located under the URL contains the harmful content; and
storing the URL in a database of harmful URLs.