US 12,278,828 B2
	System for detecting web page fraud based on wordlist categorization
Md. Rafiul Hassan, Dhahran (SA); and Muhammad Imtiaz Hossain, Dhahran (SA)
Assigned to KING FAHD UNIVERSITY OF PETROLEUM AND MINERALS, Dhahran (SA)
Filed by KING FAHD UNIVERSITY OF PETROLEUM AND MINERALS, Dhahran (SA)
Filed on Jul. 10, 2024, as Appl. No. 18/768,090.
Application 18/768,090 is a continuation of application No. 17/510,458, filed on Oct. 26, 2021, granted, now 12,095,781.
Prior Publication US 2024/0364719 A1, Oct. 31, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. H04L 29/06 (2006.01); G06N 20/00 (2019.01); H04L 9/40 (2022.01)

CPC H04L 63/1416 (2013.01) [G06N 20/00 (2019.01)]

5 Claims

1. A system for detection of fraudulent activity, including:

processing circuitry configured to

train a machine learning classifier, including

perform a Hidden Markov Model (HMM) for generating log-likelihood scores based on a plurality of attribute value vectors for one class and attribute value vectors for another class for a set of keyword features characterizing a Web page, wherein the generating includes recursively computing the log-likelihood of each state of each of the attribute value vectors, and wherein there are substantially fewer attribute value vectors in the one class than in the another class,

rank the log-likelihood scores generated by the HMM,

group the plurality of attribute value vectors into a predetermined number of bins, wherein the attribute value vectors in each bin are grouped by log-likelihood scores within equal ranges,

apply a one-sided sampling technique on each bin of the predetermined number of bins in order to remove redundant and borderline attribute value vectors of the attribute value vectors of the another class in the respective bin in order to obtain a balanced training dataset between the one class and the another class in each bin, and

train the machine learning classifier using the respective balanced training dataset, and

detect fraudulent activity in Web pages using the trained machine learning classifier,

wherein detecting fraudulent activity includes categorizing the Web pages based on inclusion of a keyword from a fraud indication wordlist.