US 12,231,464 B2
Detecting phishing websites via a machine learning-based system using URL feature hashes, HTML encodings and embedded images of content pages
Ari Azarafrooz, Rancho Santa Margarita, CA (US); Yihua Liao, Fremont, CA (US); Zhi Xu, Cupertino, CA (US); and Najmeh Miramirkhani, Santa Clara, CA (US)
Assigned to Netskope, Inc., Santa Clara, CA (US)
Filed by Netskope, Inc., Santa Clara, CA (US)
Filed on May 16, 2022, as Appl. No. 17/745,701.
Application 17/745,701 is a continuation of application No. 17/475,233, filed on Sep. 14, 2021, granted, now 11,336,689.
Prior Publication US 2023/0082481 A1, Mar. 16, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. H04L 9/40 (2022.01)
CPC H04L 63/1483 (2013.01) [H04L 63/0281 (2013.01); H04L 63/1408 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A phishing classifier that classifies a universal resource locator (URL) and a content page accessed via the URL as phishing or not phishing, including:
a URL feature hasher that parses the URL into features and hashes the features to produce URL feature hashes;
a hypertext markup language (HTML) encoder, trained on HTML tokens:
extracted from content pages at example URLs,
encoded into an embedding space, then
decoded to reproduce images captured from rendering of the content pages,
wherein the trained HTML encoder produces an HTML encoding of HTML tokens extracted from the content page; and
phishing classifier layers,
trained on URL feature hashes and HTML encodings of the example URLs, each example URL accompanied by a ground truth classification as phishing or as not phishing,
wherein the phishing classifier layers process the URL feature hashes, and the HTML encoding of the URL to produce at least one likelihood score that the URL and the content page accessed via the URL presents a phishing risk.