US 12,455,930 B1
Crawling electronic (web) documents
Brennan Troy Robert Seal, Austin, TX (US); Chris Everett Peterson, Austin, TX (US); Nicholas Anthony Esposito, Round Rock, TX (US); Rachel Gabrielle Mazzini, Dallas, TX (US); Sandeep Bola Ratnakar, Bangalore (IN); and Siddharth Sreekumar, Bangalore (IN)
Assigned to Dell Products L.P., Round Rock, TX (US)
Filed by Dell Products L.P., Round Rock, TX (US)
Filed on Apr. 26, 2024, as Appl. No. 18/647,208.
Int. Cl. G06F 16/951 (2019.01); G06F 16/9538 (2019.01); G06F 16/953 (2019.01); G06F 16/9532 (2019.01); G06F 40/205 (2020.01)
CPC G06F 16/951 (2019.01) [G06F 16/953 (2019.01); G06F 16/9532 (2019.01); G06F 16/9538 (2019.01); G06F 40/205 (2020.01)] 18 Claims
OG exemplary drawing
 
8. An information handling system comprising a processor having access to memory media storing instructions executable by the processor to perform operations, comprising:
training, at a first time, an electronic document crawling model based on a first set of electronic documents, including:
for each electronic document of the first set of electronic documents:
obtaining the electronic document including obtaining an entirety of HyperText Markup Language (HTML) of the electronic document;
analyzing a copy of the electronic document, including:
identifying a plurality of elements of the electronic document, each element of the plurality of elements including HTML tags, text associated with the HTML tags, HTML attributes, and scripts;
reducing the electronic document by i) removing portions of the electronic document related to scripts that do not expose functionality of the electronic document and ii) maintaining the HTML tags, text associated with the HTML tags, HTML, attributes, and scripts that expose functionality of the electronic document;
creating, based on the reduced electronic document, a plurality of clusters of texts based on a similarity of the HTML tags, the text associated with the HTML tags, and the HTML attributes of each of the plurality of elements;
labeling, for each cluster of the plurality of clusters, the cluster based on the text associated with the HTML tags of one element of the cluster;
storing, at a storage device, data indicating the plurality of clusters and the label for each cluster of the plurality of clusters;
updating, for each cluster of the plurality of clusters, the electronic document crawling model with data indicating the label of the cluster; and
training, at a second time after the first time, the electronic document crawling model, after being updated, based on a second set of electronic documents differing from the first set of electronic documents.