US 12,455,927 B1
	Multimodal techniques for web information extraction
Shrikant G Nayak, Bangalore (IN); Tejas Duseja, Naya Bazar (IN); and Sathya Prakash Podila Venkata Subramanya, Bangalore (IN)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 30, 2022, as Appl. No. 17/937,384.
Int. Cl. G06F 16/951 (2019.01); G06F 16/904 (2019.01); G06F 16/958 (2019.01); G06F 30/20 (2020.01); G06F 30/27 (2020.01); G06N 3/08 (2023.01); G06N 20/00 (2019.01)

CPC G06F 16/951 (2019.01) [G06N 3/08 (2013.01); G06F 16/904 (2019.01); G06F 16/958 (2019.01); G06F 30/20 (2020.01); G06F 30/27 (2020.01); G06N 20/00 (2019.01)]

20 Claims

1. A system, comprising:

one or more computing devices;

wherein the one or more computing devices include instructions that upon execution on or across the one or more computing devices:

determine, at an analytics service of a cloud provider network, a set of target attributes for which respective values are to be extracted from web pages of one or more web sites, wherein individual ones of the web pages comprise data about one or more entities whose attributes are included in the set of target attributes;

prepare, at the analytics service, a neural network based model to be used to extract the respective values, wherein the neural network based model comprises a transformer encoder, and wherein preparation of the neural network based model comprises:

identifying an input data set for a first training phase of the neural network based model, wherein the input data set comprises, corresponding to a first web page of a first plurality of web pages, at least (a) a screenshot of the first web page, (b) markup language content of the first web page, wherein the markup language content indicates a plurality of elements arranged in accordance with a document model, wherein a particular element of the plurality of elements comprises a value of an attribute of the set of target attributes, and (c) a representation of respective bounding boxes corresponding to at least some elements of the plurality of elements, and wherein the input data set does not comprise a label for the first web page;

generating respective multi-modal embedding representations of the first plurality of web pages, wherein a multi-modal embedding representation of the first web page comprises embeddings of at least (a) respective subdivisions of the screenshot of the first web page, (b) text content of respective elements of the plurality of elements of the first web page, (c) respective bounding boxes included in the input data set for the first web page and (d) respective document model paths of individual elements of the plurality of elements of the first web page;

conducting the first training phase of the neural network based model, wherein the first training phase comprises self-supervised learning of the transformer encoder, and wherein in the first training phase, parameters of the transformer encoder are learned by jointly optimizing a plurality of multi-modal loss functions corresponding to respective tasks, wherein the plurality of multi-modal loss functions includes:

a first loss function associated with a first task which comprises reconstructing text of a masked element of the plurality of elements of the first web page, and

a second loss function associated with a second task comprising predicting an overlap between (a) a masked subdivision of the screenshot of the first web page and (b) a bounding box of an element of the plurality of elements;

conducting a second training phase of the neural network based model, wherein the second training phase comprises supervised learning to classify elements of web pages, wherein input of the second training phase comprises output of a hidden layer of the transformer encoder, wherein the output of the hidden layer corresponds to a second plurality of web pages, and wherein at least some parameters of the hidden layer were learned in the first training phase; and

generate, at the analytics service, a response to a query pertaining to a particular web page, wherein the response to the query is based at least in part on (a) classification, by a version of the neural network based model obtained after the second training phase, of a particular element of the particular web page, and (b) a determination, from the particular element, of a value of a particular target attribute of the set of target attributes.