| CPC G06V 30/414 (2022.01) [G06F 18/214 (2023.01); G06F 18/253 (2023.01); G06F 40/30 (2020.01); G06N 3/02 (2013.01); G06N 20/00 (2019.01)] | 15 Claims |

|
1. A non-transitory computer-readable storage medium having instructions stored thereon, which, when executed by a processor of a computing device cause the processor to perform actions comprising:
receiving image data that encodes a document;
extracting, from the image data, a set of sentences and a corresponding bounding box for each sentence of the set of sentences;
for each sentence of the set of sentences, generating a set of predicted features using an encoder machine learning (ML) model that performs cross-attention between a set of masked-textual features for the sentence and a set of masked-visual features for the sentence, wherein the set of masked-textual features are based on a masking function and the sentence and the set of masked-visual features are based on the masking function and the corresponding bounding box for the sentence; and
pretraining a document-encoder ML model based on the set of predicted features for each sentence of the set of sentences and one or more pretraining tasks including visual-language alignment to enforce alignment between text and image regions and jointly pretraining, in association with pretraining the document-encoder ML, an image encoder that derives visual features for semantic regions, wherein at least one visual feature comprises a table, a font size, a style, or a figure.
|