US 12,333,845 B2
	Unified pretraining framework for document understanding
Jiuxiang Gu, Greenbelt City, MD (US); Ani Nenkova Nenkova, Philadelphia, PA (US); Nikolaos Barmpalios, Palo Alto, CA (US); Vlad Ion Morariu, Potomac, MD (US); Tong Sun, San Ramon, CA (US); Rajiv Bhawanji Jain, Falls Church, VA (US); Jason wen yong Kuen, Santa Clara, CA (US); and Handong Zhao, San Jose, CA (US)
Assigned to Adobe Inc., San Jose, CA (US)
Filed by ADOBE INC., San Jose, CA (US)
Filed on Nov. 16, 2021, as Appl. No. 17/528,061.
Prior Publication US 2023/0154221 A1, May 18, 2023
Int. Cl. G06V 30/414 (2022.01); G06F 18/214 (2023.01); G06F 18/25 (2023.01); G06F 40/30 (2020.01); G06N 3/02 (2006.01); G06N 20/00 (2019.01)

CPC G06V 30/414 (2022.01) [G06F 18/214 (2023.01); G06F 18/253 (2023.01); G06F 40/30 (2020.01); G06N 3/02 (2013.01); G06N 20/00 (2019.01)]

15 Claims

1. A non-transitory computer-readable storage medium having instructions stored thereon, which, when executed by a processor of a computing device cause the processor to perform actions comprising:

receiving image data that encodes a document;

extracting, from the image data, a set of sentences and a corresponding bounding box for each sentence of the set of sentences;

for each sentence of the set of sentences, generating a set of predicted features using an encoder machine learning (ML) model that performs cross-attention between a set of masked-textual features for the sentence and a set of masked-visual features for the sentence, wherein the set of masked-textual features are based on a masking function and the sentence and the set of masked-visual features are based on the masking function and the corresponding bounding box for the sentence; and

pretraining a document-encoder ML model based on the set of predicted features for each sentence of the set of sentences and one or more pretraining tasks including visual-language alignment to enforce alignment between text and image regions and jointly pretraining, in association with pretraining the document-encoder ML, an image encoder that derives visual features for semantic regions, wherein at least one visual feature comprises a table, a font size, a style, or a figure.