US 12,393,768 B2
Layout-aware multimodal pretraining for multimodal document understanding
Mingyang Zhang, San Jose, CA (US); Cheng Li, Mountain View, CA (US); Tao Chen, Sunnyvale, CA (US); Spurthi Amba Hombaiah, Mountain View, CA (US); Michael Bendersky, Cupertino, CA (US); Marc Alexander Najork, Palo Alto, CA (US); and Te-Lin Wu, Los Angeles, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Appl. No. 17/928,984
Filed by Google LLC, Mountain View, CA (US)
PCT Filed Dec. 22, 2020, PCT No. PCT/US2020/066588
§ 371(c)(1), (2) Date Dec. 1, 2022,
PCT Pub. No. WO2022/139807, PCT Pub. Date Jun. 30, 2022.
Prior Publication US 2023/0222285 A1, Jul. 13, 2023
Int. Cl. G06F 40/166 (2020.01); G06F 40/109 (2020.01); G06F 40/284 (2020.01); G06V 30/413 (2022.01)
CPC G06F 40/166 (2020.01) [G06F 40/109 (2020.01); G06F 40/284 (2020.01); G06V 30/413 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for generating layout-aware document representations, the method comprising:
obtaining, by a computing system comprising one or more computing devices, a document, wherein the document comprises text and one or more images, and wherein layout data is associated with the document;
partitioning, by the computing system, the document into a plurality of blocks based at least in part on the layout data;
processing, by the computing system, the plurality of blocks with a hierarchical encoder model to generate a document-level representation, wherein generating the document-level representation comprises:
processing, by the computing system, each of the plurality of blocks with a machine-learned block-level encoder model of the hierarchical encoder model to respectively generate a plurality of block-level representations for the plurality of blocks, wherein, for each of the plurality of blocks, the layout data associated with such block is provided as input to the machine-learned block-level encoder model;
determining a plurality of block local connections based on recognizing relationships between neighboring blocks of the plurality of blocks;
processing, by the computing system, the plurality of block-level representations and the plurality of block local connections with a machine-learned document-level encoder model of the hierarchical encoder model to generate the document-level representation for the document;
providing, by the computing system, the document-level representation as an output; and
processing, by the computing system, the document-level representation to generate an image suggestion descriptive of a particular image from a set of candidate images to suggest to place in the document, wherein the hierarchical encoder model was trained based at least in part on a loss function that evaluates output document-level representations based on an image-text matching prediction, wherein the image-text matching prediction is descriptive of a binary matching prediction of whether a particular image was replaced with one or more training images from a different training document.