US 12,314,668 B2
	Natural language processing text-image-layout transformer
Lukasz Konrad Borchmann, Warsaw (PL); Dawid Andrzej Jurkiewicz, Poznan (PL); Tomasz Dwojak, Poznan (PL); Michal Waldemar Pietruszka, Cracow (PL); and Gabriela Klaudia Palka, Poznan (PL)
Assigned to Snowflake Inc., Bozeman, MT (US)
Filed by APPLICA SP. Z O.O., Warsaw (PL)
Filed on Jul. 31, 2023, as Appl. No. 18/362,886.
Application 18/362,886 is a continuation of application No. 17/651,311, filed on Feb. 16, 2022, granted, now 11,763,087.
Claims priority of provisional application 63/150,271, filed on Feb. 17, 2021.
Prior Publication US 2024/0028832 A1, Jan. 25, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06T 11/60 (2006.01); G06F 40/106 (2020.01); G06F 40/295 (2020.01); G06F 40/30 (2020.01); G06N 3/08 (2023.01)

CPC G06F 40/295 (2020.01) [G06F 40/106 (2020.01); G06F 40/30 (2020.01); G06N 3/08 (2013.01); G06T 11/60 (2013.01)]

20 Claims

1. A method comprising:

providing access to a machine learning model for iterative training on Natural Language Processing (NLP) of real-world documents, the providing access to the machine learning model comprising:

receiving, at a text-image-layout transformer (TILT) NLP system of a cloud data platform, multi-modal input data comprising text data, layout data, and image data;

executing multiple NLP models on the multi-modal input data, the multiple NLP models comprising:

an encoder-decoder model configured to generate text-based features not present in the text data;

a spatial model configured to implement spatial relationship features in the layout data; and

a multi-modal model configured to add visual context features to process the image data;

receiving additional data associated with the multi-modal input data, the additional data comprising semantic data associated with the text-based features, sequential distance data associated with the spatial relationship features, and spatial relationship data associated with the visual context features;

maintaining a distinction between the semantic data, the sequential distance data, and the spatial relationship data in the real-world documents, the maintaining the distinction comprising separating the semantic data from the sequential distance data and the spatial relationship data;

providing regularization augmentation to each of the text-based features, the spatial relationship features, and the visual context features while enabling cross-modal learning among the encoder-decoder model, the spatial model, and the multi-modal model; and

training the machine learning model on the multi-modal input data, the text-based features, the spatial relationship features, and the visual context features.