| CPC G06F 40/295 (2020.01) [G06F 40/106 (2020.01); G06F 40/30 (2020.01); G06N 3/08 (2013.01); G06T 11/60 (2013.01)] | 20 Claims |

|
1. A method comprising:
providing access to a machine learning model for iterative training on Natural Language Processing (NLP) of real-world documents, the providing access to the machine learning model comprising:
receiving, at a text-image-layout transformer (TILT) NLP system of a cloud data platform, multi-modal input data comprising text data, layout data, and image data;
executing multiple NLP models on the multi-modal input data, the multiple NLP models comprising:
an encoder-decoder model configured to generate text-based features not present in the text data;
a spatial model configured to implement spatial relationship features in the layout data; and
a multi-modal model configured to add visual context features to process the image data;
receiving additional data associated with the multi-modal input data, the additional data comprising semantic data associated with the text-based features, sequential distance data associated with the spatial relationship features, and spatial relationship data associated with the visual context features;
maintaining a distinction between the semantic data, the sequential distance data, and the spatial relationship data in the real-world documents, the maintaining the distinction comprising separating the semantic data from the sequential distance data and the spatial relationship data;
providing regularization augmentation to each of the text-based features, the spatial relationship features, and the visual context features while enabling cross-modal learning among the encoder-decoder model, the spatial model, and the multi-modal model; and
training the machine learning model on the multi-modal input data, the text-based features, the spatial relationship features, and the visual context features.
|