| CPC G06F 16/35 (2019.01) [G06F 16/3334 (2019.01)] | 18 Claims |

|
1. A method of extracting data from a set of documents, the method comprising:
for each document of the set of documents stored in a database:
determining, by a processor, a plurality of spatial features based on a set of keywords and a set of entities extracted from each document of the set of documents, wherein the plurality of spatial features comprises a plurality of text features, a plurality of layout features, and a plurality of location features;
determining, by the processor, a variance between at least one spatial feature of the plurality of spatial features determined for each document of the set of documents;
determining, by the processor, a layout based on the plurality of spatial features and the variance;
clustering, by the processor, each document of the set of documents in at least one predefined cluster of a plurality of predefined clusters based on a similarity between the layouts of the set of documents using a first machine learning model executed on a computing device, wherein each document from the plurality of predefined clusters correspond to a unique document layout from a plurality of document layouts, wherein the first machine learning model is trained based on first training data comprising a plurality of documents corresponding to the plurality of document layouts, and wherein the similarity between layouts of at least two documents of the set of documents is determined by selecting one or more spatial connections between the set of keywords and an entity in each document of the at least two documents based on the variance;
selecting, by the processor, one or more spatial features from the plurality of spatial features using a second machine learning model that is different from the first machine learning model, wherein the second machine learning model is trained to select the one or more spatial features from the plurality of spatial features based on a probability of accuracy of each spatial feature of the plurality of spatial features based on second training data; and
extracting, by the processor, in each document of the set of documents, data of the set of entities corresponding to the set of keywords based on the selection of the one or more spatial features and the similarity between the layouts of at least two documents of the set of documents using a third machine learning model, wherein the third machine learning model is trained to determine a feature-based probability for the extraction of the data of the set of entities corresponding to the set of keywords based on third training data.
|