| CPC G06F 16/3347 (2019.01) [G01V 11/002 (2013.01); G06F 40/279 (2020.01); G06V 30/412 (2022.01); G06V 30/413 (2022.01); G06V 2201/10 (2022.01)] | 14 Claims |

|
1. A method for extracting text associated with user-defined attributes from a plurality of documents, the method comprising:
identifying relevant documents related to a specific entity from storage, wherein the specific entity is either a specific oil well or is associated with one from a group consisting of a wellbore, an oilfield, and a prospect;
extracting text and spatial coordinates of the text;
identifying significant document entities and associated spatial locations of the significant document entities through page layout analysis;
ranking pages of the relevant documents based on the extracted text and the spatial coordinates using term frequency-inverse document frequency (TFIDF) or Okapi Best Match 25 (Okapi BM25);
extracting user-defined attributes from the pages of the relevant documents using a deep learning language model;
aggregating first attribute values associated with the user-defined attributes from one of the relevant documents into a single record;
aggregating second attribute values associated with the user-defined attributes across the relevant documents;
aggregating an attribute value across multiple sources based on at least one of a majority vote from among the multiple sources, a confidence probability of the attribute value from among the multiple sources, source metadata, and source priority, wherein the majority vote involves determining which attribute value was extracted a majority of the time; and
writing aggregated records to a database.
|