CPC G06V 30/414 (2022.01) [G06F 40/232 (2020.01); G06F 40/263 (2020.01); G06F 40/284 (2020.01)] | 20 Claims |
1. A computerized method for extracting data from electronic documents using optical character recognition (OCR) and non-OCR based text extraction, the method comprising:
initiating, by a server computing device, non-OCR based text extraction for each of a plurality of pages of an electronic document;
calculating, by the server computing device, a document text coverage percentage corresponding to the non-OCR based text extraction for the electronic document as a whole;
in response to determining that the document text coverage percentage for the electronic document as a whole is below a first threshold, initiating, by the server computing device, OCR for the electronic document as a whole;
calculating, by the server computing device, a page text coverage percentage corresponding to the non-OCR based text extraction for one or more pages of the electronic document;
in response to determining that the page text coverage percentage for one or more pages of the electronic document is below a second threshold, initiating, by the server computing device, OCR for the one or more pages of the electronic document; and
combining, by the server computing device, first text extracted from the electronic document using non-OCR based text extraction and second text extracted from the electronic document using OCR.
|