CPC G06V 30/133 (2022.01) [G06V 30/155 (2022.01); G06V 30/26 (2022.01); G06V 30/41 (2022.01)] | 17 Claims |
1. A method, comprising:
extracting, by at least one processor, a first set of text from a document using a first optical character recognition (OCR) tool;
extracting, by the at least one processor, a second set of text from the document using a second OCR tool;
comparing, by the at least one processor, a first metric of the first set of text to a second metric of the second set of text, the first metric measuring a first level of OCR quality of the first set of text and the second metric measuring a second level of OCR quality of the second set of text;
selecting, by the at least one processor, a first selected text from the first set of text or the second set of text based on a higher level of OCR quality;
extracting, by the at least one processor, a third set of text from the document using a third OCR tool;
comparing, by the at least one processor, a corresponding metric of the first selected text to a third metric of the third set of text, the third metric measuring a third level of OCR quality of the third set of text;
determining, for the first set of text, the second set of text, and the third set of text, a fourth metric, wherein the fourth metric comprises measuring a fourth level of OCR quality based on a respective number of words in respective ones of the first set, second set, or third set of extracted texts from the document divided by a number of pages in the document;
selecting, by the at least one processor, a second selected text from the first selected text or the third set of text based on a higher level of the third level of OCR quality or the fourth level of OCR quality; and
storing, by the at least one processor, the second selected text as a final text in a searchable format.
|