US 12,277,789 B2
Smart optical character recognition trainer
Radu Stoicescu, Braselton, GA (US); and Jesse Osborne, Burtonsville, MD (US)
Assigned to Innovative Computing & Applied Technology LLC, Reston, VA (US)
Filed by Innovative Computing & Applied Technology LLC, Reston, VA (US)
Filed on Jul. 14, 2022, as Appl. No. 17/864,929.
Prior Publication US 2024/0020999 A1, Jan. 18, 2024
Int. Cl. G06V 30/19 (2022.01); G06F 16/25 (2019.01); G06V 30/18 (2022.01)
CPC G06V 30/19173 (2022.01) [G06F 16/254 (2019.01); G06V 30/18 (2022.01)] 19 Claims
OG exemplary drawing
 
1. An apparatus for converting one or more imaged documents to one or more electronic versions of the imaged documents including encoded characters representing extracted textual data for subsequent processing comprising non-transitory computer readable media having encoded thereon a plurality of instructions for causing a processor to perform a plurality of processes including:
an input queue for accepting said plurality of imaged documents, said input queue to be coupled to a data source, a data migration ETL process, as part of a data versioning repository's storage function, or as part of a data pipeline;
a page extract process to convert each page of an imaged document of the plurality of imaged documents to a rasterized image;
an optical character recognition process to: (i) extract text from a PDF file; (ii) render pages of a PDF document as images; and (iii) read and modify the properties of a PDF document; and (iv) build a simple PDF viewer to perform special operations using PDF documents, wherein said optical character recognition process produces as an output encoded textual characters that are machine-readable and can be utilized in word processing software;
an extraction process to extract a plurality of quantitative values from each document, said extraction process including:
a noise detection and characterization functional operation to quantitatively characterize a degree to which each image is affected by noise;
a scanning-artifacts detection and characterization functional operation to quantitatively characterize a degree to which each image is affected by one or more known scanning artifacts;
a page alignment detection functional operation to quantitatively characterize a page alignment during which each document's page contents are geometrically partitioned into one or more page-segments having similar content;
an analyze sign and shape of textual characters functional operation to quantitatively characterize a size and shape for an entire distribution of characters in a document, wherein for said entire distribution of characters in the document a number of vertical pixels used to construct each character from top to bottom is determined and a number of horizontal pixels used to construct each character from left-to-right is determined, and four groups of characters are created based on a general style of representation because of their shared size and shapes;
wherein:
a first group consists of lower-case textual letters from the group consisting of: “a”, “c”, “e”, “m”, “n”, “o”, “r”, “s”, “u”, “v”, “w”, “x”, and “z”;
a second group consists of lower-case textual letters from the group consisting of: “b”, “d”, “f”, “g”, “h”, “l”, “k”, “p”, “q”, “t”, and “y”;
a third group consists of all upper-case textual letters; and
a fourth group consists of Arabic numerals;
a statistical analysis of letter frequency functional operation to enumerate character frequency for each textual character identified in a document, wherein based on a subject matter of a document, preloaded dictionaries of domain-specific acronyms, abbreviations, and initialisms are employed to adjust for expected proportions of letter frequencies;
a letter placement and page alignment functional operation to: (i) detect a placement and composition of letter content; (ii) to determine page segmentation; and (iii) to detect a presence of white space between page segments and margins;
a typography analysis and detection functional operation to detect one or more typefaces used in a document;
a context and dictionary-based spelling analysis functional operation to perform spelling analysis and enumerate an occurrence of misspelled words using common and domain-specific dictionaries to evaluate acceptability of words, wherein based on a subject matter of a document, one or more preloaded dictionaries of domain-specific acronyms, abbreviations, and initialisms are employed;
a classifier algorithm to receive said plurality of quantitative values to perform classification employing a principal components analysis, which performs combinatorial scoring to aggregate the plurality of quantitative values to sort a document into one of a plurality of classes based on a combined score that is derived from multiple conditions determined through analysis, said plurality of classes including:
(i) an extremely poor-quality document, in which analysis has measured that a document's quality is too low to be considered for automated trainer optimization, wherein a quality threshold for said document is so low that content may even be uninterpretable with human intervention;
(ii) a low-quality document, in which analysis has measured that a document's quality can be improved or enhanced through automated optimization techniques, wherein a quality of a document in this threshold is also substandard to a predetermined acceptable threshold;
(iii) a high-quality document, in which analysis has measured that a document's quality now meets or exceeds said predetermined acceptable threshold; and
a suspense queue for storing a plurality of extremely poor-quality documents deemed not appropriate for automated processing;
an output queue populated with a plurality of high quality documents determined to meet or exceed said predetermined quality standards, said plurality of high quality documents including: (i) original high-quality documents; and (ii) derived high-quality documents;
wherein each of said documents include one or more files reporting on a chain-of-custody for audit purposes;
wherein for said original high-quality documents, said one or more files include: (i) an unadulterated PDF document used as input; and (ii) a text file report stating one or more reasons the unadulterated PDF was deemed acceptable without further processing;
wherein for said derived high-quality documents, said one or more file include: (i) an original unadulterated PDF document used as input; (ii) an enhanced PDF document resulting from optimization, which reflects changes in image and textual content; and (iii) a stylized HTML file report on one or more differences between said original document and said enhanced document, which stylized HTML file report includes a color-coded of each line of text that differs between said original unadulterated PDF document and said enhanced document highlighting any character differences existing in each line of text; and (iv) a similarity matrix graphically depicting a location of discovered differences throughout a document's content, which includes a jpg file representing a full context of a document with color symbology employed to show similarity between said original document and said enhanced document using colors where brightness indicates similarity and conversely darkness represents non-similarity; and
an optimizer process to receive the plurality of low quality documents and to apply one or more tailored optimization techniques based on the plurality of quantitative values extracted by the extraction process for each of the plurality of low quality documents to improve a quality of said low quality documents and to return a plurality of low quality optimized documents to the extraction process for reprocessing.