US 11,704,352 B2
Automated categorization and assembly of low-quality images into electronic documents
Van Nguyen, Plano, TX (US); Sean Michael Byrne, Tampa, FL (US); Syed Talha, McKinney, TX (US); Aftab Khan, Richardson, TX (US); Beena Khushalani, Moorpark, CA (US); and Sharad K. Kalyani, Coppell, TX (US)
Assigned to Bank of America Corporation, Charlotte, NC (US)
Filed by BANK OF AMERICA CORPORATION, Charlotte, NC (US)
Filed on May 3, 2021, as Appl. No. 17/306,374.
Prior Publication US 2022/0350830 A1, Nov. 3, 2022
Int. Cl. G06F 16/35 (2019.01); G06F 16/33 (2019.01); G06N 20/00 (2019.01); G06V 10/30 (2022.01); G06V 30/416 (2022.01); G06F 18/21 (2023.01)
CPC G06F 16/35 (2019.01) [G06F 16/3347 (2019.01); G06F 18/2178 (2023.01); G06N 20/00 (2019.01); G06V 10/30 (2022.01); G06V 30/416 (2022.01)] 20 Claims
OG exemplary drawing
 
1. An apparatus comprising:
a memory configured to store:
an optical character recognition (OCR) algorithm; and
a natural language processing (NLP) algorithm;
a hardware processor communicatively coupled to the memory, the hardware processor configured to:
receive an image of a page of a physical document;
convert, by executing the OCR algorithm, the image into a set of text;
identify one or more errors in the set of text, the one or more errors associated with noise in the image, wherein each error of the one or more errors is assigned to an error type of a plurality of error types;
generate a feature vector from the set of text, the feature vector comprising:
a first plurality of features obtained by executing the NLP algorithm on the set of text; and
a second plurality of features, wherein each feature of the second plurality of features is associated with an error type of the plurality of error types and provides a measure of a quantity of errors of the one or more errors that are assigned to the associated error type;
assign, based on the feature vector, the image to a first document category of a set of document categories, wherein:
documents assigned to the first document category share one or more characteristics; and
the feature vector is associated with a probability that the physical document associated with the image comprises the one or more characteristics, wherein the probability is greater than a threshold; and
in response to assigning the image to the first document category, store the image in a database as a page of an electronic document belonging to the first document category.