US 12,292,934 B2
Classifying documents using geometric information
Mahbub Gani, London (GB)
Assigned to Sage Global Services Limited, Newcastle (GB)
Filed by Sage Global Services Limited, Newcastle Upon Tyne (GB)
Filed on Sep. 7, 2022, as Appl. No. 17/939,809.
Prior Publication US 2024/0078270 A1, Mar. 7, 2024
Int. Cl. G06F 16/906 (2019.01); G06F 16/93 (2019.01); G06V 30/414 (2022.01); G06V 30/418 (2022.01)
CPC G06F 16/906 (2019.01) [G06F 16/93 (2019.01); G06V 30/414 (2022.01); G06V 30/418 (2022.01)] 27 Claims
OG exemplary drawing
 
1. A computer-implemented method for classifying a document, comprising:
receiving a plurality of reference documents;
at a hardware processing device, for each of the reference documents:
automatically identifying a plurality of bounding boxes, each surrounding a block of content within the reference document; and
automatically identifying a subset of the bounding boxes for each reference document as representing noise;
at the hardware processing device, generating a feature vector for each of the reference documents based on the bounding boxes identified in the reference document that are not included in the identified subset representing noise;
storing the generated feature vectors at a storage device;
receiving a target document for classification;
at the hardware processing device:
automatically identifying a plurality of bounding boxes for the target document, each surrounding a block of content within the target document;
automatically identifying a subset of the bounding boxes for the target document as representing noise;
generating a feature vector based on the bounding boxes identified in the target document that are not included in the identified subset representing noise;
comparing the feature vector for the target document with the feature vectors for the reference documents, to determine which reference document feature vector is most closely aligned with the target document feature vector; and
classifying the target document based on the comparing step; and
at an output device, outputting results of the classifying step.