| CPC G06F 40/40 (2020.01) | 20 Claims |

|
1. A method comprising:
detecting, by one or more processors, one or more languages from different countries in a document;
assigning, by the one or more processors, a weight to the one or more languages from different countries;
determining, by the one or more processors, document content concepts in the document from a bag of words associated with the document, wherein the document content concepts include an attribute of a type of document;
determining, by the one or more processors, that synonyms are associated with the document content concepts, wherein the synonyms include at least one of a wording variation, abbreviation or acronym of the document content concepts that is used to link the synonyms to the document content concepts;
linking, by the one or more processors, the synonyms to the document content concepts;
assigning, by the one or more processors, a weight to each of the document content concepts, wherein the document content concepts include the synonyms to the respective document content concepts;
scoring, by the one or more processors, the document with a classification score based on the weight of each of the document content concepts and the weights of the one or more languages from different countries in the document;
determining, by the one or more processors, that the classification score meets a threshold;
determining, by the one or more processors, a pattern in the document;
determining, by the one or more processors, an object within the pattern in the document;
creating, by the one or more processors, a region around the object using x-y coordinates;
searching, by the one or more processors, in the region for data relevant to the object;
determining, by the processor, that the document lacks personal health information;
determining, by the processor, that the document lacks personal credit information;
determining, by the one or more processors, that one or more rejected keywords are not in the bag of words;
avoiding, by the one or more processors, portions of the document based on an opt-out request;
classifying, by the one or more processors, the document based on the classification score; and
assigning, by the one or more processors, and based on the classifying, the document to at least one of a release report in response to the document being a valid document or an exemption report in response to the document being a rejected document.
|