CPC G06Q 10/107 (2013.01) [G06F 40/284 (2020.01); G06N 20/00 (2019.01); H04L 51/212 (2022.05)] | 19 Claims |
1. A method to automatically classify documents, the method comprising:
generating, by a system that includes a processor and memory, a plurality of entity data objects, each of the plurality of entity data objects representing a single person identified in a plurality of documents such that each entity data object of the plurality of entity data objects represents a different one of the persons identified in the plurality of documents and each entity data object of the plurality of entity data objects includes a name;
after generating the plurality of entity data objects, creating, for at least one entity data object of the plurality of entity data objects, using a name in the at least one entity data object a plurality of name variants for the name, the plurality of name variants stored as data of the at least one entity data object;
extracting, by the system, tokens from the plurality of documents, each token being a word or phrase from the plurality of documents;
after extracting the tokens and creating the plurality of name variants, searching the tokens from the plurality of documents for matches with the data of the plurality of entity data objects, including the plurality of name variants;
after searching the tokens, selecting, by the system, a first document of the plurality of documents in response to the first document including an extracted token that corresponds with data from two or more of the entity data objects of the plurality of entity data objects;
identifying the two or more of the entity data objects as candidate entity data objects;
determining, by the system, a particular entity data object of the candidate entity data objects to which the first document corresponds, wherein the determining comprises:
calculating, for each of the candidate entity data objects using a document network graph, a degree of separation between the candidate entity data objects and one or more persons identified in the first document, the document network graph constructed to represent patterns between the persons identified in the plurality of documents; and
selecting, as the particular entity data object, a candidate entity data object that includes the lowest degree of separation from the persons identified in the first document; and
automatically assigning, by the system, the first document to a category corresponding to the particular entity data object.
|