US 12,333,236 B2
System and method for automatically tagging documents
Eleftherios Panagiotis Loukas, Agia Paraskevi (GR); Eirini Spyropoulou, Agia Paraskevi (GR); Prodromos Malakasiotis, Agia Paraskevi (GR); Emmanouil Fergadiotis, Agia Paraskevi (GR); Ilias Chalkidis, Agia Paraskevi (GR); Ioannis Androutsopoulos, Agia Paraskevi (GR); and Georgios Paliouras, Agia Paraskevi (GR)
Assigned to National Centre for Scientific Research “Demokritos”, Agia Paraskevi (GR)
Filed by National Centre for Scientific Research “Demokritos”, Agia Paraskevi (GR)
Filed on Jul. 26, 2022, as Appl. No. 17/873,932.
Claims priority of application No. 21386048 (EP), filed on Jul. 26, 2021.
Prior Publication US 2023/0028664 A1, Jan. 26, 2023
Int. Cl. G06F 40/117 (2020.01); G06F 40/143 (2020.01); G06F 40/151 (2020.01); G06F 40/166 (2020.01); G06F 40/284 (2020.01)
CPC G06F 40/117 (2020.01) [G06F 40/143 (2020.01); G06F 40/151 (2020.01); G06F 40/166 (2020.01); G06F 40/284 (2020.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method for tagging electronic documents, the computer-implemented method comprising:
receiving, by an input module, an electronic document to be tagged;
preprocessing, by a preprocessing module, the electronic document to be tagged, the preprocessing comprising:
extracting a text from the electronic document to be tagged;
replacing at least one of a numerical amount or a date in the extracted text with a predetermined symbol, wherein the predetermined symbol is not used in the extracted text before replacing at least one of the numerical amount or the date in the extracted text, wherein the predetermined symbol includes a special character that is non-numeric and non-alphabetic; and
tokenizing the extracted text with the predetermined symbol into a plurality of tokens without fragmenting the predetermined symbol, wherein keeping the predetermined symbol unfragmented avoids inaccurate tagging associated with the predetermined symbol;
determining, by a deep learning module, a tag for at least one of the plurality of tokens; and
outputting, by an output module, the determined tag for the at least one of the plurality of tokens.