CPC G06F 40/295 (2020.01) [G06F 40/30 (2020.01); G06F 40/47 (2020.01); G06F 16/951 (2019.01)] | 15 Claims |
1. A method for extracting and labeling Named-Entity Recognition (NER) data in a target language for use in a multi-lingual software module, comprising:
pre-processing a textual sentence to replace textual organizations and textual persons in the source language using a static list of organizations and names previously stored in a retrievable electronic database for the target language;
translating the textual sentence to the target language using an open source translation module;
identifying a named entity within the translated textual sentence by:
(i) if an exact mapping of a translated named entity is available, using the exact mapping of the translated named entity,
(ii) if the exact mapping is not available and identifying a semantically similar translated named entity that meets a predetermined minimum threshold of similarity as determined by a confidence score in embedded similarity is possible, identifying the semantically similar translated named entity that meets the pre-determined minimum threshold of similarity, and
(iii) if the exact mapping is not available and identifying the semantically similar translated named entity that meets the predetermined minimum threshold of similarity as determined by a confidence score in embedded similarity is not possible, utilizing a rule-based library for the target language;
labeling the identified named entity with a pre-determined category; and
storing the labeled named entity in a retrievable electronic database for later retrieval by the multi-lingual software module.
|