US 11,853,699 B2
	Synthetic crafting of training and test data for named entity recognition by utilizing a rule-based library
Shubham Mehrotra, Sunnyvale, CA (US); and Ankit Chadha, San Jose, CA (US)
Assigned to salesforce.com, inc.
Filed by salesforce.com, inc., San Francisco, CA (US)
Filed on Jan. 29, 2021, as Appl. No. 17/248,583.
Prior Publication US 2022/0245346 A1, Aug. 4, 2022
Int. Cl. G06F 40/295 (2020.01); G06F 40/58 (2020.01); G06F 40/42 (2020.01); G06F 16/951 (2019.01); G06F 40/30 (2020.01); G06F 40/47 (2020.01)

CPC G06F 40/295 (2020.01) [G06F 40/30 (2020.01); G06F 40/47 (2020.01); G06F 16/951 (2019.01)]

15 Claims

1. A method for extracting and labeling Named-Entity Recognition (NER) data in a target language for use in a multi-lingual software module, comprising:

pre-processing a textual sentence to replace textual organizations and textual persons in the source language using a static list of organizations and names previously stored in a retrievable electronic database for the target language;

translating the textual sentence to the target language using an open source translation module;

identifying a named entity within the translated textual sentence by:

(i) if an exact mapping of a translated named entity is available, using the exact mapping of the translated named entity,

(ii) if the exact mapping is not available and identifying a semantically similar translated named entity that meets a predetermined minimum threshold of similarity as determined by a confidence score in embedded similarity is possible, identifying the semantically similar translated named entity that meets the pre-determined minimum threshold of similarity, and

(iii) if the exact mapping is not available and identifying the semantically similar translated named entity that meets the predetermined minimum threshold of similarity as determined by a confidence score in embedded similarity is not possible, utilizing a rule-based library for the target language;

labeling the identified named entity with a pre-determined category; and

storing the labeled named entity in a retrievable electronic database for later retrieval by the multi-lingual software module.