US 12,293,153 B2
	Fuzzy matching of obscure texts with meaningful terms included in a glossary
Shlomit Ifergan Shachor, Yokneam Eilit (IL); Natalia Razinkov, Haifa (IL); Micha Gideon Moffie, Zichron Yaakov (IL); and Omer Yehuda Boehm, Haifa (IL)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Aug. 22, 2022, as Appl. No. 17/892,169.
Prior Publication US 2024/0062004 A1, Feb. 22, 2024
Int. Cl. G06F 40/284 (2020.01)

CPC G06F 40/284 (2020.01)

15 Claims

1. A computer-implemented method comprising:

obtaining multiple glossary terms each comprising one or more words; automatically operating a fuzzy token generator to generate multiple fuzzy tokens from each word of each of the glossary terms;

automatically calculating a similarity score for each of the fuzzy tokens, wherein the similarity score denotes a similarity between the respective fuzzy token and its respective word;

obtaining multiple input terms to be matched with the multiple glossary terms;

automatically operating a tokenizer to separate each of the input terms into multiple input tokens;

automatically generating multiple n-grams from each of the input tokens;

automatically comparing the n-grams with the fuzzy tokens, to output a list of matching n-grams and fuzzy tokens;

based on the list of matching n-grams and fuzzy tokens, automatically identifying, from the glossary terms, candidate glossary term matches for each of the input terms; and

automatically calculating one or more scores that quantify the match between each of the candidate glossary term matches and its respective input term,

wherein the calculation of the one or more scores that quantify the match comprises:

calculating a match score between each of the input terms and each of its identified candidate glossary term matches, as:

a sum of the similarity scores of the fuzzy tokens associated with the respective candidate glossary term matches,

the sum being normalized to a length of the fuzzy tokens associated with the respective candidate glossary term matches, relative to a total length of the input terms, wherein the normalized sum being factored by a ratio between (a) a number of words in each of the candidate glossary term matches whose fuzzy tokens were matched with the automatically comparing step, and (b) a total number of words in each of the candidate glossary term matches.