US 12,189,696 B2
Computer implemented method for improving search engine queries
Moeka Shishido, Vélizy-Villacoublay (FR); and Xavier Gréhant, Vélizy-Villacoublay (FR)
Assigned to Dassault Systemes, Velizy-Villacoublay (FR)
Filed by Dassault Systemes, Vélizy-Villacoublay (FR)
Filed on Dec. 17, 2021, as Appl. No. 17/644,955.
Claims priority of application No. 20306626 (EP), filed on Dec. 18, 2020.
Prior Publication US 2022/0197962 A1, Jun. 23, 2022
Int. Cl. G06F 40/284 (2020.01); G06F 16/9532 (2019.01); G06F 40/289 (2020.01); G06F 40/40 (2020.01)
CPC G06F 16/9532 (2019.01) [G06F 40/284 (2020.01); G06F 40/289 (2020.01); G06F 40/40 (2020.01)] 16 Claims
OG exemplary drawing
 
1. A computer implemented method for improving a search engine comprising:
a. receiving a text corpus;
b. determining a list of n-gram candidates, each being a series of consecutive words of said text corpus, a number of said consecutive words within said series being an integer n superior or equal to two;
c. modifying at least partially said text corpus based on said list of n-gram candidates;
d. performing a machine learning embedding on the text corpus at least partially modified in step c;
e. for each element in said list of n-gram candidates, computing a score based on the embedding of said element and the embeddings of the words making up said element; and
f. adding one or more of the n-gram candidates to a search engine queries items list based on their respective scores,
wherein step c further includes:
parsing the text corpus with a list of n-gram candidates of step b, and each time a series of consecutive words of said text corpus make up a given element of said list of n-gram candidates, determining whether one or more words consecutive to said series of consecutive words of said text corpus make up a different element of said list of n-gram candidates with one or more of endmost words of said series of consecutive words of said text corpus, and, in such case, using a pseudo-random function parametrized by a number of times the series of words making up said given element has already been replaced by a token associated with a corresponding given element within the part of the text corpus already parsed to compute a value determining whether the series of consecutive words making up said given element should be tokenized as a single word token or not, and applying same processing to the series of consecutive words making up said different element.