US 12,147,402 B1
Method for index structure optimization using index deduplication
Yannis Rivard-Mulrooney, Québec (CA); Daniel Lavoie, Québec (CA); and Pierre Rousseau, Québec (CA)
Assigned to Coveo Solutions Inc., Québec (CA)
Filed by Coveo Solutions Inc., Québec (CA)
Filed on Jul. 11, 2023, as Appl. No. 18/350,013.
Int. Cl. G06F 16/215 (2019.01); G06F 16/248 (2019.01)
CPC G06F 16/215 (2019.01) [G06F 16/248 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method for deduplication of document entries in a database, the method comprising steps of:
retrieving, from the database, entries to be deduplicated, each entry comprising an identification (ID) field, a first data field and a second data field;
parsing the entries and identifying groups of entries therein, each group comprising one or more entries having a same first data field value;
generating a virtual ID value for identifying new entries;
for each group of entries:
identifying a common second data field value; and
creating a deduplicated entry in a deduplicated database with the virtual ID value in the ID field, the first data field value of the group of entries in the first data field and the common second data field value in the second data field, respectively;
for each unique ID field value of the entries to be deduplicated:
generating a correspondence vector having a plurality of vector fields, each vector field associated with one of the group of entries; and
storing, in each vector field, a value indicative of an existence, in the entries to be deduplicated, of a duplicated entry containing the unique ID field value, the first data field value of the group of entries associated with the vector field, and the common second data field of the group of entries; and
adding, to the deduplicated database, remaining entries comprising the entries to be deduplicated excluding duplicated entries, the deduplicated database comprising the deduplicated entries and the remaining entries and having a reduced size relative to a size of the entries to be deduplicated, thereby reducing computing needs for storing the deduplicated database.