US 12,242,439 B2
System and process for data enrichment
Emmanuel Le Huerou, Chatillon (FR); and Mikael Szczerbak, Chatillon (FR)
Assigned to ORANGE, Issy-les-Moulineaux (FR)
Appl. No. 17/599,113
Filed by ORANGE, Issy-les-Moulineaux (FR)
PCT Filed Mar. 20, 2020, PCT No. PCT/FR2020/050609
§ 371(c)(1), (2) Date Sep. 28, 2021,
PCT Pub. No. WO2020/201662, PCT Pub. Date Oct. 8, 2020.
Claims priority of application No. 1903406 (FR), filed on Mar. 29, 2019.
Prior Publication US 2022/0171749 A1, Jun. 2, 2022
Int. Cl. G06F 16/21 (2019.01); G06F 16/215 (2019.01); G06F 16/28 (2019.01)
CPC G06F 16/215 (2019.01) [G06F 16/285 (2019.01)] 15 Claims
OG exemplary drawing
 
1. A process for data enrichment implemented by a computing device and comprising the following ordered steps:
a) receiving, at a communication module of the computing device, several sets of data from a data communication network, each set of data comprising one fundamental datum corresponding to a character string, sound or image, and metadata that describes or defines the fundamental datum, the sets of data received from the data communication network being at risk of containing errors in the data;
b) grouping, by the computing device, the received sets of data in one or more groups based on the fundamental datum of each set of data according to a similarity function, the similarity function being a distance defined on a space of M+1 dimensions, wherein M is a number of metadata of the received sets of data, the computing device grouping two set of data in a common group when the distance between two sets of data is below a given threshold;
c) enriching, by the computing device, each received set of data with one additional datum called a label which characterizes the group to which the set of data belongs, the enriching providing an enriched set of data, wherein each metadata from the enriched set of data is associated with a weight;
d) for each enriched set of data, searching, by the computing device,—for a combination of at least one part of the metadata and the label from the enriched set of data in at least one database, the at least one database storing additional sets of data each comprising metadata and a label;
e) as a result of the searching, determining, by the computing device, whether the combination of the at least one part of the metadata and the label from the enriched set of data is present or absent from the at least one database, wherein the combination of at least a part of the metadata and the label is present in the at least one database if and only if a value of a presence function, calculated depending on the respective weights of the metadata of the combination present in the database, is greater than or equal to a preset threshold; and
f) for at least one enriched set of data, in response to determining that the combination of the at least one part of the metadata and the label from the enriched set of data is absent from the at least one database, the computing device determining that the label is an erroneous label for that set of data and consequently the computing device removing the label from the enriched set of data; and
g) the computing device outputting at least one of the enriched sets of data.