US 11,714,790 B2
Data unification
Meiyalagan Balasubramanian, Redmond, WA (US); Lengning Liu, Redmond, WA (US); Aditya Kuppa, Redmond, WA (US); Kirk Hartmann Freiheit, Orlando, FL (US); Kalen Wong, Redmond, WA (US); Paula Budig Greve, Monroe, WA (US); Patrick Clinton Little, Seattle, WA (US); Lucas Pritz, Redmond, WA (US); Yue Wang, Bellevue, WA (US); Vivek Ravindranath Narasayya, Redmond, WA (US); Katchaguy Areekijseree, Woodinville, WA (US); Yeye He, Bellevue, WA (US); Surajit Chaudhuri, Kirkland, WA (US); and Gaurav Ghosh, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Sep. 30, 2021, as Appl. No. 17/490,908.
Prior Publication US 2023/0098926 A1, Mar. 30, 2023
Int. Cl. G06F 16/21 (2019.01); G06F 16/215 (2019.01); G06F 16/2455 (2019.01)
CPC G06F 16/215 (2019.01) [G06F 16/24556 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method of data unification, the method comprising:
receiving a plurality of data records, each data record of the plurality of data records comprising a plurality of data fields;
performing a self-conflation process for the plurality of data records, the self-conflation process comprising:
performing a partition-based clustering, in parallel, for a plurality of partitions, wherein the plurality of data records are distributed among the plurality of partitions; and
producing a unified data record from the plurality of records;
selecting, from among the plurality of data fields, a subset of the data fields, the subset of the data fields being fewer in number than the plurality of data fields, wherein selecting the subset of the data fields comprises:
applying a first rule to select at least a first one of the data fields within the unified data record for inclusion in the subset of the data fields;
using content of the subset of the data fields, generating a stable identifier (stableID) for the unified data record; and
inserting the stableID into a primary key data field of the unified data record.