US 12,236,264 B2
Data shards for distributed processing
Yuxi Zhang, San Francisco, CA (US); and Kexin Xie, San Mateo, CA (US)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by salesforce.com, inc., San Francisco, CA (US)
Filed on Jan. 30, 2021, as Appl. No. 17/163,386.
Prior Publication US 2022/0244988 A1, Aug. 4, 2022
Int. Cl. G06F 9/48 (2006.01); G06F 9/30 (2018.01); G06F 9/38 (2018.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01); H04L 9/06 (2006.01)
CPC G06F 9/4881 (2013.01) [G06F 9/30036 (2013.01); G06F 9/3836 (2013.01); G06F 9/3851 (2013.01); G06F 9/3877 (2013.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01); H04L 9/0643 (2013.01)] 15 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving two or more data sets of data for users wherein each of the two or more data sets belongs to a separate one of two or more groups;
hashing user identifiers in the two or more data sets to generate hashed identifiers for the two or more data sets;
replacing the user identifiers in the two or more data sets with the hashed identifiers;
splitting each of the two or more data sets to generate shards, wherein each of the two or more data sets is split into the same number of shards by splitting each of the two more data sets such that the hashed identifiers that are common to two or more of the two or more data sets are in equivalent shards from the two or more of the two or more data sets;
generating merged shards by merging the shards using a separate running process for each of the merged shards, wherein each of the merged shards is generated using shards from more than one of the two or more data sets;
performing an operation on all of the merged shards wherein performing an operation on all of the merged shards comprises joining the merged shards into a merged data set; and
training a machine learning system using the merged data set.