US 12,229,169 B1
	Clustering key selection based on machine-learned key selection models for data processing service
Terry Kim, Belleview, WA (US); Lin Ma, Ann Arbor, MI (US); Rahul Shivu Mahadev, Santa Clara, CA (US); and Rahul Potharaju, San Ramon, CA (US)
Assigned to Databricks, Inc., San Francisco, CA (US)
Filed by Databricks, Inc., San Francisco, CA (US)
Filed on Nov. 3, 2023, as Appl. No. 18/501,830.
Int. Cl. G06F 16/28 (2019.01); G06F 16/21 (2019.01); G06F 16/22 (2019.01)

CPC G06F 16/285 (2019.01) [G06F 16/211 (2019.01); G06F 16/2246 (2019.01)]

20 Claims

1. A computer-implemented method, comprising:

obtaining a table schema describing at least one or more keys of a data table, the data table stored at a storage system associated with a data processing service;

generating a set of features for the data table, wherein for each key of the data table, the set of features includes at least a name label for the key and a data type associated with the key;

generating a set of tokens from the set of features, wherein the set of tokens are numerical representations of features in the set of features;

generating a set of predictions by applying a machine-learned transformer model to the set of tokens, wherein the set of predictions include a prediction for each token indicating a likelihood that a key associated with the token is a clustering key for the data table;

determining one or more clustering keys based on the set of predictions, wherein the one or more clustering keys are a subset of the one or more keys; and

clustering data records of the data table into one or more data files based on key-values for the one or more clustering keys.