| CPC G06F 16/285 (2019.01) [G06F 16/211 (2019.01); G06F 16/2246 (2019.01)] | 20 Claims |

|
1. A computer-implemented method, comprising:
obtaining a table schema describing at least one or more keys of a data table, the data table stored at a storage system associated with a data processing service;
generating a set of features for the data table, wherein for each key of the data table, the set of features includes at least a name label for the key and a data type associated with the key;
generating a set of tokens from the set of features, wherein the set of tokens are numerical representations of features in the set of features;
generating a set of predictions by applying a machine-learned transformer model to the set of tokens, wherein the set of predictions include a prediction for each token indicating a likelihood that a key associated with the token is a clustering key for the data table;
determining one or more clustering keys based on the set of predictions, wherein the one or more clustering keys are a subset of the one or more keys; and
clustering data records of the data table into one or more data files based on key-values for the one or more clustering keys.
|