CPC G06F 40/40 (2020.01) [G06F 16/35 (2019.01); G06F 40/284 (2020.01)] | 20 Claims |
1. A method performed by a processor, said method comprising:
generating a plurality of sequences of tokens from raw text data to be tagged for generating training data for a machine learning model;
calculating distances between each pair of the sequences of tokens;
training, using the calculated distances, an embedding layer to map the plurality of sequences of tokens into corresponding vector representations;
clustering the vector representations to generate a plurality of clusters;
selecting a single sample vector representation from each cluster of the plurality of clusters; and
tagging text associated with the selected single sample vector representation to generate the training data for the machine learning model while avoiding tagging of text associated with non-selected vector representations.
|