US 11,755,846 B1
Efficient tagging of training data for machine learning models
Itay Margolin, Tel Aviv (IL); and Yair Horesh, Tel Aviv (IL)
Assigned to INTUIT INC., Mountain View, CA (US)
Filed by INTUIT INC., Mountain View, CA (US)
Filed on Oct. 28, 2022, as Appl. No. 18/50,973.
Int. Cl. G06F 40/40 (2020.01); G06F 16/35 (2019.01); G06F 40/284 (2020.01)
CPC G06F 40/40 (2020.01) [G06F 16/35 (2019.01); G06F 40/284 (2020.01)] 20 Claims
OG exemplary drawing
 
1. A method performed by a processor, said method comprising:
generating a plurality of sequences of tokens from raw text data to be tagged for generating training data for a machine learning model;
calculating distances between each pair of the sequences of tokens;
training, using the calculated distances, an embedding layer to map the plurality of sequences of tokens into corresponding vector representations;
clustering the vector representations to generate a plurality of clusters;
selecting a single sample vector representation from each cluster of the plurality of clusters; and
tagging text associated with the selected single sample vector representation to generate the training data for the machine learning model while avoiding tagging of text associated with non-selected vector representations.