| CPC G06F 16/285 (2019.01) | 20 Claims |

|
1. A system for identifying data labels for submitting to additional data labeling routines based on embedding clusters, the system comprising:
one or more processors; and
one or more non-transitory, computer-readable mediums comprising instructions that when executed by the one or more processors cause operations comprising:
retrieving an unlabeled dataset, wherein the unlabeled dataset comprises a plurality of unlabeled samples, wherein the plurality of unlabeled samples is based on unstructured text based on linguistic inputs;
generating a plurality of embeddings based on the unlabeled dataset, wherein the plurality of embeddings comprises a respective embedding for each unlabeled sample in the plurality of unlabeled samples;
clustering the plurality of embeddings into a plurality of clusters;
generating a first labeled dataset based on the unlabeled dataset, wherein the first labeled dataset comprises a first plurality of labeled samples, and wherein the first plurality of labeled samples is generated using a first data labeling routine;
determining that a first cluster of the plurality of clusters has a first cluster characteristic, wherein the first cluster characteristic comprises a similarity between data in the first cluster, and wherein the similarity indicates respective distances between the data in the first cluster;
based on the first cluster having the first cluster characteristic:
determining a first labeled sample of the first labeled dataset corresponding to the first cluster; and
determining to submit the first labeled sample to a second data labeling routine;
generating, using the second data labeling routine, a second labeled sample based on the first labeled sample;
deleting the first labeled sample from the first plurality of labeled samples; and
adding the second labeled sample to the first plurality of labeled samples.
|