US 12,488,022 B2
Systems and methods for identifying data labels for submitting to additional data labeling routines based on embedding clusters
Joshua Edwards, Philadelphia, PA (US); Purva Shanker, Arlington, VA (US); Jing Zhu, McLean, VA (US); Zhuqing Zhang, McLean, VA (US); Nathan Wolfe, Silver Spring, MD (US); and Ebony Edwards, McLean, VA (US)
Assigned to Capital One Services, LLC, McLean, VA (US)
Filed by Capital One Services, LLC, McLean, VA (US)
Filed on Nov. 27, 2023, as Appl. No. 18/520,417.
Prior Publication US 2025/0173359 A1, May 29, 2025
Int. Cl. G06F 7/00 (2006.01); G06F 16/28 (2019.01)
CPC G06F 16/285 (2019.01) 20 Claims
OG exemplary drawing
 
1. A system for identifying data labels for submitting to additional data labeling routines based on embedding clusters, the system comprising:
one or more processors; and
one or more non-transitory, computer-readable mediums comprising instructions that when executed by the one or more processors cause operations comprising:
retrieving an unlabeled dataset, wherein the unlabeled dataset comprises a plurality of unlabeled samples, wherein the plurality of unlabeled samples is based on unstructured text based on linguistic inputs;
generating a plurality of embeddings based on the unlabeled dataset, wherein the plurality of embeddings comprises a respective embedding for each unlabeled sample in the plurality of unlabeled samples;
clustering the plurality of embeddings into a plurality of clusters;
generating a first labeled dataset based on the unlabeled dataset, wherein the first labeled dataset comprises a first plurality of labeled samples, and wherein the first plurality of labeled samples is generated using a first data labeling routine;
determining that a first cluster of the plurality of clusters has a first cluster characteristic, wherein the first cluster characteristic comprises a similarity between data in the first cluster, and wherein the similarity indicates respective distances between the data in the first cluster;
based on the first cluster having the first cluster characteristic:
determining a first labeled sample of the first labeled dataset corresponding to the first cluster; and
determining to submit the first labeled sample to a second data labeling routine;
generating, using the second data labeling routine, a second labeled sample based on the first labeled sample;
deleting the first labeled sample from the first plurality of labeled samples; and
adding the second labeled sample to the first plurality of labeled samples.