US 12,271,443 B1
	Automatic data curation
Diego Ardila, Oakland, CA (US); Russell Kaplan, San Francisco, CA (US); Vinjai Saraj Vale, Exeter, NH (US); and Jihan Yin, San Francisco, CA (US)
Assigned to SCALE AI, INC., San Francisco, CA (US)
Filed by Scale AI, Inc., San Francisco, CA (US)
Filed on Sep. 23, 2021, as Appl. No. 17/482,860.
Int. Cl. G06F 18/214 (2023.01); G06F 18/21 (2023.01); G06N 3/088 (2023.01); G06N 3/0895 (2023.01); G06N 3/09 (2023.01); G06N 20/00 (2019.01); G06V 10/70 (2022.01); G16B 40/20 (2019.01); G16B 40/30 (2019.01); G06F 16/24 (2019.01); G06F 16/28 (2019.01); G06F 16/332 (2019.01); G06F 16/3329 (2025.01); G06F 16/35 (2019.01); G06F 16/903 (2019.01); G06F 16/9032 (2019.01); G06F 16/906 (2019.01); G06F 18/23 (2023.01); G06N 3/091 (2023.01); G06V 10/82 (2022.01)

CPC G06F 18/2148 (2023.01) [G06F 18/214 (2023.01); G06F 18/2155 (2023.01); G06F 18/2193 (2023.01); G06N 3/088 (2013.01); G06N 3/0895 (2023.01); G06N 3/09 (2023.01); G06N 20/00 (2019.01); G06V 10/70 (2022.01); G16B 40/20 (2019.02); G16B 40/30 (2019.02); G06F 16/24 (2019.01); G06F 16/285 (2019.01); G06F 16/3329 (2019.01); G06F 16/35 (2019.01); G06F 16/903 (2019.01); G06F 16/90332 (2019.01); G06F 16/906 (2019.01); G06F 18/217 (2023.01); G06F 18/23 (2023.01); G06N 3/091 (2023.01); G06V 10/82 (2022.01)]

15 Claims

1. A computer-implemented method for curating a data sample set, the method comprising:

generating a set of outputs by processing at least one data sample using a trained machine learning model;

concurrently determining on a plurality of computing instances a plurality of relevance scores between a plurality of sampling objectives and the set of outputs generated using the trained machine learning model, wherein each sampling objective in the plurality of sampling objectives is associated with a different computing instance in the plurality of computing instances and is associated with a different objective that is to be achieved during retraining of the trained machine learning model, and wherein each relevance score represents a relevance of a given sampling objective in the plurality of sampling objectives to the trained machine learning model;

determining a given sampling objective for the data sample set that is to be achieved during retraining of the trained machine learning model, wherein the determining of the given sampling objective is based on the plurality of relevance scores;

determining one or more data sampling criteria based on the given sampling objective;

selecting, from a set of unlabeled data samples, at least one data sample for labeling and adding to the data sample set based on the one or more data sampling criteria;

for each selected data sample, supplementing the data sample set with the selected data sample and at least one association with a label; and

retraining the trained machine learning model using the supplemented data sample set.