US 11,948,064 B2
System, method, and computer program product for cleaning noisy data from unlabeled datasets using autoencoders
Qingguo Chen, Round Rock, TX (US); Yiwei Cai, Mercer Island, WA (US); Dan Wang, Cedar Park, TX (US); and Peng Wu, College Station, TX (US)
Assigned to Visa International Service Association, San Francisco, CA (US)
Appl. No. 18/026,742
Filed by Visa International Service Association, San Francisco, CA (US)
PCT Filed Sep. 2, 2022, PCT No. PCT/US2022/042433
§ 371(c)(1), (2) Date Mar. 16, 2023,
PCT Pub. No. WO2023/107164, PCT Pub. Date Jun. 15, 2023.
Claims priority of provisional application 63/287,225, filed on Dec. 8, 2021.
Prior Publication US 2024/0028874 A1, Jan. 25, 2024
Int. Cl. G06N 3/0455 (2023.01); G06N 3/09 (2023.01)
CPC G06N 3/0455 (2023.01) [G06N 3/09 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method, comprising:
receiving, with at least one processor, training data comprising a plurality of noisy samples labeled as noisy and a plurality of other samples not labeled as noisy;
training, with at least one processor, an autoencoder network based on the training data to increase a first metric based on the plurality of noisy samples and a plurality of first outputs generated by the autoencoder network using the plurality of noisy samples and to reduce a second metric based on the plurality of other samples and a plurality of second outputs generated by the autoencoder network using the plurality of other samples;
receiving, with at least one processor, unlabeled data comprising a plurality of unlabeled samples;
generating, with at least one processor, a plurality of third outputs by the autoencoder network based on the plurality of unlabeled samples;
for each respective unlabeled sample of the plurality of unlabeled samples, determining, with at least one processor, a respective third metric based on the respective unlabeled sample and a respective third output of the plurality of third outputs;
for each respective unlabeled sample of the plurality of unlabeled samples, determining, with at least one processor, whether to label the respective unlabeled sample as noisy or clean based on the respective third metric and a threshold; and
for each respective unlabeled sample determined to be labeled as noisy, cleaning, with at least one processor, the respective unlabeled sample.