| CPC G06F 16/906 (2019.01) [G06F 16/9024 (2019.01); G06F 16/93 (2019.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06N 3/09 (2023.01)] | 9 Claims |

|
1. A system for reducing computation resource used by a data processing system for training a machine leaning model for classifying documents, the system comprising one or more processors configured to:
receive a training dataset comprising digital copies of sample documents;
create graph embedding vector for each of the sample documents;
cluster the graph embedding vectors of the sample documents of the training dataset into clusters based on the similarity between the graph embedding vectors;
select a first set of training data comprising a subset of the graph embedding vectors, wherein selecting the first set of training data comprises:
selecting a subset of the clusters; and
selecting graph embedding vectors from the selected clusters, wherein graph embedding vectors are selected from each of the selected clusters, and wherein in one or more of the selected clusters, having the graph embedding vectors more than a predefined upper threshold value, only a subset of the graph embedding vectors is included in the first set of training data; and
input the first set of training data as input training data for the machine learning model.
|