CPC G06F 21/6245 (2013.01) [G06F 16/906 (2019.01); G06F 16/9035 (2019.01); G06F 16/93 (2019.01); G06F 18/23213 (2023.01); G06N 20/00 (2019.01)] | 21 Claims |
1. A method for data management of documents in one or more data repositories in a computer network or cloud infrastructure, the method comprising:
sampling the documents in the one or more data repositories, wherein sampling the documents comprises sampling a document extension and one or more other metadata features of each of the documents;
formulating representative subsets of the sampled documents in response to the document extension and the one or more other metadata features of the sampled documents;
generating sampled data sets of the sampled documents in response to the metadata features of the sampled documents; and
balancing the sampled data sets for further processing of the sampled documents,
wherein the formulation of the representative subsets is performed for identification of at least one of the representative subsets for initial processing.
|