US 11,675,926 B2
Systems and methods for subset selection and optimization for balanced sampled dataset generation
Christopher Muffat, Singapore (SG); and Tetiana Kodliuk, Singapore (SG)
Assigned to DATHENA SCIENCE PTE LTD, Singapore (SG)
Filed by Dathena Science Pte Ltd, Singapore (SG)
Filed on Dec. 30, 2019, as Appl. No. 16/730,111.
Claims priority of application No. 10201811834U (SG), filed on Dec. 31, 2018.
Prior Publication US 2020/0250241 A1, Aug. 6, 2020
Int. Cl. G06F 16/93 (2019.01); G06F 21/62 (2013.01); G06F 16/9035 (2019.01); G06N 20/00 (2019.01); G06F 16/906 (2019.01); G06F 18/23213 (2023.01)
CPC G06F 21/6245 (2013.01) [G06F 16/906 (2019.01); G06F 16/9035 (2019.01); G06F 16/93 (2019.01); G06F 18/23213 (2023.01); G06N 20/00 (2019.01)] 21 Claims
OG exemplary drawing
 
1. A method for data management of documents in one or more data repositories in a computer network or cloud infrastructure, the method comprising:
sampling the documents in the one or more data repositories, wherein sampling the documents comprises sampling a document extension and one or more other metadata features of each of the documents;
formulating representative subsets of the sampled documents in response to the document extension and the one or more other metadata features of the sampled documents;
generating sampled data sets of the sampled documents in response to the metadata features of the sampled documents; and
balancing the sampled data sets for further processing of the sampled documents,
wherein the formulation of the representative subsets is performed for identification of at least one of the representative subsets for initial processing.