US 11,809,373 B2
	Defining redundant array of independent disks level for machine learning training data
Manish Anand Bhide, Hyderabad (IN); Seema Nagar, Bangalore (IN); Prateek Goyal, Indore (IN); and Kuntal Dey, Rampurhat (IN)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Mar. 16, 2021, as Appl. No. 17/202,559.
Prior Publication US 2022/0300453 A1, Sep. 22, 2022
Int. Cl. G06F 17/00 (2019.01); G06F 16/16 (2019.01); G06N 5/04 (2023.01); G06N 20/00 (2019.01)

CPC G06F 16/16 (2019.01) [G06N 5/04 (2013.01); G06N 20/00 (2019.01)]

14 Claims

1. A computer-implemented method comprising:

determining, by one or more computer processors, a storage strategy for each chunked data block in a training dataset based on a respective computed score and a series of score thresholds, wherein the storage strategy comprises RAID strategies that include striping, mirroring, parity, and double parity, wherein the computed score is computed by:

responsive to an identified machine learning task associated with the training dataset, computing, by one or more computer processors, an aggregated information gain value and an aggregated heterogeneity value for each chunked data block;

computing, by one or more computer processors, the score for each chunked data block based on a product of respective computed information gain values and respective computed heterogeneity values; and

distributing, by one or more computer processors, each data block in the training dataset according to the respective determined storage strategy.