US 12,423,382 B2
	Method, electronic device, and computer program product for data processing
Zijia Wang, WeiFang (CN); Jiacheng Ni, Shanghai (CN); Wenbin Yang, Shanghai (CN); and Zhen Jia, Shanghai (CN)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on Aug. 9, 2021, as Appl. No. 17/397,518.
Claims priority of application No. 202110839222.2 (CN), filed on Jul. 23, 2021.
Prior Publication US 2023/0028860 A1, Jan. 26, 2023
Int. Cl. G06F 17/16 (2006.01); G06F 18/213 (2023.01); G06F 18/214 (2023.01)

CPC G06F 18/214 (2023.01) [G06F 17/16 (2013.01); G06F 18/213 (2023.01)]

20 Claims

1. A method for data processing, comprising:

determining, in a first machine learning model of a first architectural stage in a multi-stage serial processing architecture of a processor-based machine learning system, a first set of feature vectors representing samples in a data set;

generating, in a second architectural stage in the multi-stage serial processing architecture of the processor-based machine learning system, a second set of feature vectors by performing a first transformation on the first set of feature vectors, wherein distribution skewness of the second set of feature vectors in a feature space is smaller than that of the first set of feature vectors;

generating, in a third architectural stage in the multi-stage serial processing architecture of the processor-based machine learning system, a third set of feature vectors by performing a second transformation on the second set of feature vectors, wherein the third set of feature vectors and the second set of feature vectors have different distances between vectors, and wherein the second transformation utilizes a loss function based on potential energy minimization, and in computing the loss function the second transformation is configured to iteratively combine results of application of a potential energy function to distances between respective pairs of feature vectors in the second set of feature vectors;

selecting, utilizing a series arrangement of fourth and fifth architectural stages in the multi-stage serial processing architecture of the processor-based machine learning system, target samples as representatives from the samples based on a distribution of the third set of feature vectors in the feature space, the selected target samples providing a distilled data set for use in training a second machine learning model of the processor-based machine learning system, wherein the fourth architectural stage implements a designated processor-based filter utilizing a hardware processor;

storing the selected target samples of the distilled data set as a replacement for the samples in the data set to reduce resource consumption in the processor-based machine learning system; and

training the second machine learning model of the processor-based machine learning system utilizing the distilled data set that replaces the samples in the data set, to achieve at least a portion of the reduced resource consumption in conjunction with the training.