| CPC G06N 3/063 (2013.01) [G06N 3/045 (2023.01); G06N 3/08 (2013.01)] | 9 Claims |

|
1. A processor-implemented training method, comprising:
executing an iterative process, by one or more processors, with training data and an in-training first neural network, configured to perform a first task, to generate a trained first neural network, configured to perform the first task and a second task different from the first task, that has a trained first layer including trained first weights that each have a first bit-width corresponding to a first precision, the iterative process including:
quantizing in-training first weights, having the first bit-width of an in-training first layer of the in-training first neural network to generate second weights of a first layer of a second neural network, that have a second bit-width that is less than the first bit-width;
executing the second neural network using the second weights, including applying the training data to the first layer of the second neural network and determining loss values, corresponding to the second task, of the first layer of the second neural network;
updating the in-training first weights of the in-training first layer of the in-training first neural network based on the determined loss values; and
performing, for each of the updated in-training first weights, a quantization of a corresponding updated in-training first weight of the updated in-training first weights to generate a corresponding first weight of the trained first weights that includes a nested second weight having the second bit-width that shares bits with the corresponding first weight,
wherein the updating of the in-training first weights comprises updating the in-training first weights of the first bit-width based on statistical information of loss gradients corresponding to the determined loss values,
wherein the updating of the in-training first weights further comprises calculating the statistical information by assigning a high weighted value to a loss gradient corresponding to a weight for which a high priority is set among the second weights of the second bit-width, and
wherein the nested second weight is nested in the corresponding first weight and stored in a same memory space.
|