US 12,307,750 B2
	Scalable knowledge distillation techniques for machine learning
Adit Krishnan, Mountain View, CA (US); Ji Li, San Jose, CA (US); Yixuan Wei, Beijing (CN); Xiaozhi Yu, San Jose, CA (US); Han Hu, Beijing (CN); and Qi Dai, Beijing (CN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Jun. 10, 2022, as Appl. No. 17/837,636.
Prior Publication US 2023/0401831 A1, Dec. 14, 2023
Int. Cl. G06V 10/776 (2022.01); G06N 3/045 (2023.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01)

CPC G06V 10/776 (2022.01) [G06N 3/045 (2023.01); G06V 10/7747 (2022.01); G06V 10/82 (2022.01)]

20 Claims

1. A data processing system comprising:

a processor; and

a machine-readable storage medium storing executable instructions that, when executed, cause the processor to perform operations comprising:

instantiating an instance of a teacher model in a memory of the data processing system;

instantiating an instance of a student model in the memory of the data processing system;

dividing training data into a plurality of batches of samples, wherein a size of the plurality of batches of samples is based at least in part on an amount of the memory available for training the student model after instantiating the instance of the teacher model and the instance of the student model; and

training the student model to replicate performance of the teacher model using an iterative knowledge distillation process in which the teacher model and the student model are trained in parallel during each iteration of the iterative knowledge distillation process by:

obtaining a respective batch of training data from the plurality of batches of samples in the memory;

training the teacher model using each of the samples in the respective batch of training data;

training the student model using each of the samples in the respective batch of training data;

evaluating performance of the student model compared with the performance of the teacher model; and

providing feedback to student model to adjust behavior of the student model based on the performance of the student model.