US 11,704,556 B2
Optimization methods for quantization of neural network models
Min Guo, San Diego, CA (US); Manjiang Zhang, Sunnyvale, CA (US); and Shengjin Zhou, San Jose, CA (US)
Assigned to BAIDU USA LLC, Sunnyvale, CA (US)
Filed by Baidu USA LLC, Sunnyvale, CA (US)
Filed on Feb. 6, 2020, as Appl. No. 16/784,223.
Prior Publication US 2021/0248456 A1, Aug. 12, 2021
Int. Cl. G06N 3/08 (2023.01); G06N 3/04 (2023.01); G06F 16/22 (2019.01)
CPC G06N 3/08 (2013.01) [G06F 16/2219 (2019.01); G06N 3/04 (2013.01)] 20 Claims
OG exemplary drawing
 
11. A data processing system, comprising:
one or more processors; and
a memory coupled to the one or more processors to store instructions, which when executed by the one or more processors, cause the one or more processors to perform operations, the operations including
receiving a trained artificial intelligence (AI) model having one or more layers;
receiving first input data for offline inferencing;
applying offline inferencing to the trained AI model based on the first input data to generate offline data distributions for the trained AI model;
identifying outlier points in the offline data distributions;
removing a predetermined number of the identified outlier points to generate an updated offline data distributions; and
quantizing one or more tensors of the trained AI model based on the updated offline data distributions to generate a low-bit representation AI model, wherein each layer of the AI model includes the one or more tensors, wherein the one or more tensors include activation, weights, or bias tensors.