US 12,443,830 B2
	Compressed weight distribution in networks of neural processors
Andrew S. Cassidy, San Jose, CA (US); Rathinakumar Appuswamy, San Jose, CA (US); John V. Arthur, Mountain View, CA (US); Pallab Datta, San Jose, CA (US); Steve Esser, San Jose, CA (US); Myron D. Flickner, San Jose, CA (US); Dharmendra S. Modha, San Jose, CA (US); and Jun Sawada, Austin, TX (US)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Jan. 3, 2020, as Appl. No. 16/733,393.
Prior Publication US 2021/0209450 A1, Jul. 8, 2021
Int. Cl. G06N 3/063 (2023.01); G06N 5/04 (2023.01)

CPC G06N 3/063 (2013.01) [G06N 5/04 (2013.01)]

22 Claims

1. A neural inference chip comprising:

a global weight memory;

at least one neural core, the at least one neural core comprising a local weight memory, the local weight memory comprising a plurality of memory banks,

each of the plurality of memory banks being uniquely addressable by at least one index, wherein the at least one index identifies a column of a compressed weight matrix and a first memory bank of the plurality of memory banks,

each of the plurality of memory banks comprising a comparator, a value mux, and an index mux such that the first memory bank comprises a first comparator, a first value mux, and a first index mux, wherein

the comparator of each memory bank is adapted to compare the at least one index to the index of that memory bank of the plurality of memory banks,

that comparator provides a control line to the value mux of that memory bank,

the value mux is configured to select between zero and a weight value based on the control line,

the index mux is configured to select between the weight value and an index value based on the at least one index, and

the index value is an index of the weight value in an uncompressed weight matrix;

a network-on-chip connecting the global weight memory to the at least one neural core, wherein

the neural inference chip is adapted to store in the global weight memory a compressed weight block comprising at least one compressed weight matrix,

the neural inference chip is adapted to transmit the compressed weight block from the global weight memory to the at least one neural core via the network-on-chip,

the at least one core is adapted to decode the at least one compressed weight matrix into a decoded weight matrix and store the decoded weight matrix in its local weight memory,

the at least one neural core is adapted to apply the decoded weight matrix to a plurality of input activations to produce a plurality of output activations.