US 11,769,040 B2
	Scalable multi-die deep learning system
Yakun Shao, Santa Clara, CA (US); Rangharajan Venkatesan, San Jose, CA (US); Nan Jiang, St. Louis, MO (US); Brian Matthew Zimmer, Mountain View, CA (US); Jason Clemons, Leander, TX (US); Nathaniel Pinckney, Cedar Park, TX (US); Matthew R Fojtik, Chapel Hill, NC (US); William James Dally, Incline Village, NV (US); Joel S. Emer, Acton, MA (US); Stephen W. Keckler, Austin, TX (US); and Brucek Khailany, Austin, TX (US)
Assigned to NVIDIA CORP., Santa Clara, CA (US)
Filed by NVIDIA Corp., Santa Clara, CA (US)
Filed on Jul. 19, 2019, as Appl. No. 16/517,431.
Claims priority of provisional application 62/729,066, filed on Sep. 10, 2018.
Prior Publication US 2020/0082246 A1, Mar. 12, 2020
Int. Cl. G06F 7/02 (2006.01); G06N 3/049 (2023.01); G06F 9/445 (2018.01); G06F 9/54 (2006.01); G06N 3/082 (2023.01)

CPC G06N 3/049 (2013.01) [G06F 9/44505 (2013.01); G06F 9/544 (2013.01); G06N 3/082 (2013.01)]

19 Claims

1. A semiconductor package comprising:

a plurality of dice each comprising:

a central controller;

a global memory buffer; and

a plurality of processing elements each comprising:

a weight buffer to receive from the central controller weight values for a neural network;

an activation buffer to receive activation values for the neural network;

an accumulation memory buffer to collect partial sum values;

a plurality of multiply-accumulate units to combine, in parallel, the weight values and the activation values into the partial sum values, each of the multiply-accumulate units comprising:

a weight collection buffer disposed in a data flow between the multiply-accumulate unit and the weight buffer; and

a partial sum collection buffer disposed in the data flow between the multiply-accumulate unit and the accumulation memory buffer.