US 12,254,395 B2
	System and method for processing convolutions on crossbar-based neural network accelerators for increased inference throughput
Glaucimar Da Silva Aguiar, São Paulo (BR); Francisco Plínio Oliveira Silveira, Porto Alegre (BR); Eun Sub Lee, Plano, TX (US); Rodrigo Jose Da Rosa Antunes, Porto Alegre (BR); Joaquim Gomes Da Costa Eulalio De Souza, Porto Alegre (BR); Martin Foltin, Ft. Collins, CO (US); Jefferson Rodrigo Alves Cavalcante, Houston, TX (US); Lucas Leite, Houston, TX (US); Arthur Carvalho Walraven Da Cunha, Houston, TX (US); Monycky Vasconcelos Frazao, Houston, TX (US); and Alex Ferreira Ramires Trajano, Fortaleza de Minas (BR)
Assigned to Hewlett Packard Enterprise Development LP, Spring, TX (US)
Filed by HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, Houston, TX (US)
Filed on Sep. 21, 2020, as Appl. No. 17/027,628.
Prior Publication US 2022/0092393 A1, Mar. 24, 2022
Int. Cl. G06F 15/78 (2006.01); G06F 17/16 (2006.01); G06N 3/063 (2023.01)

CPC G06N 3/063 (2013.01) [G06F 15/7825 (2013.01); G06F 17/16 (2013.01)]

21 Claims

1. An integrated chip for computing an output for a convolutional neural network, the integrated chip comprising:

a plurality of convolution layers, wherein a convolution layer in the plurality of convolution layers comprises a plurality of kernels and a kernel in the plurality of kernels comprises a respective matrix structure of weights;

for the convolution layer in the plurality of convolution layers, the integrated chip is configured to execute instructions that cause the integrated chip to perform intra-crossbar parallelization and inter-crossbar parallelization that compute simultaneously on input data points via a method comprising:

flattening the kernel in the plurality of kernels into vectors;

grouping the vectors into a vector matrix, where the vector matrix comprises a plurality of lines;

replicating and storing duplicates of the vector matrix according to a number and size on the convolution layer of the convolutional neural network and a crossbar size of a crossbar, wherein the duplicates are stored in unused space of the crossbar of the integrated chip, comprising a crossbar matrix; and

computing a first convolution of the convolution layer as a dot product of input activations vector and the crossbar matrix, wherein the first convolution of the convolution layer corresponds with smaller weights than a second convolution of the convolution layer to parallelize separate iterations in the convolution layer in alignment with the intra-crossbar parallelization, and wherein the duplicates are computed simultaneously on the input data points to perform the intra-crossbar parallelization and the inter-crossbar parallelization.