US 12,287,843 B2
	Systems and methods of instructions to accelerate multiplication of sparse matrices using bitmasks that identify non-zero elements
Dan Baum, Haifa (IL); Chen Koren, Hadera (IL); Elmoustapha Ould-Ahmed-Vall, Chandler, AZ (US); Michael Espig, Newberg, OR (US); Christopher J. Hughes, Santa Clara, CA (US); Raanan Sade, Kibutz Sarid (IL); Robert Valentine, Kiryat Tivon (IL); Mark J. Charney, Lexington, MA (US); and Alexander F. Heinecke, San Jose, CA (US)
Assigned to Intel Corporation, Santa Clara, CA (US)
Filed by Intel Corporation, Santa Clara, CA (US)
Filed on Nov. 6, 2023, as Appl. No. 18/502,291.
Application 18/502,291 is a continuation of application No. 17/485,055, filed on Sep. 24, 2021, granted, now 11,847,185.
Application 17/485,055 is a continuation of application No. 16/234,374, filed on Dec. 27, 2018, abandoned.
Prior Publication US 2024/0078285 A1, Mar. 7, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/30 (2018.01); G06F 9/38 (2018.01); G06F 17/16 (2006.01)

CPC G06F 17/16 (2013.01) [G06F 9/3001 (2013.01); G06F 9/30101 (2013.01); G06F 9/3016 (2013.01); G06F 9/3802 (2013.01); G06F 9/3836 (2013.01); G06F 9/3893 (2013.01)]

20 Claims

1. An apparatus comprising:

a cache; and

a graphics processing unit coupled to the cache, wherein the graphics processing unit comprises:

a first register to store elements from a first matrix that has zero and non-zero values in a compressed format,

a second register to store elements from a second matrix,

a scheduler circuit to schedule an instruction for execution, the instruction comprising fields to specify the first register, the second register, an accumulation matrix, a destination matrix, indications of a logical matrix position of the elements in at least the first matrix in a non-compressed format, and an opcode to indicate the instruction is a sparse matrix instruction and that execution circuitry including a processing engine is to select a proper subset of elements of the second register from the second matrix as an input into a multiply-accumulator circuit of the processing engine based on the indications, multiply the elements from the first matrix with corresponding elements of the proper subset of elements of the second matrix to generate products, accumulate the products with corresponding elements of the accumulation matrix to produce sums, and store the sums in corresponding elements of the destination matrix, and

the execution circuitry, including the processing engine, to execute the instruction according to the opcode,

wherein the first matrix has M rows by K columns, the second matrix has K rows by N columns, the accumulation matrix has M rows by N columns, and the instruction includes a suffix to the opcode that when set to a first value is to explicitly specify a first set of K, M, and N values, and when set to a different second value is to explicitly specify a different second set of K, M, and N values.