US 12,461,745 B2
Systems for performing instructions to quickly convert and use tiles as 1D vectors
Bret Toll, Hillsboro, OR (US); Christopher J. Hughes, Santa Clara, CA (US); Dan Baum, Haifa (IL); Elmoustapha Ould-Ahmed-Vall, Gilbert, AZ (US); Raanan Sade, Portland, OR (US); Robert Valentine, Kiryat Tivon (IL); Mark J. Charney, Lexington, MA (US); and Alexander F. Heinecke, San Jose, CA (US)
Assigned to Intel Corporation, Santa Clara, CA (US)
Filed by Intel Corporation, Santa Clara, CA (US)
Filed on Nov. 28, 2023, as Appl. No. 18/521,000.
Application 18/521,000 is a continuation of application No. 17/549,363, filed on Dec. 13, 2021, granted, now 11,954,489.
Application 17/549,363 is a continuation of application No. 17/549,221, filed on Dec. 13, 2021, granted, now 11,714,648, issued on Aug. 1, 2023.
Application 17/549,221 is a continuation of application No. 17/240,882, filed on Apr. 26, 2021, granted, now 11,579,880, issued on Feb. 14, 2023.
Application 17/240,882 is a continuation of application No. 16/145,066, filed on Sep. 27, 2018, granted, now 10,990,396, issued on Apr. 27, 2021.
Prior Publication US 2024/0103867 A1, Mar. 28, 2024
Int. Cl. G06F 9/30 (2018.01)
CPC G06F 9/30145 (2013.01) [G06F 9/30032 (2013.01); G06F 9/30036 (2013.01); G06F 9/30038 (2023.08); G06F 9/30109 (2013.01)] 21 Claims
OG exemplary drawing
 
1. A processor comprising:
a plurality of vector registers, each vector register of the plurality of vector registers to store a plurality of matrix data elements;
execution circuitry to execute an instruction to load a first plurality of source data elements from a first column and a second column of a first source tile to a first vector register of the plurality of vector registers, wherein a first subset of the first plurality of source data elements from the first column are to be interleaved with a second subset of the first plurality of source data elements from the second column within the first vector register, the first source tile comprising group of rows and columns of a first source matrix stored in a memory;
the execution circuitry further comprising:
a set of multipliers to perform a parallel multiplication of each data element of the first plurality of source data elements stored in the first vector register with a corresponding source data element of a second plurality of source data elements stored in a second vector register of the plurality of vector registers to generate a corresponding plurality of products, the second plurality of source data elements from a second source tile of a second source matrix to be multiplied with the first source matrix; and
accumulator circuitry to add groups of the corresponding plurality of products to corresponding accumulated data elements of an accumulation matrix to generate corresponding result data elements of a result matrix.