US 12,260,223 B2
Generative AI accelerator apparatus using in-memory compute chiplet devices for transformer workloads
Sudeep Bhoja, Cupertino, CA (US); and Siddharth Sheth, Cupertino, CA (US)
Assigned to d-MATRIX CORPORATION, Santa Clara, CA (US)
Filed by d-MATRIX CORPORATION, Cupertino, CA (US)
Filed on Nov. 23, 2022, as Appl. No. 18/058,706.
Application 18/058,706 is a continuation in part of application No. 17/538,923, filed on Nov. 30, 2021, granted, now 11,847,072.
Prior Publication US 2023/0168899 A1, Jun. 1, 2023
Int. Cl. G06F 9/38 (2018.01); G06F 9/30 (2018.01); G06F 13/16 (2006.01)
CPC G06F 9/3887 (2013.01) [G06F 9/3001 (2013.01); G06F 9/3836 (2013.01)] 26 Claims
OG exemplary drawing
 
1. An AI accelerator apparatus configured with in-memory compute, the apparatus comprising:
one or N chiplets, where N is an integer greater than 1, each of the chiplets comprising a plurality of tiles, and each of the tiles comprising:
a plurality of slices,
a central processing unit (CPU) coupled to the plurality of slices, and
a hardware dispatch device coupled to the CPU;
a first clock configured to output a clock signal ranging from 0.5 GHz to 4 GHz;
a plurality of die-to-die (D2D) interconnects coupled to the each of CPUs in each of the tiles;
a peripheral component interconnect express (PCIe) bus coupled to the CPUs in each of the tiles;
a dynamic random access memory (DRAM) interface coupled to the CPUs in each of the tiles;
a global reduced instruction set computer (RISC) interface coupled to each of the CPUs in each of the tiles;
wherein each of the slices includes a digital in memory compute (DIMC) device configured to allow for a throughput of one or more matrix computations provided in the DIMC device such that the throughput is characterized by 512 multiply accumulates per a clock cycle;
wherein the DIMC device is configured to accelerate the one or more matrix computations for a generative AI application;
wherein the DIMC device is coupled to a second clock configured at an output rate of one half of the rate of the first clock; and
a substrate member configured to provide mechanical support and having a surface region and an interposer, the surface region being coupled to support the one or N chiplets, and the one or N chiplets being coupled to each other using the interposer.