| CPC G06N 3/063 (2013.01) [G06N 3/0455 (2023.01)] | 31 Claims |

|
1. A server system configured for processing transformer workloads using AI accelerator apparatuses with in-memory compute, the system comprising:
a plurality of first server central processing units (CPUs) and a plurality of second server CPUs, wherein each of the first server CPUs is coupled to one of the second server CPUs, wherein each of the first server CPUs and the second server CPUs is coupled to a plurality of memory devices, and wherein each of the first server CPUs is coupled a network interface controller (NIC) device;
a plurality of switch devices coupled to each other and to the plurality of first server CPUs and the plurality of second server CPUs, wherein each of the switch devices is coupled to a plurality of AI accelerator apparatuses, each of the AI accelerator apparatuses comprising:
one or N chiplets, where N is an integer greater than 1, each of the chiplets comprising a plurality of tiles, and each of the tiles comprising:
a plurality of slices,
a CPU coupled to the plurality of slices, and
a hardware dispatch device coupled to the CPU;
a first clock configured to output a clock signal of 0.5 GHz to 4 GHz;
a plurality of die-to-die (D2D) interconnects coupled to the each of CPUs in each of the tiles;
a peripheral component interconnect express (PCIe) bus coupled to the CPUs in each of the tiles, wherein each switch device is coupled to one of the plurality of chiplets of each AI accelerator apparatus via the PCIe bus, and one or more of the chiplets of each AI accelerator apparatus are coupled to one other of the chiplets of the AI accelerator apparatus via a bridge connection pathway;
a dynamic random access memory (DRAM) interface coupled to the CPUs in each of the tiles;
a global reduced instruction set computer (RISC) interface coupled to each of the CPUs in each of the tiles;
wherein each of the slices includes a digital in memory compute (DIMC) device coupled to a second clock and configured to allow for a throughput of one or more matrix computations provided in the DIMC device such that the throughput is characterized by 512 multiply accumulates per a clock cycle;
wherein the DIMC device is coupled to the second clock configured at an output rate of one half of the rate of the first clock; and
a substrate member configured to provide mechanical support and having a surface region and an interposer, the surface region being coupled to support the one or N chiplets, and the one or N chiplets being coupled to each other using the interposer.
|