CPC G06T 1/20 (2013.01) [G06F 7/483 (2013.01); G06F 9/30014 (2013.01); G06F 9/30185 (2013.01); G06F 9/3863 (2013.01); G06F 9/5044 (2013.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/063 (2013.01); G06N 3/084 (2013.01); G06N 20/00 (2019.01); G06F 3/14 (2013.01); G06T 1/60 (2013.01); G06T 15/005 (2013.01)] | 27 Claims |
1. A multi-chip module accelerator usable to execute tensor data processing instructions, the multi-chip module accelerator comprising:
a multi-chip module comprising:
an interconnect to a host processor;
a plurality of distinct chips integrated on the multi-chip module;
a memory stack including multiple memory dies; and
parallel processor circuitry communicatively coupled to the memory stack, the parallel processor circuitry comprising a plurality of multiprocessor cores distributed across the plurality of distinct chips, each of the plurality of multiprocessor cores configured to execute a single instruction to perform multiple matrix multiplication and accumulate operations;
wherein:
the matrix multiplication and accumulate operations comprise floating-point operations;
the floating-point operations are configurable to comprise two-dimensional matrix multiply and accumulate operations involving inputs that have differing floating-point precisions, the two-dimensional matrix multiply and accumulate operations including a plurality of concurrent multiply operations;
the floating-point operations comprise a first operation at a first precision and a second operation at a second precision; and
the first operation comprises a multiply having at least one 16-bit floating-point input and the second operation comprises an accumulate having a 32-bit floating-point input.
|