CPC G06F 9/3867 (2013.01) [G06F 9/3851 (2013.01); G06F 9/3873 (2013.01); G06F 9/3875 (2013.01); G06F 9/3885 (2013.01); G06F 15/76 (2013.01)] | 2 Claims |
1. A system comprising:
emulated shared memory (ESM) comprising a physically distributed and logically shared data memory; and
a plurality of multi-threaded processors, each multi-threaded processor of the plurality of multi-threaded processors comprising an interleaved inter-thread pipeline configured to execute a plurality of threads in a cyclic, interleaved manner such that while a thread of the plurality of threads references the physically distributed and logically shared data memory of the ESM, other threads of the plurality of threads are executed by the interleaved inter-thread pipeline,
wherein each interleaved inter-thread pipeline comprises:
a plurality of segments across the interleaved inter-thread pipeline, the plurality of segments connected in series and comprising:
a first segment beginning at a beginning of the interleaved inter-thread pipeline,
a memory access segment beginning at a first latency from the beginning of the interleaved inter-thread pipeline, and
a second segment beginning at a second latency from the beginning of the interleaved inter-thread pipeline, wherein the second latency is larger than the first latency; and
at least three operatively parallel branches comprising:
a first parallel branch comprising a plurality of arithmetic and logic units (ALUs) that perform integer operations, wherein portions of the first parallel branch corresponding to the first segment and the second segment include at least one ALU from the plurality of ALUs, such that the first segment and the second segment each includes at least one ALU,
a second parallel branch comprising a plurality of floating-point units (FPUs) that perform floating point operations, wherein portions of the second parallel branch corresponding to the first segment, the memory access segment, and the second segment all include at least one FPU from the plurality of FPUs, such that the first segment, the memory access segment, and the second segment each includes at least one FPU, and
a third parallel branch comprising at least one memory unit that references the physically distributed and logically shared data memory of the ESM, wherein a portion of the third parallel branch corresponding to the memory access segment includes the at least one memory unit, such that the memory access segment includes the at least one memory unit, and wherein portions of the third parallel branch corresponding to the first and second segments include no memory units,
wherein:
in the first segment, the at least one ALU of the first segment and the at least one FPU of the first segment execute simultaneously,
in the memory access segment, the at least one FPU of the memory access segment and the at least one memory unit execute simultaneously, and
in the second segment, the at least one ALU of the second segment and the at least one FPU of the second segment execute simultaneously, and
wherein at least one of the plurality of FPUs has a longer execution latency than at least one of the plurality of ALUs.
|