US 11,797,310 B2
Floating-point supportive pipeline for emulated shared memory architectures
Martti Forsell, Oulu (FI)
Assigned to TEKNOLOGIAN TUTKIMUSKESKUS VTT OY, Vtt (FI)
Appl. No. 15/31,285
Filed by TEKNOLOGIAN TUTKIMUSKESKUS VTT OY, Espoo (FI)
PCT Filed Oct. 23, 2014, PCT No. PCT/FI2014/050804
§ 371(c)(1), (2) Date Apr. 22, 2016,
PCT Pub. No. WO2015/059362, PCT Pub. Date Apr. 30, 2015.
Claims priority of application No. 13189861 (EP), filed on Oct. 23, 2013.
Prior Publication US 2016/0283249 A1, Sep. 29, 2016
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/38 (2018.01); G06F 15/76 (2006.01)
CPC G06F 9/3867 (2013.01) [G06F 9/3851 (2013.01); G06F 9/3873 (2013.01); G06F 9/3875 (2013.01); G06F 9/3885 (2013.01); G06F 15/76 (2013.01)] 2 Claims
OG exemplary drawing
 
1. A system comprising:
emulated shared memory (ESM) comprising a physically distributed and logically shared data memory; and
a plurality of multi-threaded processors, each multi-threaded processor of the plurality of multi-threaded processors comprising an interleaved inter-thread pipeline configured to execute a plurality of threads in a cyclic, interleaved manner such that while a thread of the plurality of threads references the physically distributed and logically shared data memory of the ESM, other threads of the plurality of threads are executed by the interleaved inter-thread pipeline,
wherein each interleaved inter-thread pipeline comprises:
a plurality of segments across the interleaved inter-thread pipeline, the plurality of segments connected in series and comprising:
a first segment beginning at a beginning of the interleaved inter-thread pipeline,
a memory access segment beginning at a first latency from the beginning of the interleaved inter-thread pipeline, and
a second segment beginning at a second latency from the beginning of the interleaved inter-thread pipeline, wherein the second latency is larger than the first latency; and
at least three operatively parallel branches comprising:
a first parallel branch comprising a plurality of arithmetic and logic units (ALUs) that perform integer operations, wherein portions of the first parallel branch corresponding to the first segment and the second segment include at least one ALU from the plurality of ALUs, such that the first segment and the second segment each includes at least one ALU,
a second parallel branch comprising a plurality of floating-point units (FPUs) that perform floating point operations, wherein portions of the second parallel branch corresponding to the first segment, the memory access segment, and the second segment all include at least one FPU from the plurality of FPUs, such that the first segment, the memory access segment, and the second segment each includes at least one FPU, and
a third parallel branch comprising at least one memory unit that references the physically distributed and logically shared data memory of the ESM, wherein a portion of the third parallel branch corresponding to the memory access segment includes the at least one memory unit, such that the memory access segment includes the at least one memory unit, and wherein portions of the third parallel branch corresponding to the first and second segments include no memory units,
wherein:
in the first segment, the at least one ALU of the first segment and the at least one FPU of the first segment execute simultaneously,
in the memory access segment, the at least one FPU of the memory access segment and the at least one memory unit execute simultaneously, and
in the second segment, the at least one ALU of the second segment and the at least one FPU of the second segment execute simultaneously, and
wherein at least one of the plurality of FPUs has a longer execution latency than at least one of the plurality of ALUs.