当前位置：文档库 › The Mitosis Speculative Multithreaded Architecture

The Mitosis Speculative Multithreaded Architecture

The Mitosis Speculative Multithreaded

Architecture

C.Madriles,C.Garc′?a Qui?nones,J.S′anchez,

P.Marcuello,A.Gonz′alez

published in

Parallel Computing:

Current&Future Issues of High-End Computing,

Proceedings of the International Conference ParCo2005,

G.R.Joubert,W.E.Nagel,F.J.Peters,O.Plata,P.Tirado,E.Zapata (Editors),

John von Neumann Institute for Computing,J¨ulich,

NIC Series,Vol.33,ISBN3-00-017352-8,pp.27-38,2006.

c 2006by John von Neumann Institute for Computing

Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the?rst page.To copy otherwise requires prior speci?c permission by the publisher mentioned above.

http://www.fz-juelich.de/nic-series/volume33

The Mitosis Speculative Multithreaded

Architecture

C.Madriles,C.Garc′?a Qui?nones,J.S′anchez,

P.Marcuello,A.Gonz′alez

published in

Parallel Computing:

Current&Future Issues of High-End Computing,

Proceedings of the International Conference ParCo2005,

G.R.Joubert,W.E.Nagel,F.J.Peters,O.Plata,P.Tirado,E.Zapata (Editors),

John von Neumann Institute for Computing,J¨ulich,

NIC Series,Vol.33,ISBN3-00-017352-8,pp.15-26,2006.

c 2006by John von Neumann Institute for Computing

http://www.fz-juelich.de/nic-series/volume33

27 The Mitosis Speculative Multithreaded Architecture

Carlos Madriles,Carlos Garc′?a Qui?n ones,Jes′u s S′a nchez,Pedro Marcuello,Antonio Gonz′a lez Intel Barcelona Research Center,Intel Labs-UPC,Barcelona

Abstract

Speculative multithreading(SpMT)increases the performance by means of exploiting specula-tive thread-level parallelism.In this paper we describe the Mitosis framework,which is a com-bined hardware-software approach to?nding and exploiting speculative thread-level parallelism, even in the presence of frequent dependences between threads.The approach is based on predict-ing/computing thread input values via software,through a piece of code that is added at the begin-ning of each thread,the pre-computation slice(p-slice).A p-slice is expected to compute the correct thread input values most of the time,but not necessarily always.Because of that,aggressive opti-mization techniques can be applied to the slice to make it very short.In this paper,we also describe the microarchitecture that supports this execution model.The main novelty of the microarchitecture is the organization of the register?le and the cache memory in order to support multiple versions of each variable and allow for roll-back in case of misspeculation.We also describe a novel compiler approach to identify points where speculative threads are most effectively spawned,and generate the corresponding p-slices for each.We show that the Mitosis microarchitecture-compiler achieves very important speedups for applications that the compiler cannot parallelize by conventional non-speculative approaches,such as the Olden benchmarks.

1.Introduction

Most high-performance processor vendors have now introduced designs that feature processors that can execute multiple threads simultaneously on the same core,either through multithreading [5][15],multiprocessing[20][22]or a combination of the two.The way thread-level parallelism (TLP)is currently being exploited in these processors is through non-speculative threading–all created threads are committed.Basically,there are two main sources of non-speculative TLP:1) execute different applications in parallel,and2)execute parallel threads from a single application generated through compiler/programmer support.In the former case,running different independent applications in the same processor provides an increase in throughput(number of jobs?nished per time unit)and a reduction in the average response time of jobs over a single-threaded processor. In the latter case,programs are partitioned into smaller threads that are executed in parallel.This partitioning process may signi?cantly reduce the execution time of the parallelized application in comparison with single-threaded execution.

Executing different applications in parallel provides no gain for a single application.On the other hand,partitioning applications into parallel threads may be a straightforward task in regular appli-cations,but becomes much harder for irregular programs,where compilers usually fail to discover suf?cient thread-level parallelism,tipically because of the necessarily conservative approach taken by the compiler.This normally results in the compiler including many unnecessary inter-thread communication/synchronization operations,or more likely,concluding that insuf?cient parallelism exists.Recent studies have proposed the use speculative thread-level parallelism(SpMT).This tech-nique reduces the execution time of applications by executing several speculative threads in parallel.

These threads are speculative in the sense that they may be data and control dependent on previous threads and their correct execution and commitment is not guaranteed.Additional hardware/software is required to validate these threads and eventually commit them.

There are two main strategies for speculative TLP:1)the use of helper threads to reduce the execution time of high-latency instructions by means of side effects[2][4][10][17][24],and2) relaxing the parallelization constraints and parallelizing the applications into speculative threads ([1][3][11][18][19][21]among others).

The Helper Thread paradigm is based on the observation that the cost of some instructions severely impact performance,such as loads that miss in cache or branches that are incorrectly predicted.This paradigm attempts to reduce the execution time of the application by using speculative threads to reduce the cost of these high-latency operations.

In speculative parallelization,each of the speculative threads executes a different portion of the program.This partitioning process is based on relaxing the parallelization constraints and allowing the spawning of speculative threads even where the compiler cannot guarantee correct execution. Once the thread?nishes,the speculative decisions are veri?ed–if they were correct,then the appli-cation has been accelerated.If a misspeculation(control or data)has occurred,then the work done by the speculative thread is discarded and the processor continues with the correct threads.

The Mitosis processor is based on the speculative parallelization approach.The primary distinc-tion from previous works stems from how inter-thread dependences are handled.It is possible to partition a program into enough parallel threads such that there are few or no dependences between them[19].However,for most programs it is necessary to create threads where there are control/data dependences across these partitions to fully exploit the available parallelism.The manner in which such dependences are managed critically affects the performance of the SpMT processor[11].Pre-vious approaches have used hardware communication of produced register values[9],explicit syn-chronization[6][21],and value prediction[12]to manage these dependences.

In contrast,in the Mitosis framework we propose a new,software approach to manage both data and control dependences among threads.We?nd that this both produces values earlier than the hardware-assisted value-passing approaches,and more accurately than hardware value prediction approaches(due to the use of the actual application code).Each speculative thread is prepended with a speculative pre-computation slice(p-slice)that precomputes live-ins,while executing in a fraction of the time of the actual code that produces those live-ins.

In this paper,we present the Mitosis processor,a hardware/software approach to effectively exploit speculative TLP.The Mitosis framework is composed of a compiler that partitions the applications into speculative threads and a speculative multithreaded processor that is able to manage multiple speculative threads.It also provides novel mechanisms to track and manage the different versions of the memory and the register values for the speculative threads.The overall Mitosis processor frame-work is presented,including the compiler architecture,the hardware architecture,and several novel aspects of each.Some very promising early performance results are presented as well.This study focuses on applications that sophisticated parallelizing compilers cannot parallelize.Performance results reported by Mitosis processors show an average speed-up of about2.2x for a subset of the Olden benchmark suite and1.75x over a con?guration that models perfect L1level caches.

The rest of the paper is organized as follows:Section2presents the execution model.The different components of the Mitosis processors are described:the compiler in Section3and the microarchi-tecture in Section4.Section5presents the performance evaluation of this architecture.Finally, Section6summarizes the main conclusions of the work.

29 2.Execution Model of the Mitosis Processor

Mitosis is a framework that combines hardware and software techniques to exploit speculative TLP.The Mitosis compiler is responsible for partitioning the application into speculative threads.It also plays an important role in dealing with data dependences since it generates the code to com-pute the live-in values of the speculative threads.The Mitosis processor architecture provides the hardware support for executing speculative threads and detecting any possible misspeculation. Figure1shows the execution model of the Mitosis processor.Programs are partitioned into spec-ulative threads statically with the Mitosis compiler.The partitioning process done by the compiler is described in Section3.It basically explores the application to insert spawning pairs,that is,pairs of instructions made up of the point where the speculative thread will be spawned(the spawning point,or SP for short)and the point where the speculative thread will start its execution(the control quasi-independent point,or CQIP for short)[13].The Mitosis compiler also computes the p-slice of each spawning pair that predicts the input values of the speculative thread.

This new binary obtained with the Mitosis compiler is executed on the Mitosis processor.Ap-plications run on a Mitosis processor in the same way as on a conventional superscalar processor. However,when a spawn instruction is found,the processor looks for the availability of a free context (or thread unit).On?nding a free one,the p-slice of the corresponding speculative thread is assigned to be executed at that thread unit.The p-slice ends with an unconditional jump to the CQIP,thereby starting the execution of the speculative thread.If no free thread unit is available when the spawn instruction is executed,the system looks for a speculative thread that is more speculative(further in sequential time)than the new thread we want to spawn.If any is found,the most speculative one is cancelled and its thread unit is assigned to the new thread.

Threads in Mitosis processors are committed in program order.The thread executing the oldest instructions in program order is non-speculative,whereas the rest are speculative.When any running thread reaches the CQIP of any other active thread,it stops fetching instructions until it becomes the non-speculative thread.Then,a veri?cation process checks that the next speculative thread has been executed with the correct input values.If the speculation has been correct,the non-speculative thread is committed and its thread unit is freed.Moreover,the next speculative thread becomes the non-speculative.If there is a misspeculation,the next speculative thread and all its successors are squashed,and the non-speculative thread continues executing the instructions beyond the CQIP.

In Mitosis processors,the spawning process is highly general.Any speculative or non-speculative thread can spawn a new thread upon reaching an SP.Moreover,speculative threads can be spawned out of program order,that is,threads can be created in a different order than they will be committed.

3.Mitosis Compiler

The Mitosis compiler has been developed on top of the Open Research Compiler(ORC)[8].The ORC infrastructure is a research IPF compiler that produces code with a quality similar to that of a production compiler.The different Mitosis modi?cations have been implemented in the back-end of the compiler after all the optimizations and before bundle formation.These modi?cations include the following highly coupled steps:1)identi?cation of spawning pairs,and2)generation and optimization of the p-slice.We will describe these two steps below.

3.1.Spawning Pair Identi?cation

The partitioning of applications into speculative threads is performed by means of the identi?ca-tion of spawning pairs.A spawning pair represents two instructions:the spawning point(SP)and

Speculative

Non-speculative

Overhead

execution

Figure1.Mitosis execution model Figure2.Mitosis processor architecture

the control quasi-independent point(CQIP).This pair of instructions could be formed by any pair of instructions in the program,but there are some requirements they should meet in order to pro-vide some bene?t.Examples of these requirements are control and data independence,load balance issues,thread length,etc.

In this work,we propose a static mechanism that selects the spawning pairs by means of a cost model that estimates the expected bene?t of any potential spawning pair.This model also considers the probability/cost of misspeculations and the interaction with other spawning pairs selected.

The Mitosis compiler uses edge pro?ling information,as extracted by many conventional com-pilers,to initially obtain a list of possible spawning pair candidates.This initial set of candidates is obtained pruning the whole set by a certain control independence and a minimum distance between the SP and the CQIP.These candidates have the SP and the CQIP in the same routine.

Once this step has been performed,the p-slice for each of these potential spawning pairs is calcu-lated.The way slices are generated is described in the next subsection.Candidates are pruned again depending on the size of the https://www.wendangku.net/doc/4f8223035.html,rge p-slices imply fewer overlapped instructions.Thus,those spawning pairs whose p-slice size,relative to the expected size of the body of the thread,is higher than a certain threshold are discarded.

Next,a synthetic trace of the program is generated from the edge pro?le information and a greedy algorithm is applied to get the best set of spawning pairs.In this algorithm,the pair that performs the best individually based on the cost model is selected among all the candidate spawning pairs. Taking into account this inserted thread,this process is repeated until the increasing bene?t between two consecutive iterations is lower than a certain threshold.

3.2.Slice Generation

Given a spawning pair,a p-slice is the subset of the instructions executed between the spawning point and the control quasi-independent point needed for computing values used beyond the control quasi-independent point.Therefore,the?rst step to get the p-slice is to determine the thread live-in values.The compiler examines the code following the CQIP,but only for as many instructions as the average distance between the spawning point and the control quasi-independent point(because once the spawning thread reaches the CQIP and veri?es the spawned thread,all values are available non-speculatively).Then,for these thread input values,it is determined which of them are produced 30

between the spawning pair and the corresponding data and control dependence graph for only those values is built.Those instructions that are in the sub-graphs of the thread live-ins are selected to be inserted in the slice.Finally,the slice ends with an unconditional branch to the CQIP.

The identi?cation of the thread input values is relatively straightforward for register values,but harder for memory values.To detect thread input memory values,the Mitosis compiler uses an improved version of the memory dependence pro?ling.

Subroutine calls within the spawning pair that belong to the sub-graph of the thread input values are also included in the p-slice.However,subroutines that perform system calls are not included and then spawning pairs that require a system call in the p-slice are not considered for selection.

A key observation for generating p-slices is that they do not need to be correct.This is because the hardware,as described in the next section,will validate them and squash the thread when they are incorrect.Therefore,the compiler includes aggressive,sometimes unsafe,optimizations such as thread and slice branch pruning and memory dependence https://www.wendangku.net/doc/4f8223035.html,plete information regarding the compiler support and the applied p-slice optimizations can be found at[14].

4.Mitosis Processors

The Mitosis processor has a multi-core design similar to an on-chip multiprocessor(CMP),as shown in Figure2.Each thread unit executes a single thread at a time,and it is similar to a superscalar core.Each speculative thread may have a different version for each logical register and memory location.The different versions are supported by means of a local register?le and a local memory per thread unit.In this section,the most relevant components of the microarchitecture are described, including the Multi-Version Register File and the Multi-Version Memory.

4.1.Spawning process

A speculative thread starts when any active thread fetches a spawn instruction in the code.The spawn instruction triggers the allocation of a thread unit,the initialization of some registers,and the ordering of the new thread with respect to the other active threads.These tasks are handled by the Speculation Engine.

To initialize the register state,the spawn instruction includes a mask that encodes which registers are p-slice live-ins.Those registers included in the mask are copied from the parent thread to the spawned one.On average,we have observed that just6registers need to be initialized for our benchmarks.

Any active thread is allowed to spawn a new thread when it executes a spawn instruction.Addi-tionally,speculative threads can be spawned out of program order,that is,a speculative thread that would be executed later than another thread if executed sequentially can be spawned and executed in reverse order in a Mitosis processor.Some previous studies have shown that out-of-order spawning schemes have a much higher performance potential than in-order approaches[1][11].

It is also necessary to properly order the spawned thread with respect to the rest of the active speculative threads.The order among threads will determine where a thread is going to look for values not produced by itself.Akkary and Driscoll[1]propose a simple mechanism in which the new spawned thread is always assumed to be the most speculative.On the other hand,Marcuello and Gonz′a lez[11]propose an order predictor based on the previous executions of the speculative threads.In this work,this latter scheme is used since it provides better hit ratio(98%with the con?guration described in Section5).

4.2.Multi-Version Register File

To achieve correct execution and high performance,the architecture must simultaneously support the following,seemingly con?icting,goals:a uni?ed view of all committed register state,the co-existence of multiple versions of each register,register dependences that cross thread units,and a latency similar to a single local register?le.

This support is provided by the Mitosis multi-version register?le,shown in Figure3.As can be observed,the register?le has a hierarchical organization.Each thread unit has its own Local Register File(LRF)and there is a Global Register File(GRF)for all the thread units.There is also a table,the Register Versioning Table(RVT)that has as many rows as logical registers and as many columns as thread units,that tracks which thread units have a copy of that logical register.

When a speculative thread is spawned in a thread unit,some register values are copied from the parent thread.These registers are encoded in the spawn instruction,as described above.Slices have the characteristic that they are not more nor less speculative than the parent thread;in fact,they execute a subset of instructions of the parent thread so the speculation degree is the same.Therefore, it needs to access the same register versions as the parent thread would see at the spawning point (while executing in a different thread unit).

When a thread requires a register value,the LRF is checked?rst.If the value is not present,then the RVT is accessed to determine which the closest predecessor thread that has a copy of the value. If there are no predecessors that have the requested register,then the value is obtained from the GRF.

There is an additional structure used for validation purposes:the Register Validation Store(RVS). When a speculative thread reads a register for the?rst time and this value has not been produced by the thread itself the value is copied into this structure.Additionally,those register values generated by the p-slice that have been used by the speculative thread are also inserted in the RVS.When this thread is validated,the values in the RVS are compared with the actual values of the corresponding registers in the predecessor thread.Doing so we ensure that values consumed by the speculative thread would have been the same in a sequential execution.Because we explicitly track values con-sumed,incorrect live-ins produced by the slice that are not consumed do not cause misspeculation.

Finally,when the non-speculative thread commits,all the modi?ed registers in the LRF are copied into the GRF.An evaluation of the performance impact of the Multi-Version Register File design has been done.On average,more than99%of the register accesses are satis?ed from the LRF for the benchmarks evaluated.Thus,the average perceived latency for register access is essentially equal to the latency of the LRF,meeting the goals for our register?le hierarchy.

4.3.Multi-Version Memory System

Similar to the register storage,the memory system provides support for multi-versioning,that is,it allows different threads to have different values for the same memory location.As shown in Figure 4,each thread unit has its own L0and L1data caches,which are responsible for maintaining the speculative values,since speculative threads are not allowed to modify main memory.Moreover, three additional structures are needed at each thread unit–the Slice Buffer(SB),the Old Buffer (OldB)and the Replication Cache(RC).Finally,there is a global L2cache shared among all the thread units which can only be updated by the non-speculative thread,and a centralized logic to handle the order list of the different variables,the Version Control Logic(VCL).

The architecture of this memory system is inspired by the Speculative Versioning Cache(SVC) proposed by Gopal et al.[7]with notable extensions to handle p-slices.As a summary,the Mitosis memory subsystem contains the following novel features:support for p-slice execution and the replication cache.This section focuses on these new features.

A load in a p-slice needs to read the value of that memory location at the SP,while a load in the

thread needs to see the value at the CQIP,if it has not been produced by itself.For this reason,during the execution of a p-slice the processor needs to have an exact view of the machine state of the parent at the time of the spawn instruction.Therefore,when a thread performs a store and any of its children is still executing its p-slice,the value of that memory location needs to be saved before overwriting it since its children may later require that value.The buffers used for storing the values that are overwritten while a child is executing a p-slice are referred to as Old Buffers.Each thread unit has as many Old Buffers as direct child threads are allowed to be executing a p-slice simultaneously. Thus,when a speculative thread that is executing the p-slice performs a load to a memory location, it?rst checks for a local version at its local memory.In case of a miss,it checks in its corresponding Old Buffer from the parent thread.If the value is there,then it is forwarded to the speculative thread. Otherwise,it looks for it at any less speculative thread cache.When a speculative thread?nishes its slice,it sends a noti?cation to its parent thread to deallocate the corresponding Old Buffer.Finally,it is possible that a thread?nishes its execution and some of its children are still executing the p-slice. Then,those Old Buffers cannot be freed until these threads?nish their corresponding p-slices.

The values read during the execution of the p-slice have to be marked in some way to avoid being read by any other thread,and mistaken for state expected to be valid across CQIPs.To prevent a more speculative thread from reading an incorrect value,a new bit is added to the SVC protocol that is referred to as Old Bit.When a thread that is executing the slice performs a load from the parent Old Buffer,the value is inserted into the local cache with the Old Bit set.Then,when a more speculative thread requests this value,if it?nds the Old Bit set it knows that the value stored in that cache may be potentially old and is ignored.Finally,when the slice?nishes,all the lines of the local cache with the Old Bit set are invalidated.

Slice Buffers are used to store the values computed by the live-in p-slice in order to validate if the speculative thread is being executed with the correct input values.When a thread is allocated to a thread unit,the Slice Buffer is reset and all the stores performed during the execution of the slice go directly to Slice Buffer,bypassing the cache.Each entry of the Slice Buffer contains an address,a value,a read bit and a valid bit.Thus,when the slice?nishes and the speculative thread starts its execution,every time a value is read from memory,the Slice Buffer is checked?rst.If the value is there,the read bit is set and the value copied to cache.When the thread becomes the non-speculative one,all the values that have been consumed from the Slice Buffer have to be checked for their

correctness.Thus,those entries that have their read bit set are sent to the previous non-speculative thread to validate their values.

When threads only exploit samethread memory locality,cache miss rates would be extreme.Pre-liminary experiments showed that the miss ratio of the L1cache was very high for the speculative threads.This is due to the fact that when a thread commits,all the lines of the local cache set its commit bit to defer the traf?c burst to main memory.As a result,every newly spawned thread begins with a completely empty cold cache.The impact of this feature could be reduced by introducing to the SVC protocol the stale bit as was proposed elsewhere[7].However,this memory model has the problem of very poor locality exploitation.If all the active threads consume the same memory line,values have to travel from one thread unit to another every time.Mitosis solves that with the Replication Cache(RC).This small cache works as follows:when a thread performs a store,it has to send a bus request to know if any more speculative thread has performed a load on that address. We use this request to send the value and store it in the Replication Cache of all the threads that are more speculative as well as in the free thread units.Thus,when a thread performs a load,L1and the Replication Cache are checked simultaneously.If the value requested is not in L1but it is in the Replication Cache,the value is moved to L1and supplied to the thread unit.This simple mechanism prevents the thread units from starting with cold caches as well as taking advantage of locality.

4.4.Thread Validation

A thread?nishes its execution when it reaches the starting point of any other active thread,that is,the CQIP.At this point,if the thread is nonspeculative,it validates the next thread.Otherwise,it waits until it becomes the non-speculative one.The?rst thing to verify is the order.The CQIP found by the terminating thread is compared with the control quasi-independent point of the following thread in the thread order(as maintained by the order predictor).If they are not the same,then an order misspeculation has occurred and the following thread and all its successors are squashed.

If the order is correct,then the thread input values used by the speculative thread are veri?ed. These comparisons may take some time,depending on the number of values to validate.We have observed that on average,for our workloads,a thread validation requires to check less than1memory and about5register values for this validation.Note that only memory values produced by the slice and then consumed by the thread(values read from the slice buffer)need to be validated when the previous thread?nishes.Other memory values consumed by the thread are dynamically validated as soon as they are produced through the versioning protocol described above.If no misspeculations are detected,the non-speculative thread is committed and the next thread becomes the nonspeculative one.The thread unit assigned to the?nished thread is freed,except when there is a child thread that is still executing the p-slice since it may require values available in the Old Buffer.

5.Experimental Framework

The performance of the Mitosis processors was evaluated through a detailed execution-driven simulation.The Mitosis compiler has been implemented on top of the ORC compiler to generate IPF code.The Mitosis processor simulator models a research Itanium CMP processor with4hardware contexts based upon SMTSIM[23].The main parameters considered are shown in Table1.The numbers in the table are per thread unit.

To evaluate the potential performance of the Mitosis architecture,a subset of the Olden suite[16] has been used.The benchmarks used are the bh,em3d,health,mst and perimeter,with an input set that on average executes around300M instructions.Statistics in the next section correspond to the whole execution of the programs.Different input data sets have been used for pro?ling and

35 Table1

Mitosis processor con?guration

Fetch,in-order issue16K-entry

and commit bandwidth

512instructions Branch Predictor

I-Cache1cycle

4-way16KB–hit:1cycle Global Register File

L1-Cache5cycles

3cycles Validation overhead

Replication cache1K-entry–hit:1cycle

4-way8MB–hit:6cycles;Old buffer

miss:250cycles

Thread size%Slice/thread%Squashes bh422196.5 4.4

em3d3966389.0 1.0

health19849741.6 2.7

mst1367114 5.8 2.3

perimeter49372524.0 3.6

3585.2 2.7 6.0

Statistics corresponding to the characterization of the speculative threads are shown in Table2. The last row shows the arithmetic mean for the evaluated benchmarks.The?rst column shows the number of spawned threads by benchmark and the second column the average number of speculative instructions executed by the speculative threads.It can be observed that bh spawns the fewest number of threads but their average size is about30time larger than for the rest of benchmarks.On the other hand,mst spawns the most but their average size is the lowest.The third column shows the average dynamic size of the slices and the fourth column the relationship between the sizes of the speculative threads and their corresponding slice.This percentage is consistently quite low for all the studied benchmarks and on average represents less than3%.The?fth column shows the average number of thread input values that are computed by the slice,that is,on average it is only necessary to compute 3values to execute the speculative threads.Finally,the right-most column represents the average number of squashed threads.For all the benchmarks,this percentage is rather low except for health where almost1of every4threads is squashed.We have observed that for this particular benchmark,

Figure5.Speed-up over single thread Figure6.Time breakdown for Mitosis memory dependences for the pro?ling and simulated inputs are signi?cantly different,which result in many memory dependence misspeculations.

Figure5shows the speedups of the Mitosis processor over a superscalar in-order processor with about the same resources as one Mitosis thread unit with no speculative threading.For comparison, we also show the speedup of a more aggressive processor,with twice the amount of resources(func-tional units),twice the superscalar width and out-of-order issue(with no speculative threading),and a processor with perfect?rst-level cache(an aggressive upper limit to the performance of helper threads that target cache misses).

It can be observed that the Mitosis processor achieves an average speed-up close to2.2x over single-threaded execution,whereas the rest of the con?gurations provide much lower performance. Perfect memory achieves a speed-up of just1.23x and the more aggressive out-of-order processor only provides a1.26x speed-up.Some results worth further discussion are the similar performance achieved by the Mitosis processor compared with the outof-order double-wide core in em3d and compared with the perfect memory model in mst.In the former case,ILP is quite abundant in this benchmark,which could be additionally exploited with more complex Mitosis con?gurations (with out-of-order thread units).In mst,the performance of the memory system is quite low(for a single-threaded execution,the miss ratio in L1cache is higher than70%and close to50%in L0). Even so,Mitosis is surprisingly competitive with the perfect memory con?guration(within8%). In summary,we?nd Mitosis mirrors an aggressive superscalar when ILP is high,an unattainable memory subsystem when memory parallelism is high,and outperforms both when neither ILP nor memory parallelism is high.

Figure6shows the time breakdown for the execution of the different benchmarks in the Mitosis processors.As expected,most of the time the thread units are executing useful work(the sum of the non-speculative and the speculative execution).On average,this percentage is almost the80% of the time the thread units are working and higher than90%for bh and em3d.The overhead added by this execution model represents less than20%for these benchmarks.The most signi?cant part of this overhead comes from the wait time.This time stands for the time a thread unit has?nished the execution of a speculative thread but it has to wait until becoming non-speculative to commit. The other components of the overhead are the slice execution,the initialization overhead,and the validation and commit overhead.It is worth noting that the execution of the slices only corresponds to4%.Finally,the top of the bars shows the average time thread units are executing incorrect work, that is,threads that are squashed.This percentage is only8%overall,mostly due to health,where 36

the overhead is almost20%.In this case,most of the squashes are due to memory violations and the cascading effect of the squashing mechanism.Recall,however,that health still maintains a3x speedup despite these squashes.

These results strongly validate the effectiveness of the execution model introduced in this paper (precomputation slice based speculative multithreading)in meeting the Mitosis design goals:high performance,resulting from high parallelism(low wait time),high spawn accuracy(very low squash rates),and low spawn and prediction overhead(very low slice overheads).

6.Conclusions

In this work,we have presented and evaluated the Mitosis framework,which exploits speculative TLP through a novel scheme to deal with interthread data dependences.This model is based on predicting via software the thread live-ins.It does so by means of inserting a piece of code in the binary that speculatively computes the values at the starting point of the speculative thread.This code,referred to as a p-slice,is built from a subset of the code after the spawn instruction and before the beginning of the speculative thread.A key feature of Mitosis processors is that p-slices do not need to be correct,which allows the compiler to use aggressive optimizations when generating them. An ef?cient mechanism to partition the code into speculative threads is presented.This mech-anism looks into the code to detect which parts of the program will provide the highest bene?t, taking into account possible misspeculations,overheads,and load balancing.Moreover,some com-piler optimization techniques have been presented in order to reduce the weight of the slices in the speculative threads.

The key microarchitecture components of Mitosis processors have been presented:(1)hardware support for the spawning,execution,and validation of p-slices allows the compiler to create slices with minimal overhead,(2)a novel multi-version register?le organization supports a uni?ed global register view,multiple versions of register values,transparent communication of register depen-dences across processor cores,all with no signi?cant latency increase over traditional register?les, and(3)a memory system that support multiple versions of memory values,and introduces a novel architecture that pushes values towards threads that are executing future code,to effectively mimic the temporal locality available to a single-threaded processor with a single cache.

Finally,the results obtained by the Mitosis processor with4thread units for a subset of the Olden benchmarks are impressive.It outperforms the single-threaded execution by2.5x,and more than 1.75x compared with a double-size out-of-order processor,and over a perfect memory model.These results con?rm that there are large amounts of TLP on codes that can be extracted by speculative techniques such as those of Mitosis on codes that cannot be parallelized by conventional approaches.

Acknowlegdments

We would like to thank Peter Rundberg(currently at Gridcore,Sweden),Hong Wang and John Shen(at Intel Labs,Santa Clara)and Dean Tullsen(at UC San Diego)for his valuable collaboration in this work.Also,we would like to thank the ORC team for their support in the compiler implemen-tation.This work has been partially supported by the Spanish Ministry of Education and Science under contract TIN2004-03072and Feder funds.

References

[1]H.Akkary and M.A.Driscoll:A Dynamic Multithreading Processor.Procs.of the31st Int.Symp.on

Microarchitecture.1998

[2]R.S.Chappel et al.:Simultaneous Subordinate Microthreading(SSMT).Procs.of the27th Int.Symp.

on Computer Architecture.2000

[3]L.Codrescu and D.Wills:On Dynamic Speculative Thread Partitioning and the MEM-Slicing Algo-

rithm.Procs.of the Int.Conf.on Parallel Architectures and Compilation Techniques.1999

[4]J.D.Collins et al:Speculative Precomputation:Long Range Prefetching of Delinquent Loads.Procs.of

the28th Int.Symp.on Computer Architecture.2001

[5]K.Diekendorff:Compaq Chooses SMT for Alpha.Microprocessor Report(December).1999

[6]M.Franklin and G.S.Sohi:The Expandable Split Window Paradigm for Exploiting Fine Grain Paral-

lelism.Procs.of the19th Int.Symp.on Computer Architecture,pp.58-67,1992.

[7]S.Gopal,T.N.Vijaykumar,J.E.Smith and G.S.Sohi:Speculative Versioning Cache.Procs.of the4th

Int.Symp.on High Performance Computer Architecture.1998

[8]http://ipf-orc/https://www.wendangku.net/doc/4f8223035.html,

[9]V.Krishnan and J.Torrellas:Hardware and Software Support for Speculative Execution of Sequential

binaries on a Chip-Multiprocessor.Int.Conf.on Supercomputing.1998

[10]C.Luk:Toleraing Memory Latency through Software-Controlled Pre-Execution in Simultaneous Mul-

tithreading Processors.Procs.of the28th Int.Symp.on Computer Architecture.2001

[11]P.Marcuello:Speculative Multithreaded Processors.Ph.D.Thesis,Universitat Politecnica de Catalunya.

2003

[12]P.Marcuello,J.Tubella and A.Gonz′a lez:Value Prediction for Speculative Multithreaded Architectures.

Procs.of the32nd.Int.Conf,.on Microarchitecture.1999

[13]P.Marcuello and A.Gonz′a lez:Thread-Spawning Schemes for Speculative Multithreaded Architectures.

Procs.of the8th Int.Symp,on High Performance Computer Architectures.2002

[14]C.Garc′?a et.al.:Mitosis Compiler:an Infrastructure for Speculative Threading Based on Pre-

Computation Slices.Proc.of the Int.Symp.on Programming Language Design and Implementation.

2005

[15]T.Marr et al.:Hyper-threading Technology Architecture and Microarchitecture.Intel technology Jour-

nal,6(1).2002

[16]A.Rogers,M.Carliste,J.Reppy and L.Hendren:Supporting Dynamic Data Structures on Distributed

Memory Machines.ACM Trans on Programming Languages and System(March).1995

[17]A.Roth and G.S.Sohi:Speculative Data-Driven Multithreading.Procs.of the7th.Int.Symp.on High

Performance Computer Architecture.2001

[18]G.S.Sohi,S.E.Breach and T.N.Vijaykumar:Multiscalar Processors.Procs.of the22nd Int.Symp.on

Computer Architecture.1995

[19]J.Steffan and T.Mowry:The Potential of Using Thread-level Data Speculation to Facilitate Automatic

Parallelization.Procs.of the4th Int.Symp.on High Performance Computer Architecture.1998 [20]S.Storino an dJ.Borkenhagen:A Multithreaded64-bit PowerPC Commercial RISC Processor Design.

Procs.Of the11th Int.Conf.on High Performance Chips.1999

[21]J.Y.Tsai and P-C.Yew:The Superthreaded Architecture:Thread Pipelining with Run-Time Data De-

pendence Checking and Control Speculation.Procs.of the Int.Conf.on Parallel Architectures and Com-pilation Techniques.1995

[22]M.Tremblay et al.:The MAJC Architecture,a synthesis of Parallelism and Scalability.IEEE Micro,

20(6).2000

[23]D.M.Tullsen,S.J.Eggers and H.M.Levy:Simultaneous Multithreading:Maximizing On-Chip Paral-

lelism.Procs.of the22nd Int.Symp.on Computer Architecture.1995

[24]C.B.Zilles and G.S.Sohi:Execution-Based Prediction Using Speculative Slices.Procs.of the28th Int.

Symp.on Computer Architecture.2001