8. Pipelining

How Pipelining Works
PIpelining, a standard feature in RISC processors, is much like an assembly line. Because the processor works on different steps of the instruction at the same time, more instructions can be executed in a shorter period of time.

A useful method of demonstrating this is the laundry analogy. Let's say that there are four loads of dirty laundry that need to be washed, dried, and folded. We could put the the first load in the washer for 30 minutes, dry it for 40 minutes, and then take 20 minutes to fold the clothes. Then pick up the second load and wash, dry, and fold, and repeat for the third and fourth loads. Supposing we started at 6 PM and worked as efficiently as possible, we would still be doing laundry until midnight.

non-pipelined laundry

Source: http://www.ece.arizona.edu/~ece462/Lec03-pipe/

However, a smarter approach to the problem would be to put the second load of dirty laundry into the washer after the first was already clean and whirling happily in the dryer. Then, while the first load was being folded, the second load would dry, and a third load could be added to the pipeline of laundry. Using this method, the laundry would be finished by 9:30.

pipelined laundry

Source http://www.ece.arizona.edu/~ece462/Lec03-pipe/

Watch a movie of pipelining in action! (Source: http://www.inf.fh-dortmund.de/person/prof/si/risc/intro_to_risc/irt0_index.html)

RISC Pipelines
A RISC processor pipeline operates in much the same way, although the stages in the pipeline are different. While different processors have different numbers of steps, they are basically variations of these five, used in the MIPS R3000 processor:

  1. fetch instructions from memory

  2. read registers and decode the instruction

  3. execute the instruction or calculate an address

  4. access an operand in data memory

  5. write the result into a register

If you glance back at the diagram of the laundry pipeline, you'll notice that although the washer finishes in half an hour, the dryer takes an extra ten minutes, and thus the wet clothes must wait ten minutes for the dryer to free up. Thus, the length of the pipeline is dependent on the length of the longest step. Because RISC instructions are simpler than those used in pre-RISC processors (now called CISC, or Complex Instruction Set Computer), they are more conducive to pipelining. While CISC instructions varied in length, RISC instructions are all the same length and can be fetched in a single operation. Ideally, each of the stages in a RISC processor pipeline should take 1 clock cycle so that the processor finishes an instruction each clock cycle and averages one cycle per instruction (CPI).

Pipeline Problems
In practice, however, RISC processors operate at more than one cycle per instruction. The processor might occasionally stall a a result of data dependencies and branch instructions.

A data dependency occurs when an instruction depends on the results of a previous instruction. A particular instruction might need data in a register which has not yet been stored since that is the job of a preceeding instruction which has not yet reached that step in the pipeline.

For example:

add $r3, $r2, $r1
add $r5, $r4, $r3
more instructions that are independent of the first two

In this example, the first instruction tells the processor to add the contents of registers r1 and r2 and store the result in register r3. The second instructs it to add r3 and r4 and store the sum in r5. We place this set of instructions in a pipeline. When the second instruction is in the second stage, the processor will be attempting to read r3 and r4 from the registers. Remember, though, that the first instruction is just one step ahead of the second, so the contents of r1 and r2 are being added, but the result has not yet been written into register r3. The second instruction therefore cannot read from the register r3 because it hasn't been written yet and must wait until the data it needs is stored. Consequently, the pipeline is stalled and a number of empty instructions (known as bubbles go into the pipeline. Data dependency affects long pipelines more than shorter ones since it takes a longer period of time for an instruction to reach the final register-writing stage of a long pipeline.

MIPS' solution to this problem is code reordering. If, as in the example above, the following instructions have nothing to do with the first two, the code could be rearranged so that those instructions are executed in between the two dependent instructions and the pipeline could flow efficiently. The task of code reordering is generally left to the compiler, which recognizes data dependencies and attempts to minimize performance stalls.

Branch instructions are those that tell the processor to make a decision about what the next instruction to be executed should be based on the results of another instruction. Branch instructions can be troublesome in a pipeline if a branch is conditional on the results of an instruction which has not yet finished its path through the pipeline.

For example:

Loop :

add $r3, $r2, $r1
sub $r6, $r5, $r4
beq $r3, $r6, Loop

The example above instructs the processor to add r1 and r2 and put the result in r3, then subtract r4 from r5, storing the difference in r6. In the third instruction, beq stands for branch if equal. If the contents of r3 and r6 are equal, the processor should execute the instruction labeled "Loop." Otherwise, it should continue to the next instruction. In this example, the processor cannot make a decision about which branch to take because neither the value of r3 or r6 have been written into the registers yet.

The processor could stall, but a more sophisticated method of dealing with branch instructions is branch prediction. The processor makes a guess about which path to take - if the guess is wrong, anything written into the registers must be cleared, and the pipeline must be started again with the correct instruction. Some methods of branch prediction depend on stereotypical behavior. Branches pointing backward are taken about 90% of the time since backward-pointing branches are often found at the bottom of loops. On the other hand, branches pointing forward, are only taken approximately 50% of the time. Thus, it would be logical for processors to always follow the branch when it points backward, but not when it points forward. Other methods of branch prediction are less static: processors that use dynamic prediction keep a history for each branch and uses it to predict future branches. These processors are correct in their predictions 90% of the time.

Still other processors forgo the entire branch prediction ordeal. The RISC System/6000 fetches and starts decoding instructions from both sides of the branch. When it determines which branch should be followed, it then sends the correct instructions down the pipeline to be executed.

Pipelining Developments
In order to make processors even faster, various methods of optimizing pipelines have been devised.

Superpipelining refers to dividing the pipeline into more steps. The more pipe stages there are, the faster the pipeline is because each stage is then shorter. Ideally, a pipeline with five stages should be five times faster than a non-pipelined processor (or rather, a pipeline with one stage). The instructions are executed at the speed at which each stage is completed, and each stage takes one fifth of the amount of time that the non-pipelined instruction takes. Thus, a processor with an 8-step pipeline (the MIPS R4000) will be even faster than its 5-step counterpart. The MIPS R4000 chops its pipeline into more pieces by dividing some steps into two. Instruction fetching, for example, is now done in two stages rather than one. The stages are as shown:

  1. Instruction Fetch (First Half)

  2. Instruction Fetch (Second Half)

  3. Register Fetch

  4. Instruction Execute

  5. Data Cache Access (First Half)

  6. Data Cache Access (Second Half)

  7. Tag Check

  8. Write Back

Superscalar pipelining involves multiple pipelines in parallel. Internal components of the processor are replicated so it can launch multiple instructions in some or all of its pipeline stages. The RISC System/6000 has a forked pipeline with different paths for floating-point and integer instructions. If there is a mixture of both types in a program, the processor can keep both forks running simultaneously. Both types of instructions share two initial stages (Instruction Fetch and Instruction Dispatch) before they fork. Often, however, superscalar pipelining refers to multiple copies of all pipeline stages (In terms of laundry, this would mean four washers, four dryers, and four people who fold clothes). Many of today's machines attempt to find two to six instructions that it can execute in every pipeline stage. If some of the instructions are dependent, however, only the first instruction or instructions are issued.

Dynamic pipelines have the capability to schedule around stalls. A dynamic pipeline is divided into three units: the instruction fetch and decode unit, five to ten execute or functional units, and a commit unit. Each execute unit has reservation stations, which act as buffers and hold the operands and operations.

dynamic pipelining

While the functional units have the freedom to execute out of order, the instruction fetch/decode and commit units must operate in-order to maintain simple pipeline behavior. When the instruction is executed and the result is calculated, the commit unit decides when it is safe to store the result. If a stall occurs, the processor can schedule other instructions to be executed until the stall is resolved. This, coupled with the efficiency of multiple units executing instructions simultaneously, makes a dynamic pipeline an attractive alternative.

 

 

Instruction-Level Parallelism (ILP)

click on image for full view


Instead of waiting until an instruction has completed all five stages of our model machine, we could start a new instruction as soon as the first instruction has cleared stage 1. Notice that we can now have five instructions progressing through our "pipeline" at the same time. Essentially, we're processing five instructions in parallel, referred to as "Instruction-Level Parallelism (ILP)". If it took five clock cycles to completely execute an instruction before we pipelined the machine, we're now able to execute a new instruction every single clock. We made our computer five times faster, just with this "simple" change.

Suddenly, memory fetches have to occur five times faster then before. This implies that system and cache must now run five times as fast, even though each instruction still takes five cycles to completely execute.

We've also made a huge assumption that each stage was taking exactly the same amount of time, since that's the rule that our pipeline clock is enforcing. What about the assumption that the processor was even going to run the next four instructions in that order? We (usually) won't even know until the execute stage whether we need to branch to some other instruction address. Hey, what would happen if the sequence of instructions called for the processor to load some data from memory and then try to perform a math operation using that data in the next instruction? The math operation would likely be delayed, due to memory latency slowing down the process.


What we're describing are called "pipeline hazards", and their effects can get really ugly. There are three types of hazards that can cause our pipeline to come to a screeching halt--or cause nasty errors if we don't put in extra hardware to detect them. The first hazard is a "data hazard", such as the problem of trying to use data before it's available (a "data dependency"). Another type is a "control hazard" where the pipeline contains instructions that come after a branch. A "structural hazard" is caused by resource conflicts where an instruction sequence can cause multiple instructions to need the same processor resource during a given clock cycle. We'd have a structural hazard if we tried to use the same memory port for both instructions and data.


There are ways to reduce the chances of a pipeline hazard occurring, and we'll discuss some of the ways that CPU architects deal with the various cases. In a practical sense, there will always be some hazards that will cause the pipeline to stall. One way to describe the situation is to say that an instruction will "block" part of the pipe (something modern implementations help minimize). When the pipe stalls, every (blocked) instruction behind the stalled stage will have to wait, while the instructions fetched earlier can continue on their way. This opens up a gap (a "pipeline bubble") between blocked instructions and the instructions proceeding down the pipeline in front of the blocked instructions.

When the blocked instruction restarts, the bubble will continue down the pipeline. For some hazards, like the control hazard caused by a (mispredicted) branch instruction, the following instructions in the pipeline need to be killed, since they aren't supposed to execute. If the branch target address isn't in the instruction cache, the pipeline can stall for a large number of clock cycles. The stall would be extended by the latency of accesses to the L2 cache or, worse, accesses to main memory. Stalls due to branches are a serious problem, and this is one of the two major areas where designers have focused their energy (and transistor budget). The other major area, not surprisingly, is when the pipeline goes to memory to load data. Most of our analysis will focus in on these 2 latency-induced problems.


For some data hazards, one commonly-used solution is to forward result data from a completed instruction straight to another instruction yet to execute in the pipeline (data "forwarding", though sometimes called "bypassing"). This is much faster than writing out the data and forcing the other instruction to read it back in. Our case of a math operation needing data from a previous memory load instruction would seem to be a good candidate for this technique. The data loaded from memory into a register can also be forwarded straight to the ALU execute stage, instead of going all the way through the register write-back stage. An instruction in the write-back stage could forward data straight to an instruction in the execute stage.

Why wait 2 cycles? Why not forward straight from the data access stage? In reality, the data load stage is far from instantaneous and suffers from the same memory latency risk as instruction fetches. The figure below shows how this can occur. What if the data is not in the cache? There would be a huge pipeline bubble. As it turns out, data access is even more challenging than an instruction fetch, since we don't know the memory address until we've calculated the Effective Address. While instructions are usually accessed sequentially, allowing several cache lines to be prefetched from the instruction cache (and main memory) into a fast local buffer near the execution core, data accesses don't always have such nice "locality of reference".

Clock Number

click on image for full view

 

The Limits of Pipelining
If five stages made us run up to five times faster, why not chop up the work into a bunch more stages? Who cares about pipeline hazards when it gives the marketing folks some really high peak performance numbers to brag about? Well, every x86 processor we'll analyze has a lot more than five stages. Originally called "super-pipelining" until Intel (for no obvious reason) decided to rename it "hyper-pipelining" in their Pentium 4 design, this technique breaks up various processing stages into multiple clock cycles.

This also has the architectural benefit of giving better granularity to operations, so there should be fewer cases where a fast operation waits around while slow operations throttle the clock rate. With some of the clever design techniques we'll examine, the pipeline hazards can be managed, and clock rates can be cranked into the stratosphere. The real limit isn't an architectural issue, but is related to the way digital circuits clock data between pipeline stages.

To pipeline an operation, each new stage of the pipeline must store information passed to it from a prior stage, since each stage will (usually) contain information for a different instruction. This staged data is held in a storage device (usually a "latch"). As you chop up a task into smaller and smaller pipeline stages, the overhead time it takes to clock data into the latch ("set-up and hold" times and allowance for clock "skew" between circuits) becomes a significant percentage of the entire clock period. At some point, there is no time left in the clock cycle to do any real work. There are some exotic circuit tricks that can help, but it would burn a lot of power - not a good trade-off for chips that already exceed 70 watts in some cases

So now, if pipelining is so great, why not just make a 500-stage pipeline, and call it a day? Unfortunately, there are some issues that prevent that. The biggest issue is cost and complexity--a 500-stage pipeline is a little excessive for today's technology. Just like you'd need a factory that could house 500 automobile workers instead of 4, you'd need a die 500 times larger than our unpipelined processor, and that's too expensive, and yields would be too poor using today's technology. And there are other issues, as well...

Often, one instruction depends on the result of a previous instruction. For example, instruction 2 may be different, depending on the result of instruction 1. Consider the following example:

Instruction 1 - Determine the value of c, given c = a + b.
Instruction 2 - If c is greater than 4, multiply the value of c by 2. d = 2c.
OR
Instruction 2 - If c is less than 4, multiply the value of c by -2. d = -2c.

It has been estimated that about 10-20% of a typical set of x86 code contains branches like those above. In that situation, we cannot execute instruction 2 without knowing the result of instruction 1. So the Fetch Unit, for example, would indeed end up waiting a few cycles for instruction 1 to finish, before it begins with instruction 2. That would equate to wasted clock cycles, and would drive the efficiency down, which is unacceptable. So what's the solution?

One of the possible solutions is called Branch Prediction. Essentially, the processor makes an educated guess as to what it expects the result will be, and proceeds under that assumption. Using the case above, it may predict, for example, that the result from instruction 1 will be c = 3, and go ahead with instruction 2 assuming that c is less than 4. Since most code generally uses somewhat repetitive loops, Branch Prediction isn't exceedingly difficult. In fact, today's Prediction units are able to predict the appropriate path correctly over 90% of the time.

But what happens 10% of the time when the prediction is wrong?

(There are other ways to deal with dependency as well, such as Out of Order execution, but we'll save that for a later date.)

 

7.4 Data Dependency

HardwareCentral - Pipelining

Suppose instruction 2, 3, 4, and so on, were each dependent on the result from instruction 1, and were executed under a predicted result. Suppose that prediction was incorrect. Thus, instruction 2 was based on incorrect data, and must be repeated with the correct data, as must instruction 3, 4, and so on. This is called Flushing the Pipeline. In effect, the processor must discard all the data in the pipeline, and start over, since all of the data in the pipeline was calculated based on a false assumption. The clock cycles that were spent processing those instructions have been wasted.

Now, in a 2-stage pipeline, only half of one other instruction was carried out incorrectly, and flushed, and so the clock cycles that were wasted weren't excessive. Now consider a 10-stage pipeline. Because instruction 1 takes 10 clock cycles to complete, the processor has been proceeding under a false assumption (misprediction) for nine clock cycles, all of which have now been wasted! Likewise, with a 20-stage pipeline (like Willamette's), if a misprediction occurs, 19 clock cycles have been wasted!

To give you a small idea of the impact, it has been estimated that the less-than-10% of mispredicted branches slow the performance of Intel's Pentium III by anywhere from 20-40%. Considering that only about 10-20% of the instructions are branched to begin with, that means mispredictions occur only about once in every 50-100 instructions (10% of the 10-20% of branched instructions), on average. Restated, if those one in fifty to one-hundred instructions were predicted correctly, the processor would perform 20-40% faster. So you can appreciate the importance of good prediction algorithms.

Furthermore, you can now see that, if we attempted to implement a 500-stage pipeline, the performance hit from relatively few branch mispredictions would become enormous.

 

 

 

In computers, a pipeline is the continuous and somewhat overlapped movement of instruction to the processor or in the arithmetic steps taken by the processor to perform an instruction. Pipelining is the use of a pipeline. Without a pipeline, a computer processor gets the first instruction from memory, performs the operation it calls for, and then goes to get the next instruction from memory, and so forth. While fetching (getting) the instruction, the arithmetic part of the processor is idle. It must wait until it gets the next instruction. With pipelining, the computer architecture allows the next instructions to be fetched while the processor is performing arithmetic operations, holding them in a buffer close to the processor until each instruction operation can be performed. The staging of instruction fetching is continuous. The result is an increase in the number of instructions that can be performed during a given time period.

Pipelining is sometimes compared to a manufacturing assembly line in which different parts of a product are being assembled at the same time although ultimately there may be some parts that have to be assembled before others are. Even if there is some sequential dependency, the overall process can take advantage of those operations that can proceed concurrently.

Computer processor pipelining is sometimes divided into an instruction pipeline and an arithmetic pipeline. The instruction pipeline represents the stages in which an instruction is moved through the processor, including its being fetched, perhaps buffered, and then executed. The arithmetic pipeline represents the parts of an arithmetic operation that can be broken down and overlapped as they are performed.

Pipelines and pipelining also apply to computer memory controllers and moving data through various memory staging places.

A Pipeline is a series of stages, where some work is done at each stage. The work is not finished until it has passed through all stages. Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Today, Pipelining is key to making processors fast. A pipeline is like an assembly line: in both, each step completes one piece of the whole job. Workers on a car assembly line perform small tasks, such as installing seat covers. The power of the assembly line comes from many cars per day. On a well-balanced assembly line, a new car exits the line in the time it takes to perform one of the many steps. Note that the assembly line does not reduce the time it takes to complete an individual car; it increases the number of cars being built simultaneously and thus the rate at which the cars are started and completed. There are two types of pipelines, Instructional pipeline where different stages of an instruction fetch and execution are handled in a pipeline and Arithmetic pipeline where different stages of an arithmetic operation are handled along the stages of a pipeline.

There are two disadvantages of pipeline architecture. The first is complexity. The second is the inability to continuously run the pipeline at full speed, i.e. the pipeline stalls. There are many reasons as to why pipeline cannot run at full speed. There are phenomena called pipeline hazards, which disrupt the smooth execution of the pipeline. The resulting delays in the pipeline flow are called bubbles. These pipeline hazards include

 

These issues can and are successfully dealt with. But detecting and avoiding the hazards leads to a considerable increase in hardware complexity. The control paths controlling the gating between stages can contain more circuit levels than the data paths being controlled [AND67, p 9]. In 1970, this complexity is one reason that led Foster to call pipelining still controversial [FOS76, p253-256].

 

The one major idea that is still controversial is "instruction look-ahead" [pipelining]...

Why then the controversy? First, there is a considerable increase in hardware complexity. The second problem when a branch instruction comes along, it is impossible to know in advance of execution which path the program is going to take and, if the machine guesses wrong, all the partially processed instructions in the pipeline are useless and must be replaced. In the second edition of Foster's book, published 1976, this passage was gone. Apparently, Foster felt that pipelining was no longer controversial.

  

In retrospect, most of Myers' book Advances in Computer Architecture dealt with his concepts for improvements in computer architecture that would be termed CISC today. With the benefits of hindsight, we can see that pipelining is here today and that most of the new CPUs are in the RISC class. In fact, Myers is one of the co-architects of Intel's series of 32-bit RISC microprocessors. This processor is fully pipelined [MYE88]. I suspect that Myers no longer considers pipelining a step backwards.

  

By the mid-60s, the importance of pipelining in increasing machine performance was clear to IBM. An article in the January 1967 IBM Journal [AND67, p 8] presents the organizational philosophy utilized in IBM's highest performance computer, the System/360 Model 91. The first section of the paper deals with the development of the assembly-line processing approach adopted for the Model 91.

The assembly-line approach (pipelining) had become the first topic when explaining why IBM fastest machine was fast. The goal for the 360/91 a performance increase of 10 to 100 times that of the 7090. The circuit and hardware advance could only provide a fourfold increase, so organizational techniques would have to provide the remaining improvement. The authors state that the primary organizational objective for a high performance CPU is concurrency -- the parallel execution of different instructions. [...] it is desired to "overlay" the separate instruction functions to the greatest degree possible.

 

The 360/91 CPU loads the instruction fetch buffers with eight double words. This pre-fetch hides the instruction access time for straight-line (no-branching) programs. In order to minimize the disruption of the flow of instructions when a branch is encountered, the CPU fetches two double words down the target path of the branch as a hedge against taking the branch. Maximum performance is achieved when the branch is not taken.

 

With smooth-flowing instruction streams where the pipeline can be kept full, programs can execute at 100 times the speed of the 7090, about 25 times the performance increase due to faster hardware. For applications that do not flow smoothly, performance diminishes to about 10 times the speed of the 7090 ([AND67], p12). This is still 2.5 times better than the performance increase due to faster hardware.

Performance is improved in the IBM 360/91 by a factor of 2.5 to 25 due to pipelining. Pipelining is facilitated by the division of float-point operations and fixed-point operations into separate units. As long as no dependencies exist, the execution is not necessarily in the order in which the instructions were programmed.

The fixed-point unit executes instructions serially. The floating-point unit has additional concurrency. Here again, independent instructions may be executed out of order. Interrupts are a major bottleneck to performance in the pipelined architecture. Strict adherence to a sequential flow of instruction would reduce performance. Another approach would be to set aside sufficient information to properly recover from any interrupt which would occur. This approach was too complex and costly for the designers. The solution was to compromise the architecture by allowing the CPU to continue to execute instructions in the pipeline after the interrupt, even if one of the instruction being executed caused the interrupt. This is termed the "imprecise interrupt".

Interrupts associated with an instruction which can be uncovered during instruction decoding are precise. Only those interrupts which result from address, storage, or execution are imprecise. Considered as a whole, the architecture of the 360/91 was a major step forward. The concepts outlined in [AND67] are frequently cited in subsequent works on pipelining. For instance, Hennessy and Patterson [HEN94] state these concepts. Many of the ideas in the 360/91 faded from use for nearly 25 years before being broadly employed in the 1990s.

Motorola 68060

The Motorola 68060 is a fully pipelined super-scalar processor. The 68060 allows simultaneous execution of two integer instructions (or 1 integer and 1 float instruction) and one branch during each clock cycle. A branch cache allows most branches to execute in zero cycles. It contains a 4-stage instruction fetch pipeline, and two 6-stage pipelines for the primary operand execution and the secondary operand execution.

The 68060 allows simultaneous execution of two integer instructions during one clock cycle.

IFP (Instruction Fetch Pipeline) stages

  1. IAG - Instruction Address Generation

  2. IC - Instruction cache access

  3. IED - Instruction Early Decode

  4. IB - Instruction Buffer

 

OEP (Operand Execution Pipeline) stages

  1. DS - Decode and Select instructions

  2. AG - Address Generation of the operand

  3. OC - Operand Cache access

  4. EX - Execute

  5. DA - Data Available

  6. ST - Store

 

Intel Pentium

 

The Pentium is a super-scalar fully pipelined microprocessor. A super-scalar processor has the ability to process more than one instruction per clock cycle. The Pentium has two execution pipes (U and V) so it is a super-scalar level 2 processor.

 

The Pentium has dual internal caches, for both code and data, and dual TLBs. The TLB is the Translation Lookaside Buffer which caches the virtual page number to the physical page number. This facilitates efficient handling of the pipeline.

 

The Pentium pre-fetches as much as 32 bytes of instruction. It employees branch-prediction, a technique that attempts to infer the proper next instruction address, in order to keep the pipeline from stalling. In the Pentium, a super-scalar level 2 processor, two instructions are fetched and decoded simultaneously.

 

Instruction pipeline

  1. PF - Fetch and Align instruction

  2. D1 - Decode Instruction, Generate Control Word

  3. D2 - Decode Control Word, Generate memory address

  4. E - Access data cache or Calculate ALU Result

  5. WB - Write Result

 

Floating-Point Pipeline

  1. PF - Pre-fetch

  2. D1 - First Decode

  3. D2 - Second Decode

  4. E - Operand Fetch

  5. X1- First execute

  6. X2 - Second Execute

  7. WF - Write Float

  8. ER - Error Reporting

 

 

 

Part of the excitement in working in the computer industry is the rate of progress. In November 1995, Intel released the P6, the Pentium Pro microprocessor.

 

The P6 (Pentium Pro) is a superscalar level 3 processor. It has three pipelines and is capable of executing three integer instructions per clock cycle. Further more, it employees speculative execution to predict program flow and to execute ahead of their normal execution sequence.

 

The results of this speculative execution are stored in the Re-Order Buffer (ROB) so that they may be discarded due to a change in program flow or they may be retired, that is, committed if the instruction results in the ROB are accepted.

 

If the P6 encounters a data dependency, i.e. a data hazard that was previously discussed, it will speculatively execute the next following instruction that does not have a data dependency. For example, assume the next four instructions to be executed are in addresses 1, 2, 3, and 4, and further more, instruction 3 is dependent on the results of instruction 4. The P6 will execute instructions 1, 2, and 4 in a single clock cycle. The results of instruction 4 will not be retired until after instruction 3 has executed. Dynamic Execution is the process of utilizing branch prediction, analyzing for data dependencies, and utilizing speculative execution.

 

What are the results of this advancement in pipelining? In one example shown by Intel, the Pentium completes 17 instructions in 19 cycles, less than one instruction per cycle. When the Pentium has the necessary data, it does very well. But often it stalls. The Pentium Pro, in contrast, completes the same 17 instructions in 9 clock cycles, slightly less than half the time. In addition, another 10 instructions have been speculatively executed and are awaiting retirement. In one example, the Pentium Pro is over twice as fast as the Pentium (both CPUs at same clock rate).

 

What does the future hold? Increased Instruction Level Parallelism. ILP is already found in the 68080 and Pentium chips. Other processors have even more. For instance, the IBM RS-6000 series has four pipelines. The gap between CPU speed and memory speed will continue to widen. Some memory designs are becoming pipelined themselves. The sooner the CPU can notify the memory of an anticipated access, the quicker the data can be retrieved. Increased pipelining will be the major component in making computers ever faster. Pipelining is a technique that exploits parallelism among the instructions in a sequential instruction stream. It has the substantial advantage that, unlike some speedup techniques, it is fundamentally invisible to the programmer.

 

Control (Branch) Hazards

Control hazards are the most difficult types of hazards arising from normal operation of a program. In the next section, we will see that exceptions (e.g., overflow) can play particularly interesting types of havoc with smooth pipeline execution.

The most common type of control hazard is the branch instruction, which has two alternative results: (1) jump to the branch target address if the instruction succeeds, or (2) execute the instruction after the branch (at PC+4 of instruction memory) if the branch fails.

The problem with the branch instruction is that we usually do not know which result will occur (i.e., whether or not the branch will be taken) until the branch condition is computed. Often, the branch condition depends on the result of the preceding instruction, so we cannot precompute the branch condition to find out whether or not the branch will be taken.

The following four strategies are employed in resolving control dependencies due to branch instructions.

Assume Branch Not Taken. As we saw previously, we can insert stalls until we find out whether or not the branch is taken. However, this slows pipeline execution unacceptably. A common alternative to stalling is to continue execution of the instruction stream as though the branch was not taken. The intervening instructions between the branch and its target are then executed. If the branch is not taken, this is not a harmful or disruptive technique. However, if the branch is taken, then we must discard the results of the instructions executed after the branch statement. This is done by flushing the IF, ID, and EX stages of the pipeline for the discarded instructions. Execution continues uninterrupted after the branch target.

The cost of this technique is approximately equal to the cost of discarding instructions. For example, if branches are not taken 50 percent of the time, and the cost of discarding results is negligible, then this technique reduces by 50 percent the cost of control hazards.

Dynamic Branch Prediction. It would be useful to be able to predict whether or not a majority of the branches are taken or not taken. This can be done in software, using intelligent compilers, and can also be done at runtime. We concentrate on the software-intensive techniques first, since they are less expensive to implement (being closer to the compiler, which is easier to modify than the hardware).

The most advantageous situation is one where the branch condition does not depend on instructions immemdiately preceding it, as shown in the following code fragment:

     add $5, $5, $6     # One of the registers for beq comparison is modified
     sub $4, $3, $6     # Nothing important to the branch here
     and $7, $8, $6     # Nothing important to the branch here
     and $9, $6, $6     # Nothing important to the branch here
     beq $5, $6, target

Here, the branch compares Registers 5 and 6, which are last modified in the add instruction. We can therefore precompute the branch condition as sub r $5, $6, where r denotes a destination register. If r = 0, then we know the branch will be taken, and the runtime module (pipeline loader) can schedule the jump to the branch target address with full confidence that the branch will be taken.

Another approach is to keep a history of branch statements, and to record what addresses these statements branch. Since the vast majority of branches are used as tests of loop indices, then we know that the branch will almost always jump to the loopback point. If the branch fails, then we know the loop is finished, and this happens only once per loop. Since most loops are designed to have many iterations, branch failure occurs less frequently in loops than does taking the branch.

Thus, it makes good sense to assume that a branch will jump to the place that it jumped to before. However, in dense decision structures (e.g., nested or cascaded if statements), this situation does not always occur. In such cases, one might not be able to tell from the preceding branch whether or not the branching behavior will be repeated. It is then reasonable to use a multi-branch lookahead.

Another clever technique of making branches more efficient is the branch delay slot.

Exceptions as Hazards

Hardware and software must work together in any architecture, especially in a pipeline processor. Here, the ISA and processor control must be designed so that the following steps occur when an exception is detected:

1.      Hardware detects an exception (e.g., overflow in the ALU) and stops the offending instruction at the EX stage.

2.      Pipeline loader and scheduler allow all prior instructions (e.g., those already in the pipeline in MEM and WB) to complete.

3.      All instructions that are present in the pipeline after the exception is detected are flushed from the pipeline.

4.      The address of the offending instruction (usually the address in main memory) is saved in the EPC register, and a code describing the exception is saved in the Cause register.

5.      Hardware control branches to the exception handling routine (part of the operating system).

6.      The exception handler performs one of three actions: (1) notify the user of the exception (e.g., divide-by-zero or arithmetic-overflow) then terminate the program; (2) try to correct or mitigate the exception then restart the offending instruction; or (3) if the exception is a benign interrupt (e.g., an I/O request), then save the program/pipeline state, service the interrupt request, then restart the program at the instruction pointed to by EPC + 4.

In any case, the pipeline is flushed as described.

In general, we can say that, if a pipeline has N segments, and the EX stage is at segment 1 < i < N, then two observations are key to the prediction of pipeline performance:

·         Flushing negates the processing of the (i-1) instructions following the offending instruction. These must be reloaded into the pipe, at the cost of i cycles (one cycle to flush, i-1 cycles to reload the i-1 instructions after the exception is processed).

·         Completing the N-i instructions that were loaded into the pipeline prior to the offending instruction takes N-i clock cycles, which are executed (a) prior to, or (b) concurrently with, the reloading of the instructions i-1 that followed the i-th instruction (in the EX stage).

It is readily seen that the total number of wasted cycles equals (i-1) + (N-i) = N - 1, which is precisely the number of cycles that it takes to set up or reload the pipeline.

The proliferation of unproductive cycles can be mitigated by the following technique:

1.      Freeze the pipeline state as soon as an exception is detected.

2.      Process the exception via the exception handler, and decide whether or not to halt or restart the pipeline.

3.      If the pipeline is restarted, reload the (i-1) instructions following the offending instruction, concurrently with completing execution of the (N-i) instructions that were being processed prior to the offending instruction.

If Step 3 can be performed as stated, then the best-case penalty is only one cycle, plus the time incurred by executing the exception handler. If the entire pipeline needs to be flushed and restarted, then the worst-case penalty is N cycles incurred by flushing the pipe, then reloading the pipeline after the instructions preceding the offending instruction have been executed. If the offending instruction must be restarted, then a maximum of i cycles are lost (one cycle for flush, plus (i-1) cycles to restart the instructions in the pipe following the offending instruction).

In the next section, we collect the concepts about pipeline performance that we have been discussing, and show how to compute the CPI for a pipeline processor under constraint of stalls, structural hazards, branch penalties, and exception penalties.

Pipeline Performance Analysis

As we said early on in this course, we are trying to teach the technique of performance analysis, which helps one to intelligently determine whether or not a given processor is suitable computationally for a specific application. In this section, we develop performance equations for a pipeline processor, and do so in a stepwise way, so you can see how the various hazards and penalties affect performance.

CPI of a Pipeline Processor

Suppose an N-segment pipeline processes M instructions without stalls or penalties. We know that it takes N-1 cycles to load (setup) the pipeline, and M cycles to complete the instructions. Thus, the number of cycles is given by:

Ncyc = N + M - 1 .

The cycles per instruction are easily computed:

CPI = Ncyc/M = 1 + (N - 1)/M .

Thus, CPI for a finite program will always be greater than one. This stands in sharp contradiction to the first fallacy of pipeline processing, which says:

Fallacy #1: CPI of a pipeline processor is always equal to 1.0, since one instruction is processed per cycle.

This statement is fallacious because it ignores the overhead that we have just discussed. The fallacy is similar to claiming that you only spend eight hours at the office each day, so you must have 16 hours per day of "fun time". However, you have to take time to commute to/from the office, buy groceries, and do all the other homely tasks of life, many of which are in no way related to "fun time". In practice, such tasks are drudgery that is a type of overhead.

Effect of Stalls

Now let us add some stalls to the pipeline processing scheme. Suppose that we have a N-segment pipeline processing M instructions, and we must insert K stalls to resolve data dependencies. This means that the pipeline now has a setup penalty of N-1 cycles, as before, a stall penalty of K cycles, and a processing cost (as before) of M cycles to process the M instructions. Thus, our governing equations become:

Ncyc = N + M + K - 1 .

and

CPI = Ncyc/M = 1 + (N + K - 1)/M .

In practice, what does this tell us? Namely, that the stall penalty (and all the other penalties that we will examine) adversely impact CPI. Here is an example to show how we would analyze the problem of stalls in a pipelined program where the percentage of instructions that incur stalls versus non-stalls are specified.

Example. Suppose that an N-segment pipeline executes M instructions, and that a fraction fstall of the instructions require the insertion of K stalls per instruction to resolve data dependencies. The total number of stalls is given by fstall · M · K (fraction of instructions that are stalls, times the total number of instructions, times the average number of stalls per instruction). By substitution, our preceding equations for pipeline performance become:

Ncyc = N + M + (fstall · M · K) - 1 .

and

CPI = Ncyc/M = 1 + (fstall · K) + (N - 1)/M .

So, the CPI penalty due to the combined effects of setup cost and stalls now increases to fK + (N - 1)/M. If fstall = 0.1, K = 3, N = 5, and M = 100, then CPI = 1 + 0.3 + 4/100 = 1.34, which is 34 percent larger than the fallacious assumption of CPI = 1.

This leads to the next fallacy of pipeline processing:

Fallacy #2: Stalls are not a big problem with pipelines - you only have to worry about the number of stalls, not the percentage of instructions that induce stalls.

This fallacy is particularly dangerous. It is analogous to saying that the only thing that matters is the number of home burglaries each year, not the burglary rate per capita. If you move to a new neighborhood, you want to know both the number and per-capita incidence of crimes in that neighborhood, not just the robbery count. Then you can determine, from the population, whether or not it is safe to live there.

Similarly, with a pipeline processor, you want to determine whether or not the instruction mix or ordering of instructions causes data dependencies, and what is the incidence of such dependencies. For example, 1,000 instruction program with 20 stalls will run more efficiently than a 1,000 instruction program with 20 percent of the instructions requiring one stall each to resolve dependencies.

Effect of Exceptions

For purposes of discussion, assume that we have M instructions executing on an N-segment pipeline with no stalls, but that a fraction fex of the instructions raise an exception in the EX stage. Further assume that each exception requires that (a) the pipeline segments before the EX stage be flushed, (b) that the exception be handled, requiring an average of H cycles per exception, then that (c) the instruction causing the exception and its following instructions be reloaded into the pipeline.

Thus, fex · M instructions will cause exceptions. In the MIPS pipeline, each of these instructions causes three instructions to be flushed out of the pipe (IF, ID, and EX stages), which incurs a penalty of four cycles (one cycle to flush, and three to reload) plus H cycles to handle the exception. Thus, the pipeline performance equations become:

Ncyc = N - 1 + (1 - fex) · M + (fex · M · (H + 4)) ,

which we can rewrite as

Ncyc = M + [N - 1 - M + (1 - fex) · M + (fex · M · (H + 4))] .

Rearranging terms, the equation for CPI can be expressed as

CPI = Ncyc/M = 1 + [1 - fex + (fex · (H+4)) - 1 + (N - 1)/M] .

After combining terms, this becomes:

CPI = Ncyc/M = 1 + [(fex · (H+3)) + (N - 1)/M] .

We can see by examination of this equation and the expression for CPI due to stalls that exceptions have a more detrimental effect, for two reasons. First, the overhead for stalls (K stalls per affected instruction) is K < 4 cycles in the MIPS pipeline (since the pipeline has only five segments). Second, the cost of each exception is H+3 cycles per affected instruction. Since H > 0 for a nontrivial exception handler, the cost of an exception in the MIPS pipeline (under the preceding assumptions) will exceed the cost of remedying a hazard using stalls. The good news, however, is that there are usually fewer exceptions in programs than data or structural dependencies, with the exception of I/O-intensive programs (many I/O interrupts) and arithmetic-intensive programs (possible overflow or underflow exceptions).

Effect of Branches

Branches present a more complex picture in pipeline performance analysis. Recall that there are three ways of dealing with a branch: (1) Assume the branch is not taken, and if the branch is taken, flush the instructions in the pipe after the branch, then insert the instruction pointed to by the BTA; (2) the converse of 1); and (3) use a delayed branch with a branch delay slot and re-ordering of code (assuming that this can be done).

The first two cases are symmetric. Assume that an error in branch prediction (i.e., taking the branch when you expected not to, and conversely) requires L instruction to be flushed from the pipeline (one cycle for flushing plus L-1 "dead" cycles, since the branch target can be inserted in the IF stage). Thus, the cost of each branch prediction error is L cycles. Further assume that a fraction fbr of the instructions are branches and fbe of these instructions result in branch prediction errors.

The penalty in cycles for branch prediction errors is thus given by

branch_penalty = fbr · fbe · M instructions · L cycles per instruction .

The pipeline performance equations then become:

Ncyc = N - 1 + (1 - fbr · fbe) · M + (fbr · fbe · M · L) ,

which we can rewrite as

Ncyc = M + [N - 1 - M + (1 - fbr · fbe) · M + (fbr · fbe · M · L) ,

Rearranging terms, the equation for CPI can be expressed as

CPI = Ncyc/M = 1 + [(1 - fbr · fbe) + (fbr · fbe · L) - 1 + (N - 1)/M] .

After combining terms, this becomes:

CPI = Ncyc/M = 1 + [(fbr · fbe · (L-1)) + (N - 1)/M] .

In the case of the branch delay slot, we assume that the branch target address is computed and the branch condition is evaluated at the ID stage. Thus, if the branch prediction is correct, there is no penalty. Depending on the method by which the pipeline evaluates the branch and fetches (or pre-fetches) the branch target, a maximum of two cycles penalty (one cycle for flushing, one cycle for fetching and inserting the branch target) is incurred for insertion of a stall in the case of a branch prediction error. In this case, the pipeline performance equations become:

Ncyc = N - 1 + (1 - fbr · fbe) · M + (fbr · fbe · 2M) ,

which implies the following equation for CPI as a function of branches and branch prediction errors:

CPI = Ncyc/M = 1 + [fbr · fbe + (N - 1)/M] .

Since fbr << 1 is usual, and fbe is, on average, assumed to be no worse than 0.5, the product fbr · fbe, which represents the additional branch penalty for CPI in the presence of delayed branch and BDS, is generally small.