ECE 3056: Architecture, Concurrency, and Energy of Computation

Sample Problem Set: Pipelining
1. Consider the following code sequence used often in block copies. This produces a special form of the load hazard and normally introduces one stall cycle.

   lw $7, 0($12)

   sw $7, 0($13)

Show how you can use forwarding to avoid the occurrence of stalls in this situation. Provide your solution by modifying the view of the datapath provided below and precisely state the conditions you would check to implement such forwarding.

(Note: Datapath found in the solution version, as it contains the answer)
2. Consider the following instruction sequence with respect to the pipelined datapath with hazard detection and data forwarding. Assume writes in WB take place in the first half of the cycle and reads take place in the second half of the cycle. Registers $3, $4, and $5 contain 0x11, 0x22 and 0x88 respectively. Branches are assumed not taken with 38% of conditional branches being taken in practice with flushing. Branches are implemented in ID with forwarding.

Cont:

lw $4, 0($3)
slt $5, $4, $2
bne $5, $0, Cont
li $4, 0x0400
addi $4, $4, 0x0004
sw $9, 0($4)
a. If the first instruction starts at location 0x10000000, what are the values of the following variables in the pipeline during cycle 4 assuming the branch is taken?

IF/ID.PC____________________

EX/MEM.aluresult ______

EX/MEM.write.value __________________________

EX/MEM.wreg.addr __________________________

b. For the case of the branch being taken as well as the case of the branch not taken, at what cycle will the last instruction exit the pipeline?
3. Assume branch instructions occur 22% of the time and are predicted as not taken, while in practice they are taken 42% of the time with a penalty of 2 cycles. With forwarding, the load delay slot is one cycle and can be filled 55% of the time with useful instructions. 21% of the instructions are loads and 30% of these introduce load delay hazards.

   a. What is the increase in CPI due to load delay slots and branch hazards? Load slots + branch delay slots

   b. To improve performance, we would like to pipeline the memory accesses. This will reduce the clock cycle time by 20%, but increase the load delay slots to 2 cycles while not affecting branch delays. Is this worthwhile? Provide a quantitative justification for your answer.
4. Consider the following instruction sequence with respect to the pipelined datapath with load hazard detection, data forwarding, and branch prediction (not taken) with flushing and branches are resolved in ID with forwarding. Assume writes in WB take place in the first half of the cycle and reads take place in the second half of the cycle.

<table>
<thead>
<tr>
<th></th>
<th>Instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>add $7, $0, $0</td>
</tr>
<tr>
<td>2</td>
<td>addi $5, $0, 72</td>
</tr>
<tr>
<td>3</td>
<td>addi $3, $0, x10028</td>
</tr>
<tr>
<td>4</td>
<td>loop: lw $4, 0($3)</td>
</tr>
<tr>
<td>5</td>
<td>lw $9, -8($3)</td>
</tr>
<tr>
<td>6</td>
<td>lw $6, -4($3)</td>
</tr>
<tr>
<td>7</td>
<td>add $7, $6, $7</td>
</tr>
<tr>
<td>8</td>
<td>add $7, $9, $7</td>
</tr>
<tr>
<td>9</td>
<td>add $7, $4, $7</td>
</tr>
<tr>
<td>10</td>
<td>addi $3, $3, -12</td>
</tr>
<tr>
<td>11</td>
<td>addi $5, $5, -12</td>
</tr>
<tr>
<td>12</td>
<td>bne $5, $0, loop</td>
</tr>
</tbody>
</table>

a. If the first instruction in the above sequence starts at location 0x400, what are the values of the following variables in the pipeline during cycle 8. The first cycle is cycle 1.

<table>
<thead>
<tr>
<th>Both Forwarding Unit Outputs</th>
<th>Output of the ALU</th>
<th>IF/ID.PC</th>
<th>EX.MEM. RegWrite</th>
<th>ID/EX.Register.Rs</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

The instruction in IF is: ______

b. From the code, you can determine how many times this loop is executed. Furthermore, remember branches are predicted as not taken (with flushing), and full forwarding and load hazard detection. What is the CPI value that you would obtain if you computed this value only over this block of code? Ignore the pipeline setup time and assume there is forwarding to the branch logic in ID.
c. Suppose a page fault occurred on the data reference for instruction 6. State i) the instructions in the pipeline at the time, ii) those which are allowed to complete execution, and iii) at which instruction does execution resume after the page fault has been serviced, i.e., which is the first instruction to be fetched.
5. Assume branch instructions occur 15% of the time and are predicted as not taken, while in practice they are taken 40% of the time with a penalty of 3 cycles. With forwarding, the load delay slot is one cycle and can be filled 60% of the time with useful instructions. 20% of the instructions are loads and 30% of these introduce load delay hazards.

   a. What is the increase in CPI due to load delay slots and branch hazards?

b. To improve performance, we would like to pipeline the ALU. This will reduce the clock cycle time by 15%, but increase the load delay slots to 2 cycles, and the branch delays to 4 cycles. Provide a quantitative basis for determining if this is a good tradeoff.
6. Consider a new high speed SPIM processor design where we have an 8-stage pipeline. Relative to the design in the text, the memory stage has been partitioned into 2 stages and the ALU stage has been partitioned into three stages (hence the new 8 stage pipeline). As a result, the clock cycle time has been reduced to 6 ns.

   a. What is the latency experienced by an instruction and the theoretical maximum throughput of this processor in instructions/second? Ignore hazards.

b. Now we are examining the next generation design, where we will have two such pipelines operating concurrently, i.e., a superscalar design. An instruction fetched from memory may be executed in either pipeline. Memory is fast enough to supply two instructions in each fetch cycle. Repeat (a) for this pipeline.
Consider the state of following program executing within the SPIM pipeline during cycle 6 (start at cycle 1). Assume full hardware support for forwarding, branches predicted as not taken, and flushing when branches are taken and the branch condition is resolved normally in EX. The first instruction is at address 0x400. Assume that the mov instruction is a native instruction.

```plaintext
mov $6, $0
addi $4, $0, 32

loop:
    lw $10, 1000($4)
    blt $10, $6, next -- check if $10 < $6
    mov $10 $6 -- move $6 to $10

next:
    addi $4, $4, -4
    bne $4, $0, loop
```

a. What instructions are in each stage of the pipeline.

b. What are the values of the following fields?

- MEM/WB.ALUData
- EX/MEM.MemToReg
- IF/ID.PC
- ID/EX.ALUOp

<table>
<thead>
<tr>
<th>Field</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>MEM/WB.ALUData</td>
<td>________</td>
</tr>
<tr>
<td>EX/MEM.MemToReg</td>
<td>________</td>
</tr>
<tr>
<td>IF/ID.PC</td>
<td>________</td>
</tr>
<tr>
<td>ID/EX.ALUOp</td>
<td>________</td>
</tr>
</tbody>
</table>

c. Identify all the data and control hazards assuming no forwarding support.

Data Hazards Control Hazards
8. Consider the state of following procedure executing within the SPIM pipeline during cycle 6. Assume that the first instruction is fetched in cycle 0! Assume full hardware support for forwarding, branches predicted as not taken, and flushing when branches are taken. The penalties for hazards are as dictated by the datapath in the figure. Also assume concurrent read/write to the same register in the same cycle. The first instruction is at address 0x00400000. Assume that all instructions are native instructions.

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>lw $t0, 0($a0)</td>
</tr>
<tr>
<td>2</td>
<td>add, $a0, $a0, $a1</td>
</tr>
<tr>
<td>3</td>
<td>addi $sp, $sp, -16</td>
</tr>
<tr>
<td>4</td>
<td>sw $fp,4($sp)</td>
</tr>
<tr>
<td>5</td>
<td>addi $fp, $sp, 16</td>
</tr>
<tr>
<td>6</td>
<td>sw $ra, 0($fp)</td>
</tr>
<tr>
<td>7</td>
<td>sw $t0, -4($fp)</td>
</tr>
<tr>
<td>8</td>
<td>addi $a1, $t0, $a0</td>
</tr>
</tbody>
</table>

a. What instructions are in each stage of the pipeline.

<table>
<thead>
<tr>
<th>Stage</th>
<th>Instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td></td>
</tr>
<tr>
<td>ID</td>
<td></td>
</tr>
<tr>
<td>EX</td>
<td></td>
</tr>
<tr>
<td>MEM</td>
<td></td>
</tr>
<tr>
<td>WB</td>
<td></td>
</tr>
</tbody>
</table>

b. What are the values of the following fields? ALL multiplexor inputs are numbered from the top to bottom starting at 0. MuxA is the first one from the top in EX.

<table>
<thead>
<tr>
<th>Field</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>MEM/WB.MemToReg</td>
<td></td>
</tr>
<tr>
<td>ForwardMuxA, ForwardMuxB</td>
<td></td>
</tr>
<tr>
<td>IF/ID.PC</td>
<td></td>
</tr>
<tr>
<td>ID/EX.RegWrite</td>
<td></td>
</tr>
<tr>
<td>EX/MEM.RegWrite</td>
<td></td>
</tr>
<tr>
<td>EX/MEM.WriteRegisterAddr</td>
<td></td>
</tr>
</tbody>
</table>

c. Extensive analysis of the instruction stream of code targeted for the pipelined MIPS with forwarding has produced the following statistics. To support larger caches we have accepted a two cycle access time and therefore the IF and MEM stages have been partitioned into two stages. The alternative is to keep the single cycle hazards but with a clock that is 10% slower. The overall pipeline is now 7 stages. Penalties for load hazards are 2 cycles and those for control hazards are also 2 cycles. We note that 20% of load instructions produce a hazard and 30% of these load delay slots can be filled via
instruction scheduling. Similarly, we can use scheduling can fill 70% of the branch delay slots. Compute the new CPI. If the CPI without additional pipe stages was 1.5, is this design a good idea? Justify your answer.

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>Loads</td>
<td>22%</td>
</tr>
<tr>
<td>Stores</td>
<td>13%</td>
</tr>
<tr>
<td>ALU Operations</td>
<td>42%</td>
</tr>
<tr>
<td>Branches</td>
<td>18%</td>
</tr>
</tbody>
</table>
9. Extensive analysis of the instruction stream has produced the following statistics. Data hazards incur a 2 cycle penalty while control hazards incur a 3 cycle penalty. You can leave your answers in expression form.

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>Loads</td>
<td>22%</td>
</tr>
<tr>
<td>Stores</td>
<td>13%</td>
</tr>
<tr>
<td>ALU Operations</td>
<td>42%</td>
</tr>
<tr>
<td>Branches</td>
<td>18%</td>
</tr>
</tbody>
</table>

a. Assuming an ALU or Load/Store instruction has probability of 0.2 of experiencing a data hazard, what is the maximum increase in CPI due to data hazards and control hazards? There is no instruction scheduling and no forwarding.

b. Now assume that you have full hardware support for forwarding, and branches predicted as not taken supported by instruction flushing when branches are taken. Branches are assumed to be not taken and instructions are fetched accordingly. Branches are actually taken 44% of the time. 34% of the load instructions produce a single load delay slot which can be filled 60% of the time. What is the improvement in pipeline CPI from (a)?
c. Rather than using forwarding and hardware support for flushing on branches, suppose that instruction scheduling could successfully fill 70% of the delay slots produced by data and control hazards. Furthermore, the clock cycle time could be reduced by 10% since the multiplexors used for forwarding were on the critical path. Is this option a better one than the use of forwarding? If expressions are not evaluated, show precisely how you can answer this question.
10. Consider the state of following procedure executing within the SPIM pipeline during cycle 8. Assume that the first instruction is fetched in cycle 0! Assume full hardware support for forwarding, branches predicted as not taken, and flushing when branches are taken. The penalties for hazards are as dictated by the datapath in the figure. The first instruction is at address 0x00400000. Assume that all instructions are native instructions.

```
func:   addi $2, $4, $0
       addi $3, $5, $0
       add $3, $3, $2
       addi $9, $0, $0

loop:  lw $22, 0($2)   -- WB
       hazard cycle   -- MEM
       add $9, $9, $22 -- EX
       addi $2, $2, 4  -- ID
       bne $2, $3, loop -- IF
       hazard cycle
       move $2, $9
       jr $31
```

a. What instructions are in each stage of the pipeline.

   IF
   ID
   EX
   MEM
   WB

b. What are the values of the following fields? Mux inputs are numbered from the top to bottom starting at 0. MuxA is the first one from the top in EX.

   MEM/WB.MemToReg
   ForwardMuxA, ForwardMuxB
   IF/ID.PC
   ID/EX.RegWrite
   EX/MEM.RegWrite
   EX/MEM.WriteRegister
c. If there was no forwarding, what is the earliest cycle in which the following instructions could execute for the first time in the stage shown. Remember we start counting cycles at cycle 0!
11. Consider the following code sequence which computes a running sum of an array of integers stored starting at location \texttt{Start} through memory address \texttt{End}:

\begin{verbatim}
add $7, $0, $0
la $4, End
la $3, Start
loop:
lw $6, 0($3)
lw $5, 4($3)
add $7, $6, $7
add $7, $5, $7
addi $3, $3, 8
bne $3, $4, loop
\end{verbatim}

a. Rewrite the code sequence assuming that the pipelined SPIM provides no support for forwarding or branch hazards (branches resolved in ID) and the compiler must ensure correctness by the proper insertion of nops. Assume register file write operations take place in the first half of the cycle and read operations in the second half of the cycle.
b. Now schedule the code in 2(a) to remove as many nops as possible.
12. Consider execution of a MIPS code sequence on a pipelined data path shown on the next page. Consider a 2 GHz MIPS processor with a canonical 5-stage pipeline and 32 general-purpose registers. The branch condition and jump addresses are computed in decode. Branches are assumed not taken. The data path implements forwarding.

a. Show the modifications required to the pipelined datapath overleaf to implement support for the jal instruction.

b. What would be the values of all of the control signals to realize this implementation.

Note that some of these control signal values can be don't care (e.g., RegDst)

<table>
<thead>
<tr>
<th>Instr.</th>
<th>RegDst</th>
<th>AluSrc</th>
<th>MemToReg</th>
<th>RegWrite</th>
<th>MemRead</th>
<th>MemWrite</th>
<th>Branch</th>
<th>AluOp</th>
</tr>
</thead>
<tbody>
<tr>
<td>jal</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
c. Now consider the execution of the following code sequence – specifically the execution of the function func: Why will execution be incorrect in the datapath described on page 2 and what can you do to ensure correct execution? Assume support for the jr instruction is similar to that of the jal instruction and addi and move are supported similar to the add instruction.

```assembly
.data
.start: .word 0x32, 33, 0x34, 35

.text
la $a0, start
addi $a1, $a0, 0x10
jal func
li $v0, 10
syscall

func:   add $t7, $zero, $zero

loop:   lw $t6, 0($a0)
        add $t7, $t7, $t6
        addi $a0, $a0, 4
        bne $a1, $a0, loop
        move $v0, $t7
        jr $31
```
14. We wish to implement an atomic swap instruction – \texttt{swap r1, offset(r2)}. This instruction atomically swaps the contents of a register with a location in memory.

a. What does it mean for the execution of this instruction to be atomic?

b. Using a few instructions, provide a MIPS assembly language implementation of this instruction using the following MIPS instructions.

\begin{enumerate}
\item \texttt{load linked: ll r1, offset(r2)}
\item \texttt{store conditional: sc r1, offset(r2)}
\end{enumerate}

Consider the following block of SPIM code. The text segment starts at 0x00400000 and the data segment starts at 0x10010000. Register \$t1 is initialized to 0x10010000. The remaining registers are initialized to 0. The first three words of memory starting at 0x10010000 contain the values 0x4, 0x8, 0xC. Consider the state of the pipeline with forwarding and hazard detection on cycle 7 (the first cycle is 0!).

\begin{verbatim}
lw \$t3, 8($t1)  
lw \$t0, 0($t1)  
lw \$t2, 4($t1)  
add \$t5, \$t5, \$t2  
add \$t5, \$t5, \$t0  
add \$t5, \$t5, \$t3  
sw \$t5, 12($t1)  
li \$v0, 10  
syscall
\end{verbatim}

List the instruction in each stage of the pipeline and fill in the table below.
IF: __________
ID: __________
EX: __________
MEM: __________
WB: __________

IF/ID.PC4
ID/EX.ForwardA (top mux below)
ID/EX.RegisterRt (contents see below)
EX/MEM.WriteData
MEM/WB.ALUResult
15. Consider the following segment of code. Assume full forwarding and branches resolved in EX and support for the `addi` instruction.

```
loop:   lw $t4, 0($t0)  #fetch array element
       add $t3, $t3, $t4   #update sum
       addi $t0, $t0, 4   #point to next word
       addi $t2, $t2, -1  #decrement count
       bne $t2, $0, loop
```

a. What is the CPI of the above code sequence assuming that the loop executes exactly 4 times? Count all cycles starting with the fetch of the first instruction and until the last instruction exits the pipeline. It should be clear as to exactly how you computed the CPI.

b. Now assuming **no hardware support** for forwarding or hazard detection, reschedule the code using `nops`. Minimize the number of `nops`. Writes take place in the first half of the cycle and reads in the second half.
16. Consider the following instruction dependency that can be caused by copying data.

\[
\begin{align*}
&\text{lw} \ $8, \ 0($9) \\
&\text{sw} \ $8, \ 4($10)
\end{align*}
\]

Modify the pipelined datapath to add forwarding so that this sequence can execute correctly without stall cycles. Describe (functionally) how this should work including values of any new control signals you may add.
17. With forwarding and hazard detection, assume branch instructions occur 10% of the time and are predicted as not taken, while in practice they are taken 40% of the time with a penalty of 3 cycles. With forwarding, the load delay slot is one cycle and can be filled 40% of the time with useful instructions. 25% of the instructions are loads and 30% of these introduce load delay hazards. You may leave your answers in expression form.

a. What is the increase in CPI due to load delay slots and branch hazards?

b. Since memory is much slower than the core, to improve performance, we would like to pipeline all stages that access memories by making them take 3 cycles instead of 1. This will reduce the clock cycle time by 15%. Provide a quantitative basis for determining if this is a good tradeoff.
18. Consider two threads spawned to execute a function on different data sets and sharing the pipeline as shown below – only a segment of the function code is shown. The first instruction of thread #1 is fetched in the first cycle (cycle 0) and instructions from each thread are interleaved on an instruction-by-instruction basis thereafter. All registers are initialized to 0 except for $t0 which is initialized to 0x1001000 for thread #1 and 0x10010024 for thread #2 (initialization not shown). The first instruction (the load) is stored at 0x40040048.

```
Thread #1
lw $t3, 0($t0)
add $t2, $t2, $t3
addi $t0, $t0, 4
addi $t1, $t1, -1
bne $t1, $zero, loop
......
```

```
Thread #2
lw $t3, 0($t0)
add $t2, $t2, $t3
addi $t0, $t0, 4
addi $t1, $t1, -1
bne $t1, $zero, loop
```

a. What are the contents of the following during cycle 9 (the first cycle is 0!)?

Thread #2, register $t0:

ID/EX.PC (include thread ID):

b. Distinguish between fine grain and coarse grain multithreading.