CSC 8400-001

Examination E1

February 21, 2000

 

 

 

PROBLEM 1 [15%]

In this problem, we are considering enhancing a non-pipelined machine by adding some extra special-purpose registers and an extra ALU. The enhancement will decrease the execution time of an average ALU instruction by a factor of 1.4 (1.4 times), increase (careful!) the execution time of an average data transfer instruction by a factor of 1.2, and will not affect the execution time of any other types of instructions. Assume before these changes were made, the machine spent 20% of the time executing ALU instructions and 25% of the time executing data transfer instructions. What would be the speedup from this enhancement?

 

 

 

PROBLEM 2 [15%]

 

In our original DLX’ pipeline (Fig. 3.4), branch penalty was 3 cycles. Figure 3.22 shows a modification of that pipeline which reduces branch penalty to 1 cycle. However, this modification creates a new data hazard (nothing comes for free in this life!). Write a DLX’ code fragment (you need only 2 instructions) which requires a data hazard stall in the modified pipeline while it wouldn’t require any stalls in the old pipeline. Assume that both pipelines use data forwarding wherever possible. Briefly justify your answer.

 

 

 

 

PROBLEM 3 [15%]

 

Describe, in words, a code for which caller saving would work better than callee saving. Then describe a code for which callee saving would work better.

 

 

 

 

PROBLEM 4 [15%]

 

What are the 3 types of pipeline hazards? For each of them, list as many ways as you know to eliminate hazards of that type or to reduce the number of stall cycles caused by those hazards.

 

 

 

PROBLEM 5 [20%]

 

Translate the following fragment of a high-level language code into DLX’:

i = 10;

{While i ¹ 0 do {B[i]=B[i]+5; i=i-1;} };

i=B[12];

Assume i is a 32-bit integer stored at memory location 1000, and B is an array of 32-bit integers stored at the memory location 1004.

  1. What is the size of your code in bytes?
  2. What is the instruction count for your code?
  3. How many (instruction and/or data) bytes will be read from memory?
  4. How many bytes will be written in memory?

e) What is the CPI for your code if it runs on the non-pipelined machine of Figure 3.1?

 

 

 

PROBLEM 6 [20%]

a) Consider the following DLX code:

BEQZ R1, Target

ADD R2, R2, R2

SUB R5, R2, R5

BNEZ R2, Target

ADDI R3, R3, #5

Target: ADD R2,R3,R3

How many clock cycles would this code take in the pipeline of Figure 3.22, assuming that the latter uses forwarding whenever possible, if none of the branches is taken?

  1. Assume we implement the delayed branch scheme in this pipeline, with the delay length of 1. To prevent incorrect execution of the code, we must insert a NOP instruction in every delay slot. Here is the code from subquestion (a) with this modification:

ADD R1, R7,R8

BEQZ R1, Target

NOP

ADD R2, R2, R2

SUB R5, R2, R5

BNEZ R2, Target

NOP

ADDI R3, R3, #5

Target: ADD R2,R3,R3

Re-arrange this code by filling the delay slots (i.e. replacing the NOP instructions) with something "useful" so that the code runs faster than in subquestion (a). For both delay slots, indicate which of the three methods (from before, from target or from fall through) you used.