CSC 8400-001

Examination 3

May 1, 2000



PROBLEM 1 [15%]. Assume that we make an enhancement to a computer that improves a certain mode M of execution by a factor of 20. M is used 5% of the time, measured as a percentage of the execution time when the enhanced mode is in use. What is the speedup we have obtained from fast mode?



PROBLEM 2 [15%]. What are the 3 types of data hazards in pipelines? Briefly explain/characterize each of them. Which of these hazards can occur in our DLX pipeline and why?



PROBLEM 3 [15%]. The ideal speedup from pipelining equals the pipeline depth. However, the real speedup would usually be lower that the ideal speedup. One of the reasons for that is pipeline hazards. List 3 other reasons.



PROBLEM 4 [15%]. Assume we have two pipelined DLX machines, M1 and M2. M2 uses delayed branch scheme with a branch delay of length 2, while M1 uses the regular (no delayed branch) semantics. Hence, there are no stalls in M2 due to control hazards, while every branch instruction causes a 2-cycle stall in M1.

Consider the following code C1 for M1:

Loop: ADDI R1, R1, #- 1

SW R3, 0(R2)

ADD R2, R2, R2

BNEZ R1, Loop

ADD R3, R1, R2

Write an equivalent code C2 for M2. C2 must run faster on M2 than C1 runs on M1.

"Equivalent" means that if initially the two machines have identical contents of registers and memory, then so would they have after running C1 on M1 and C2 on M2.



PROBLEM 5 [15%].

a) [5%] Why do fully associative caches generally have lower miss rates than direct mapped caches?

b) [10%] However, in some special cases, a direct mapped cache can have a lower miss rate than a fully associative cache of the same size.

Consider a cache with 2 block frames, where the block size is 1 word. Write a DLX code whose miss rate would be lower if this cache is direct mapped rather than fully associative with LRU policy.

What are the two corresponding miss rates for your code?

PROBLEM 6 [25%]. Assume the virtual memory size is 4GB, and the physical memory size is 1GB. Pages are 64 KB, and cache blocks are 8 bytes. We have an instruction-only cache, where the miss penalty is 60 cycles. It is 2-way set associative, uses LRU and has 2 sets.

The TLB is fully associative, uses LRU, and has (only) 2 entries. We assume that every TLB entry holds only a virtual page number and the corresponding physical page number. There are no dirty-bit, valid-bit or other fields. The TLB miss penalty is 40 cycles.

For safety/simplicity, assume there is no pipelining. Every instruction takes 5 clock cycles to be executed when there are no misses in the TLB or the cache (in other words, the base CPI, or CPIexecution, is 5 cycles).

a) (2%) How many pages are there in virtual memory?

b) (2%) How many blocks are there in a page?

c) (5%) What is the size of the TLB in bits (i.e. how many bits of storage does the TLB need)?

d) (16%) Consider the following DLX code:

T1: ADDI R1, R0, #1

T2: ADD R1, R1, R1

T3: ADDI R1, R1, #-1

T4: BEQZ R1, T6

T5: BNEZ R1, T3

... (some other instructions)

T6: BEQZ R1, T7

... (some other instructions)

T7: ADD R1, R1, R1

Assume that the whole code is in main memory when it starts running, so that there are no page faults. Assume also that the virtual addresses of these instructions are as follows:

T1: 0000 0000 0000 0000 0000 0000 0000 0000

T2: 0000 0000 0000 0000 0000 0000 0000 0100

T3: 0000 0000 0000 0000 0000 0000 0000 1000

T4: 0000 0000 0000 0000 0000 0000 0000 1100

T5: 0000 0000 0000 0000 0000 0000 0001 0000

T6: 0000 0000 0000 0000 1111 0000 0000 0000

T7: 0000 0000 0000 0001 0000 0000 0000 0000

If both the cache and the TLB are initially empty, what would be the execution time (in clock cycles) for this code?

In answering this question, do the following:

1) (4%) Write down the instruction sequence that will be executed when the code runs;

2) (4%) For each instruction in this sequence, show the contents of the cache and the TLB immediately before and immediately after the instruction is executed (for a TLB entry, just indicate which virtual page number it holds);

3) (4%) Based on (2), for each instruction, indicate if it generates a cache and/or TLB miss;

4) (4%) Using (3), count the total number of clock cycles needed, taking into account the information about the base CPI and the cache and TLB miss penalties.