CSC 8400: Homework Assignments

CSC 8400: Homework Assignments


Homework will be assigned each week and will be collected at the beginning of the class the following week. Late homework will be penalized by 50%, no matter what the reason was; no homework will be accepted if it is late by more than 1 week.

Academic integrity: You can discuss and clarify homework questions with your classmates, but you must write out the solutions on your own. That means that you should never copy or allow someone to copy your homework. It is your responsibility to be familiar with Villanova University Academic Integrity Policy and Procedures.


Grading criteria:

  1. Correct approach, correct answer -- 100%
  2. Basically correct approach but incorrect answer due to some minor error -- 75%
  3. Not quite correct but still meaningful approach, half-way to the right solution -- 50%
  4. Some progress towards solution -- 25%
  5. No work, or no progress towards solution -- 0%.

Hw1 | Hw2 | Hw3 | Hw4 | Hw5 | Hw6 | Hw7 | Hw8 | Hw9 | Hw10 |


HOMEWORK 1
Sep 3 / Sep 5

PROBLEM 1[36%]:

Assume that we make an enhancement to a computer that improves a certain mode M of execution.

a) [4%] Suppose M is used 40% of the time. What is the limit for the overall speedup that can be achieved by making M faster (and faster)?

b) [4%] Suppose, again, that M is used 40% of the time, and the enhancement improves M by a factor of 6. What is the speedup we have obtained from the enhancement?

c) [4%] Suppose M is used 75% of the time. What should be the speedup for the enhanced mode to achieve the overall performance improvement by a factor of 2?

d) [4%] Suppose the enhancement improves M by a factor of 4. What % of the time should M be used in order to obtain the overall performance improvement by a factor of 2?

e) [20%] Assume the enhancement improves M by a factor of 5. M is used 10% of the time, measured as a percentage of the execution time when the enhanced mode is in use. What percentage of the original execution time has been converted to fast mode?

Explanation to (e): What we know is that 10% of the new time (the execution time after the enhancement is made) is spent in mode M, and that before the enhancement was made, the time spent in M was 5 times longer. What we need to find is: what % of the old time (the execution time before the enhancement was made) was spent in M? Hint: Trying to apply Amdahl's law directly may take you nowhere.

PROBLEM 2[32%]:

Assume at any particular time the machine runs in one of the three modes: M1, M2 or M3. These modes are not overlapping in the sense that it is impossible to run in two different modes simultaneously. In particular, 50% of the time the machine runs in M1, 40% of the time in M2 and 10% of the time in M3. We made an enhancement that speeds up M1 by a factor of 2, speeds up M2 by a factor of 3 and slows down M3 by a factor of 4. What is the overall speedup from the enhancement?



PROBLEM 3[32%]:

a) [16%] Exercise 1.1(a)

b) [16%] Exercise 1.1(d)


HOMEWORK 2
Sep 10 / Sep 12

PROBLEM 1[50%] (modified Exercise 2.2):

Several researchers have suggested that switching to register-memory architecture from a load-store (register-register) architecture might be useful. The idea is to replace sequences of

Load R1, B
Add R2,R2,R1

by

Add R2, B

The latter is the only new type of instruction that will be added to the old instruction set. Assume the new instruction will cause the clock cycle to increase by 10% (1.1 times), and the CPI increase by 2%.
Use the instruction frequencies for the gcc benchmark on the load-store machine from Figure 2.26, and find what percentage of the loads must be eliminated for the machine with the new instruction to have at least the same performance?


Hints / Explanation:
Adding the new instruction will allow us to eliminate some (but not all!) Load instructions: old programs can be rewritten by replacing some instruction pairs of the type

Load R1, B
Add R2,R2,R1

by

Add R2, B

We don't know what % of all load instructions will be eliminated. This depends on the program and on how smart the compiler is. You need to find at least what % of the loads should be eliminated to compensate for the longer clock cycle and higher CPI.

PROBLEM 2[50%]:
Exercise 2.3 with the following modification: instead of the code given in the exercise, use the following:

Students who submit a solution for the code from the textbook will get 0 point.



Hints / Explanation:
"Instruction bytes fetched" here means nothing but the (machine) code size in bytes [warning: this is true only for codes that have no control (branch, jump) instructions].
"Memory-data bytes transferred" means the number of data bytes that will move between (to and from) memory and the CPU when the code runs. In particular, these data bytes are transferred every time we read or write the variables A,B,C and D.




HOMEWORK 3
Sep 17 / Sep 19

PROBLEM 1[60%] (Modified Exercise 2.6): Consider the following fragment of C code:

For (i=1; i<=100; i++)
{A[i] = C - (A[i] + B[i] - 6);}

Assume that A and B are arrays of 32-bit integers, and C and i are 32-bit integers. A, B and C are in memory at the addresses 200, 4000 and 6000, respectively (the address of an array is the address of the first element of it). Treat i as a temporary value that does not have to be saved in memory; (also, unlike Exercise 2.6, DO NOT assume that the values in registers are lost between iterations of the loop)

a) [40%] Write the code for DLX'.
b) [5%] What is the size of your code in bytes?
c) [5%] What is the instruction count for your code?
d) [5%] How many memory-data references will be executed?
e) [5%] What will be the total memory traffic, i.e. how many (intruction and/or data)bytes will be transferred to and/or from memory while the code runs?

PROBLEM 2[40%]: Exercise 2.10.
Explanation:

  • "The percentage of all memory accesses for data" means the number memory accesses for data divided by the number of all memory accesses (for data + instructions)
  • "The percentage of data accesses that are reads" means the number of data read accesses divided by the number of all data accesses
  • "The percentage of all memory accesses that are reads" means the number of all read (data+instructions) accesses divided by the number of all memory accesses.




    HOMEWORK 4
    Oct 1 / Oct 3

    PROBLEM 1[40%] (this is a problem from one of the past examinations, posted on the web, with minor modifications)
    Remember that stores and branches take 4 clock cycles on the unpipelined (integer) DLX machine, and all the other instructions take 5 clock cycles. Assume for our benchmark programs, stores are 5% of all instructions and branches are 10% of all instructions.

    When we pipeline this machine, we have 5 pipe stages and each stage takes 1 clock cycle. Assume that the clock cycle time does not change - it is the same for both the unpipelined and the pipelined machine.

    a) What is the CPI for the unpipelined machine?
    b) What is the speedup from pipelining if we assume that there are no stalls?
    c) Assume each branch instruction causes a 3-cycle stall, and no other instructions cause any stalls. What would be the speedup from pipelining in that case?
    d) Would your answers to (b) and (c) be the same or different if the pipeline had not 5, but 10 stages, assuming that all the other conditions are the same? Note that "all the other conditions are the same" implies that each pipe stage, again, takes 1 clock cycle, and the clock cycle time of the pipelined machine is, again, the same as that of the unpipelined machine. If you say "the same", justify your answer in a few words. If you say "different", then re-answer (b) and (c) for the case when we have 10 pipe stages.


    PROBLEM 2[30%](This, too, is a problem from one of the past examinations, with some numbers changed)
    We are comparing two pipelined machines: M1 and M2. The pipe stage for both machines is 1 clock cycle, and its length is 30 ns for M1 and 20 ns for M2. M1 has no stalls, while M2 has (only) control hazard stalls. In particular, in M2 every taken branch causes a 1-cycle stall, and every branch that is not taken causes a 3-cycle stall. We also know that 20% of all instructions are branches and 70% of the branches are taken.
    a) What are the CPIs for the two machines?
    b) Which of the two machines is faster and how many times faster?

    PROBLEM 3[30%] Exercise 3.4.


    HOMEWORK 5
    Oct 8 / Oct 10

    NOTE: You also have an in-class test on this day. Closed books/notes.

    This homework consists of a (slightly modified) problem from one of the past examinations.


    PROBLEM 1 [100%] Assume for our benchmark for DLX' we have the following instruction frequencies:
    1. ADD --- 20%
    2. ADDI --- 15%
    3. SUB --- 10%
    4. SW --- 10%
    5. LW --- 20%
    6. BEQZ --- 15%
    7. BNEZ --- 10% We also know the following:
      • 40% of the load instructions are immediately followed by an instruction that uses, as one of the source registers, the register loaded by the load instruction;
      • 70% of all branches (equally for BEQZ and BNEZ) are taken.

      a) What would be the CPI for the non-pipelined machine of Figure 3.1?

      b) What would be the CPI for the pipelined machine of Figure 3.4, assuming that the latter uses forwarding whenever possible, and does not use any branch penalty reduction schemes?

      c) What would be the CPI for the pipelined machine of Figure 3.4, assuming that the latter uses forwarding whenever possible, and also uses the "predict not taken" scheme?

      d) The same question as in (c), only for the machine of Figure 3.22.


      HOMEWORK 6
      Oct 29 / Oct 31

      PROBLEM 1[100%]. Assume the following:

      • The memory size is 256 MB;
      • The block size is 2 KB;
      • The cache has 1024 block frames in it, numbered 0,1,...,1023;
      • The cache is 4-way set associative;
      • The machine uses byte addressing.
      Let W be the memory word whose address is

      0011 1100 0011 0000 0000 1111 1000

      a) How many blocks are there in memory?

      b) How many sets does the cache have?

      c) What is the number of the memory block in which W is sitting?

      d) If W is in the cache, in which set would it be? Which block frames does that set have?

      e) How many different blocks map to set #1?

      f) How far is word W from the beginning in its block, that is, how many words are there to the left of W in the block of W?

      g) Write the address of an arbitrary word which is in memory block #18.


      HOMEWORK 7
      Nov 5 / Nov 7

      PROBLEM 1 [30%] Assume the memory size in our DLX machine is 32 MB. It has a 2-way set associative instruction cache with LRU replacement. Miss penalty for this cache is 50 cycles. The block size is 2 words, and there are (only) 2 sets in the cache. Consider the following code:
      T1: ADDI R1, R0, #2
      T2: ADD R0, R0, R0
      T3: ADD R0, R0, R0
      T4: ADD R0, R0, R0
      T5: ADD R0, R0, R0
      T6: ADD R0, R0, R0
      T7: ADD R0, R0, R0
      T8: ADD R0, R0, R0
      T9: ADDI R1, R1, # -1
      T10: BNEZ R1, T2
      These instructions are written as consecutive words in memory, starting at location 0. In other words, the address of T1 is 0 (0000000000000000000000000), the address of T2 is 4 (0000000000000000000000100), the address of T3 is 8 (0000000000000000000001000), etc. Assume the cache is initially empty.

      a) What is the miss rate for this code?

      b) What is the average number of memory stall cycles per instruction for this code?

      c) What is the CPI if the base CPI (CPI_execution) is 1.95?

      PROBLEM 2[30%]: The point of this exercise is to show how easy it is to make unfair benchmarks. Here are two machines, B and T, with the same processor and main memory but different cache organizations. The block size for both machines is the same --- 128 bytes. Both caches are unified and have 32 block frames. There are only two differences:

      1. B is write back, while T is write through.
      2. B is direct mapped, while T is 2-way set associative.
      We also adopt the following natural assumptions:
      1. Any miss in either cache takes longer than any hit in either cache.
      2. Read hit time is the same for both caches.
      3. Write hit time is longer for T than for B (because T is write-through and has to go to memory on every write hit; we assume here that there is no write buffer to make write hits in the write through cache as fast as read hits).
      You need to write any two (short) DLX' codes that are infinite loops, such that one of them runs faster on B, and the other one runs faster on T. Understand "faster" as "shorter average instruction execution time". You must state all the necessary assumptions regarding the addresses of the instructions of your codes, so that we know, exactly, when we have misses and when we have hits. Without this information, it is impossible to say anything about on which of the machines the code runs faster. Make sure you justify your answers/claims.

      Hint: When writing the codes, think of the the main advantage and disadvantage of each of the caches. The advantage of B is faster write hit time, and the advantage of T is lover miss rate as it is 2-way set associative. The code that runs faster on the given machine should exploit the advantage of this machine and be immune to its disadvantage.

      PROBLEM 3 [40%] (Modified Exercise 5.4)

      Assume the following:

      • The miss rate is 4%.
      • The cache block size is 4 words.
      • The processor makes 1,000,000,000 memory references per second.
      • 20% of those references are writes.
      • The memory system can support 800,000,000 words per second, reads or writes .
      • At any given time, 15% of the blocks in the cache have been modified (are " dirty").
      • The cache (no matter whether it is write back or write through) uses write allocate on a write miss.
      You are considering adding a peripherial to the system, and you want to know how much of the memory system bandwidth is already used. Calculate the the percenta ge of memory system bandwidth used on the average in the two cases below.

      a) The cache is write through

      b) The cache is write back.

      Comments: "Memory bandwidth" refers to the rate at which words are sent to or from the main memory. The traffic between the CPU and the cache doesn't count as a part of memory bandwidth.

      As the exercise states, the bandwidth that the memory system can support is 800 million words per second. However, only a fraction of this bandwidth will be really needed (used), and it is just this fraction you need to find.

      Note that even though the rate at which the processor produces references is 1 billion words/second, not all of these references will cause memory traffic, because some of them will be satisfied from the cache. That is why it is possible that the memory bandwidth actually used is less than 1 billion words/second.




      HOMEWORK 8
      Nov 12 / Nov 14

      PROBLEM 1 [20%] Problem 1 from Fal'00, Section 1,E2

      PROBLEM 2 [20%] Problem 2 from Fal'00, Section 1,E2

      PROBLEM 3 [30%] Even though caches with LRU replacement generally have lower miss rates than those with random replacement, in some specially selected cases a small cache with random replacement can outperform a one with LRU (in class we have discussed a similar anomaly with direct mapped versus fully associative placement policies).

      Assume the block size is 2 words, and we have a fully associative cache with 2 block frames in it. Write a DLX' code for which the miss rate will be lower with random replacement than with LRU.

      PROBLEM 4 [30%] The DLX' machine we consider has a unified 2-level (L1 and L2) cache. We know the following:

      1. the ideal CPI (i.e., when all memory accesses are hits) is 3.
      2. The local read miss rate is 5% in L1 and 10% in L2;
      3. the local write miss rate is 7% in L1 and 25% in L2;
      4. the read miss penalty in L2 is 100 clock cycles;
      5. the write miss penalty in L2 is 250 clock cycles;
      6. the hit time is 1 cylce in L1 and 8 cycles in L2;
      7. 10% of the instructions are stores;
      8. 15% of the instructions are loads;
      9. the remaining 75% of the instructions are ALU instructions.

      a) What is the global miss rate in L2?

      b) What is the average memory access time?

      c) What is the real CPI?



      HOMEWORK 9
      Nov 19 / Nov 21

      PROBLEM 1[60%]

      We have a three-level cache, where the miss rate in the first level cache is 8%. The global miss rate in the second level cache is 3%, and the local miss rate in the third level cache is 40%.
      a)[9%] What is the local miss rate in the second level cache?

      b)[9%] What is the global miss rate in the third level cache?

      c)[12%] Assume the hit time in the first level cache is 1 cycle, in the second level cache is 6 cycles and in the third level cache is 12 cycles. The miss penalty in the third level cache is 300 cycles. What is the average memory access time (the average time to complete a memory access generated by the CPU)?

      PROBLEM 2 [40%] Assume the virtual memory size is 16 GB, the physical memory size is 1 GB, and the page size is 1 KB.
      a) How many entries would a page table have?
      b) How many entries would an inverted page table have?
      c) How many bits would each TLB entry require?



      HOMEWORK 10
      Nov 26 / Dec 5

      PROBLEM 1[20%] Assume the virtual memory size is 4GB, the physical memory size is 64 MB, and the page size is 1 KB. Here is the initial segment --- entries 0 to 3 --- of the page table (it is NOT inverted):

      0000000000000011
      0000000000000010
      ---
      0000001010001001

      If the virtual address of a word W is 0000 0000 0000 0000 0000 1100 0101 0000, what would be its physical address?



      PROBLEM 2 [40%] Problems 5c and 5d from Spring'00, Section 2,E2


      PROBLEM 3[40%] Assume we have the following characteristics for a hard disk:

      • Rotation rate - 7200 RPM
      • Advertised average seek time - 10 ms
      • Controller overhead - 2ms
      • Transfer rate - 4 MB/sec
      • Sector size - 512 bytes
      • Queuing delay - 20 ms.

      a) How long would it take to read one sector from the disk, assuming that the CPU, I/O bus and memory are fast enough so that they do not slow down the maximum data rate that the disk can sustain?

      b) How long would it take to read 100 sectors? (if you want to say "of course 100 times the answer to (a)", you are missing something).

      Assume we have a RAID 0 (see the handouts) of 4 disks, where each disk has the same characteristics as the disk from (a) and (b). How long would it take to read

      c) one sector?

      d) four consecutive sectors?





      FastCounter by LinkExchange