Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Final: Cache simulation

Due: Wednesday, May 13 at 11:00

This final project consists of two, equally weighted parts. For the first part, you will write a configurable cache simulator using the infrastructure provided. Your cache simulator will read an address trace (a chronological list of memory addresses referenced), simulate the cache, generate cache hit and miss data, and calculate the execution time for the executing program. The address traces have been generated by a simulator executing real programs. Your cache simulator will be graded for accuracy, but is not the end product of this project, rather it is a tool you will use to complete the project.

For the second part of the project, you will experiment with various cache configurations and make conclusions about the optimal cache organization for a set of programs.

Preliminaries

You can work on this project with a partner if you choose. If you decide to work with a partner, you and your partner should check out a single repository. The first partner will create a team name, and the second partner should choose that team name. Please be careful choosing a team, as this cannot be undone. Please name your team something that makes it clear who you are.

If you choose to work with a partner, you and your partner must complete the entire project together. Dividing the project up into pieces and having each partner complete a portion of it on their own will be considered a violation of the honor code. Both you and your partner are expected to fully understand all of the code you submit.

Click on the assignment link.

Once you have accepted the assignment, you can clone the repository on your computer by following the instruction and begin working.

Address trace

An address trace is simply a list of addresses produced by a program running on a processor. These are the addresses resulting from load and store instructions in the code as it is executed. Some address traces would include both instruction fetch addresses and data (load and store) addresses, but you will be simulating only a data cache, so the provided traces in the traces directory only have data addresses.

These traces were generated by a simulator of a RISC processor running three programs, art, mcf, and swim from the SPEC benchmarks. The files are art.trace.gz, mcf.trace.gz, and swim.trace.gz. The number of loads/stores in the traces vary by benchmark. They are all compressed with gzip. You do not need to decompress the traces because the open_input() function in main.rs knows how to read both compressed and uncompressed files. (In addition, there is a test.trace file containing the contents of the example trace shown below.)

Use the following command to run your simulator on the given trace file.

$ cargo run -- [cache args] tracefile

Because your workload is three programs, you will run three simulations for each cache architecture you simulate, and then combine the results in some meaningful way. The simulator arguments should be taken in as command line arguments. For example, to simulate a 32 kB, 4-way set-associative cache with 32-byte blocks, a 4 cycle hit time, and a 34-cycle miss penalty on the traces/mcf.trace.gz trace file, you’d run the following.

$ cargo run -- -s 32 -a 4 -b 32 -h 4 -m 30 traces/mcf.trace.gz

The supported command line arguments are as follows.

$ cargo run --quiet -- --help
Usage: cache [OPTIONS] <PATH>

Arguments:
  <PATH>  Input file path

Options:
  -b, --block-size <BLOCK_SIZE>        Set the block size in bytes. Must be a power of two [default: 32]
  -a, --associativity <ASSOCIATIVITY>  Set the associativity. Must be a power of two [default: 1]
  -s, --size <SIZE>                    Set the cache size in kilobytes (kB). Must be a power of two [default: 32]
  -h, --hit-time <HIT_TIME>            Set the cache hit time [default: 4]
  -m, --miss-penalty <MISS_PENALTY>    Set the cache miss penalty [default: 34]
      --help                           Print help
  -V, --version                        Print version

Your code should support any reasonable values for block size, associativity, size, and miss penalty. Note that main.rs will ensure that the block size, associativity, and cache size in kB will be a power of two.

Format of the address trace

All lines of the address trace are of the format

# LS ADDRESS IC

where LS is a 0 for a load and 1 for a store, ADDRESS is an 8-character hexadecimal number, and IC is the number of instructions executed between the previous memory access and this one (including the load or store instruction itself). There is a single space between each field. The instruction count information will be used to calculate execution time (or at least cycle count). A sample address trace starts out like this:

# 0 7fffed80 1
# 0 10010000 10
# 0 10010060 3
# 1 10010030 4
# 0 10010004 6
# 0 10010064 3
# 1 10010034 4

You should assume that all memory accesses in the program are appropriately aligned for the size of the access. This ensures that each memory access only interacts with a single cache block.

Simulator (40 points)

Your simulator will model an n-way set associative, write-back, write-allocate cache. The cache replacement policy is always least-recently used for associative caches.

When run on a trace file, the provided code will construct an instance of the Cache structure defined in cache.rs using the cache parameters passed on the command line. Note that the size of the cache passed to Cache::new() is the size in bytes, not kilobytes.

let mut cache = Cache::new(args.size * 1024, args.block_size, args.associativity, args.hit_time, args.miss_penalty);

The provided code will print the cache parameters, run the simulation by calling Cache::access() for each line in the trace file, and then print out the simulation results.

Your task is to compute the

  • total execution time in cycles;
  • number of instructions;
  • number of memory access instructions (i.e., number of loads and stores);
  • overall cache miss rate;
  • cache miss rate for load instructions;
  • average number of cycles per instruction (CPI);
  • average memory access time in cycles (see below for cache behavior);
  • number of dirty cache blocks evicted;
  • number of load misses;
  • number of store misses;
  • number of load hits; and
  • number of store hits.

See the examples below.

For execution time, assume the following.

  • Instructions other than loads and stores take one cycle;
  • A load or store takes a configurable number of cycles (four by default) on a cache hit;
  • A load or store takes the number of cycles for a cache hit (four by default) plus the configurable miss penalty on a cache miss;
  • A load or store that misses in the cache and causes the cache to evict a dirty cache block has an additional 2 cycle penalty.

(We’re assuming that on a cache miss that causes a dirty block to be evicted, the write back to memory happens mostly in the background and the additional 2-cycle penalty is to write the dirty block to a write buffer.)

To recap: With the default hit time of four cycles and the base miss penalty of 34 cycles, a load or store instruction takes four cycles for a cache hit, 38 cycles for a cache miss that does not evict a dirty block, and 40 cycles for a cache miss that evicts a dirty block.

Info

These numbers are a simplification of the real cache behavior of modern processors. For comparison, an Intel Broadwell processor has three levels of cache:

  • Level 1 cache hit time is 4 cycles,
  • Level 2 cache hit time is 12 cycles, and
  • Level 3 cache hit time is 34 cycles.

On a level 3 cache miss, fetching data from memory takes several hundred cycles.

For this simulator, you are essentially simulating a Broadwell’s level 1 cache backed by a level 3 cache that never misses, plus some additional simplifications.

To compute the average memory access time, you should compute the actual number of cycles taken for a memory access which includes the hit time, the miss penalties incurred, and the additional penalties for evicting a dirty block. Divide that total by the number of memory accesses.

In the trace shown above, the first 31 instructions take 188 cycles, assuming four cache misses and three cache hits for the five loads and two stores, and a 34-cycle base miss penalty. The average memory access time is about 23.4 cycles which we compute by using the hit time for each memory access plus the miss penalty for the four cache misses. In this short example, there were no dirty evictions. Had there been dirty evictions, that would have impacted the average memory access time.

Each trace contains the memory accesses of just over 5 million instructions. Your simulations should process all of them.

Implementation task

Your task is to implement the cache by modifying src/cache.rs. You need to add some fields to the Cache struct to perform the cache simulation as well as to hold whichever statistics you need in order to print out the simulation results.

I recommend adding fields to hold counts of things like number of instructions and dirty evictions. You can update these counts in each call to Cache::access(). For simulation results that can be calculated from these counters (like miss rates and CPI), I recommend not adding fields for them and instead to compute them in their corresponding functions. (Take a look at src/cache.rs to see what I mean.)

The provided skeleton code contains an eprintln!() in Cache::access() which prints out each access. This is purely for your own debugging needs. Please comment it out before submission. (Indeed, leaving it in will substantially slow down your program as it will produce megabytes of output that have to be printed to the console when run on the larger traces. Leaving it in and running the code on traces/mcf.trace.gz, for example, will print 145 MB of output.)

Questions (40 points)

For the second part of this project, you will use your cache simulator to test out different cache configurations and answer questions about your findings. You must try every configuration below on the art, mcf, and swim cache traces provided to you, and discuss all of them in your answers to the questions. Note that your question answers are worth just as much as the code for this project—you are expected to give detailed answers that clearly backup your claims with your simulator results.

The baseline cache configuration will be 16-byte block size, direct-mapped, 32 kB cache size, write-back, and write-allocate with a miss penalty of 34 cycles. You will re-evaluate some of these parameters one at a time, in the given order. In each case, choose a best value for each parameter, then use that for all subsequent analyses.

The first step is to compare 32 kB, 64 kB, and 128 kB cache sizes. Larger caches take longer to access, but not linearly. Use the following hit times for the given cache sizes.

Cache sizeHit time
32 kB4 cycles
64 kB5 cycles
128 kB6 cycles

Run your simulator on each of the three traces with each of the three cache sizes and associated hit times. Based on those nine results, select the best cache size to use for all subsequent experiments. Make sure you justify your choice with data when answering question 1.

The next step is to compare cache associativity of direct-mapped, 2-way set-associative, and 8-way set-associative. More associativity increases the hit time, but it has a smaller impact than size on the hit time. Increase the hit time of the the cache with the size selected in the pervious step according to the table below.

AssociativityHit time increase
Direct-mapped+0 cycles
2-way set-associative+0 cycles
8-way set-associative+1 cycles

Run your simulator on each of the three traces with each of the three associativities and the associated hit time increase. Based on those nine results, select the best associativity to use for all subsequent experiments. Make sure you justify your choice with data when answering question 1.

The final step is to compare cache block sizes of 16, 32, and 64 bytes. Increasing the block size has a small impact on hit time which we’ll discount but has a large impact on the miss penalty because more data must be moved into the cache from RAM. Use the miss penalties from the table below for the corresponding block size.

Block sizeMiss penalty
1634 cycles
3240 cycles
6446 cycles

Run your simulator on each of the three traces with each of the three block sizes and the associated miss penalties. Based on those nine results, select the best block size. Make sure you justify your choice with data when answering question 1.

  1. (10 points) What is the best cache size, associativity, and block size, given the parameters above?
  2. (10 points) Is the cache miss rate a good indicator of performance? In which cases did the option with the lowest miss rate not have the lowest execution time? Why?
  3. (10 points) Were results uniform across the three programs? In which cases did different programs give different conclusions? Speculate as to why that may have been true.
  4. (10 points) What was the speedup of your final design over the baseline for each trace? Use the definition of speedup covered in class. (The final design may not be faster for all programs in which case the speed up for that program will be less than 1.)

Put the answers to these questions in the README.md file. Feel free to use markdown to format your answers including adding tables or images, if you so desire.

Hints

Think about how to intelligently debug and test your program. Running immediately on the entire input gives you little insight on whether it is working (unless it is way off). To do this create separate memory tests (you can see the text format above) to ensure cache size, cache associativity, block size, and miss penalty are functioning correctly. You do not need to turn them in, but they will help tremendously.

You can also implement additional unit tests. See src/cache.rs for an example of a unit test.

Speed matters. These simulations should take less than a couple minutes (actually, much less) on an unloaded computer. If it is taking much more than that, do yourself a favor and think about what you are doing inefficiently.

Simulations are not the same as hardware. If your tag only takes 16 bits, feel free to use a larger integer for that value, if that’s helpful.

Submission

Submit the project by committing your code and answers to the questions and pushing it to your GitHub repository.

Examples

Here are three example simulation runs and the command line to produce them. Your output should be identical. Note that I’m running the optimized (“release”) build of the code to speed things up.

The test.trace example should be nearly instantaneous regardless, but the other two benefit from the optimizations. My solutions take less than 2 seconds for the mcf example when optimized and about 8 seconds when using the debug build.

$ cargo run --release --quiet -- -s 32 -a 4 -b 32 -h 5 -m 34 traces/test.trace  
Cache parameters:
    Cache Size                         32 kB
    Cache Associativity                 4
    Cache Block Size                   32 bytes
    Hit time                            5 cycles
    Miss penalty                       34 cycles

Simulation results:
    execution time                    195 cycles
    instructions                       31
    memory accesses                     7
    overall miss rate                0.57
    load miss rate                   0.60
    CPI                              6.29
    average memory access time      24.43 cycles
    dirty evictions                     0
    load_misses                         3
    store_misses                        1
    load_hits                           2
    store_hits                          1
$ cargo run --release --quiet -- -s 32 -a 4 -b 32 -h 5 -m 34 traces/art.trace.gz 
Cache parameters:
    Cache Size                         32 kB
    Cache Associativity                 4
    Cache Block Size                   32 bytes
    Hit time                            5 cycles
    Miss penalty                       34 cycles

Simulation results:
    execution time               29941160 cycles
    instructions                  5136716
    memory accesses               1957764
    overall miss rate                0.25
    load miss rate                   0.27
    CPI                              5.83
    average memory access time      13.67 cycles
    dirty evictions                 60015
    load_misses                    475672
    store_misses                    20015
    load_hits                     1256211
    store_hits                     205866
$ cargo run --release --quiet -- -s 32 -a 4 -b 32 -h 5 -m 34 traces/mcf.trace.gz 
Cache parameters:
    Cache Size                         32 kB
    Cache Associativity                 4
    Cache Block Size                   32 bytes
    Hit time                            5 cycles
    Miss penalty                       34 cycles

Simulation results:
    execution time              153650660 cycles
    instructions                 19999998
    memory accesses               6943857
    overall miss rate                0.44
    load miss rate                   0.39
    CPI                              7.68
    average memory access time      20.25 cycles
    dirty evictions               1025771
    load_misses                   2178475
    store_misses                   875163
    load_hits                     3410997
    store_hits                     479222
$ cargo run --release --quiet -- -s 32 -a 4 -b 32 -h 5 -m 34 traces/swim.trace.gz 
Cache parameters:
    Cache Size                         32 kB
    Cache Associativity                 4
    Cache Block Size                   32 bytes
    Hit time                            5 cycles
    Miss penalty                       34 cycles

Simulation results:
    execution time               20545582 cycles
    instructions                  5002552
    memory accesses               1520886
    overall miss rate                0.18
    load miss rate                   0.15
    CPI                              4.11
    average memory access time      11.22 cycles
    dirty evictions                108293
    load_misses                    163143
    store_misses                   108707
    load_hits                      923251
    store_hits                     325785