联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2020-04-30 11:22

CIS341, Spring 2020

Cache2 Lab: Understanding Cache Memories

Assigned: Tuesday, April 7th, 2020

Due: Saturday, May 2nd, 8:59PM

1 Overview

This assignment will help you understand the behavior of cache memories and the impact that cache memories

have on performance. You may work individually, or as a group consisting of up to 4 individuals.

The assignment consists of two parts. In the first part, you will deduce the cache parameters such as block

size, number of sets, number of ways, and hit time for a two-level cache simulator. In the second part, you

will optimize a matrix multiply function to minimize the average memory read time. You will turn in not

only your C code for the cache-optimized matrix multiply, but also a short report documenting your work.

2 Milestones

The entirety of the assignment is due on May 2nd, 8:59pm, but there are two milestones before the due date.

While you may choose to turn-in everything on the final due date, hitting the first two milestones on time

will grant you bonus points.

2.1 Milestone 1: April 14th, 8:59pm

You must form a group to receive the files to work on the lab. Even a single-person group must be explicitly

formed. You will do this by contacting me via e-mail, stating the team name, and the list of group members.

Please check the announcement from April 8th for a sample of such e-mail.

You will receive 10 bonus points if you form a group by the first milestone. Any students who do not belong

to a group by the first milestone will be automatically and randomly grouped into new teams.

You will receive the files for the assignment on the class server within 24 hours of your group formation.

1

2.2 Milestone 2: April 21st, 8:59pm

You will receive bonus points if you correctly deduce the cache memory parameters for the cache simulator

before the second milestone. You may make only one guess, and I will either confirm or deny the correctness

of your deduction.

The awarded of bonus points depend on the size of your group. For a single-person group, you will receive

up to 24 bonus points if you correctly deduce the cache parameters. A two-person group will receive up to

12 bonus points; a three-person group, up to 8 points, and a four-person group, up to 6 points. You may

receive partial credit for the educated guess, but you have one chance in requesting confirmation for the all

of the deduced cache parameters. My response will either be a) all the parameters in your deduction are

correct or b) not all parameters are correct. I will not tell you which parameters are correct and which are

not.

The maximum attainable bonus points between Milestone 1 and 2 are limited to 20 points.

2.3 Due date: May 2nd, 8:59pm

By the due date, a report that documents your work and the matrix multiply code must be submitted. The

report will be uploaded through Blackboard, and the C code will be placed in the turnin directory of a

group member. The report must specify the name of the group member who placed the matrix multiply code

in his/her turnin directory.

Late penalties apply similarly to the prior assignments: each late-day will reduce your score by 20%. This

reduction is applied after bonus points are added. After 5 days, you will not receive any points even if you

received bonus points.

3 Communication

One of the cornerstones of a group project is frequent and efficient communication within its members. The

Blackboard group system allows group members to have private discussion boards, send e-mails to each

other, and write journal entries.

I would like you to have a habit of documenting your progress, no matter how minor it is. You may start by

starting a thread on some of the observations you have made in the lab, or by writing a digest of your group

meeting as journal entries. Even if you work alone, you should utilize these features, and some points will

be graded based on how you document your progress.

4 Tasks

Once you form a group, you will receive a tarball for the assignment, placed in your home directory of

the class server, lcs-vc-cis341.syr.edu. Untar it, and you will find the relevant files in the created

directory.

2

linux> cd

linux> tar xvf cache2lab-handout_Team*.tar

linux> cd cache2lab-handout

4.1 Task #1: Deducing cache parameters

csim is the binary for the two-level cache simulator whose parameter you must deduce through experimentation.

These parameters include the cache block size, number of sets, number of ways, and hit time for

each level 1 and level 2 cache. Do not delete csim: make will not be able to create this binary for you!

4.1.1 Running and understanding csim

You will run csim by giving it a trace file to run.

linux> ./csim -v -t traces/dave.trace

L 10,4 L2:miss 100

S 18,4 L1:hit

L 20,4 L2:miss 100

S 28,4 L1:hit

S 50,4 L2:miss

L1: hits:2 misses:3 evictions:0

L2: hits:0 misses:3 evictions:0

Average read time: 100.00

linux>

Here, the -v flag tells the simulator to be verbose, and the trace file path must follow the -t flag. The

simulator run under verbose will output the result of each memory access, followed by a summary for each

cache, and the average memory read time.

L 10,4 L2:miss 100

The first memory trace is a load (memory read) of 4 bytes from address 0x10. It resulted in an L2 miss

(implying that it also missed in L1), and incurred a total access time of 100 cycles. This result tells us that

the sum of L1 hit time, L2 hit time, and DRAM access time is 100 cycles.

S 18,4 L1:hit The second memory trace is a store (memory write) of 4 bytes to address 0x18. It

resulted in an L1 hit. This result tells us that the cache block size for L1 is greater than 8 bytes because the

previous load from address 0x10 was installed in the L1 cache. Note that the store does not log any access

time. We assume that all stores are non-blocking, and they are buffered and written in the background.

The average read time indicates 100 cycles. The two loads were only taken into consideration when computing

this average read time.

3

4.1.2 csim policies

Aside from cache parameters of block size, number of sets, number of ways, and hit time, the cache simulator

implements a number of policies that determine the behavior of the cache. These policies are discussed

here.

• Write-back

Write-back policy means that on a write-hit (the cache hits on a store), the write will only be applied

to that cache, not the levels below. For example, if a store hits in L1, it will not be written to L2 nor

DRAM. If a store L2 hits (through a dirty data eviction from L1 being written to L2), it will not be

written to DRAM.

• Write-allocate

Write-allocate policy means that on a write-miss (the cache misses on a store), the data will be first

loaded into the cache before applying the updates. For example, if a store misses in L1, the corresponding

cache line will first be brought into L1.

• Least-recently-used (LRU) replacement

LRU replacement policy means that to evict a cache line, the data that was least recently used will

be replaced. In real hardware, it may not be possible to track LRU perfectly; but for our simulator,

assume that we achieve perfect LRU.

• Non-inclusive, non-exclusive (NINE)

NINE policy means that there are no strict inclusivity and exclusivity requirements to the data in L1

and L2. For example, the same data may reside in both L1 and L2, or not be in both L1 and L2. It

may be possible for the data to only be in L1 and not in L2 (through L2’s eviction policy), or only in

L2 and not in L1 (after data is evicted in L1). However, on a miss in L1, data will always be brought

into both L1 and L2.

4.1.3 csim parameter assumptions

You should assume the following relations to be true.

• On hit time: L1 hit time is strictly smaller than L2’s hit time, and L2’s hit time is strictly smaller than

DRAM access time.

• On block size: L2’s block size is greater than or equal to L1’s block size.

• On the numbr of ways: L2’s associativity is greater than or equal to L1’s associativity.

• On the numbr of sets: The number of sets in L2 is greater than or equal to that in L1.

• On capacity: The size of L2 is strictly greater than the size of L1.

• On numbers: The block size, number of sets, and number of ways are all powers of 2.

Also, you should assume that memory accesses are aligned properly, such that single memory access never

crosses block boundaries. By making this assumption, you can ignore the request sizes in the traces.

4

4.1.4 Writing traces

In the directory traces, you will find several sample traces. The memory traces have the following form:

I 0400d7d4,8

M 0421c7f0,4

L 04f6b868,8

S 7ff0005c8,8

Each line denotes one or two memory accesses. The format of each line is

[space]operation address,size

The operation field denotes the type of memory access: “I” denotes an instruction load, “L” a data load,

“S” a data store, and “M” a data modify (i.e., a data load followed by a data store). There is never a space

before each “I”. There is always a space before each “M”, “L”, and “S”. The address field specifies a 64-bit

hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.

csim will ignore any instruction load, so I 0400d7d4, 8 will not affect the simulation.

4.1.5 Generating traces

Instead of handcrafting memory traces to exercise csim, you may generate traces directly from a program.

A Linux program called valgrind can trace the memory accesses during the execution of a program.

For example, typing

linux> valgrind --log-fd=1 --tool=lackey -v --trace-mem=yes ls -l

on the command line runs the executable program “ls -l”, captures a trace of each of its memory accesses

in the order they occur, and prints them on stdout.

Thus, if you choose so, you may write a program in a high-level language to access memory in a particular

pattern, compile to an executable binary, and then run valgrind to gather the memory trace. However,

this includes a lot of accesses outside of the memory region of interests, such as using the standard libraries

and accessing the stack. You may choose to inspect how test-mm.c and tracegen.c create isolated

memory traces.

4.2 Task #2: Writing a cache-optimized matrix multiply

You will write a matrix multiply code in mm.c, optimized for the csim binary. A reference matrix multiply

code is included in mm ref.c.

Running test-mm will validate the correctness of your implementation in mm.c, and evaluate its performance

by generating a memory trace and running it on csim. Below shows an example of testing a correct

but slow implementation of matrix multiply.

5

linux>

linux> ./test-mm -M 50 -N 51 -P 52

Step 1: Validating and generating memory traces for mm.c

Step 2: Evaluating performance

L1: hits:126418 misses:141337 evictions:141273

L2: hits:142899 misses:978 evictions:0

Average read time: 13.05

There are several approaches that you should consider when writing mm.c.

• Spatial locality

mm ref.c shows a na¨ıve implementation of matrix multiply. Try changing the order of the loops.

• Temporal locality

We’ve seen in lectures that chopping up the matrices into smaller blocks can reuse data already in

cache. Consider writing matrix multiply such that once data are brought into the cache, they will be

re-used.

• Spatio-temporal locality

Try combining the two techniques above.

• Two-level cache

csim has a two-level cache structure. Think about how you would design a matrix multiply for

two-level caches.

Furthermore, below are rules that you must adhere to when writing mm.c

• You are allowed to define at most 20 local variables of type int in your mm function. The reason for

this restriction is that our testing code is not able to count references to the stack. We want you to

limit your references to the stack and focus on the access patterns of the source and destination arrays.

• You are not allowed to side-step the previous rule by using any variables of type long or by using

any bit tricks to store more than one value to a single variable.

• Your matrix multiply function may not use recursion.

• If you choose to use helper functions, you may not have more than 20 local variables on the stack at a

time between your helper functions and your top-level matrix multiply function. For example, if your

matrix multiply declares 16 variables, and then you call a function that uses 4 variables, which calls

another function that uses 2, you will have 22 variables on the stack, and you will violate the rule.

• Your mm function may not modify array A or B. You may, however, do whatever you want with the

contents of array C.

• You are NOT allowed to define any arrays in your code or to use any variant of malloc.

6

4.3 Task #3: Writing a short report

You will write a short report documenting your work in Task #1 and Task #2. Only one report is needed per

group. You must include the following information in the report.

• Team members

A list of team members who were part of the work.

• Deduced cache parameters

A list of cache parameters that you deduced. There are 9 parameters you must uncover:

– L1: block size, number of sets, number of ways, hit time

– L2: block size, number of sets, number of ways, hit time

– DRAM access time

Even if you have met Milestone #2, you must still explicitly list these parameters.

• Evidence for your deduction of cache parameters

Provide a summary of the experiments you have performed to arrive at the conclusions. You need to

include a description of the types of workload traces you exercised, and how you interpreted these

results.

• Directory location of the matrix multiply code

The location of a complete and working mm.c must be disclosed in the report. I will look at this

location to validate your code. Please include the full path. You can run pwd in the directory to find

the full path.

• Description of your matrix multiply implementation

Briefly describe the final version of your matrix multiply code implemented in mm.c.

• Exemplar output from testing matrix multiply

Include the output of a test run of test-mm.

• Your reasoning on the performance evaluation of matrix multiply

Describe why the performance of your mm.c is better than the reference mm ref.c. If you implemented

and combined multiple techniques, a step-by-step description of all the intermediary implementations

is encouraged, although not required. If you see significant performance improvement,

you must provide your reasoning to it. If you do not see performance improvement, explain why is

the case.

5 Grading

We will grade the assignment 5 days after the due date to take account of late submissions. As did with

previous assignments, each late submission will reduce the total points by 20%.

7

The assignment itself is worth 100 points and possibly up to 20 bonus points. Out of the 100 base points,

70 points are allocated to be evaluated based on the submission: mm.c and the report. 30 points are participation

points that are assigned based on your peer-evaluation and group activity.

The breakdown of points is as follows.

• 10 points: cache parameters

You will receive partial credit for each correctly deduced parameters (9 parameters total).

• 20 points: matrix multiply

5 points will be awarded for mm.c that correctly compiles with no syntax errors. Another 5 points

will be awarded for mm.c that is correct functionally. 10 points will be based on the performance

improvement of mm.c. Because each group will receive different csim, the maximum possible

improvement is different for each group. Thus, the scale for awarding performance improvement

will depend on the csim parameters. Also, I reserve the exact matrix dimensions I will use for

performance evaluation.

• 40 points: report

20 points will be reserved for how well the group approaches Task #1, and the other 20points for

Task #2. The general criteria for judging these points are described in Task #3: Writing a short report

section.

• 30 points: participation

A short questionnaire for peer-evaluation will be made available after the deadline. It will be used to

evaluate 20 points participation grades for each team member. The remaining 10 points will be based

on your activity on Blackboard: your group’s discussion board and journal entries.

• (bonus) 10 points: forming a group by Milestone #1

An additional 10 points will be awarded for groups that are formed before Milestone #1.

• (bonus) 6/8/12/24 points: correctly deducing the parameters by Milestone #2

Points will be awarded to groups that correctly deduce the parameters by Milestone #2. Awarded

points depend on the group size (a group size of 4 receives up to 6, and a group size of 1 receives up

to 24), and how many parameters are correctly deduced. You will only have one attempt at submitting

your educated guess.

8


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp