联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2020-10-07 09:22

COMP SCI 4094/4194/7094 - Distributed Databases and Data Mining

Assignment 2

Important Notes

? Handins:

.

– You must do this assignment individually and make individual submissions.

– Your program should be coded in C++ and pass test runs on 3 test files. The sample

input and output files are downloadable in “Assignments” of the course home page

(https://myuni.adelaide.edu.au/courses/54718/assignments/176864/).

– You need to use svn to upload and run your source code in the web submission system

following “Web-submission instructions” stated at the end of this sheet. You should

attach your name and student number in your submission.

– Late submissions will attract a penalty: the maximum mark you can obtain will be

reduced by 25% per day (or part thereof) past the due date or any extension you are

granted.

? Marking scheme:

– 12 marks for testing on 3 standard tests: 4 marks per test.

– 3 marks for the code structure.

– Note: If it is found your code did not implement the required computation tasks

in this assignment, you will receive zero mark regardless of the correctness of testing

output.

If you have any questions, please send them to the student discussion forum. This way you

can all help each other and everyone gets to see the answers.

The assignment

In this assignment you are required to code a traffic packet clustering engine to cluster the raw

network packet to different applications, such as http, smtp. To accomplish this assignment,

a data preprocessing module and a clustering module should be implemented, the structure is

illustrated below:

You have two input files, and you should print two output files.

The input file1 contains a distance threshold and the raw network packet information, that is,

seven attributes of a packet: source address, source port, destination address, destination port,

protocol, arrival time, and packet length. input file1.txt is Sample traffic flow information;

Input file2.txt has a number K, and on the next line include K integer numbers represent an

initial set of K medoids.

In the data preprocessing module, your program should prepare the flow data for clustering

by the raw packet data, two steps are involved: you need to firstly merge the packets into flows

by the rule: a network flow includes at least TWO packets with same source address, source

port, destination address, destination port, and protocol, then calculate two clustering features:

average transferring time and the average packet length of a flow.

In the clustering module, you need to apply k-medoids algorithm (course slides Chapter 10,

not the book’s random method) to find the minimum number of clusters that the sum of the

distance of each flow to its centroid is less than the given threshold. Note: the clustering features

come from data preprocessing module, the distance measurement is Mannhaton distance.

For your convenience, below is the framework of the k-medoids algorithm which you should

follow:

Example

Sample traffic flow information

src addr src port dst addr dst port protocol arrival time packet length

202.234.224.254 49880 31.65.181.210 80 6 115258 52

202.234.224.254 49880 31.65.181.210 80 6 115307 52

202.234.35.144 55256 74.39.124.220 443 6 115310 46

119.188.179.82 50592 150.79.7.129 80 6 115314 40

202.234.224.254 49880 31.65.181.210 80 6 115341 52

119.188.179.82 50592 150.79.7.129 80 6 115350 40

119.188.179.82 50592 150.79.7.129 80 6 115363 40

Data preprocessing module

In the above traffic flow information, there are two flows: The first, second, and fifth packet

belong to the first flow(index is 0); the fourth, sixth, and seventh packet belong to the second

flow(index is 1).

The Average transferring time of first flow = (( the arrival time of fifth packet - the arrival

time of second packet ) + (the arrival time of second packet - the arrival time of first packet))

÷ (3 - 1) = ((115341 - 115307) + (115307 - 115258)) ÷ 2 = 41.5. The Average length of first

flow = (P packet length) ÷ 3 = (52 + 52 + 52) ÷ 3 = 52. Similarly, the Average transferring

time of second flow = 24.5, the average length of second flow = 40.

(arrival time is microsecond(μs))

Clustering module

We use Mannhaton distance to measure the distance between flows. In our sample, the distance

between the two flows is |41.5 ? 24.5| + |52 ? 40|.

Example input initial medoids.txt — initial k medoids

1 (k=1)

0 (Start from index 0, as the initial start medoid)

Example Output

At begin you should output the flow after Data preprocessing module, include index, average

transferring time x value and average length y value.

ID X Y

In this case, flow.txt should print:

0 41.50 52.00

1 24.50 40.00

Rounding numbers (X,Y) to 2 decimal place. You can use:

cout << f ixed << setprecision(2) << 3.1415926;

or

printf(”%0.2f”, 3.1415926);

After doing KMedoid, you will get K clusters. It includes K+2 lines. First line is absoluteerror

criterion. Next one line include K medoids’ index. Following each line have several flow

index represent each medoid includes which flows.

29.00 (Absolute-error of the cluster,2 decimal place)

0 (Medoid is 0)

0 1 (This cluster include 2 flows index 0 and index 1)

Web-submission instructions

? First, type the following command, all on one line (replacing xxxxxxx with your student

ID):

svn mkdir - -parents -m “DDDM”

https://version-control.adelaide.edu.au/svn/axxxxxxx/2020/s2/dddm/assignment2

? Then, check out this directory and add your files:

svn co https://version-control.adelaide.edu.au/svn/axxxxxxx/2020/s2/dddm/assignment2

cd assignment2

svn add KMedoids.cpp

· · ·

svn commit -m “assignment2 solution”

? Next, go to the web submission system at:

https://cs.adelaide.edu.au/services/websubmission/

Navigate to 2020, Semester 2, Distributed Databases and Data Mining, Assignment 2.

Then, click Tab “Make Submission” for this assignment and indicate that you agree to the

declaration. The automark script will then check whether your code compiles. You can

make as many resubmissions as you like. If your final solution does not compile you won’t

get any marks for this solution.

? Note:

1. Please follow the forms in sample output files.

2. Your local file path will not work with our web-submission system.

3. We prepared ten test files in web-submission system, when you submit your program,

random test files will be allocated for you.

4. The auto-marker script compiles and runs named ”KMedoids.cpp” by using following

command:

g++ -std=c++11 KMedoids.cpp -o runKMedoids

./runKMedoids network packets.txt initial medoids.txt

In this assignment, you need to read two files network packets.txt ( network packets

traffic information) and initial medoids.txt (initial medoids) which are generated

randomly by the system.

you should print two output files named med Flow.txt (flow data after preprocessing)

and KMedoidsClusters.txt (k-medoids clustering results) as shown in the following

twosamples:.

Example1

input:File1.txt

src addr src port dst addr dst port protocol arrival time packet length

202.234.224.254 49880 31.65.181.210 80 6 115258 52

202.234.224.254 49880 31.65.181.210 80 6 115307 52

202.234.35.144 55256 74.39.124.220 443 6 115310 46

119.188.179.82 50592 150.79.7.129 80 6 115314 40

202.234.224.254 49880 31.65.181.210 80 6 115341 52

119.188.179.82 50592 150.79.7.129 80 6 115350 40

119.188.179.82 50592 150.79.7.129 80 6 115363 40

input:File2.txt


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp