联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-10-22 12:16

School of Computing and Information Systems

The University of Melbourne

COMP90073 Security Analytics,

Semester 2 2022

Project 2: Machine learning based cyberattack detection

Release: Tue 30 Aug 2022

Due: 1pm, Tue 11 Oct 2022

Marks: The Project will contribute 25% of your overall mark for the subject;

you will be assigned a mark out of 25, according to the criteria below.

Overview

There are three tasks in this project: Task I aims to develop your skills in applying

unsupervised machine learning techniques for anomaly detection. Task II helps you better

understand how to use gradient descent-based methods to generate adversarial examples

against supervised learning models beyond the computer vision domain. In Task III, you are

asked to read and review a recent paper on adversarial machine learning.

Specifically, (1) for Tasks I and II, two network traffic (NetFlow) datasets are provided, one

for each task. Both datasets contain botnet traffic and normal traffic. You need to identify

botnet IP addresses from both two datasets. (2) For Task II you also need to choose a

botnet IP address, and explain how to manipulate the corresponding raw network traffic

records in order to bypass detection. (3) Each student has been assigned a paper for Task

III, which will be sent individually via email.

Deliverables

1. Task I – Source code (Python) and SPL queries used to do the following:

a. Generate/Select features from the packet capture files (training and test datasets)

using Splunk. You can use apps such as Splunk Machine Learning Toolkit, but all

features have to be generated/selected within Splunk.

b. Use two alternative feature generation/selection methods (filter-based, wrapperbased, etc.) to select features from packet capture files (training and test datasets).

c. Use Python/Splunk to build six models: apply two different anomaly detection

techniques on each of the three set of features generated/extracted from 1.a. and

1.b.

d. Score the test data such that cyberattacks are assigned the highest (or lowest)1

scores.

e. Return the IP addresses of attackers and the timestamps of their first and last

attempt for attacking the network service (per attack scenario).

1 Optionally anomalies may have lowest scores given the applied technique. Some anomaly detection techniques

assign high scores (e.g., distance measure) to anomalies and some of them assign low scores (e.g., probability)

to anomalies.

f. Compare and discuss the results from different feature extraction and different

anomaly detection techniques.

g. Prepare a TXT file including all stream ID which your program classifies as attack

traffic, separated by newlines (i.e., one stream ID in each line).

2. Task II

a. Source code in Python, including:

i. Building, training and testing the supervised learning model.

ii. Generating adversarial examples for a chosen botnet IP address, i.e., how to

modify its feature values.

b. Explain how to change the raw traffic sent from/to the chosen botnet IP address, in

order to reflect the modified feature values. For example, the following six features

are extracted for each IP address: (1) mean outbound packet size, (2) variance of

outbound packet size, (3) mean packet count per second, (4) max packet count per

second, (5) mean of packet jitter, (6) variance of packet jitter. A supervised model

is trained on these features to decide whether an IP address is malicious. You find

that by manipulating the values of the third and fourth features, a botnet IP address

is labelled as “normal” by the model. Then how do you change the raw traffic

records so that they are consistent with the modified feature values? For instance,

if 1000 raw traffic records were related to the bot, do you change all 1000 records,

or only a subset, e.g., 100/200 of them? How do you change each of the selected

traffic records?

* Note that for Task II, (1) the model is trained to classify each IP address, NOT each

traffic record, as demonstrated in the above example. (2) The focus is not to train an

accurate detection model (i.e., do not spend too much effort on improving model

performance), but to understand the difference of generating adversarial examples in

domains other than computer vision: in the vision domain, raw pixels are often taken as

input, and attackers can directly manipulate them. However, in other domains such as cyber

security, raw data cannot be fed into a model directly, and instead features need to be

extracted first. Therefore, although it would not be difficult to know how to manipulate the

features to bypass detection, there will be different ways to change the raw traffic records, in

order to be consistent with the modified feature values and without affecting the botnet

functionality.

3. Task III. In this task, you will learn how to write a review for an academic paper.

Typically, a review should include the following parts:

a. Summary. Your review starts with a brief summary of the main ideas of the paper.

It helps meta-reviewers, program chairs and the authors to determine whether

there are any misunderstandings.

b. Merits. List the main contributions of the paper in this section. Contributions can be

theoretical, methodological, algorithmic, empirical, etc.

c. Main review. Provide a thorough review of the paper, including:

i. Originality: Are the tasks or methods new? Is the work a novel combination of

well-known techniques? Is it clear how this work differs from previous

contributions? Is related work adequately cited?

ii. Quality: Is the submission technically sound? Are claims well supported (e.g., by

theoretical analysis or experimental results)? Are the methods used appropriate?

Is this a complete piece of work or work in progress? Are the authors careful and

honest about evaluating both the strengths and weaknesses of their work?

iii. Clarity: Is the submission clearly written? Is it well organized? If not, please make

constructive suggestions for improving its clarity.

iv. Significance: Are the results important? Are others (researchers or practitioners)

likely to use the ideas or build on them? Does the submission address a difficult

task in a better way than previous work? Does it advance the state of the art in a

demonstrable way? Does it provide unique data, unique conclusions about

existing data, or a unique theoretical or experimental approach?

*Note that (1) the questions listed in c.i -- c.iv are for explanation only. DO NOT write the

main review in the form of Q&A. Write it like an essay instead. (2) Some papers include

appendix, which may include proofs, additional experimental settings and results. The

appendix helps you better understand the paper, but your review should focus on the main

part of the paper.

For Tasks I & II:

4. A README that briefly details how your program(s)/queries work(s). You may use

any external resources for your program(s) that you wish. You must indicate (cite)

what these external resources are and where you obtained them, in the README

file.

*Note: please submit a separate README file for Tasks I & II.

Technical Report

A technical report of around 2000 words comprising:

Task I:

1. An overview of the test dataset using Splunk and explaining feature

generation/selection using SPL queries and Splunk native functionalities.

2. Description of your methodology for generating features. Briefly explain your method

for the first project, and discuss your modifications and new findings in Project 2.

3. Review of at least two anomaly detection methods that you have used.

4. Description of the experimental setup and evaluation of the (two) methods in

detecting anomalies on the test datasets using features generated in Splunk and also

features generated using alternative methods. Description should also comprise IP

addresses of attacker(s) and victim(s), the attacked service(s), the timestamp, and

the type of the attack per attack scenario identified.

5. Description of your final CSV file, the scoring and thresholding technique you used

for detecting the reported anomalies2

.

6. Conclusion and discussion: describe anomaly detection method worked best given

the attack scenario.

2 For example, you may choose the best model as your final model or make an ensemble of models.

Task II:

7. Explanation of the generated features and your choice of supervised learning

model. Note that supervised learning is used here, and the mode is the target against

which adversarial examples will be generated.

8. Choosing one IP address classified as botnet by your model, and explaining:

a. How to perturb its features via gradient descent-based method to bypass the

detection of your model;

b. How to change the raw network traffic sent from/to it, in order to be consistent

with the modified feature values and without affecting the botnet functionality.

You should include a bibliography and citations to relevant research papers and external

resources and code you have used.

Review

A review of 400 – 500 words of the assignment paper.

1. Summary. This part should contain no more than three sentences. Please be brief,

but specific.

2. Merits. List the top three or more main contributions.

3. Main review. In order for your reviews to provide useful feedback to authors, write

this section in a top-down manner and start from the most important aspects. Your

arguments should be objective, specific, concise and polite.

Assessment Criteria

Code quality and README (2 marks)

Technical report (17 marks)

1. Methodology: (4 marks)

You will describe your methodology in a manner that would make your work

reproducible. You should describe in detail:

Tasks I and II

a. The features that were generated and/or selected.

Task I

a. The training data that was used to learn the anomaly detection models. You

should explain how the parameter settings for your methods were performed

(e.g., setting the 𝜈𝜈 parameter in OCSVM3

). You should not use the test

data for setting the parameters.

b. The scoring that was performed in each model to rank the data instances.

c. The thresholding on the scores that was performed in each model to label the

attacks.

3 For anomaly detection methods that require validation set for parameter tuning, you can combine a small

amount of anomalies (about 5%) from the of the attack day dataset to your training set.

2. Accuracy of Results: (4 marks)

Task I

Your machine learning based technique should generate a report of detected attacks

on the test datasets. This should be the output of your algorithm and you should not

change it based on your analysis. It should indicate the IP address of the attacker

and the victim, the attacked service, and the period (timestamps) for which the attack

was happening. You are marked out of 4 based on the percentage of successfully

detected attacks by your anomaly detection model.

Task II

As explained in Deliverable 2, the focus of Task II is not to obtain an accurate

detection model. Therefore, accuracy will not be marked separately, but together with

the critical analysis – you are required to perturb the feature(s) and raw network

traffic of an IP address classified as botnet by your model, but if that IP address in

fact belongs to a normal user, i.e., your model misclassifies it, you will not get full

mark for the critical analysis, even if your methods for perturbing the features and the

raw network traffic are correct.

3. Critical Analysis: (7 marks)

Task I

a. Use of Splunk for feature generation/selection from packet capture files

(training and test datasets).

b. Discuss the differences in processes, scalability, and results identified using

the Python code developed for anomaly detection.

Task II

a. Explain the steps for generating adversarial examples, including which

features are chosen, how perturbations are calculated.

b. Explain the steps for changing the raw network records.

4. Report Quality: (2 marks)

You will produce a formal report and express your methodology and findings

concisely and clearly. The quality and description of figures, tables, and the

README file should be acceptable.

Review (6 marks)

1 mark each for summary and merits. 4 marks for the main review.

Description of the Data

The two datasets for Project 2 (A2_1.zip & A2_2.zip) contain the NetFlow data for a network

under cyberattacks. Each line of the dataset includes the following 15 fields: (1) stream ID,

(2) timestamp, (3) duration, (4) protocol, (5) source IP address, (6) source port, (7) direction,

(8) destination IP address, (9) destination port, (10) state, (11) source type of service, (12)

destination type of service, (13) the number of total packets, (14) the number of bytes

transferred in both directions, (15) the number of bytes transferred from the source to the

destination.

Changes/Updates to the Project Specifications

If we require any changes or clarifications to the project specifications, they will be posted on

the LMS. Any addendums will supersede information included in this document.

Academic Misconduct

For most people, collaboration will form a natural part of the undertaking of this project.

However, it is still an individual task, and so reuse of ideas or excessive influence in

algorithm choice and development will be considered cheating. We will be checking

submissions for originality and will invoke the University’s Academic Misconduct policy

(http://academichonesty.unimelb.edu.au/policy.html) where inappropriate levels of collusion

or plagiarism are deemed to have taken place.

Late Submission Policy

You are strongly encouraged to submit by the time and date specified above, however, if

circumstances do not permit this, then the marks will be adjusted as follows. Each day (or

part thereof) that this project is submitted after the due date (and time) specified above, 10%

will be deducted from the marks available, up until 5 days has passed, after which regular

submissions will no longer be accepted.

Extensions

If you require an extension, please email Mark Jiang <yujing.jiang@unimelb.edu.au> using

the subject ‘COMP90073 Extension Request’ at the earliest possible opportunity. We will

then assess whether an extension is appropriate. If you have a medical reason for your

request, you will be asked to provide a medical certificate. Requests for extensions on

medical grounds received after the deadline may be declined. Note that computer systems

are often heavily loaded near project deadlines, and unexpected network or system

downtime can occur. Generally, system downtime or failure will not be considered as

grounds for an extension. You should plan ahead to avoid leaving things to the last minute,

when unexpected problems may occur.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp