联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2021-04-18 11:29

Homework 3

AM 218 Machine Learning

Due on April 16 at 11:59 PM on Canvas

Part I: Soft-margin SVM (10 pts)

In order to deal with non-separable cases, we introduce the slack variables ξi

, i = 1, 2 . . . , m allowing

a few data points to be misclassified or inside the margin.

min

w,b,

Similar to the linear-separable case, we solve the above optimization problem by applying

Lagrangian method to derive the dual problem.

(1) Could you derive the dual problem for the SVM with slack variables using the Lagrangian

method? (5 pts)

(2) Let α? denote the solution to the dual problem in (1), could you derive the solution to the

primal problem? In other words, give the expression of the optimal solution w? and b

? using

α?

. (5 pts)

Part II: Implementation of Support Vector Machine: Ham or Spam

Email has been one of the most important forms of communication. Spam is one of the major

threats posed to email users. Links in spam emails may lead to users websites with malware or

phishing schemes. Therefore, an effective spam filtering technology is a significant contribution to

the sustainability of the cyberspace and to our society. Our client company provides a collection of

spam emails as well as regular work and personal emails that came from its employee. The client

company asked us to design an effective spam-filtering system. In order to increase the effectiveness

of spam detection, we plan to analyze the content of the email to determine if the email is spam.

1

After pre-processing the content of the emails, we have extracted 57 features from those emails.

Your task now is to build a spam filter to predict whether an email is a spam or not.

Data

In the dataset (see spambase.csv) provided by the client company, the first 48 columns indicate

the feature word freq WORD, the percentage of words in the email that match WORD (number

of times the WORD appears in the email/total number of words in email * 100). The following

6 columns indicates the feature char freq CHAR, the percentage of characters in the email that

match CHAR. The next column indicates the feature capital run length longest, the length of

longest uninterrupted sequence of capital letters. The last feature is capital run length total, the

total number of capital letters in the email. The last three features have been standardized for

improving the efficiency of training process of the models. The last column indicates whether the

email was considered as spam (1) or not (0). There are in total 4600 instances in this dataset.

Programming assignment

In this assignment, you will implement Support Vector Machine on the spambase dataset to predict

whether the email is spam or not using Python. You can either use the functions from libraries

such as scikit-learn or write the code from scratch to implement the algorithm. After you run the

sample code (see, “svm demo.py”) successfully (this step is not required, but recommended), you

should apply the Support Vector Machine following the instructions as below. In this assignment,

you should use linear kernels and set the penalty coefficient C to 1 if those parameters are not

specified.

1. Learn and apply the algorithm (20pts)

a. For each W in [7, 17, 27, 37, 47, 57]

i. Use the first W features and randomly sample 70% of the data for training and use

the remaining 30% of the data for testing

iii. Learn a model from the training data and apply it to test data

iv. Measure the performance on both the training and test data using accuracy

2

v. Repeat 100 times with different random samples (using the same W)

vi. Record and report the average and the standard deviation of accuracy on both

training and test sets across the 100 trials

b. Plot curves for the results (the number of features vs avg., the number of features vs

std. dev.) for both training and test data. Discuss the results.

2. Explore the effect of the penalty coefficient C (25pts)

a. Randomly sample 80% of the data for training and use the remaining 20% of data for

testing.

b. For each C in [0.001, 0.01, 0.05, 0.1, 0.5, 1, 10]

i. Use 10-fold cross validation on the training set to evaluate the accuracy of the SVM

with the penalty coefficient C.

ii. Record and report the average and the standard deviation of accuracy on 10 validation

sets.

iii. Record and report the average margin of the SVM that you trained.

c. Plot curves for the results (the penalty coefficient vs the avg. accuracy, the penalty

coefficient vs the std. dev. of accuracy, the penalty coefficient vs the avg. margin).

Discuss the results.

d. Choose a penalty coefficient C and give the reason why you choose this penalty coeffi-

cient. Learn the SVM model on the whole training set with the chosen penalty coefficient

C and report the accuracy on the testing set.

3. Comparison with Logistic Regression (15pts)

a. Evaluate the accuracy of SVM and Logisitc Regression using 10-fold cross validation on

the whole dataset.

b. Record and report the average and the standard deviation of the accuracy of SVM and

Logistic Regression on the validation sets.

c. Record and report the average precision and average recall of SVM and Logistic Regression

on validation sets.

3

d. Is one classifier better than the other based on the performance metric accuracy? Use

the paired t-test to justify your conclusion.

Hint

1. For part 1, you may use sklearn.svm.SVC() to train the model.

2. For part 2, you may use sklearn.model selection.KFold to implementation k-fold cross

validation.

3. For part 3, you may use sklearn.metrics.recall score and sklearn.metrics.precision score

to evaluate your models.

Submission

1. The source code in python.

Name your file as “svm.py”. You should include the code that you write for the above assignments.

In addition, you should prepare the code such that TA can run without additional

modifications and the output should report the results for the question (b) in the part 2 like

this:

?python svm.py

The average accuracy with different penalty coefficients:

[0.111 0.333 0.777 0.212 0.676 0.216 0.313 ]

The std. dev. of accuracy with different penalty coefficients:

[0.013 0.013 0.012 0.011 0.012 0.222 0.213]

The average margin with different penalty coefficients:

[1.432 2.212 0.266 3.154 1.089 1.212 3.122]

2. Your evaluation & analysis in .pdf format.

Note that your analysis should include the graphs as well as a discussion of results.

4


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp