联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> CS作业CS作业

日期:2018-09-01 07:52


1. The assignment MUST be submitted electronically to Turnitin through QBUS6850

Canvas site. Please do NOT submit a zipped file.

2. The assignment is due at 17:00pm on Monday, 3 September 2018. The late penalty

for the assignment is 10% of the assigned mark per day, starting after 17:00pm on the

due date. The closing date Monday, 10 September 2018, 17:00pm is the last date on

which an assessment will be accepted for marking.

3. Your answers shall be provided as a word-processed report giving full explanation

and interpretation of any results you obtain. Output without explanation will receive

zero marks.

4. Be warned that plagiarism between individuals is always obvious to the markers of

the assignment and can be easily detected by Turnitin.

5. The data sets for this assignment can be downloaded from Canvas.

6. Presentation of the assignment is part of the assignment. Markers will reduce to 10%

of the mark for poor writing in clarity and presentation. It is recommended that you

should include your Python code as appendix to your report, however you may insert

small section of your code into the report for better interpretation when necessary.

Think about the best and most structured way to present your work, summarise the

procedures implemented, support your results/findings and prove the originality of

your work.

7. Numbers with decimals should be reported to the third decimal point.

8. The report should be NOT more than 10 pages including everything like text, figure,

tables, small sections of inserted codes etc but excluding the appendix containing

Python code.

Tasks

Question 1 (50 Marks)

You will work on the UCI ML housing dataset

A template Python program has been prepared for you. The program can

help you get the dataset from sklearn dataset repository. Please test and play with the

template program to fully understand the dataset.

For further information, please visit

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names.

(a) Suppose you are interested in using the house age AGE (proportion of owneroccupied

units built prior to 1940) as the first feature 𝑥𝑥1 and the full-value

property-tax rate TAX as the second feature 𝑥𝑥2 to predict the MEDV (median

value of owner-occupied homes in $1000’s) as the target t. Write code to extract

2018S2 QBUS6850 Page 2 of 4

these two features and the target from the dataset.

Use the dataset (two chosen features and one target) to plot the loss function

𝐿𝐿(𝜷𝜷) = 1

2𝑁𝑁�(𝑓𝑓(𝐱𝐱𝑛𝑛,𝜷𝜷) − 𝑡𝑡𝑛𝑛)2

𝑁𝑁

𝑛𝑛=1

with 𝑓𝑓(𝐱𝐱𝑛𝑛,𝜷𝜷) = 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2

That is, we are using a linear regression model without the intercept term 𝛽𝛽0.

Hint: This is a 3D plot and you will need to iterate over a range of 𝛽𝛽1 and 𝛽𝛽2

values.

(b) Use the linear regression model LinearRegression in the scikit-learn package

to do two linear regression models to predict the target, with and without the

intercept term. You may use 90% of the data as your training data, and the

remaining 10% as your testing data. Compare the performance of two models and

explain the importance of the intercept term.

Hint: The argument fit_intercept of the LinearRegression controls

whether an intercept term is included in the model by fit_intercept = True

or fit_intercept = False.

(c) Take 90% of data as training data. Construct the centred training dataset by

conducting the following steps in your Python code:

(i) Take the mean of all the training target values, then deduct this mean from

each training target value MEDV. Take the resulting target values as the new

training target values 𝐭𝐭𝑛𝑛𝑛𝑛𝑛𝑛;

(ii) In the training data, take the mean of all the first feature values AGE, then

deduct this mean from each of first feature values. Take the result as the new

first feature values 𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛

𝟏𝟏 ;

(iii)In the training data, do the same for the second feature TAX. The result is

𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛

𝟐𝟐 ;

Now build linear regressions with and without the intercept to fit to the new

training data. Report and compare the coefficients and the intercept. Compare the

performance of two models over the testing data. Note that, when you take your

testing data into the model to calculate performance scores, you shall take the

relevant training means from the testing features and targets.

(d) Consider the closed-form solution of the linear regression below, see slide 25 (the

number may change) of Lecture 2,

𝜷𝜷 = (𝐗𝐗𝑇𝑇𝐗𝐗)−1𝑿𝑿𝑇𝑇𝐭𝐭

where X is the design (data) matrix whose first column is all 1s, and the first

component in 𝜷𝜷 is the intercept. Suppose that the data are centred (refer to (c)).

Now prove that, in the case of centred data, the intercept 𝛽𝛽0 in the solution above

is zero.

Hint: You may need that following fact that

2018S2 QBUS6850 Page 3 of 4

𝐀𝐀 0

0 𝐁𝐁�

−1

= �𝐀𝐀−1 0

0 𝐁𝐁−1�

where both matrices A and B are invertible.

Question 2 (50 Marks)

Use Logistic Regression to predict diagnosis of breast cancer patients on the Breast Cancer

Wisconsin (Diagnostic) Dataset (wdbc.data). See Section About Datasets. This question

aims to test your ability in programming in matrix operation for Logistic Regression.

(a) Write Python code to load the data into your program. For the target feature

Diagnosis, change its literal M (malignant) to 0 and B (benign) to 1. Split the data

into training and validation sets (80%, 20% split). Then define and train a logistic

regression model by using scikit-learn’s LogisticRegression model.

(b) Using the logistic regression model function below and the estimated parameters

from your model, calculate the probability of sample ID 8510426 (20th sample)

having a benign diagnosis.

𝑓𝑓(𝐱𝐱𝑛𝑛,𝜷𝜷) = 1

1 + 𝑒𝑒−𝐱𝐱𝑛𝑛

𝑇𝑇𝜷𝜷

(c) The objective of logistic regression is defined as, on slide 17 (the number may

change) of Lecture 3,

𝐿𝐿(𝜷𝜷) = − 1

𝑁𝑁 ���𝑡𝑡𝑛𝑛 log �𝑓𝑓�𝐱𝐱𝑛𝑛, 𝜷𝜷�� + (1 − 𝑡𝑡𝑛𝑛) log �1 − 𝑓𝑓�𝐱𝐱𝑛𝑛, 𝜷𝜷���

𝑁𝑁

𝑛𝑛=1

where both the parameter 𝜷𝜷 = (𝛽𝛽0, 𝛽𝛽1, … , 𝛽𝛽𝑑𝑑)𝑻𝑻 and sample 𝐱𝐱𝑛𝑛 =

(𝑥𝑥𝑛𝑛0, 𝑥𝑥𝑛𝑛1, … , 𝑥𝑥𝑛𝑛𝑛𝑛)𝑇𝑇 are d+1 dimensional vectors, where the intercept feature

𝑥𝑥𝑛𝑛0 = 1. For Wisconsin Dataset d = 30. It is easy to prove that (you don’t need

to prove this)

𝜕𝜕𝜕𝜕(𝜷𝜷)

𝜕𝜕𝜷𝜷 = 1

𝑁𝑁 𝐗𝐗𝑇𝑇(𝐟𝐟(𝐗𝐗,𝜷𝜷) − 𝐭𝐭)

where 𝐟𝐟(𝐗𝐗,𝜷𝜷) = �𝑓𝑓(𝐱𝐱1,𝜷𝜷), 𝑓𝑓(𝐱𝐱2,𝜷𝜷), … , 𝑓𝑓(𝐱𝐱𝑁𝑁,𝜷𝜷)�

𝑇𝑇 and 𝑡𝑡 = (𝑡𝑡1,𝑡𝑡2, … ,𝑡𝑡𝑁𝑁)𝑇𝑇.

Write your own python code to use this derivative formula to implement the

gradient descent algorithm for the logistic regression. You may write a python

function named such as myLogisticGD, which accepts an data matrix X, an

initial parameter beta_0, and a number of GD iterations T and other arguments

you see appropriate. Your function should return the learned parameter 𝜷𝜷.

Hint: In python, you can use the following way to get the vector 𝐅𝐅 = 𝐟𝐟(𝐗𝐗,𝜷𝜷).

First define the sigmoid function by

2018S2 QBUS6850 Page 4 of 4

def sigmoid(x):

return (1 / (1 + np.exp(-x)))

then

F = sigmoid(np.dot(X, beta))

or similar.

(d) Based on task (c) and the training data used in (a), write python code to use

different initial values 𝜷𝜷 = (0, 0, … , 0)𝑻𝑻, 𝜷𝜷 = (1, 1, … , 1)𝑻𝑻, and a random initial

𝜷𝜷 to start the gradient descent algorithm to minimise the objective of logistic

regression with respect to the parameter 𝜷𝜷. You set the number of iteration

T=200. Use each resulting 𝜷𝜷 to re-do task (b). Compare the results and explain the

major reasons why you may have different answers with different initial value for

𝜷𝜷.

Hint: As mentioned on slide 29 of Lecture 2, it is a good practice to normalize

your data before you send them to your algorithm.

About Datasets

Breast Cancer Wisconsin (Diagnostic): wdbc.data

Attribute information

1: ID number

2: Diagnosis (M = malignant, B = benign)

3-32: Ten real-valued features are computed for two cell nuclei:

• radius (mean of distances from center to points on the perimeter)

• texture (standard deviation of gray-scale values)

• perimeter

• area

• smoothness (local variation in radius lengths)

• compactness (perimeter^2 / area - 1.0)

• concavity (severity of concave portions of the contour)

• concave points (number of concave portions of the contour)

• symmetry

• fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three largest values) of these

features were computed for each image, resulting in 30 features. For instance, field 3 is Mean

Radius, field 13 is Radius SE, field 23 is Worst Radius.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp