联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-10-28 09:56

BUSS6002 Assignment 2

October 10, 2022

Instructions

• Due: at 23:59 on Friday, October 28, 2022 (end of week 12).

• You must submit a written report (in PDF) with the following filename format, replacing

STUDENTID with your own student ID: BUSS6002 A2 STUDENTID.pdf.

• You must also submit a Jupyter Notebook (.ipynb) file with the following filename format,

replacing STUDENTID with your own student ID: BUSS6002 A2 STUDENTID.ipynb.

• There is a limit of 2000 words for your report (excluding equations, tables, and captions).

• All plots, computational tasks, and results must be completed using Python.

• Each section of your report must be clearly labelled with a heading.

• Do not include any Python code as part of your report.

• All figures must be appropriately sized and have readable axis labels and legends (where

applicable).

• The submitted .ipynb file must contain all the code used in the development of your report.

• The submitted .ipynb file must be free of any errors, and the results must be reproducible.

• You may submit multiple times but only your last submission will be marked.

• A late penalty applies if you submit your assignment late without a successful special consideration. See the Unit Outline for more details.

1

Rubric

This assignment is worth 20% of the unit’s marks. The assessment is designed to test your technical

ability and statistical knowledge in modelling a real-world dataset, as well as your communication

skills in writing a concise and coherent report presenting your approach and results.

Assessment Item Goal Marks

Section 1 Introduction 3

Section 2 Candidate models 10

Section 3 Model estimation and selection 12

Section 4 Model evaluation 8

Section 5 Conclusion 3

Overall Presentation Clear, concise, coherent, and professional 4

Total 40

Table 1: Assessment Items and Mark Allocation

Overview

Being able to accurately predict the sale prices of residential properties is crucial to many aspects

of the economy. Some companies base their entire business models on providing their clients with

predictions of property sale prices. As a data scientist, you are asked to build a model to predict

sale prices using data on residential home sales in Ames, a city in the state of Iowa of the United

States. The dataset contains sale prices between 2006 and 2010 of all residential properties in

Ames, as well as many numerical and categorical features (i.e., variables) associated with each

dwelling. The following downloadable files are available on Canvas.

File Description

AmesHousing.txt Data file containing 2,930 observations and 82 variables

DataDocumentation.txt Data dictionary containing description of each variable

AmesResidential.pdf A map of Ames

Table 2: Files Provided

Data

Place the data file AmesHousing.txt in the same location (i.e., directory) as your Jupyter Notebook

file (.ipynb), and then read the data into a pandas DataFrame object using exactly the following

code.

import pandas as pd

data = pd . read_csv (

' AmesHousing . txt ',

sep='\t',

keep_default_na =False ,

na_values =[''])

2

1 Introduction

In this section, you should

• provide a brief project background so that the reader of your report can understand the

general problem that you are solving;

• state the aim of your project;

• briefly describe the dataset;

• briefly summarise your key results.

2 Candidate models

Propose at least three candidate models for predicting the response variable ‘SalePrice’. For

i ∈ {1, 2, 3}, each candidate model should take the form

y = fi(xi

; βi

) + εi

,

where y is the sale price of a property, and xi

, βi

, and εi are the predictor vector, parameter vector,

and the error term of the i-th model, respectively. The set of variables chosen for the feature vector

xi should be a subset (or constructed from a subset) of the 81 predictors in the provided dataset.

You may label your models M1, M2, and M3. The proposed models should be different in terms

of model complexity (i.e., number of parameters). For each proposed model, you should:

• clearly define the function fi

, which can be either linear or nonlinear with respect to xi

;

• clearly define the feature vector xi

;

• justify your choices of fi and xi

;

• state any assumptions on the error term εi

;

• discuss how the model parameters βi

can be estimated.

Hint: one effective way to motivate/justify your choices of fi and xi

is to present the relevant

evidence in the data.

3 Model estimation and selection

Select the best model from the set of candidate models proposed in Section 2 using the “validation

set” approach. In this section, you should:

• include a description of the model selection procedure that you adopted;

• report and discuss the estimation results (based on the training set) of each candidate model;

• discuss whether each candidate model is correctly specified based on residuals (obtained from

fitting each model to the training set);

• report the validation performance (MSE) of each candidate model;

• identify the best model;

• discuss the complexity of the selected model in terms of bias-variance tradeoff.

The description of the model selection procedure (first point above) should provide enough details

so that the reader is able to implement exactly what you have done by following your description.

3

4 Model evaluation

Evaluate the generalisation performance of the selected model in Section 3 against two benchmark

models. The generalisation performance should be measured by the observed MSE calculated using

the test set. The two benchmark models are specified as follows.

• Let C be the set constructed by combining (or concatenating) the observed sale prices in the

training and validation sets. The first benchmark model (BM1) is the “constant mean”

model given by

yˆBM1 :=

1

m

X

y∈C

y,

where m > 0 is the size of the set C. That is, BM1 will always give the sample mean of C

as its prediction, regardless the values of any predictors.

• Let N(x) be the subset of C that contains only the sale prices from the neighbourhood x.

E.g., N(‘OldTown’) contains the sale prices in C that are associated with the neighbourhood

‘OldTown’. The second benchmark model (BM2) is the “neighbourhood mean” model

given by

yˆBM2 :=

1

m(x)

X

y∈N(x)

y,

where m(x) < m is the size of the set N(x). That is, BM2 predicts the sale price by the

average price of the corresponding neighbourhood.

In this section, you should

• combine the training and validation sets and re-estimate the selected model on the combined

set;

• describe the model evaluation procedure;

• describe the two benchmark models;

• report and discuss the generalisation (i.e., test set) performance of the selected model against

the two benchmark models.

5 Conclusion

In this section, you should

• discuss your findings;

• discuss any limitations of your project;

• suggest any potential extensions for future work.

4


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp