Take Home Exam
Using R for Economics and Statistics
Due: Monday, July 8, 11:59 PM
Your final work should be in .Rmd form.
Please note the due date.
Exercise 1 (Computation Time)
[12 points] For this exercise we will create data via simulation, then assess how well certain methods perform.
Use the code below to create a train and test dataset.
library(mlbench)
set.seed(42)
sim_trn = mlbench.spirals(n = 2500, cycles = 1.5, sd = 0.125)
sim_trn = data.frame(sim_trn$x, class = as.factor(sim_trn$classes))
sim_tst = mlbench.spirals(n = 10000, cycles = 1.5, sd = 0.125)
sim_tst = data.frame(sim_tst$x, class = as.factor(sim_tst$classes))
The training data is plotted below, with colors indicating the class variable, which is the response.
Before proceeding further, set a seed equal to your UIN.
uin = 123456789
set.seed(uin)
We’ll use the following to define 5-fold cross-validation for use with train() from caret.
library(caret)
cv_5 = trainControl(method = "cv", number = 5)
We now tune two models with train(). First, a logistic regression using glm. (This actually isn’t “tuned” as
there are not parameters to be tuned, but we use train() to perform cross-validation.) Second we tune a
single decision tree using rpart.
We store the results in sim_glm_cv and sim_tree_cv respectively, but we also wrap both function calls with
system.time() in order to record how long the tuning process takes for each method.
glm_cv_time = system.time({
sim_glm_cv = train(class ~ .,data = sim_trn,trControl = cv_5,method = "glm")})tree_cv_time = system.time({
2
sim_tree_cv = train(class ~ .,data = sim_trn,trControl = cv_5,method = "rpart")})
We see that both methods are tuned via cross-validation in a similar amount of time.
glm_cv_time["elapsed"]
## elapsed
## 0.98
tree_cv_time["elapsed"]
## elapsed
## 1.25
library(rpart.plot)
rpart.plot(sim_tree_cv$finalModel)
Repeat the above analysis using a random forest, twice. The first time use 5-fold cross-validation. (This is
how we had been using random forests before we understood random forests.) The second time, tune the
model using OOB samples. We only have two predictors here, so, for both, use the following tuning grid.
rf_grid = expand.grid(mtry = c(1, 2))
Create a table summarizing the results of these four models. (Logistic with CV, Tree with CV, RF with
3
OOB, RF with CV). Report:
Chosen value of tuning parameter (If applicable)
Elapsed tuning time
Resampled (CV or OOB) Accuracy
Test Accuracy
Exercise 2 (Predicting Baseball Salaries)
[12 points] For this question we will predict the Salary of Hitters. (Hitters is also the name of the
dataset.) We first remove the missing data:
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.4
Hitters = na.omit(Hitters)
After changing uin to your UIN, use the following code to test-train split the data.
uin = 123456789
set.seed(uin)
hit_idx = createDataPartition(Hitters$Salary, p = 0.6, list = FALSE)
hit_trn = Hitters[hit_idx,]
hit_tst = Hitters[-hit_idx,]
Do the following:
Tune a boosted tree model using the following tuning grid and 5-fold cross-validation.
gbm_grid = expand.grid(interaction.depth = c(1, 2),
n.trees = c(500, 1000, 1500),
shrinkage = c(0.001, 0.01, 0.1),
n.minobsinnode = 10)
Tune a random forest using OOB resampling and all possible values of mtry.
Create a table summarizing the results of three models:
Tuned boosted tree model
Tuned random forest model
Bagged tree model
For each, report:
Resampled RMSE
Test RMSE
Exercise 3 (Transforming the Response)
[5 points] Continue with the data from Exercise 2. People always suggest log transforming the response,
Salary, before fitting a random forest. Is this necessary? Re-tune a random forest as you did in Exercise 2,
except with a log transformed response. Report test RMSE for both the untransformed and transformed
model on the original scale of the response variable.
4
Salaray
Percent of Total
0 500 1000 1500 2000 2500
Exercise 4 (Concept Checks)
[1 point each] Answer the following questions based on your results from the three exercises.
Timing
(a) Compare the time taken to tune each model. Is the difference between the OOB and CV result for the
random forest similar to what you would have expected?
(b) Compare the tuned value of mtry for each of the random forests tuned. Do they choose the same model?
(c) Compare the test accuracy of each of the four procedures considered. Briefly explain these results.
Salary
(d) Report the tuned value of mtry for the random forest.
(e) Create a plot that shows the tuning results for the tuning of the boosted tree model.
(f) Create a plot of the variable importance for the tuned random forest.
(g) Create a plot of the variable importance for the tuned boosted tree model.
(h) According to the random forest, what are the three most important predictors?
(i) According to the boosted model, what are the three most important predictors?
5
Transformation
(j) Based on these results, do you think the transformation was necessary?
Exercise 5 (Neutral Network)
[11 point each]
Neural networks have always been one of the most fascinating machine learning model in my opinion, not
only because of the fancy backpropagation algorithm, but also because of their complexity (think of deep
learning with many hidden layers) and structure inspired by the brain. In this exercise you are required to fit
a simple neural network using the neuralnet package and fit a linear model as a comparison.
We are going to use the Boston dataset in the MASS package.The Boston dataset is a collection of data about
housing values in the suburbs of Boston. Our goal is to predict the median value of owner-occupied homes
(medv) using all the other continuous variables available.
set.seed(500)
library(MASS)
data <- Boston
First we need to check that no datapoint is missing, otherwise we need to fix the dataset.
apply(data,2,function(x) sum(is.na(x)))
## crim zn indus chas nox rm age dis rad
## 0 0 0 0 0 0 0 0 0
## tax ptratio black lstat medv
## 0 0 0 0 0
There is no missing data, good. We first randomly splitting the data into a train (75% percent of the total
sample) and a test set:
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train <- data[index,]
test <- data[-index,]
Then do the following:
Fit a linear regression model and test it on the test set (That is to predict test set) and compute RMSE
(Root Mean Squared Error). Note that you’d better use the glm() function instead of the lm() this
will become useful later when cross validating the linear model.
Before fitting a neural network, some preparation need to be done. Neural networks are not that easy to
train and tune.As a first step, we are going to address data preprocessing. It is good practice to normalize
your data before training a neural network. We chose to use the min-max method and scale the data in the
interval [0, 1]. Usually scaling in the intervals [0, 1] or [?1, 1] tends to give better results. We therefore scale
and split the data before moving on:
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
train_ <- scaled[index,]
test_ <- scaled[-index,]
6
There is no fixed rule as to how many layers and neurons to use although there are several more or less
accepted rules of thumb. Usually, if at all necessary, one hidden layer is enough for a vast numbers of
applications. As far as the number of neurons is concerned, it should be between the input layer size and
the output layer size, usually 2
3
of the input size. Since this is a toy example, we are going to use 2 hidden
layers with this configuration: 13 : 5 : 3 : 1. The input layer has 13 inputs, the two hidden layers have 5 and 3
neurons and the output layer has, of course, a single output since we are doing regression.
Fit the data by package neuralnet then plot the results
Predicting medv using the neural networkand compare the RMSE with the results from linear regression
method.
7
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。