IIMT2641

Introduction to Business Analytics Due November 7

Fall 2019

Assignment 4

In this problem, we will practice building CART models with a continuous outcome, using the dataset

StateData.csv which has data from 1970s on all fifty US states. A description of the variables in the dataset is

given in Table 1.

Variable Description

Population Population estimate of the state in 1975.

Income Per capita income in the state in 1974.

Illiteracy Illiteracy rates in 1970, as a percentage of the state’s population.

LifeExp The life expectancy in years of residents of the state in 1970.

Murder

The murder and non-negligent manslaughter rate per 100,000

population in 1976.

HighSchoolGrad The high-school graduation rate in the state in 1970.

Frost

The mean number of days with minumum temperature below

freezing from 1931 to 1960 in the capital or a large city of the state.

Area The land area (in sqaure miles) of the state.

Longitude The longitude of the center of the state.

Latitude The latitude of the center of the state.

Region

The region (Northeast, South, North Central, or West)

that the state belongs to.

Table 1: Variables in the dataset StateData.csv.

(a) Let us start by building a linear regression model. Randomly split the dataset into a training set (70%)

and a test set (30%).

(i) First, build a linear regression model to predict LifeExp using the following several variables

as the independent variables: Population, Murder, Frost, Income, Illiteracy, Area, and

HighSchoolGrad. Use the training dataset to build the model. What is the R2 of the model on

the test set?

(ii) Now, build a linear regression model to predict LifeExp the following four variables as the

independent variables: Population, Murder, Frost, and HighSchoolGrad. Again, use the

training dataset to build the model. What is the R2 of the model on the test set?

(iii) Compare these two models. What are we achieving by removing independent variables? What

is the equivalent procedure in a CART model?

(b) Now, build a CART model to predict LifeExP using the following seven variables as the independent

variables: Population, Murder, Frost, Income, Illiteracy, Area, and HighSchoolGrad. Set

the parameter minbucket to be 5. Make sure that you are building a regression tree, and not a

classification tree, by setting the argument method to “anova” instead of “class”.

IIMT2641

Introduction to Business Analytic

Fall 2019

Assignment 4

(i) Plot the trees. Which of the independent variables appear in the tree? Do you find the linear

regression model or the CART model easier to interpret?

(ii) Compute the predicted life expectancies for the test dataset using the CART model, and calculate

the R2 of the predictions.

(c) Now, build a random forest model to predict LifeExP using the same severn variables as the inde?pendent variables. Set the parameter nodesize to 5. Compute the predicted life expectancies for

the test dataset using the random forest model, and calculate the R2 of the predictions.

(d) Which of the four models you built do you think is the best model, if out-of-sample accuracy is the

most important. How about if interpretability is the most important?

版权所有：留学生编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。