Term 1, 2019/2020

ACCT648 Applied Statistics for Data Analysis

Assignment 3

Deadline of Submission: Upload your answer file in word-format on 6 November

2019 before 5pm in e-Learn, and submit the hard copy during class on that day

1. The owner of a moving company typically has his most experienced manager predict the

total number of labor hours (Hours) that will be required to complete an upcoming move.

This approach has proved useful in the past, but the owner has the business objective

of developing a more accurate method of predicting labor hours. In a preliminary effort

to provide a more accurate method, the owner has decided to use the number of cubic

feet moved (Feet), the number of pieces of large furniture (Large) and whether there is

an elevator in the apartment building (Elevator) as the independent variables and has

collected data for moves in which the origin and destination were within the borough

of Manhattan in New York City and the travel time was an insignificant portion of the

hours worked. The data are organized and stored in Moving2019.csv.

(a) Find the multiple regression equation L1 with all the three main independent variables.

(b) Find the multiple regression equation L2 with all the three main independent variables

with the interaction effect of Feet and Elevator.

(c) Find the multiple regression equation L3 with all the three main independent variables

with the interaction effect of Large and Elevator.

(d) Find the multiple regression equation L4 with all the three main independent variables

with the interaction effect of Feet and Large.

(e) When comparing all four regression models: L1, L2, L3, L4, explain why model L3

is the best model.

(f) Perform a residual analysis on the model L3 and determine whether the regression

assumptions are valid.

(g) Construct a 95% prediction interval estimate for the labor hours for moving 420

cubic feet with 2 large furniture in an apartment building that does not have an

elevator in model L3

(h) Construct a 95% confidence interval estimate for the average labor hours for moving

400 cubic feet with 3 large furniture in an apartment building that has an elevator

in model L3

(i) True or False: For a fixed value of cubic feet and at least one large furniture

situations, the total number of labor hours to move in the building with elevator

is on average less than the number of labor hours to move in the building without

elevator under model L3. Justify your answer.

1

2. Based on data set given in Question (1),

(a) Fit the multiple regression equation to predict the total number of labor hours with

all independent variables by using the Forward Selection and BIC criterion on the

training set. Plot the graph to show the number of variables versus BIC in each

selection step.

(b) Fit the multiple regression equation to predict the total number of labor hours

with all independent variables by using the Best Subset Selection with adjusted R2

criterion on the training set. Plot the graph to show the number of variables versus

adjusted R2

in each selection step.

(c) Use the 5-fold cross-validation approach to fit the models of L1, L2, L3 and L4 and

determine which model is the best under the criterion of their associated crossvalidation

errors. (Note: use set.seed(1208))

(d) Use the Leave-One-Out cross-validation approach to fit the models of L1, L2, L3 and

L4 and determine which model is the best under the criterion of their associated

cross-validation errors. (Note: use set.seed(5623))

3. Suppose we collect data for a group of 130 students in a statistical class with two

independent variables X1 = average studying hours per week, X2 = GPA, and one

dependent variable Y = Pass (or Fail).

We fit a logistic regression model: log(odds ratio) = β0+β1X1+β2X2 to predict whether

a student will pass the course. R-outputs produce estimated coefficients, β?

0 = ?9.5447,

β?

1 = 0.5709, and β?

2 = 1.0682. The observations of the first five students are given as

follows:

Student Y X1 X2

1 Pass 9.4 3.03

2 Pass 14.5 3.52

3 Pass 12.2 3.14

4 Fail 8.4 2.76

5 Fail 11.3 3.20

(a) Based on the estimated logistic regression model, predict the probability that a

student who studies 11 hours per week on average and has a GPA of 3.40 will pass

the course.

(b) At least how many hours would the student in part (a) need to study to have more

than 70% predicted chance of passing the course?

(c) Find the deviance residues of the first five observed students.

(d) By using the estimated logistic regression model with the threshold value being

0.55 for classification of passing the course, determine whether the model makes

any error to predict each of the above five observed students. If there is an error,

determine what type of error as well.

2

4. The stock prices of Singapore Telecommunications Limited (SingTel) with code (Z74.SI)

and Singapore Airlines Limited (SIA) with code (C6L.SI) from 27 August 2018 to 29

July 2019 are stored in SingTelSIA2019.csv. Suppose a portfolio investment has 8,000

shares of SingTel at price of $3.34 per share and 5,000 shares of SIA at price of $9.42

per share on 29 July 2019. Therefore, the portfolio investment has value of $73,820

(8, 000 × 3.34 + 5, 000 × 9.42) on 29 July 2019.

(a) Based on the historical approach without any assumption of distribution, calculate

the one-day 99% VaR for this portfolio on 29 July 2019.

(b) Without any assumption of distribution, estimate the one-day 99% VaR for this

portfolio on 29 July 2019 based on the Bootstrap approach with 100,000 repetitions.

(Note: use set.seed(5483))

(c) Obtain a 95% Bootstrap percentile confidence interval for the one-day 99% VaR for

this portfolio on 29 July 2019.

5. The director of undergraduate studies at a college of business wants to predict whether

students in a BBA program can graduate with a honor degree using independent variables,

High school grade point average (GPA), SAT score, gender, and local citizen.

Data from a random sample of 90 students, organized and stored in BBA2019.csv,

show that 46 successfully completed the program with honor degrees (coded as Yes) and

44 without honor degrees (coded as No) under the variable column Graduate.

(a) Develop a logistic regression model, L1, to predict the probability of successfully

completed the BBA program with honor degrees, based on all independent variables.

(b) Develop the other logistic regression model, L2, to predict the probability of successfully

completed the BBA program with honor degrees, based on the SAT, Gender,

and Local independent variables.

(c) Develop the other logistic regression model, L3, to predict the probability of successfully

completed the BBA program, based on the SAT and Local independent

variables.

(d) Develop the other logistic regression model, L4, to predict the probability of successfully

completed the BBA program, based on the SAT independent variables.

(e) Explain why model L4 is the best model among the four models considered. At the

0.05 level of significance, is there evidence that a logistic regression model L4 is a

good fitting model?

(f) Predict the probability of successfully completed the BBA program with honor

degree given that a male local citizen with GPA 3.45 and SAT score 1330 under

model L4.

(g) Find the confusion matrix of model L4 with the threshold value 0.6 for classifying

students successfully completed the BBA program with honor degrees.

(h) Find the sensitivity, specificity and total error rate of the model L4 with the threshold

value 0.6.

-END-

3

版权所有：留学生编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。