联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-12-04 09:51

Statistical learning

Department of Economics

Brock University

1 Assignment 2

1.1 Conceptual questions

1. Suppose that we wish to predict whether a given stock will issue a dividend this year

(“Yes” or “No”) based on X, last year’s percent profit.We examine a large number of

companies and discover that the mean value of X for companies that issued a dividend

was 10, while the mean for those that didn’t was 0. In addition, the variance of X for these

two sets of companies was 36. Finally, 80% of companies issued dividends. Assuming

that X follows a normal distribution, predict the probability that a company will issue a

dividend this year given that its percentage profit was X = 4 last year. Use equation (1)

from your notes on classification.

? This problem has to do with odds. On average, what fraction of people with an odds of

0.37 of defaulting on their credit card payment will in fact default?

1.2 Classification methods

This question should be answered using the Weekly data set, which is part of the ISLR package.

This data is similar in nature to the Smarket data except that it contains 1, 089 weekly returns

for 21 years, from the beginning of 1990 to the end of 2010.

1. Produce some numerical and graphical summaries of the Weekly data. Do there appear

to be any patterns? For the numerical summaries focus on the means of the returns

(today and all lags) as well as on the correlation between today’s returns and the lags.

For the graphical summaries create a plot of today’s return versus its first lag and discuss.

2. Use the full data set to perform a logistic regression with Direction as the response and

the five lag variables plus Volume as predictors. Use the summary function to print the

results. Do any of the predictors appear to be statistically significant? If so, which ones?

Compute the predicted probabilities and obtain the following features: min, max, mean.

Discuss those features.

3. Compute the confusion matrix and overall fraction of incorrect predictions. Explain

what the confusion matrix is telling you about the types of mistakes made by the logistic

regression.

1

4. Use the full data set to perform a LPM regression with Direction as the response and

the five lag variables plus Volume as predictors. Use the summary function to print

the results. Do any of the predictors appear to be statistically significant? If so, which

ones? Compute the predicted probabilities and obtain the following features: min, max,

mean. Discuss those features. Are the LPM probs sensible? Are they similar to those of

the logistic regression? Do you expect the confusion matrix to be similar to that of the

logistic regression?

5. Compute the confusion matrix and overall fraction of incorrect predictions for this LPM.

Is the matrix similar to the one obtained with the logistic regression?

6. Now fit the logistic regression model using a training data period from 1990 to 2008, with

Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of

incorrect predictions for the held out data (that is, the data from 2009 and 2010).

7. Repeat (6) using LDA.

8. Repeat (6) using KNN with K = 1.

9. Which of these methods (logistic, LDA or KNN) appears to provide the best results on

this data? Why?

1.3 Cross-validation

In this question you will use the glm() and predict() functions, and a for loop to compute the

LOOCV error for a simple logistic regression model on the Weekly data set.

1. Fit a logistic regression model that predicts Direction using Lag1 and Lag2.

2. Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but

the first observation.

3. Use the model from (2) to predict the direction of the first observation. You can do this by

predicting that the first observation will go up if P(Direction = ”U p”|Lag1, Lag2) > 0.5.

Was this observation correctly classified?

4. Write a for loop from i = 1 to i = n, where n is the number of observations in the data

set, that performs each of the following steps:

i. Fit a logistic regression model using all but the ith observation to predict Direction

using Lag1 and Lag2.

ii. Compute the posterior probability of the market moving up for the ith observation.

iii. Use the posterior probability for the ith observation in order to predict whether or

not the market moves up.

iv. Determine whether or not an error was made in predicting the direction for the ith

observation. If an error was made, then indicate this as a 1, and otherwise indicate

it as a 0.

5. Take the average of the n numbers obtained in (4)iv in order to obtain the LOOCV

estimate for the test error. Comment on the results.

2

Notes:

? Have a look at the Course Outline (on Sakai) for more info on how to create tables.

? The report must be typed.

? The report should have a titlepage, be single space and typed using a font of size 12.

? Your computer code and output should be included in the appendix.

? Pay attention to your graphs.

? Descriptive statistics, when applicable, should be reported in a table.

? Regression results should also be presented in a Table. The first column of your table

would contain the list of independent variables (starting with the constant). The remaining

columns would contain the results for the different models. The last few rows of the

table should contain: the sample size, and 2 measures of goodness of fit.

? When using a test statistic, report the null being testing, the formula for the test statistic

and how it was computed (eg using a regression and if so which regression). Make sure

to report a conclusion for that test (eg, I reject the null because XXXX and this implies

that XXXX).

3


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp