Name:
ID:
Disussion Section, A, B or C:
Final Exam, Stats 101C
You'll need to turn in two documents:
Document #1: a pdf document that has only the answers to theq uestions on this exam. Make sure each answer is clearly labeled so that we know which question it belongs to. For example:
Part A, 5:
DF: My answer
Sum Sq: My answer
etc
Document #2: a .R file that contains all of the commands you used to generate answers for Part B. Name it lastname.firstname.R
Rules: You may not talk to anyone about this exam other than me or the TAs. You are welcome to use your own notes, the textbook and anything provided via the CCLE. You may not use external resources or online resources with the exception of resources to assist you with R.
This test is not intended to test your skill with R. If you run into error messages or otherwise find that you can't continue because R isn't cooperating, please send us an email right away (include the code, perhaps the entire .R file, so we can look it over). If you are confused by a question, please assume the fault is with the question-writer and not with your own understanding. Ask us about it. If you were wrong in your assumption, we will gently tell you so. My goal is that you should not get a question wrong because you misinterpreted the question.
Part A
This first part is based on the "IPEDS" dataset, a national database of institutions of higher learning. The data you'll examine includes a random sample of four-year, non-profit institutions in 2016. The output you need to answer these questions is given after the last question of Part A. For these and all questions, please type your answers directly into this document.
IPEDS Data:
grad.rate: the percent of students who graduate within four years. Values are integers that represent percents. For example, 23 means the graduation rate is 23%.
sector: Indicates whether public or private school.
Average.loan: in dollars, the average student debt at graduation.
SAT.read.25p: the 25th percentile of the SAT reading score for enrolled students.
SAT.math.25p: the 25th percentile of the SAT math score for enrolled students.
SFR: student-faculty ratio. (Number of students per full-time faculty member)
1. (3) According to the output, do public schools and private schools differ in mean graduation rates, controlling for the other variables? Give an answer ("yes" or "no") and explain how you reached this conclusion. Use a 5% significance level.
2. (3) Why do the p-values for SFR differ in the anova and summary tables? Choose the best:
a) The anova table is testing the null hypothesis that the slope for SFR is 0, and the summary table is testing the null hypothesis that the slope for SFR is 1.
b) The anova table is testing the null hypothesis that the slope for SFR is 0 given that SAT.math.25p, SAT.reading.25p, and Average.loans are all in the model, while the summary table is testing that the slope is 0 given that SAT.math.25.p, SAT.reading.25p, Average.loans, and sector are in the model.
c) The anova table is based on the F statistic while the summary table is based on the t-statistic.
d) The anova table is testing the null hypothesis that the slope for SFR is 0 given that all of the other variables are included in the model, and the summary table is testing the null hypothesis that the slope for SFR is 0 given that no other variables are included in the model.
3 . (2) Which of the following is the best interpretation of the coefficient for sector? (Assume the model is valid.) Indicate the best choice.
a) Among all schools with similar loan amounts, similar SAT reading and Math 25th percentiles, and similar student-faculty ratios, the graduation rate at public universities is about 2.6 percentage points lower, on average, than at private universities.
b) The graduation rate at public universities is about 2.6 percentage points lower than at private universities.
c) The mean graduate rate at public universities is about 2.6 percentage points lower than at private universities.
4. (2) What is the interpretation of a 5% significance level the context of testing whether public and private schools differ in mean graduation rates? Indicate the best interpretation from among these:
a) The probability that we will conclude that public and private schools differ in mean graduation rates when, in fact, they are the same, is 5%.
b) If public and private schools do not differ in mean graduation rates, then the probability of getting a test statistic as extreme or more extreme than 0.976 is 5%.
c) The probability that public and private schools differ in mean graduation rates is 5%
d) The probability that public and private schools have the same mean graduation rates is 5%.
5. (3) Notice that the last line in the anova table has been removed. SYY = 355,246. Give the values to fill in the rest of the table:
Df:
Sum Sq:
Mean Sq
F value
PR(>F)
6. (2) A politician sees this analysis and notes that the coefficient for student-faculty ratio is negative and statistically significant. He says "This analysis shows that if we lower student-faculty ratios, then graduation rates will increase." This interpretation is
a) Valid
b) Invalid
7. (2) The p-value of 0.04710 for SFR is best interpreted as the probability that the null hypothesis is correct. Is this a valid statement?
a) Valid
b) Invalid
8. (2) The p-value of 0.04710 for SFR is best interpreted as the probability the null hypothesis is wrong. Is this a valid statement?
a) Valid
b) Invalid
9. (2) Suppose you have fit two different models to predict the salary of a worker in the U.S. based on a number of different predictor variables. Model 1 has R2 of 90% and the residual plot shows a trend. The other diagnostic plots look good. Model 2 has an R2 of 60% and all of the diagnostic plots look good. Which model should you use?
a) Model 1
b) Model 2
10. (2) Explain your choice for (9):
IPEDS: R OUTPUT
> str(fouryearipeds)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':2374 obs. of 6 variables:
$ grad.rate : int 22 64 10 31 45 50 43 20 41 44 ...
$ sector : chr "Private not-for-profit, 4-year or above" "Private not-for-profit, 4-year or above" "Private not-for-profit, 4-year or above" "Private not-for-profit, 4-year or above" ...
$ Average.loans : int 3555 7236 2513 6380 7297 2062 5724 7861 6918 8743 ...
$ SAT.reading.25p: int NA 500 320 NA 460 NA NA NA 410 422 ...
$ SAT.math.25p : int NA 500 200 NA 450 NA NA NA 420 390 ...
$ SFR : int 10 13 18 18 15 6 9 17 12 14 ...
> table(fouryearipeds$sector)
Private not-for-profit, 4-year or above Public, 4-year or above
1653 721
> m1 <- lm(grad.rate~Average.loans+SAT.reading.25p+SAT.math.25p+SFR+sector,data=fouryearipeds)
> summary(m1)
Call:
lm(formula = grad.rate ~ Average.loans + SAT.reading.25p + SAT.math.25p +
SFR + sector, data = fouryearipeds)
Residuals:
Min 1Q Median 3Q Max
-54.768 -6.150 0.596 6.601 38.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.331e+01 3.046e+00 -14.221 < 2e-16 ***
Average.loans 1.573e-03 1.963e-04 8.011 2.77e-15 ***
SAT.reading.25p 8.455e-02 1.178e-02 7.177 1.27e-12 ***
SAT.math.25p 1.086e-01 1.072e-02 10.138 < 2e-16 ***
SFR -1.812e-01 9.119e-02 -1.988 0.04710 *
sectorPublic, 4-year or above -2.602e+00 7.907e-01 -3.291 0.00103 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.855 on 1149 degrees of freedom
(1219 observations deleted due to missingness)
Multiple R-squared: 0.6859,Adjusted R-squared: 0.6845
F-statistic: 501.7 on 5 and 1149 DF, p-value: < 2.2e-16
> anova(m1)
Analysis of Variance Table
Response: grad.rate
Df Sum Sq Mean Sq F value Pr(>F)
Average.loans 1 11715 11715 120.618 < 2.2e-16 ***
SAT.reading.25p 1 220038 220038 2265.507 < 2.2e-16 ***
SAT.math.25p 1 8646 8646 89.014 < 2.2e-16 ***
SFR 1 2198 2198 22.628 2.215e-06 ***
sector xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Residuals 1149 111597 97
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
PART B
In this part, please upload the provided data set into R. The dataset you upload depends on your discussion section. The dataset is in "FinalData.csv"
You are expected to turn in a .R file that includes all commands you used to prepare your answers for Part B. (You need not include any calculations you may have performed for part A.)
The data are from a study you met on the midterm and are based on a random sample of "Coconut" crabs found in Japan that are famous for their extreme pinching force.
The variable size is the sum of the variables ThoraxLength, ClawLength, and ClawHeight.
1. Fit a basic linear model using only size, Weight and Sex to predict pinching force. Do not do any transformations or higher order terms.
a) (2) Write the equation of the model: Predicted_PinchingForce=
b) (3) Comment on the model validity with respect to these three conditions. Type the word "is" or "isn't" and then give your reason.
Linear trend condition [is or isn't?] satisfied
because:
Constant Variance condition [is or isn't?] satisfied
because:
Normal distribution of errors condition[is or isn't?] satisfied
because:
2. (2) Create an Inverse Response Plot. What transformation of PinchingForce provides the lowest residual sums of squares?
3. (2) What transformation of PinchingForce is suggested by the Box-Cox transformation?
4. (3) Fit the model using the transformation for PinchingForce based on the Box-Cox power transform (using the "Rounded" power). Which model do you think is better, in terms of model validity: the "basic" model in question1 or this model? Explain.
5. (3) At the midterm, we found that the pinching force for male crabs was greater than for female crabs. Explain why this is not the case with the current model. (Hint: note that male crabs tend to be bigger and heavier than females.)
6. (2) Give the variance inflation factors for each variable for the transformed model from question 4.:
Sex
Weight
Size
7.(2) What do these values for vif tell us in this context?
8. Perform best subsets regression, forward stepwise, and backward stepwise to develop the "best" model, using BIC as a criteria. Use your transformed version of PinchingForce. Include these predictors to start: Weight, ThoraxLength, Sex, ClawLength, ClawHeight, ClawWeight. Note that you may get three different models from each of these three approaches. Choose the one with the lowest BIC. Be sure to state the BIC value for your choice. Use this model to answer these questions:
a) (2) Give the equation for the final model you chose:
b) (2) BIC for final model:
c) (2) Suppose we had just caught a coconut crab with these measurements:
ThoraxLength: 52
Weight: 615
Sex: Male
ClawLength: 67
ClawHeight: 26
ClawWeight: 34
Predict it's pinching force at a 95% level (give the appropriate interval)
9). Consider the output below:
> summary(mfull)
Call:
lm(formula = sqrt(PinchingForce) ~ Weight + Sex + ClawLength +
ClawHeight + ClawWeight, data = crabs2)
Residuals:
Min 1Q Median 3Q Max
-0.77961 -0.39837 -0.01247 0.28257 1.16845
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.027e-01 8.733e-01 -0.576 0.572
Weight 8.123e-05 8.164e-04 0.100 0.922
SexM 6.776e-02 3.194e-01 0.212 0.834
ClawLength 6.365e-02 5.397e-02 1.179 0.255
ClawHeight 5.816e-02 7.883e-02 0.738 0.471
ClawWeight 4.621e-02 1.396e-01 0.331 0.745
Residual standard error: 0.607 on 17 degrees of freedom
(6 observations deleted due to missingness)
Multiple R-squared: 0.9736,Adjusted R-squared: 0.9659
F-statistic: 125.5 on 5 and 17 DF, p-value: 8.375e-13
a) (2) What null and alternative hypotheses does the F-statistic test?
b) (2) What do you conclude based on the p-value (using a significance level of 0.05)?
c) (3) Extra Credit: What's going on here?
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。