联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-08-17 10:45

Stat 203V

Final Exam.

Instructions: Solutions should be uploaded to Gradescope before Saturday, August

17 at 11:59 p.m. The deadline is firm: to avoid technical glitches, please start early.

There are 4 problems of equal points value. The datasets binge.txt along with

survey{.csv,.Rmd} and pima.Rmd are in the directory Final on the course website.

In writing solutions to the data problems, R Markdown is highly recommended. Include

only the pieces of output (and plots) needed for your answer. Readability of solutions

is important. Poorly organized presentation and/or errors in additional plots/analyses

may be penalized. For each question, as appropriate, explain clearly (i) your objectives, (ii)

any hypotheses that you are testing, (iii) the statistical procedure, and (iv) the statistical

support for your conclusions. An example (from a HW problem) is given in the files

sample-solution.{Rmd,pdf}.

Honor Code: Please respect the honor code in completing this exam. You can use

books, computers and the internet, but not other people.

1

1. [25 pts.] Some years back, the Centers for Disease Control issued a report on binge

drinking that received national attention. The data set binge.txt contains data for 48

states (no data for South Dakota and Tennessee) on the age-adjusted prevalence of binge

drinking (as a percentage of adults responding to a telephone survey).

The CDC article stated “Overall, states with the highest age-adjusted prevalence of

adult binge drinking were in the Midwest and New England, and included Alaska and

Hawaii.”.

A question of interest might be whether the variation in binge drinking was associated

with climate, in particular the depth of winters. The file binge.txt includes columns with

the average winter temperature (degrees Celsius) and the state population (in millions).

Investigate the relation between prevalence of binge drinking and the predictors. For

example, can the regional variation be ascribed to differences in climate? Summarize your

findings.

2

2. [25 pts.] The file survey203.csv contains the survey data on lecture attendance and

Freedman practice problems studied (variables Lectures and Practice, with NA used for

no-response cases) merged with the midterm Scores. The R Markdown file survey203.Rmd

preprocesses the data to reorder the levels of the factors. You should add your answers in

your copy of this file.

(a) Create a variable that indicates whether the case (i.e. row) contains a missing value.

Is non-response to the survey associated with the mid-term scores?

Use na.omit to create a data frame with no missing values for the rest of this question.

(b) Assess whether the levels factor Practice have a significant effect on the midterm

scores, both via ANOVA and by pairwise comparison of means. Summarize your conclusions.

(c) Now consider both factors and construct an ANOVA table with main effects and

interactions and interpret your conclusions.

(d) Use the function as.numeric() to “coerce” each of the factors to numeric variables

PracticeN, LectureN. Is there an association between these (coerced) variables? Fit a linear

model with Score as response and the numeric variables plus interaction as predictors,

and summarize your conclusions, including a comparison with the results from (c).

3

3. [25 pts.] (a) Dropping a predictor that is orthogonal to the others doesn’t change the

coefficient estimates. More specifically, suppose that the n × p design matrix X, assumed

to be of full rank, is partitioned into [XA XB] with XA being n×pA and XB being n×pB,

with p = pA + pB. Let β? be the least squares estimate using X and β?A that using XA.

Suppose that XB is orthogonal to XA: X0

BXA = 0. Show that

βi = βA,i for i = 1, . . . , pA.

(b) Consider a one way ANOVA model

yij = μ + αi + ij , i = 1, . . . , I, j = 1, . . . , ni

.

Suppose that the design is balanced, ni ≡ n1 for all i. Consider the design matrices

corresponding to “treatment” and “sum” contrasts. In each of the two cases, is the intercept

column orthogonal to the factor columns? What if the design is unbalanced? Explain.

(c) Now consider the coefficient differences αi ? αj in the balanced one way ANOVA

model. Do their estimates ?αi ? α?j depend on whether treatment or sum contrasts are

chosen? Explain.

(d) In question 2(c), do the sums of squares in the ANOVA table change if the order in

which Practice and Lectures appear is switched? Can you explain briefly in words (i.e.

without detailed mathematical argument) why this might be?

4

4. [25 pts.] The National Institute of Diabetes and Digestive and Kidney Diseases

conducted a study on 768 adult female Pima Indians living near Phoenix. The purpose

of the study was to investigate factors related to diabetes. The data may be found in the

the dataset pima in library(faraway). See also pima.Rmd in which some preprocessing is

done: it creates a factor version of the test results. And, as discussed in Ch. 1 of Faraway,

the zero values for variables diastolic, glucose, triceps, insulin and bmi in fact

seem to be missing values, so those are set to NA.

(a) Fit a model with the result of the diabetes test as the response and all the other

variables as predictors. How many observations were used in the model fitting?

(b) Refit the model but now without the insulin and triceps predictors. How many

observations were used in fitting this model? Devise a test to compare this model with

that in the previous question and report your conclusion. Hint: use na.omit() to create a

data frame with no missing values.

(c) Use AIC via the function step() to select a model. You will need to take account

of the missing values. Which predictors are selected? How many cases are used in your

selected model?

(d) Create a variable that indicates whether the case contains a missing value. Use this

variable as a predictor of the test result. Is missingness associated with the test result?

Refit the selected model from (c), but now using as much of the data as is feasible. Explain

why it is appropriate to do this.

(e) Using the last fitted model of the previous question, what is the ratio of the odds of

testing positive for diabetes for a woman with a BMI at the first quartile compared with

a woman at the third quartile, assuming that all other factors are held constant? Give a

confidence interval for this ratio.

(f) Do women who test positive have higher diastolic blood pressures? Is the diastolic

blood pressure significant in the regression model? Explain the distinction between the two

questions and discuss why the answers are only apparently contradictory.

5


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp