STOR 556: Spring 2019
Take-home Final Exam With Grade Scheme
Answer all questions.
This is a take-home exam that you are expected to do in your own time and hand in no later
than Noon Thursday May 2. The exam should be submitted via the “Assignments” tab of the
course sakai page.
Rules of the Exam. All course resources including text, personal notes and resources available
through R or R-Studio are permitted. Your submitted answers should include full verbal answers to
the questions, illustrated where appropriate by R code, tables or figures. Very long-winded answers
are discouraged; greatest credit will be given for full but concise answers to the questions. Solutions
may be submitted in R-Markdown but this is not required. (A fully acceptable alternative is if
you submit a Word document into which you cut and paste R output as appropriate; however, I
recommend you “save as” a pdf file for the final submission.) Other web resources may be used if
fully acknowledged and referenced. Discussion among yourselves or with an outside party is not
permitted; you are allowed to email the instructor if you find the question ambiguous or if you
think there is an error, but the instructor will not give advice how to solve the problems.
The datasets have been previously posted (see the “Resources” tab in sakai for instructions how
to download them); please download the data first and contact the instructor immediately if you
have any problem with this step.
Please acknowledge you accept these conditions by copying out and signing:
PLEDGE: I will neither give nor receive unauthorized aid in this exam.
SIGNED: (A typed signature will be accepted)
1. The “dolphins” dataset documents number of dolphin bycatch (dolphins caught by accident
in fishing nets) over six seasons (1989/90 through 1994/95) in two areas in New Zealand
(North Taranaki and South Taranaki), for two types of gear (Bottom and Midwater) and for
both day and night trawls. For each combination of season, area, gear type and time of day,
the dataset documents the number of tows (fishing trawls observed) and number of dolphin
bycatch. Five of the possible 48 rows of data are absent because the number of tows was 0.
(a) A reasonable model is that the number of bycatch yijk` in season i, area j, type of gear
k and time of day ` is
yijk` ~ Poisson(λijk`Tijk`),
log λijk` = β0 + βSi Seasoni + βAj Areaj + βGk Geark + βT` Time` (1)
where Tijk` is the number of tows and each of the variables Season, Area, Gear and Time
is treated as a factor variable, β0 is the intercept and each of βSi
, βAj
, βGk
and βT`
represents a regression coefficient for the corresponding variable. Fit the model (1) and
display a table showing the regression coefficients and their standard errors. [6 points]
(b) Can any of the terms from the model (1) be dropped? Test each of the terms in turn,
using an appropriate χ
2 or F test, and state your conclusions. [5 points]
1
(c) Do the data show evidence of overdispersion? Use appropriate diagnostics and tests,
and if necessary, repeat your calculations of part (b). [5 points]
(d) Would the model (1) be improved by adding interactions? Test, in particular, interactions
among the Area, Gear and Time variables, and state your conclusions. [4 points]
(e) It can be seen that in one season (1990/91), there were no dolphin bycatch. To what
extent does that year’s data bias the conclusions? Refit the model without 1990/91, and
describe any significant features that change. [5 points]
(f) For whatever model you finally accepted based on parts (a)–(e), investigate the data
for (i) non-random patterns among the residuals, (ii) datapoints of high leverage, (iii)
points of high influence. Overall, would you describe this as a satisfactory analysis, or
if not, why not? [7 points] ([32 points for the whole question])
2. The “glucose” dataset records the blood glucose levels in six subjects at various times before
or after a test meal, starting 15 minutes before the meal and ending 6 hours later. There are
six separate runs (using the same six subjects on different days), corresponding to different
times of day that the meal is taken (6am, 10am, 2pm,...,2am). This is an example of a
“repeated measures experiment” in which multiple measures are taken for the same subject
at different times. It is expected that the pattern of responses will be different from one run
to another, and there may also be a variation from one subject to another.
(a) For each of the six runs A–F, draw a trace plot that shows the pattern of glucose
responses in all six subjects for that run. [5 points]
(b) Draw a second set of trace plots where you show the response of each subject in each
run using one of a (i) linear, (ii) quadratic, (iii) cubic or (iv) quartic regression through
all ten time points (in other words, including powers of time t up to t
4
in the case of
quartic regression). State which one of these gives the best representation, and briefly
justify your choice (formal tests, confidence intervals, etc. are not required). [5 points]
(c) Recast the dataset as a 360×4 matrix where the columns represent all the glucose levels,
times (t = ?15, 0, 30, ..., 360), subjects (1 through 6) and runs (A through F). (Hint:
You may need some R command such as y=as.vector(as.matrix(glucose[,3:12]))
to write all the glucose levels as a single column vector.) [5 points]
(d) Now try fitting the data as a single regression model where the covariates are (i) powers
of t up to t
4
(you may need to rescale for numerical stability), (ii) Subject treated as a
random effect, (iii) Run treated as a fixed effect. Check for interactions between t (or
powers of t) and either Run or Subject, and don’t forget that the variance of a random
effect can under some circumstances be estimated as 0. What do you conclude? Where
appropriate, use tests of hypotheses (such as the Kenward-Roger test, or a bootstrap
test) to decide between different nested models. [13 points]
(e) Summarize your conclusions, with particular attention to whether there is evidence that
patterns of glucose levels vary among the six Runs. [5 points] ([33 points for the
whole question])
2
3. The “indonesia” dataset records a number of outcomes from a children’s health study in Indonesia.
Variables include an ID number for each child (repeated up to 6 times), indicators
of respiratory disease, xerophthalmia (used to mark vitamin A deficiency), age (in months,
centered about 36 months), sex, season, height adjusted for age, and an indicator of stuntedness.
There is also an age group variable “agegp” that groups the ages into four groups.
Respiratory disease (coded 0 or 1) is the primary outcome of interest, and we are interested
in how each of the other variables affects it. Since each child appears in the dataset multiple
times, we must account for correlated observations by using either a random effects model or
a generalized estimating equations approach.
(a) Construct a 2×2 table relating incidence of respiratory disease to concurrent incidence of
xerophthalmia. Repeat the same construction separately for each of the four age groups.
What pattern do you notice? Is this an example of Simpson’s paradox? [7 points]
(b) Create a random effects model using the glmer command to related incidence of respiratory
disease each of the other variables, treating ID as a random effect to allow for
systematic variation from one child to another. (Don’t include “agegp” in this analysis,
since age is already included as a covariate.) Which variables are significant? [5 points]
(c) Ae there alternative models that are superior to the omdel in (b)? Consider, in particular,
(i) adding square or cubic terms in age, (ii) omitting any of the other variables that
may not be significant. Summarize your conclusions. [5 points]
(d) Based on your results to parts (b) and (c), what do you now say about the relationship
between xerophthalmia and respiratory disease? [3 points]
(e) For whatever model you previously decided was best, test the fit of the model using the
Hosmer-Lemeshow test. What do you conclude from that? [3 points]
(f) For whatever model you previously decided was best, try an alternative fit using (i)
the PQL method, (ii) the generalized estimating equations approach, using your own
judgement (or trial and error) to decide which correlation structure is appropriate. Do
any of your conclusions change? [6 points]
(g) Comment on the dependence of respiratory disease on age, using suitable plots to illustrate
your conclusions. [6 points] ([35 points for the whole question])
3
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。