联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-05-02 10:38

STOR 556: Spring 2019

Take-home Final Exam With Grade Scheme

Answer all questions.

This is a take-home exam that you are expected to do in your own time and hand in no later

than Noon Thursday May 2. The exam should be submitted via the “Assignments” tab of the

course sakai page.

Rules of the Exam. All course resources including text, personal notes and resources available

through R or R-Studio are permitted. Your submitted answers should include full verbal answers to

the questions, illustrated where appropriate by R code, tables or figures. Very long-winded answers

are discouraged; greatest credit will be given for full but concise answers to the questions. Solutions

may be submitted in R-Markdown but this is not required. (A fully acceptable alternative is if

you submit a Word document into which you cut and paste R output as appropriate; however, I

recommend you “save as” a pdf file for the final submission.) Other web resources may be used if

fully acknowledged and referenced. Discussion among yourselves or with an outside party is not

permitted; you are allowed to email the instructor if you find the question ambiguous or if you

think there is an error, but the instructor will not give advice how to solve the problems.

The datasets have been previously posted (see the “Resources” tab in sakai for instructions how

to download them); please download the data first and contact the instructor immediately if you

have any problem with this step.

Please acknowledge you accept these conditions by copying out and signing:

PLEDGE: I will neither give nor receive unauthorized aid in this exam.

SIGNED: (A typed signature will be accepted)

1. The “dolphins” dataset documents number of dolphin bycatch (dolphins caught by accident

in fishing nets) over six seasons (1989/90 through 1994/95) in two areas in New Zealand

(North Taranaki and South Taranaki), for two types of gear (Bottom and Midwater) and for

both day and night trawls. For each combination of season, area, gear type and time of day,

the dataset documents the number of tows (fishing trawls observed) and number of dolphin

bycatch. Five of the possible 48 rows of data are absent because the number of tows was 0.

(a) A reasonable model is that the number of bycatch yijk` in season i, area j, type of gear

k and time of day ` is

yijk` ~ Poisson(λijk`Tijk`),

log λijk` = β0 + βSi Seasoni + βAj Areaj + βGk Geark + βT` Time` (1)

where Tijk` is the number of tows and each of the variables Season, Area, Gear and Time

is treated as a factor variable, β0 is the intercept and each of βSi

, βAj

, βGk

and βT`

represents a regression coefficient for the corresponding variable. Fit the model (1) and

display a table showing the regression coefficients and their standard errors. [6 points]

(b) Can any of the terms from the model (1) be dropped? Test each of the terms in turn,

using an appropriate χ

2 or F test, and state your conclusions. [5 points]

1

(c) Do the data show evidence of overdispersion? Use appropriate diagnostics and tests,

and if necessary, repeat your calculations of part (b). [5 points]

(d) Would the model (1) be improved by adding interactions? Test, in particular, interactions

among the Area, Gear and Time variables, and state your conclusions. [4 points]

(e) It can be seen that in one season (1990/91), there were no dolphin bycatch. To what

extent does that year’s data bias the conclusions? Refit the model without 1990/91, and

describe any significant features that change. [5 points]

(f) For whatever model you finally accepted based on parts (a)–(e), investigate the data

for (i) non-random patterns among the residuals, (ii) datapoints of high leverage, (iii)

points of high influence. Overall, would you describe this as a satisfactory analysis, or

if not, why not? [7 points] ([32 points for the whole question])

2. The “glucose” dataset records the blood glucose levels in six subjects at various times before

or after a test meal, starting 15 minutes before the meal and ending 6 hours later. There are

six separate runs (using the same six subjects on different days), corresponding to different

times of day that the meal is taken (6am, 10am, 2pm,...,2am). This is an example of a

“repeated measures experiment” in which multiple measures are taken for the same subject

at different times. It is expected that the pattern of responses will be different from one run

to another, and there may also be a variation from one subject to another.

(a) For each of the six runs A–F, draw a trace plot that shows the pattern of glucose

responses in all six subjects for that run. [5 points]

(b) Draw a second set of trace plots where you show the response of each subject in each

run using one of a (i) linear, (ii) quadratic, (iii) cubic or (iv) quartic regression through

all ten time points (in other words, including powers of time t up to t

4

in the case of

quartic regression). State which one of these gives the best representation, and briefly

justify your choice (formal tests, confidence intervals, etc. are not required). [5 points]

(c) Recast the dataset as a 360×4 matrix where the columns represent all the glucose levels,

times (t = ?15, 0, 30, ..., 360), subjects (1 through 6) and runs (A through F). (Hint:

You may need some R command such as y=as.vector(as.matrix(glucose[,3:12]))

to write all the glucose levels as a single column vector.) [5 points]

(d) Now try fitting the data as a single regression model where the covariates are (i) powers

of t up to t

4

(you may need to rescale for numerical stability), (ii) Subject treated as a

random effect, (iii) Run treated as a fixed effect. Check for interactions between t (or

powers of t) and either Run or Subject, and don’t forget that the variance of a random

effect can under some circumstances be estimated as 0. What do you conclude? Where

appropriate, use tests of hypotheses (such as the Kenward-Roger test, or a bootstrap

test) to decide between different nested models. [13 points]

(e) Summarize your conclusions, with particular attention to whether there is evidence that

patterns of glucose levels vary among the six Runs. [5 points] ([33 points for the

whole question])

2

3. The “indonesia” dataset records a number of outcomes from a children’s health study in Indonesia.

Variables include an ID number for each child (repeated up to 6 times), indicators

of respiratory disease, xerophthalmia (used to mark vitamin A deficiency), age (in months,

centered about 36 months), sex, season, height adjusted for age, and an indicator of stuntedness.

There is also an age group variable “agegp” that groups the ages into four groups.

Respiratory disease (coded 0 or 1) is the primary outcome of interest, and we are interested

in how each of the other variables affects it. Since each child appears in the dataset multiple

times, we must account for correlated observations by using either a random effects model or

a generalized estimating equations approach.

(a) Construct a 2×2 table relating incidence of respiratory disease to concurrent incidence of

xerophthalmia. Repeat the same construction separately for each of the four age groups.

What pattern do you notice? Is this an example of Simpson’s paradox? [7 points]

(b) Create a random effects model using the glmer command to related incidence of respiratory

disease each of the other variables, treating ID as a random effect to allow for

systematic variation from one child to another. (Don’t include “agegp” in this analysis,

since age is already included as a covariate.) Which variables are significant? [5 points]

(c) Ae there alternative models that are superior to the omdel in (b)? Consider, in particular,

(i) adding square or cubic terms in age, (ii) omitting any of the other variables that

may not be significant. Summarize your conclusions. [5 points]

(d) Based on your results to parts (b) and (c), what do you now say about the relationship

between xerophthalmia and respiratory disease? [3 points]

(e) For whatever model you previously decided was best, test the fit of the model using the

Hosmer-Lemeshow test. What do you conclude from that? [3 points]

(f) For whatever model you previously decided was best, try an alternative fit using (i)

the PQL method, (ii) the generalized estimating equations approach, using your own

judgement (or trial and error) to decide which correlation structure is appropriate. Do

any of your conclusions change? [6 points]

(g) Comment on the dependence of respiratory disease on age, using suitable plots to illustrate

your conclusions. [6 points] ([35 points for the whole question])

3


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp