联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-04-11 10:53

G12SMM/MATH 2011 Statistical Models and Methods

Linear Models, Assessed Coursework — 2018/2019

Please submit your work on Moodle as a pdf file by 3.00pm on Friday 12 April 2019.

Your solutions should contain all relevant R output needed to justify your answers/arguments,

together with appropriate discussion, but please do not include pages of irrelevant plots/output

which you do not discuss. The easiest way to include R output is to use R Markdown to produce

your solutions, but you do not have to do so. You do not need to include your R code, though

you can include it if you wish. If you are using R Markdown, and do not wish to include your

R code, then you can suppress the R code using the echo = FALSE argument, i.e. enclose the

code in an {r, echo=FALSE} environment in the Markdown file.

There will be a Moodle forum specifically for answering queries about the coursework, so you

may post questions and I will answer them there so that everyone receives the same assistance.

Please be careful to not inadvertently give away parts of your answer if you do post a question.

Note that as this is assessed work, I can only answer queries relating to clarification, and I will

only answer queries via the forum so that everyone can see my responses. You can change

your settings so that you get email notifications of new posts if you wish (I do not think that this

is the default setting). Otherwise, please check the forum to see if your query has already been

asked.

Unauthorised late submission will be penalised by 5% of the full mark per day. Work submitted

more than one week late will receive zero marks. You are reminded to familiarise yourself with the

guidelines concerning plagiarism in assessed coursework (see the student handbook), and note

that this applies equally to computer code as it does to written work.

The work contributes 15% to the overall module mark.

The Data

The objective is to build a predictive model for body fat content using 10 body measurement

variables. Body fat is difficult to measure, but is important to help medical professionals determine

risk of certain conditions. To this end, the body fat content of 202 men was accurately measured

using an underwater weighing technique, but this is not practical for general use. Hence, it is

desirable to develop a model for predicting body fat content reasonably accurately using easilyobtainable

measurements.

The data for the 202 individuals is contained in the file Train.txt on Moodle. The body fat

measurement is the variable brozek (which refers to Brozek’s equation for body fat content).

The remaining 10 variables give the circumference, in centimetres, of neck, chest, abdom, hip,

thigh, knee, ankle, biceps, forearm and wrist. This data set is the training data, to be

used for model development.

Additionally, the file Test.txt contains the same data for a further 50 individuals. This is to

be used for testing the predictive ability of models, and should not be used in any model

development.

The Task

(a) Using only the training data, develop a model, or models, for predicting the body

fat content (brozek) using the other 10 measurements. You may use whichever methods

covered in the module you see fit. However, for this part, you should not use the test

data in any way. [40]

(b) Use your chosen “best” model(s) from (a) to predict the body fat content of the individuals

in the test data set. Use appropriate numerical summaries/plots to evaluate the quality of

your predictions. How do the predictions compare to those of the model of the form

brozek = intercept + neck + chest + abdom + hip + thigh + knee + ankle +

biceps + forearm + wrist? [10]

Notes

An approximate breakdown of marks for part (a) is: Exploratory analysis [10 marks], Model

selection [20 marks], Model checking and validation [10 marks]. About half the marks

for each are for doing technically correct and relevant things, and half for discussion and

interpretation of the output. However, this is only a guide, and the work does not have to

be rigidly set out in this manner. There is some natural overlap between these parts, and

overall level of presentation and focus of the analysis are also important in the assessment.

The above marks are also not indicative of the relative amount of output/discussion needed

for each part, it is the quality of what is produced/discussed which matters.

As always, the first step should be to do some exploratory analysis. However, you do not

need to go overboard on this. Explore the data yourself, but you only need to report the

general picture, plus any findings you think are particularly important.

For the model fitting/selection, you can use any of the techniques we have covered this

semester to investigate potential models — the automated methods of Chapter 6/Case

Study 9 can be used to narrow down the search, but you can still use hypothesis tests, e.g.

if two different automated methods/criteria suggest slightly different models.

Please make use of the help files for R commands. Some functions may require you to

change their arguments a little from examples in the notes, or behaviour/output can be

controlled by setting optional arguments.

You should check the model assumptions and whether conclusions are materially affected

by any influential data points.

The task is deliberately open-ended: as this is a realistic situation with real data, there is

not one single correct answer, and different selection methods may suggest different “best”

models — this is normal. Your job is to investigate potential models using the information

and techniques we have covered. The important point is that you correctly use some of the

relevant techniques in a logical and principled manner, and provide a concise but insightful

summary of your findings and reasoning. (Note however that you do not have to produce

a report in a formal “report” format.)

You do not need to include all your R output, as you will likely generate lots of output when

experimenting. You might try a few different things whilst experimenting, and you do not

need to give all the details of everything you do — this will detract from the analysis.

2


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp