联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-12-31 07:20


MA308: Statistical Calculation and Software

Assignment 3 (Dec 24, 2019 - Jan 02, 2020)

3.1 For the “weightgain” dataset from HSAUR3 package, the data arise from an experi?ment to study the gain in weight of rats fed on four different diets, distinguished by

amount of protein (low and high) and by source of protein (beef and cereal). Ten

rats are randomized to each of the four treatments and the weight gain in grams

recorded. The question of interest is how diet affects weight gain.

(a) Summarize the main features of the data by calculating group means and stan?dard deviations, use the plotmeans() function in the gplots package to produce

an interaction plot of group means and their confidence intervals.

(b) Use interaction2wt() function in the HH package to produce a plot of both

main effects and two-way interactions for any factorial design of any order.

Explain whether there exists interaction between source and type.

(c) Carry out two-way factorial ANOVA analysis with and without interaction

terms respectively, explain the corresponding results.

(d) What are the assumptions that our data need to satisfy when we implement

one-way ANOVA? Now if we use one-way ANOVA to examine the difference of

weightgain between different source of protein, are these assumptions satisfied?

(e) Carry out the permutation test version of the two-way factorial ANOVA analysis

of weightgain~source*type with the lmPerm package, compare the result with

that in 3.1(c).

3.2 For the “planets” dataset from HSAUR3 package,

(a) Apply complete linkage and average linkage hierarchical clustering to the planets

data. Compare the results with the K-means (K=3) clustering results in the

lecture notes.

2

(b) Construct a three-dimensional drop-line scatterplot of the planets data in which

the points are labelled with a suitable cluster label, K-means (K=3) method

can be used for clustering.

(c) Write a R function to fit a parametric model based on two-component normal

mixture model for the eccen variable in the planet data. (Hint: refer to the

“Mixture distribution estimation” section in Chapter 6)

(d) In fact, package mclust offers high-level functionality for estimating mixture

models, apply Mclust to estimate normal mixture model for the eccen variable

in the planet data. Compare the result with that in 3.2(c).

(e) Implement principal component analysis on the planet data, find out the co?efficients for the first two principal components and the principal component

scores for each planet.

(f) Apply K-means (K=3) clustering to the first two principal components of the

planet data. Compare the clustering result with that based on the original

data mentioned in 3.2(a).

3.3 For the “Default” dataset from ISLR pacakge, we consider how to predict default for

any given value of balance and income. In particular, we will now compute estimates

for the standard errors of the income and balance logistic regression coefficients in

two different ways: (1) using the bootstrap, and (2) using the standard formula for

computing the standard errors in the glm() function. Do not forget to set a random

seed before beginning your analysis.

(a) Using the summary() and glm() functions, determine the estimated standard

errors for the coefficients associated with income and balance in a multiple

logistic regression model that uses both predictors.

(b) Write a function, boot.fn() , that takes as input the Default data set as well

as an index of the observations, and that outputs the coefficient estimates for

income and balance in the multiple logistic regression model.

(c) Use the boot() function together with your boot.fn() function to estimate the

standard errors of the logistic regression coefficients for income and balance.

3

(d) Comment on the estimated standard errors obtained using the glm() function

and using your bootstrap function.

3.4 For the “Default” dataset from ISLR pacakge, we consider how to predict default for

any given value of balance and income.

(a) Split the sample set into a training set (70%) and a validation set (30%). Fit a

multiple logistic regression model (default ~ balance + income) using only the

training observations. Obtain a prediction of default status for each individual

in the validation set by computing the posterior probability of default for that

individual, and classifying the individual to the default category if the posterior

probability is greater than 0.5. Compute the validation set error, which is the

fraction of the observations in the validation set that are misclassified.

[10 points]

(b) Apply Classical Decision Tree and Conditional Inference Tree on the Default

dataset. Use the plotcp() function to plot the cross-validated error against the

complexity parameter and choose the most appropriate tree size.

(c) Write down the algorithm for a random forest involves sampling cases and

variables to create a large number of decision trees. Implement random forest

algorighm based on traditional decision trees and conditional inference trees

respectively. Use the random forest models built to classify the validation

sample and compare the predictive accuracy of the two models.

(d) Fit a support vector machine classifier to the Default dataset. Use tune.svm()

function to choose a combination of gamma and cost which may lead to a more

effective model. Compare the sensitivity, specificity, positive predictive power

and negative predictive power of the svm, random forest and logistic regression

classifiers.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp