Statistics 1 assignment - 2019
March 11, 2019
This computer practical counts 10% towards your final mark and is due
on Friday 22nd March by 5pm.
It should be handed in in the dedicated blue box “Probability and Statistics”
by the entrance of the main building.
Do get started on week 19 and go to the drop in session in the computer
lab in order to get help.
You should use R Markdown for your code, output and associated comments
and print the corresponding pdf file. Remember to make clear which
question you are answering and include your name at the beginning of the
document.
Use pen and paper to answer the questions not involving code or numerical
experiments.
Stapple the two documents together and make sure that your name appears
clearly on the first page.
In Chapter 2 we have seen how a QQplot
or a probability plot can be useful
to assess whether a sample is distributed according to a specific probability
distribution. Although useful we would like to complete this graphical method
with a statistical hypothesis test which would lead to a more objective and
principled decision. Numerous tests have been proposed in the literature (in
particular in order to test normality) and we focus here on the Anderson-Darling
test. For an observed sample x1,...,xn the Anderson-Darling (AD) test statistic
is given by
T(x1,...,xn) = nZ +1
1(Fn(y)
FX(y; ))2
FX(y; )(1
FX(y; ))fX(y; )dy,
where FX(y) is the hypothesised cumulative distribution for the data, fX(y) is
the corresponding probability density and
Fn(y) = #{i 2 {1,...,n}: xi y}n ,
is the empirical distribution function of the observed sample.
1
1. (2 marks) State, in words and at most two sentences, the null and alternative
hypotheses in the present scenario.
2. (3 marks) Briefly explain why the AD statistic may be useful to achieve
our goal? In particular briefly comment on the roles played by the three
terms, (Fn(y; )
FX(y; ))2, FX(y; )(1
FX(y; )) and fX(y; ).
3. (2 marks) Describe the form of a critical region, give the theoretical
formula for the type I error and the theoretical formula for the pvalue
for this test and an observed statistics tobs. You should precisely state
the probability distribution of any random variable you may use and can
assume ? to be known.
While the expression above leads to an intuitive interpretation of what the
statistic can achieve, a more useful expression is given by
T(x1,...,xn) = n,
where x(1), x(2),...,x(n) is the order statistic of the sample, as defined in Chapter
1. Most often is unknown and must be estimated from the observed
sample and tobs can then be computed. From now on assume that we want to
test whether a sample is drawn from a normal distribution. The two datasets
x1 and x2 referred to below can be downloaded using
load(url("https://people.maths.bris.ac.uk/~maxca/stats1/stats1-assignment.RData"))
4. (4 marks) Write a function compute.ad.test(xs) which takes in a vector
of observations xs and returns the Anderson-Darling statistics. You
should test your function on the two datasets x1 and x2.
[Hint: the ad.test function in the nortest R library (which is not installed
by default), may be a source of inspiration for your code and may
be used to check that your own code produces plausible values (you will
not get marks for using it but some of you may find it useful/reassuring).
You can see the code of the function by simply typing ad.test. Note that
ad.test renormalizes the data and that you should not do this here.]
To complete the statistical procedure we require computing the pvalue.
Even
when is assumed known, the distribution of T(X1, X2,...,Xn) under the null
hypothesis is not tractable and it is unlikely that it will be when ? is estimated.
The numerical method below works in both scenarios.
5. (3 marks) Write pseudo-code describing an algorithm, based on simulation
and similar to the procedure used in Section 4.3 of the lecture notes
to compare the sampling distributions of three estimators, to compute the
pvalue
for an observed statistics tobs.
6. (3 marks) Write the R code corresponding to your pseudo-code to compute
the pvalues
corresponding to x1 and x2, assuming that the empirical
mean and variance are used to estimate ?. For each of x1 and x2
2
plot the histogram of the simulated statistics and draw a vertical line for
the position of the observed test statistic and on separate graphs plot the
corresponding QQ-plots (you may use the functions qqnorm and qqline).
Conclusions?
The approach is also often referred to as a Monte Carlo method. Note that
statistical tables and approximate formulae have been constructed and derived
for this test: as indicated in [Stephens 1974] these are based on Monte Carlo
simulations. Such approximate formulae, are used in the ad.test function in
the nortest R library.
7. (3 marks) Explain in a few lines how you would adapt your code in order
to test whether a sample is sampled from an exponential distribution.
What is your conclusion about the generality of the approach?
[1] Stephens, M. A. “EDF Statistics for Goodness of Fit and Some Comparisons.”
Journal of the American Statistical Association 69, no. 347 (1974):
730-37. doi:10.2307/2286009.
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。