联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2022-11-06 01:11

STAT 231 Assignment 3: What Are You Waiting For?

Due: 11am Eastern on Friday November 4

Total marks: 50

Please review the information on Page 1 of Assignment 1 for full details on how to submit your

assignment. As a reminder, to complete this assignment you must:

Upload your typed/computer-generated Assignment 3 Report file as a PDF to Crowdmark.

Upload your Assignment 3 R code file as a .R file to the Assignment 3 R Code File dropbox.

As with Assignments 1 and 2, your Report must be typeset, and your R code file should generate all

the results presented in your Report.

If you are unsure how to format an answer, please check the Layout Lowdown on pages 5-8! There

are also template files available on LEARN for this assignment - you are not required to follow these

templates, but you may if you wish!

What’s this assignment about?

This assignment covers the material up to and including Chapter 4, with a focus on interval estimation

techniques. We will seek to model the tweet.gap variate, which measures the time (or ‘gap’) between

the publication of tweets. More precisely, for a particular tweet, tweet.gap gives the number of

seconds since the user’s previous tweet was published.

Data about how often a user is interfacing with a website, service, or product, are valuable for a

variety of reasons. The regularity, and reliability, with which users return (sometimes referred to as

‘stickiness’) is a key metric to assess product performance, as well as for testing the effectiveness of

new features and initiatives.

In addition to providing insights into how often users post tweets, the variate tweet.gap also provides

an opportunity to explore some challenges commonly encountered in real-world data analysis. Many

of you will find that tweet.gap contains some particularly large values, as a result of users not

tweeting for several days, or even weeks. When working with real-world data it is common to

encounter unusual behaviour such as this, which can make finding a suitable statistical model difficult.

In this assignment we will explore two approaches for modelling data with unusual distributions. One

of these is to consider a subset of the data, narrowing the focus of our research question in order to

facilitate meaningful analysis. The other is data transformation, which we have used previously (such

as in taking logs of the likes variate) and will now extend to other, more complex transformation

procedures.

Before we begin

For the purposes of this assignment, the study population is defined as the set of tweets in the

primary dataset from which you downloaded your sample at the start of term.

In this analysis we will include all of the data in your Twitter dataset (that is, all five accounts).

You may find it interesting to re-run your analyses on your personal and organizational accounts

separately, while thinking about why we might expect these accounts to have different distributions

for this variate.

Because tweet.gap is measured in seconds, we will convert this to hours to make it easier to

interpret our results. You should create the variate tweet.gap.hour, just like how we created

time.of.day.hour in Assignment 1.

1

Analysis 1: Time Between Tweets and an Exponential Model

In Analyses 1 and 2 we will be exploring the distribution of tweet.gap.hour for tweets that are not

the first tweet of the day. In the following, we refer to two sets of tweets denoted Tweet Set A and

Tweet Set B as follows:

Tweet Set A: All tweets in your dataset.

Tweet Set B: Just tweets that are not the first tweet of the day. Note that these are the tweets

for which first.tweet equals 0.

1a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number.

1b. [2 marks] Do you have any concerns about measurement error in the first.tweet variate?

Briefly explain why or why not.

1c. [2 marks] State the sample size, and calculate the sample mean, sample median, sample mini-

mum, sample maximum, and sample standard deviation of tweet.gap.hour for Tweet Set A

and Tweet Set B. Display these values in a table in your Report.

1d. [1 mark] Briefly explain why the maximum value of tweet.gap.hour for Tweet Set B should not

be greater than 24. Note: This question is not asking you to simply verify that the maximum

calculated in Analysis 1c is not larger than 24; your answer should explain why, based on how

Tweet Set B is constructed, it should not contain a value larger than 24 for any possible sample.

1e. [4 marks] Generate a relative frequency histogram and an empirical cumulative distribution

function plot of the variate tweet.gap.hour for each of Tweet Set A and Tweet Set B (that

is, you should include a total of four plots, two for each Tweet Set). All plots should feature

a suitable superimposed Exponential probability density or cumulative distribution function

curve. Hint: You may wish to use par(mfrow = c(2, 2)) so that your plots are displayed in

a single image.

1f. [7 marks] For each of Tweet Set A and Tweet Set B, discuss how well an Exponential model

fits the data. Your answer should explain what you would expect to observe if the data were

generated from an Exponential distribution, and compare this with what you observe in your

sample. You should make at least three comparisons (of what you would expect, and what you

observe) for each of Tweet Set A and Tweet Set B, and include an overall conclusion on which

of Tweet Set A and Tweet Set B the Exponential model appears to fit better.

Analysis 2: Interval Estimation Using an Exponential Model

In this analysis we will use an Exponential model to describe the time between tweets that were

not the first tweet of the day. Note that, regardless of your conclusion in Analysis 1f, you should

complete Analysis 2 using Tweet Set B.

Let Y ~ Exponential(θ) denote the value of tweet.gap.hour for a randomly chosen tweet from the

study population that was not the first tweet of the day. You are reminded that in our notation

E[Y ] = θ.

2a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number.

2b. [1 mark] What is the maximum likelihood estimate of θ based on your sample?

2c. [3 marks] Generate a plot of R(θ), the relative likelihood function for θ based on your sample

and the assumed Exponential(θ) model. Your plot should include a horizontal line that could

be used to identify the 15% likelihood interval for θ.

2

2d. [2 marks] Using uniroot() or uniroot.all(), calculate the 15% likelihood interval for θ.

Give your answer to four decimal places.

2e. [3 marks] Calculate approximate 15%, 95%, and 99% confidence intervals for θ based on a

Central Limit Theorem approximation. Your Report should include an explanation of how this

was calculated, which may be expressed algebraically or, if you wish, by including the relevant

R command(s).

2f. [2 marks] Which of the confidence intervals you calculated in Analysis 2e is most similar to the

15% likelihood interval found in Analysis 2d? Is this what you would expect? Briefly explain

why or why not.

2g. [3 marks] Write 1-2 sentences that explain what the 95% confidence interval calculated in

Analysis 2e means in the context of the study. Note: your answer should relate your interval

to the real-world question under consideration, and not simply be written in terms of θ.

Analysis 3: Time Between Tweets and a Gaussian Model

In Analyses 3 and 4 we will be exploring the distribution of tweet.gap.hour for tweets that are

the first tweet of the day. We will exclude tweets that were published more than 24 hours after the

preceding tweet (think about why we might wish to do this). You can create this subset of tweets as

follows:

> tgh.first <- mydata$tweet.gap.hour[mydata$first.tweet == 1 & mydata$tweet.gap.hour <= 24]

Note: We have called the variate tgh.first as shorthand for ‘tweet gap hour first tweets’; you are

welcome to use your own choice of naming convention!

The data in tgh.first are therefore the times between the first tweet sent on a particular day, and

the last tweet sent the preceding day. Hint: Run summary(tgh.first) and check the results make

sense based on how we have defined this variate.

We will explore various transformations of the variate in an attempt to facilitate the use of a Gaussian

model. In particular, we will consider the following three transformations, which we first define in

general terms for data y1, y2, . . . , yn, recalling that y(n) denotes the maximum value in our sample.

? Square Root: si =

√(

y(n) ? yi

)

+ 1

? Log: li = log(

(

y(n) ? yi

)

+ 1)

? Reciprocal: ri =

1

(y(n)?yi)+1

You should generate three new variates as follows (where, again, you are welcome to use your own

naming conventions):

# Square Root

> tf1 <- sqrt(max(tgh.first) - tgh.first + 1)

# Log

> tf2 <- log(max(tgh.first) - tgh.first + 1)

# Reciprocal

> tf3 <- 1/(max(tgh.first) - tgh.first + 1)

We will refer to the non-transformed data as the ‘Original’ data.

3

3a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number.

3b. [4 marks] Generate a relative frequency histogram or an empirical cumulative distribution

function plot of the Original, Square Root, Log, and Reciprocal transformations of the variate

defined above as tgh.first. All four plots should be of the same type (that is, your Report

should contain four histograms, or four e.c.d.f. plots). All four plots should feature a suitable

superimposed Gaussian probability density or cumulative distribution function curve. Hint:

You may wish to use par(mfrow = c(2, 2)) as you did in Analysis 1e.

3c. [2 marks] Which of the Square Root, Log, or Reciprocal transformations leads to the best fit of

a Gaussian model? Briefly justify your answer in 1-2 sentences. It is sufficient to refer only to

your results in Analysis 3b, but if you wish to carry out additional analyses you are welcome

to. Note that even if you believe the original dataset exhibits the best fit, you must choose one

of the three transformation options detailed above.

Analysis 4: Interval Estimation Using a Gaussian Model

In our final analysis, we will use the transformed variate chosen in Analysis 3c. Let X ~ G(μ, σ)

denote the value of the transformed variate for a randomly chosen tweet from the study population.

Note that all questions in this analysis should be conducted using the transformed variate you chose

in Analysis 3c.

4a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number, and write

down the name of the transformation you chose in Analysis 3c (that is, Square Root, Log, or

Reciprocal).

4b. [1 mark] State the sample size, and calculate the sample mean and sample standard deviation

for your transformed variate.

4c. [3 marks] Calculate a 95% confidence interval or approximate confidence interval for μ based

on your sample. (You should decide which is the appropriate confidence interval to calculate.)

Your Report should include an explanation of how this was calculated, which may be expressed

algebraically or, if you wish, by including the relevant R command(s).

4d. [1 mark] Is the confidence interval you calculated in Analysis 4c exact or approximate? Briefly

justify your answer. (Note: this question concerns whether the interval is theoretically ex-

act or approximate, your answer should not discuss numerical matters such as rounding, or

approximations used within R itself.) You may cite results in the Course Notes without proof.

4e. [3 marks] Write 1-2 sentences that explain what the interval calculated in Analysis 4c means

in the context of the study. Note: your answer should relate your interval to the real-world

question under consideration, and not simply be written in terms of μ. Note: Do not transform

your interval back to the original scale on which tweet.gap.hour is measured.

4f. [3 marks] Calculate a 95% confidence interval for σ based on your sample. Your Report should

include an explanation of how this was calculated, which may be expressed algebraically or, if

you wish, by including the relevant R command(s).

4g. [2 marks] You are told that Alex, another STAT 231 student, has a sample which contains

considerably fewer tweets than your sample. Would the interval Alex calculated in Analysis

4f be narrower, wider, or about the same width as the interval you calculated in Analysis 4f?

Justify your answer in 1-2 sentences.


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp