联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-12-06 09:27

STA 4373 – Computational Methods in Statistics

Fall 2022

STA 4373 Assignment 2

Instructions.

In this assignment you’ll analyze a COVID dataset and create a PDF of your results using the same Quarto

template I posted to Canvas. As before, when you turn in the file, the filename of the turn-in should be

last names separated by dashes and terminated with -2.pdf. For example, if Joe Shmo, Jane Doe, Mickey

Mouse worked together, they would turn in shmo-doe-mouse-2.pdf.

Again, you may use your text and work in groups of size up to three. Only one delegate of your team

will submit the resulting PDF on Canvas. The PDF should have the names of each of the collaborators on

top. The main advantage to working in a group is that you can bounce ideas off one another, and hopefully

uncover more interesting features of the data.

You may use the internet to access the text’s wepage, other websites directly linked in this document, and

other general-purpose data science in R questions. However, you may not read or use any analyses of this

or related datasets you find online. Failure to follow this rule may be considered a violation of this course’s

academic integrity policy. If you have any questions about this, please contact me.

Please put a new page break before each question so each question starts on its own page (this will

facilitate grading) and never provide output that runs over more than one page if you can help it. Be sure

to echo all your code!

The COVID19 pandemic in Texas.

The Texas Department of State Health Services (DSHS) is the primary municipal body in the state that

tracks the spread of the Covid-19 pandemic and makes information available to the public. To that end it

has two dashboards, one that monitors case counts, available here, and another that focuses on testing and

hospitalization, available here; these were setup in the early days and weeks of the pandemic shutdown in

March and April 2020. DSHS also provides web endpoints for related datasets, a listing of which can be

found at https://dshs.texas.gov/coronavirus/AdditionalData.aspx. Until relatively recently, this data was

updated daily, typically in the 3pm–5pm range.

1

Questions.

1. Read in the data ”Cases over Time by County” ("TexasCOVID-19NewCasesOverTimebyCounty.xlsx")

into a variable called new cases, but don’t clean it yet (that will come in the next steps). Then run

the code below to show you’ve succeeded.

Note: You may look at the file you have downloaded in another application, but do not edit it; all

manipulations of the file must be done in R.

Hint: Be sure to look at the whole dataset before reading it in. I encourage you to use readxl::cell limits()

with the ul and lr arguments to get the reading right.

new_cases |> select(1:5) |> glimpse()

# Rows: 254

# Columns: 5

# $ County "Anderson", "Andrews", "Angelina", "Aransas", "~

# $ ‘New Cases 03-04-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

# $ ‘New Cases 03-05-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

# $ ‘New Cases 03-06-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

# $ ‘New Cases 03-07-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

2. Clean the names to match the naming conventions listed below. Run the code below to show you’ve

succeeded.

new_cases |> select(1:5) |> glimpse()

# Rows: 254

# Columns: 5

# $ county "Anderson", "Andrews", "Angelina", "Aransas", "Archer", "~

# $ ‘03_04_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

# $ ‘03_05_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

# $ ‘03_06_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

# $ ‘03_07_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

3. Change all count columns to integers (instead of doubles). Run the code below to show you’ve suc-

ceeded.

new_cases |> select(1:5) |> glimpse()

# Rows: 254

# Columns: 5

# $ county "Anderson", "Andrews", "Angelina", "Aransas", "Archer", "~

# $ ‘03_04_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

# $ ‘03_05_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

# $ ‘03_06_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

# $ ‘03_07_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

2

4. Reshape new cases to have columns date, county, cases, and convert the dates into date objects.

Run the code below to show you’ve succeeded.

new_cases

# # A tibble: 158,496 x 3

# county date new_cases

#

# 1 Anderson 2020-03-04 0

# 2 Anderson 2020-03-05 0

# 3 Anderson 2020-03-06 0

# 4 Anderson 2020-03-07 0

# 5 Anderson 2020-03-08 0

# 6 Anderson 2020-03-09 0

# 7 Anderson 2020-03-10 0

# 8 Anderson 2020-03-11 0

# 9 Anderson 2020-03-12 0

# 10 Anderson 2020-03-13 0

# # ... with 158,486 more rows

5. I have included along with this document on Canvas another data file containing the population of

each Texas county (this data came from DSHS as well). The file name is county-populations.csv.

Read this file and merge its information into cases. After merging the population information in, run

the code below to show you’ve succeeded

new_cases

# # A tibble: 158,496 x 4

# county date new_cases population

#

# 1 Anderson 2020-03-04 0 58199

# 2 Anderson 2020-03-05 0 58199

# 3 Anderson 2020-03-06 0 58199

# 4 Anderson 2020-03-07 0 58199

# 5 Anderson 2020-03-08 0 58199

# 6 Anderson 2020-03-09 0 58199

# 7 Anderson 2020-03-10 0 58199

# 8 Anderson 2020-03-11 0 58199

# 9 Anderson 2020-03-12 0 58199

# 10 Anderson 2020-03-13 0 58199

# # ... with 158,486 more rows

6. Create a line chart showing the incident cases (daily new cases) for the top 9 counties in Texas by

population. Plot these on the same graph, differentiating different counties by color. Polish the graphic.

Note: The plot will be overplotted.

Hint: What are the aesthetics in this plot?

Hint 2: Determine the top population counties first, then filter new cases by checking whether the

county is in that top list (in a pipeline, don’t re-save new cases). Then make the plot.

Hint 3: Consider using scale x date()!

3

7. Instead of using color, facet the graphic. Free the scales in the faceting function to allow for easier

visibility of the curves. Again, polish the graphic.

8. The slider package allows you to compute windowed functions; here we’ll use it for computing moving

averages. Look at this (and think about it!) to see how it works.

library("slider")

x <- 1:5

slide_dbl(x, mean, .before = 1)

# [1] 1.0 1.5 2.5 3.5 4.5

slide_dbl(x, mean, .before = 2)

# [1] 1.0 1.5 2.0 3.0 4.0

Instead of looking at daily new cases, re-create the graphic above using 7-day moving averages.

9. Re-stack the graphic as colored curves using the smoothed 7-day moving average graphic.

Note: This will just be copy/paste and add one line from the last code chunk.

10. Note that the above graphics don’t necessarily communicate how bad community spread is in each

county, since the county sizes differ. Make a line chart of the new cases per 10,000 individuals by

county, by dividing the 7-day moving average of new cases per day divided by population size and

multiplying by 10,000. Again, color the lines by county.

What do you notice in this graphic that you couldn’t see in the last one?

4


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp