联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-05-03 10:26

STATS 4014

Advanced Data Science

Assignment 3

Jono Tuke

Semester 1 2019

CHECKLIST

: Have you shown all of your working, including probability notation where necessary?

: Have you given all numbers to 3 decimal places unless otherwise stated?

: Have you included all R output and plots to support your answers where necessary?

: Have you included all of your R code?

: Have you made sure that all plots and tables each have a caption?

: If before the deadline, have you submitted your assignment via the online submission on MyUni?

: Is your submission a single pdf file - correctly orientated, easy to read? If not, penalties apply.

: Penalties for more than one document - 10% of final mark for each extra document. Note that you

may resubmit and your final version is marked, but the final document should be a single file.

? : Penalties for late submission - within 24 hours 40% of final mark. After 24 hours, assignment is not

marked and you get zero.

: Assignments emailed instead of submitted by the online submission on MyUni will not be marked

and will receive zero.

: Have you checked that the assignment submitted is the correct one, as we cannot accept other

submissions after the due date?

Due date: Friday 3rd May 2019 (Week 7), 5pm.

Q1. Bayesian connection to lasso and ridge regression

a. Suppose that

Yi = β0 + β1xi1 + . . . + βpxip + i,

where ~ iid N(0, σ2).

Write the likelihood for the data.

b. Let βj , j = 1, . . . , p have priors that are iid with

i.e., they are i.i.d. with a double-exponential distribution with mean 0, and common scale parameter b.

Write out the posterior for β given the likelihood in Part a. Show that the lasso estimate is the mode

of the posterior.

c. Let βj , j = 1, . . . , p have priors that are i.i.d. normal distribution with a mean zero and variance c.

Write the posterior of βj , j = 1, . . . , p. Hence show that the ridge regression is both the mean and the

mode of the posterior.

1

Q2. Using data.table

In the following, you are advised to use data.table. Trying to use standard data manipulation may crash

your computer or take too long. The data in DNA_combined.csv is real data on DNA methylation in modern

and ancient DNA samples.

Each row in the dataset is a segment of DNA for which we have the following information:

chr: the chromosome the segment is from,

pos: the starting position of the segment on the chromosome,

N: the length of the segment in number of bases,

X: the number of the bases that are methylated,

type: whether the DNA is modern or ancient, and

ID: the ID of the individual that the DNA is from.

Also we have a spreadsheet of metadata given in Data_Info.xlsx. Each row is an individual and we have

the following information:

Filename: the filename of the compressed file that had the data. I used this to get the samples for you,

SampleID: the ID of each individual,

Sex: the gender of the individual,

Tissue: the area of the body that the DNA was extracted from,

Type: whether the DNA is modern or ancient, and

Age_kyr: the age of the individual in 1000’s year.

Our goal is to find the proportion of samples for each tissue / type combination that has a higher proportion

of methylation compared to the mean for each tisse / type combination.

Perform the following steps:

a. Read in both datasets.

b. Rename the SampleID column to ID in the metadata data.table.

c. Find which samples IDs are repeated in the metadata.

d. Remove from the metadata any samples that are not Hairpin.

e. What is the total number of samples? What is the total number of modern samples and the total

number of ancient samples?

f. Calculate the proportion of methylation for each sample.

g. What is the total number of samples for each combination of tissue and type.

h. Calculate the mean proportion of methylation for each combination of tissue and type.

i. What proportion of samples have a methylation proportion greater than the mean proportion of

methylation for each tissue / type combination?

Q3. Webscraping

In this question, we are going to webscrape data from the internet movie database. As before there are marks

for webscraping and cleaning the dataset, but if you prefer not to do this, the cleaned dataset is provided.

a. Webscraping the data. The main package for webscraping is rvest:

https://rvest.tidyverse.org/

Also the chrome extension selectorgadget is really useful to identify the parts of the webpage that contains

the information:

https://selectorgadget.com/

https://rvest.tidyverse.org/articles/selectorgadget.html

I have written a template function to start you off based on the following tutorial:

2

https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/

The function is

## Get libs ----

pacman::p_load(

rvest, tidyverse, glue

)

#' get top 100

#'

#' Take a year and get information from imdb on the top 100 movies for that year.

#'

#' @param year year to get the data from

#'

#' @return data frame with the information

#'

#' @author Jono Tuke

#'

#' Wednesday 27 Mar 2019

get_top_100 <- function(year){

# Create url for the given year split to make easier reading

url <- glue("https://www.imdb.com/search/title?",

"count=100&release_date={year},{year}",

"&title_type=feature")

# Read in the webpage

html <- try(read_html(url))

if('try-error' %in% class(html)){

cat("Cannot load webpage", url, "\n")

return(NA)

}

# Get title of movies

titles <- html %>%

html_nodes(".lister-item-header a") %>%

html_text()

# Ratings

ratings <-

html %>%

html_nodes(".ratings-imdb-rating strong") %>%

html_text()

## Put together

info <- tibble(

year = year,

title = titles,

rating = ratings

)

return(info)

}

At present, it gets only year, title and ratings for the top 100 movies for a given year.

Write a function that will get the following

## # A tibble: 6 x 11

## year title description runtimes genre rating vote director actors

## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>

3

## 1 1980 The ~ "\n A f~ 146 min "\nD~ 8.4 768,~ Stanley~ Jack ~

## 2 1980 Star~ "\n Aft~ 124 min "\nA~ 8.8 1,03~ Irvin K~ Mark ~

## 3 1980 The ~ "\n Jak~ 133 min "\nA~ 7.9 165,~ John La~ John ~

## 4 1980 Flyi~ "\n A m~ 88 min "\nC~ 7.8 188,~ Jim Abr~ Rober~

## 5 1980 Flas~ "\n A f~ 111 min "\nA~ 6.5 45,2~ Mike Ho~ Sam J~

## 6 1980 The ~ "\n In ~ 104 min "\nA~ 5.7 57,7~ Randal ~ Brook~

## # ... with 2 more variables: metascore <chr>, gross <chr>

Then webscrape the data for 1980 to 2018 inclusively.

b. Cleaning the data. I will leave the decisions on the cleaning to you, but so that you know - I kept the

top 20 most prolific directors and top 20 most prolific actors - the rest became Other. Also I created a

boolean column for each genre.

c. Which movies are the most highly rated and the most lowly rated?

d. Which director has the highest mean rating?

e. Fit a lasso regression to predict rating with the following predictors:

year,

runtimes,

vote,

metascore,

gross, and

Animation1.

What is the best model? What is the first coefficient to be shrunk to zero as λ increases, and what is

the last coefficient?

=======

1Just because I am obsessed with animation.

4

Mark scheme

Part Marks Difficulty Area Type Comments

Q1

1a 4 0.00 Lasso/ridge proof 4 for derivation

1b 7 0.29 Lasso/ridge proof 5 for derivation; 2 for justification

1c 7 0.29 Lasso/ridge proof 5 for derivation; 2 for justification

Total 18

Q2

2a 1 0.00 data.table analysis 1 for coding

2b 1 1.00 data.table analysis 1 for coding

2c 2 0.50 data.table analysis 2 for coding

2d 1 0.00 data.table analysis 1 for coding

2e 2 0.00 data.table analysis 2 for coding

2f 1 0.00 data.table analysis 1 for coding

2g 4 0.50 data.table analysis 4 for coding

2h 2 0.00 data.table analysis 2 for coding

2i 5 0.60 data.table analysis 2 for coding; 3 for over presentation of

code in this Q

Total 19

Q3

3a 7 0.29 Lasso/ridge coding 5 for coding; 2 for quality of code

3b 10 0.20 Lasso/ridge analysis 5 for code; 5 for explanation of code

3c 2 0.00 Lasso/ridge analysis 2 for code

3d 2 0.00 Lasso/ridge analysis 2 for code

3e 8 0.38 Lasso/ridge interpretation 4 for coding; 4 for interpretation of

results

Total 29

Assignment total 66

5


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp