联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-05-08 10:41

Analytics 512: Take Home Final Exam 2019

200 points in ten problems.

This is the take-home portion of the exam. You may use your notes, your books, all material on the course

website, and your computer or any computer in the departmental computer lab. You may also use official

documentation for R, built-in or on https://cran.r-project.org/, but no other material on the Internet. Provide

proper attribution for all such sources. You may not use any human help, except whatever help is provided

by me.

Your solution should consist of two files: An .Rmd file that loads all data and all packages, makes all plots,

and contains all comments and explanation, and an .html or .pdf file that is produced by the .Rmd file.

Return your solutions by Friday, 5/10/19, 11:59PM.

in Canvas

or hand in printed copies of both files

or fax both files to 202.687.6067.

Part I: Bikeshare Ridership

The first part of the exam uses data on hourly ridership counts for the Capital Bikeshare system in Washington,

DC for the years 2011 and 2012. Use the data frame cabi. The data frame contains time related variables

and weather related variables, plus two numerical target variables. Each observation contains data for

one hour during these two years, with a few gaps.

The data have been adapted from a set at the UCI repository. Link to the original data set: https:

//archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

System data of the Capital Bikeshare system are here: https://www.capitalbikeshare.com/system-data

Time related variables

season a categorical variable with values 1 (for January - March), 2 (April - June), 3 (July - September),

4 (October - December)

year with values 2011 and 2012

month a categorical variable with values 1, 2, . . . , 12

wday which is 0 for weekends and holidays and 1 otherwise

hr a numerical variable with values 0, 1, . . . , 23

Weather related variables

temp scaled temperature

atemp scaled perceived temperature

hum scaled humidity

windspeed scaled

weather, a categorical variable with values 1 (e.g. clear or few clouds or partly cloudy), 2 (e.g. cloudy

or broken clouds or foggy), 3 (e.g. snow or rain or thunderstorm)

1

Target variables

The bikeshare system has casual riders who rent bicycles on the spot (e.g., tourists) and registered riders

who have a subscription (e.g., commuters).

casual Total number of casual riders during this hour

registered Total number of registered riders during this hour

Problem 1 (20)

Use numerical summaries, graphs, etc. to answer the following questions. No model fitting or other statistical

procedures are required for this. Each graph should help answer one or more of these questions and should

be accompanied by explanations.

(a) How do ridership counts depend on the year? The month? The hour of the day? How do casual and

registered riders differ in this respect?

(b) How are casual and registered ridership counts related? Does this depend on the year? Does it depend

on the type of day (working day or not)?

(c) Is there an association between the weather situation and ridership counts? For casual riders? For

registered riders?

(d) There are relations between time related predictors and weather related predictors. Demonstrate this

with a few suitable graphs.

For problems 2-4, split the data into a training set (70%) and a test set (30%).

Problem 2 (25)

(a) Fit a multiple regression to predict registered ridership from the other variables (excluding casual

ridership), using the training data. Identify the significant variables and comment on their coefficients.

(b) Estimate the RMS prediction error of this model using the test set.

(c) Does the RMS prediction error depend on the month? Answer this question using the test data and

suitable tables or graphs.

(d) Make copies of the training and test data in which hr is a categorical variable. Fit a multiple regression

model. Compare the summary of this model to the one from part (a). Also estimate the RMS prediction

error from the test set.

Problem 3 (30)

Use the original cabi data for this problem. (a) Train artificial neural networks with various numbers of

nodes in the hidden layer to predict registered ridership. Use the training data and only weather related

variables. Recommend a suitable number of nodes, with explanation. (b) Repeat part (a), using only time

related variables. (c) Repeat part (a), using two time related and two weather related variables. Explain

your choice of variables.

Problem 4 (10)

What do you think are six useful predictors? Use any method you want to answer this question.

2

Part II: Vegetation Cover

Problems 5 - 8 use data on vegetation cover. Use the data frames covtype.train and covtype.test. The

original data are at https://archive.ics.uci.edu/ml/datasets/Covertype

Each data set contains 10,000 observations of 55 variables. These have been collected on 30m × 30m patches

of hilly forest land by the US Forest Service.

elev = elevation in meters, slope = slope of the terrain in degrees, aspect = direction of the slope in

degrees

h_dist_hydro, v_dist_hydro = Horizontal and vertical distance to nearest water feature in meters

h_dist_road = Horizontal distance to nearest roadway in meters

hillshade_9, hillshade_12, hillshade_3 = Index for hill shade at 9 AM, 12 noon, 3 PM, at

summer solstice

h_dist_fire = Horizontal distance to nearest wildfire ignition point in meters

wild1, ... wild4 = binary indicator variables for wilderness designation

soil1, ..., soil40 = binary indicator variables for soil type

cover = Target variable (type of forest cover), with values 2 and 3.

Problem 5 (20)

Fit a logistic model to the training data in order to separate the classes. Choose a classification threshold

so that sensitivity and specificity are approximately the same on the training data. Then report sensitivity,

specificity, and overall error rate for the test data.

Problem 6 (25)

Fit a support vector machine with radial kernels in order to separate the classes. Tune the cost and gamma

parameters so that cross validation gives the best performance on the training data. Then assess the resulting

model on the test data. Report sensitivity, specificity, and overall error rate for training and test data.

Problem 7 (10)

Fit a decision tree to the training data in order to separate the two classes. Prune the tree using cross

validation and make sure that there are no redundant splits (i.e. splits that lead to leaves with the same

classification). Then estimate the classification error rate for the pruned tree from the test data.

Problem 8 (20)

Fit a random forest model to the training data in order to separate the classes. Identify the ten most

important variables and fit another random forest model, using only these variables. Use the test data to

decide which model has better performance.

Part 3: MNIST Digit Data

Problems 9 and 10 use the MNIST image classification data, available as mnist_all.RData in Canvas. We

use only the test data (10,000 images).

3

Problem 9 (20)

(a) Select a random subset of 1000 digits. Use hierarchical clustering with complete linkage on these images

and visualize the dendrogram.

(b) Does the dendrogram provides compelling evidence about the “correct” number of clusters? Explain

your answer.

(c) Cut the dendrogram to generate a set of clusters that appears to be reasonable. There should be

between 5 and 15 clusters. Then find a way to create a visual representation (i.e. a typical image) of

each cluster. Explain and describe your approach.

Problem 10 (20)

Use Principal Component Analysis on the MNIST images.

(a) Make a plot of the proportion of variance explained vs. number of principal components. Which fraction

of the variance is explained by the first two principal components? Which fraction is explained by the

first ten principal components?

(b) Plot the scores of the first two principal components of all digits against each other, color coded by the

digit that is represented. Comment on the plot. Does it appear that digits may be separated by these

scores?

(c) Find three digits which are reasonably well separated by the plot that you made in part (b). Illustrate

this with a color coded plot like the one in (b) for just these three digits. Don’t expect perfect separation.

(d) Find three other digits which are not well separated by the plot that you made in part (b). Illustrate

this with another color coded plot like the one in (b) for just these three digits.

4


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp