联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-05-08 10:42

CSC334/424

Assignment #2

Deliverables: Turn in your answers in a single PDF file. Copy any R output relevant to your answer

into your document and explain your answer thoroughly and include a copy of the full analysis in your

report along with your conclusions.

1) (Due by Thursday April 25) We will be trying to finalize the groups this week, so the first

assignement here is to post to the final project forum with one of the following.

1. An introduction with what kind of data you are interested in looking at. If you already

have a dataset, give a short description of the dataset, along with a description of its

scope (# metric variables, #categorical variables, #samples, multiple related tables?)

2. A response to one or more other posts expressing interest in their project idea

3. A post with a fully formed group (some of you have already formed a group). In this

case, you should also give a description of your dataset along the lines of a)

In addition, as you are forming your groups, remember the following requirements for datasets

and groups

a. Your group should have 3-4 people in it.

b. Your group should have at least one in-class student and one on-line student. This

helps me check in with each group if I have at least one in-class student in each, and also

helps get remote students involved with those of you in Chicago. I will consider making

exceptions here, but your group will need to contact me to discuss it.

c. Your dataset should be a real and rich dataset with at least 15 to 20 variables mixed

between categorical and metric, but should definitely have a good set of numeric

variables to work with. It should have at least (10 * #var, but better yet 15 to 20 * #var)

samples (we will see that some techniques like PCA require this for

significance/stability). So the more variables your dataset has the larger the sample size

should be. See me if you have any doubts about this.

2) (Due Monday April 29th) Do one of the following

1. Finalize your choice of a group that is forming online. Your group should post its final

composition to the “Group Finalization” forum. Create a thread and have each member

of your group post to that thread. (This helps me with tracking who has not found a

group). When each member posts here, they should include whether they are an

online or in-class student.

2. Post your name, a list of three areas of data interest, and whether you are in-class or

online, to the alternate group formation site (I will form groups out of the remaining

people, either creating new groups, or filling empty slots in existing groups).

We will finalize group formation by next Friday.

3) (10 points) For each of the following datasets, identify the variables in the dataset as correlated

or uncorrelated and if they are correlated, draw the principal components, and for their

lengths, estimate the size of the eigenvalue (i.e. the variance in the direction of the principal

component). Note, you do note need to do this precisely, but you should be able to get a rough

estimate from the graph.

a) b)

c) d)

4) (10 points) Answer each of the following by hand for the following matrices/vectors, and then

verify your answers with R code:

.4 .88 .28 1

1 1

, .88 1.1 .98 , 2

1 2.5

.28 .98 2.26 3

a) Compute the eigenvalues and eigenvectors of M

b) Verify that v is an eigenvector of N.

c) For the eigenvector in b), what is the corresponding eigenvalue (note, you do not need

to solve for it. Hint: what does it mean that v is an eigenvector? Also note that it will be

approximate (there may be a difference in the second or third decimal place.)

5) (15 points, Regularized Regression) The data in mapleTrain.csv and mapleTest.csv is the same

Maple dataset from the first homework but divided into a test and training set. Perform the

following analyses to build models on the Training set and compare their predictions on the test

set with the root mean squared error (rmse) that we used in class.

1. Compute as you did last week a regression using the two predictors of Latitude and

JulyTemp.

2. Compute a cross-validated Ridge regression. Due to the small sample size, you may

have to set the number of folds for the cross-validation “nfolds” to something less than

10, say 8.

3. Compute a cross-validated Lasso regression.

4. Compute an Elastic net regression with an alpha of .5 (i.e. a perfect mix of ridge and

lasso.

5. (5 pts, Extra Credit for both classes) With a for loop, perform a search on alpha to find

the lowest rmse.

6) (Principal Component Analysis - 20 points): The data given in the file ‘Employment.txt’ is the

percentage employed in different industries in Europe countries during 1979. Techniques such

as Principal Component Analysis (PCA) can be used to examine which countries have similar

employment patterns. There are 26 countries in the file and 10 variables as follows:

Variable Names:

1. Country: Name of country

2. Agr: Percentage employed in agriculture

3. Min: Percentage employed in mining

4. Man: Percentage employed in manufacturing

5. PS: Percentage employed in power supply industries

6. Con: Percentage employed in construction

7. SI: Percentage employed in service industries

8. Fin: Percentage employed in finance

9. SPS: Percentage employed in social and personal services

10. TC: Percentage employed in transport and communications.

Perform a principal component analysis using the covariance matrix:

a. How many principal components are required to explain 90% of the total variation for this

data?

b. For the number of components in part a, give the formula for each component and a brief

interpretation, without rotation of the components. How easy are they to separate in-terms

of meaning? Then try rotating the components (your function for computing PCA may be

doing this already, if so, make sure that you know the difference and can get both out of your

software). Give the formula for each component and a brief interpretation. Has rotating

improved the ability to interpret the components?

c. What countries have the highest and lowest values for each principal component (only include

the number of components specified in part a). For each of those countries, give the principal

component scores (again only for the number of components specified in part a).

d. Analyze the significance of the entries in the correlation matrix for fields that are highly

correlated or completely uncorrelated with the other fields (use a 90% confidence level, and

consider a field highly correlated if it is correlated with over 75% of the other fields). If there

are fields, try removing them from the analysis. Does this help your interpretation of the

analysis in b)?

7) (Principal Component Analysis, 20 points) Begin with the “census2.csv” datafile, which contains

census data on various tracts in a district. The fields in the data are

1. Total Population (thousands)

2. Professional degree (percent)

3. Employed age over 16 (percent)

4. Government employed (percent)

5. Median home value (dollars)

a) Conduct a principal component analysis using the covariance matrix (the default for prcomp

and many routines in other software), and interpret the results. How much of the variance

is accounted for in the first component and why is this?

b) Try dividing the MedianHomeValue field by 100,000 so that the median home value in the

dataset is measured in $100,000’s rather than in dollars. How does this change the

analysis?

c) Compute the PCA with the correlation matrix instead. How does this change the result and

how does your answer compare (if you did it) with your answer in b)?

d) Analyze the correlation matrix for this dataset for significance, and also look for variables

that are extremely correlated or uncorrelated. Discuss the effect of this on the analysis.

e) (Extra Credit for Undergraduate Students) Discuss what using the correlation matrix does

and why it may or may not be appropriate in this case.

8) (20 points, Principal Component Analysis, Extra Credit for Undergraduate Students) Download

the “trackRecord.txt” dataset and perform a principal component analysis on the data. The data

give track records for various countries in a series of events (100m, 200m, 400m, 800m, 1500m,

5000m, 10000m, Marathon). Note that the first three are measured in seconds and the last 4 in

minutes.

Choose your PCA method carefully and give a reason for your choice. Your method should

account for the differences in scales of the fields. Try different ways of formulating the analysis

until you get a small set of components that are easy to interpret.

Finally, run a common factor analysis on the same data. What difference, if any, do you find?

Does the factor analysis change your ability to interpret the results practically?

9) (Reflection) Post a comment on the lectures 3 & 4 forum regarding some topic covered during

these lectures.

10) (Paper review) An academic paper from a conference or Journal will be posted to the

Homework 2 content section of D2L. Review the paper and evaluate their usage of Principal

Component Analysis. In particular, address

1. How suitable is their data for PCA?

2. How are they applying PCA? Are they trying to extract interpretable underlying

variables, or is their goal more along the lines of dimensionality reduction?

3. What kind of factor rotation do they use if any?

4. How many components do they concentrate on in their analysis?

5. Do they evaluate, and how do they evaluate the stability of the components?

6. What conclusions does PCA allow them to draw?


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp