CSC334/424
Assignment #2
Deliverables: Turn in your answers in a single PDF file. Copy any R output relevant to your answer
into your document and explain your answer thoroughly and include a copy of the full analysis in your
report along with your conclusions.
1) (Due by Thursday April 25) We will be trying to finalize the groups this week, so the first
assignement here is to post to the final project forum with one of the following.
1. An introduction with what kind of data you are interested in looking at. If you already
have a dataset, give a short description of the dataset, along with a description of its
scope (# metric variables, #categorical variables, #samples, multiple related tables?)
2. A response to one or more other posts expressing interest in their project idea
3. A post with a fully formed group (some of you have already formed a group). In this
case, you should also give a description of your dataset along the lines of a)
In addition, as you are forming your groups, remember the following requirements for datasets
and groups
a. Your group should have 3-4 people in it.
b. Your group should have at least one in-class student and one on-line student. This
helps me check in with each group if I have at least one in-class student in each, and also
helps get remote students involved with those of you in Chicago. I will consider making
exceptions here, but your group will need to contact me to discuss it.
c. Your dataset should be a real and rich dataset with at least 15 to 20 variables mixed
between categorical and metric, but should definitely have a good set of numeric
variables to work with. It should have at least (10 * #var, but better yet 15 to 20 * #var)
samples (we will see that some techniques like PCA require this for
significance/stability). So the more variables your dataset has the larger the sample size
should be. See me if you have any doubts about this.
2) (Due Monday April 29th) Do one of the following
1. Finalize your choice of a group that is forming online. Your group should post its final
composition to the “Group Finalization” forum. Create a thread and have each member
of your group post to that thread. (This helps me with tracking who has not found a
group). When each member posts here, they should include whether they are an
online or in-class student.
2. Post your name, a list of three areas of data interest, and whether you are in-class or
online, to the alternate group formation site (I will form groups out of the remaining
people, either creating new groups, or filling empty slots in existing groups).
We will finalize group formation by next Friday.
3) (10 points) For each of the following datasets, identify the variables in the dataset as correlated
or uncorrelated and if they are correlated, draw the principal components, and for their
lengths, estimate the size of the eigenvalue (i.e. the variance in the direction of the principal
component). Note, you do note need to do this precisely, but you should be able to get a rough
estimate from the graph.
a) b)
c) d)
4) (10 points) Answer each of the following by hand for the following matrices/vectors, and then
verify your answers with R code:
.4 .88 .28 1
1 1
, .88 1.1 .98 , 2
1 2.5
.28 .98 2.26 3
a) Compute the eigenvalues and eigenvectors of M
b) Verify that v is an eigenvector of N.
c) For the eigenvector in b), what is the corresponding eigenvalue (note, you do not need
to solve for it. Hint: what does it mean that v is an eigenvector? Also note that it will be
approximate (there may be a difference in the second or third decimal place.)
5) (15 points, Regularized Regression) The data in mapleTrain.csv and mapleTest.csv is the same
Maple dataset from the first homework but divided into a test and training set. Perform the
following analyses to build models on the Training set and compare their predictions on the test
set with the root mean squared error (rmse) that we used in class.
1. Compute as you did last week a regression using the two predictors of Latitude and
JulyTemp.
2. Compute a cross-validated Ridge regression. Due to the small sample size, you may
have to set the number of folds for the cross-validation “nfolds” to something less than
10, say 8.
3. Compute a cross-validated Lasso regression.
4. Compute an Elastic net regression with an alpha of .5 (i.e. a perfect mix of ridge and
lasso.
5. (5 pts, Extra Credit for both classes) With a for loop, perform a search on alpha to find
the lowest rmse.
6) (Principal Component Analysis - 20 points): The data given in the file ‘Employment.txt’ is the
percentage employed in different industries in Europe countries during 1979. Techniques such
as Principal Component Analysis (PCA) can be used to examine which countries have similar
employment patterns. There are 26 countries in the file and 10 variables as follows:
Variable Names:
1. Country: Name of country
2. Agr: Percentage employed in agriculture
3. Min: Percentage employed in mining
4. Man: Percentage employed in manufacturing
5. PS: Percentage employed in power supply industries
6. Con: Percentage employed in construction
7. SI: Percentage employed in service industries
8. Fin: Percentage employed in finance
9. SPS: Percentage employed in social and personal services
10. TC: Percentage employed in transport and communications.
Perform a principal component analysis using the covariance matrix:
a. How many principal components are required to explain 90% of the total variation for this
data?
b. For the number of components in part a, give the formula for each component and a brief
interpretation, without rotation of the components. How easy are they to separate in-terms
of meaning? Then try rotating the components (your function for computing PCA may be
doing this already, if so, make sure that you know the difference and can get both out of your
software). Give the formula for each component and a brief interpretation. Has rotating
improved the ability to interpret the components?
c. What countries have the highest and lowest values for each principal component (only include
the number of components specified in part a). For each of those countries, give the principal
component scores (again only for the number of components specified in part a).
d. Analyze the significance of the entries in the correlation matrix for fields that are highly
correlated or completely uncorrelated with the other fields (use a 90% confidence level, and
consider a field highly correlated if it is correlated with over 75% of the other fields). If there
are fields, try removing them from the analysis. Does this help your interpretation of the
analysis in b)?
7) (Principal Component Analysis, 20 points) Begin with the “census2.csv” datafile, which contains
census data on various tracts in a district. The fields in the data are
1. Total Population (thousands)
2. Professional degree (percent)
3. Employed age over 16 (percent)
4. Government employed (percent)
5. Median home value (dollars)
a) Conduct a principal component analysis using the covariance matrix (the default for prcomp
and many routines in other software), and interpret the results. How much of the variance
is accounted for in the first component and why is this?
b) Try dividing the MedianHomeValue field by 100,000 so that the median home value in the
dataset is measured in $100,000’s rather than in dollars. How does this change the
analysis?
c) Compute the PCA with the correlation matrix instead. How does this change the result and
how does your answer compare (if you did it) with your answer in b)?
d) Analyze the correlation matrix for this dataset for significance, and also look for variables
that are extremely correlated or uncorrelated. Discuss the effect of this on the analysis.
e) (Extra Credit for Undergraduate Students) Discuss what using the correlation matrix does
and why it may or may not be appropriate in this case.
8) (20 points, Principal Component Analysis, Extra Credit for Undergraduate Students) Download
the “trackRecord.txt” dataset and perform a principal component analysis on the data. The data
give track records for various countries in a series of events (100m, 200m, 400m, 800m, 1500m,
5000m, 10000m, Marathon). Note that the first three are measured in seconds and the last 4 in
minutes.
Choose your PCA method carefully and give a reason for your choice. Your method should
account for the differences in scales of the fields. Try different ways of formulating the analysis
until you get a small set of components that are easy to interpret.
Finally, run a common factor analysis on the same data. What difference, if any, do you find?
Does the factor analysis change your ability to interpret the results practically?
9) (Reflection) Post a comment on the lectures 3 & 4 forum regarding some topic covered during
these lectures.
10) (Paper review) An academic paper from a conference or Journal will be posted to the
Homework 2 content section of D2L. Review the paper and evaluate their usage of Principal
Component Analysis. In particular, address
1. How suitable is their data for PCA?
2. How are they applying PCA? Are they trying to extract interpretable underlying
variables, or is their goal more along the lines of dimensionality reduction?
3. What kind of factor rotation do they use if any?
4. How many components do they concentrate on in their analysis?
5. Do they evaluate, and how do they evaluate the stability of the components?
6. What conclusions does PCA allow them to draw?
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。