Fall 2018 IS542 Final
Due Tuesday December 18, 5:00PM US Central Time
Discuss two or more of the following questions, in your own words. You may choose to address any two,
three, four, or even all questions but should target 3-4 pages of text in total (not counting figures, tables,
and references). Upload your answers to the final section of the class Moodle page as a single narrative
document in pdf format. You may, and are encouraged to, illustrate your answers using R, but that's no
substitute for lucid natural language explanations. To preserve the natural flow of the narrative, figures
and tables should be embedded into the document near their first mention. Any supplementary files like
code or data should be referenced in the text and separately uploaded. You may use books, articles, notes,
search engines, or computers, but may not solicit or receive direct assistance from other human beings.
Cite sources if you use them. For the first three question you may want to illustrate technical detail using
R, discuss practical aspects that are important for applications, and theoretical aspects of the subject.
Question 1. Construct a dataset with at least 8 observations and 3 variables (y, x1, and x2) such that least
squares linear regression of y versus x1 produces y = - 2x1 + e1 and regressing y versus x1 and x2
produces y = 2x1 - x2 + e2. How might you interpret the relationship between y and x1? Show your work
in R.
Question 2. Write a short essay, in your own words, explaining the four assumptions of linear regression
and show how to test them on a dataset of your choice. Show your work in R.
Question 3. Write a short essay, in your own words, on the subject of the Bayes theorem illustrate its use
in an application of your making.
Question 4. R challenge. During the last class session we worked with the circle.arff dataset, assessing
the cross-validated performance of a wide variety of classification algorithms such as decision trees,
random forest, rules, support vector machine, Na?ve Bayes, Bayes Net, logistic regression, neural net, knearest
neighbor, and boosting. Replicate some of these experiments using R.
http://abel.lis.illinois.edu/data/circle.arff
Question 5. R challenge: The data directory contains a file with author names and associated Ethnea and
Genni predictions. Use logistic regression to identify character n-grams of first and/or last names that may
help predict the Ethnea categories. It might be helpful to install and use an R package such as tm that is
able to extract character n-grams. Classification performance can be assessed using precision and recall
for each ethnicity Ethnea category, and classes that are the most similar can be identified using the
confusion matrix.
Full dataset:
http://abel.ischool.illinois.edu/data/names_ethnea_genni_country.csv
Of which a smaller, random sample is given here:
http://abel.ischool.illinois.edu/data/names_ethnea_genni_country_sample.csv
References:
Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a largescale
bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of
Congress, Washington DC, USA http://hdl.handle.net/2142/88927
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。