Statistical and Predictive Modeling for Analytics I
Final Project (total 30 points)
The final project is worth 30% of the grade.
The final project utilizes the hypothesis testing framework and Least squares regression to answer a question in economics. Specifically, we would like to answer the question “By how much will another year of schooling raise one’s income”? We will use data collected from interviewing twins about their education, income and background. The data records contain information from genetically identical twins thus providing an excellent control for confounding variables. Your task is to use the statistical and predictive modeling techniques you have learned in class to test hypotheses about years of education and income. You will perform exploratory analysis, state hypotheses and test those hypotheses using simple tests as well as using simple and multiple linear regression. Based on your analysis you will provide your conclusions about the research question as well as estimate any impacts (such as what is the effect of a unit increase or decrease in years of education).
You will create a power point deck to report your findings and state your conclusion based on your results.
Data set and related information:
The dataset is available in the UCLA Statistics Course Datasets:
The dataset, labels and summary statistics are uploaded to the project content folder.
Read through the dataset information, variables information and relevant papers.
Note that you will need a method for handling missing data. Please refer to the “Some tips you will find useful” section for more information. Also refer to the Student’s Guide to R for methods of handling missing data.
Note that dataset twins.dat contains a total of 183 records and it includes data for both the twins. That is each record contains education level, hourly wage, demographics for twin 1 and twin 2.
Here is a link to the main paper that uses this dataset and describes the approaches used:
Read the paper to understand some methods used.
The following is a checklist of the contents for each slide.
Slide 1 [3 points]
Name of presenter
Description of the research question
A high level description of how you would use statistical and predictive modeling (what you have learned in class) to answer the research question.
Slide 2-4 [3 points]
Create some basic plots and graphs (histograms, boxplots, scatterplots) of the data
Also compute some statistics of the variables that you think are important
Plot some scatter plots showing the bivariate scatter of variables
Slides 5-6 [4 points]
Describe any abnormalities in the data (such as missing data)
Explain how you addressed these abnormalities and the resulting dataset
Slides 7 [4 points]
State the hypotheses related to the research question
Slides 8-9 [4 points]
Report the results of t-tests that will prove or disprove your hypotheses
State what assumptions need to be satisfied for the t-test and whether they are satisfied
Slides 10-11 [4 points]
Perform a simple linear regression and report the results
Interpret the coefficients
Report on any hypothesis tests
State assumptions and whether they are satisfied
Slides 12-13 [4 points]
Perform a multiple linear regression and report the results
Interpret the coefficients
State assumptions and whether they are satisfied
Slide 14-15 [4 points]
State your conclusions about the research question based on any evidence from your analysis
Some tips you will find useful
1.You might find the following page as a starting point for handling missing data:
2.Converting a factor to numeric variables: This stackoverflow page has some tips on how to convert a factor to numeric variables:
3.You will do two t-tests in support of the analysis related to this study. Assuming the data on the twins to be paired data, you will state an appropriate hypothesis and you can do a t-test to analyze the difference in hourly wages as well as the difference in education in years.
4.You will fit a simple linear regression of hourly wage of twin 2 against self-reported education of twin 2.
5.You will fit a multiple linear regression of log wages against own education, age, age squared, male and white.
6.You can fit any other regressions that you deem necessary and fit to answer the research question.
Final Term Project Rubric
Incorrect or Unacceptable
1Name along with a clear description of the research question is given. A clear description of how statistical and predictive modeling can be used to answer the research question. (3)Name along with a clear description of the research question is given. Mostly clear description of how statistical and predictive modeling can be used to answer the research question. (2)Name along with a clear description of the research question is given. An incomplete description of how statistical and predictive modeling can be used to answer the research question. (1)Description of research problem is incorrect or missing. Description of how statistical and predictive modeling can be used to answer the research question is missing or incorrect. (0)
2-4Histograms, boxplots and scatterplots are correct. Statistics computed are correct and meaningful. Scatterplots showing bivariate scatter are correct. (4)Histograms, boxplots and scatterplots are correct. Statistics computed are mostly correct and meaningful. Scatterplots showing bivariate scatter are mostly correct. (3)Histograms, boxplots and scatterplots are mostly correct. Statistics computed are mostly correct and meaningful. Scatterplots showing bivariate scatter may be incorrect or incomplete. (2)Some plots are correct and some statistics are correct. Others are mostly wrong. (0-1)
5-6Any abnormalities are clearly identified. Clear explanation of how the abnormalities were addressed is presented along with a description of the final resulting dataset (4)Any abnormalities are clearly identified. Mostly clear explanation of how the abnormalities were addressed is presented along with a description of the final resulting dataset (3)Any abnormalities are clearly identified. Somewhat clear or incomplete explanation of how the abnormalities were addressed is presented along with a description of the final resulting dataset (2)Abnormalities identified are incorrect and explanation is missing or incorrect. (0-1)
7Hypotheses are clearly stated and correct. (4)Hypotheses are clearly stated and mostly correct (3)Hypotheses are incomplete (2)Hypotheses are missing or incorrect (0-1)
8-9Results or the t-test are reported correctly. Assumptions that need to be satisfied is clearly stated along with whether they were satisfied. (4)Results or the t-test are reported correctly. Assumptions stated are correct and an explanation of whether they were satisfied is mostly correct (3)Results or the t-test are reported correctly. Assumptions are incomplete and the explanation is also incomplete (2)Results of the t-test are incorrect. (0-1)
10-11Simple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly, hypotheses tests are reported accurately and assumptions along with whether they were satisfied are stated. (4)Simple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly, hypotheses tests are reported accurately. Assumptions are incomplete (3)Simple linear regression is performed correctly and reported correctly. There are some issues with coefficient interpretation, hypotheses tests or assumptions (2)Simple linear regression is incorrect and subsequently all other answers are also incorrect (0-1)
12-13Multiple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly, and assumptions along with whether they were satisfied are stated. (4)Multiple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly. Assumptions are incomplete (3)Multiple linear regression is performed correctly and reported correctly. There are some issues with coefficient interpretation, or assumptions (2)Multiple linear regression is incorrect and subsequently all other answers are also incorrect (0-1)
14-15Conclusions about the research question are clearly stated and correct. Evidence for the conclusions is presented clearly. (4)Conclusions about the research question are clearly stated and correct. Evidence for the conclusions is mostly presented clearly. (3)Conclusions about the research question are clearly stated and correct. Evidence for the conclusions is incomplete. (2)Conclusions are incorrect or poorly stated (0-1)
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱