Guideline of R Project
Overview
In this project, you will use R to formulate and answer a series of specific questions about a data set
of your choice. You are expected to:
- Form a group of 5 teammates.
- Only 1 group can be of 4, under the special permission from the instructor.
- Identify a dataset of interest
- Perform exploratory analysis with R to understand the data
- Investigate hypotheses (i.e., potential questions you want to answer by analyzing this dataset),
and develop preliminary insights
- Prepare a report in Word: include a set of at least 6 visualizations that illustrate your findings,
and interpret these visualizations
- Prepare a presentation in PPT: to share your findings to class
Final Deliverables and Important Dates
1. Proposal
- A 1-page proposal consisting of
Title
team formation
Dataset of your choice
Background information: what it is about?
What attributes/fields are available, how many records?
The source of the dataset (e.g., web link)
Sample records (e.g., the first 10)
Propose an initial set of at least 3 questions you’d like to investigate
You also need to submit the downloaded raw dataset with your proposal
- due on Nov. 15, submit to Moodle.
2. Presentation
- Nov. 29 (the last class)
- see details below
3. R project of solution
- A self-contained project file with source code and raw data
- due on Dec. 3
- submit to Moodle by the team leader
4. Final report
- See details below
- due on Dec. 3, submit to Moodle
- submit to Moodle by the team leader
5. Peer evaluation
- See details below
- due on Dec. 3
- submit individually to Moodle
2 | P a g e
Details
Data Selection and Preparation
- First, choose a topic of interest to your team and find a dataset that can provide insights into
that topic. See recommended sources at the end of this guideline.
- Please check with the instructor to ensure it is appropriate for this assignment, and write a 1-
page proposal
- Be advised that data collection and preparation (also known as data wrangling) can be a very
time-consuming process. Be sure you have sufficient time to conduct exploratory analysis,
after preparing the data.
Exploratory and Visual Analysis
You are expected to perform an exploratory analysis of your dataset using R. You should consider
two different phases of exploration.
- In the first phase, you should seek to gain an overview of the shape & structure of your dataset.
What variables does the dataset contain? How are they distributed? Are there any notable data
quality issues? Are there any surprising relationships among the variables?
- In the second phase, you should investigate your initial questions, as well as any new questions
that arise during your exploration, if any. For each question, start by creating a visualization
that might provide a useful answer. Then refine the visualization (for example, by adding
additional variables, changing sorting or axis scales, filtering or subsetting data, etc.) to
develop better perspectives and explore unexpected observations. You should repeat this
process for each of your questions, but feel free to revise your questions or branch off to
explore new questions if the data allow.
Group Presentation
- Design your presentation slides
- Presentation: a 5-minutes storytelling of your work; 2 minutes for Q&A
- Introduce your data and background information, hypotheses/questions, results, and
discuss limitations/future directions.
- Try to make it interesting and rich in information (if time allows).
- Do NOT highlight the technical details of your work (such as code, functions, special
tricks, etc.) during the presentation. Focus on storytelling.
- Due to the short time available, choose 1 or 2 representatives to present. However, all
members must attend and prepare for Q&A.
Coding
- This is an R project, you are expected to use R to process data and present results throughout
the entire project (rather than Excel, Power BI, etc.)
- Create a self-contained R project folder (refer to the structure requirement in the first R
assignment)
- Provide appropriate comments to your code
- Working code – your code should run without any error (tip: try it on different computers)
- Results should be consistent with those in your report and presentation
- Zip the whole project directory into a compressed package, and submit to Moodle, including
- Your raw data
3 | P a g e
- Your code
- Anything else you use
Final Report
Your final submission will be a written report. Focus on the answers to your initial questions. If
applicable, describe surprises as well as challenges encountered along the way, e.g. data quality
issues. Each visualization image should be accompanied with a title and short caption (<2
sentences). Provide sufficient detail for each caption such that anyone could read through your
report and understand your findings. Feel free to annotate your images to draw attention to specific
features of the data.
- Recommended report outline (revise or enhance if needed)
Title page. (report title and team members)
Abstract (No more than 150 words)
Data descriptions – introducing the dataset and related background information. You
should indicate the source of data.
Research Questions – introducing the questions you want to answer, and the
motivation.
Results – analytical results and visualizations
Summary – briefly summarize and discuss your findings
Future Work - A description of how your solution could be extended or improved
References – literatures you have used
Do NOT put code into this report. The code should be submitted separately.
General Grading Criteria
- Poses clear questions applicable to the chosen dataset.
- Appropriate data wrangling (preprocessing) and exploratory data analysis (EDA)
- Breadth and depth of analysis
- Expressive & effective visualizations appropriate to analysis questions.
- Clearly written, understandable captions that communicate primary insights.
- Originality. Submissions will be checked by Turnitin for originality report. Remember to cite
property for any references.
Detailed Grading Components (totally 100 points)
o Part 1: proposal (10 points)
o Part 2: report (30 points)
- In general, the report will be graded on its content (correctness and accuracy), breadth and
depth of discussion, report structure, originality, and writing quality.
o Part 3: presentation content (20 points, delivered by 1 or 2 representatives)
- Slide design
- Correct and accurate information, logical arguments
- Content richness (relevant and rich information, well-defined terms)
- Presentation delivery (preparation, expression clarity)
- Ability to answer questions
- Time management
o Part 4: coding (30 points)
4 | P a g e
- Working code
- Code readability, necessary comments
- Output consistent with report
- Originality
- A well-structured self-contained project
o Part 5: peer evaluation (10 points, individual-based evaluation)
- The evaluation in this part is based on the average contribution percentage (CP) through
intra-group peer evaluation. Each student is expected to submit his/her evaluation separately
to Moodle.
- Your CP = average(intra-group evaluation of your contribution)
- For group of 5, for example, the equal-contribution percentage (ECP) is 100% ÷ 5 = 20%
- You may gain all 10 points if your CP = ECP. You may gain as high as 15 points in this part,
if your CP is significantly higher than ECP; and as low as 0 points in this part, if your CP is
significantly lower than ECP.
Data Sources
- Open databases
o Kaggle datasets
o Awesome Public Datasets: topic-centric list of high-quality open datasets in public domains
o Macau government open database: Macau regional statistics
o Chinese government open databases: Provided by Chinese National Statistical Bureau
o Databases in business-related subjects: commercial databases available in UM library, only
accessible in UM
- Unopen datasets
o You may also choose datasets that are not open to public. In such a case, please indicate the
source of data.
- Notes and hints
o You are recommended to choose a business-related dataset; Interesting datasets in other
domains are also good choices.
o You are not recommended to choose datasets in a highly specialized domain (e.g., biology,
physics, etc.), unless you are very familiar with this domain.
o Choose the dataset that comes with sufficient descriptions and/or background information.
It is not wise to choose a dataset with little additional descriptions. As such you will have to
guess the meaning of its attributes and values.
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。