联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2019-03-31 10:46

COMPSCI 2034b / DIGIHUM 2144b

Data Analytics: Principles and Tools

Assignment #3

Decision Trees & Visualizations

Posted: March 18th 2019

Due: April 5th 2019 11:55PM

Total: 100 Points (10% of Final Grade)

CS2034 - Data Analytics: Principles and Tools Assignment #3

Learning Outcomes

By completing this assignment, you will gain and demonstrate skills relating to:

Creating Decision Trees

Applying Information Theory Concepts

Calculating Entropy and Information Gain Creating Visualizations

Processing Data to be Visualized

Instructions

This assignment is divided into two distinct activities, one dealing with decision trees and

one with visualizations. In both activities, it is left to you to decide the best way to process

the data and do the required calculations using the techniques that have been covered in

class and the labs. Precise step-by-step instructions are intentionally not given so you can

demonstrate the skills you have acquired in this course.

For both activities you should use Excel (and optionally VBA) as your primary tool for

processing data and making any calculations. You must provide full details on the steps

you took to process the data and make any calculations clear to the reader of the Excel

document. This can be done by including notes in the Excel sheet (e.g. cells with text in

them explaining the calculations being done in other cells), documenting/commenting any

VBA code (if used) or by including a PDF with text explaining your work. For each activity,

you are expected to include an Excel sheet with your processed data and calculations as part

of your final deliverables.

You should check that your Excel documents and any VBA code (if used) work correctly

and are compatible with the GenLab computers and Excel 2016 for Windows.

You will be assessed on the following:

Using the correct file from OWL (activity 2).

Showing your work, calculations and steps taken to process the data.

Your Excel formulas and operations.

Your VBA code (if used).

Completion of each task correctly.

Using appropriate visualizations (activity 2).

Producing the final deliverables as described.

Assignment submission via OWL before deadline.

1 of 10

CS2034 - Data Analytics: Principles and Tools Assignment #3

Activity 1: Decision Trees

Below is a table of observations of 18 objects numbered (O1 to O18).

Table 1: Object Observation

Object Colour Roundness Size Texture Class

O1 Yellow Round Small Rough Duck

O2 Blue Round Large Rough Not a Duck

O3 Yellow Round Small Smooth Duck

O4 Red Round Medium Rough Duck

O5 Blue Square Small Smooth Not a Duck

O6 Blue Square Large Rough Not a Duck

O7 Red Round Small Rough Duck

O8 Blue Square Medium Rough Not a Duck

O9 Red Square Small Smooth Not a Duck

O10 Yellow Square Large Rough Duck

O11 Yellow Square Medium Rough Not a Duck

O12 Yellow Round Large Rough Not a Duck

O13 Red Square Large Smooth Duck

O14 Yellow Square Medium Smooth Duck

O15 Red Square Medium Rough Not a Duck

O16 Yellow Round Large Smooth Duck

O17 Blue Round Large Smooth Not a Duck

O18 Blue Round Medium Smooth Not a Duck

Colour, Roundness, Size and Texture are attributes of the objects (features) and Class

denotes if the object is a rubber duck (Duck) or some other object (Not a Duck). Assume

that only the values shown in this table are possible for each attribute (i.e. “Green” is not

a valid Colour and “Medium” is not a valid value for Roundness).

2 of 10

CS2034 - Data Analytics: Principles and Tools Assignment #3

Task 1.1

Using Table 1 as your training data, create a full decision tree to classify an object as Duck or

Not a Duck based on the attributes Colour, Roundness, Size, and Texture. Use the method

based on Information Theory described in the week 10 lecture and Lab 10.

You are required to show your work and calculations for each step of the process, including

the Entropy and Information Gain values needed to find each node in the tree (even if you

could “eyeball it” accurately.

Do all of your calculations in Excel. You may use VBA, including the code for the entropy

function from Lab 10 (you would have to modify it to work with this data) but this is not

required (you can do all calculations with just Excel formulas).

You are required to make your calculations clear and understandable to any reader of the

Excel document. You should include notes as text in cells to explain any complicated calculations

and make it clear what you are calculating. Use multiple sheets such that the calculations

for each node in the tree are on a different sheet in your Excel work book and make it

clear what node the sheet is for. If you use VBA code it should be documented/commented.

You may include a PDF with notes (see the Deliverables section for details) to the TA about

how you did your calculations and processed the data but you should still have notes in the

Excel sheet.

You are allowed to do some manual processing of the data and hard coding. For example, you

can manually copy the table of observations and delete rows to create a subset of the data

(you are not required to automate this). However, the more you automate the easier/faster

it will be to calculate the next node in the tree.

Note that the same attribute can appear multiple times in a decision tree so long as they

only appear once on any given path from the root node to a leaf node.

3 of 10

CS2034 - Data Analytics: Principles and Tools Assignment #3

Task 1.2

After you have completed Task 1.1 and believe your calculations to be correct, create a

diagram of your decision tree that clearly labels all attributes (nodes), classes (leaf nodes)

and values (branches/edges).

You may use any software you are comfortable with to create this diagram so long as everything

is labelled clearly. See the Deliverables section for details on format and file name.

Below is an example decision tree diagram from the week 10 slides (for different data, your

tree will look different and have different attributes/values).

4 of 10

CS2034 - Data Analytics: Principles and Tools Assignment #3

Task 1.3

Use your decision tree to classify the following new objects (N1 to N6) based on their

attributes.

Table 2: New Object Observation

New Object Colour Roundness Size Texture

N1 Yellow Square Small Rough

N2 Red Square Medium Smooth

N3 Blue Round Small Smooth

N4 Yellow Square Large Smooth

N5 Yellow Round Large Rough

N6 Red Round Large Rough

Give your answers in a PDF file (see the Deliverables section for details) and include a brief

explanation (two to three sentences) of how you classify new observations using a decision

tree.

Activity 1 Deliverables

For this activity you must submit:

An Excel workbook, named userid act1.xlsx or userid act1.xlsm (if you used VBA)

where userid is your UWO user id, that contains all of your calculations, data processing

and VBA code (if used) for Task 1.1.

A PDF named, userid act1.pdf where userid is your UWO user id, that contains any

notes for Task 1.1, your diagram for Task 1.2 and your answers to Task 1.3. The

diagram for Task 1.2 must be legible, not overly pixelated or cut off/cropped. This

PDF should be easy for the TA to read and understand what answers are for what

Task.

You must submit these deliverables via OWL with the deliverables from Activity 2.

5 of 10

CS2034 - Data Analytics: Principles and Tools Assignment #3

Activity 2: Visualization

Download the file tweetdata.xlsx from OWL. This file contains the processed tweet data

from Assignment #2 with two enhancements. The location column has been cleaned up and

split into City, Province and Country columns. The string ”NULL” is used in cases where

the City, Province or Country could not be determined. The sentiment values have also

been updated using a sentimentCalc function that considers far more positive and negative

keywords.

Base your visualizations and work in the following tasks on this updated tweetdata.xlsx file

and not your own work from assignment #2.

For this activity, you may create your visualizations using any tool you are comfortable with

and have access to. The following tools are recommended and you may use more than one:

The RAW site (used in Lab 9)

Excel (Charts, Power View, Power Map, 3D Map, etc.)

HeatMapper.ca

Any other visualization mentioned in the week 9 slides.

Task 2.1: Country Visualizations

Process the Data

Using techniques we have covered in lectures, labs and assignments create a new sheet in

the tweetdata.xlsx workbook titled “Country” that contains a list of all of the countries in

the data (containing each country only once). You are allowed to use VBA (but you are not

required to) and do some manual steps (e.g. copy and pasting, using Excel’s sort feature,

etc.).

For each country in the list, calculate the average sentiment, number of tweets in the data

set, number and percentage of positive, negative and neutral tweets and any other value you

need to create the visualizations in the next steps.

You must include notes in your Excel workbook detailing how you processed your data (e.g.

you need to describe how you created the list of countries). You may also include notes in

a PDF file (see the deliverables section for this activity for details).

Example of what your “Country” sheet might look like (not all countries shown and data

intentionally blurred out):

6 of 10

CS2034 - Data Analytics: Principles and Tools Assignment #3

Create the Visualizations

Create the following visualizations using the Country data:

1. A visualization that best shows the rank of the top 10 countries by total tweets.

2. A visualization that best shows the rank of the top 10 countries by average sentiment.

3. A visualization that best shows the percentage of positive, negative and neutral tweets

for Canada (out of the total number of tweets for Canada).

4. A visualization that best shows the total number of tweets for each country geospatially

(e.g. on a map).

5. A visualization that best shows the percentage of negative tweets for each country

geospatially.

6. A visualization that best shows the percentage of positive tweets for each country

geospatially.

The percentage for 3, 5 and 6 should be based on the number of tweets for that country and

not the total number of tweets (i.e. the positive percentage, negative percentage and neutral

percentage should add up to exactly 100% for each country).

Note that you may have to do more processing and clean up the data more depending on

the tools you use to create the visualization. For example, you may need to edit the country

names slightly to get them all to work with mapping tools.

For each visualization, include a title and appropriate labels. If a legend is required or

appropriate for the visualization type you pick, include that as well.

7 of 10

CS2034 - Data Analytics: Principles and Tools Assignment #3

Task 2.2: Hierarchical Visualizations

Process the Data

Using techniques we have covered in lectures, labs and assignments create a new sheet in

the tweetdata.xlsx workbook titled “Hierarchy” that contains a list of all of the unique city,

province, and country pairs in the data. That is, each combination of a city, province and

country found in the data should be listed exactly once. Any row with a ”NULL” value for

city, province or country should be ignored. You are allowed to use VBA (but you are not

required to) and do some manual steps (e.g. copy and pasting, using Excel’s sort feature,

etc.).

For each combination in the list calculate the average sentiment, number of tweets in the

data set, number and percentage of positive, negative and neutral tweets and any other value

you need to create the visualizations in the next steps.

You must include notes in your Excel workbook detailing how you processed your data (e.g.

you need to describe how you created the list of combinations). You may also include notes

in a PDF file (see the deliverables section for this activity for details).

Example of what your “Hierarchy” sheet might look like (not all combinations shown and

data intentionally blurred out):

8 of 10

CS2034 - Data Analytics: Principles and Tools Assignment #3

Create the Visualizations

Create the following visualizations using the Hierarchy data:

1. A visualization that shows the hierarchical relationship between cities, provinces

and countries. No data values (e.g. average sentiment, number of tweets, etc.) should

be shown or used (i.e. the hierarchy should not be weighted).

2. A visualization that shows the hierarchical relationship between cities, provinces

and countries weighted by the total number of tweets.

3. A visualization that shows the hierarchical relationship between cities, provinces

and countries weighted by the total number of positive tweets. This visualization

must use a different visualization type than the one you used for the last visualization.

4. A visualization that shows the flow of negative tweets from cities to provinces to

countries.

Note that you may have to do more processing and/or clean up the data more depending

on the tools you use to create the visualization.

For each visualization, include a title and appropriate labels. If a legend is required or

appropriate for the visualization type you pick, include that as well.

Hint: The RAW Site might be a useful tool for creating some of these visualizations and

deciding which visualization type to use.

Task 2.3: Infographic

Note: You will not be graded on your graphic skills per se but on how well you

communicate the results and take advantage of the Gestalt Principles.

Using the data in tweetdata.xlsx and the data you have processed, create a unique visualization

(distinct from the visualizations from the previous tasks) that shows aspects of the data

you find interesting. You may use parts of the data we have not yet dealt with like followers,

friends, verified, etc. Show any work you do for processing the data for this visualization in

a new sheet named “MyVis”.

Using at least 3 of the visual representations you have created in Task 2.1 or 2.2 and your

unique visualization, create an infographic using Paint, Adobe Photoshop (available in some

GenLabs) or other software available to you. The following web based tool may also be of

use:

Piktochart

Venngage

Canva

easelly

9 of 10

CS2034 - Data Analytics: Principles and Tools Assignment #3

For tips on how to create infographics, start with the article 19 Warning Signs Your Infographic

Stinks and search the web for good examples.

Your infographic should:

Explain the data set, and the images you included from Task 2.1 or 2.2.

Explain your unique visualization.

Have at least 3 facts about the data.

Have a title and at least 2 subsections.

Take advantage of at least some of the Gestalt Principles to help communicate your

analysis.

Activity 2 Deliverables

For this activity you must submit:

The tweetdata.xlsx file renamed to userid act2.xlsx or userid act2.xlsm (if you used

VBA) where userid is your UWO user id. This file should contain all of your calculations,

data processing and VBA code (if used) for Tasks 2.1 to 2.3.

A PDF named, userid act2.pdf where userid is your UWO user id, that contains any

notes for Task 2.1 to 2.3, your visualizations for each task and your infographic Task

2.3. The visualizations must be legible, not overly pixelated or cut off/cropped in any

way that you can not see all the data. This PDF should be easy for the TA to read

and understand what answers are for what Task and visualization.

A short paragraph explaining what Gestalt Principles you used in your visualizations

and/or info graphic. This should be included at the end of the above mentioned PDF

file named userid act2.pdf where userid is your UWO user id.

You must submit these deliverables via OWL with the deliverables from Activity 1.

10 of 10


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp