联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2021-04-23 10:22

ELEC0033 - 2020/2021

Page 9

5 Data Analytics Task - Climate Data Analysis using Python

5.1 General Overview

The assignment comprises individual code writing, data analysis and inferring. You are

allowed to discuss ideas with peers, but your code, and experiments and report must be

done solely based on your on work.

The assignment leverages elements covered in class (data analytics lecture). You will be

working with a couple of meteorological datasets, you will be required to crunch data, to

clean the datasets and infer hidden patterns. Specifically, there will be three tasks you will

be asked to solve.

The goals of the assignment are the following:

? To further develop your programming skills

? To further develop your skills and understanding principle of data analytics and

machine learning

? To acquire experience in dealing with real-world data

5.2 Assignment description

1. Dataset description

You will find two pickle files named weather-denmark-resampled.pkl and df_perth.pkl,

respectively.

For TASKS 1 and 2, which cover the main aspects of preliminary data analysis, missing

data and outlier detection, you must use the first dataset.

For TASK 3, which cover correlation and pattern inferring, you will be using the second

smaller dataset in order to find correlations and infer patterns.

2. Tasks to be solved

Read carefully the three tasks description and address them using the pre-compiled

Jupyter notebook named Coursework_weather_data.ipynb.

TASK 1 - PRELIMINARY ANALYSIS

In this first task, you will explore the dataset. Follow the instructions in the following:

a. Import the weather-denmark-resampled.pkl dataset provided in the folder and

explore the dataset by answering the following questions.

i. How many cities are there in the dataset?

ii. How many observations and features are there in this dataset?

iii. What are the names of the different features?

ELEC0033 - 2020/2021

Page 10

b. Now that you got confident with the dataset, evaluate if the dataset contains any

missing values? If so, then remove them using the pandas built-in function.

c. Extract the general statistical properties summarising the minimum, maximum,

median, mean and standard deviation values for all the features in the dataset. Spot

any anomalies in these properties and clearly explain why you classify them as

anomalies.

TASK 2 – OUTLIERS

The second task is focused on spotting and overcoming outliers. Follow the instructions

in the following:

d. Store the temperature measurements in May 2006 for the city of Odense. Then

produce a simple plot of the temperature versus time.

HINT: In this dataset, the cities are vertically stacked. Therefore, we have a multi

column dataset, which basically works as a nested dictionary.

e. Find the outliers in this set of measurements (if any) and replace them using linear

interpolation.

TASK 3 – CORRELATION AND INFERENCE

In this last task, you will be seeking correlation between features of the data and inferring

hidden patterns. For this task, you will be working with a smaller dataset. Follow the

instructions in the following:

3.1 – CORRELATION

f. We now take a new dataset (df_perth.pkl), which collects climate data of a city

in Australia. Here we have just one year of measurements, but more features.

g. Find any significant correlations between features.

HINT: you might find useful looking for trends and recurrent patterns within the

data.

h. We now focus on the correlation between precipitation and cloud cover. We

want to infer the probability of having moderate to heavy rain (> 1 mm/h) as a

function of the cloud cover index.

HINT: you might find useful to create a new column where you have 0 if

precipitation < 1 mm/h and 1 otherwise.

3.2 – INFERENCE

i. Let’s now assume that we want to predict the photovoltaic production (PV

production) using multiple linear regression. Explain which features are

statistically significant in modelling the target variable.

j. Create a multivariate model using the predictors chosen in the previous

question.

ELEC0033 - 2020/2021

Page 11

5.3 Deliverable

Report

The report should be written in the form of an academic paper using the ICML format1.

The report should be at most 10 pages long excluding references and appendices. The

report must include the following sections:

● Abstract. This section should be a short paragraph (4-5 sentences) that provides a

brief overview of the methodology and results presented in the report.

● Preliminary Analysis. This section describes your study carried out during task 1

and should be organized in the following subsections:

○ Data Understanding. This subsection should detail the data that was used

for this study, clearly describing the content, size and format of the data,

how many cities are described in the dataset, how many observations and

how many (and which) features are considered. Further information can

be provided.

○ Data Cleaning. This subsection should describe the missing data

processing. It is important to describe the methodology that you used in

searching for the missing data and how did you address them in the best

way (for example how do you ensure that the dataset preserver the same

statistics/properties). Motivate clearly your answers.

○ Data Statistics. This subsection should describe the general statistical

properties of the dataset with numerical or graphical visualization. Provide

reflections toward anomalies (with clear motivation/supporting evidence

for anomalies)

● Outliers. This section should describe all the steps that were applied to the data

to find and tackle outlier pre-processing. A justification for each step should also

be provided. In case no or very little pre-processing was done, this section should

clearly justify why.

● Data inference. This section should describe the explorative and inference

process. The following subsections should be provided

○ Data Correlation: This subsection should describe the different features

correlations that you have investigated in the current dataset. Even if you

discover little patterns, it is important that you clearly explain and justify

the methodologies that you adopted. Clearly show results that can support

your statements.

○ Data Inference. This subsection should describe the final step of data

inference. Again clearly motivate your solutions, approaches and

1 https://icml.cc/Conferences/2020/StyleAuthorInstructions

ELEC0033 - 2020/2021

Page 12

conclusions/results.

● Conclusion. This last section summarises the findings, highlights any challenges or

limitations that were encountered during the study and provides directions for

potential improvements.

Please make sure you complement your discussion in each section with relevant

equations, diagrams, or figures as you see fit. Most importantly, be sure that all your

answers and solutions are well motivated.

Marking Criteria

See the following page for the marking criteria

Criteria Mark

Weight

Abstract/

Conclusions

The purpose of the executive summary is to outline data analytics project,

input, envisioned outputs as well as key findings 5%

Task 1 -

Preliminary

Analysis

Dataset Understanding. Provide a clear description of the dataset answering the

following questions: i) How many cities are there in the dataset? ii) How many

observations and features are there in this dataset? iii) What are the names of the

different features?

10%

Data Cleaning – Missing data. Provide a clear description of the results

from your missing data analysis and key outcomes. 15%

Data Statistics. Describe the general statistical properties of the dataset

with numerical or graphical visualization. Provide reflections toward

anomalies (with clear motivation/supporting evidence for anomalies)

10%

Task 2 –

Outliers

Show the visualization of the temperature measurements, together with some

comments on the behaviour depicted in the plots. Provide summaries on the

outliers – in terms of number of outliers detected as well as techniques adopted to

replace outliers (motivate your answers).

20%

Task 3 –

Inference

Data Correlation. Comment on the significant correlation you found between

features and assess rain probability as a function of cloud cover index. Support

the text with visualization of results and key insights on the considered

approach.

15%

Data Inference. Good understanding of data inference. Comment on the

multivariate model using the predictors chosen in the previous question. 20%

Report Style Report needs to be with a clean and clear structure as well as layout. Quality

of images, table, citations and references will be also taken into account. 5%


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp