
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-03-09 09:17

COMP534 Lab session


This is an example you can use to work through loading, fixing, plotting, and predicting

on some data. The instructions may not be exact, and the snippets of code may have

small errors – be prepared to search for what seems to be missing.

I have suggested using PyCharm as a python development environment, but feel free to use

anything else you are more familiar with. I also suggest conda as a package and environment

manager, but it is not the only option.

Getting started with data

install Pycharm and Miniconda (or Anaconda)

Create a new python project - call it COMP534_1

For Project interpreter select : new conda environment

And select to Make available to all other projects

Conda is a package management system for python (Anaconda and Miniconda are interchangeable)

It allows you to easily install and manage many different packages

It also provides a method of managing environments. An 'environment' is a unique set of python

versions and libraries,

as sometimes you need to switch between different sets of libraries, or even different python


You first need to install some packages – so open a terminal window in Pycharm

For Lab pcs – the conda setup is complicated due to file permissions: make the following changes:

Pycharm can be installed from the ‘Install University Applications’ on the desktop

For Project interpreter select : new virtual environment

Instead of using conda to install packages, use pip

o i.e., pip install scikit-learn

o pip install seaborn

note that it says (COMP534_1) in brackets

this tells you that the virtual environment COMP534_1 is currently selected.

Some of the common conda Virtual environment commands are:-

conda create -n name

conda activate ...

conda activate base

And for managing packages…

conda list

conda install ...

For now, you should just need…

conda install scikit-learn

conda install seaborn

Note that things like matplotlib and pandas are installed automatically as dependencies.

Iris dataset

Add a new python file (e.g. first.py) to your project (right-click on the project name in the Project

window), and add some code

from sklearn import datasets

iris = datasets.load_iris()


right-click on test.py, and click run 'test'

With python, you can happily run from the python console, going one step at a time - but if you

might need to rerun

you analysis, maybe with different parameters, then it becomes easier to store a program, and re-

run it whenever you have mnade changes.

(with pycharm, you can click the 'run' button, or press CTRL-F5 to re-run the last python file)

We will convert this to a pandas dataframe- we don't need to, but that keeps it more consistent

from sklearn import datasets

import pandas as pd

data = datasets.load_iris()

df = pd.DataFrame(data=data.data, columns=data.feature_names)


As we are running from inside a program - we will want to 'print' things to see them. If you are

running from a python console then you will see the results of each command anyway - so you only

need df.describe()

now we will add a plot - we will include matplotlib as well as seaborn, as we will need some of the

lower level commands

import matplotlib.pyplot as plt

from sklearn import datasets

import pandas as pd

import seaborn as sns

data = datasets.load_iris()

df = pd.DataFrame(data=data.data, columns=data.feature_names)



We often call an example dataframe df, it's just a convention which is convenient when looking at

other people's code

You can refer to columns by their name - which you can get from df.columns

hence df[df.columns[0]] is the first column. But you should be careful of doing this, in case

the order changes later on.

See here for more information about accessing and selecting rows and columns:-


Seaborn has lots of different plots you can use, and there is loads of information at



You can create a new dataframe with only certain columns

df2 = df[['sepal length (cm)' ,'petal length (cm)' ]]


In the histogram plot, we can see that there sepals are usually longer than petals, but what else can

we find out from just this data?

sns.scatterplot(x=df['petal length (cm)'], y=df['sepal length (cm)'])


sns.scatterplot(data=df,x='petal length (cm)', y='sepal length (cm)')

Notice anything strange?

Petal length, not surprisingly is roughly related to sepal length - but there are at least two distinct

clusters of values, seemingly with different relationships.

print(df.columns) to see what columns we have included so fare

but this dataset contains something else - a 'target'. In this case, it is a classification for each iris as

one of 3 species.

You can see it here:-


So we can copy that to the dataframe, and we now have an extra piece of information we can see....

df[‘target’] = data[‘target’]

and plot with…

sns.scatterplot(data=df,x='petal length (cm)', y='sepal length

(cm)', hue='target')

And now we can see why we have this separate cluster on the left - they are a different species of

Iris to the others.

You can see what they are called with


What other plots can you generate for this data - can you think of anything that may actually be



And this is why we have a 'target' - we are going to see how well we can predict the species, just

based on the size of the petals and sepals. The 'target' column will come up frequently, sometimes

with different names, but this is typically the thing that we are interested in predicting. As a

reminder, for supervised learning, we have some 'training' data where we know the value of the

target (or at least have a reasonable guess), and we want to learn how to predict this value for new

data, where we don't know the real value.

In this case, it is the species of Iris, but it may be a huge variety of things in the real world - likelihood

of disease, the value of a hand-written number, the cost of a footballer, the ratio of peptide

ionisation, etc. etc.

One of the names for the 'target' is simply y. We call the rest of the data X, and the target y.

You will often see this in example code.

To run supervised learning, we also want to see how well it performs - so we split the data we have

into 2. 'train' and 'test'. The classifier uses the 'train' dataset to learn how to predict the result.

Then we can give it the 'test' set to see if it really works!

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

#We are just using the default parameters, you can set your own splits etc.

X_train, X_test, y_train, y_test =


model = KNeighborsClassifier()

model.fit(X_train, y_train)

predictions = model.predict(X_test)


And you can see which predictions it made correctly...

(Make sure that you don't accidentally include the 'target' in the training data - or it will be very

good at predicting only when it has been given the answer!

Remove the df['target'] = data['target'], and try again

You may now get one or two wrong; this is normal, 100% accuracy only happens with quite simple


You can plot the predictions against the actual values



It's not very interesting for this data, but it will show you where predictions are going wrong...

Try using the different classifiers - see how easy it is to change, as all of the inputs are usually the


(Try at least SVM, naive_bayes, DecisionTree)

Try the DecisionTreeClassifier with (max_depth = 1) and (max_depth = 10) what is the difference?

Each classifier may also have different parameters, which you can look into. But, for this data, as it is

so simple they are generally unlikely to make much difference.

Titanic Data

So, let's get a more complex dataset....

For the sake of simplicity, data is often converted into .csv files. These are very simple text files,

each line consists of one data record, and the values are just text, separated by commas. The first

line is usually a 'header', which gives you the column names.

You can open these in Excel, in Notepad, or even just view them from a command / terminal


e.g. with the commands type file.csv in Windows or cat file.csv in Linux

Datasets can be in more complex database-style formats, such as json, XML, or even stored in a


This one is a dataset many people use as a form of competition - it gives passenger details from

those on board the Titanic when it sank. We want to see if it is possible to predict who survived,

based on their details.



(There is a separate test and train dataset, but we can just split the training dataset as we have done

before. Once you are finished with it, you could also get the test datset and work out how to

incorporate that.)

df = pd.read_csv('train.csv')



Note that we can't describle columns which contain non-numerical data, for now we can just remove

them, but we will look at dealing with them better later on

df = df.drop(['name','sex','ticket','cabin','embarked'], axis=1)

We will also drop rows which are incomplete - they have a NA value somewhere. This isn't always

(or often) the best way to handle missing data

df = df.dropna()

and just have a look at the data, to check we can plot it – it won’t really make much sense as the

columns have very different values.


What useful plots could you make instead?

So lets take the code from our last attempt at a decision tree classifier

And put the 'target' into a separate series variable

target = df['survived']

# Don't forget to remove it from the training data!

df = df.drop(['survived'], axis=1)

model = DecisionTreeClassifier(max_depth = 1)

X_train, X_test, y_train, y_test = train_test_split(df,target])

model.fit(X_train, y_train)

predictions = model.predict(X_test)


If you want to more easily see how good your prediction is ...

print(len(predictions), sum(predictions==y_test))

tells you how many predictions you made, and how many are correct - it should be possible to get

over 90% accuracy (but not without some more work).

This is a very basic statistic, there are many others that you can use…

You may see that some of the predictors perform better than others. As the test/train split is

random, you will also get slightly different answers every time.

In order to improve performance, we are going to use some of the text data that we removed


Where data is strings, we are just going to treat them as categories - i.e., the order doesn't mean


So, we will just use the LabelEncoder in scikit-learn

insert the following code, and no long drop the column marked 'sex'

from sklearn.preprocessing import LabelEncoder

df['sex'] = LabelEncoder().fit_transform(df['sex'])

This will just set the sex to 0 or 1 as Male or Female.

Will this change the prediction?

Do you think you could encode the other values as numbers in a sensible way?


Everything we have looked at so far is Classification - i.e., there are a set number of possible


(3 species of flowers, or survived / not survived the Titanic).

But we often want to work out more detail - e.g., what is the probability of..., what is the value of....,

and for that we use 'regression'

We can look at a California house price dataset, from


And will re-use some of the things that we already learned how to do…


df['ocean_proximity'] =



#again, we remove the incomplete rows with NA

df = df.dropna()

#set the target value

target = df['median_house_value']

df = df.drop(['median_house_value'], axis=1)

model = RandomForestRegressor()

X_train, X_test, y_train, y_test = train_test_split(df,target)

model.fit(X_train, y_train)

predictions = model.predict(X_test)



What are some of the ways that you can view and evaluate the performance?


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图
