联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-03-26 08:32

INT304: Pattern Recognition In Computer Vision S2 2021-2022

Assignment/Lab 1: Feature Selection

Lecturer: Xi Yang Unit: INT Dept. of Intelligent Science

Disclaimer:

1. Lab reports deadlines are strict. University late submission policy will be applied.

2. Collusion and plagiarism are absolutely forbidden (University policy will be

applied).

3. Report is due 21 days from the date of running this lab. (March 29th, 2022)

1.1 Objectives

• Master the concepts and knowledge on feature selection.

• Be familiar with the text processing

1.2 Introduction

Text categorization is the task on classifying a set of documents into categories from a set of predefined

labels. Texts cannot be directly handled by our model. The indexing procedure is the first step that maps

a text dj into a numeric representation during the training and validation. The standard tfidf function

is used to represent the text. The unique words from English vocabulary are represented as a dimension of

the dataset. The high dimensionality of the word space may be problematic for our classification methods.

In this case, we will choose a subset by the feature selection methods to reduce the dimensionality of the

feature space.

1.3 Tasks

1.3.1 Data Preprocessing

• ( 10 marks ) Download the text dataset and read the documents: http://qwone.com/ jason/20Newsgroups/.

The training examples of the version 20news-bydate.tar.gz are used for our experiments, where

the duplicates are removed.

• ( 10 marks ) Remove the stopwords (Stopword.txt), which are frequent words that carry no information.

Convert all word into their lower case form. Delete all non-alphabet characters from the text. Hint: using

python set, regex and dict

• ( 10 marks ) Perform word stemming, which means the remove the word suffix.

– install the library: nltk (python language)

– usage: see the following code on how to use Porter stemmer (https://www.nltk.org/howto/stem.html).

1-1

1-2 Lecture 1: Feature Selection

from nltk . stem . porter import *

stemmer = PorterStemmer ()

plurals = [’caresses ’, ’flies ’, ’dies ’, ’mules ’, ’denied ’,

’died ’, ’agreed ’, ’owned ’, ’humbled ’, ’sized ’,

’meeting ’, ’stating ’, ’siezing ’, ’itemization ’,

’sensational ’, ’traditional ’, ’reference ’, ’colonizer ’,

’plotted ’]

singles = [ stemmer . stem ( plural ) for plural in plurals ]

1.3.2 Indexing

The documents are represented as the vector space model. In the vector space model, each document is

represented as a vector of words. A collection of documents are represented by a document-by-word matrix

A

A = (aik) (1.1)

where aik is the weight of word k in document i.

1.3.3 Feature Selection

Feature selection try to remove non-informative words from the document in order to improve categorization

effectiveness and reduce the computational complexity.

• ( 10 marks ) Remove the low-frequency words

The document frequency for a word is the number of documents in which the words occurs. You should

compute the document frequency for each word in the training dataset and removes those words whose

document frequency is less than some predefined threshold (the setting df < 5).

• ( 40 marks ) Choose features with information gain

Information gain measures the number of bits of information by knowing the presence or absence of a

word in a document.

( 10 marks ) Let c1, c2, · · · , cK denote the set of possible categories. The information gain of a word

w is defined to be

IG(w) = H(C) − H(C|w)

= −

X

K

j=1

P(cj ) log(P(cj )) + P(w)

X

K

j=1

P(cj |w) log P(cj |w) + P( ¯w)

X

K

j=1

P(cj |w¯) log P(cj |w¯)

– 5 marks P(cj ): the fraction of documents in the total collection that belongs to class cj

– 5 marks P(w): the fraction of documents in which the word w occurs

– 10 marks P(cj |w): the fraction of documents from class cj that have at least one occurrence of

word w

– 10 marks P(cj |w¯): the fraction of documents from class cj that does not contain the word w

In the end, we choose 1000 word with maximum IG values by sorting all words.

Lecture 1: Feature Selection 1-3

1.3.4 ( 20 marks ) TFIDF Representation

TFIDF representation assigns the weight to word i in document k in proportion to the number of occurrences

of the word in the document, and inverse proportion to the number of documents in the collection for which

the word occurs at least once.

aik = fik ∗ log(N/nk) (1.2)

• fik: the frequency of word k in document i

• N: the number of documents in the training dataset

• nk: the total number of times word k occurs in the training dataset called the document frequency.

Taking int account the length of different documents, we normalize the representation of the document as

Aik = q

aik

P1000

j=1 a

2

ij

(1.3)

The training set can be represent as a matrix AN×1000. Once the features are chosen, the test set can be

converted into another matrix BM×1000, where M is the size of the test dataset.

1.4 Lab Report

• Write a short report which should contain a concise description of your results and observations.

• Please insert the clipped running image into your report for each step.

• Submit the report and the source code to electronically into LearningMall.

• Python is strongly suggested.

• The report is strongly suggested to be written with the latex typesetting language.

• The report in pdf format and python source code of your implementation should be zipped into a single

file. The naming of report is as follows:

e.g. StudentID LastName FirstName LabNumber.zip (123456789 Einstein Albert 1.zip)

1.5 Hints

Please refer to the paper for more details: K Aas and L. Eikvil, Text Categorisation: A Survey, 1999.

• Latex IDE: texstudio

• Python IDE: pycharm

• Use the python set, dict and list collections flexibly.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp