Homework 2

EE425X - Machine Learning: A signal processing persepective

Logistic Regression and Gaussian Discriminant Analysis

In this homework we are going to apply Logistic Regression (LR) and Gaussian Discriminant Analysis

(GDA) for solving a two-class classification problem. The goal will be to implement both correctly and

figure out which one is better.

To do this, you will first “learn” the parameters for each case using the training data (as discussed in

class and available in the handouts). Then, you will apply it to test data and evaluate the performance as

explained below. The only change from the handout is that, for GDA, you need to assume that the

covariance matrix Σ is diagonal.

1 Synthetic Data Generation

Generate your own training data first. To do this, we use the GDA model because that is the only one which

provides a generative model.

Generating Training data: Since we want to implement a two-class classification problem, let the class

labels, y

(i)

take two possible values 0 or 1 (for i = 1, · · · , m, i.e., we have m training samples). These

are generated independently according to a Bernoulli model with probability φ. Next, conditioned on

y

(i)

, the features x

(i) ∈ R

n×1 are generated independently from a Gaussian distribution with mean

μy

(i) and covariance matrix Σ. In other words, while generating x

(i)

, use the same covariance matrix

Σ for both classes, but pick two different μ’s: μ0 as the n-dimensional mean vector for data from class

0 and μ1 as the n-dimensional mean vector for data from class 1. Do this for all i = 1, 2, · · · , m.

Generating Test data: Do the same as above, but now instead generate mtest = m/5 samples.

2 Learning parameters using training data; and then testing the method

on test data

? Write code to estimate the parameters for Logistic Regression and for GDA. For how to do it, please

refer to the class handouts. GDA was covered recently in the Generative Learning Algorithms handout.

LR is covered in the first handout (Supervised Learning).

For LR, you need to write Gradient Descent code to estimate θ.

For GDA, proceed as follows. The ONLY CHANGE from the handout is that we assume that Σ is

1

DIAGONAL and thus use the following formulas:

while setting all non-diagonal entries of Σ to be zero. Here, 1(w = c) is the indicator function that

evaluates to 1 when w = c and 0 otherwise.

Write a code that uses the estimated parameters for each method, and then classifies the test data as

explained in the handout and in class. For GDA, we use Bayes rule for classification. For each input

query x, compute the output ?y(x) as

Evaluate accuracy: let us denote the test data as Dtest. Report accuracy of each method as

where ?y(x) is the output of the classifier for input x. Also, |Dtest| = mtest is number of testing samples.

Use n = 100 and m = 20. This means that for estimating each entry of μ or Σ you have 20 samples.

Generally speaking, we need to have order of n

2

samples to estimate all entries of Σ. However, since

in this homework we assume that Σ is a diagonal matrix, order n samples suffices.

3 Real Data

Next use the MNIST dataset to evaluate both approaches on real data. MNIST is a good database for people

who want to try learning techniques and pattern recognition methods on real-world data while spending

minimal efforts on preprocessing and formatting. The MNIST database of handwritten digits has a training

set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from

NIST. The digits have been size-normalized and centered in a fixed-size image. The entire dataset can be

downloaded from here but in this problem we only use samples corresponding to two digits 0 and 9.

Use the code written in the previous part to classify two digits 0 and 9 in MNIST by using Logistic

Regression and Gaussian Discriminant methods. You should have written code for part 2 so you need not

have to rewrite anything, except change what you provide as training and test data. This is what we want

to learn in this course: use simulated (synthetic) data to write and test code; make sure everything works

as expected, then use the same code on real data.

Please report the final classification accuracy and discuss how the obtained accuracy for the real data

differences from the synthetic data.

4 What to turn in?

Submit a short report that discusses all of the above questions. Also submit your codes with clear documentation.

Grading will be based on the quality of report and accuracy of implemented codes.

2

版权所有：留学生编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。