联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2022-03-05 11:58

Assignment 2

CMSC 678 — Introduction to Machine Learning

Due Wednesday March 9th, 11:59 AM

Item Summary

Assigned Monday February 21

Due Wednesday March 9th

Topic Loss Functions and Multiclass Classification (I)

Points 95 (+15 Extra Credit)

In this assignment you will become comfortable with the math of loss-based optimization; and implement,

experiment with, and compare multiple types of multiclass classifiers.

You are to complete this assignment on your own: that is, the code and writeup you submit must be

entirely your own. However, you may discuss the assignment at a high level with other students or on the

discussion board. Note at the top of your assignment who you discussed this with or what resources you

used (beyond course staff, any course materials, or public Piazza discussions).

The following table gives the overall point breakdown for this assignment.

Theory App.: Classification

Question 1 2 3 4 5

Points 20 35 5 35 15 (EC)

This handout may be lengthy, but think of it as both a tutorial and assignment. I provide a lot of explanation

and re-summarizing of course concepts.

What To Turn In Turn in a PDF writeup that answers the questions; turn in all requested code and models

necessary to replicate your results. Turn in models by serializing the learned models and parameters; your

programming language likely natively supports this (e.g., Python’s pickle, or Pytorch’s own format). Be

sure to include specific instructions on how to build/compile (if necessary) and run your code. Answer the

following questions in long-form. Provide any necessary analyses and discussion of your results.

How To Submit Submit the assignment on the submission site under “Assignment 2.”

https://www.csee.umbc.edu/courses/graduate/678/spring22/submit.

Be sure to select “Assignment 2.”

1

Theory

1. (20 points) For 1 ≤ i ≤ N, let xi ∈ R

2 be a two-dimensional input, and yi ∈ {−1, +1} the binary

label for xi

.

(a) Describe, in both words and with a concrete example, when the perceptron algorithm would be

guaranteed to converge (find an optimal solution to the training set). For the concrete example,

give N ≥ 5 data points {(xi

, yi)} for which the perceptron algorithm would converge.

(b) Describe, in words and with a concrete example, when the perceptron algorithm would not be

guaranteed to converge. For the concrete example, give N ≥ 5 data points {(xi

, yi)} for which

the perceptron algorithm would not converge.

(c) Let the dataset D be

i = xi yi

1 (-1, 1) 1

2 (-1, -1) -1

3 (0.5, 0.5) 1

4 (1, -1) 1

5 (0.5, -1) -1

Run, by hand, the perceptron algorithm on these five data points. Record your computations

for each time step (data point examined, activation score for that data point and current weight

vector, the predicted value, and the new weights) in the following table.

(t)

to have the correct dimensionality. The strange indexing x(t%N)+1

simply says to iterate through the data in the order provided above. Note the second column of

each row should equal the last column of the previous row.

2. (35 points) First, let f(x) be a K-dimensional feature vector, i.e., f(x) ∈ R

K. The particular meaning

of x doesn’t matter for this question: it could be an image, or a text document, or collection of robotic

sensor readings. In class, we discussed how to turn a (binary) linear model into a probabilistic (binary)

classifier: given vector weights w ∈ R

K, we pass our linear scoring model w

|f(x) through the

sigmoid function σ(z) = 1

1+exp (−z)

and use the following decision function:

(

output “class 1” if σ(w

|f(x)) ≥ 0.5

output “class 2” if σ(w

|f(x)) < 0.5.

[Eq-1]

This question considers the multi-class extension of that model.

Second, assume we have L classes, with corresponding weights θl (1 ≤ l ≤ L) per class. We can

group these weights into a matrix θ ∈ R

LxK. Notice that each θl

is a K-dimensional vector.

Consider a probabilistic linear classification model p(ˆy|x),

(a) Argue that a binary instantiation (i.e., L = 2) of model [Eq-2] provides the same decision

function as [Eq-1]. In arguing this, you do not need to prove it in a strict mathematical sense,

but you need to make it clear that you understand it. (Imagine how you would explain it to a

classmate.)

We showed in class that optimizing the MAP estimate is the “right” thing to do when we want to

minimize 0-1 classification loss. Remember that log is monotonic: if f(a) < f(b), then log f(a) <

log f(b). As such, this is also called maximizing the conditional log-likelihood:

the conditional log-likelihood

[Eq-3]

the maximum value (across possible y arguments).

It is sometimes beneficial to interpret a correct label y

in a one-hot format y~?. This one-hot vector y~?

is a vector the size of Y: each coordinate y~?[j] corresponds to one of the j items of Y. In a one-hot

format, all entries are 0 except for the coordinate corresponding to y. As an example, consider the

case of rolling a 1, 2, 3, 4, 5, or 6 (represented by y) from a weighted six-sided die (where this die

outcome represents our label), given some input x: to represent a correct roll of y = 4, a reasonable

one-hot equivalent is y~? = (0, 0, 0, 1, 0, 0). We can use this one-hot format to define the crossentropy

loss, which is the dot product between y~? and the vector of log conditional probability values

log p(y|xi):

(b) Show that, for the probabilistic classifier [Eq-2], maximizing the conditional log-likelihood

([Eq-3]) is the same as minimizing cross-entropy loss ([Eq-4]).1

(c) Given N data points {(x1, y1), . . . ,(xN , yN )}, formulate the empirical risk objective for [Eq-2]

using cross-entropy loss. You can think of this as filling out the following template,

where you must specify

• opt: the optimization goal (e.g., max, min, or something else)

• : the variables or values being optimized

• : the function (and its arguments) being optimized.

(d) Derive the gradient.

1You may be wondering what are the predicted items in cross-entropy loss. The cross-entropy loss measures the correctness of

probabilities corresponding to particular classes, against hard, “gold standard” (correct) judgments. Therefore, the predicted items

are actually probabilities, from which you can easily get a single discrete prediction.

Classification: Implementation, Experimentation, and Discussion

The next three questions lead you through building, experimenting with, reporting on the performance of,

and analyzing a multiclass perceptron classifier. The core deliverables for these questions are:

1. any implementations, scripts, and serialized model files needed to compile and

run your code; and

2. a written report discussing

• your implementations, including any design decisions or difficulties you

experienced [from questions 3-4];

• evidence and analysis of internal development/experimentation on the perceptron

model on the development set [from question 4]; and

• (optionally, for extra credit) results and analysis of a full comparison on

the test set [from 5].

The data we’ll be using is the MNIST digit dataset: in total, there are 70,000 “images” of handwritten

digits (each between 0-9). The task is to predict the digit y ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} from a 28x28 input

grayscale image x. 60,000 of these are allocated for training and 10,000 are allocated for testing. Do not

use the 10,000 testing portion until question 5.

Get the data from https://www.csee.umbc.edu/courses/graduate/678/spring22/

materials/mnist_rowmajor.jsonl.gz. This is a gzipped file, where each line is a JSON object.

Each line has three keys: “image” (a row-major representation of each image), “label” (an int, 0-9), and

“split” (train or test). Each image x is represented as a 784 (= 28 × 28) one-dimensional array (row-major

refers to how each image x should be interpreted in memory).

First, get acquainted with the data: make sure that you can read it properly and you are sufficiently

familiar with the storage format so you can easily use these files in later questions.2 Second, you are to

split the 60,000 training set into an internal training set (int-train) and an internal development set

(int-dev). How you split (e.g., randomly, stratified sampling, taking the first M), and the resulting sizes

of int-train and int-dev are up to you.

3. (5 points) Implement a “most frequent” baseline classifier. This baseline identifies the label that was

most common in training (int-train) and always produces that label during evaluation (regardless

of the input).

(a) In your report, document the implementation (this should be fairly straight-forward).

(b) In your report, report how well this baseline does on int-dev.

2Notice that each component of x is a float (0 ≤ xi ≤ 1). If you want to visualize the images, you can apply the sign function

(with τ = 0) and print its output in a 28x28 grid

signτ

(xi) = (

1 xi > τ

0 xi ≤ τ.

[Eq-6]

Note that [Eq-6] provides a binary representation of the input x.

You do not have to do the following but you may find it helps you get a handle on the data: visualize the representations

(raw/original, binary, or with arbitrary τ ) by creating a heatmap/tile plot. This way you can “look” at your data and verify that the

labels line up to their respective images (and make sense!).

4. (35 points) Implement a multiclass perceptron. Train it on int-train and evaluate it on int-dev.

For this question, you may use computation accelerators (e.g., blas, numpy, MATLAB) but you may

not use any existing perceptron implementation.

(a) In your report, discuss your implementation and convergence criteria (this should also be fairly

brief).

(b) In your report, discuss the concrete, quantifiable ways you validated your implementation was

correct. One such way could be to run it on the toy data in Question 1, though there are other

ways too.

(c) In your report, experiment with at least three different configurations, where you train each

configuration on int-train and then evaluate on int-test. Document this internal development

progress through quantifiable and readable means, i.e., graphs and/or tables showing

how different model configurations perform on int-dev. A configuration could include learning

biases or not, the convergence criteria, or the input featurization, among others.

Implementation Hint: Training a multiclass perceptron follows the training for a binary perceptron

as discussed in class, with the following four changes: first, rather than there being a single weight

vector w, there is now a weight vector wj per class j. Second, a single prediction is obtained from

the maximum of a vector of predictions ~y, using each of the class-specific weight vectors. That is,

~y[j] = w

|

j

x and the predicted value y = arg maxj ~y[j]. Third, if a prediction is incorrect, the correct

weights wy

? are increased by x, and the mispredicted weights wy are decreased by x. Fourth, a

per-class bias replaces the single bias term in the binary perceptron algorithm.

5. (Extra Credit: 15 points) Using the best perceptron configuration found in the previous question,

train new a perceptron model on the entire, original train set; also “re-train” a most-frequent classifier

baseline. At the end of this training, you should have two models trained on the entire 60,000

training set. Evaluate these models on the original 10,000 image test set. Do not experiment with

different configuration values here (that’s what your int-dev set was for!). Include these evaluation

results in your report and discuss the results: were they surprising? How similar were the test results

to the best internal development results?


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp