联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2020-11-09 11:26

Homework #2

CSE 446/546: Machine Learning

Please review all homework guidance posted on the website before submitting to Gradescope. Reminders:

• Please provide succinct answers along with succinct reasoning for all your answers. Points may be deducted

if long answers demonstrate a lack of clarity. Similarly, when discussing the experimental results, concisely

create tables and/or figures when appropriate to organize the experimental results. In other words, all

your explanations, tables, and figures for any particular part of a question must be grouped together.

• When submitting to gradescope, please link each question from the homework in gradescope to the location

of its answer in your homework PDF. Failure to do so may result in point deductions. For instructions,

see https://www.gradescope.com/get_started#student-submission.

• Please recall that B problems, indicated in boxed text are only graded for 546 students, and that they

will be weighted at most 0.2 of your final GPA (see website for details). In Gradescope there is a place

to submit solutions to A and B problems seperately. You are welcome to create just a single PDF that

contains answers to both, submit the same PDF twice, but associate the answers with the individual

questions in gradescope.

• If you collaborate on this homework with others, you must indicate who you worked with on your homework.

Failure to do so may result in accusations of plagiarism.

Conceptual Questions [10 points]

A0. The answers to these questions should be answerable without referring to external materials. Briefly justify

your answers with a few words.

a. [2 points] Suppose that your estimated model for predicting house prices has a large positive weight on

’number of bathrooms’. Does it implies that if we remove the feature ”number of bathrooms” and refit

the model, the new predictions will be strictly worse than before? Why?

b. [2 points] Compared to L2 norm penalty, explain why a L1 norm penalty is more likely to result in a

larger number of 0s in the weight vector or not?

c. [2 points] In at most one sentence each, state one possible upside and one possible downside of using the

following regularizer:

d. [1 points] True or False: If the step-size for gradient descent is too large, it may not converge.

e. [2 points] In your own words, describe why SGD works.

f. [2 points] In at most one sentence each, state one possible advantage of SGD (stochastic gradient descent)

over GD (gradient descent) and one possible disadvantage of SGD relative to GD.

1

Convexity and Norms

A1. A norm k · k over R

n is defined by the properties: i) non-negative: kxk ≥ 0 for all x ∈ R

n with equality

if and only if x = 0, ii) absolute scalability: ka xk = |a| kxk for all a ∈ R and x ∈ R

n, iii) triangle inequality:

kx + yk ≤ kxk + kyk for all x, y ∈ R

n.

a. [3 points] Show that f(x) = (Pn

i=1 |xi

|) is a norm. (Hint: begin by showing that |a + b| ≤ |a| + |b| for all

a, b ∈ R.)

b. [2 points] Show that g(x) = Pn

is not a norm. (Hint: it suffices to find two points in n = 2

dimensions such that the triangle inequality does not hold.)

Context: norms are often used in regularization to encourage specific behaviors of solutions.

1/p then one can show that kxkp is a norm for all p ≥ 1. The important cases of p = 2 and

p = 1 correspond to the penalty for ridge regression and the lasso, respectively.

B1. [6 points] For any x ∈ R

n, define the following norms:

kxk∞ := limp→∞ kxkp = maxi=1,...,n |xi

|. Show that kxk∞ ≤ kxk2 ≤ kxk1.

A2. [3 points] A set A ⊆ R

n is convex if λx + (1 − λ)y ∈ A for all x, y ∈ A and λ ∈ [0, 1].

For each of the grey-shaded sets above (I-III), state whether each one is convex, or state why it is not convex

using any of the points a, b, c, d in your answer.

A3. [4 points] We say a function f : R

d → R is convex on a set A if f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y) for

all x, y ∈ A and λ ∈ [0, 1].

For each of the grey-colored functions below (I-III), state whether each one is convex on the given interval or

state why not with a counterexample using any of the points a, b, c, d in your answer.

a. Function in panel I on [a, c]

b. Function in panel II on [a, c]

c. Function in panel III on [a, d]

d. Function in panel III on [c, d]

2

B2. Use just the definitions above and let k · k be a norm.

a. [3 points] Show that f(x) = kxk is a convex function.

b. [3 points] Show that {x ∈ R

n : kxk ≤ 1} is a convex set.

(This is the function considered in 1b above specialized to n = 2.) We know g is not a norm. Is the

defined set convex? Why not?

Context: It is a fact that a function f defined over a set A ⊆ R

n is convex if and only if the set {(x, z) ∈

R

n+1 : z ≥ f(x), x ∈ A} is convex. Draw a picture of this for yourself to be sure you understand it.

B3. For i = 1, . . . , n let `i(w) be convex functions over w ∈ R

d(e.g., `i(w) = (yi − w>xi)2), k · k is any

norm, and λ > 0.

a. [3 points] Show that

Xn

i=1

`i(w) + λkwk

is convex over w ∈ R

d

(Hint: Show that if f, g are convex functions, then f(x) + g(x) is also convex.)

b. [1 points] Explain in one sentence why we prefer to use loss functions and regularized loss functions

that are convex.

Lasso on a real dataset

Given λ > 0 and data (x1, y1), . . . ,(xn, yn), the Lasso is the problem of solving

λ is a regularization tuning parameter. For the programming part of this homework, you are required to

implement the coordinate descent method of Algorithm 1 that can solve the Lasso problem.

You may use common computing packages (such as NumPy or SciPy), but do not use an existing Lasso solver

(e.g., of scikit-learn).

Before you get started, here are some hints that you may find helpful:

Algorithm 1: Coordinate Descent Algorithm for Lasso

while not converged do

• For-loops can be slow whereas vector/matrix computation in Numpy is very optimized; exploit this as

much as possible.

• The pseudocode provided has many opportunities to speed up computation by precomputing quantities

like ak before the for loop. These small changes can speed things up considerably.

• As a sanity check, ensure the objective value is nonincreasing with each step.

• It is up to you to decide on a suitable stopping condition. A common criteria is to stop when no element

of w changes by more than some small δ during an iteration. If you need your algorithm to run faster, an

easy place to start is to loosen this condition.

• You will need to solve the Lasso on the same dataset for many values of λ. This is called a regularization

path. One way to do this efficiently is to start at a large λ, and then for each consecutive solution, initialize

the algorithm with the previous solution, decreasing λ by a constant ratio (e.g., by a factor of 2) until

finished.

This is helpful for choosing the first λ in a regularization path.

A4. We will first try out your solver with some synthetic data. A benefit of the Lasso is that if we believe

many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, effectively

differentiating between the relevant and irrelevant features. Suppose that x ∈ R

d

, y ∈ R, k < d, and pairs of

data (xi

, yi) for i = 1, . . . , n are generated independently according to the model yi = w

T xi + i where

wj =

(

j/k if j ∈ {1, . . . , k}

0 otherwise

where i ∼ N (0, σ2

) is some Gaussian noise (in the model above b = 0). Note that since k < d and wj = 0 for

j > k, the features k + 1 through d are unnecessary (and potentially even harmful) for predicting y.

With this model in mind, let n = 500, d = 1000, k = 100, and σ = 1. Generate some data by choosing xi ∈ R

d

,

where each component is drawn from a N (0, 1) distribution and yi generated as specified above.

a. [10 points] With your synthetic data, solve multiple Lasso problems on a regularization path, starting at

λmax where 0 features are selected and decreasing λ by a constant ratio (e.g., 1.5) until nearly all the

features are chosen. In plot 1, plot the number of non-zeros as a function of λ on the x-axis (Tip: use

plt.xscale(’log’)).

b. [10 points] For each value of λ tried, record values for false discovery rate (FDR) (number of incorrect

nonzeros in wb

1 /total number of nonzeros in wb) and true positive rate (TPR) (number of correct nonzeros

in wb/k). In plot 2, plot these values with the x-axis as FDR, and the y-axis as TPR and note that in an

ideal situation we would have an (FDR,TPR) pair in the upper left corner, but that can always trivially

achieve (0, 0) and ( d−k

d

, 1).

c. [5 points] Comment on the effect of λ in these two plots.

A5. We’ll now put the Lasso to work on some real data. Download the training data set “crime-train.txt”

and the test data set “crime-test.txt” from the website under Homework 1. Store your data in your working

directory and read in the files with:

import pandas as pd

df_train = pd.read_table("crime-train.txt")

df_test = pd.read_table("crime-test.txt")

1For each j, wbj is an incorrect nonzero if and only if wbj 6= 0 while wj = 0

4

This stores the data as Pandas DataFrame objects. DataFrames are similar to Numpy arrays but more flexible;

unlike Numpy arrays, they store row and column indices along with the values of the data. Each column of

a DataFrame can also, in principle, store data of a different type. For this assignment, however, all data are

floats. Here are a few commands that will get you working with Pandas for this assignment:

df.head() # Print the first few lines of DataFrame df.

df.index # Get the row indices for df.

df.columns # Get the column indices.

df[‘‘foo’’’] # Return the column named ‘‘foo’’’.

df.drop(‘‘foo’’, axis = 1) # Return all columns except ‘‘foo’’.

df.values # Return the values as a Numpy array.

df[‘‘foo’’’].values # Grab column foo and convert to Numpy array.

df.iloc[:3,:3] # Use numerical indices (like Numpy) to get 3 rows and cols.

The data consist of local crime statistics for 1,994 US communities. The response y is the rate of violent crimes

reported per capita in a community. The name of the response variable is ViolentCrimesPerPop, and it is held

in the first column of df train and df test. There are 95 features. These features include many variables.

Some features are the consequence of complex political processes, such as the size of the police force. Others

are demographic characteristics of the community, including self-reported statistics about race, age, education,

and employment drawn from Census reports.

The goals of this problem are twofold: first, to encourage you to think deeply about models you might train and

how they might be misused; and second, to see how Lasso encourages sparsity of linear models in settings where

the feature set is very large relative to the number of training examples. We emphasize that training a

model on this dataset can suggest a degree of correlation between a community’s demographics

and the rate at which a community experiences and reports violent crime. We strongly encourage

students to consider why these correlations may or may not hold more generally, whether correlations might

result from a common cause, and what issues can result in misinterpreting what a model can explain.

We have split the dataset into a training and test set with 1,595 and 399 entries, respectively2

. We’d like to use

this training set to fit a model to predict the crime rate in new communities and evaluate model performance on

the test set. As there are a considerable number of input variables and fairly few training datapoints, overfitting

is a serious issue. In order to avoid this, use the coordinate descent LASSO algorithm you just implemented in

the previous problem.

a. [4 points] Begin by reading the documentation for the original version of this dataset: http://archive.

ics.uci.edu/ml/datasets/communities+and+crime. Report 3 features included in this dataset for

which historical policy choices in the US would lead to variability in these features. As an example,

the number of police in a community is often the consequence of decisions made by governing bodies,

elections, and amount of tax revenue available to decisionmakers.

b. [4 points] Before you train a model for this prediction task, describe 3 features in the dataset which might,

if found to have nonzero weight in model, be interpreted as reasons for higher levels of violent crime, but

which might actually be a result rather than (or in addition to being) the cause of this violence.

Now, we will run the LASSO solver with λ = λmax defined above. For the initial weights, just use 0. Then, cut

λ down by a factor of 2 and run again, but this time pass in the values of ˆw from your λ = λmax solution as

your initial weights. This is faster than initializing with 0 weights each time. Continue the process of cutting λ

by a factor of 2 until the smallest value of λ is less than 0.01. For all plots use a log-scale for the λ dimension

(Tip: use plt.xscale(’log’)).

a. [4 points] Plot the number of nonzeros of each solution versus λ.

b. [4 points] Plot the regularization paths (in one plot) for the coefficients for input variables agePct12t29,

pctWSocSec, pctUrban, agePct65up, and householdsize.

2The features have been standardized to have mean 0 and variance 1.

5

c. [4 points] On one plot, plot the squared error on the training and test data versus λ.

d. [4 points] Sometimes a larger value of λ performs nearly as well as a smaller value, but a larger value will

select fewer variables and perhaps be more interpretable. Inspect the weights (on features) for λ = 30.

Which feature variable had the largest (most positive) Lasso coefficient? What about the most negative?

Discuss briefly.

e. [4 points] Suppose there was a large negative weight on agePct65up and upon seeing this result, a politician

suggests policies that encourage people over the age of 65 to move to high crime areas in an effort to reduce

crime. What is the (statistical) flaw in this line of reasoning? (Hint: fire trucks are often seen around

burning buildings, do fire trucks cause fire?)

Logistic Regression

Binary Logistic Regression

A6. Let us again consider the MNIST dataset, but now just binary classification, specifically, recognizing if

a digit is a 2 or 7. Here, let Y = 1 for all the 7’s digits in the dataset, and use Y = −1 for 2. We will use

regularized logistic regression. Given a binary classification dataset {(xi

, yi)}

n

i=1 for xi ∈ R

d and yi ∈ {−1, 1}

we showed in class that the regularized negative log likelihood objective function can be written as

J(w, b) = 1

n

Xn

i=1

log(1 + exp(−yi(b + x

T

i w))) + λ||w||2

2

Note that the offset term b is not regularized. For all experiments, use λ = 10−1

. Let µi(w, b) = 1

1+exp(−yi(b+x

T

i w)) .

a. [8 points] Derive the gradients ∇wJ(w, b), ∇bJ(w, b) and give your answers in terms of µi(w, b) (your

answers should not contain exponentials).

b. [8 points] Implement gradient descent with an initial iterate of all zeros. Try several values of step sizes

to find one that appears to make convergence on the training set as fast as possible. Run until you feel

you are near to convergence.

(i) For both the training set and the test, plot J(w, b) as a function of the iteration number (and show

both curves on the same plot).

(ii) For both the training set and the test, classify the points according to the rule sign(b + x

T

i w) and

plot the misclassification error as a function of the iteration number (and show both curves on the

same plot).

Note that you are only optimizing on the training set. The J(w, b) and misclassification error plots should

be on separate plots.

c. [7 points] Repeat (b) using stochastic gradient descent with a batch size of 1. Note, the expected gradient

with respect to the random selection should be equal to the gradient found in part (a). Take careful note

of how to scale the regularizer.

d. [7 points] Repeat (b) using stochastic gradient descent with batch size of 100. That is, instead of approximating

the gradient with a single example, use 100. Note, the expected gradient with respect to the

random selection should be equal to the gradient found in part (a).

B4.

Multinomial Logistic Regression[25 points]

We’ve talked a lot about binary classification, but what if we have k > 2 classes, like the 10 digits of

MNIST? Concretely, suppose you have a dataset {(xi

, yi)}

n

i=1 where xi ∈ R

d and yi ∈ {1, . . . , k}. Like in

our least squares classifier of homework 1 for MNIST, we will assign a separate weight vector w(`)

for each

6

class ` = 1, . . . , k; let W = [w(1)

, . . . , w(k)

] ∈ R

d×k

. We can generalize the binary classification probabilistic

model to multiple classes as follows: let

PW (yi = `|W, xi) = exp(w(`)

· xi)

Pk

j=1 exp(w(j)

· xi)

The negative log-likelihood function is equal to

L(W) = −

Xn

i=1

X

k

`=1

1{yi = `} log

exp(w(`)

· xi)

Pk

j=1 exp(w(j)

· xi)

!

Define the softmax(·) operator to be the function that takes in a vector θ ∈ R

d and outputs a vector in R

d

whose ith component is equal to P

exp(θi)

d

j=1 exp(θj )

. Clearly, this vector is nonnegative and sums to one. If for

any i we have θi  maxj6=i θj then softmax(θ) approximates ei

, a vector of all zeros with a one in the ith

component.

For each yi

let yi be the one-hot encoding of yi (i.e., yi ∈ {0, 1}

k

is a vector of all zeros aside from a 1 in

the yith index).

a. [5 points] If yb

(W)

i = softmax(W>xi), show that ∇W L(W) = −

Pn

i=1 xi(yi − yb

(W)

i

)

>.

b. [5 points] Recall Ridge Regression on MNIST problem in Homework 1 and define J(W) = 1

2

Pn

i=1 kyi−

W>xik

2

2

. If ye

(W)

i = W>xi show that ∇W J(W) = −

Pn

i=1 xi(yi − ye

(W)

i

)

>. Comparing the least

squares linear regression gradient step of this part to the gradient step of minimizing the negative log

likelihood of the logistic model of part a may shed light on why we call this classification problem

logistic regression.

c. [15 points] Using the original representations of the MNIST flattened images xi ∈ R

d

(d = 28 × 28 =

764) and all k = 10 classes, implement gradient descent (or stochastic gradient descent) for both

J(W) and L(W) and run until convergence on the training set of MNIST. For each of the two

solutions, report the classification accuracy of each on the training and test sets using the most

natural arg maxj ejW>xi classification rule.

We highly encourage you to use PyTorch for this problem! The base object in PyTorch is the

torch.tensor, which is essentially a numpy.ndarray but with some powerful new features. Namely,

tensors have accelerator support (GPU, TPU) and high-performance autodifferentiation. Don’t worry

too much about the details of PyTorch! We will discuss PyTorch and the torch.autograd package

in greater detail once we get to neural networks! At a high-level though, torch.autograd allows us

to automatically calculate the gradients of our model parameters with minimal additional cost. Yep,

that’s right! Your days of writing out gradients by hand are numbered! :D

We include some starter pseudocode for the negative log-likelihood + softmax portion of this question.

You are expected to find good hyperparameters. You can install the library at https://pytorch.org/

and access the relevant beginner tutorial here.

import torch

W = torch.zeros(784, 10, requires_grad=True)

for epoch in range(epochs):

y_hat = torch.matmul(X_train, W)

# cross entropy combines softmax calculation with NLLLoss

loss = torch.nn.functional.cross_entropy(y_hat, y_train)

# computes derivatives of the loss with respect to W

loss.backward()

# gradient descent update

W.data = W.data - step_size * W.grad

7

# .backward() accumulates gradients into W.grad instead

# of overwriting, so we need to zero out the weights

W.grad.zero_()

Confidence Interval of Least Squares Estimation

Bounding the Estimate [30 points]

B5. Let us consider the setting, where we have n inputs, X1, ..., Xn ∈ R

d

, and n observations Yi =

hXi

, β∗

i + i

, for i = 1, ..., n. Here, β

is a ground truth vector in R

d

that we are trying to estimate, and

the noise i ∼ N (0, 1). To estimate, we use the least squares estimator βb = minβkXβ − Y k

2

2

. Moreover,

we will use n = 20000 and d = 10000 in this problem.

a. [3 points] Show that βbj ∼ N (β

j

,(XT X)

−1

j,j ) for each j = 1, ..., d. (Hint: see notes on confidence

intervals from lecture.)

b. [4 points] Fix δ ∈ (0, 1) suppose β

∗ = 0. Applying the proposition from the notes, conclude that for

each j ∈ [d], with probability at least 1 − δ, |βbj | ≤ q

2(XT X)

−1

j,j log(2/δ). Can we conclude that with

probability at least 1 − δ, |βbj | ≤ q

2(XT X)

−1

j,j log(2/δ) for all j ∈ [d] simultaneously? Why or why

not?

c. [5 points] Let’s explore this question empirically. Assume data is generated as xi =

p

(i mod d) + 1 e(i mod d)+1

for all i, where ei

is the ith canonical vector and i mod d is the remainder of i when divided by d.

Generate each yi according to the model above. Compute βb and plot each βbj as a scatter plot with

the x-axis as j ∈ {1, . . . , d}. Plot ±

q

2(XT X)

−1

j,j log(2/δ) as the upper and lower confidence intervals

with 1 − δ = 0.95. How many βbj ’s are outside the confidence interval? Hint: Do the special structure

of how we generated xi, we can compute (XT X)

−1 analytically without computing an inverse

explicitly.

d. [5 points]

P

For events E1, ..., Em that are possibly dependent the ”union bound” says that P(∪iEi) ≤

i P(Ei). Use the union bound to come up with a threshold γ > 0 such that the probability that

any of the coefficients |βbj − β

j

| exceed γ is less than δ.

e. [7 points] Fix µ > 0. Assume each xi

is generated as above, and assume β

i = µ for i ≤ d/2 and β

i =

0 for i > d/2. Using the threshold γ derived in part 3, find a sufficient condition for how big n needs

to be to ensure that {j ∈ {1, . . . , d} : |βbj | > γ} = {j ∈ {1, . . . , d} : β

j

6= 0} with probability at least

1 − δ.

f. [6 points] Is the union bound tight? Let Z1, · · · , Zd be i.i.d. random variables that follow N(0, 1).

Show that there exist absolute constants c, C such that c

p

log(d) ≤ E[maxi=1,··· ,d Zi

] ≤ C

p

log(d).

Hint: Use the bounds on R ∞

x=t

1

e

−x

2/2dx given in the notes. For the lower bound use E[Y ] ≥

tP(Y ≥ t). For the upper bound use the fact that E[Y ] ≤ E[|Y |] = R ∞

t=0 P(|Y | ≥ t)dt.

8


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp