联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2020-07-27 10:43

CSE 150A. Assignment 3 Summer 2020, Session 1

Out: Thu Jul 16

Due: Thu Jul 23

3.1 Maximum likelihood estimation

(a) Complete data

Consider a complete data set of i.i.d. examples {at, bt, ct, dt}

T

t=1 drawn from the joint distribution

of the above belief network. Compute the maximum likelihood estimates of the conditional probability

tables (CPTs) shown below for this data set. Express your answers in terms of equality-testing

functions, such as:

I(a, at) = 

1 if a = at

,

0 if a 6= at

.

For example, in terms of this function, the maximum likelihood estimate for the CPT at node A is

given by P(A = a) = 1

T

PT

t=1 I(a, at). Complete the numerators and denominators in the below

expressions.

P(B =b|A=a) =

P(C =c|A=a, B =b) =

P(D =d|A=a, C =c) =

1

A

B C

D

(b) Posterior probability

Consider the belief network shown above, with observed nodes B and D and hidden nodes A and C.

Compute the posterior probability P(a, c|b, d) in terms of the CPTs of the belief network—that is, in

terms of P(a), P(b|a), P(c|a, b) and P(d|a, c).

(c) Posterior probability

Compute the posterior probabilities P(a|b, d) and P(c|b, d) in terms of your answer from part (b).

In other words, in this problem, you may assume that P(a, c|b, d) is given, since you showed how to

compute that in part (b).

(d) Log-likelihood

Consider a partially complete data set of i.i.d. examples {bt

, dt}

T

t=1 drawn from the joint distribution

of the above belief network. The log-likelihood of the data set is given by:

L =

X

t

log P(B =bt

, D =dt).

Compute this log-likelihood in terms of the CPTs of the belief network. You may re-use work from

earlier parts of the problem.

2

A

B C

D

(e) EM algorithm

The posterior probabilities from parts (b) and (c) can be used by an EM algorithm to estimate CPTs

that maximize the log-likelihood from part (d). Complete the numerator and denominator in the below

expressions for the EM update rules. Simplify your answers as much as possible, expressing them in

terms of the posterior probabilities P(a, c|bt

, dt), P(a|bt

, dt), and P(c|bt

, dt), as well as the functions

I(b, bt), and I(d, dt).

P(A=a) ←

P(B =b|A=a) ←

P(C =c|A=a, B =b) ←

P(D =d|A=a, C =c) ←

3

3.2 EM algorithm for noisy-OR

In this problem, we will see how the EM algorithm can be used to learn the parameters of a noisy-OR model.

Part (a) is related to the derivation of the EM update rules for the noisy-OR model. Parts (b) through (d) will

ask you to write code to learn the parameters of a noisy-OR model based on a particular data set.

(a) Equivalence of models

The following diagrams depict two variations of the noisy-OR model:

Y

X1 X2 ... Xn

Y

Z1 Z2 ... Zn

X1 X2 Xn

...

Model A Model B

For this part of the problem, we will use the notation PA(· · ·) and PB(· · ·) to distinguish between

probabilities computed using Model A and Model B, respectively.

Model A is a standard noisy-OR model in which the CPT for Y is

PA(Y = 1|X~ = ~x) = 1 −

Yn

i=1

(1 − pi)

xi

.

Model B is a variation of the noisy-OR model in which we’ve inserted a layer of hidden variables

Z1, . . . , Zn, with the following CPTs:

PB(Zi = 1|Xi = xi) = (

pi

, if xi = 1

0, if xi = 0

PB(Y = 1|Z~ = ~z) = (

1, if Zi = 1 for any i

0, if Zi = 0 for all i

In contrast to Model A, the Y node in Model B is a deterministic OR, since it will be 1 if and only if

any of the Zi’s are 1. We can view the Zi nodes as being a sort of “noisy copy” of the corresponding

Xi nodes: if Xi = 0, then Zi

is guaranteed to be 0 as well, but if Xi = 1, then Zi has probability pi

of being 1.

Note also that both models are defined in terms of parameters pi for i ∈ {1, . . . , n}.

4

To prove that the Y nodes in both models are equivalent, it is enough to show that

PA(Y = 0|X~ = ~x) = PB(Y = 0|X~ = ~x).

The following equations are an incomplete proof of this fact, which differs slightly from the one seen

in lecture. To complete this proof, provide a brief justification for each of the steps labeled (i) through

(Because of this result, we can simply use the notation P(· · ·) for probabilities in the remaining parts

of this problem, since we have shown that the the two models PA(· · ·) and PB(· · ·) agree.)

(b) EM Implementation: Per-iteration statistics

Suppose we are given a data set of T samples {(~xt

, yt)}

T

t=1 from Model A.

If we try to apply standard maximum-likelihood techniques to estimate the values of the parameters

pi based on Model A, we will find that there is no closed-form solution.

However, since part (a) showed that Model A and Model B are equivalent, we can instead view the

data as an incomplete data set corresponding to Model B. This allows us to estimate the values of pi

using the EM algorithm.

Noting that pi corresponds to P(Zi = 1|Xi = 1), which is one of the CPT entries for Model B, we

can write the EM update rule as:

pi ←

count d (Zi = 1, Xi = 1)

count d (Xi = 1)

=

P

t P(Zi = 1, Xi = 1|X~ = ~xt

, Y = yt)

P

t P(Xi = 1|X~ = ~xt

, Y = yt)

Then, as shown in class, the EM update rule for noisy-OR can be simplified to the following:

xit is the number of observations where Xi = 1.

5

Next, you will use the EM algorithm for estimating the noisy-OR parameters pi

. Download the

data files hw3 noisyOr x.txt and hw3 noisyOr y.txt for this homework assignment. The data set has

T = 267 examples over n= 23 inputs.1

For a data set {(~xt, yt)}T

t=1, the normalized log (conditional) likelihood is given by:

In your code, you should initialize all pi = 0.05 and perform 512 iterations of the EM algorithm using

the update rule shown above. An easy to miss reason for getting different results on this assignment is

to not use the initialization specified here! At each iteration, compute the log conditional likelihood

shown above. (If you have implemented the EM algorithm correctly, this log conditional likelihood

will always increase from one iteration to the next.)

Also compute the number of mistakes made by the model at each iteration; a mistake occurs either

• when yt = 0 and P(Y = 1|X~ = ~xt) ≥ 0.5 (indicating a false positive), or

• when yt = 1 and P(Y = 1|X~ = ~xt) < 0.5 (indicating a false negative).

In other words, use the current parameter values of the iteration to try to predict Y given the values

for X, and count how many of those predictions are wrong. The number of mistakes should generally

decrease as the model is trained, though it is not guaranteed to do so at each iteration.

For this part, turn in a completed version of the following table:

iteration number of mistakes M log conditional likelihood L

0 175 -0.9581

You should use the already completed entries of this table to check your work. As always you may

program in the language of your choice.

(c) EM Implementation: Estimated values for pi

Produce a table or bar plot of your final estimated values for pi

.

(d) EM Implementation: Source code

Include your source code for this problem as part of your Gradescope submission.

1

For those interested, more information about this data set is available here:

http://archive.ics.uci.edu/ml/machine-learning-databases/spect/SPECT.names

6

Debugging hints: If you are having trouble getting your results to match starting at iteration 1, try printing

out the values of p1 and p23 after the first few iterations. After iteration 1, you should find that p1 ≈ 0.11655

and p23 ≈ 0.15310. If you find that your value of p1 matches but p23 does not, this is likely a sign that

you are using the updated values for pi

too early. To solve this, you should save the new values for pi

in

temporary variables until p1, p2, . . . , p23 have all been computed; then, overwrite the old values of pi all at

once.

7

3.3 EM algorithm for binary matrix completion

In this problem you will use the EM algorithm to build a simple movie recommendation system. Download

the files hw3 movieTitles su20.txt and hw3 movieRatings su20.txt. The first file provides the titles of 60

movies. The second file contains a 150×60 matrix of zeros, ones, and missing elements denoted by question

marks. The (i, j)

th element in this matrix contains the i

th user’s rating of the j

th movie (corresponding to

the j

th line of the first file), according to the following key:

1 recommended,

0 not recommended,

? not seen.

Note: In past versions of this course, this data came from a survey given to you and your peers. In the interest

of time for the Summer session, I have generated synthetic data to only loosely match the distribution seen

from past students, so as to give them some privacy.

(a) Sanity check

Compute the mean popularity rating of each movie, given by the simple ratio

number of users who recommended the movie

number of users who saw the movie

,

and sort the movies by this ratio. Print out the movie titles from most popular (Doctor Strange) to

least popular (The Last Airbender).

(b) Likelihood

Now you will learn a naive Bayes model of these movie ratings, represented by the belief network

shown below, with hidden variable Y ∈ {1, 2, . . . , k} and partially observed binary variables

R1, R2, . . . , R60 (corresponding to movie ratings).

Y

R1 R2 R3 R60 ...

This model assumes that there are k different types of movie-goers, and that the i

th type of moviegoer—who

represents a fraction P(Y =i) of the overall population—likes the j

th movie with conditional

probability P(Rj = 1|Y =i).

8

This model is an interesting extension of the hidden variable models studied in lecture: for this problem,

each sampled set of user’s reviews contains its own mix of observed and unobserved variables.

Some users have watched many movies, some users have watched only a few. A hidden variable for

one user might be observed for another user. This is what we mean by ”partially observed”. Note the

the type of movie-goer (variable Y ) is always hidden for all users.

Let Ωt denote the set of movies seen (and hence rated) by the t

th user. Show that the likelihood of the

t

th user’s ratings is given by.

(c) E-step

The E-step of this model is to compute, for each user, the posterior probability that they correspond

to a particular type of movie-goer.

(d) M-step

The M-step of the model is to re-estimate the probabilities P(Y =i) and P(Rj = 1|Y =i) that define

the CPTs of the belief network. As shorthand,

denote the probabilities computed in the E-step of the algorithm. Also, let T = 150 denote the number

of users. Show that the EM updates are given by

denotes a sum over all users t that rated movie j. Similarly, P

t:j /∈Ωt

denotes a sum

over all users t that did not rate movie j.

(e) Implementation

Download the files hw3 probType init su20.txt and hw3 probRatingGivenType init su20.txt, and use

them to initialize the probabilities P(Y = i) and P(Rj = 1|Y = i) for a model with k = 4 types of

movie-goers. Run 128 iterations of the EM algorithm, computing the (normalized) log-likelihood

at each iteration. Your log-likelihood should increase (i.e., become less negative) at each iteration.

Fill in the following table, using the already provided entries to check your work:

9

iteration log-likelihood L

Note: Choosing k = 4 is somewhat arbitrary, but it is based on assuming that there are a small number

of fundamentally different movie-goer types. Moreover, the initial values for P(Rj = 1|Y =i) in

hw3 probRatingGivenType init su20.txt have been chosen randomly. We provide these to you so that

this HW unfolds in a certain manner and you’ll get the same results we got. If you are interested in

exploring further after completing this assignment, feel free to try out other initializations and other

values of k.

(f) User categorization

Using the formula from part (c) along with the parameters you estimated in part (e), evaluate the

probability distribution over movie-goer types (i.e., over Y ) given the following ratings:

”Tron” recommended

”Frozen” recommended

”Star Trek Beyond” NOT recommended

”Jumanji: Welcome to the Jungle” NOT recommended

”It” NOT recommended

”Batman v Superman: Dawn of Justice” NOT recommended

”The Dark Knight” NOT recommended

”The Lord of the Rings: The Fellowship of the Ring” recommended

”Black Panther” recommended

”The Avengers” recommended

”Avengers: Infinity War” recommended

all others not seen

Include either a table or a plot of P(Y =i| those ratings) for i ∈ {1, . . . , k}.

What value of i ∈ {1, . . . , k} maximizes the value of P(Y =i| those ratings)? (In other words, which

of the k types of movie-goers does this user most closely match?)

(g) User movie recommendations

Next, compute this user’s expected ratings on the movies they haven’t yet seen. You can do this by

applying the following formula for each movie ` that they haven’t seen:

P (R` = 1 | their ratings) = X

k

i=1

P (Y =i | their ratings) P(R` = 1|Y =i)

10

What are the top ten recommended movies? (Print out the first ten movies in the list of these (unseen)

movies sorted by their expected ratings.) Does this list seem different than the list in part (a)? Hopefully

it does (although our data set is obviously far smaller and more incomplete than the data sets

at actual movie streaming companies, not to mention the fact this particular data set was randomly

generated).

Note: If you are interested in exploring further after completing this assignment, feel free to try out

placing your own ratings in and re-doing parts (f) and (g).

(h) Source code

Include your source code for this problem as part of your Gradescope submission. As usual, you may

program in the language of your choice.

11


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp