联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2024-02-29 08:31

(520|600).666

Information Extraction from Speech and Text

Programming Assignment # 1

Due March 7, 2024.

You will model letters of English text using hidden Markov models. Some ordinary text has

been selected and, to keep matters simple, all numerals and punctuation have been purged,

upper case letters have been lower-cased, and inter-word spacing, new lines and paragraph

breaks, have all been normalized to single spaces. The alphabet of the resulting text is

therefore the 26 lower case English letters and the white-space, and is formally denoted as

Y = {a, b, c, . . . , z, #}, with # denoting the white-space character.

The text is 35,000 characters long, and has been divided into a 30,000 character training set,

named A, and a 5,000 character test set, named B.

1. Model this text with a fully connected 2-state HMM, with states 1 and 2 .

Let t1 denote the transition 1 → 2 , and t2 denote the self-loop 1 → 1 . Similarly,

let t3 denote the transition 2 → 1 , and t4 denote the self-loop 2 → 2 , so that the

transition probability matrix may be written as

p =

"

p( 1 | 1 ) p( 2 | 1 )

p( 1 | 2 ) p( 2 | 2 )

#

?

?

p(t2) p(t1)

p(t3) p(t4)

?

?

Let the initial state of the Markov chain be either 1 or 2 with equal probability.

Let the emission probabilities be associated with the states, i.e. let

q(y |t2) ≡ q(y |t3) ≡ q(y | 1 ) and q(y |t1) ≡ q(y |t4) ≡ q(y | 2 ).

Use the Baum-Welch algorithm and the training text A to estimate the probabilities

p(tj ), j = 1, 2, 3, 4, and the emission probabilities q(y | s), y ∈ Y and s ∈ { 1 , 2 }.

(a) Initialize the transition probabilities to be slightly different from uniform, as

p(t1) = 0.51 = p(t3) and p(t2) = 0.49 = p(t4).

Initialize the emission probabilities to also be slightly different from uniform, as

q(a| 1 ) = q(b| 1 ) = . . . = q(m| 1 ) = 0.0370 = q(n| 2 ) = q(o| 2 ) = . . . = q(z| 2 ),

1

q(a| 2 ) = q(b| 2 ) = . . . = q(m| 2 ) = 0.0371 = q(n| 1 ) = q(o| 1 ) = . . . = q(z| 1 ),

q(#| 1 ) = 0.0367 = q(#| 2 ).

What would happen if all probabilities were set to be uniform, i.e 1

2

and 1

27 ?

(b) Plot the average log-probability of the training and test data after k iterations,

1

|A|

log Pk(A) and 1

|B|

log Pk(B),

as a function of the number of iterations, for k = 1, 2, . . . , 600.

(c) Plot the emission probabilities of a few particular letters for each state, e.g.

qk(a| 1 ) versus qk(a| 2 ) and qk(n| 1 ) versus qk(n| 2 ),

as a function of the number of iterations, for k = 1, 2, . . . , 600.

(d) Study the emission probability distributions q600(·| 1 ) and q600(·| 2 ) to see where

they differ the most, as well as how the transition probabilities differ from their

initial values. Try to explain what the machine has learned about English text.

2. Increasing Model Complexity: Repeat the Exercises 1(a) through 1(d) with a fully

connected 4-state HMM. Modify the initialization in 1(a) to account for 4 states.

3. Alternate Initialization of Output Probabilities: HMM estimation is sometimes sensitive

to the initialization of the model parameters. You will now investigate an alternative

to the initialization of Exercise 1(a).

(a) Compute the relative frequency q(y) of the letters in Y from the entire text A.

(b) Generate a vector of random numbers r(y), compute the average r =

1

|Y|

P

y∈Y r(y),

and use it to create a zero-mean perturbation vector δ(y) = r(y) ? r.

(c) Choose a small λ > 0, though not too small, such that both

q(y| 1 ) = q(y) ? λδ(y) > 0 and q(y| 2 ) = q(y) + λδ(y) > 0 ? y ∈ Y.

Note: q(·| 1 ) and q(·| 2 ) are bona fide probability assignments on Y. (Why?)

Use the two q(y|s) thus generated, along with the p(tj ) from Exercise 1(a), to initialize

the Baum-Welch iteration. Compare the resulting plots of average log-probability

versus k with those of 1(b), as well as the final values of the average log-probabilities.

Caution: Make sure you mitigate numerical underflow problems when computing the forward and backward probabilities. Use the normalization described in §2.8 if needed.

Submission: Turn in all your plots and discussion, and your source code, via GradeScope;

make sure your code is well documented. Points may be deducted for incomprehensible

code. Your code may be rerun on different training and test data or with a different initialization to check its correctness; make sure it runs on a linux machine with standard

compilers/libraries/environments.

2


相关文章

【上一篇】:代写COMP9020 Assignment 1
【下一篇】:代写COMP9020 Assignment 1

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp