联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2025-04-27 10:36

Final Exam

Instructor: Amrit Singh Bedi

Instructions

This exam is worth a total of 100 points. Please answer all questions clearly

and concisely. Show all your work and justify your answers.

• For Question 1 and 2, please submit the PDF version of your solution

via webcourses. You can either write it in latex or do it on paper and

submit the scanned version. But if you do it on paper and scan it,

you are responsible for ensuring it is readable and properly scanned.

There will be zero marks if it is not clearly written or scanned.

• The total time to complete the exam is 24 hours and it is due at 4:00

pm EST, Friday (April 25th, 2025). This is a take-home exam. Please

do not use AI like ChatGPT to complete the exam. There are zero

marks if found (believe me, we would know if you use it).

Question 1 50 marks

Context: In supervised learning, understanding the bias-variance tradeoff

is crucial for developing models that generalize well to unseen data.

Problem 1 10 marks

Define the terms bias, variance, and irreducible error in the context of su pervised learning. Explain how each contributes to the total expected error

of a model.

1

Problem 2 20 marks

Derive the bias-variance decomposition of the expected squared error for a

regression problem. That is, show that:

ED,ε[(y − f

ˆ(x))2

] =  Bias[f

ˆ(x)]

2

+ Var[f

ˆ(x)] + σ

2

where f

ˆ(x) is the prediction of the model trained on dataset D, y = f(x)+ε,

and σ

2

is the variance of the noise ε.

Hint: You can start by taking y = f(x) + ε, where E[ε] = 0, and

Var[ε] = σ

2

. Let f

ˆ(x) be a learned function from the training set D. Then

proceed towards the derivation.

Problem 3 10 marks

Consider two models trained on the same dataset:

• Model A: A simple linear regression model.

• Model B: A 10th-degree polynomial regression model.

Discuss, in terms of bias and variance, the expected performance of each

model on training data and unseen test data. Which model is more likely

to overfit, and why?

Problem 4 10 marks

Explain how increasing the size of the training dataset affects the bias and

variance of a model. Provide reasoning for your explanation. (10 marks)

Question 2: Using Transformer Attention 50

marks

Context. Consider a simplified Transformer with a vocabulary of six to kens:

• I (ID 0): embedding  1.0, 0.0


• like (ID 1): embedding  0.0, 1.0


• to (ID 2): embedding  1.0, 1.0


2

• eat (ID 3): embedding  0.5, 0.5


• apples (ID 4): embedding  0.6, 0.4


• bananas (ID 5): embedding  0.4, 0.6


All three projection matrices are the 2 × 2 identity:

WQ = WK = WV = I2.

When predicting the next token, the model uses masked self-attention: the

query comes from the last position, while keys and values come from all

previous tokens. (Note: show step by step calculation for all questions

below)

(a) (10 marks) For the input sequence [I, like, to] (IDs [0, 1, 2]),

compute the query, key and value vectors for each token.

(b) (15 marks) Let Q be the query of the last token and K, V the keys

and values of all three tokens.

• Compute the row vector of raw attention scores qK⊤, where q is

the query of the last token and K is the 3×2 matrix of keys. .

• Scale by √

dk (with dk = 2) and apply softmax to obtain attention

weights.

• Compute the context vector as the weighted sum of the values.

(c) (15 marks) Given the context vector c ∈ R

2

from part (b), com pute the unnormalized score for each vocabulary embedding via c ·

embed(w), i.e. dot-product.

• Apply softmax over these six scores to get a probability distribu tion.

• Which token has the highest probability? [Note: Because the six

embeddings are synthetic and not trained on real text, the token

that receives the highest probability may look ungrammatical in

normal English; this is an artifact of the toy setup.]

(d) (10 marks) Explain why the model selects the token you found in

(c). In your answer, discuss:

• How the attention weights led to that choice.

• Explain why keys/values may include the current token but never

future tokens .

3


相关文章

【上一篇】:到头了
【下一篇】:没有了

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp