COMP 551 - Midterm
MCGILL UNIVERSITY
WINTER SEMESTER, 2022
Campus: Montreal - Downtown
COMPUTER SCIENCE
Applied Machine Learning Course br>NOTE: You need to submit your write-up by 07:00am March 22nd in Mycourses under the assignment
tab. Please note that your submitted answers must be typed and machine readable and may be
checked with text-matching software. You can assume the exam is open book and it is justified to
check relevant materials but you may not copy paste the exact text or image from the slides or other
sources. All answers needs to be in your own wording and reflect your own understanding.
By writing my name below, I confirm that all my exam work will be done entirely by myself, with no help
from others. I will not provide any information about the exam’s contents and or my solutions to other
people until after March 30th 2022.
McGill ID No: Student name:
SECTION A: ML Fundamentals
This part checks your understanding of the basic concepts and algorithms in Machine Learning. Please use text,
formula, images (created by yourself) or any combinations of them to answer the questions.
1. Explain each of the following models in a paragraph (use about 200 words or less). Discuss
when they work (e.g. what kind of data this method is useful for, model complexity, if they
can do classification, regression or both, etc.), how they work (explain the parameters and
discuss what they are learning and how), why they work (discuss their inductive bias) and
what needs to be considered specific to them to make them work (do you need to do data
normalization, regularization, etc.).
(a) Nearest Neighbours
(b) Decision Trees
(c) Naive Bayes
(d) Linear Regression
(e) Logistic Regression
(f) Softmax Regression
(g) Multilayer Perceptron (MLP)
(h) Convolutional Neural Networks (CNN)
CONTINUED
– 2 – COMP 551 - Midterm
2. Compare the following models with regards to each other. Use about 50 words or less per
comparison. Focus on their key differences and/or similarities in terms of data/task they
can be applied to, model complexity and efficiency, loss function, etc.
(a) Linear Regression v.s. Logistic Regression
(b) Logistic Regression v.s. Softmax Regression
(c) Logistic Regression v.s. Naive Bayes Classifier
(d) Logistic Regression v.s. Multilayer Perceptron (MLP)
(e) Multilayer Perceptron (MLP) v.s. Convolutional Neural Networks (CNN)
3. Explain each of the following concepts discussed in the course with few lines (use about or
less than 50 words).
(a) Over-fitting and Under-fitting
(b) Bias and Variance trade-off
(c) Regularization
(d) Generalization
(e) Hyper-parameter
4. Explain the gradient decent approach in a short paragraph (use less than 100 words) and
discuss what the Adam (Adaptive Moment Estimation) algorithm is doing to make it work
better.
5. Explain Maximum Likelihood Estimation (MLE) in the context of fitting a model to the
given data, and how it is different from Bayesian approach, discuss the MAP estimate and
how it relates the two, explain if it result in lower or higher variance. (use about or less than
200 words)
SECTION B: ML Practitioner’s Knowledge
This part checks your depth of understanding of the different concept with more specific questions.
6. When using gradient descent algorithm, are we guaranteed to find a local minimum? please
explain.
7. Can we use gradient descent to solve a linear regression problem? and if so, could it result
in multiple local optimum solutions?
CONTINUED
– 3 – COMP 551 - Midterm
8. How will the bias and variance of a trained model change with each of the following? e.g.
answer less bias but more variance; or less bias but variance stays the same, etc.
(a) increasing the number of data points the model learns from
(b) increasing k in a k-nearest neighbour model
(c) pruning a decision tree
(d) increasing the regularization parameter (λ) in Ridge regression
(e) adding dropout to an MLP
(f) reducing the batch size when training an MLP with stochastic gradient descent
9. Regularization is more important for a model that have higher or lower expressiveness power?
please discuss.
10. With back-propagation, can we learn the globally optimum solution for fully connected feed-
forward network with one hidden layer (2 layers MLP)? please explain.
SECTION C: ML Innerworkings Knowledge
This part checks your depth of understanding of the different algorithms with more specific questions.
11. What is the prediction for x = [1 0 1], when using a Gaussian Naive Bayes model that is
trained using maximum likelihood on the following data:
12. What is the prediction for input x = [1 0 1], when using a Softmax regression model with
the following weights:
w =
??1 2 31 0 0
0 2 1
??
13. What is the maximum likelihood estimate for the parameter w when using the following cost
function and training data? please write the derivations.
J(w) = 1
2
∑
n(y
(n) ? wx(n))2
D = {(1, 1), (1, 3), (2, 1), (2, 5), (2, 6), (3, 1), (3, 8), (3, 4), (5, 10), (6, 10)}
CONTINUED
– 4 – COMP 551 - Midterm
14. In the class, we discussed using a linear regression model (y? = wTx + b) for binary clas-
sification, where you set the targets to {0, 1}, is not a good idea since the L2 loss
(L2(y, y?) =
1
2
(y ? y?)2) used by the linear regression model may penalize confident correct
predictions. Can we fix this by using a modified hinge loss defined as:
L(y, y?) =
{
max(0, y) y? = 0
1?min(1, y) y? = 1
15. If we change the logistic regression model to use σ?(z) = e
?z
1+e?z instead of the original sigmoid
function, σ(z) = 1
1+e?z , and when trained using the same binary cross entropy loss, what
would happen to the parameters and predictions of the model? how will they change? please
explain and justify your answer.
16. What happens when you increase the momentum in Adam algorithm (i.e. increasing β1)?
What about β2? Recall that in Adam we have:
M{t} ← β1M{t?1} + (1? β1)?J(w{t?1})
S{t} ← β2S{t?1} + (1? β2)?J(w{t?1})2
w{t}← w{t?1}? α√
S?{t}+?
M?{t}
17. Consider a two layered multi layered perceptron model given by u = σ(W ReLu(V x)) where
σ(x) = (1 + e?x)?1 and ReLu = max(0, x) when the loss function is set to L = |y? ? y|.
Consider learning the parameters of this model with stochastic gradient decent (SGD) with
learning rate of 0.5, and assume W {t} = [?1 1] and V {t} =
[
1 0
0 ?1
]
. Now consider a
training example with x = [2 ? 2] and y = 0, calculate V {t+1} and W {t+1}.
18. Consider the following input and and convolution filter and when using zero padding of 1
and stride of 2, what is output of y[1, 1] assuming indexing starts at zero?
SECTION D: ML Application Scenarios
This part checks your understanding of how to apply the techniques we discussed in the course. Please use text,
formula, images (created by yourself) or any combinations of them to answer the questions.
19. Consider you are applying a linear regression model to estimate corn yield. The data contains
many irrelevant feature or measurements as it is not known what actually impacts crop
productivity. Will you use Lasso or Ridge regression? please explain and justify your answer.
CONTINUED
– 5 – COMP 551 - Midterm
20. Consider you have a spam detection model (spam = 1, not spam = 0) to filter the messages
sent to you. You can control a parameter in the model to increase or decrease the recall.
If you set the parameter to have higher recall, you expect to see more or less spam in your
inbox? do you expect to see more or less actual emails being forwarded to your spam folder?
please explain.
21. Assume you are working with a company that wants to design a fake news detection classifier
for news articles. The company is providing you 1000 example articles they have manually
labeled as fake or not fake and is asking you to develop a model for them using deep learning
(e.g. MLP). Assume that they have a good feature extractor that converts a given text to a
vector which serves as the input for your model. Please discuss and justify your answers to
the following questions:
(a) how will you split this dataset to design/train/test your model?
(b) What would you do if after training the first model you try, your training loss is too
high? could asking more data to be labeled help?
(c) What would you do if your training loss is low but your loss on validation set varies too
much between different runs?
(d) How confident you would be on your trained model being able to detect fake news when
deployed, assuming you are able to achieve low validation/test loss? please discuss.
22. Assume you are working in a hospital to make a system that helps doctors with cancer
diagnosis. You have been given example data of 1000 patients with their different test
measurements and diagnosis outcome (no cancer, brain cancer, lung cancer, etc.). Please
discuss and justify your answers to the following questions:
(a) What classifier will you use for this task? explain, justify your choice.
(b) Considering some tests are more expensive to run (e.g. performing MRIs are more costly
for the hospital compared to blood tests), which means some features are more costly
to obtain for a new patient, how would you modify your model to be able to work with
less features and ask for more only when needed?
ANSWER SHEET FOLLOWS
ANSWER SHEET – 6 – COMP 551 - Midterm
By writing my name below, I confirm that all my exam work will be done entirely by myself,
with no help from others. I will not provide any information about the exam’s contents and
or my solutions to other people until after March 30th 2022.
McGill ID No: Student name:
This exam is marked out of 180 points and contributes to 30% of your final grade. The grade
breakdown is provided below. For multi-part questions, points are equally distributed.
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681  微信:codinghelp 电子信箱:99515681@qq.com  
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。