COMP9313:
Big Data Management
Sample Exam Questions
•Explain the difference between NameNode
and DataNode.
•Given a file of 500MB, let block size be
150MB, and replication factor=3. How much
space do we need to store this file in HDFS?
Why?
Question 1 HDFS
Question 2 Spark
• Given a large text file, your task is to find out the top-k most
frequent co-occurring term pairs. The co-occurrence of (w, u)
is defined as: u and w appear in the same line (this also
means that (w, u) and (u, w) are treated equally). Your Spark
program should generate a list of k key-value pairs ranked in
descending order according to the frequencies, where the
keys are the pair of terms and the values are the co-occurring
frequencies (Hint: you need to define a function which takes
an array of terms as input and generate all possible pairs).
textFile = sc.textFile(inputFile)
words = textFile.map(lambda x: x.lower().split())
// fill your code here, and store the result in a pair RDD avgLen
avgLen.collect()
Question 3 Finding Similar Items
Suppose we wish to find similar sets, and we
apply locality-sensitive hashing with k=5 and
l=2.
If two sets had Jaccard similarity 0.6, what is
the probability that they will be identified in
the locality-sensitive hashing as candidates
(i.e. they hash at least once to the same superhash)?
You may assume that there are no
coincidences, where two unequal values hash
to the same hash value.
Question 4 Mining Data Streams
Suppose we are maintaining a count of 1s using
the DGIM method. We represent a bucket by (i, t),
where i is the number of 1s in the bucket and t is
the bucket timestamp (time of the most recent 1).
Consider that the current time is 200, window size
is 60, and the current list of buckets is: (16, 148)
(8, 162) (8, 177) (4, 183) (2, 192) (1, 197) (1,
200). At the next ten clocks, 201 through 210, the
stream has 0101010101. What will the sequence
of buckets be at the end of these ten inputs?
Question 5 Recommender Systems
Consider three users u1, u2, and u3, and four movies m1, m2, m3, and m4. The users rated the
movies using a 4-point scale: -1: bad, 1: fair, 2:
good, and 3: great. A rating of 0 means that the
user did not rate the movie. The three users’
ratings for the four movies are: u1 = (3, 0, 0, - 1), u2 = (2, -1, 0, 3), u3 = (3, 0, 3, 1)
• Which user has more similar taste to u1 based on
cosine similarity, u2 or u3? Show detailed calculation
process.
• User u1 has not yet watched movies m2 and m3. Which movie(s) are you going to recommend to
user u1, based on the user-based collaborative filtering approach? Justify your answer.
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。