COMP9313代做、Data Management代写、Python，c++，Java程序语言代做-代写Python编程

联系方式

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-23:00
微信：codinghelp

您当前位置：首页 >> Python编程Python编程

COMP9313代做、Data Management代写、Python，c++，Java程序语言代做

日期：2020-08-16 10:57

COMP9313:

Big Data Management

Sample Exam Questions

•Explain the difference between NameNode

and DataNode.

•Given a file of 500MB, let block size be

150MB, and replication factor=3. How much

space do we need to store this file in HDFS?

Why?

Question 1 HDFS

Question 2 Spark

• Given a large text file, your task is to find out the top-k most

frequent co-occurring term pairs. The co-occurrence of (w, u)

is defined as: u and w appear in the same line (this also

means that (w, u) and (u, w) are treated equally). Your Spark

program should generate a list of k key-value pairs ranked in

descending order according to the frequencies, where the

keys are the pair of terms and the values are the co-occurring

frequencies (Hint: you need to define a function which takes

an array of terms as input and generate all possible pairs).

textFile = sc.textFile(inputFile)

words = textFile.map(lambda x: x.lower().split())

// fill your code here, and store the result in a pair RDD avgLen

avgLen.collect()

Question 3 Finding Similar Items

Suppose we wish to find similar sets, and we

apply locality-sensitive hashing with k=5 and

l=2.

If two sets had Jaccard similarity 0.6, what is

the probability that they will be identified in

the locality-sensitive hashing as candidates

(i.e. they hash at least once to the same superhash)?

You may assume that there are no

coincidences, where two unequal values hash

to the same hash value.

Question 4 Mining Data Streams

Suppose we are maintaining a count of 1s using

the DGIM method. We represent a bucket by (i, t),

where i is the number of 1s in the bucket and t is

the bucket timestamp (time of the most recent 1).

Consider that the current time is 200, window size

is 60, and the current list of buckets is: (16, 148)

(8, 162) (8, 177) (4, 183) (2, 192) (1, 197) (1,

200). At the next ten clocks, 201 through 210, the

stream has 0101010101. What will the sequence

of buckets be at the end of these ten inputs?

Question 5 Recommender Systems

Consider three users u1, u2, and u3, and four movies m1, m2, m3, and m4. The users rated the

movies using a 4-point scale: -1: bad, 1: fair, 2:

good, and 3: great. A rating of 0 means that the

user did not rate the movie. The three users’

ratings for the four movies are: u1 = (3, 0, 0, - 1), u2 = (2, -1, 0, 3), u3 = (3, 0, 3, 1)

• Which user has more similar taste to u1 based on

cosine similarity, u2 or u3? Show detailed calculation

process.

• User u1 has not yet watched movies m2 and m3. Which movie(s) are you going to recommend to

user u1, based on the user-based collaborative filtering approach? Justify your answer.

【返回顶部】【打印本稿】【关闭本页】

【上一篇】：COMP9417代做、Data Mining代写、代写Python、Python编程设计调试

【下一篇】：COMP9417代做、Data Mining代写、代写Python、Python编程设计调试

联系方式

最新辅导

热门辅导

您当前位置：首页 >> Python编程Python编程

COMP9313代做、Data Management代写、Python，c++，Java程序语言代做

日期：2020-08-16 10:57

相关文章