COMPSCI 753

Algorithms for Massive Data

Assignment 4 / Semester 2, 2020

Recommender Systems

Kaiqi Zhao

Submission:

Please submit a single pdf file on CANVAS by 23:59 NZST, Sunday 25 October. The answer

file must contain your student ID and name.

Penalty Dates:

The assignment will not be accepted after the last penalty date unless there are special

circumstances (e.g., sickness with certificate). Penalties will be calculated as follows as a

percentage of the mark for the assignment.

? 23:59 NZST, Sunday 25 October – No penalty

? 23:59 NZST, Monday 26 October – 25% penalty

? 23:59 NZST, Tuesday 27 October – 50% penalty

1 Assignment problem (50 pts)

Recommender systems are widely used in many industrial companies, especially locationbased

service such as Yelp 1

. Yelp provides a very big dataset for research purpose (See

the Kaggle page https://www.kaggle.com/yelp-dataset/yelp-dataset). In this assignment,

we will explore this dataset using the learned recommendation algorithms. To make

the task feasible on most of the laptops and PCs, we have extracted a managable subset

of the datasets, which contains the reviews on businesses in Las Vegas. The dataset could

be found in the assignment page. The data format for users and businesses are exactly the

same as in the description on the Kaggle page. The review file only preserves the review

id, business id, user id, stars (ratings) and date to keep the dataset small.

Preprocessing: Split the review data into training and test datasets. Use the 20% latest

reviews (with largest “date” attribute) for testing and the remaining 80% for training.

The collaborative filtering algorithm does not need to train, but the similarity computation

can only conduct using ratings in the training set.

This assignment has the following tasks:

1. Item-based collaborative filtering (25 pts): In this task, we will build an itembased

CF algorithm on Yelp.

(a) Our first step is to pre-compute the similarity between each pair of businesses

(items) on the training data, such that we don’t have to compute the similarities

on-the-fly. Here, we use cosine similarity for simplicity. Report (1) the time

for computing all pairwise similarities, and (2) report the similarities

between the following pairs of businesses. (10 pts)

Pairs Business 1 Business 2

1 rjZ0L-P1SRLJfiK5epnjYg cL2mykvopw-kodSyc-z5jA

2 6H8xfhoZ2IGa3eNiY5FqLA XZbuPXdyA0ZtTu3AzqtQhg

3 rfwJFFzW6xW2qYfJh14OTA G58YATMKnn-M-RUDWg3lxw

4 0QSnurP5Ibor2zepJmEIlw 6-lmL3sC-axuh8y1SPSiqg

(b) Try to compute the above cosine similarity using matrix multiplication. Hints:

Let R be the m × n rating matrix, where m and n are the number of users

and number of businesses, respectively. RT R gives the pairwise dot product of

business vectors. Also notice that the diagonal of RT R gives the self dot product

of each business, which is the square of the L2-norm of each business vector used

1https://www.yelp.com/

1

in the denominator of the cosine similarity! Report the time for computing

the whole pairwise similarity matrix. (5 pts)

(c) Implement the item-based collaborative filtering algorithm based on the above

similarity matrix. Report the RMSE. (5 pts)

(d) Extend the above CF method by incorporating the bias from all the transactions

(bg), bias from each user u (bu = [avg rating of user u]?bg), and the bias from each

business i (bi = [avg rating for business i] ? bg), e.g., the recommendation score

of user u to business i is ?rui = bui +

P

j∈Pu

wij (ruj?buj )

P

j∈Pu

wij

, where bui = bg +bu +bi

. We

choose the top-20 similar businesses rated by the user u as Pu in this assignment.

Report the RMSE. (5 pts)

2. Latent factor model (25 pts): Hold-out 1/8 of the latest reviews in the training

data as validation set to tune the number of latent factors. This results in a data split

of 70% training, 10% validation, and 20% testing.

(a) Implement a latent factor model with number of latent factors k. Use stochastic

gradient descent with fixed learning rate η = 0.01 and regularization hyperparameter

λ1 = λ2 = 0.3. Run SGD for 20 epochs and report RMSE for

k = {8, 16, 32, 64} on both training (i.e., the minimized loss) and validation

sets. (10 pts)

(b) Extend the latent factor model with bias. Use stochastic gradient descent with

fixed learning rate η = 0.01 and regularization hyperparameter λ1 = λ2 = λ3 =

λ4 = 0.3. Run SGD for 20 epochs and report RMSE for k = {8, 16, 32, 64}

on both training and validation sets. (10 pts)

(c) Select the best k for each of the above two versions of latent factor models,

and apply on the test dataset. Report RMSE of the two models on test

dataset. (5 pts)

Note: You could use any python package to validate your implementation, to compute

the pairwise cosine similarity, or to do the stochastic gradient descent. However,

directly calling any package of CF and latent factor models will get zero mark.

2 What to submit?

An answer.pdf file reports the requested values and explanation of each task.

A source code file contains detailed comments.

2

版权所有：留学生编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。