联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-09-27 10:05

Computer Science 320 S2 (2019)

Assignment 4

Due date Sep 28, 2019 23:59pm

Answer all of the following questions. There are 10 points, which contribute 5% of your total course

marks. Submit a properly typesetted pdf file (LATEX preferred) of your answers to Canvas before the

deadline. There is no automarker for the Python program. To aid the markers, scanned handwritten

solutions or late submissions are NOT accepted.

Email spam filtering

You are implementing a new email spam filtering for the University of Auckland. If the email comes from

one of these trusted addresses, it is not a spam. Otherwise, it will be a spam. Since the server memory

is very limited, you can not keep the list of trusted addresses on its memory (if an email address requires

25 bytes on average, it would take 25 GB to store A in the memory).

After searching on Google, you have found a very memory-efficient solution. That is, instead of

keeping A on memory, you will construct a bit array B of size n representing for the set A. You choose

n = 8000, 000, 000 bits so that you need only 1 GB of memory. The construction is as follows.

? B is initialized by 0s.

? You choose a hash function h : a 7→ [0, n). In other words, the input of h will be an string a (e.g.

email address) and the hash value will be an integer ranging between 0 and n.

? For each trusted email address a ∈ A, you hash a into one of n buckets, and set that bit to 1. In

order words, you simply set B[h(a)] = 1.

The filtering mechanism works as follows.

? When you receive a new email from the address a0, you compute the hash value h(a0).? If B[h(a0)] = 1, you consider that this email is not a spam and let it go through.? If B[h(a0)] = 0, you consider that this email is a spam and discard it.

Theoretical questions for measuring the performance of the filtering (5 pts):

1. Illustrate that if a new email from the address a ∈ A, it always gets through (1 pts).

2. Given any position 0 ≤ i < n, what is the probability that B[i] = 1 (2 pts).

3. Given a spam email from the address a

0 ∈/ A, what is the probability that it gets through (2 pts).

Practical implementation for measuring the performance of the filtering (5 pts):

Write a Python script to implement this email spam filtering technique given the scaled setting. Your

data structure B has size n = 8000, 000 bits. For simplicity, assume that your trusted email addresses

and spam email are presented as integers. In particular, trusted email addresses A = {1, 2, ..., 1000000}

and spam email addresses are any integer x > 1000, 000. You are free to choose your hash function to

hash an integer into the range [0, n).

1. Verify that any email from the address 1 ≤ a ≤ 1000, 000 it always gets through (1 pts).

2. Compute the probability that a spam email going through your filter given this setting. (1 pts).

3. Generate 1000 random integers x > 1000, 000 as spam email addresses and compute the number of

spam emails going through your filter. Verify this value with your theoretical value from the step

2 (3 pts).


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp