联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2023-05-11 09:55

554.488/688 Computing for Applied Mathematics

Spring 2023 - Final Project Assignment

The aim of this assignment is to give you a chance to exercise your skills at prediction using

Python. You have been sent an email with a link to data collected on a random sample from some

population of Wikipedia pages, to develop prediction models for three different web page attributes.

Each student is provided with their own data drawn from a Wikipedia page population unique

to that student, and this comes in the form of two files:

ˆ A training set which is a pickled pandas data frame with 200,000 rows and 44 columns. Each

row corresponds to a distinct Wikipedia page/url drawn at random from a certain population

of Wikipedia pages. The columns are

– URLID in column 0, which gives a unique identifier for each url. You will not be able to

determine the url from the URLID or the rest of the data. (It would be a waste of time

to try so the only information you have about this url is provided in the dataset itself.)

– 40 feature/predictor variable columns in columns 1,...,40 each associated with a particular

word (the word is in the header). For each url/Wikipedia page, the word column gives

the number of times each word appears in the asociated page.

– Three response variables in columns 41, 42 and 43

* length = the length of the page, defined as the total number of characters in the

page

* date = the last date when the page was edited

* word present = a binary variable indicating whether at least one of 5 possible words

(using a word list of 5 words specific to each student and not among the 40 feature

words) 1 appears in the page

A test set which is also a pickled pandas data frame with 50,000 rows but with 41 columns

since the response variables (length, date, word present) are not available to you. The rows

of the test dataset also correspond to distinct url/pages drawn from the same Wikipedia

url/page population as the training dataset (with no pages in common with the training set

pages). The response variables have been removed so that the columns that are available are

– URLID in column 0

– the same 40 feature/predictor variable columns corresponding to word counts for the

same 40 words as in the training set

Your goal is to use the training data to

predict the length variable for pages in the test dataset

1What this list of 5 words is will not be revealed to you and you it would be a waste of time tring to figure out

what it is.

predict the mean absolute error you expect to achieve in your predictions of length in the test

dataset

predict word present for pages in the test dataset, attempting to make the false positive as

close as you can to .05 2

, and make the true positive rates as high as you possibly can 3

,

predict your true positive rate for word present in the test dataset

predict edited 2023 for pages in the test dataset, attempting to make the false positive as

close as you can to .05 4

, and make the true positive rates as high as you possibly can 5

,

predict your true positive rate for edited 2023 in the test dataset

Since I have the response variable values (length, word present, date) for the pages in your test

dataset, I can determine the performance of your predictions. Since you do not have those variables,

you will need to set aside some data in your training set or use cross-validation to estimate the

performance of your prediction models.

There are 3 different parts of this assignment, each requiring a submission:

Part 1 (30 points) - a Jupyter notebook containing

– a description (in words, no code) of the steps you followed to arrive at your predictions

and your estimates of prediction quality - including a description of any separation of

your training data into training and testing data, method you used for imputation,

methods you tried to use for making predictions (e.g. regression, logistic regression, ...)

followed by

– the code you used in your calculations

Part 2 (60 points) - a cvs file with your predictions - this file should consist of exactly 4

columns with 6

– a header row with URLID, length, word present, edited 2023

– 50,000 additional rows

– every URLID in your test dataset appearing in the URLID column - not altered in any

way!

– no mssing values

– data type for the length column should be integer or float

– data type for the word present column should be either integer (0 or 1), float (0. or 1.)

or Boolean (False/True)

2

false positive rate = proportion of pages for which word present is 0 but predicted to be 1

3

true positive rate = proportion of pages for which word present is 1 and predicted to be 1

4

false positive rate = proportion of pages for which edited 2023 is 0 but predicted to be 1

5

true positive rate = proportion of pages for which edited 2023 present is 1 and predicted to be 1

6

a notebook is provided to you for checking that your csv file is properly formatted

– data type for the edited 2023 column should be either integer (0 or 1), float (0. or 1.)

or Boolean (False/True)

Part 3 (30 points) - providing estimates of the following in a form:

– what do you predict the mean absolute error of your length predictions to be?

– what do you predict the true positive rate for your word present predictions to be?

– what do you predict the true positive rate for your edited 2023 predictions to be?

Your score in this assignment will be based on

Part 1 (30 points)

– evidence of how much effort you put into the assignment (how many different methods

did you try?)

– how well did you document what you did?

– was your method for predicting the quality of your performance prone to over-fitting?

Part 2 (60 points)

– how good are your predictions of length, word present, edited 2003 - I will do predictions

using your training data and I will compare

* your length mean absolute deviation to what I obtained in my predictions

* your true positive rate to what I obtained for the binary variables (assuming you

managed to appropriately control the false positive rate)

– how well did you meet specifications - did you get your false positive rate in predictions

of the binary variables close to .05 (again, compared to how well I was able to do this)

Part 3 (30 points)

– how good is your prediction of the length mean absolute deviation

– how good is your prediction of the true positive rate for the word present variable

– how good is your prediction of the true positive rate for the edited 2023 variable

How the datasets were produced

This is information that will not be of much help to you in completing the assignment, except

maybe to convince you that there would be no point in using one of the other students’ data in

completing this assignment.

ˆ I web crawled in WIkipedia to arrive at a random sample of around 2,000,000 pages.

ˆ I made a list of 100 random words and extracted the length, the word counts, and the last

date edited for each page.

To create one of the student personal datasets, I repeated the following steps for each student

Repeat

Chose 10 random words w0,w1,...,w9 out of the 100 words in the list above

Detemined the subsample of pages having w0 and w1 but not w2, w3 or w4.

Used the words w5,w6,w7,w8 and w9 to create the word_present variable

Until

the subsample has at least 250,000 pages

Randomly sampled 40 of 90 unsampled words without replacement

Randomly sampled without replacement 250,000 pages out of the subsample

Retained only the 250,000 pages and

word counts for the 40 words

length

word_present

last date edited

Randomly assigned missing values in the feature (word count) data

Randomly separated the 250,000 pages into

200,000 training pages

50,000 test pages


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp