2021/1/24 HW3 - Jupyter Notebook

localhost:8892/notebooks/Downloads/HW3.ipynb 1/8

HW3

PageRank

What is the most important website on the internet? Who is the "key player" on a sports team? Which countries

are the most central players in the world economy? There is no one correct answer to any of these questions,

but there is a most profitable one. PageRank (https://en.wikipedia.org/wiki/PageRank) is an algorithm for

ranking individual elements of complex systems, invited by Sergey Brin and Larry Page. It was the first and

most famous algorithm used by the Google Search engine, and it is fair to say that the internet as we know it

today would not exist without PageRank.

In this assignment, we will implement PageRank. There are many good ways to implement this algorithm, but in

this assignment we will use our newfound skills with object-oriented programming and iterators.

How it works

For the purposes of this example, let's assume that we are talking about webpages. PageRank works by

allowing a "random surfer" to move around webpages by following links. Each time the surfer lands on a page, it

then looks for all the links on that page. It then picks one at random and follows it, thereby arriving at the next

page, where the process repeats. Eventually, the surfer will visit all the pages one or more times. Pages that the

surfer visits more frequently have higher PageRank scores. Because the surfer moves between linked pages,

PageRank expresses an intuitive idea: important pages are linked to other important pages. This diagram

(https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg) from Wikipedia gives a nice

illustration. Note that more important webpages (higher PageRank) tend to be connected to other important

webpages.

A schematic for PageRank.

Data

You can complete this assignment using data from one of two sources.

Option 1: Hamilton

This data set comes from the hit Broadway musical "Hamilton."

The logo of the musical Hamilton, showing a

silhouette dressed in period custom standing on

top of a five-pointed star.

The Hamilton data set

2021/1/24 HW3 - Jupyter Notebook

localhost:8892/notebooks/Downloads/HW3.ipynb 2/8

The good folks at The Hamilton Project (http://hamilton.newtfire.org/) analyzed the script for us, obtaining data

on who talks about whom in each of the show's songs. When character A mentions character B, we'll think of

this as a link from A to B, suggesting that B might be important.

If you use this data set, listening to the soundtrack while working is strongly recommended.

Option 2: Global Airline Network

A map of the world, with many curved green

lines connecting different points on the map.

The points are airports, and the curved green

lines are flight routes.

The global airline network

Back in the Before Times, lots of people flew on airplanes. This data set includes a "link" from Airport A to

Airport B whenever there is a flight from B to A. This data set was collected by the OpenFlights Project

(https://openflights.org/data.html).

(A). Define Functions

In this part, all you have to do is hit shift + enter on the code block supplied below. This block defines two

functions. The first one retrieves the data from the internet and saves it to your local computer, while the second

reads in the data, producing a list of tuples. It's not important for you to be familiar with the code in these

functions right now -- we'll discuss them early in Week 4.

In [1]:

import urllib

import csv

def retrieve_data(url):

"""

Retrieve a file from the specified url and save it in a local file

called data.csv. The intended values of url are:

1. https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv

2. https://philchodrow.github.io/PIC16A/homework/HW3-flights-data.csv

"""

# grab the data and parse it

filedata = urllib.request.urlopen(url)

to_write = filedata.read()

# write to file

with open("data.csv", "wb") as f:

f.write(to_write)

def read_data(path):

"""

read downloaded data from a .csv file, and return a list of tuples.

each tuple represents a link between states.

"""

with open(path, "r") as f:

reader = csv.reader(f)

return [(row[0], row[1]) for row in list(reader)]

2021/1/24 HW3 - Jupyter Notebook

localhost:8892/notebooks/Downloads/HW3.ipynb 3/8

(B). Grab the Data

The data live at the following URLs:

Hamilton: https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv

Airline: https://philchodrow.github.io/PIC16A/homework/HW3-flights-data.csv

In each data set, each row corresponds to a "link" between objects. In Hamilton, the pairs have format

mentioner, mentioned while in the airline network the rows have format origin, destination.

Pick one of these data sets, and set the variable url appropriately by uncommenting one of the two lines

below. Then, call retrieve_data() and read_data() . The path argument for read_data() should be set to

"data.csv" . Create a variable called data to hold the return value of read_data() .

Your solution

In [ ]:

(C). Examine the structure of the data

This would also be a good time to inspect the data to make sure you understand how it is structured. Write a

function describe(n) that describes the meaning of the n th row of the data set you chose. In the Hamilton

data set, your function should do the following:

describe(5)

# output

"Element 5 of the Hamilton data set is ('burr', 'betsy'). This means that Burr mentions Be

tsy in a song."

In context of the airline flights data, your function should instead do this:

describe(5)

# output

"Element 5 of the flights data set is ('SIN', 'BKK'). This means that there is a flight fr

om BKK to SIN."

Please attend to capitalization and formatting. While the standard string concatenation operator + is

completely fine for this task, the fancy str.format() function may make your code somewhat simpler. This

page (https://realpython.com/python-formatted-output/) has some useful examples in case you'd like to try this.

Your Solution

# uncomment the second line if you'd prefer to

# work with the flights data.

url = "https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv"

# url = "https://philchodrow.github.io/PIC16A/homework/HW3-flights-data.csv"

# Call your functions below

2021/1/24 HW3 - Jupyter Notebook

localhost:8892/notebooks/Downloads/HW3.ipynb 4/8

In [ ]:

(D). Data to Dictionary

Write a function called data_to_dictionary that converts the data into a dictionary such that:

1. There is a single key for each character (in Hamilton) or airport (in flights).

2. The value corresponding to each key is a list of the characters/airports to which that key links. The list

should contain repeats if there are multiple links.

Here's an example of the desired behavior on a fake data set.

data = [("a", "b"),

("a", "b"),

("a", "c"),

("b", "c"),

("b", "a")]

data_to_dictionary(data)

# output

{"a" : ["b", "b", "c"], "b" : ["a", "c"]}

Your Solution

In [ ]:

(E). Define a PR_DiGraph class

A directed graph, or DiGraph, is just a set of arrows ("edges") between objects ("nodes"). It is a natural way to

represent data that represents one-way relationships, such as links from one webpage to another or mentions

of one character by another. We already saw a directed graph above when we introduced the idea of

PageRank. Here's a paired-down example.

Six circles, representing nodes, labeled A

through F. There are directed arrows between

certain pairs of nodes.

Example of a directed graph.

Implement a PR_DiGraph class with a custom __init__() method and a linked_by() method. The

__init__() method should accept two arguments: data and iteration_limit . The __init__() method

should then construct an instance variable self.link_dict which is simply the output of data_to_dictionary

applied to the argument data . __init__() should also construct an instance variable

self.iteration_limit , which simply takes the value of iteration_limit supplied to __init__() . Don't

worry about that one for now.

Then, define a method self.linked_by(x) which, when called, returns the value self.link_dict[x] .

2021/1/24 HW3 - Jupyter Notebook

localhost:8892/notebooks/Downloads/HW3.ipynb 5/8

Finally, add an __iter__ method, which returns an object of class PR_Iterator . We will define this class in

the next part.

Example session (using Hamilton):

D = PR_DiGraph(data, iteration_limit = 10000)

D.linked_by('peggy')

# output

['peggy', 'schuylerSis']

Your Solution

In [ ]:

(F). Implement PR_Iterator

Define a PR_Iterator class with a custom __next__() method.

The __init__ method of this class should create instance variables to store the PR_DiGraph object from

which it was constructed; a counter i , starting at 0, to log the number of steps taken, and a current_state

variable whose value is one of the keys of the link_dict of the Pr_DiGraph . You can choose its initial value

arbitrarily; in my solution code I chose self.current_state = "hamilton" .

We are going to use iteration to implement the PageRank algorithm. This means we are going to imagine a

surfer who is following the links in our data set. Implement the following two methods:

1. follow_link() .

A. Pick a random new character mentioned by the current character, or new airport served by the current

airport. Let's call this next_state .

B. If next_state != current_state , set current_state to next_state .

C. Otherwise (if next_state == current_state ), teleport (see below).

D. You might run into KeyError s, in which case you should again teleport (use a try-except block).

2. teleport() .

A. Set the current state to a new state (key of the link dict) completely at random.

Hint: use random.choice from the random module to choose elements of lists.

Finally, implement __next__() . __next__() should do follow_link() with 85% probability, and do

teleport() with 15% probability. You should also define a custom StopIteration condition to ensure that

only as many steps are taken as the iteration_limit supplied to the PR_DiGraph initializer.

1. To do something with 85% probability, use the following:

if random.random() < 0.85:

# do the thing

else:

# do the other thing

Example Usage

2021/1/24 HW3 - Jupyter Notebook

localhost:8892/notebooks/Downloads/HW3.ipynb 6/8

After you define your class, run the following code and show that it works. Note: your precise sequence may be

different from mine.

D = PR_DiGraph(data, iteration_limit = 5)

for char in D:

print(char)

following link : current state = burr

following link : current state = washington

following link : current state = burr

following link : current state = hamilton

teleporting : current state = washington

I have added printed messages here for you to more clearly see what should be happening, but it is not

necessary for you to do this. It is sufficient for your output to look like:

D = PR_DiGraph(data, iteration_limit = 5)

for char in D:

print(char)

burr

washington

burr

hamilton

washington

Your Solution

In [ ]:

In [ ]:

(G). Compute PageRank

Finally, we are ready to compute the PageRank in our data set. Initialize a PR_DiGraph with a large iteration

limit (say, 1,000,000). Use a for -loop to allow your surfer to randomly move through the data set. The number

of times that the surfer visits state x is the PageRank score of x .

Create a dict which logs how many times a given state appears when iterating through the PR_Digraph . So,

this dictionary holds the PageRank score of each state.

Your Solution

import random

# run the below

D = PR_DiGraph(data, iteration_limit = 5)

for char in D:

print(char)

2021/1/24 HW3 - Jupyter Notebook

localhost:8892/notebooks/Downloads/HW3.ipynb 7/8

In [ ]:

(H). Display Your Result

Use your favorite approach to show the results in sorted format, descending by PageRank score. The entries at

the top should be the entries with highest PageRank. What are the most important elements in the data set?

You may show either the complete list or just the top 10.

Check your code by comparing your top 10 to mine. Because we are using a randomized algorithm, your

results will not agree exactly with mine, but they should be relatively close. If your top 10 list is very different,

then you might want to revisit your previous solutions.

For Hamilton, my top 10 were:

[('hamilton', 166062),

('burr', 99180),

('washington', 92246),

('jefferson', 72450),

('eliza', 51485),

('angelica', 48042),

('madison', 37421),

('lafayette', 34297),

('lee', 33678),

('jAdams', 31121)]

For the flights data, my top 10 were:

[('LHR', 18043), # London Heathrow

('ATL', 16370), # Atlanta

('JFK', 14795), # New York JFK

('FRA', 14156), # Frankfurt

('CDG', 14073), # Charles de Gaulle (Paris)

('LAX', 13199), # Los Angeles

('ORD', 12915), # Chicago O'Hare

('PEK', 12525), # Beijing

('AMS', 12410), # Amsterdam Schiphol

('PVG', 11517)] # Shanghai

Your solution

In [ ]:

(I). Submit!

Check that your code is appropriately documented (comments and docstrings), and turn it in.

2021/1/24 HW3 - Jupyter Notebook

localhost:8892/notebooks/Downloads/HW3.ipynb 8/8

版权所有：留学生编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。