联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-11-05 10:11

CIS 545 - Big Data Analytics - Fall 2019 Have you ever wondered about (1) what it takes to be a data scientist or "data person", and (2) how so

work?

This homework is focused on (1) working with hierarchical data stored in dataframes, (2) traversing re

data, (3) understanding a bit about performance.

We will focus on questions about data scientists from "our" crawl of the LinkedIn dataset, which was a

extended notebook.

!pip install pymongo[tls,srv]

!pip install swifter

!pip install lxml

import pandas as pd

import numpy as np

import json

import sqlite3

from lxml import etree

import urllib

import zipfile

import time

import swifter

from pymongo import MongoClient

from pymongo.errors import DuplicateKeyError, OperationFailure

We need to pull the ziple

with LinkedIn data from Amazon S3 (where it is shared) to your local machi

machine. Only when the data is local can we eciently

parse it (and we'll read directly out of a zip le)

The zip le

contains three les

with the same schema. You can start with the tiny instance to test yo

brave and have a lot of time feel free to use the full le.

Step 0: Acquire and load data

Due October 11, 2019 at 10pm

Homework 2: Querying Linked (LinkedIn) Data

We will grade your homework using small . Hidden test 0.0 will override your le

selection, so as lon

in a cell that comes after that one, you will be ne.

linkedin.json (3M records)

linkedin_small.json (100K records)

linkedin_tiny.json (10K records)

The cell below will download a 3GB le

to your Google Cloud. It may take a while. You do not need to m

#url = 'https://upenn-bigdataanalytics.s3.amazonaws.com/linkedin.zip'

#filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')

filehandle = 'local.zip'

# What's the zip file actually called locally?

filehandle

The cell below creates pointers to the three versions of our dataset. To switch between them, simply c

the cell below.

def fetch_file(fname):

   zip_file_object = zipfile.ZipFile(filehandle, 'r')

   for file in zip_file_object.namelist():

       file = zip_file_object.open(file)

       if file.name == fname: return file

   return None

   

linkedin_tiny = fetch_file('linkedin_tiny.json')

linkedin_small = fetch_file('linkedin_small.json')

linkedin_huge = fetch_file('linkedin.json')

# CIS 545 Hidden Test 0.0 - please do not modify or delete this cell!

# Set the input file to process

file = linkedin_small In the cell below, adapt the data loading code from the in-class notebook. You will need the function th

the function that converts relations to dataframes. Read in a maximum of 20000 people. Put the code

relations, removes the interval eld,

and stores the eld

information with a try statement, just in case.

command to move on. At the end of the next cell, you should have nine dataframes with the following

Step 0.1: Store data in dataframes

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 3/12

1. people_df

2. names_df

3. education_df

4. groups_df

5. skills_df

6. experience_df

7. honors_df

8. also_view_df

9. events_df

# TODO: Adapt the data loading code from class.

# YOUR CODE HERE

raise NotImplementedError()

# CIS 545 Sanity Check 0.1 - please do not modify or delete this cell!

display(experience_df)

# CIS 545 Hidden Test 0.1.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 0.1.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 0.1.3 - please do not modify or delete this cell!

Next save the data to SQLite... Again, using the same approach as in the sample notebook.

Step 0.2: Convert to SQL conn = sqlite3.connect('linkedin.db')

# YOUR CODE HERE

raise NotImplementedError()

# CIS 545 Sanity Check 0.2.1 - please do not modify or delete this cell!

people_df.describe()

# CIS 545 Sanity Check 0.2.2 - please do not modify or delete this cell!

skills_df.describe()

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 4/12

_

# CIS 545 Sanity Check 0.2.3 - please do not modify or delete this cell!

experience_df.describe()

In this homework, we will use LinkedIn to analyze what it means to be a data scientist (as of a few yea

Step 1: What is a data scientist?

Our rst

question is: for anyone who's job revolves around data (database administrators, data curator

are the most common skills?

Step 1.1: What are common skills for data scientists?

Complete the collect_skills function below. This and the other functions in this homework allow u

queries even if your data do not match ours. The function should:

1. Using experience_df , nd

all people with a position containing "data" in the title. Remember upper versus lo

2. Using skills_df , nd

all people with "data science" as a skill. Again, remember to account for case.

3. For all of the unique people found in steps 1 and 2, nd

the rest of their skills

4. Return a dataframe of the top 15 skills, by frequency (see pandas.DataFrame.sort_values). The columns shou

scientists (the count of the number of data scientists with this skill).

Step 1.1.1: Collect skills (Pandas) # TODO: Find the top 15 skills for data scientists (Pandas)

def collect_skills(experience_df, people_df, skills_df):

   # YOUR CODE HERE

   raise NotImplementedError()

# CIS 545 Sanity Check 1.1.1 - please do not modify or delete this cell!

top_skills_df = collect_skills(experience_df, people_df, skills_df)

display(top_skills_df)

if "skill" not in top_skills_df:

   raise AssertionError("skill column not defined")

if "scientists" not in top_skills_df:

   raise AssertionError("scientists column not defined")

if len(top_skills_df) != 15:

   raise AssertionError("dataframe does not have top 15")  

# CIS 545 Hidden Test 1.1.1.1 - please do not modify or delete this cell!

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 5/12

# CIS 545 Hidden Test 1.1.1.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.1.1.3 - please do not modify or delete this cell!

Compute the same table as in 1.1.1 using SQL. Store it as top_skills_sql but otherwise matching t

to also save the data to SQLLite in a table called top_skills , as we will be testing to see if this table

Step 1.1.2: Top skills (SQL) # TODO: Find the top 15 skills for data scientists (SQL)

# YOUR CODE HERE

raise NotImplementedError()

display(top_skills_sql)

# CIS 545 Sanity Check 1.1.2 - please do not modify or delete this cell!

if "skill" not in top_skills_sql:

   raise AssertionError("skill column not defined")

if "scientists" not in top_skills_sql:

   raise AssertionError("scientists column not defined")

if len(top_skills_df) < 1:

   raise AssertionError("dataframe has no results")  

if len(top_skills_sql.merge(top_skills_df)) != len(top_skills_sql):

   raise AssertionError("Pandas and SQL versions are not of the same length")

# CIS 545 Hidden Test 1.1.2 - please do not modify or delete this cell!

Complete the collect_titles function below that aggregates the most recent titles of people with d

use the given dataframes as input and return a two column dataframe: one column called title and

consider people who have at least min_skills of the top skills for a data scientist. You should also o

min_count times.

For extra practice, you can also do this in SQL, although we are not grading that.

Step 1.2: What are common titles for those with data science skills? # TODO: Find the common titles (Pandas)

d f ll t titl (t kill df kill df l df i df i kill i

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 6/12 def collect_titles(top_skills_df, skills_df, people_df, experience_df, min_skills, min

   # YOUR CODE HERE

   raise NotImplementedError()

# CIS 545 Sanity Check 1.2 - please do not modify or delete this cell!

ds_titles_df = collect_titles(top_skills_df, skills_df, people_df, experience_df, 6, 2

display(ds_titles_df)

if "title" not in ds_titles_df:

   raise AssertionError("title column not defined")

if "count" not in ds_titles_df:

   raise AssertionError("count column not defined")

if len(ds_titles_df) < 1:

   raise AssertionError("dataframe has no results")

# CIS 545 Hidden Test 1.2.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.2.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.2.3 - please do not modify or delete this cell!

Now let's nd

the list of companies that have employed people with the above titles, ranked by numbe

Step 1.3: Who employs "data people" based on title?

Complete the collect_employers function below that aggregates the employers with positions corr

people with data science skills. This function should use the given dataframes as input and return a tw

org and the other called people . Show the names of companies (in eld

org ) with at least min_cou

(include that count in the people column). Order the dataframe by the count of data people in the com

Step 1.3.1: Data employers # TODO: Find the data employers

def collect_employers(experience_df, ds_titles_df, min_count):

   # YOUR CODE HERE

   raise NotImplementedError()

# CIS 545 Sanity Check 1.3.1 - please do not modify or delete this cell!

employers_df = collect_employers(experience_df, ds_titles_df, 5)

display(employers df)

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 7/12

p y( p y _ )

if "IBM" not in employers_df['org'].tolist():

   raise AssertionError("Missing IBM")

   

if employers_df['people'].min() < 4:

   raise AssertionError("Not filtering properly")

# CIS 545 Hidden Test 1.3.1.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.3.1.2 - please do not modify or delete this cell!

Complete the collect_employees function below that aggregates the employees of employers with

recent titles of people with data science skills. In other words, who are the employees of the data emp

their titles? This function should use the given dataframes as input and return the org , family_name

person.

Step 1.3.2: Their employees # TODO: Find the employees of the data employers

# YOUR CODE HERE

raise NotImplementedError()

# CIS 545 Sanity Check 1.3.2 - please do not modify or delete this cell!

title_people_df = collect_employees(people_df, experience_df, employers_df, names_df,

display(title_people_df)

if len(title_people_df.columns) != 4:

   raise AssertionError('Wrong number of columns. Check schema again')

# CIS 545 Hidden Test 1.3.2.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.3.2.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.3.2.3 - please do not modify or delete this cell!

Step 1.4: Find peers

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 8/12

In many common social graph settings, we can make recommendations to people based on their simi

dene

similarity in terms of the number of identical skills.

Suppose A and B have similar skills: A -> X1 and B -> X1, A -> X2 and B -> X2, etc. up to A -> Xk and B ->

Then given that A and B have similar skills, we might recommend A's employer to B, and vice versa.

Let's consider only the rst

100 people in people_df .

Find, out of this set, the pairs of people with the most shared/common skills, and return the closest 20

this to make a recommendation for a potential employer and position to each person.

Step 1.4.0: Making the problem tractable in Pandas

Complete the collect_peers function below that nds

the top num pairs of peers. In other words, co

person, counting the total set of skills in common. This function should use the given dataframes and

dataframe: person_1 , person_2 , and common_skills . The rst

two columns should be person IDs a

of skills that this pair of people shares.

Hint: Doing this requires a Cartesian product, i.e., every ID paired with every other ID. Think about how t

then add a eld

to this dataframe that will let us combine every record with every record.

Step 1.4.1: Compute the top pairs of peers # TODO: Finish the collect_peers function

people_df_subset = people_df.head(100)

def collect_peers(people_df_subset, skills_df, num):

   # YOUR CODE HERE

   raise NotImplementedError()

# CIS 545 Sanity Check 1.4.1 - please do not modify or delete this cell!

recs_df = collect_peers(people_df_subset, skills_df, 20)

display(recs_df)

if "person_1" not in recs_df:

   raise AssertionError("person_1 column not defined")

if "person_2" not in recs_df:

   raise AssertionError("person_2 column not defined")

if "common_skills" not in recs_df:

   raise AssertionError("common_skills column not defined")

if(len(recs_df) != 20):

   raise AssertionError('Wrong number of rows in recs_df')

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 9/12 # CIS 545 Hidden Test 1.4.1.1 - please do not modify or delete this cell!

# CIS 545 Hidden Test 1.4.1.2 - please do not modify or delete this cell!

Complete the last_job function below that takes experience_df as input and returns the person ,

person's last (most recent) employment experience (three column dataframe).

Step 1.4.2: Get the last jobs # TODO: Complete the last_job function

def last_job(experience_df):

   # YOUR CODE HERE

   raise NotImplementedError()

# CIS 545 Sanity Check 1.4.2 - please do not modify or delete this cell!

last_job_df = last_job(experience_df)

display(last_job_df)

if(len(last_job_df.columns) != 3):

   raise AssertionError('Wrong number of columns in last_job_df')

# CIS 545 Hidden Test 1.4.2.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.4.2.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.4.2.3 - please do not modify or delete this cell!

Complete the recommend_jobs function below that takes recs_df , names_df , and last_job_df as

person_2 's most recent title and org .

Step 1.4.3: Recommend jobs # TODO: Complete the recommend_jobs function

def recommend_jobs(recs_df, names_df, last_job_df):

   # YOUR CODE HERE

   raise NotImplementedError()

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 10/12 # CIS 545 Sanity Check 1.4.3 - please do not modify or delete this cell!

recommended_df = recommend_jobs(recs_df, names_df, last_job_df)

display(recommended_df)

if "family_name" not in recommended_df:

   raise AssertionError("person_1 column not defined")

if "given_name" not in recommended_df:

   raise AssertionError("person_2 column not defined")

if "person_2" not in recommended_df:

   raise AssertionError("common_skills column not defined")

if "org" not in recommended_df:

   raise AssertionError("common_skills column not defined")

if "title" not in recommended_df:

   raise AssertionError("common_skills column not defined")

# CIS 545 Hidden Test 1.4.3 - please do not modify or delete this cell!

This last section relates to our discussions in lecture about computation eciency

with big data.

Step 2: Compare Evaluation Orders

Let's look at some computation and optimization tasks. We'll start with the code from our lecture note

dataframes.

Step 2.0: Load custom functions # Join using nested loops

def merge(S,T,l_on,r_on):

   ret = pd.DataFrame()

   count = 0

   S_ = S.reset_index().drop(columns=['index'])

   T_ = T.reset_index().drop(columns=['index'])

   for s_index in range(0, len(S)):

       for t_index in range(0, len(T)):

           count = count + 1

           if S_.loc[s_index, l_on] == T_.loc[t_index, r_on]:

               ret = ret.append(S_.loc[s_index].append(T_.loc[t_index].drop(labels=r_

   print('Merge compared %d tuples'%count)

   return ret

 

# Join using a *map*, which is a kind of in-memory index

# from keys to (single) values

def merge_map(S,T,l_on,r_on):

   ret = pd.DataFrame()

T map {}

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 11/12     T_map = {}

   count = 0

   # Take each value in the r_on field, and

   # make a map entry for it

   T_ = T.reset_index().drop(columns=['index'])

   for t_index in range(0, len(T)):

       # Make sure we aren't overwriting an entry!

       assert (T_.loc[t_index,r_on] not in T_map)

       T_map[T_.loc[t_index,r_on]] = T_.loc[t_index]

       count = count + 1

   # Now find matches

   S_ = S.reset_index().drop(columns=['index'])

   for s_index in range(0, len(S)):

       count = count + 1

       if S_.loc[s_index, l_on] in T_map:

               ret = ret.append(S_.loc[s_index].append(T_map[S_.loc[s_index, l_on]].d

   print('Merge compared %d tuples'%count)

   return ret

Reimplement recommend_jobs using the above merge or merge_map functions instead of Pandas' m

You should start with the dataframes recs_df , names_df , and last_job_df from above. Store your

Step 2.1: Find an optimal order of evaluation. # TODO: Reimplement recommend jobs using our custom merge and merge_map functions

def recommend_jobs_new(recs_df, names_df, last_job_df):

   # YOUR CODE HERE

   raise NotImplementedError()

# CIS 545 Sanity Check 2.1 - please do not modify or delete this cell!

%%time

recs_new_df = recommend_jobs_new(recs_df, names_df, last_job_df)

if(len(recs_new_df.columns) != 5):

   raise AssertionError('Wrong number of columns in recs_new_df')

1. When you are done, select “Edit” at the top of the window, under the lename,

not the one that may appear ab

do this just before turning is your homework because it reduces the size of your le.

Step 3: Submitting Your Homework

11/3/2019 Homework_2.ipynb - Colaboratory

https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 12/12

2. In the same menu under the lename,

select “File” and then “Download .ipynb”. It is very importa

of this downloaded notebook. Make sure that something like “(1)” did not get added to the lena

the .py version. Our autograder can only handle .ipynb les

with the correct le

name.

3. Compress the ipynb le

into a Zip le

hw2.zip.

4. Go to the submission site, and click on the Google icon. Log in using your Google@SEAS (if at al

student) GMail account.

5. Click on the Courses icon at the top, then select CIS 545 and Save. Select cis545-2019c-hw2 an

6. You should see a message on the submission site notifying you about whether your submission

necessary, but may have to withdraw your previous submission in OpenSubmit in order to do so.

If you have not already, please go to Settings and set your Student ID to your PennID (all numbers).


版权所有:留学生编程辅导网 2018 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。