联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-11-25 09:44

CSCI 141 Project 4: Analyzing Immunization Data


Due December 2nd, 2022 by 2200 hrs(10:00 PM EST)


UNICEF maintains a database which houses data sets related to health, development, and other information

related to maternal and child health. For this project, we will use immunization data maintained by

UNICEF, which contains information on yearly vaccinations administered to children around the world.

The vaccine data you have been given is sourced from: https://data.unicef.org/topic/child-

health/immunization/.


This data considers vaccines for the following infectious diseases and agents (abbreviation for vaccine

shown in all caps): tuberculosis, BCG; diphteria, pertussis, and tetanus, DTP1 and DTP3; meningococcal

disease, MCV1 and MCV2; hepatitis B, HEPBB and HEPB3; Haeomphilus influenza, HIB1; polio, IPV3

and POL3; pneumococcal disease, PCV3; rubella, RCV1; rotavirus, ROTAC; and Yellow Fever Virus,

YFV. Data is categorized as the percentage of children vaccinated, and is provided both globally and

regionally (e.g. East Asia and Pacific, Middle East and North Africa, etc.).


You will create functions to process the data and will write a main program that performs data QC and

makes use of your functions. You have been given three files:


vaccine_data.csv is a comma-delimited text file which contains all of the data

Project_4.py is a skeleton file where you will write your functions, import lines have been provided

for you, but you must write the def lines according to the specifications below

Project_4_Main.py is a skeleton file which will contain your main program


In order to complete this assignment you must have functional versions of the following packages installed:

pandas, numpy.


BE AWARE: Your project submission (Project_4.py and Project_4_Main.py) will be graded on

style in terms of using pandas methods where appropriate and writing compact code as needed and

specified in the instructions. In order to receive full credit, you must use pandas objects and the

pandas/numpy libraries to edit data when possible. This doesn’t mean you can’t use multiple lines

or include conditionals, but rather that if something can be done with pandas function or method,

you shouldn’t write loops to iterate over data frames and series even if you can get the expected

output. Implementations which write code to take the place of pandas functions/methods and/or

that use other imported objects may not receive credit. Manipulations to data performed without

corresponding code (e.g. opening data in Excel and editing it) will receive no credit.


Part One: Data QC


The first part of this project requires reading in the file vaccine_data.csv and reformatting some of the data.

It will be helpful for you to look at the data frame after each step. Ask if you don’t know what this means.


You have been provided with code in the main program to correct. You must edit these lines in place. You

may not add any new lines of code or alter these lines dramatically - this means you must use the

pandas functionality and should not add other functions, loops, or list comprehension. All of the

changes you need to make are relatively minor.


The code you have been given to debug should do the following:

2


(1) Read the data in from the file to a pandas data frame called vaccine, consider that there are no column

names in the raw data


(2) Name the columns: 'Region', 'Vaccine', 'Year', and 'Percentage' (the quotes indicate that these are strings,

there should not be quotes in the actual text of your column names)


(3) Update region names to remove spaces and ampersands, for example, 'Eastern & Southern Africa' should

be changed to 'Eastern_and_Southern_Africa'


(4) Change the type of the Year column to a string


(5) Create a new column named Description that contains the full name of the pathogen or disease that the

vaccine is administered for; you MUST use the dictionary provided to accomplish this task. This column will

end up as the last column in the data frame – that is fine, you do not and should not move it.


(6) Drop any rows with missing data (ANY missing data)


If you are doing this in a notebook, we strongly suggest putting each line of code in its own cell. That way

you can look after each step to see if things worked the way they should have. If you put the code all in one

block, it can be very hard to figure out where the errors are originating.


Part Two: Function


For this section of the project you will create 1 function to use with your processed data frame or other

data frames in a similar format.


make_subset(df, region = None, vaccine = None, year = None, additive = True)


This function returns a data frame that is a subset (or a copy) of the data frame passed in by the user as the

required argument df. This data frame has at least three columns representing the region, vaccine, and

year. The data types of the Year column are strings.


The optional arguments region, vaccine, and year, which will be lists of one or more strings if passed in,

allow the user to specify which subset of the data they are looking for. These arguments all have a default

value of None. The optional argument additive is a Boolean.


When additive is True, for the optional arguments region, vaccine and year, the user may specify values

for all three arguments, for only two of the arguments, or for a single argument. If the user specifies

nothing for all these three arguments, you should return a COPY of the original data frame. Do not return

the original data frame. If you don’t understand the difference, please ask us and clarify.


When additive is False, the user must specify values for all three arguments region, vaccine, and year.


Here are a few examples, so that you can clearly see what is happening. These examples make use of a small

set of data. The data frame passed in for df in all of these examples consists of the following data:


Notice that the columns are not sorted in any particular way. You should not assume the data is sorted when

writing your subsetting code.


Please note that these examples are not exhaustive, i.e., they don’t show every possible case. The returned

data frames shown are shown in the view from the Jupyter notebook. When you run your code from the

command line, if you explicitly print the results, they will not be formatted neatly like the examples shown

here.


Example 1: df is the data frame from the introduction, additive is False, vaccine is ['PCV3',

'HEPB3']; region is ['West_and_Central_Africa'], year is ['1981', '1987']; function returns a data frame with

the following rows:


When additive is False, we treat region, vaccine and year as OR requirements. Rows which meet any of the

criteria will be part of the output data frame, i.e., the output data frame will include any rows where the

vaccine is PCV3 or HEPB3 OR where the region is West_and_Central_Africa OR where the year is 1981 or

1987.


Example 2: df is the data frame from the introduction, additive is True, vaccine is ['PCV3',

'HEPB3']; region is ['West_and_Central_Africa'], year is ['1981', '1987']; function returns an empty data

frame:


When additive is True, we treated region, vaccine and year as AND conditions, i.e., the output data frame

will only include rows where all these three - the region, the vaccine, and the year - meet the conditions. In

this example, the user passes in a combination of arguments for which there are no rows that meet all of the

criteria; the function returns an empty data frame.


Example 3: df is the data frame from the introduction, additive is True, vaccine is ['PCV3', 'RCV1',

'HEPB3']; function returns a data frame with the following rows:


Example 4: df is the data frame from the introduction, additive is True, region is

['West_and_Central_Africa'], year is ['1981', '1987']; function returns a data frame with the following rows:


NOTE: if you have written your subsetting properly, you do not check for the case where the inputs don’t

match any of the rows separately – this should happen automatically without you writing any additional

code.


Key points:


? You can assume that the user will pass in inputs of the correct types and formats. df will always be

a Data Frame, region (if passed in), vaccine(if passed in), and/or year (if passed in) will always

be lists of one or more strings, additive will always be a Boolean. You can also assume that the

user will pass in values for all three arguments - region, vaccine, and year - when additive is

False.

If you are reading in from a file anywhere in this function, you are doing it wrong.

You can hardcode the column names, like Region, for example.

The basis of this function is subsetting a data frame. We have discussed this. If you are looking up

pandas methods to append data frames to each other, pivot the data frame, or complicated code we

didn’t talk about, you are probably doing it wrong.

Remember that if the user calls the function with no arguments for vaccine, year, or region, your

function should return a COPY of the data frame.

Don’t overcomplicate this. Our reference implementation is 7~13 lines long (not including the def

line). We expect your code to be simplified and efficient. If you create separate cases for each

possible combination of arguments, your code is overly complicated.

You will be graded to some degree on programming style – specifically: parsimony in terms of not

repeating the same EXACT line(s) of code several times and using pandas methods and

structures whenever possible.

Curious about how to make a copy of a data frame? Use the copy method.


Part Three: Putting it all together


For this part of the assignment, you will add lines to the main program to use the functions you wrote in

Part 2 on the re-formatted data frame you created in Part 1. The lines of code you write for Part 3 will follow

(i.e., go under) the lines of code from Part 1. ASK IF YOU DO NOT UNDERSTAND THIS – YOUR

CODE WILL NOT WORK PROPERLY OTHERWISE.


You must use your function to accomplish the following tasks. You should not repeat code from the

function. Use good style – don’t pass in default arguments, and don’t provide unnecessary arguments.

Some tasks will require additional code. This is indicated in the instructions. Read carefully!


When additive is True, create a data frame called BCG_2019 that contains the rows from the

vaccine data frame that correspond to BCG vaccinations for the year 2019. This will include all

available regions.

From the data frame you made above, create and print a pandas Series called BCG2019_Series

that has the data in the Region column as the index and the data from the Percentage column as the

values. The easiest way to do this is to create a new data frame with Region as the index and then

select the Percentage column.

When additive is False, create a data frame called DTP1_Years that contains the rows from the

vaccine data frame that correspond to DTP1 vaccinations, in the East Asia and Pacific region or for

the year 1980.

From the data frame you made above, create and print a pandas Series called DTP1_series that has

the data in the Year column as the index and the data from the Percentage column as the values.

The easiest way to do this is to create a new data frame with Year as the index and then select the

Percentage column.


SUBMISSION EXPECTATIONS


Project_4.py: Your function code goes in this file. You have been given a skeleton file that contains the

appropriate import lines. Changing default arguments, the order of arguments, the number of arguments

etc. is not permitted.


Project_4_Main.py : Your correction of data QC of the .csv file in Part 1 and the additional main program

in Part 3 go in this file.


Project_4.pdf: A PDF document containing your reflections on the project. You must also cite any

sources you use. Please be aware that you can consult sources, but all code written must be your own.

Programs copied in part or wholesale from the web or other sources or individuals will result in reporting

of an Honor Code violation.

If your code contains structures NOT mentioned in class or readings, please include the following in your

write-up:

If it's a method:

(1) What does this method do? What should be its input and output?

(2) Why do you use this new method instead of the way we learned in class or readings or labs? What is

the advantage?

If it's not a method, but a concept, e.g., recursive data structures,

(1) Apply it to one of the examples in lecture notes.

(2) For recursive data structures, please specify base case(s) and recursive case(s).

(3) Why do you use it instead of the way we learned in class or readings or labs? What is the advantage?


You can expect significant grade penalties if you get worse time and/or space complexity by introducing

anything that is NOT mentioned in class or readings or if the program can be written in a more concise

way using anything we learned.


POINT VALUES AND GRADING RUBRIC


Part1: Data QC (30 pts)

Part2: make_subset function (27 pts)

Part3: Main program(30 pts)

Writeup (2.5 pts)

Autograder (10.5 pts)


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp