Final Project

Stat 428

I. Simulation Problem (50 points)

In the lecture, we discussed Nearest Neighbor Tests and Energy Distance Test for two sample testing problem.

We consider another two tests: two-sample Hotelling’s T-square test statistic and graph-based two sample

test. Suppose the data we observe X1, . . . , Xn and Y1, . . . , Ym, where Xi

, Yj ∈ R

d are multivariate random

vectors. Here, X1, . . . , Xn are drawn from distribution F and Y1, . . . , Ym are drawn from distribution G. The

hypothesis of interest in two sample testing problem is

H0 : F = G and H1 : F 6= G.

Graph-based two sample test is defined in the following way. We pool all data together

Z1, . . . , Zn+m = X1, . . . , Xn|Y1, . . . , Ym

Based these n + m observations, we construct a graph G = (V, E) such that the set of vertex is V =

{1, . . . , n + m} and there is an edge between i and j if kZi ? Zjk ≤ Q, where Q is a positive number. Let E

be the collection of edges. The graph-based two sample test statistic is defined as，

where |E| means the number of edges in the edge set E. Here, Ie = 1 if the two vertex connected by e have

the same label and Ie = 0 otherwise.

Question 1 Report

A pharmaceutical company would like to test whether the effect of two treatments are similar or not. The

manager want to choose one two sample testing method from nearest neighbor tests, energy distance test,

Hotelling’s T-square test and graph-based two sample test and ask your advice for the choice of two sample

test. First, could you help the manager to implement these four methods from the scratch: nearest neighbor

tests, energy distance test, Hotelling’s T-square test and graph-based two sample test? Second, could you

prepare a report to provide some suggestions for the manager? In this report, you need to address at least

four of the following points:

1

? Several different parts can be customized in these tests, e.g., the threshold Q in graph-based test, the

number of neighbor in nearest neighbor test and the specific form of distance in energy distance test

and graph-based test. Could you provide some suggestion on the choice of these customized part? You

need to show some numerical experiment as your evidence.

? Are these tests sensitive to the dimension of data d?

? Are these tests sensitive to specific distribution of F or G?

? Which test has larger power under what condition?

? Clearly, the power of the test relies on the sample size n, m and how different F and G are under

alternative hypothesis. Could you prepare a plot to show effect of sample size on power? Could you

prepare another plot to show effect of the difference bewteen F and G on power?

? Are these methods able to control Type I error?

You need to submit both Rmd and pdf file of your report.

Question 2 Presentation and Slides

Based your report, could you prepare a 3-5 minutes presentation to summarize your findings and suggestions?

Assume your audience is the manager from this pharmaceutical company, who has only very limited statistic

background. In this question, you need to submit a video (I need to see you in this video) and your slides

(Both Rmd and pdf).

Question 3 R package (Bonus question: extra 10 points for the final project)

Could you prepare an R package to include all your four two sample testing methods and a manual that

introduces how these methods can be used? To finish this question, you need to submit a compressed R

package.

II. Real Data Problem (50 points)

The data for this project describe payments for child support made to a government agency. A “case” refers

to a legal judgment that an absent parent (abbreviated in variable names as “AP”) must make child support

payments. The data is distributed in four CSV files, whcih can be downloaded from Compass2g. The data

are distributed “as is” as obtained from the agency (albeit anonymized). Most categorical variables are

self-explanatory.

The file cases.csv has six columns, one for each case:

? CASE_NUM Unique case identifier

? CASE_STATUS ACV (active), IN_ (inactive), IC_ (closed), IO_ (legal), IS_(suspend)

? CASE_SUBTYPE AO (arrears), EF (foster), MA (medical), NO (arrears), RA (regular), RN (regular)

? CASE_TYPE AF (AFDC), NA (non-afdc), NI (other)

? AP_ID Identifying number for absent parent

? LAST_PYMNT_DT Recorded date of last payment

The file parents.csv has 10 columns, one for each parent:

? AP_ID Unique identifier for parent

? AP_ADDR_ZIP Coded na for missing, 0 for “known unknown”, 1 for city, 2 south state, 3 north state,

4 other

? AP_DECEASED_IND AP is deceased

? AP_CUR_INCAR_IND AP is incarcerated

? AP_APPROX_AGE

? MARITAL_STS_CD Self-explanatory

? SEX_CD

? RACE_CD Categorical

? PRIM_LANG_CD Language code

2

? CITIZENSHIP_CD Citizenship code

The file children.csv has 9 columns:

? CASE_NUM Case number

? ID Unique identifier for child

? SEX_CD

? RACE_CD

? MARITAL_STS_CD Marital status code

? PRIM_LANG_CD Primary language

? CITIZENSHIP_CD

? DATE_OF_BIRTH_DT

? DRUG_OFFNDR_IND Past drug offence

The file payments.csv has only six columns, but more than 1.5 million records:

? CASE_NUM Case number for the payment

? PYMNT_AMT Dollar amount of payment

? COLLECTION_DT Date of payment

? PYMNT_SRC A (regular), C (worker comp), F (tax offset), I (interstate), S (st tax), W (garnish)

? PYMNT_TYPE A (cash), B (bank), C (check), D (credit card), E (elec trans), M (money order)

? AP_ID Absent parent ID

Question 1 File linkage integrity

(a) Read the four CSV files into R, building four data frames with the names “Cases”, “Parents”, “Children”

and “Payments”. Show the dimensions of these data frames. (You may find it useful to save these data

frames as Rdata objects in a file using the save command. You can then recover them with the load

command more quickly than reading the CSV file.)

(b) What is the distribution of the number of children attached to a case? Show an appropriate plot of the

distribution, and mark the location of the average number in the plot.

(c) The file children.csv may have more than one record for each child. What is the largest number of

cases associated with a child, and indicate why you believe that this is indeed the same child.

(d) Does every absent parent (AP_ID) identified in the payments data have an identifying record in the

parents data file?

Question 2 Recoding categories

Some categorical variables among these data frames are sparse (seldom observed). For example, the variable

PYMNT_SRC in Payments has category ‘M’ with 2 cases and category ‘R’ with 7. These are too few for

modeling in regression.

Write a function named “pool_categories” that recodes a categorical variable into a “simpler” factor with

fewer categories by pooling categories with counts below a threshold into a category labeled ‘Other’ (a factor

level which your function should check does not already exist!). You might find the R function %in% useful

for this exercise.

Question 3 Payment counts and amounts

You must use ggplot2 for generating the plots asked for in this question.

(a) Make a variable Payments$DATE which is a viable R date by converting the COLLECTION_DT

variable. Use this variable to find (i) the range of dates of all payments and (ii) the percentage of the

total number of payments made before May 1, 2015.

3

(b) Show a sequence plot of the total number of payments made on each day from May 1, 2015 through the

end of the data.

(c) What explains the bimodal shape of the marginal distribution of the number of payments over this

time period? Explain with some evidence how you reached your opinion.

(d) Describe the distribution of the payment amounts. Do you have an explanation for its shape? (You

might find it useful to work with a sample for plotting. R takes a while to draw 1.5 million points.)

Question 4 Most common parent

(a) Identify the parent with the most cases.

(b) Identify all of the different children associated with the cases of the parent identified in (a).

(c) What is the average age of these children, in years? Use their age as of Jan 1, 2017. (Fractions of a

year are fine.)

(d) Show a plot of the payment history for this parent.

Question 5 Payments for cases

The unit of analysis for this question is the payment behavior of an absent parent. Hence, if the parent is

involved in several cases, you will need to accumulate the relevant information. You may find it useful for

this and the next question to build a data frame for parents that collects the relevant information for each

parent. You may find dplyr useful here and elsewhere, but you don’t have to use it.

(a) It has been conjectured that parents deemed responsible for more children are more likely to make

either a larger number of payments or a larger total payment amount over this period. Is that true?

(b) It has been conjectured that parents responsible for younger children are more likely to make more

payments. Is the average age of the children of an absent parent associated with the total amount of

payments made by the absent parent? (Define a child’s age as the age on Jan 1, 2017.)

(c) Does the location of the parent (AP_ADDR_ZIP) anticipate the total amount of payments made by

the absent parent?

(d) Does the combination of attributes of the parent with the number and average age of the children

involved predict the total amount of payments made by a parent? Explain your results briefly. (Note:

It makes no sense to remove cases with missing values of a categorical variable. Missingness just defines

another category of the variable.)

Question 6 Consistency

Again, the unit of analysis for this question is an absent parent. An important aspect of payments is the

consistency of the payments over time. A steady income stream is, for many, preferable to a highly volatile,

unpredictable payment schedule, even if the latter has a higher average.

(a) Among all parents who made payments, is there any association between the SD of total daily payments

and the average of total daily payments?

(b) The coefficient of variation (CV) is the ratio of the SD of daily payments to the mean. Show time

sequence plots of the payments of 3 parents, with low, medium and high CV. That is, find three

representative parents who make payments. One of these three should have a high CV, another an

medium CV, and a third a low CV.

(c) Is the CV of payments associated with the total amount of payments over this time period?

(d) Do any attributes of the parent as revealed in these data anticipate that the parent will make consistent

payments, that is, have small CV?

4

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。