联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2020-02-25 09:25

Overview of the data

The data is from the 1991 Survey of Income and Program Participation

(SIPP). You are provided with 7933 observations.

The sample contains households data in which the reference persons

aged 25-64 years old. At least one person is employed, and no one is

self-employed. The observation units correspond to the household

reference persons.

The data set contains a number of feature variables that you can

choose to predict total wealth. The outcome variable (total wealth) and

feature variables are described in the next slide.

Dataframe with the following variables

Variable to predict (outcome variable):

? tw: total wealth (in US $).

? Total wealth equals net financial assets, including Individual Retirement Account (IRA) and 401(k) assets,

plus housing equity plus the value of business,

property, and motor vehicles.

Variables related to retirement (features):

? ira: individual retirement account (IRA) (in US $).

? e401: 1 if eligible for 401(k), 0 otherwise

Financial variables (features):

? nifa: non-401k financial assets (in US $).

? inc: income (in US $).

Variables related to home ownership (features):

? hmort: home mortgage (in US $).

? hval: home value (in US $).

? hequity: home value minus home mortgage.

Other covariates (features):

? educ: education (in years).

? male: 1 if male, 0 otherwise.

? twoearn: 1 if two earners in the household, 0 otherwise.

? nohs, hs, smcol, col: dummies for education: no high- school, high-school, some college, college.

? age: age.

? fsize: family size.

? marr: 1 if married, 0 otherwise.

What is 401k and IRA?

? Both 401k and IRA are tax deferred savings options which aims to increase

individual saving for retirement

? The 401(k) plan:

? a company-sponsored retirement account where employees can contribute

? employers can match a certain % of an employee’s contribution

? 401(k) plans are offered by employers -- only employees in companies

offering such plans can participate

? The feature variable e401 contains information on the eligibility

? IRA accounts:

? Everyone can participate -- you can go to a bank to open an IRA account

? The feature variable ira contains IRA account (in US $)

Collection of methods

We have already seen:

? OLS

? Ridge regressions

? Stepwise selection methods

? Lasso

Note:

1. In the project, you should select different methods from the list above and

compare their prediction performance and interpretability

2. For Ridge, Stepwise selection, and Lasso, don’t forget the use of Cross- Validation

3. In addition to prediction performance, you might want to think about

whether the set of predictors used to predict total wealth make intuitive

sense

Compare the prediction performances of different

methods -- an example (this is just ONE EXAMPLE)

? Say, you have applied the Ridge regression and the Lasso

? For the Ridge regression, you use the K-fold CV (Slide 12) to choose the best ????, say ????????????

? . Given

????????????? , estimate the model with the ENTIRE data

? Note that you have computed the ????????????????????????????????(????????????

? ) in Step 6 of Slide 12

? For the Lasso, you also use the K-fold CV (Slide 12) to choose the best ????, say ????????

? . Given ?????????,

estimate the model with the ENTIRE data

? Note that you have computed the ????????????????????????????????(????????

? ) in Step 6 of Slide 12

? The best ???? for Ridge does not have to be the same as the best ???? for Lasso; that is, ????????????

? doesn’t

necessarily equal to ????????

?

? Which do you choose to build the prediction/fitted model? Ridge estimates or Lasso

estimates?

? You compare ????????????????????????????????(????????????

? ) with ????????????????????????????????(????????

? )

? If ???????????????????????????????? ????????????

? > ????????????????????????????????(????????

? ), choose Lasso to build the prediction/fitted model; otherwise, choose Ridge

? If ???????????????????????????????? ????????????? and ????????????????????????????????(????????? ) are similar, choose one that you feel the resulting fitted model is easier to understand (e.g., one that with fewer predictors and the predictors are

intuitive)

K-fold cross validation

1. Partition the data ???? into ???? separate sets of equal size ? ???? = (????1, ????2, … , ????????); e.g., ???? = 5 ???????? 10

2. For a given ???? and each ???? = 1,2, … ,????, estimate the model with the training data excluding ????????

? Denote the obtained model by ??????????,????(?)

3. Predict the outcomes for ???????? with the model from Step 2 and the input data in ????????

? The predicted outcomes are ?????

?????,???? ???? where ???? ∈ ????????

4. Compute the sample mean squared (prediction) error for ????????, known as the CV

prediction error:

? ????????????????????????? ???? = ???????? ?1 ∑ ????,???? ∈???????? ???? ? ??????????,???? ????2

5. Compute the average of ???????????????? over all ???? sets for each ????

? av???????????????????????? ???? = ?????1 ∑????=1

???? ????????????????????????? ????

6. Select ???? = ????? that gives the smallest av???????????????????????? ????


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp