联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-02-20 09:47

Preliminary Information

In your report do not just replicate the process followed during the workshops! The objective of the

workshops is to introduce you to the different techniques discussed during the lectures, and not to give

you a roadmap on how to answer the coursework.

Assessment: In the coursework you will be assessed based on:

1. Your ability to use correctly the tools that we covered in the course

2. Your ability to draw the correct conclusions from these tools

3. Your ability to address the questions posed in the coursework based on an intelligent interpretation

of the evidence provided in the previous two steps. (Consult the CRISP-DM process described in

Chapter 1 of Guide to Intelligent Data Analysis)

You will not be assessed on your capability to use R or any other software. For this reason don’t

include screenshots from any software or any other information about commands you used, or options

you set, or how to draw a figure etc. You will be simply wasting valuable space.

You are free to use any software you like to do the coursework. However, you can’t use as an excuse

the fact that you couldn’t do a particular task because the software you chose doesn’t offer a particular

capability which we covered in the workshops.

Page limits Your report must be submitted as a PDF file that does not exceed 12 pages, with at

least 11 point typeface. This limit is strict and it includes appendices (which I strongly recommend

that you don’t use). If your report exceeds the page limit I will simply stop reading at the end of page

12 and not take into account anything from the remaining pages in the assessment.

Plagiarism: This is an individual piece of assessment, and you should ensure that your report reflects

your own work exclusively.

All reports go through automated software to detect plagiarism from a variety of sources (including

past and current students’ reports as well as online resources, conference and journal publications etc.)

The consequences of plagiarism are very serious.

Problem Description/ Project Objectives

A bank wants you to develop a credit scoring model to classify applications for unsecured loans. You have

been provided with a sample of observations which contain information about past bank customers. (The

dataset provided to each student is unique.) The description of the variables in this dataset is provided in

the next section.

The bank is primarily interested in understanding what are the main factors that influence repayment

behaviour, so that it can exploit this knowledge to improve future decisions. The bank faces a trade-off

between accepting applicants for loans, so as to retain its share in the market and increase its profit through

interest payments, and on the other hand incurring losses due to giving loans to customers that default on

their debt. The bank managers are interested in the following questions:

What is the best way for the bank to use a statistical model to achieve the following goals:

– Accept the maximum number of good customers if at least 85% of bad customers are correctly

identified

– Accept at least 70% of good customers while rejecting as many bad customers as possible.

If the previous two goals were not specified which statistical model would you recommend, and why?

Compare this model to the ones recommended in the previous question, and discuss similarities and

differences.

1

How many and which are the most important variables that determine the repayment behaviour of

mortgage customers. (Do these differ depending on the objective, and/ or the classification method

used?)

Data Description

You are provided with a sample of observations which contain information about past bank customers. The

dataset provided to each student is unique. The main variables in this dataset are described in Table 1.

The class variable (i.e. the variable we want to predict) is called BAD. There are 9 more variables in the dataset

you were provided with in addition to these described in the table. Each of these variables is encoded as M

and the name of one of the main variables: for example, M MORTDUE, or M DEBTINC. All the M variables are

binary (i.e. take values in {0, 1}). They were created because the original dataset contained a large number

of missing values. For each variable that had missing values in the original data (e.g. MORTDUE) the missing

values were replaced, and a binary variable (M MORTDUE) was created what indicates whether the value of

the variable was missing in the original dataset (M MORTDUE=1) or not (M MORTDUE=0). In other words, the

value of a variable like DEBTINC is the actual, observed, value when M DEBTINC=0. When M DEBTINC=1 the

value of DEBTINC has been predicted (and therefore does not correspond to the actual value of this variable

for that customer). You don’t know which method was used to replace these missing values.

Name Type Description

BAD Binary 1=applicant defaulted on loan or seriously delinquent, 0=applicant paid loan

LOAN Continuous Amount of the loan request

MORTDUE Continuous Amount due on existing mortgage

VALUE Continuous Value of current property

REASON Nominal Not Provided; DebtCon=debt consolidation; HomeImp=home improvement

JOB Nominal Occupational categories

YOJ Continuous Years at present job

DEROG Continuous Number of major derogatory reports

DEBTINC Continuous Debt-to-income ratio

CLAGE Continuous Age of oldest credit line in months

NINQ Continuous Number of recent credit inquiries

CLNO Continuous Number of credit lines

DELINQ Continuous Number of delinquent credit lines

Table 1: Description of main variables in training dataset

Tasks

Exploratory Data Analysis (40 marks).

In particular, consider each variable and answer the following questions:

– Does this variable appear to be important for the task at hand? (After discussing each variable

separately provide a ranking of the importance of all explanatory variables.) Support your claims

with appropriate visualisations that document whether and how important each variable is.

– Are different variables related, and which variables convey information similar to that provided

in other variable(s)?

– Do you find evidence of “outliers” or other issues with data quality (e.g. incorrect observations)?

– For which variables is the fact that specific values were missing in the original dataset informative,

and what are the implications of this?

2

Statistical Modelling (60 marks)

– What is the appropriate performance measure for this application and why? Relate this to the

project objectives.

– For the two types of classifiers: logistic regression, and decision trees discuss different settings you

used and why you considered these important. (Consider the choice of variable selection method

as part of this question also.)

– For each classification method develop one or a few candidate models that you think are promising

before providing a final recommendation of the most appropriate model (for each question in the

project objectives section). You do not need to discuss every model you tried in detail, but

you must include the results for the important steps in the process that led you to the final

recommendations. I am particularly interested in understanding the steps you followed and the

justification for these. (Refer to the CRISP data mining process discussed during the lectures and

in Chapter 1 of the Guide to Intelligent Data Analysis).

– Comment on the generalisation performance of the model(s) you recommend for each type of

classifier.

The coursework requires you to write a report explaining your findings. This means that you need to

explain each figure, table or number you include in the report. In other words including a relevant figure

but not explaining what are the conclusions from it will get you no marks.

You do not need to write an executive summary, or include a cover page, and a page of contents.

You do need to include at the end of your coursework a Conclusions section which will summarise your

findings and will clearly answer the questions posed in the project objectives section. In this section I

would also recommend to discuss the relative advantages and limitations of the two types of classifiers

for the problem at hand.

Report Assessment

Your coursework will not be evaluated by the quality of the final model alone, or by whether you got a

particular answer right. You will be primarily assessed by whether you are able to correctly justify the steps

you took to complete the assignment. In other words, your report needs to document that you are able to

intelligently analyse the provided data, that you draw correct conclusions from what you observe, and that

these conclusions lead you either to the next logical step of the data mining process, or to the revision of

decisions made in previous steps of the analysis. (Refer to the flowchart of data mining stages we covered in

the first lectures and in particular to the feedback loops)

Therefore, don’t simply present the conclusions/ results of your analysis and expect to get a high mark.

Reports that don’t document the steps followed and the reasons why these were chosen will receive minimal

marks, even if the final answer is sensible. Explain your reasoning clearly and in good English. Don’t

provide a list of bullet points, or unstructured sentences etc. Similarly, don’t include figures or any

other output from R that you don’t comment/ explain in the text. I will not assume that you know

how to interpret these correctly.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp