联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2023-03-28 11:21

QBUS2820 Predictive Analytics


Individual Assignment 1


Key information


1. Required submissions (through Canvas/Assignments/Individual Assignment 1)

a. ONE written report (word or pdf format)

b. ONE Jupyter Notebook .ipynb

Please upload both files to canvas in the SAME submission, as separate files (NO

zip file).

2. Due date/time and closing date/time: See Canvas. The late penalty for the

assignment is 5% of the assigned mark per day, starting after 23.59pm on the due

date.

3. Weight: 30% of the total mark of the unit.

4. Length: The main text of your report should have a maximum of 10 pages with the

usual font size 11-12. You should write a complete report including sections such as

business context, problem formulation, data processing, Exploratory Data Analysis

(EDA), methodology, analysis, conclusions and limitations, etc.

5. If you wish to include additional material, you can do so by creating an appendix.

There is no page limit for the appendix. Keep in mind that making good use of your

audience’s time is an essential business skill. Every sentence, table and figure have

to count. Extraneous and/or wrong material will reduce your mark no matter the

quality of the assignment.

6. Anonymous marking: As the anonymous marking policy of the University, please

only include your student ID in the submitted report, and do NOT include your

name. The file name of your report and code file should follow the following format.

Replace "SID" with your Student ID. Example: SID_Qbus2820_Assignment1.

7. Presentation/clarity is part of the assignment. Markers will allocate 10% marks for

clarity of writing and presentation. Numbers with decimals should be reported to

the fourth decimal point.


Key rules:

Carefully read the requirements for each part of the assignment.

Please follow any further instructions announced on Canvas.

You must use Python for the assignment. Use "random_state= 1" when needed, e.g.

when using “train_test_split” function of Python. For all other parameters that are not

specified in the questions, use the default values of the corresponding Python

functions.

Reproducibility is fundamental in data analysis, so that you will be required to submit a

Jupyter Notebook that generates your results. Not submitting your code will lead to a

loss of 50% of the assignment marks.

2


The notebook must run without errors and produce results consistent with the report

when accessed through Kernel -> Restart & Run All from the Jupyter menu, assuming

that the train and test datasets are in the same folder as the notebook. Failure to do so

can results in a loss of up to 50% of the assignment marks.

Failure to read information and follow instructions may lead to a loss of marks.

Furthermore, note that it is your responsibility to be informed of the University of

Sydney and Business School rules and guidelines, and follow them.


The Task


You will work on a Credit Risk Rating Data set. This is a dataset about credit ratings given to

a list of publicly traded firms in the US, gathered from 2014 to 2015. The dataset consists of

multiple financial variables of the firms, and their respective rating given to the firms by the

rating agency Standard and Poor’s.

The assignment consists of applying models and model selection methodologies to arrive at

models that predict the rating from some of the other variables measured.

The credit ratings are often given in an ordered scale, from AAA to D, but in our dataset, the

ratings have been grouped and transformed to numbers, integers from 1 (the group of best

rankings) to 4 (the group of worse rankings).


The dataset (`credit_data.csv`) comes from a research paper that explores performance of

‘Artificial Intelligence’ methods for predicting credit ratings:

https://doi.org/10.1016/j.eswa.2020.113925

You might read the introduction of the paper for motivation and context. In addition to the

original variables which should be self-explanatory, the following variables have been

added:

`Rating`: The credit risk rating transformed to numbers

`ID`: Unique ID identifying the firm

`Year`: Year the report was made (they are all done on Q4 of that year).


1. Problem description


A primary goal is finding a model that is accurate in predicting the rating of the firms. The

accuracy of the predictions is initially measured in Mean Absolute Error (MAE).

A secondary goal is to get an understanding of which are the main factors that drive the

ratings, according to the model, this would require that at least one of the models uses a

few variables or that you can create a coherent explanation out of one of the models if all

use many variables (you do not need to be a finance expert for this, though if you want to .


Select three models, one from each model family to predict the target variable Rating.

These model families are:

a linear regression model,

a kNN regression model,

A third model. This model can be any model of your choice that is not linear

regression nor kNN (might even be a model not covered in the QBUS2820 unit). This


is to encourage you to self-explore and self-study, since the ability of self-study is

critical in the field of machine learning which is evolving rapidly.


All the models need to be fine-tuned with hyperparameter search (when appropriate) and

potentially variable selection. The methodology should maximize the predictive accuracy

and first, and the explanation second. When the three models have been tuned, you will

compute an accurate estimate of the prediction error of these models and make a final

decision among the three. In the conclusions, you also have to explain the driving factors of

the ratings, if the chosen model is not explainable, then use another (or several) and

carefully justify the tradeoffs (accuracy sacrificed vs explanations) .


The model selection part of the assignment, including:

intro/business context/problem formulation

exploratory data analysis

The three models

The conclusions section

Represents the main body of the report and makes 80% of the grade of the assignment.


In addition to the model selection above, the following short exercises. Create a section for

each of the questions and remember to explain and discuss the methodology in the report

as weel as in the main body.


(5%) Find the best predictive model that uses a single predictor (only one variable),

you can use all model classes .

(10%) Think a bit more carefully about the implications of the error function used in

the main part of the assignment, the Mean Absolute Error, and the interpretation of

the response variable ‘Ratings’. Describe a more ‘appropriate’ objective function

that considers the differences between predictions and true values, and the

implications of these differences. Program this function in python. Sketch it using a

table or plot. Re-evaluate your candidate models according to this new function and

comment on the difference in results (if any). You do not need to re-train your

models for this new error function (in practice we would try to).

? (5%) Notice the ID and year variable. What is the main problem that these

represent, with respect to the basic assumptions we make in the predictive analytics

setting (the main violation of the assumptions required to do predictive analytics)?

How could you transform the dataset to solve or mitigate the effects of this

problem?


The grading of the assignment will be based on the methodology and justifications,

removing points for methodological errors, incomplete sections, etc. There is no ‘minimum’

predictive accuracy to be reached, but you need to apply a good methodology.


2. Written report


The purpose of the report is to describe, explain, and justify your solution. Be concise and

objective. Find ways to say more with less. When in doubt, put it in the appendix. Below are

some guidelines on how to work on the Task.

Preparation. You read and understood the assignment requirements and are aware that

this is part of the assessment. You understand that machine learning is grounded in

rigorous logic and theory that should inform your practical analysis. You understand that

there is no single right solution and that trying different approaches and discovering

empirically what works best for a particular problem is natural and desirable in this type of

analysis.


Business context and problem formulation. The report includes a discussion of the context

for the analysis, the problem and questions/hypotheses to be addressed, and how you plan

to measure the success of your proposed solutions.


Data processing. You make sure that the dataset is free of errors and correctly processed

for your analysis. You handle missing values and other issues appropriately. You describe

the data processing steps in a clear and concise way.


Exploratory data analysis (EDA). Your report describes your EDA process, presenting only

selected results. You studied key variables individually. You note any features of the data

that are relevant for model building (some variables might be ‘invalid’ for predictive

purposes). You note the presence of outliers and any other anomalies that can affect the

analysis. You explain the relevance of the EDA results to your subsequent modelling. Your

EDA section in the report is concise, leaving additional figures and tables to the appendix if

needed. Outliers should be clear (e.g. negative values for counting variables). EDA is not the

place to do variable selection and outliers of a non clear nature (e.g. very large values)

should be either not removed or further analyzed using the predictive model performance.

The dataset has many variables and you are not expected to report on all of them

individually, just report your methodology and main findings.


Variable selection. You describe and explain your process for variable selection. Your

choices are justified by data analysis and/or trial and error. Other than potentially invalid

variables from the dataset, the decision should be driven by the performance of the models,

not based on opinions (you are free to comment on the disagreements between your

background knowledge and the models).


Methodology and modelling. You clearly describe and justify the models, methods, and

algorithms in your analysis. The choice of methods is logically related to the assignment

requirements, the substantive problem, underlying theoretical knowledge, and data

analysis. This may involve systematic trial and error, but the report should focus on your

final solutions. Your methodology pays attention to statistical variability. You report all

crucial assumptions and check them as relevant via formal and informal diagnostics. You

clearly recognize when an assumption is not satisfied or questionable. Some problems may

be unfixable given the available data and methods. In this case you can identify what

additional information or methodology could allow you to fix these problems.


Analysis and conclusions. Your analysis is rich. You correctly interpret the results and

discuss how they address the substantive question. The reasoning from methodology and

results to your conclusions is logical and convincing. You are not misled by overfitting. Your

analysis pays attention to statistical variability. You make no claims for which you have no

evidence. You do not make statements that imply causation when discussing associations.

You explicitly acknowledge when limitations of the data or methods lead to uncertainty

about your answer to the substantive question.


Writing. Your writing is concise, clear, precise, and free of grammatical and spelling errors.

You use appropriate technical terminology. Your paragraphs and sentences follow a clear

logic and are well connected. There is a clear distinction between the essential parts of the

report and less important material (use the appendix). Your text refers to meaningful names

for variables and subjects. If you use an abbreviation or label, you first have to define it.


Report. Your report is well organized and professionally presented and formatted, as if it

had been prepared for a client later in your career. There are clear divisions between

sections and paragraphs.


Tables. Your tables are appropriately formatted and have a clear layout. The tables have

informative rows and column labels. The tables are as much as possible easy to be

understood on their own (in the real world, a significant part of your audience will skim-read

by going straight to the tables). The tables do not contain information which is irrelevant to

the discussion in your report. Your table is not an image. The tables are placed near the

relevant discussion in your report. There is no text around your tables.


Figures. Your figures are easy to understand and have informative titles, captions, labels,

and legends. The figures are well formatted and laid out. The figures are placed near the

relevant discussion in your report and are references from the text of the report. Your

figures have appropriate definition and were directly saved from Python into an image file

format. There is no text around your figures.


Numbers. All numerical results are reported to four-decimal point.


Referencing. You add citations for your sources. The references follow a recognizable style

(e.g. the Harvard Referencing System, MLA, APA, Vancouver, etc.)


Python code. The code is presented in a neat and compact way. The code uses meaningful

variable names and can be easily followed by someone with training in Python and statistics.

Someone should be able to run your code and reproduce all the results that appear in your

report. Your code has comments that clearly indicate which parts correspond to which

sections of your report. You explicitly acknowledge when you borrow pieces of code from

sources other than the lecture and tutorial materials.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp