联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-04-04 11:44

Introduction to Data Analytics

Coursework

Spring 2022, Lecturers: Edwin Simpson (unit director), Ian Nabney.

Deadline: 13.00 on Wednesday 11th May

Overview

This coursework will take you through the data analytics process for an example scenario, from

processing text data to visualising information. As well as implementing data analytics methods and

obtaining results, you should aim to demonstrate your understanding of the methods you use and

critically evaluate these methods. Your work should also incorporate ideas from the lecture videos

and lectorials.

We recommend that you first get a basic implementation for all parts of the required assignment,

then start writing your report with some results for all tasks. You can then gradually improve your

implementation and results.

Total time required: 40 hours.

Support

The lecturers and teaching assistants are available to answer clarification questions if you are unsure

what to do for any part of the coursework. You can ask questions during our Monday labs, post

questions to the QA channel on Teams or anonymously to the Blackboard discussion forum.

Alternatively, use email to contact Edwin (edwin.simpson@bristol.ac.uk) about questions on tasks 1

and 2 and Ian for the other tasks.

Task 1: Sentiment Classification (max. 22%)

Financial news provides important information for investors, such as positive or negative sentiment

towards a company. Your task is to design, implement and evaluate a sentiment classifier for

financial news. For this task, we will be working with the Financial Phrasebank dataset, which

contains sentences from English news articles discussing companies listed on the Helsinki stock

exchange. Each sentence has one of three labels: positive (2), negative (0), or neutral (1). The

dataset can be accessed through the HuggingFace datasets library. Please see the Jupyter notebook

data_loader_demo.ipynb (available on Blackboard) for example code for loading and splitting it into

training and test sets.

The data is described in this paper:

Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P. (2014). Good debt or bad debt: Detecting

semantic orientations in economic texts. Journal of the Association for Information Science and

Technology, 65(4), 782-796.


1.1. Implement and train a method for automatically labelling texts in the Financial Phrasebank with

their sentiment labels. Refer to the labs, lecture materials and textbook to identify a suitable

method. Include the following in your report:

Briefly explain how your chosen sentiment analysis method works and its main strengths

and limitations;

Describe the features you have chosen and why you chose them, and hypothesise how they

will affect your results;

Explain the preprocessing steps your method requires.

(7 marks)

1.2. Implement, train, and test your method. Briefly document this process in the report. (6 marks)

1.3. Evaluate, interpret and discuss your results, making sure to include the following points:

Define your performance metrics and state their limitations;

Show your results using suitable plots, tables and/or a confusion matrix;

How could you improve the method or experimental process? Consider the errors that your

method makes.

(9 marks)

High performance figures are less important for getting high marks than motivating your method

well and implementing and evaluating it correctly.

Suggested length of report for task 1: 2 pages.

Task 2: Named Entity Recognition (max. 28%)

Our clients would like to extract information automatically from financial documents about

organisations, places, and people. This task is therefore to design and implement named entity

recognition using the SEC-Filings dataset, containing U.S. financial agreements. The dataset is

labelled with the entity tags location (LOC), person (PER), organisation (ORG) and miscellaneous

(MISC). Code to load the dataset is provided in the Jupyter notebook data_loader_demo.ipynb

(available on Blackboard).

The data is presented in this paper:

Alvarado, J. C. S., Verspoor, K., & Baldwin, T. (2015). Domain adaption of named entity recognition to

support credit risk assessment. In Proceedings of the Australasian Language Technology Association

Workshop 2015 (pp. 84-90).

2.1. Design a method for tagging named entities in the SEC-Filings dataset. Refer to the labs, lecture

materials and textbook to identify a suitable method. Include the following in your report:

Briefly explain how your chosen named entity recognition method works and its main

strengths and limitations;

Describe the features you have chosen and why you chose them, and hypothesise how they

will affect your results;

Explain the tagging scheme for labelling entities in this dataset.

(7 marks)


2.2. Implement, train, and test your method. Briefly document this process in the report. (6 marks)

2.3. Evaluate, interpret, and discuss your results, making sure to include the following points:

Explain your choice of performance metrics and their limitations;

Show your results using suitable plots and/or tables;

How could you improve the method or experimental process? Consider the errors your

method makes.

(8 marks)

2.4. Apply your trained NER tagger to the Financial Phrasebank dataset.

Compute a sentiment score for each entity that you detect. Briefly explain your method.

One way you could compute a score for an organisation is to count the number of positive

texts it occurs in and subtract the number of negative documents it occurs in;

Show your results, for example by listing the five most positive and five most negative

organisations, along with their scores.

(7 marks)

Suggested length of report for task 2: 2.5 pages.

Task 3 Information Visualisation Analysis (8%)

Analyse the approach you have used to present your results in tasks 1.3 and 2.3 as defined above.

3.1. Justify the design chosen in terms of key information visualisation principles. (5 marks)

3.2. Define and explain the visual queries that the user carries out when viewing your presentation

of results. (3 marks)

Suggested length of report for task 3: less than 1 page.

Task 4: Information Visualisation (42%)

4.1. Use Tableau to create plots that enable the user to explore the Bookshop dataset that was used

in lab 3. You should enable the user to answer these questions:

Is there a link between the number of hours per day that an author writes and their total

output (in terms of the total number of pages in their books)?

Is there a link between ratings and sales at the book level?

Show where authors live on a world map, how many work in each country (using a visual

representation), and in a tooltip provide information on the average price for books written

in that country.

In about two pages, write a short description of the visualization techniques you used and a

justification for your choices. You should refer to the principles of info vis, relevant aspects of

human perception and cognition, and the scientific literature where appropriate.

(32 marks: 22 marks for the visualization; 10 marks for the description and justification).

4.2. Using appropriate levels and types of validation (as in Chapter 4 of Munzner and the lectures

from week 2), assess the quality of your visualization by making appropriate measurements and

observations of the other students in your group (the groups will be defined separately) in an


analytic task using your visualisation. The lab class on 25th April will be dedicated to this activity, so

you will need a complete visualization by then. Your report on this should be no more than one

page. (10 marks).

Implementation

Text Analytics: The lab notebooks provide useful example code and we recommend using Python 3

with the libraries used in the labs. You may use other libraries if preferred and you can write your

code in either Jupyter notebooks or standard Python files.

Information Visualisation: We recommend using Tableau and applying what you have learned in the

labs and lectorials.

Report Formatting

Maximum of 10 pages

References do not count toward the page limit

We recommend using the template from COLING 2020 if writing the report in Latex1, or

following the same formatting style if using Word or another application.

No less than 11pt font

Single line spacing

A4 page format

Aim for quality rather than quantity: you do not have to use the maximum number of pages

and will receive higher marks if you write concisely and clearly.

The text in your figures must be big enough to read without zooming in.

Citations and References

Make sure to cite a relevant source when you introduce a method or discuss results from previous

work. The preferred style is given in the COLING 2020 style guide above. The details of the cited

papers must be given at the end in the references section (no page limits on the references list).

Please only include papers that you discuss in the main body of the report.

Google Scholar and similar tools are useful for finding relevant papers. The ‘cite’ link provides bibtex

code for use with latex and references that you can copy, but beware they often contains errors.

Submission

Deadline: 13.00 (GMT+1) on 11th May.

On Blackboard under the “assessment, submission and feedback” link.

Please upload the following three files:

1. Your report as a PDF with filename .pdf, where is

your student number (not your username).

2. Your code inside a single zip file with filename .zip. Please remove

datasets and other large files to minimise the upload size – we only need the code itself.


1 Latex is the most common tool for writing published papers in AI/ML/NLP research. It separates writing the

content from formatting. A good way to get started with Latex is to use

3. A packaged Tableau workbook (use this link to find out more) with filename

.twbx containing your solution to Task 4. This enables us to run the

workbook in Tableau reliably.

We will briefly review your Python code by eye – we do not need to run it. Your marks will be based

on the contents of your report, with the code used to check how you carried out the experiments

described in your report. We will not give marks for the coding style, comments, or organisation of

the code.

Please do not include your name in the report text itself: to ensure fairness, we mark the reports

anonymously.

Please check that your submission follows these guidelines before uploading, otherwise you may

lose marks.

Assessment Criteria

Your coursework will be evaluated based on your submitted report containing the presentation of

methods, results and discussions for each task. To gain high marks your report will need to

demonstrate a thorough understanding of the tasks and the methods used, backed up by a clear

explanation (including figures) of your results and error analysis. The exact structure of the report

and what is included in it is your decision and you should aim to write it in a professional and

objective manner. Marks will be awarded for appropriately including concepts and techniques from

the lectures.

Avoiding Academic Offences

Please re-read the university’s plagiarism rules to make sure you do not unknowingly break any

rules. Do not copy text directly from your sources – always rewrite in your own words and provide a

citation.

Academic offences include submission of work that is not your own, falsification of data/evidence or

the use of materials without appropriate referencing. Note that sharing your report with others is

also not allowed. These offences are all taken very seriously by the University.

Suspected offences will be dealt with in accordance with the University’s policies and procedures. If

an academic offence is suspected in your work, you will be asked to attend an interview with senior

members of the school, where you will be given the opportunity to defend your work. The

plagiarism panel can apply a range of penalties, depending on the severity of the offence. These

include a requirement to resubmit work, capping of grades and the award of no mark for an element

of assessment.

Extenuating Circumstances

If the completion of your assignment has been significantly disrupted by serious health conditions,

personal problems, periods of quarantine, or other similar issues, you can apply for consideration of

extenuating circumstances in accordance with the normal university policy and processes. Students

should apply for consideration of extenuating circumstances as soon as possible when the problem

occurs. Please see the details here.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp