联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-10-17 07:35

COMP4650/6490 Document Analysis – Semester 2 / 2022

Assignment 3

Due 17:00 on Wednesday 19 October 2022 AEDT (UTC +11)

Last updated September 27, 2022

Overview

For this assignment, you will implement several common NLP algorithms: probabilistic context free

grammar learning, dependency parsing, and relation extraction in Python.

Throughout this assignment you will make changes to the provided code to improve or complete existing

models.

Submission

• The answers to this assignment (including your code files) have to be submitted online in Wattle.

• You will produce an answers file with your responses to all questions. Your answers file must be a

PDF file named u1234567.pdf where u1234567 should be replaced with your Uni ID.

• Questions 1 and 3 also require you to submit an output file in a format specified by the question.

• You should submit a ZIP file containing all the code files, your answers PDF file, and these requested

output files BUT NO OTHER DATA.

Marking

This assignment will be marked out of 15, and it will contribute 15% of your final course mark.

Your answers to coding questions (or coding parts of each question) will be marked based on the quality

of your code (is it efficient, is it readable, is it extendable, is it correct). The output files requested will be

used as part of the marking for the coding questions.

Your answers to discussion questions (or discussion parts of each question) will be marked based on how

convincing your explanations are (are they sufficiently detailed, are they well-reasoned, are they backed

by appropriate evidence, are they clear).

This is an individual assignment. Group work is not permitted. Assignments will be checked for

similarities.

Question 1: Probabilistic Context Free Grammars (3 marks)

For this question you will be modifying the starter code in syntax parser.py. You have been provided

with a training dataset containing 73,638 English language sentences, which have been parsed into

constituency trees (in parsed sents list.json).

For example, given the sentence: “A large elephant scratching”. The parse tree is:

1

For this structure the parsed sentence in parsed sents list.json is

[“S”, [“NP”, [“DT”, “a”], [“JJ”, “large”], [“NN”, “elephant”]], [“VP”, [“VBG”, “scratching”]]].

Note that parsed sents list.json contains a list of parses, each for a different sentence.

Your task is to estimate a probabilistic context free grammar (PCFG) given this dataset. You should

estimate the conditional probabilities of the rules in this grammar (including all rules that express facts

about the lexicon) using the maximum-likelihood approach given in lectures. Some examples of rules in

this grammar are:

S → NP

VBN → parked

ADVP → INNP

NP → NP, NP

Note that some terminals and non-terminals are punctuation characters. There are many more rules

than those listed here and you will first have to identify all the rules. All terminals are lower case or

punctuation.

When your syntax parser.py code is run, write a list of all the transition rules and their conditional

probabilities to a file called “q1.json”. There is a class called RuleWriter in syntax parser.py which

will write the rules in the correct format for the automated judging system (do not modify this class). The

expected format is (example of format only not actual answers):

[[“NP”, [“PP”, “NP”], 0.8731919455731331], [“NNS”, [“locations”], 0.062349802638746935],

. . . <many more rules> ]

Hint: if you create a test input file containing only

[[“A”, [“B”, [“C”, “blue”]], [“B”, “cat”]]]

Your program should output (note that the order the rules are listed in the output does not matter):

[[“A”, [“B”, “B”], 1.0 ], [“B”, [“C”], 0.5 ], [“B”, [“cat”], 0.5 ], [“C”, [“blue”], 1.0 ]]

What to submit

Make sure to submit “q1.json” for marking. Also make sure to submit your syntax parser.py.

Question 2: Dependency Parsing (6 marks)

Complete both Part A and Part B of this question.

Part A

The file dependency parser.py contains a partial implementation of Nivre’s arc-eager dependency

parsing algorithm. Your first task is to implement the missing left arc method, reduce method,

2

right arc method, and parse function. Your implementation should follow the transitions as given by

the lecture slides (also some additional instructions are provided as comments in dependency parser.py).

Make sure that your class methods and parse function return True if the operation(s) was/were successfully

applied, or else False. The parse function should return True if it generates a dependency forest. Your

implementation must make use of the DepContext class which stores the current state of the parser.

Do not modify the DepContext class or the name and parameters of the methods and functions you

have been asked to implement because Part A of this question will be automatically marked.

The auto-marker will inspect the internal state of DepContext to confirm that operations have been

applied correctly. You should not use any additional imports for this question (except from Python

standard libraries). The judging system will use Python-3.9.13 in a sandboxed Linux environment.

Hint: As mentioned in lectures we need to make sure that words are uniquely represented (in case there

are multiple copies of the same word), to do this we use the position of the word in the sequence, with

the [ROOT] token prepended to the sequence. This is handled by DepContext.

What to submit

In your answers PDF file include the output of dependency parser.py (it runs a small number of test

cases and prints them to the terminal). Make sure to submit your dependency parser.py.

Part B

Your second task is to implement the standard oracle for arc-eager dependency parsing in oracle parser.py

(specifically you should implement the parse gold function). Make use of the parsing class you developed in Part A of this question – do not modify your parsing class from Part A, all code for this second

part should be put in the oracle parser.py.

The parsing class from Part A has been imported. Additional imports are not permitted (except from

Python standard libraries). You may add helper functions in oracle parser.py.

The oracle you need to implement determines which action to be performed in each state to reconstruct the

ground truth dependency parse tree. It returns the list of actions, and the states in which they occur – though

in this case you should return only features of the state (using the provided extract features function

– do not modify the feature extraction code). The details of this algorithm were not covered in lectures,

instead you should implement Algorithm 1 from this paper https://aclanthology.org/C12-1059.pdf

(while this paper covers different oracle methods including a dynamic oracle, you should only implement

Algorithm 1 on page 963). You should implement un-labelled dependency parsing so you will need to

modify this algorithm to work without labels. To implement Algorithm 1 you will need to make use of

the parsing class you developed in Part A of this question.

Once you have implemented the parse gold function, run oracle parser.py and it will determine the

actions that could generate the ground truth dependency parse trees in gold parse.json under the rule

set in rule set.json. It will print the first 3 cases where the oracle could reproduce the ground truth,

the first 3 cases where it could not reproduce the ground truth, and after a short wait the number of total

failure and total success cases.

What to submit

In your answers PDF file include the output of oracle parser.py and explain why it failed in the three

printed cases. Make sure to submit your oracle parser.py.

3

Question 3: Relation Extraction (6 marks)

Your task is to design and implement a relation extraction model along with its training and testing pipeline.

Specifically, your model should extract the relation ‘nationality’ (denoted /people/person/nationality

in our dataset) which holds between a person and a geopolitical entity (GPE). For example,

/people/person/nationality(Yann LeCun, France),

/people/person/nationality(Julie Bishop, Australia).

The training data is provided in sents parsed train.json, the test data is in sents parsed test.json.

A file containing a list of country names (country names.txt) is also provided for you to use. The test

data does not have the ground truth set of relations; your program must extract them. Some starter code

has been provided in rel extract.py which shows you how to read in the data, access its various fields,

and write a valid output file.

It is very important that your relation extraction algorithm work for entities not seen during training.

To clarify, this means that your model should be designed so that it can identify relations such as

/people/person/nationality(Julie Bishop, Australia) even if “Julie Bishop” and “Australia”

entities are not in the training data. None of the (PERSON, GPE) pairs that constitute a nationality relation

occur in both the training dataset and the test dataset. The focus of this question is to develop a good

training/validation pipeline and to explore different lexical, syntactic or semantic features. You should

use the “Feature-based supervised relation classifiers” technique introduced in the Speech and Language

Processing textbook (3rd ed. Jan 12, 2022 draft) Section 17.2.2 and Figure 17.6. You do not have to

implement all the suggested features, but an attempt to implement a variety of them is expected. You

will likely need to do substantial pre-processing of the input data as well as some post processing of

the output list to get a good result. You should use LogisticRegression from scikit-learn as the

classifier. Neural networks and vector features generated by neural networks are not to be used

for this question. You may use spaCy1 to extract additional features from your text; however, several

features are already provided for you (see Description of the data below). It is possible to get full marks

using only the features provided; however, these features will likely need some processing before they can

be used appropriately as part of a classification pipeline. You are permitted to use the Python standard

libraries, numpy, pandas, scikit-learn, nltk, spaCy. You should not use other libraries, datasets, or

ontological resources.

The final goal is to extract all relations of the type nationality that exist in the test data, not to simply

identify if a relation type is in each sentence. Your output should be a list of all the relations in the test

data, in CSV format (you should name it “q3.csv”) – the starter code provides a function to write this

file in the correct format for the automated judging system. The order of lines in this CSV file does not

matter, and duplicate lines will be removed before scoring. An example judging script eval output.py

is provided for you. Since you have not been given the ground truth for the test data you can only run this

script using the training data. To do a sanity check before submitting,

(1) first run your rel extract.py using the training data, and

(2) evaluate your model performance on the training set

python eval_output.py --trainset

(3) to make sure “q3.csv” is the output from the test data, run rel extract.py again using the test data

(this overwrites “q3.csv” with the relations extracted from the test set).

Note that part of the mark for this question will be based on the F1 score of your “q3.csv” file on the

held-out ground truth for the test data (evaluated by the examiner using the automated judging system).

The rest will be based on an examination of your code and answers to the questions below.

1https://spacy.io/

4

What to submit

Answer the following questions in your answers PDF file (please answer in plain English, dot points are

preferred, do not include code snippets or screenshots of code as these will not be marked):

(a) What pre-processing did you do?

(b) How did you split the data into training and validation sets?

(c) How did you choose hyper-parameters?

(d) What features did you choose and why did you choose these features?

(e) What post-processing did you do?

Submit your code. Submit your output file named “q3.csv” (make sure this is the output from the test

data). Submit your answers to the questions.

Description of the data

The data you have been given is described in detail below. Briefly, it is a set of sentences, each of

which contain a relation between two entities. You may use the relations that are not the target relation

/people/person/nationality to generate negative examples on which to train your model.

sents parsed train.json and sents parsed test.json each contain a list of dictionaries. Each dictionary corresponds to a different example. Each example dictionary has the following fields:

• tokens: a tokenised sentence that contains a relation

• entities: a list of the named entities in the sentence

• relation: a description of the relation in the sentence

Each entity is a dictionary with the following fields:

• start: the token index where the entity starts

• end: the token index where the entity ends

• label: the entity type. PERSON = a person, GPE = a geopolitical entity, and several others

Each relation is a dictionary with the following fields (note that the relation field has been removed from

sents parsed test.json):

• a: first entity that is part of the relationship (tokenised into a list)

• b: the second entity that is part of the relationship (tokenised into a list)

• a start: the token index where the first entity starts

• b start: the token index where the second entity starts

• relation: the type of relation, note we are interested in the /people/person/nationality relation

for this assignment.

There are also some additional features that may or may not be helpful (each of the features below is a

list, the array indices match to those of the tokenised sentence):

• lemma: the tokenised text that has been lemmatised.

• pos: part of speech tags for each word

• dep: the dependency relationships

5

• dep head: the head of each dependency relationship (allowing you to reconstruct the full dependency tree)

• shape: shape features for each word. This preserves things such as word length and capitalisation

while removing much of the content of the word.

• isalpha: is each word only alphabetic.

• isstop: is each word in the spaCy stop word dictionary (and thus likely to be common)

All features were extracted using spaCy-2.3.5 en core web md models. The data you have been provided

was constructed automatically by matching ground truth relations to text. For the training data any

sentence that had both the named entities that participated in a ground truth relation was marked as

containing that relation. No manual checks were done to ensure accuracy. You might like to consider

how this will affect the data quality.

6


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp