联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-08-17 09:05


COMP4650/6490 Document Analysis – Semester 2 / 2022

Assignment 1

Due 17:00 on Wednesday 17 August 2022 AEST (UTC +10)

Last updated July 25, 2022

Overview

In this assignment, your task is to index a document collection into an inverted index, and then mea-

sure search performance based on predefined queries. A document collection containing more than

30, 000 government site descriptions is provided for this assignment, along with a set of queries

(gov/topics/gov.topics) and the expected returned documents (gov/qrels/gov.qrels). The pro-

vided code implements most of an information retrieval system. This provided code is designed to be

simple to understand and modify, it is not efficient nor scalable. When developing a real-world IR system

you would be better off using high performance software such as Apache Lucene1.

Throughout this assignment:

1. You will develop a better understanding of indexing, including the tokenizer, parser, and normaliser

components, and how to improve the search performance given a predefined evaluation metric;

2. You will develop a better understanding of search algorithms, and how to obtain better search

results, and

3. You will find the best way to combine an indexer and search algorithm to maximise your perfor-

mance.

Submission

The answers to this assignment (including your modified code files) have to be submitted online in

Wattle, see the link Assignment 1 Submission in Week 4 (15 to 19 August).

You can edit your answers many times and they will be saved by Wattle.

Note that Wattle does not allow us to access any earlier edited versions of your answers, so

check very carefully what you submit as the final version.

You can only submit your assignment once. Make sure you submit the final version of your

assignment answers before the submission deadline.

Marking

This assignment will be marked out of 15, and it will contribute 15% of your final course mark.

Your answers to coding questions will be marked based on the quality of your code (is it efficient, is it

readable, is it extendable, is it correct) and the solution in general (is it appropriate, is it reliable, does it

demonstrate a suitable level of understanding).

Your answers to discussion questions will be marked based on how convincing your explanations are (are

they sufficiently detailed, are they well-reasoned, are they backed by appropriate evidence, are they clear,

do they use appropriate aids such as tables and plots where necessary).

1https://lucene.apache.org/

1

Question 1: Implement the Indexer and Evaluate (3 marks)

Yourfirst task is to implement index construction. You should complete the function stub index_from_tokens

in indexer.py. This function takes as input a sorted list of tuples of the form (token_string, doc_id),

indicating that the token token_stringwas in the document represented by doc_id. For each occurrence

of a token in a document there is a tuple in this list – this means duplicate tuples indicate term count. The

input list is sorted in ascending order by token_string then doc_id.

Once complete the index_from_tokens function should output two dictionaries, the first dictionary index

is a mapping from token strings to a sorted list of (doc_id, term_frequency) tuples. These lists should

have their elements sorted in ascending order by doc_id, and contain only unique doc_ids – duplicates

are used to compute term frequency. The second dictionary doc_freq is a mapping from token_string

to document frequency.

Once you have implemented index_from_tokens you should run indexer.py to store the index, then run

query.py to run a set of test queries, finally run evaluate.py to evaluate the query results against the

ground truth. Record your evaluation results in your answers and make sure you submit your indexer.py.

Example Input for index_from_tokens:

[("cat", 1), ("cat", 1), ("cat", 2), ("door", 1), ("water", 3)]

Example Output for index_from_tokens:

index: {"cat": [(1, 2), (2, 1)], "door": [(1, 1)], "water": [(3, 1)]}

doc_freq: {"cat": 2, "door": 1, "water": 1}

Question 2: Implement TF-IDF Cosine Similarity (3 marks)

Currently query_tfidf.py uses cosine similarity applied to term frequency (it is currently a copy of

query.py but you will change that). Your task is to implement cosine similarity applied to TF-IDF. In

your solution both the query and the document vectors should be TF-IDF vectors. You will need to

modify the run_query function and the get_doc_to_norm both of which are in query_tfidf.py.

The TF-IDF variant you should implement is:

TF-IDF = nt ln

N

1 + nd

Where nt is the term frequency, nd is the document frequency, and N is the total number of documents

in the collection. This is almost the standard TF-IDF variant, except that 1 is added to the document

frequency to avoid division by zero errors.

Once you have implementedTF-IDF cosine similarity, run the query_tfidf.pyfile, then run evaluate.py,

and record the results in your answers. Make sure you submit your query_tfidf.py.

Question 3: Explore Linguistic Processing Techniques (3 marks)

For this question youwill exploreways to improve the process_tokens function in string_processing.py.

The current function removes stopwords. You should modify the function and explore the results. To

modify the function, you should make changes to the functions process_token_1, process_token_2,

and process_token_3 and then uncomment the one you want to test within the main process_tokens

function. You should pick at least three different modifications and evaluate them (you can add new

process tokens functions if you want to evaluate more than three modifications). See lectures for some

possible modifications. You might find the Python nltk library useful. The modifications you make do

not need to require significant coding, the focus of this question is choosing reasonable modifications and

explaining the results.

2

Note: for this question you should use the unmodified query.py file. Do not use your TF-IDF

implementation.

For each of the modification you make you should describe in your answers:

What modifications you made.

Why you made them (in other words why you thought they might work).

What the new performance is.

Why you think the modification did/did not work. Making sure to give (and explain) examples of

possible failure or success cases.

Finally, you should compare all the modifications using one appropriate metric and decide which

modification (or combination of modifications) performed the best. Your comparison should make use

of a table or chart as well as some discussion. Make sure to report all of this and your justification in

your answer.

Make sure to submit your string_processing.py showing each of the changes you made.

Question 4: Implement Boolean Queries (3 marks)

Your task is to implement the ability to run boolean queries on the existing index. The starting code is

provided in query_boolean.py. Specifically, you will implement a simplified boolean query grammar.

You may assume that input queries consist of only AND and OR operators separated by single tokens. For

example, cat AND dog is a valid query while cat mask AND dog is not a valid query since cat mask is

not a single token. You are not required to implement NOT. The order of operations will be left to right

with no precedence for either of the operators, for example the query cat AND dog OR fish AND fox

should be done in the following order: ((cat AND dog) OR fish) AND fox. The brackets are provided

as an example; you can assume that the queries provided to your system will not contain brackets. To

score full marks on this question, your solution should implement O(n+m) sorted list intersection and

sorted list union algorithms – O(n+m) sorted list intersection was covered in the lectures, while union

is very similar. Where n and m refer to the lengths of the two lists. Solutions using data structures

such as sets or dictionaries to implement the intersection and union operations will not score full

marks.

Once you have completed your boolean query system please run it on the following queries and list the

relative paths (e.g. ./gov/documents/92/G00-92-0775281) of the retrieved documents in your answers

(HINT: none of the queries below give more than 10 results, and Query 0 has been done for you so that

you can check your system). Please double check that you are not using an index built with a modified

version of process_tokens in string_processing.py.

Query 0: Welcoming

Answer:

3

Query 1: Australasia OR logistic

Query 2: heart AND warm

Query 3: global AND space AND wildlife

Query 4: engine OR origin AND record AND wireless

Query 5: placement AND sensor OR max AND speed

Make sure you submit your code for this question (i.e. query_boolean.py) as well as your answers.

Question 5: Evaluating IR Systems (3 marks)

Answer the following two questions:

(a) Explain how you can evaluate your boolean query system.

(b) Now consider the case where the output of your boolean query system is going to be re-ranked and

only the top 10 results displayed. Explain how you could evaluate this combined boolean query and

re-ranking system.

Your answers to both (a) and (b) should at a minimum consider:

What data you would need (e.g. queries and ground truth);

The challenges that you would have to face getting this data;

What metrics are appropriate, and why these metrics are appropriate.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp