Project Instructions

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 5

Project Outline
Use of Kaggle in Class
Datasets
QA System
Evaluation

COMP90042 Project 2018: Question Answering

May 3, 2018

1 Project Outline

Due date: 11pm, Sunday 27th May, 2018

Your challenge is to build a Question Answering (QA) system, based on working as an individual or in a

small team of up to three members. More speciﬁcally, given a question and a document, the goal is to ﬁnd the

answer to that question in the corresponding document. Your task is to develop an automatic method, or several

methods, for solving this problem.

You will need to write a report outlining your system/s developed, the reasons behind the choices you have

made, and the performance results for your techniques.

We hope that you will enjoy the project. To make it more engaging we will run this task as a kaggle in-class

competition. Kaggle is one of the most popular online platforms for running data science competitions. You will

be competing with other teams in the class. The following sections give more details on data format, the use of

kaggle, and marking scheme. Your assessment will be based on your team’s performance in the competition,

your code, and your report.

Submission materials: Please submit the following:

• Report (.pdf)

• Python code (.py or .ipynb)

Each group should choose one member to submit these ﬁles via the LMS, as a zip archive (submissions

using obscure archive format such as 7z, rar, etc will not be marked, and given a score of 0).

If multiple code ﬁles are included, please make clear what each of the ﬁles do. Do not submit any dataﬁles,

your code should assume the data is in the same directory. We should be able to run your code, if needed,

however note that code is very much secondary – the primary focus of marking will be your report, and

your system performance on Kaggle.

Every member of the group will also be required to perform peer assessment of your colleagues in your

team (individual teams are exempted from this), which will be factored into the marking to adjust for

individual effort. The peer assessment process will be done via a LMS quiz, which will be announced near

to the project deadline.

You must also submit at least one entry to the Kaggle In Class competition.

Late submissions: -20% per day (note that late Kaggle submissions cannot be accepted, i.e., the competition

component needs to be completed strictly by the deadline)

Marks: 30% of mark for class

Materials: See the main class LMS page for information on the basic setup required for this class, including

an iPython notebook viewer and the Python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and

Gensim. For this project, you are encouraged to use the NLP tools accessible from NLTK, such as the

Stanford parser, NER tagger etc, or you may elect to use the Spacy toolkit. You are free to use the corpora

in NLTK. You may also use Python based deep learning libraries: Theano, TensorFlow, Keras or PyTorch.

You are being provided with a training, a development and a test set. See the main instructions for in-

formation on their format and usage. Any other package, tool, or corpus is not allowed except by special

permission of the instructors: if there’s something else you want to use, e-mail us.

Evaluation: You will be evaluated based on several criteria: the correctness of your approach, the originality and

appropriateness of your method, the performance of your best system, and the clarity and comprehensive-

ness of your ﬁnal report.

Group assessment: In the majority of cases, everyone in the same group will receive the same mark, however,

we may make adjustments based on the peer assessment.

To keep things fair, the assessment will be tailored to the size of the group. For teams of two, we will expect

a longer report. For teams of three, besides a longer report, teams should also submit two QA systems

instead of one. These systems need to be sufﬁciently different in method, as outlined below. As well as

this, team submissions on Kaggle will be evaluated against all other teams of smaller or equal size.

Updates: Any major changes to the project will be announced via LMS. Minor changes and clariﬁcations will be

announced in the forum on LMS, we recommend you check the forum regularly.

Academic Misconduct: Though this is a group project and therefore exchange of code within groups is allowed,

reuse of code across groups, copying large chunks of code from online sources, or other instances of clear

inﬂuence will be considered cheating. Do remember to cite your sources properly, both for research ideas,

algorithmic solutions and code snippets. We will be checking submissions for originality and will invoke

the University’s Academic Misconduct policy where inappropriate levels of collusion or plagiarism are

deemed to have taken place.

2 Use of Kaggle in Class

Please do the following within the ﬁrst week after receiving this assignment:

• Setup an account on Kaggle using your unimelb student email, and setting your Kaggle user-

name to your unimelb login,1and register for the competition here https://www.kaggle.com/t/

959a494a05a84083b4eab38ce126b65d. Silly login names might be great fun, but please resist the temp-

tation, as this makes marking difﬁcult. Submissions from non-unimelb accounts will not be marked.

• Form your team of student peers; if you do not form one by Wednesday 9th May, you will be treated a

single person team.

• For teams:

–Connect with your team mates on Kaggle and form a Kaggle team, which you can name according to

your whim.

–Submit your Kaggle team name and list of member names and emails to https://goo.gl/forms/

Ge6Chc0RyvbbeEls2 Only ﬁll in the form once per team.

• For individuals: Submit your Kaggle name and member name to https://goo.gl/forms/

Ge6Chc0RyvbbeEls2

Those in teams should only make submissions using the team name, and individual submissions will attract

penalties. Those working as an individual, should make their submissions as an individual. Note that you are

limited to 5 submissions per day.

You should submit your Kaggle results as a comma separated variable (CSV) ﬁle. Your outputs should be

output, one entry per line, in a CSV ﬁle where each line contains two ﬁelds: the question id and the answer. The

ids are increasing integers and denote the question in the test dataset, while the answer strings can be words or

sequences of words. Here is an example:

id,answer

0,the queen of spain

1,eating fish and chips on the pier

2,bananarama

...

1If your login is already taken, please choose a string that makes it trivial for us to work out who you are, e.g., by adding a sufﬁx to

your unimelb login name. E.g., bdylan →bdylan843

See also the sample submission on Kaggle (with garbage answers). Take care with double quotes, commas and

other special characters, which will need to be escaped (using the python csv library makes this easy). Be sure

to include the header line as given above.

The real answers for the test data are hidden from you, but were made available to Kaggle. Each time a

submission is made, half of the predictions will be used to compute your public F-score and determine your rank

in public leaderboard. This information will become available from the competition page almost immediately. At

the same time, the other half of predictions is used to compute a private F-score and rank in private leaderboard,

and this information will be hidden from you. At the end of the competition, only private scores and private

ranks will be used for assessment. This type of scoring is a common practice and was introduced to discourage

overﬁtting to public leaderboard. A good model should generalize and work well on new data, which in this

case is represented by the portion of data with the hidden F-score.

The evaluation score used in this competition is the average F-score over the token strings, which is based

on the precision and recall between the bags of predicted tokens versus the text span in the gold standard. For

a prediction to be judged completely correct, it must be string identical to the gold answer, meaning that correct

tokenisation, including punctuation, determiners, etc are highly important. Before the end of the competition

each team will need to choose their best submissions for scoring (this could be your basic QA system or, more

likely, one of your extensions.) These do not have to be the latest submission. Kaggle will compute a private

F-score for the chosen submissions only, and the best of these used to compute the private leaderboard ranking,

which will be used for marking.

3 Datasets

You are provided with several data ﬁles for use in the project:

documents.json a collection of wikipedia documents

training.json a set of training questions and answers

devel.json a set of development questions and answers

testing.json a set of test questions, with no answers given

Each of dataﬁles is a json list, the ﬁrst comprising a list of documents, where each document has an identiﬁer

ﬁeld, “docid”, and “text” which contains a list of paragraph strings comprising the document. The other dataﬁles

contain list of questions, with ﬁelds “question”, containing a text question, “docid” identifying the relevant

document, and in the case of testing.json, a question “id” which is used in evaluating your outputs using

Kaggle. For the training and test data, each question additionally includes a gold standard answer in the “text”

ﬁeld, as well as the “answer_paragraph”, which identiﬁes the paragraph in the document where the answer can

be found. There is an exact string match of the answer in the answer sentence, however please be aware that

the tokenisation of the “text” answers will often differ from the untokenised text in the document. For the test

data, these answer ﬁelds are missing, but the “id” number is included which should be used when creating your

Kaggle submissions. You should use the Python json library to load these ﬁles.

Each of this datasets has a different purpose. The training data should be used for building your QA models,

e.g., for use in supervised learning. You are encouraged to inspect this data closely to fully understand the

task, the types of questions, and the kinds of answers you are expected to predict. The training set is large, and

you should not feel you do not have to use all of the data if it is not feasible to do so. The development set

is formatted like the training set, where we have reserved some documents for validation. This will help you

make major implementation decisions, and should also be used for detailed analysis of your system – both for

measuring performance, and for error analysis – in the report.

You will use the test set, as discussed above, to participate in the Kaggle competition. You should not at any

time manually inspect the test dataset; any sign that you have done so will result in loss of marks.

4 QA System

You are asked to develop one or more QA systems, depending on the size of your group. How you do this is up

to you, and you should start by reviewing the lecture on QA and the suggested reading carefully. In particular,

you should consider approaches based on

• retrieval techniques for ﬁnding the best matching paragraph, sentence or smaller span based on the content

terms in the question and the text; this might serve as the ﬁrst stage in proposing candidate spans, for use

in later stages of processing

• language processing techniques, such as parsing, named entity tagging, based on the idea that the answer

is often a constituent or an entity; you might also want to use lexical resources like wordnet, distributional

word vectors or topic models

• some form of machine learned method, such as learning the expected answer type (e.g., a person entity, a

number, a place name) from the way the question is phrased (e.g., what “wh” word is used, if any)

and combinations thereof, as well as your own ideas for solving the problem.

Teams of 3 members will be required to develop diverse approaches, such that the techniques are not just

minor riffs on the same idea. Accordingly, such teams should produce at least one predominantly retrieval based

method, and one based on richer NLP annotations. Signiﬁcant parts of your preprocessing and evaluation code

will be shared between your methods, however the core of the techniques should differ.

If you are at all uncertain about what design choices to make, you should evaluate your methods using the

dev data, and use the results to justify your choice in your report. You will need to perform error analysis on the

development data, where you attempt to understand where your approach(es) work well, and where they fail.

As part of this, you may want to develop method-speciﬁc evaluation, such as evaluating retrieval accuracy.

Your approaches should run in a modest time frame, with the end to end process of training and evaluation

not taking more than 8 hours of wall clock time on a commodity desktop machine (which may have a single

GPU card). You are welcome to use cloud computing for running your code, however techniques with excessive

computational demands will be penalised.

5 Evaluation

Your submissions will be evaluated on the following grounds:

Component Marks Criteria

Report writing 6 clarity of writing and document structure; exposition of

methodology; display of experimental results

Report content 12 exposition of technique; motivation for method(s) and justi-

ﬁcation of design decisions; correctness of technique; ambi-

tion of technique*; quality of error analysis*; interpretation

of results and experimental conclusions*

Competition 12 Kaggle ranking in cohort; absolute score of system

*: For these criteria, the expectations applied to team submissions will be judged more stringently than for

individuals. For instance, the greater manpower available to teams, means that we expect more advanced and

elaborate techniques, judged under “ambition”. This also applies to the level of detail in the error analysis, and

degree of comparative analysis in the results and conclusions, which will need to have greater depth to achieve

a high grade.

One you are satisﬁed that your system is working as intended, you should use the training and development

data to do a thorough error analysis, looking for patterns in the kinds of errors your basic system is making. You

should consider the various steps in processing, and identify where the most serious problems are occurring.

If there are any relatively simple ﬁxes that might have a sizeable impact on performance, you should feel free

to note and apply them, but your main goal is to identify opportunities for major enhancements. You should

include a summary of this error analysis in your report.

Each team will submit a report with the description, analysis, and comparative assessment (where applicable)

of methods used. There is no ﬁxed template for the report, but it should start with a very brief introduction of

the problem. You should mention any choices you made in implementing your QA system along with empirical

justiﬁcation for those choices. Use your error analysis of the basic system to motivate your enhancements, and

describe them in enough detail that we could replicate them without looking at your code. Using the dev dataset,

you should evaluate whether your enhancements increased performance as compared to the basic system, and

also report your relative performance on the Kaggle leaderboard. Finally, discuss what steps you might take

next if you were to continue development of your system (since you don’t actually have to do it, feel free to be

ambitious!).

For the evaluation, you should generally avoid reporting numbers in the text: include at least one table, and

at least one chart. Using the dev set, you should report your results, based on the average F-score. In addition,

you are encouraged to report results with other metrics, where needed to best support your error analysis.

Your description of your method should be clear and concise. You should write it at a level that a masters

student could read and understand without difﬁculty. If you use any existing algorithms, you do not have to

rewrite the complete description, but must provide a summary that shows your understanding and references

to the relevant literature. In the report, we will be very interested in seeing evidence of your thought processes

and reasoning for choosing one approach over another.

The report should be submitted as a PDF, and be no more than:

Team Size Content Pages Reference Pages

1 2 1

2 3 1

3 4 1

where page counts refer to single-sides of A4, using a font size of 11pt and margins of at least 1cm (much like

this document). “Content pages” should contain your brief introduction, method description, plots, tables, dis-

cussion of results, and conclusions. “Reference Pages” may contain only references. You do not need to include a

cover page, but please ensure that all usernames are clearly stated on the ﬁrst page, as well as your Kaggle team

name. If a report is longer than the stated limits above, we will only read and assess the report up to the page

limit and ignore further pages.

Competition component

This component of your assessment is based on the results achieved by your method(s) in your ﬁnal submission

to the Kaggle competition. Before the end of the competition you should choose two of your systems to put

forward for the ﬁnal evaluation on the private data.

You will be evaluated in two ways:

by ranking where each competition entry is ranked in terms of the private evaluation score, such that the team

with the best score gets a rank of 1 and the worst scoring team a rank of N. Any ties will be assigned the

lower (better) rank. These ranks will then be mapped linearly to produce a mark between 0 and 6, i.e., rank

1 gets a mark of 6, and rank N gets a mark of 0. We will round fractional marks to the nearest half mark.

Note that in determining rankings we will compare your approach only against teams of equal or smaller

size. This way individuals will not be penalised for underperforming against much larger teams. Note

however that larger teams will have a tougher time, as they effectively compete against a larger cohort.

by absolute score where the mark is set based on your F1 score on the private test data, based on the following

levels:

private test F1 <1% 1-5% 5-9% 9-12% 12-15% 15-18% >18%

marks 0 1 2 3 4 5 6

Although you will not see the private test results until the project deadline, your public test results should

give you a reasonable idea of your performance. The benchmark system labelled “heuristic” in the leader-

board is a method we implemented, and will falls roughly in the middle of the above range. You should be

able to beat with some effort.

Together these two marks will be combined to give an overall evaluation score out of 12.

Project Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation