COMS 4705 – Natural Language Processing Spring 2015 Instructions

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 9

Download
Open PDF In Browser	View PDF

COMS 4705 – Natural Language Processing – Fall 2015
Assignment 4
Machine Translation
Due: April 25, 2016 at 11:59:59 PM

Introduction
In this assignment we will:
1. Go through the basics of Machine Translation models.
2. Implement IBM models 1 and 2.
3. Create an implementation of the Berkeley Aligner.
4. Analyze data output from all models.
This document is structured in the following sequence:

Getting Started

Environment Setup

Tutorial

Basic introduction to Machine Translation and IBM models 1 & 2

Setup everything to start working on the assignment

Overview of NLTK commands to be used in the assignment

Assignment Part A:
IBM Models 1 & 2

Implement and examine IBM models 1 & 2

Assignment Part B:
Berkeley Aligner

Implement and analyze the Berkeley Aligner; optional extra credit

Submission

Files to be submitted

Getting Started

The goal of this assignment is to understand and implement various models for machine translation.
We will be using the German-English corpus from Comtrans. The corpus contains roughly 100,000
pairs of equivalent sentences--one in German and one in English. Our objective will be to
determine the alignments between these sentences using various Machine Translation models and
compare the results.

Assume we are given a parallel corpus of training sentences in two languages. For example, the
first pair in the corpus is:

e = Wiederaufnahme der Sitzungsperiode
f = Resumption of the session

For each pair of sentences, we want to determine the alignment between them. Since we do not
have any alignment information in our training set, we will need to determine a set of parameters
from the corpus, which will enable us to predict the alignments.

For IBM Model 1, we need to determine one parameter set: t(f | e). The t parameter represents the
translation probability of word f being translated from word e. The q parameters will be initialized to
the uniform distribution, meaning that q (j | i, l, m) = 1 / (l + 1). This results in a simple model that
produces alignments based on the most probable translation of each word.

For IBM Model 2, we need to determine two parameter sets: q(j | i, l, m) and t(f | e). The q parameter
represents the alignment probability that a particular word position i will be aligned to a word position
j, given the sentence lengths l and m. This results in a model that examines both the word translations
and the position distortion of a word in the target sentence.
A great explanation of the IBM Models by Michael Collins can be found here:

http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/ibm12.pdf

In addition, here are a few more links explaining the IBM Models:
http://mt-class.org/jhu/slides/lecture-ibm-model1.pdf
http://web.stanford.edu/class/cs224n/handouts/cs224n-lecture3-MT-6up.pdf
https://www.cs.jhu.edu/~jason/465/PowerPoint/lect32a-mt-word-based-models.pdf

In order to compute these parameters, we use the EM algorithm. The EM Algorithm takes an initial set
of parameters and attempts to find the maximum likelihood estimates of these parameters under the
model. There are 2 steps: the Expectation step, which computes the the likelihood of the current
parameters, and the Maximization step, which updates these parameters to maximize the likelihood.
A very detailed explanation with some intuition behind it can be found here:

http://www.isi.edu/natural-language/mt/wkbk.rtf

In this assignment, you will implement a simplified version of the BerkeleyAligner model. This model
uses the concept of alignment by agreement. Essentially, you will be training two IBM model 2s
simultaneously: one from the source language to the target language and one from the target language
to the source language. The idea is that intersecting these two models will eliminate some of the errors
that unidirectional translation models would produce. In order to accomplish this, the EM algorithm
initializes two models and maximizes the combined probability of both models. More information can
be found here:

http://cs.stanford.edu/~pliang/papers/alignment-naacl2006.pdf

Environment Setup

Step 1: Add NLTK location to your default path
NOTE: This should have been done in proj1, but may need to be done again if you have edited
your .bashrc
Open the file ~/.bashrc and add the following line to the end of the file:
export PYTHONPATH=$PYTHONPATH:/home/cs4701/python/lib/python2.7/site-packages
Then run:

source ~/.bashrc

Step 2: Create a link to the NLTK files
Move to your home directory and run:
ln –s ~coms4705/nltk_data nltk_data
To test that you have successfully linked to the nltk directory, open the python interpreter and
run:
from nltk.corpus import comtrans
This should not produce any errors. If you get the error:
Resource u’corpora/comtrams’ not found
It is likely an error from linking in assignment 1. Remove the symbolic link:
rm -r ~nltk_data
Run the linking command again and test. You should get no errors.

Step 3: Copy the homework files to your hidden directory under Homework4 folder
Use the following command:
cp -r ~coms4705/Homework4 ~/hidden//Homework4/

Tutorial
In this assignment, you will be using many of the alignment tools available within NLTK.

One important object is AlignedSent, which is given in the ComTrans corpus object:

AlignedSent(['Resumption', 'of', 'the', 'session'], ['Reprise', 'de', 'la', 'session'],
Alignment([(0, 0), (1, 1), (2, 2), (3, 3)]))

The original sentence can be obtained using: aligned_sent.words. The translated sentence can be
obtained

with

aligned_sent.mots.

The

alignment

array

can

obtained

with

aligned_sent.alignment. Each tuple (i, j) in the alignment indicates that words[i] is aligned to

mots[j].

An IBMModel1 object can be created with:

ibm = IBMModel1(aligned_sents, num_iters)

“num_iters” is the number of iterations you want the EM algorithm to make. The model
automatically trains on initialization. You can then call the align method to predict the alignment
for an AlignedSent object.

Each AlignedSent object in the corpus contains the correct alignments. In order to evaluate the
results of the model, use the AlignedSent.alignment_error_rate() function.

Provided Files and Report
The assignment can be found in:
/home/coms4705/Homework4
It contains the following files:
A.py – contains skeleton code for Part A
B.py – contains skeleton code for Part B
EC.py – contains skeleton code for Part B extra credit
main.py – runs the assignment [do not modify this file]

You must use the skeleton code we provide you. Do not rename these files. Inside, you will find
our solutions with most of the code removed. In the missing code’s place, you will find comments
to guide you to a correct solution.
We have provided code that creates all output files (main.py). Do not modify this code in any way,
and pay close attention to the instructions in the comments. If you modify this code it may be very
difficult to evaluate your work and this will be reflected in your grade. You may add helper functions
if you would like.

Report
You are required to create a brief report about your work. The top of the report should include
your UNI and the time you expect each of your programs to complete. Throughout the assignment,
you will be asked to include specific output or comment on specific aspects of your work.

Part A – IBM Models 1 & 2
In this part, we will be using NLTK to examine IBM models 1 & 2. You will need to implement the
methods compute_average_aer(), save_model_output(), create_ibm1(), and create_ibm2(). When
creating the text files, do not worry about encoding.
1) Initialize an instance of IBM Model 1, using 10 iterations of EM. Train it on the corpus.
Then, save the model’s predicted alignments for the first 20 sentence pairs in the corpus
to “ibm1.txt” (we are training and testing on the same data). Use the following format for
each sentence pair:
Target sentence

[as given by AlignedSent.words]

Source Sentence

[as given by AlignedSent.mots]

Alignments

[as given by AlignedSent.alignment]

(blank line)
Ex:
[u’Frau’, u’Pr\xe4sidentin’, u’,’ u’zur’, u’Gesch\xe4ftsordnung’, u’.’]
[u’Madam’, u’President’, u’,’, u’on’, u’a’, u’point’, u’of’, u’order’, u’.’]
0-0 1-0 2-2 3-7 4-7
(blank line)
2) Initialize an instance of IBM Model 2 using 10 iterations of EM. Train it on the corpus. Then,
save the model’s predicted alignments for the first 20 sentences in the corpus to “ibm2.txt”.
Use the same sentence pair format as above.
3) For each of the first 50 sentence pairs in the corpus, compute the alignment error rate
(AER). Compute the average AER over the first 50 sentences for each model. You can use
the AER that exists in NLTK, but must implement your own averaging scheme. Compare
the results between the two models. Specifically, highlight a sentence pair from the
development set where one model outperformed the other. Comment on why one model
computed a more accurate alignment on this pair.
4) Experiment with the number of iterations for the EM algorithm. Determine a reasonable
number of iterations (in terms of processing time), which provides a lower bound on the

AER. Discuss how the number of iterations is related to the AER. These number may not
be the same for each of the IBM models. Note: Even though you are experimenting with
varying numbers of iterations here, the ibm1.txt and ibm2.txt files you submit should be
ran with 10 iterations.

Part B – Berkeley Aligner
In this part we will improve upon the performance of the IBM models using a simplified version of
the BerkeleyAligner Model. In this model, we will train two separate models simultaneously such
that we maximize the agreement between them. Here, we will be digging a bit more into the
machine translation algorithms, and implementing the EM algorithm.
You will need to complete the implementation of the class BerkeleyAligner, which will support the
same function calls as the NLTK implementations of the IBM models.
For the EM algorithm, initialize the translation parameters to be the uniform distribution over the
length of the vocabulary of words in the target sentences. Words should be treated as case
sensitive. Initialize the alignment parameters to be the uniform distribution over the length of the
source sentence plus one.
We will be using a simplified quantification of agreement between the two models (in comparison
to that proposed in the paper). When you compute the expected counts, use the average expected
count with respect to the two models’ parameters.
Both model parameters can be stored in the same dictionary. However, the translation and
distortion parameters should be stored in separate dictionaries. Null alignments should be dealt
with in the same fashion that IBM models 1 & 2 handle them. When creating the text file, do not
worry about encoding. Note: The iteration should be left at 10 when the code is submitted, but
feel free to alter it while testing, if needed, to find convergence.
1) Implement the train() function. This function takes in a training set and the number of
iterations for the EM algorithm. You will have to implement the EM algorithm for this new
model. Return the parameters in the following format:
(translation, distortion)
2) Implement the align() function. This function uses the trained model’s parameters to
determine the alignments for a single sentence pair.
3) Train the model and determine the alignments for the first 20 sentences. Save the results
to “ba.txt”. Use the same format as in part A.
4) Compute the average AER for the first 50 sentences. Compare the performance of the

BerkeleyAligner model to the IBM models.
5) Give an example of a sentence pair that the Berkeley Aligner performs better on than the
IBM models and explain why you think this is the case.
6) (Extra Credit) Think of a way to improve upon the Berkeley Aligner model. Specifically
examine the way we quantify agreement between the two models. In our implementation,
we computed agreement as the average expected count of the two models. Implement
an improved Berkeley Aligner model that computes agreement in a better way. There is
skeleton code in EC.py (same as for B.py) Compute the average AER for the first 50
sentences. Compare to the other models. Again, this part is optional but if your
implementation is interesting and shows improved performance, you will be eligible for
bonus points.

Submission
A.py – methods implemented
Ibm1.txt – contains the first 20 sentence pairs and their alignments for IBM Model 1
Ibm2.txt – contains the first 20 sentence pairs and their alignments for IBM Model 2
B.py – methods implemented
ba.txt – contains the first 20 sentence pairs and their alignments for Berkeley Aligner
EC.py – improved Berkeley Aligner code [optional]
README.txt – writeup of all written portions as well as general comments about your code
and how it works or any issues that occur while running it

Make sure running main.py does not fail and that all output is printed correctly. The output of
main.py will make up a large portion of your grade.
Make sure all files are within the folder:
~/hidden//Homework4/
As a final step, run the permissions script to set correct permissions for your homework files:
~coms4705/hw4_set_permissions.sh

Additional Resources
Here are some additional problems related to translation:

http://www.nacloweb.org/resources/problems/2012/N2012-C.pdf
http://www.nacloweb.org/resources/problems/2010/C.pdf

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 9
Language                        : en-US
Tagged PDF                      : Yes
Title                           : COMS 4705 – Natural Language Processing – Spring 2015
Author                          : linkuo
Creator                         : Microsoft® Word 2016
Create Date                     : 2016:04:20 17:14:11-04:00
Modify Date                     : 2016:04:20 17:14:11-04:00
Producer                        : Microsoft® Word 2016

EXIF Metadata provided by EXIF.tools

COMS 4705 – Natural Language Processing Spring 2015 Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation