Hw6 Instructions

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 2

LING572 Hw6: Beam search
Due: 11pm on Feb 20, 2019
The example files are under /dropbox/18-19/572/hw6/examples/.
Q1 (75 points): Write a script, beamsearch maxent.sh, that implements the beam search for
POS tagging.
The format is: beamsearch maxent.sh test data boundary file model file sys output beam size
topN topK
test data has the following format (e.g., ex/test.txt): “instanceName goldClass f1 v1 f2 v2
...”, where an instance corresponds to a word and goldClass is the word’s POS tag according to
the gold standard. Note this format is slightly different from the format used in the previous
assignments, which is “goldClass f1:v1 f2:v2 ...”.
boundary file: the format of boundary file is one number per line, which is the length of a
sentence (e.g., ex/boundary.txt); for instance, if the first line is 46, it means the first sentence
in test data has 46 words.
model file is a MaxEnt model in text format (e.g., m1.txt).
sys output (e.g., ex/sys) has the following format: “instanceName goldClass sysClass prob”,
where instanceName and goldClass are copied from the test data, sysClass is the tag yfor
the word xaccording to the best tag sequence found by the beam search, and prob is P(y|x).
Note prob is NOT the probability of the whole tag sequence given the word sentence. It is the
probability of the tag ygiven the word x.
topN: When expanding a node in the beam search tree, choose only the topN POS tags for the
given word based on P(y|x).
beam size is the max gap between the lg-prob of the best path and the lg-prob of kept path:
that is, a kept path should satisfy lg(prob) + beam size lg(max prob), where max prob is the
prob of the best path for the current position. lg is base-10 log.
topK is the max number of paths kept alive at each position after pruning.
Note:
Apath in the beam search is the path from the root to a node in the beam search tree. And for
more info about how beam search works and the meaning of beam size, topN and topK, see the
hw6 slides.
Remember that the feature vectors in the test data do not include features ti1=tagi1(e.g.,
prevT=NN) and ti2ti1=tagi2+tagi1(e.g., prevTwoTags=JJ+NN), because the tags
of the previous words are not available for the test data before the decoding starts. You need to
add those features to the feature vectors before calling the model to classify the current instance
based on the current path.
1
For instance, suppose the current instance is “instanceName goldTag f1 v1 f2 v2 ...”, and in
the current path the system tags the previous word as NN and the word before the previous
word as JJ. You need to add “prevT=NN 1” and “prevTwoTags=JJ+NN 1” to the feature
vector in order to determine the top tags of the current instance according to the current
path.
When you add these two types of features, only add the ones that appear in the model
file. If a feature (e.g., prevTwoTags=NN+RB) does not appear in the model file, that
means that the tag bigram does not appear in the training data. In that case, do not
add the feature to the feature vector, as the model does not contain the weights for the
corresponding feature functions. Another way to look at this is that if a (feature, class)
pair does not appear in the model file, it means the weight of the feature function is zero.
For your convenience, the list of these two types of features in the m1.txt is stored in
feats to add. Your code should NOT read in a file like feats to add because this info
should come from the model file. This file is there just to show you what these features
look like.
To summarize, you need to add prevT=xx and prevTwoTags=yy+xx features on the fly.
If such a feature does not appear in the model file, simply ignore the feature (i.e., assuming
its weight is 0).
Run beamsearch maxent.sh with sec19 21.txt as the test data, m1.txt as model file, sec19 21.boundary
as the boundary file.
Before running your code on the whole test set, you should test your code on smaller data sets.
For instance, you can use ex/test.txt as the test file, ex/boundary.txt as boundary file, m1.txt as
the model file. After that, you can run your code on the real data set with the (0, 1, 1) setting,
and record the time it takes. The running time for other settings could be much longer.
Fill out Table 1.
Submit the sys output file for the third row in Table 1 (i.e., the row when beam size=2, topN=5,
and topK=10).
beam size topN topK Test accuracy Running time
0 1 1
1 3 5
2 5 10
3 10 100
Table 1: Beam search results
Submission: Submit the following to Canvas:
Your note file readme.(txt |pdf) that includes Table 1 and any notes that you want the TA to
read.
hw.tar.gz that includes all the files specified in dropbox/18-19/572/hw6/submit-file-list, plus any
source code (and binary code) used by the shell scripts.
Make sure that you run check hw6.sh before submitting your hw.tar.gz.
2

Navigation menu