Hw6 Instructions

User Manual:

Open the PDF directly: View PDF .
Page Count: 2

LING572 Hw6: Beam search

Due: 11pm on Feb 20, 2019

The example ﬁles are under /dropbox/18-19/572/hw6/examples/.

Q1 (75 points): Write a script, beamsearch maxent.sh, that implements the beam search for

POS tagging.

•The format is: beamsearch maxent.sh test data boundary ﬁle model ﬁle sys output beam size

topN topK

•test data has the following format (e.g., ex/test.txt): “instanceName goldClass f1 v1 f2 v2

...”, where an instance corresponds to a word and goldClass is the word’s POS tag according to

the gold standard. Note this format is slightly diﬀerent from the format used in the previous

assignments, which is “goldClass f1:v1 f2:v2 ...”.

•boundary ﬁle: the format of boundary ﬁle is one number per line, which is the length of a

sentence (e.g., ex/boundary.txt); for instance, if the ﬁrst line is 46, it means the ﬁrst sentence

in test data has 46 words.

•model ﬁle is a MaxEnt model in text format (e.g., m1.txt).

•sys output (e.g., ex/sys) has the following format: “instanceName goldClass sysClass prob”,

where instanceName and goldClass are copied from the test data, sysClass is the tag yfor

the word xaccording to the best tag sequence found by the beam search, and prob is P(y|x).

Note prob is NOT the probability of the whole tag sequence given the word sentence. It is the

probability of the tag ygiven the word x.

•topN: When expanding a node in the beam search tree, choose only the topN POS tags for the

given word based on P(y|x).

•beam size is the max gap between the lg-prob of the best path and the lg-prob of kept path:

that is, a kept path should satisfy lg(prob) + beam size ≥lg(max prob), where max prob is the

prob of the best path for the current position. lg is base-10 log.

•topK is the max number of paths kept alive at each position after pruning.

Note:

•Apath in the beam search is the path from the root to a node in the beam search tree. And for

more info about how beam search works and the meaning of beam size, topN and topK, see the

hw6 slides.

•Remember that the feature vectors in the test data do not include features ti−1=tagi−1(e.g.,

prevT=NN) and ti−2ti−1=tagi−2+tagi−1(e.g., prevTwoTags=JJ+NN), because the tags

of the previous words are not available for the test data before the decoding starts. You need to

add those features to the feature vectors before calling the model to classify the current instance

based on the current path.

–For instance, suppose the current instance is “instanceName goldTag f1 v1 f2 v2 ...”, and in

the current path the system tags the previous word as NN and the word before the previous

word as JJ. You need to add “prevT=NN 1” and “prevTwoTags=JJ+NN 1” to the feature

vector in order to determine the top tags of the current instance according to the current

path.

–When you add these two types of features, only add the ones that appear in the model

ﬁle. If a feature (e.g., prevTwoTags=NN+RB) does not appear in the model ﬁle, that

means that the tag bigram does not appear in the training data. In that case, do not

add the feature to the feature vector, as the model does not contain the weights for the

corresponding feature functions. Another way to look at this is that if a (feature, class)

pair does not appear in the model ﬁle, it means the weight of the feature function is zero.

–For your convenience, the list of these two types of features in the m1.txt is stored in

feats to add. Your code should NOT read in a ﬁle like feats to add because this info

should come from the model ﬁle. This ﬁle is there just to show you what these features

look like.

–To summarize, you need to add prevT=xx and prevTwoTags=yy+xx features on the ﬂy.

If such a feature does not appear in the model ﬁle, simply ignore the feature (i.e., assuming

its weight is 0).

Run beamsearch maxent.sh with sec19 21.txt as the test data, m1.txt as model ﬁle, sec19 21.boundary

as the boundary ﬁle.

•Before running your code on the whole test set, you should test your code on smaller data sets.

For instance, you can use ex/test.txt as the test ﬁle, ex/boundary.txt as boundary ﬁle, m1.txt as

the model ﬁle. After that, you can run your code on the real data set with the (0, 1, 1) setting,

and record the time it takes. The running time for other settings could be much longer.

•Fill out Table 1.

•Submit the sys output ﬁle for the third row in Table 1 (i.e., the row when beam size=2, topN=5,

and topK=10).

beam size topN topK Test accuracy Running time

0 1 1

1 3 5

2 5 10

3 10 100

Table 1: Beam search results

Submission: Submit the following to Canvas:

•Your note ﬁle readme.(txt |pdf) that includes Table 1 and any notes that you want the TA to

read.

•hw.tar.gz that includes all the ﬁles speciﬁed in dropbox/18-19/572/hw6/submit-ﬁle-list, plus any

source code (and binary code) used by the shell scripts.

•Make sure that you run check hw6.sh before submitting your hw.tar.gz.

Hw6 Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation