User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 361 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Machine Translation
User Manual and Code Guide
Philipp Koehn
University of Edinburgh
This document serves as user manual and code guide for the Moses machine translation decoder. The
decoder was mainly developed by Hieu Hoang and Philipp Koehn at the University of Edinburgh and
extended during a Johns Hopkins University Summer Workshop and further developed under EuroMa-
trix and GALE project funding. The decoder (which is part of a complete statistical machine translation
toolkit) is the de facto benchmark for research in the field.
This document serves two purposes: a user manual for the functions of the Moses decoder and a code
guide for developers. In large parts, this manual is identical to documentation available at the official
Moses decoder web site This document does not describe in depth the underlying
methods, which are described in the text book Statistical Machine Translation (Philipp Koehn, Cambridge
University Press, 2009).
February 3, 2018
The Moses decoder was supported by the European Framework 6 projects EuroMatrix, TC-Star,
the European Framework 7 projects EuroMatrixPlus, Let’s MT, META-NET and MosesCore
and the DARPA GALE project, as well as several universities such as the University of Edin-
burgh, the University of Maryland, ITC-irst, Massachusetts Institute of Technology, and others.
Contributors are too many to mention, but it is important to stress the substantial contributions
from Hieu Hoang, Chris Dyer, Josh Schroeder, Marcello Federico, Richard Zens, and Wade
Shen. Moses is an open source project under the guidance of Philipp Koehn.
1 Introduction 11
1.1 WelcometoMoses!.................................... 11
1.2 Overview ......................................... 11
1.2.1 Technology.................................... 11
1.2.2 Components ................................... 12
1.2.3 Development................................... 13
1.2.4 MosesinUse................................... 14
1.2.5 History ...................................... 14
1.3 GetInvolved ....................................... 14
1.3.1 MailingList.................................... 14
1.3.2 Suggestions.................................... 15
1.3.3 Development................................... 15
1.3.4 Use......................................... 15
1.3.5 Contribute .................................... 15
1.3.6 Projects ...................................... 16
2 Installation 23
2.1 Getting Started with Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 Easy Setup on Ubuntu (on other linux systems, you’ll need to install
packages that provide gcc, make, git, automake, libtool) . . . . . . . . . . 23
2.1.2 Compiling Moses directly with bjam . . . . . . . . . . . . . . . . . . . . . 25
2.1.3 Other software to install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.4 Platforms ..................................... 27
2.1.5 OSXInstallation ................................. 27
2.1.6 LinuxInstallation ................................ 28
2.1.7 Windows Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.8 Run Moses for the first time . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.9 ChartDecoder .................................. 31
2.1.10 NextSteps .................................... 31
2.1.11 bjamoptions ................................... 31
2.2 BuildingwithEclipse .................................. 33
2.3 BaselineSystem...................................... 34
2.3.1 Overview..................................... 34
2.3.2 Installation .................................... 35
2.3.3 CorpusPreparation ............................... 36
2.3.4 Language Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.5 Training the Translation System . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.6 Tuning....................................... 39
2.3.7 Testing....................................... 40
2.3.8 Experiment Management System (EMS) . . . . . . . . . . . . . . . . . . . 43
2.4 Releases .......................................... 44
2.4.1 Release 4.0 (5th Oct, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.2 Release 3.0 (3rd Feb, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.3 Release 2.1.1 (3rd March, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.4 Release 2.1 (21th Jan, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.5 Release 1.0 (28th Jan, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.6 Release 0.91 (12th October, 2012) . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.7 Status 11th July, 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.8 Status 13th August, 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4.9 Status 9th August, 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4.10 Status 26th April, 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.11 Status 1st April, 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.12 Status 26th March, 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.5 WorkinProgress..................................... 55
3 Tutorials 57
3.1 Phrase-basedTutorial .................................. 57
3.1.1 A Simple Translation Model . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1.2 RunningtheDecoder .............................. 58
3.1.3 Trace........................................ 59
3.1.4 Verbose ...................................... 60
3.1.5 TuningforQuality................................ 64
3.1.6 TuningforSpeed................................. 65
3.1.7 Limit on Distortion (Reordering) . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2 Tutorial for Using Factored Models . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.1 Train an unfactored model . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.2 Train a model with POS tags . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.3 Train a model with generation and translation steps . . . . . . . . . . . . 74
3.2.4 Train a morphological analysis and generation model . . . . . . . . . . . 75
3.2.5 Train a model with multiple decoding paths . . . . . . . . . . . . . . . . . 76
3.3 SyntaxTutorial ...................................... 77
3.3.1 Tree-BasedModels................................ 77
3.3.2 Decoding ..................................... 79
3.3.3 DecoderParameters............................... 83
3.3.4 Training...................................... 85
3.3.5 Using Meta-symbols in Non-terminal Symbols (e.g., CCG) . . . . . . . . . 88
3.3.6 Different Kinds of Syntax Models . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.7 Format of text rule table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.4 OptimizingMoses .................................... 95
3.4.1 Multi-threaded Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.4.2 How much memory do I need during decoding? . . . . . . . . . . . . . . 96
3.4.3 How little memory can I get away with during decoding? . . . . . . . . . 98
3.4.4 FasterTraining.................................. 98
3.4.5 TrainingSummary................................100
3.4.6 LanguageModel.................................101
3.4.7 Sufxarray....................................103
3.4.8 CubePruning...................................104
3.4.9 Minimizing memory during training . . . . . . . . . . . . . . . . . . . . . 104
3.4.10 Minimizing memory during decoding . . . . . . . . . . . . . . . . . . . . 104
3.4.11 Phrase-tabletypes................................106
3.5 Experiment Management System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.5.1 Introduction ...................................107
3.5.2 Requirements...................................108
3.5.3 QuickStart ....................................109
3.5.4 MoreExamples..................................112
3.5.5 Try a Few More Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.5.6 AShortManual .................................120
3.5.7 Analysis......................................128
4 User Guide 133
4.1 SupportTools.......................................133
4.1.1 Overview.....................................133
4.1.2 Converting Pharaoh configuration files to Moses configuration files . . . 133
4.1.3 Moses decoder in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.1.4 Filtering phrase tables for Moses . . . . . . . . . . . . . . . . . . . . . . . . 134
4.1.5 Reducing and Extending the Number of Factors . . . . . . . . . . . . . . . 134
4.1.6 Scoring translations with BLEU . . . . . . . . . . . . . . . . . . . . . . . . 135
4.1.7 Missing and Extra N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.1.8 Making a Full Local Clone of Moses Model + ini File . . . . . . . . . . . . 135
4.1.9 Absolutizing Paths in moses.ini . . . . . . . . . . . . . . . . . . . . . . . . 136
4.1.10 Printing Statistics about Model Components . . . . . . . . . . . . . . . . . 136
4.1.11 Recaser ......................................137
4.1.12 Truecaser .....................................137
4.1.13 SearchgraphtoDOT...............................138
4.1.14 Threshold Pruning of Phrase Table . . . . . . . . . . . . . . . . . . . . . . 139
4.2 ExternalTools.......................................140
4.2.1 Word Alignment Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.2.2 EvaluationMetrics................................143
4.2.3 Part-of-Speech Taggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.2.4 SyntacticParsers.................................145
4.2.5 Other Open Source Machine Translation Systems . . . . . . . . . . . . . . 147
4.2.6 Other Translation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.3 UserDocumentation...................................149
4.4 AdvancedModels ....................................150
4.4.1 Lexicalized Reordering Models . . . . . . . . . . . . . . . . . . . . . . . . 150
4.4.2 Operation Sequence Model (OSM) . . . . . . . . . . . . . . . . . . . . . . . 153
4.4.3 Class-basedModels ...............................156
4.4.4 Multiple Translation Tables and Back-off Models . . . . . . . . . . . . . . 157
4.4.5 Global Lexicon Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.4.6 Desegmentation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.4.7 Advanced Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.5 Efficient Phrase and Rule Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.5.1 Binary Phrase Tables with On-demand Loading . . . . . . . . . . . . . . . 163
4.5.2 Compact Phrase Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.5.3 Compact Lexical Reordering Table . . . . . . . . . . . . . . . . . . . . . . . 167
4.5.4 Pruning the Translation Table . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.5.5 Pruning the Phrase Table based on Relative Entropy . . . . . . . . . . . . 168
4.5.6 Pruning Rules based on Low Scores . . . . . . . . . . . . . . . . . . . . . . 172
4.6 Search ...........................................172
4.6.1 Contents......................................172
4.6.2 Generating n-Best Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.6.3 Minimum Bayes Risk Decoding . . . . . . . . . . . . . . . . . . . . . . . . 174
4.6.4 Lattice MBR and Consensus Decoding . . . . . . . . . . . . . . . . . . . . 174
4.6.5 OutputSearchGraph ..............................176
4.6.6 Early Discarding of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 178
4.6.7 Maintaining stack diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.6.8 CubePruning...................................178
4.7 OOVs ...........................................179
4.7.1 Contents......................................179
4.7.2 Handling Unknown Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.7.3 Unsupervised Transliteration Model . . . . . . . . . . . . . . . . . . . . . 180
4.8 HybridTranslation....................................182
4.8.1 Contents......................................182
4.8.2 XMLMarkup...................................182
4.8.3 Specifying Reordering Constraints . . . . . . . . . . . . . . . . . . . . . . . 184
4.8.4 Fuzzy Match Rule Table for Hierachical Models . . . . . . . . . . . . . . . 185
4.8.5 Placeholder....................................185
4.9 MosesasaService ....................................188
4.9.1 Contents......................................188
4.9.2 MosesServer...................................188
4.9.3 Open Machine Translation Core (OMTC) - A proposed machine transla-
4.10 IncrementalTraining...................................190
4.10.1 Contents......................................190
4.10.2 Introduction ...................................191
4.10.3 InitialTraining ..................................191
4.10.4 Virtual Phrase Tables Based on Sampling Word-aligned Bitexts . . . . . . 192
4.10.5 Updates......................................193
4.10.6 Phrase Table Features for PhraseDictionaryBitextSampling . . . . . . . . 195
4.10.7 Suffix Arrays for Hierarchical Models . . . . . . . . . . . . . . . . . . . . . 197
4.11 DomainAdaptation ...................................199
4.11.1 Contents......................................199
4.11.2 Translation Model Combination . . . . . . . . . . . . . . . . . . . . . . . . 200
4.11.3 OSM Model Combination (Interpolated OSM) . . . . . . . . . . . . . . . . 201
4.11.4 Online Translation Model Combination (Multimodel phrase table type) . 202
4.11.5 Alternate Weight Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
4.11.6 Modified Moore-Lewis Filtering . . . . . . . . . . . . . . . . . . . . . . . . 206
4.12 ConstrainedDecoding..................................207
4.12.1 Contents......................................207
4.12.2 Constrained Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
4.13 Cache-basedModels...................................207
4.13.1 Contents......................................207
4.13.2 Dynamic Cache-Based Phrase Table . . . . . . . . . . . . . . . . . . . . . . 208
4.13.3 Dynamic Cache-Based Language Model . . . . . . . . . . . . . . . . . . . 213
4.14 Pipeline Creation Language (PCL) . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4.15 ObsoleteFeatures.....................................221
4.15.1 BinaryPhrasetable ...............................221
4.15.2 Word-to-word alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.15.3 Binary Reordering Tables with On-demand Loading . . . . . . . . . . . . 224
4.15.4 Continue Partial Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 224
4.15.5 Distributed Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.15.6 Using Multiple Translation Systems in the Same Server . . . . . . . . . . 228
4.16 SparseFeatures......................................229
4.16.1 Word Translation Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
4.16.2 Phrase Length Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
4.16.3 DomainFeatures.................................232
4.16.4 CountBinFeatures................................233
4.16.5 BigramFeatures .................................233
4.16.6 Soft Matching Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
4.17 Translating Web pages with Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.17.1 Introduction ...................................234
4.17.2 Detailed setup instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
5 Training Manual 241
5.1 Training ..........................................241
5.1.1 Trainingprocess .................................241
5.1.2 Running the training script . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.2 PreparingTrainingData.................................242
5.2.1 Training data for factored models . . . . . . . . . . . . . . . . . . . . . . . 243
5.2.2 Cleaningthecorpus...............................243
5.3 FactoredTraining.....................................244
5.3.1 Translationfactors................................244
5.3.2 Reorderingfactors................................245
5.3.3 Generationfactors................................245
5.3.4 Decodingsteps..................................245
5.4 Training Step 1: Prepare Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
5.5 Training Step 2: Run GIZA++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
5.5.1 Training on really large corpora . . . . . . . . . . . . . . . . . . . . . . . . 248
5.5.2 Traininginparallel................................248
5.6 Training Step 3: Align Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
5.7 Training Step 4: Get Lexical Translation Table . . . . . . . . . . . . . . . . . . . . 252
5.8 Training Step 5: Extract Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
5.9 Training Step 6: Score Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
5.10 Training Step 7: Build reordering model . . . . . . . . . . . . . . . . . . . . . . . . 256
5.11 Training Step 8: Build generation model . . . . . . . . . . . . . . . . . . . . . . . . 258
5.12 Training Step 9: Create Configuration File . . . . . . . . . . . . . . . . . . . . . . . 258
5.13 Building a Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.13.1 Language Models in Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.13.2 Enabling the LM OOV Feature . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.13.3 Building a LM with the SRILM Toolkit . . . . . . . . . . . . . . . . . . . . 260
5.13.4 On the IRSTLM Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.13.5 RandLM......................................265
5.13.6 KenLM ......................................269
5.13.7 OxLM.......................................273
5.13.8 NPLM.......................................273
5.13.9 BilingualNeuralLM...............................275
5.13.10 Bilingual N-gram LM (OSM) . . . . . . . . . . . . . . . . . . . . . . . . . . 276
5.13.11 Dependency Language Model (RDLM) . . . . . . . . . . . . . . . . . . . . 278
5.14 Tuning...........................................279
5.14.1 Overview.....................................279
5.14.2 Batch tuning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
5.14.3 Online tuning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5.14.4 Metrics ......................................281
5.14.5 TuninginPractice ................................281
6 Background 285
6.1 Background........................................285
6.1.1 Model.......................................286
6.1.2 WordAlignment.................................287
6.1.3 Methods for Learning Phrase Translations . . . . . . . . . . . . . . . . . . 288
6.1.4 OchandNey...................................288
6.2 Decoder ..........................................291
6.2.1 TranslationOptions ...............................291
6.2.2 CoreAlgorithm .................................292
6.2.3 Recombining Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
6.2.4 BeamSearch ...................................293
6.2.5 Future Cost Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
6.2.6 N-Best Lists Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
6.3 Factored Translation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
6.3.1 Motivating Example: Morphology . . . . . . . . . . . . . . . . . . . . . . . 298
6.3.2 Decomposition of Factored Translation . . . . . . . . . . . . . . . . . . . . 299
6.3.3 StatisticalModel.................................300
6.4 Confusion Networks Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
6.4.1 ConfusionNetworks ..............................303
6.4.2 Representation of Confusion Network . . . . . . . . . . . . . . . . . . . . 304
6.5 WordLattices.......................................305
6.5.1 How to represent lattice inputs . . . . . . . . . . . . . . . . . . . . . . . . . 305
6.5.2 Configuring moses to translate lattices . . . . . . . . . . . . . . . . . . . . 306
6.5.3 Verifying PLF files with checkplf .......................306
6.5.4 Citation ......................................307
6.6 Publications........................................307
7 Code Guide 309
7.1 CodeGuide........................................309
7.1.1 Github, branching, and merging . . . . . . . . . . . . . . . . . . . . . . . . 309
7.1.2 Thecode .....................................312
7.1.3 QuickStart ....................................313
7.1.4 DetailedGuides .................................313
7.2 CodingStyle .......................................313
7.2.1 Formatting ....................................313
7.2.2 Comments ....................................314
7.2.3 Data types and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
7.2.4 Source Control Etiquette . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
7.3 Factors,Words,Phrases .................................316
7.3.1 Factors.......................................316
7.3.2 Words .......................................317
7.3.3 FactorTypes ...................................317
7.3.4 Phrases ......................................317
7.4 Tree-Based Model Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
7.4.1 Looping over the Spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
7.4.2 Looking up Applicable Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 319
7.4.3 Applying the Rules: Cube Pruning . . . . . . . . . . . . . . . . . . . . . . 322
7.4.4 Hypotheses and Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
7.5 Multi-Threading .....................................325
7.5.1 Tasks........................................325
7.5.2 ThreadPool....................................327
7.5.3 OutputCollector .................................327
7.5.4 Not Deleting Threads after Execution . . . . . . . . . . . . . . . . . . . . . 328
7.5.5 Limit the Size of the Thread Queue . . . . . . . . . . . . . . . . . . . . . . 328
7.5.6 Example......................................328
7.6 Adding Feature Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
7.6.1 Video .......................................331
7.6.2 Otherresources..................................331
7.6.3 FeatureFunction.................................331
7.6.4 Stateless Feature Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
7.6.5 Stateful Feature Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.6.6 Place-holder features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
7.6.7 moses.ini .....................................336
7.6.8 Examples .....................................337
7.7 Adding Sparse Feature Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
7.7.1 Implementation .................................341
7.7.2 Weights ......................................341
7.8 RegressionTesting ....................................342
7.8.1 Goals .......................................342
7.8.2 Testsuite .....................................342
7.8.3 Running the test suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
7.8.4 Running an individual test . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
7.8.5 Howitworks...................................343
7.8.6 Writing regression tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
8 Reference 345
8.1 Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
8.1.1 My system is taking a really long time to translate a sentence. What can
8.1.2 The system runs out of memory during decoding. . . . . . . . . . . . . . 345
8.1.3 I would like to point out a bug / contribute code. . . . . . . . . . . . . . . 345
8.1.4 How can I get an updated version of Moses ? . . . . . . . . . . . . . . . . 346
8.1.5 What changed in the latest release of Moses? . . . . . . . . . . . . . . . . . 346
8.1.6 I am an undergrad/masters student looking for a project in SMT. What
8.1.7 What do the 5 numbers in the phrase table mean? . . . . . . . . . . . . . . 346
8.1.8 What OS does Moses run on? . . . . . . . . . . . . . . . . . . . . . . . . . . 346
8.1.9 Can I use Moses on Windows ? . . . . . . . . . . . . . . . . . . . . . . . . 347
8.1.10 Do I need a computer cluster to run experiments? . . . . . . . . . . . . . . 347
8.1.11 I have compiled Moses, but it segfaults when running. . . . . . . . . . . . 347
8.1.12 How do I add a new feature function to the decoder? . . . . . . . . . . . . 347
8.1.13 Compiling with SRILM or IRSTLM produces errors. . . . . . . . . . . . . 347
8.1.14 I am trying to use Moses to create a web page to do translation. . . . . . . 348
8.1.15 How can a create a system that translate both ways, ie. X-to-Y as well as
8.1.16 PhraseScore dies with signal 11 - why? . . . . . . . . . . . . . . . . . . . . 348
8.1.17 Does Moses do Hierarchical decoding, like Hiero etc? . . . . . . . . . . . 349
8.1.18 Can I use Moses in proprietary software ? . . . . . . . . . . . . . . . . . . 349
8.1.19 GIZA++ crashes with error "parameter ’coocurrencefile’ does not exist." . 349
8.1.20 Running gives me lots of errors about *GREP
8.1.21 Running training I got the following error "*** buffer overflow detected
***: ../giza-pp/GIZA++-v2/GIZA++ terminated" . . . . . . . . . . . . . . 350
8.1.22 I retrained my model and got different BLEU scores. Why? . . . . . . . . 350
8.1.23 I specified ranges for mert weights, but it returned weights which are
8.1.24 Who do I ask if my question has not been answered by this FAQ? . . . . 350
8.2 Reference: All Decoder Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 351
8.3 Reference: All Training Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
8.3.1 BasicOptions...................................353
8.3.2 Factored Translation Model Settings . . . . . . . . . . . . . . . . . . . . . . 355
8.3.3 Lexicalized Reordering Model . . . . . . . . . . . . . . . . . . . . . . . . . 355
8.3.4 PartialTraining..................................355
8.3.5 FileLocations...................................356
8.3.6 AlignmentHeuristic...............................357
8.3.7 Maximum Phrase Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
8.3.8 GIZA++Options.................................357
8.3.9 Dealing with large training corpora . . . . . . . . . . . . . . . . . . . . . . 358
8.4 Glossary..........................................358
1.1 Welcome to Moses!
Moses is a statistical machine translation system that allows you to automatically train trans-
lation models for any language pair. All you need is a collection of translated texts (parallel
corpus). Once you have a trained model, an efficient search algorithm quickly finds the highest
probability translation among the exponential number of choices.
1.2 Overview
1.2.1 Technology
Moses is an implementation of the statistical (or data-driven) approach to machine translation
(MT). This is the dominant approach in the field at the moment, and is employed by the on-
line translation systems deployed by the likes of Google and Microsoft. In statistical machine
translation (SMT), translation systems are trained on large quantities of parallel data (from
which the systems learn how to translate small segments), as well as even larger quantities of
monolingual data (from which the systems learn what the target language should look like).
Parallel data is a collection of sentences in two different languages, which is sentence-aligned,
in that each sentence in one language is matched with its corresponding translated sentence in
the other language. It is also known as a bitext.
The training process in Moses takes in the parallel data and uses coocurrences of words and
segments (known as phrases) to infer translation correspondences between the two languages
of interest. In phrase-based machine translation, these correspondences are simply between
continuous sequences of words, whereas in hierarchical phrase-based machine translation
or syntax-based translation, more structure is added to the correspondences. For instance a
hierarchical MT system could learn that the German hat X gegessen corresponds to the English
ate X, where the Xs are replaced by any German-English word pair. The extra structure used in
these types of systems may or may not be derived from a linguistic analysis of the parallel data.
Moses also implements an extension of phrase-based machine translation know as factored
translation which enables extra linguistic information to be added to a phrase-based systems.
12 1. Introduction
For more information about the Moses translation models, please refer to the tutorials on
phrase-based MT (Section 3.1), syntactic MT (Section 3.3) or factored MT (Section 3.2).
Whichever type of machine translation model you use, the key to creating a good system is lots
of good quality data. There are many free sources of parallel data1which you can use to train
sample systems, but (in general) the closer the data you use is to the type of data you want to
translate, the better the results will be. This is one of the advantages to using on open-source
tool like Moses, if you have your own data then you can tailor the system to your needs and
potentially get better performance than a general-purpose translation system. Moses needs
sentence-aligned data for its training process, but if data is aligned at the document level, it can
often be converted to sentence-aligned data using a tool like hunalign2
1.2.2 Components
The two main components in Moses are the training pipeline and the decoder. There are also
a variety of contributed tools and utilities. The training pipeline is really a collection of tools
(mainly written in perl, with some in C++) which take the raw data (parallel and monolingual)
and turn it into a machine translation model. The decoder is a single C++ application which,
given a trained machine translation model and a source sentence, will translate the source
sentence into the target language.
The Training Pipeline
There are various stages involved in producing a translation system from training data, which
are described in more detail in the training documentation (Section 5.1) and in the baseline
system guide (Section 2.3). These are implemented as a pipeline, which can be controlled by
the Moses experiment management system (Section 3.5), and Moses in general makes it easy
to insert different types of external tools into the training pipeline.
The data typically needs to be prepared before it is used in training, tokenising the text and con-
verting tokens to a standard case. Heuristics are used to remove sentence pairs which look to
be misaligned, and long sentences are removed. The parallel sentences are then word-aligned,
typically using GIZA++3, which implements a set of statistical models developed at IBM in the
80s. These word alignments are used to extract phrase-phrase translations, or hierarchical rules
as required, and corpus-wide statistics on these rules are used to estimate probabilities.
An important part of the translation system is the language model, a statistical model built
using monolingual data in the target language and used by the decoder to try to ensure the
fluency of the output. Moses relies on external tools (Section 5.13) for language model building.
The final step in the creation of the machine translation system is tuning (Section 5.14), where
the different statistical models are weighted against each other to produce the best possible
translations. Moses contains implementations of the most popular tuning algorithms.
1.2. Overview 13
The Decoder
The job of the Moses decoder is to find the highest scoring sentence in the target language
(according to the translation model) corresponding to a given source sentence. It is also pos-
sible for the decoder to output a ranked list of the translation candidates, and also to supply
various types of information about how it came to its decision (for instance the phrase-phrase
correspondences that it used).
The decoder is written in a modular fashion and allows the user to vary the decoding process
in various ways, such as:
Input: This can be a plain sentence, or it can be annotated with xml-like elements to guide
the translation process, or it can be a more complex structure like a lattice or confusion
network (say, from the output of speech recognition)
Translation model: This can use phrase-phrase rules, or hierarchical (perhaps syntactic)
rules. It can be compiled into a binarised form for faster loading. It can be supplemented
with features to add extra information to the translation process, for instance features
which indicate the sources of the phrase pairs in order to weight their reliability.
Decoding algorithm: Decoding is a huge search problem, generally too big for exact
search, and Moses implements several different strategies for this search, such as stack-
based, cube-pruning, chart parsing etc.
Language model: Moses supports several different language model toolkits (SRILM,
KenLM, IRSTLM, RandLM) each of which has there own strengths and weaknesses, and
adding a new LM toolkit is straightforward.
The Moses decoder also supports multi-threaded decoding (since translation is embarassingly
parallelisable4), and also has scripts to enable multi-process decoding if you have access to a
Contributed Tools
There are many contributed tools in Moses which supply additional functionality over and
above the standard training and decoding pipelines. These include:
Moses server: which provides an xml-rpc interface to the decoder
Web translation: A set of scripts to enable Moses to be used to translate web pages
Analysis tools: Scripts to enable the analysis and visualisation of Moses output, in com-
parison with a reference.
There are also tools to evaluate translations, alternative phrase scoring methods, an implemen-
tation of a technique for weighting phrase tables, a tool to reduce the size of the phrase table,
and other contributed tools.
1.2.3 Development
Moses is an open-source project, licensed under the LGPL5, which incorporates contributions
from many sources. There is no formal management structure in Moses, so if you want to
14 1. Introduction
contribute then just mail support6and take it from there. There is a list (Section 1.3) of possible
projects on this website, but any new MT techiques are fair game for inclusion into Moses.
In general, the Moses administrators are fairly open about giving out push access to the git
repository, preferring the approach of removing/fixing bad commits, rather than vetting com-
mits as they come in. This means that trunk occasionally breaks, but given the active Moses
user community, it doesn’t stay broken for long. The nightly builds and tests of trunk are re-
ported on the cruise control7web page, but if you want a more stable version then look for one
of the releases (Section 2.4).
1.2.4 Moses in Use
The liberal licensing policy in Moses, together with its wide coverage of current SMT technol-
ogy and complete tool chain, make it probably the most widely used open-source SMT system.
It is used in teaching, research, and, increasingly, in commercial settings.
Commercial use of Moses is promoted and tracked by TAUS8. The most common current use
for SMT in commercial settings is post-editing where machine translation is used as a first-
pass, with the results then being edited by human translators. This can often reduce the time
(and hence total cost) of translation. There is also work on using SMT in computer-aided
translation, which is the research topic of two current EU projects, Casmacat9and MateCat10.
1.2.5 History
2005 Hieu Hoang (then student of Philipp Koehn) starts Moses as successor to Pharoah
2006 Moses is the subject of the JHU workshop, first check-in to public repository
2006 Start of Euromatrix, EU project which helps fund Moses development
2007 First machine translation marathon held in Edinburgh
2009 Moses receives support from EuromatrixPlus, also EU-funded
2010 Moses now supports hierarchical and syntax-based models, using chart decoding
2011 Moses moves from sourceforge to github, after over 4000 sourceforge check-ins
2012 EU-funded MosesCore launched to support continued development of Moses
Subsection last modified on August 13, 2013, at 10:38 AM
1.3 Get Involved
1.3.1 Mailing List
The main forum for communication on Moses is the Moses support mailing list11.
1.3. Get Involved 15
1.3.2 Suggestions
We’d like to hear what you want from Moses. We can’t promise to implement the suggestions,
but they can be used as input into research and student projects, as well as Marathon12 projects.
If you have a suggestion/wish for a new feature or improvement, then either report them
via the issue tracker13, contact the mailing list or drop Barry or Hieu a line (addresses on the
mailing list page).
1.3.3 Development
Moses is an open source project that is at home in the academic research community. There are
several venues where this community gathers, such as:
The main conferences in the field: ACL, EMNLP, MT Summit, etc.
The annual ACL Workshop on Statistical Machine Translation14
The annual Machine Translation Marathon15
Moses is being developed as a reference implementation of state-of-the-art methods in statisti-
cal machine translation. Extending this implementation may be the subject of undergraduate or
graduate theses, or class projects. Typically, developers extend functionality that they required
for their projects, or to explore novel methods. Let us know if you made an improvement, no
matter how minor. Also let us know if you found or fixed a bug.
1.3.4 Use
We are aware of many commercial deployments of Moses, for instance as described by TAUS16.
Please let us know if you use Moses commercially. Do not hesitate to contact the core develop-
ers of Moses. They are willing to answer questions and may be even available for consulting
1.3.5 Contribute
There are many ways you can contribute to Moses.
To get started, build systems with your data and get familiar with how Moses works.
Test out alternative settings for building a system. The shared tasks organized around
the ACL Workshop on Statistical Machine Translation17 are a good forum to publish such
results on standard data conditions.
Read the code. While you at it, feel free to add comments or contribute to the Code Guide
(Section 7.1) to make it easier for others to understand the code.
If you come across inefficient implementations (e.g., bad algorithms or code in Perl that
should be ported to C++), program more efficient implementations.
If you have new ideas for features, tools, and functionality, add them.
Help out with some of the projects listed below.
16 1. Introduction
1.3.6 Projects
If you are looking for projects to improve Moses, please consider the following list:
Front-end Projects
OpenOffice/Microsoft Word, Excel or Access plugins: (Hieu Hoang) Create wrappers for
the Moses decoder to translate within user apps. Skills required - Windows, VBA, Moses.
Firefox, Chrome, Internet Explorer plugins: (Hieu Hoang) Create a plugin that calls the
Moses server to translate webpages. Skills required - Web design, Javascript, Moses.
Moses on the OLPC: (Hieu Hoang) Create a front-end for the decoder, and possible the
training pipeline, so that it can be run on the OLPC. Some preliminary work has been
done here18
Rule-based numbers, currency, date translation: (Hieu Hoang) SMT is bad at translating
numbers and dates. Write some simple rules to identify and translate these for the lan-
guage pairs of your choice. Integrate it into Moses and combine it with the placeholder
feature19. Skills required - C++, Moses. (GSOC)
Named entity translation: (Hieu Hoang) Text with lots of names and trademarks etc are
difficult for SMT to translate. Integrate named entity recognition into Moses. Translate
them using the transliteration phrase-table, placeholder feature, or a secondary phrase-
table. Skills required - C++, Moses. (GSOC)
Interactive visualization for SCFG decoding: (Hieu Hoang) Create a front-end to the
hiero/syntax decoder that enables the user to re-translate a part of the sentence, change
parameters in the decoder, add or delete translation rules etc. Skills required - C++, GUI,
Moses. (GSOC)
Integrating the decoder with OCR/speech recognition input and speech synthesis out-
put (Hieu Hoang)
Training & Tuning
Incremental updating of translation and language model: When you add new sentences
to the training data, you don’t want to re-run the whole training pipeline (do you?). Abby
Levenberg has implemented incremental training20 for Moses but what it lacks is a nice
How-To guide.
Compression for lmplz: (Kenneth Heafield) lmplz trains language models on disk. The
temporary data on disk is not compressed, but it could be, especially with a fast com-
pression algorithm like zippy. This will enable us to build much larger models. Skills
required: C++. No SMT knowledge required. (GSOC)
Faster tuning by reuse: In tuning, you constantly re-decode the same set of sentences
and this can be very time-consuming. What if you could reuse part of the calculation
each time? This has been previously proposed as a marathon project21
1.3. Get Involved 17
Use binary files to speed up phrase scoring: Phrase-extraction and scoring involves a lot
of processing of text files which is inefficient in both time and disk usage. Using binary
files and vocabulary ids has the potential to make training more efficient, although more
Lattice training: At the moment lattices can be used for decoding (Section 6.5), and also
for MERT22 but they can’t be used in training. It would be pretty cool if they could be
used for training, but this is far from trivial.
Training via forced decoding: (Matthias Huck) Implement leave-one-out phrase model
training in Moses. Skills required - C++, SMT.
Faster training for the global lexicon model: Moses implements the global lexicon model
proposed by Mauser et al. (2009)23, but training features for each target word using a
maximum entropy trainer is very slow (years of CPU time). More efficient training or
accommodation of training of only frequent words would be useful.
Letter-based TER: Implement an efficient version of letter-based TER as metric for tuning
and evaluation, geared towards morphologically complex languages.
New Feature Functions: Many new feature functions could be implemented and tested.
For some ideas, see Green et al. (2014)24
Character count feature: The word count feature is very valuable, but may be geared
towards producing superfluous function words. To encourage the production of longer
words, a character count feature could be useful. Maybe a unigram language model
fulfills the same purpose.
Training with comparable corpora, related language, monolingual data: (Hieu Hoang)
High quality parallel corpora is difficult to obtain. There is a large amount of work on us-
ing comparable corpora, monolingual data, and parallel data in closely related languages
to create translation models. This project will re-implement and extend some of the prior
Chart-based Translation
Decoding algorithms for syntax-based models: Moses generally supports a large set
of grammar types. For some of these (for instance ones with source syntax, or a very
large set of non-terminals), the implemented CYK+ decoding algorithm is not optimal.
Implementing search algorithms for dedicated models, or just to explore alternatives,
would be of great interest.
Source cardinality synchronous cube pruning for the chart-based decoder: (Matthias
Huck) Pooling hypotheses by amount of covered source words. Skills required - C++,
Cube pruning for factored models: Complex factored models with multiple translation
and generation steps push the limits of the current factored model implementation which
exhaustively computes all translations options up front. Using ideas from cube pruning
(sorting the most likely rules and partial translation options) may be the basis for more
efficient factored model decoding.
18 1. Introduction
Missing features for chart decoder: A number of features are missing for the chart de-
coder, such as: MBR decoding (should be simple) and lattice decoding. In general, re-
porting and analysis within experiment.perl could be improved.
More efficient rule table for chart decoder: (Marcin) The in-memory rule table for the
hierarchical decoder loads very slowly and uses a lot of RAM. An optimized implemen-
tation that is vastly more efficient on both fronts should be feasible. Skills required - C++,
NLP, Moses. (GSOC)
More features for incremental search: Kenneth Heafield presented a faster search al-
gorithm for chart decoding Grouping Language Model Boundary Words to Speed K-Best Ex-
traction from Hypergraphs (NAACL 2013)25. This is implemented as a separate search al-
gorithm in Moses (called ’incremental search’), but it lacks many features of the default
search algorithm (such as sparse feature support, or support for multiple stateful fea-
tures). Implementing these features for the incremental search would be of great interest.
Scope-0 grammar and phrase-table: (Hieu Hoang). The most popular decoding algorithm
for syntax MT is the CYK+ algorithm. This is a parsing algorithm which is able to use de-
coding with an unnormalized, unpruned grammar. However, the disadvantage of using
such a general algorithm is its speed; Hopkins and Langmead (2010) showed that that a
sentence of length n can be parsed using a scope-k grammar in O(nk) chart update. For
an unpruned grammar with 2 non-terminals (the usual SMT setup), the scope is 3.
This project proposes to quantify the advantages and disadvantages of scope-0 grammar. A
scope-0 grammar lacks application ambiguity, therefore, decoding can be fast and memory
efficient. However, this must be offset against potential translation quality degradation due to
the lack of coverage.
It may be that the advantages of a scope-0 grammar can only be realized through specifically
developed algorithms, such as parsing algorithms or data structures. The phrase-table lookup
for a Scope-0 grammar can be significantly simplified, made faster, and applied to much large
span width.
This project will also aim to explore this potentially rich research area.
Phrase-based Translation
A better phrase table: The current binarised phrase table suffers from (i) far too many lay-
ers of indirection in the code making it hard to follow and inefficient (ii) a cache-locking
mechanism which creates excessive contention; and (iii) lack of extensibility meaning
that (e.g.) word alignments were added on by extensively duplicating code and addi-
tional phrase properties are not available. A new phrase table could make Moses faster
and more extensible.
Multi-threaded decoding: Moses uses a simple "thread per sentence" model for multi-
threaded decoding. However this means that if you have a single sentence to decode,
then multi-threading will not get you the translation any faster. Is it possible to have a
finer-grained threading model that can use multiple threads on a single sentence? This
would call for a new approach to decoding.
1.3. Get Involved 19
Better reordering: (Matthias Huck, Hieu Hoang) E.g. with soft constraints on reordering:
Moses currently allows you to specify hard constraints26 on reordering, but it might be
useful to have "soft" versions of these constraints. This would mean that the translation
would incur a trainable penalty for violating the constraints, implemented by adding a
feature function. Skills required - C++, SMT.
More ideas related to reordering:
Merging the phrase table and lexicalized reordering table: (Matthias Huck, Hieu Hoang)
They contain the same source and target phrases, but different probabilities, and how
those probabilities are applied. Merging the 2 models would halve the number of lookups.
Skills required - C++, Moses. (GSOC)
Using artificial neural networks as memory to store the phrase table: (Hieu Hoang) ANN
can be used as associative memory to store information in a lossy method. [].
It would be interesting to use them to how useful they are at store the phrase table. Fur-
ther research can focus on how they can be used to store morphologically similar transla-
Entropy-based pruning: (Matthias Huck) A more consistent method for pre-pruning phrase
tables. Skills required - C++, NLP.
Faster phrase-based decoding by refining feature state: Implement Heafield’s Faster
Phrase-Based Decoding by Refining Feature State (ACL 2014)27.
Multi-pass decoding: (Hieu Hoang) Some features may be too expensive to use during
decoding - maybe due to their computational cost, or due to their wider use of context
which leads to more state splitting. Think of a recurrent neural network language model
that both uses too much context (the entire output string) and is costly to compute. We
would like to use these features in a reranking phase, but dumping out the search graph,
and then re-decode it outside of Moses, creates a lot of additional overhead. So, it would
be nicer to integrate second pass decoding within the decoder. This idea is related to
coarse to fine decoding. Technically, we would like to be able to specify any feature
function as a first pass or second pass feature function. There are some major issues that
have to be tackled with multi-pass decoding:
1. A losing hypothesis which have been recombined with the winning hypothesis may now
be the new winning hypothesis. The output search graph has to be reordered to reflect
20 1. Introduction
2. The feature functions in the 2nd pass produce state information. Recombined hypotheses
may no longer be recombined and have to be split.
3. It would be useful for feature functions scores to be able to be evaluated asynchronously.
That is, a function to calculate the score it called but the score is calculated later. Skills
required - C++, NLP, Moses. (GSOC)
General Framework & Tools
Out-of-vocabulary (OOV) word handling: Currently there are two choices for OOVs -
pass them through or drop them. Often neither is appropriate and Moses lacks good
hooks to add new OOV strategies, and lacks alternative strategies. A new phrase table
class should be created which process OOV. To create a new phrase-table type, make a
copy of moses/TranslationModel/SkeletonPT.*, rename the class and follow the exam-
ple in the file to implement your own code. Skills required - C++, Moses. (GSOC)
Tokenization for your language: Tokenization is the only part of the basic SMT process
that is language-specific. You can help make translation for your language better. Make
a copy of the file scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en and
replace it with non-breaking words in your language. Skills required - SMT, Moses, lots
of human languages. (GSOC)
Python interface: A Python interface to the decoder could enable easy experimentation
and incorporation into other tools. cdec has one28 and Moses has a python interface to
the on-disk phrase tables (implemented by Wilker Aziz) but it would be useful to be able
to call the decoder from python.
Analysis of results: (Philipp Koehn) Assessing the impact of variations in the design of
a machine translation system by observing the fluctuations of the BLEU score may not
be sufficiently enlightening. Having more analysis of the types of errors a system makes
should be very useful.
Engineering Improvements
Integration of sigfilter: The filtering algorithm of Johnson et al29 is available30 in Moses,
but it is not well integrated, has awkward external dependencies and so is seldom used.
At the moment the code is in the contrib directory. A useful project would be to refactor
this code to use the Moses libraries for suffix arrays, and to integrate it with the Moses
experiment management system (EMS). The goal would be to enable the filtering to be
turned on with a simple switch in the EMS config file.
Boostification: Moses has allowed boost31 since Autumn 2011, but there are still many
areas of the code that could be improved by usage of the boost libraries, for instance using
shared pointers in collections.
Cruise control: Moses has cruise control32 running on a server at the University of Ed-
inburgh, however this only tests one platform (Ubuntu 12.04). If you have a different
platform, and care about keeping Moses stable on that platform, then you could set up a
cruise control instance too. The code is all in the standard Moses distribution.
1.3. Get Involved 21
Maintenance: The documentation always needs maintenance as new features are intro-
duced and old ones are updated. Such a large body of documentation inevitably contains
mistakes and inconsistencies, so any help in fixing these would be most welcome. If you
want to work on the documentation, just introduce yourself on the mailing list.
Help messages: Moses has a lot of executables, and often the help messages are quite
cryptic or missing. A help message in the code is more likely to be maintained than
separate documentation, and easier to locate when you’re trying to find the right options.
Fixing the help messages would be a useful contribution to making Moses easier to use.
Subsection last modified on June 16, 2015, at 02:05 PM
22 1. Introduction
2.1 Getting Started with Moses
This section will show you how to install and build Moses, and how to use Moses to translate
with some simple models. If you experience problems, then please check the support1page. If
you do not want to build Moses from source, then there are packages2available for Windows
and popular Linux distributions.
2.1.1 Easy Setup on Ubuntu (on other linux systems, you’ll need to install pack-
ages that provide gcc, make, git, automake, libtool)
1. Install required Ubuntu packages to build Moses and its dependencies:
sudo apt-get install build-essential git-core pkg-config automake libtool wget
zlib1g-dev python-dev libbz2-dev
For the regression tests, you’ll also need
sudo apt-get install libsoap-lite-perl
See below for additional packages that you’ll need to actually run Moses (especially when
you are using EMS).
2. Clone Moses from the repository and cd into the directory for building Moses
git clone
cd mosesdecoder
3. Run the following to install a recent version of Boost (the default version on your system
might be too old), as well as cmph (for CompactPT), irstlm (language model from FBK,
required to pass the regression tests), and xmlrpc-c (for moses server). By default, these
will be installed in ./opt in your working directory:
make -f contrib/Makefiles/install-dependencies.gmake
4. To compile moses, run
./ [additional options]
24 2. Installation
Popular additional bjam options (called from within ./ and ./
--prefix=/destination/path --install-scripts
... to install Moses somewhere else on your system
--with-mm enable suffix array-based phrase tables3
Note that you’ll still need a word aligner; this is not built automatically
Running regression tests (Advanced; for Moses developers; normal users won’t need this)
To compile and run the regression tests all in one go, run
./ [additional options]
Regression testing is only of interest for people who are actively making changes in the Moses
codebase. If you are just using Moses to run MT experiments, there’s no point in running
regression tests, unless you want to check that your current version of Moses is working as
expected. However, you can also check your version against the daily regression tests here4.
If you run your own regression tests, sometimes Moses will fail them even when everything
is working correctly, because different compilers produce slightly different executables that
might produce slightly different output because they make different kinds of rounding errors.
Manually installing Boost
Boost 1.48 has a serious bug which breaks Moses compilation. Unfortunately, some Linux
distributions (eg. Ubuntu 12.04) have broken versions of the Boost library. In these cases, you
must download and compile Boost yourself.
This is the exact commands I (Hieu) use to compile boost:
tar zxvf boost_1_64_0.tar.gz
cd boost_1_64_0/
./b2 -j4 --prefix=$PWD --libdir=$PWD/lib64 --layout=system link=static install || echo FAILURE
This create library file in the directory lib64, NOT in the system directory. Therefore, you don’t
need to be system admin/root to run this. However, you will need to tell moses where to find
boost, which is explained below
Once boost is installed, you can then compile Moses. However, you must tell Moses where
boost is with the --with-boost flag. This is the exact commands I use to compile Moses:
2.1. Getting Started with Moses 25
./bjam --with-boost=~/workspace/temp/boost_1_64_0 -j4
2.1.2 Compiling Moses directly with bjam
You may need to do this if
1. doesn’t work for you, for example,
i. you’re using OSX
ii. you don’t have all the prerequisites installed on your system so you want to compile Moses with a reduced number of features
2. You want more control over exactly what options and features you want
To compile with bare minimum of features:
./bjam -j4
If you have compiled boost manually, then tell bjam where it is:
./bjam --with-boost=~/workspace/temp/boost_1_64_0 -j8
If you have compiled the cmph library manually:
./bjam --with-cmph=/Users/hieu/workspace/cmph-2.0
If you have compiled the xmlrpc-c library manually:
./bjam --with-xmlrpc-c=/Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.33.17
If you have compiled the xmlrpc-c library manually:
26 2. Installation
./bjam --with-irstlm=/Users/hieu/workspace/irstlm/irstlm-5.80.08/trunk
This is the exact command I (Hieu) used on Linux:
./bjam --with-boost=/home/s0565741/workspace/boost/boost_1_57_0 --with-cmph=/home/s0565741/workspace/cmph-2.0 --with-irstlm=/home/s0565741/workspace/irstlm-code --with-xmlrpc-c=/home/s0565741/workspace/xmlrpc-c/xmlrpc-c-1.33.17 -j12
Compiling on OSX
Recent versions of OSX have clang C/C++ compiler, rather than gcc. When compiling with
bjam, you must add the following:
./bjam toolset=clang
This is the exact command I (Hieu) use on OSX Yosemite:
./bjam --with-boost=/Users/hieu/workspace/boost/boost_1_59_0.clang/ --with-cmph=/Users/hieu/workspace/cmph-2.0 --with-xmlrpc-c=/Users/hieu/workspace/xmlrpc-c/xmlrpc-c-1.33.17 --with-irstlm=/Users/hieu/workspace/irstlm/irstlm-5.80.08/trunk --with-mm --with-probing-pt -j5 toolset=clang -q -d2
You also need to add this argument when manually compiling boost. This is the exact com-
mand I use:
./b2 -j8 --prefix=$PWD --libdir=$PWD/lib64 --layout=system link=static toolset=clang install || echo FAILURE
2.1.3 Other software to install
Word Alignment
Moses requires a word alignment tool, such as giza++5, mgiza6, or Fast Align7.
I (Hieu) use MGIZA because it is multi-threaded and give general good result, however, I’ve
also heard good things about Fast Align. You can find instructions to compile them here8.
2.1. Getting Started with Moses 27
Language Model Creation
Moses includes the KenLM language model creation program, lmplz9.
You can also create language models with IRSTLM10 and SRILM11. Please read this12 if you
want to compile IRSTLM. Language model toolkits perform two main tasks: training and
querying. You can train a language model with any of them, produce an ARPA file, and query
with a different one. To train a model, just call the relevant script.
If you want to use SRILM or IRSTLM to query the language model, then they need to be linked
with Moses. For IRSTLM, you first need to compile IRSTLM then use the --with-irstlm switch
to compile Moses with IRSTLM. This is the exact command I used:
./bjam --with-irstlm=/home/s0565741/workspace/temp/irstlm-5.80.03 -j4
Personally, I only use IRSTLM as a query tool in this way if the LM n-gram order is over 7. In
most situation, I use KenLM because KenLM is multi-threaded and faster.
2.1.4 Platforms
The primary development platform for Moses is Linux, and this is the recommended platform
since you will find it easier to get support for it. However Moses does work on other platforms:
2.1.5 OSX Installation
Mac OSX is widely used by Moses developers and everything should run fine. Installation is
the same as for Linux.
Mac OSX out-of-the-box doesn’t have many programs that are critical to Moses, or different
version of standard GNU programs. For example, split,sort,zcat are incompatible BSD-
versions rather than GNU versions.
Therefore, Moses has been tested with Mac OSX with Mac Ports. Make sure you have this
installed on your machine. Success has also been reported with brew installation. Do note,
however, that you will need to install xmlrpc-c independently, and then compile with bjam
using the --with-xmlrpc-c=/usr/local flag (where /usr/local/ is the default location of the
xmlrpc-c installation).
28 2. Installation
2.1.6 Linux Installation
Install the following packages using the command
apt-get install [package name]
Install the following packages using the command
sudo apt-get install [package name]
2.1. Getting Started with Moses 29
libgoogle-perftools-dev (for tcmalloc)
Fedora / Redhat / CentOS / Scientific Linux
Install the following packages using the command
yum install [package name]
In addition, you have to install some perl packages:
cpan XML::Twig
cpan Sort::Naturally
30 2. Installation
2.1.7 Windows Installation
Moses can run on Windows 10 with Ubuntu 16.04 subsystem. Installation is exactly the same as
for Ubuntu. (Are you running it on Windows? If so, please give us feedback on how it works).
Install the following packages via Cygwin:
Also, the nist-bleu script need a perl module called XML::Twig13. Install the following perl
cpan XML::Twig
cpan Sort::Naturally
2.1.8 Run Moses for the first time
Download the sample models and extract them into your working directory:
cd ~/mosesdecoder
tar xzf sample-models.tgz
cd sample-models
2.1. Getting Started with Moses 31
Run the decoder
cd ~/mosesdecoder/sample-models
~/mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out
If everything worked out right, this should translate the sentence "das ist ein kleines haus" (in
the file in) as "this is a small house" (in the file out).
Note that the configuration file moses.ini in each directory is set to use the KenLM language
model toolkit by default. If you prefer to use IRSTLM14, then edit the language model en-
try in moses.ini, replacing KENLM with IRSTLM. You will also have to compile with ./bjam
--with-irstlm, adding the full path of your IRSTLM installation.
Moses also supports SRILM and RandLM language models. See here15 for more details.
2.1.9 Chart Decoder
The chart decoder is part of the same executable as of version 3.0.
You can run the chart demos from the sample-models directory as follows
~/mosesdecoder/bin/moses -f string-to-tree/moses.ini < string-to-tree/in > out.stt
~/mosesdecoder/bin/moses -f tree-to-tree/moses.ini < tree-to-tree/in.xml > out.ttt
The expected result of the string-to-tree demo is
this is a small house
2.1.10 Next Steps
Why not try to build a Baseline (Section 2.3) translation system with freely available data?
2.1.11 bjam options
This is a list of options to bjam. On a system with Boost installed in a standard path, none
should be required, but you may want additional functionality or control.
32 2. Installation
Optional packages
Language models In addition to KenLM and ORLM (which are always compiled):
--with-irstlm=/path/to/irstlm Path to IRSTLM installation
--with-randlm=/path/to/randlm Path to RandLM installation
--with-nplm=/path/to/nplm Path to NPLM installation
--with-srilm=/path/to/srilm Path to SRILM installation.
If your SRILM install is non-standard, use these options:
--with-srilm-dynamic Link against
--with-srilm-arch=arch Override the arch setting given by /path/to/srilm/sbin/machine-type
Other packages
--with-boost=/path/to/boost If Boost is in a non-standard location, specify it here. This direc-
tory is expected to contain include and lib or lib64.
--with-xmlrpc-c=/path/to/xmlrpc-c Specify a non-standard libxmlrpc-c installation path. Used
by Moses server.
--with-cmph=/path/to/cmph Path where CMPH is installed. Used by the compact phrase table
and compact lexical reordering table.
--without-tcmalloc Disable thread-caching malloc.
--with-regtest=/path/to/moses-regression-tests Run the regression tests using data from this
directory. Tests can be downloaded from
--prefix=/path/to/prefix sets the install prefix [default is source root].
--bindir=/path/to/prefix/bin sets the bin directory [default is PREFIX/bin]
--libdir=/path/to/prefix/lib sets the lib directory [default is PREFIX/lib]
--includedir=/path/to/prefix/include installs headers. Does not install if missing. No argu-
ment defaults to PREFIX/include .
--install-scripts=/path/to/scripts copies scripts into a directory. Does not install if missing. No
argument defaults to PREFIX/scripts .
--git appends the git revision to the prefix directory.
2.2. Building with Eclipse 33
Build Options
By default, the build is multi-threaded, optimized, and statically linked.
threading=single|multi controls threading (default multi)
variant=release|debug|profile builds optimized (default), for debug, or for profiling
link=static|shared controls preferred linking (default static)
--static forces static linking (the default will fall back to shared)
debug-symbols=on|off include (default) or exclude debugging information also known as -g
--notrace compiles without TRACE macros
--enable-boost-pool uses Boost pools for the memory SCFG table
--enable-mpi switch on mpi (used for MIRA - one of the tuning algorithms)
--without-libsegfault does not link with libSegFault
--max-kenlm-order maximum ngram order that kenlm can process (default 6)
--max-factors maximum number of factors (default 4)
--unlabelled-source ignore source nonterminals (if you only use hierarchical or string-to-tree
models without source syntax)
Controlling the Build
-q quit on the first error
-a to build from scratch
-j$NCPUS to compile in parallel
--clean to clean
2.2 Building with Eclipse
There is a video showing you how to set up Moses with Eclipse.
{\bf How to compile Moses with Eclipse\footnote{\sf}}
Moses comes with Eclipse project files for some of the C++ executables. Currently, there are
project files for
34 2. Installation
* moses (decoder)
* moses-cmd (decoder)
* extract
* extract-rules
* extract-ghkm
* server
* ...
The Eclipse build is used primarily for development and debugging. It is not optimized and
doesn’t have many of the options available in the bjam build.
The advantage of using Eclipse is that it offers code-completion, and a GUI debugging envi-
NB. The recent update of Mac OSX replaces g++ with clang. Eclipse doesn’t yet fully function
with clang. Therefore, you should not use the Eclipse build with any OSX version higher than
10.8 (Mountain Lion)
Follow these instructions to build with Eclipse:
* Use the version of Eclipse for C++. Works (at least) with Eclipse Kepler and Luna.
* Get the Moses source code
git clone
cd mosesdecoder
* Create a softlink to Boost (and optionally to XMLRPC-C lib if you want to compile the moses server) in the Moses root directory
eg. ln -s ~/workspace/boost_x_xx_x boost
* Create a new Eclipse workspace. The workspace {\em MUST} be in
Eclipse should now be running.
* Import all the Moses Eclipse project into the workspace.
File >> Import >> Existing Projects into Workspace >> Select root directory: contrib/other-builds/ >> Finish
* Compile all projects.
Project >> Build All
Subsection last modified on January 15, 2018, at 10:58 PM
2.3 Baseline System
2.3.1 Overview
This guide assumes that you have successfully installed Moses (Section 2.1), and would like
to see how to use parallel data to build a real phrase-based translation system. The process
requires some familiarity with UNIX and, ideally, access to a Linux server. It can be run on a
laptop, but could take about a day and requires at least 2G of RAM, and about 10G of free disk
space (these requirements are just educated guesses, so if you have a different experience then
please mail support16).
2.3. Baseline System 35
If you want to save the effort of typing in all the commands on this page (and see how the pros
manage their experiments), then skip straight to the experiment management system (Section
2.3.8) instructions below. But I’d recommend that you follow through the process manually, at
least once, just to see how it all works.
2.3.2 Installation
The minimum software requirements are:
Moses (obviously!)
GIZA++17, for word-aligning your parallel corpus
IRSTLM18, SRILM19, OR KenLM20 for language model estimation.
KenLM is included in Moses and the default in the Moses tool-chain. IRSTLM and KenLM are
LGPL licensed (like Moses) and therefore available for commercial use.
For the purposes of this guide, I will assume that you’re going to install all the tools and data in
your home directory (i.e. ~/), and that you’ve already downloaded and compiled Moses into
~/mosesdecoder. And you’re going to run Moses from there.
Installing GIZA++
GIZA++ is hosted at Google Code21, and a mirror of the original documentation can be found
here22. I recommend that you download the latest version via svn:
git clone
cd giza-pp
This should create the binaries ~/giza-pp/GIZA++-v2/GIZA++,~/giza-pp/GIZA++-v2/snt2cooc.out
and ~/giza-pp/mkcls-v2/mkcls. These need to be copied to somewhere that Moses can find
them as follows
cd ~/mosesdecoder
mkdir tools
cp ~/giza-pp/GIZA++-v2/GIZA++ ~/giza-pp/GIZA++-v2/snt2cooc.out \
~/giza-pp/mkcls-v2/mkcls tools
36 2. Installation
When you come to run the training, you need to tell the training script where GIZA++ was
installed using the -external-bin-dir argument.
train-model.perl -external-bin-dir $HOME/mosesdecoder/tools
UPDATE - GIZA++ only compiles with gcc. If you’re using OSX Mavericks, you’ll have to
install gcc yourself. I (Hieu) recommend using MGIZA instead
2.3.3 Corpus Preparation
To train a translation system we need parallel data (text translated into two different languages)
which is aligned at the sentence level. Luckily there’s plenty of this data freely available, and
for this system I’m going to use a small (only 130,000 sentences!) data set released for the 2013
Workshop in Machine Translation. To get the data we want, we have to download the tarball
and unpack it (into a corpus directory in our home directory) as follows
mkdir corpus
cd corpus
tar zxvf training-parallel-nc-v8.tgz
If you look in the ~/corpus/training directory you’ll see that there’s data from news-commentary
(news analysis from project syndicate) in various languages. We’re going to build a French-
English (fr-en) translation system using the news commentary data set, but feel free to use one
of the other language pairs if you prefer.
To prepare the data for training the translation system, we have to perform the following steps:
tokenisation: This means that spaces have to be inserted between (e.g.) words and punc-
truecasing: The initial words in each sentence are converted to their most probable cas-
ing. This helps reduce data sparsity.
cleaning: Long sentences and empty sentences are removed as they can cause problems
with the training pipeline, and obviously mis-aligned sentences are removed.
The tokenisation can be run as follows:
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< ~/corpus/training/ \
> ~/corpus/
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< ~/corpus/training/ \
> ~/corpus/
2.3. Baseline System 37
The truecaser first requires training, in order to extract some statistics about the text:
~/mosesdecoder/scripts/recaser/train-truecaser.perl \
--model ~/corpus/truecase-model.en --corpus \
~/mosesdecoder/scripts/recaser/train-truecaser.perl \
--model ~/corpus/ --corpus \
Truecasing uses another script from the Moses distribution:
~/mosesdecoder/scripts/recaser/truecase.perl \
--model ~/corpus/truecase-model.en \
< ~/corpus/ \
> ~/corpus/
~/mosesdecoder/scripts/recaser/truecase.perl \
--model ~/corpus/ \
< ~/corpus/ \
> ~/corpus/
Finally we clean, limiting sentence length to 80:
~/mosesdecoder/scripts/training/clean-corpus-n.perl \
~/corpus/ fr en \
~/corpus/ 1 80
Notice that the last command processes both sides at once.
2.3.4 Language Model Training
The language model (LM) is used to ensure fluent output, so it is built with the target language
(i.e English in this case). The KenLM documentation gives a full explanation of the command-
line options, but the following will build an appropriate 3-gram language model.
mkdir ~/lm
cd ~/lm
~/mosesdecoder/bin/lmplz -o 3 <~/corpus/ >
38 2. Installation
Then you should binarise (for faster loading) the *.arpa.en file using KenLM:
~/mosesdecoder/bin/build_binary \ \
(Note that you can also use IRSTLM which also has a binary format that Moses supports. See
the IRSTLM documentation for more information. For simplicity we only describe one ap-
proach here)
You can check the language model by querying it, e.g.
$ echo "is this an English sentence ?" \
| ~/mosesdecoder/bin/query
Loading statistics:
Name:query VmPeak:46788 kB VmRSS:30828 kB RSSMax:0 kB \
user:0 sys:0 CPU:0 real:0.012207
is=35 2 -2.6704 this=287 3 -0.889896 an=295 3 -2.25226 \
English=7286 1 -5.27842 sentence=4470 2 -2.69906 \
?=65 1 -3.32728 </s>=21 2 -0.0308115 Total: -17.1481 OOV: 0
After queries:
Name:query VmPeak:46796 kB VmRSS:30828 kB RSSMax:0 kB \
user:0 sys:0 CPU:0 real:0.0129395
Total time including destruction:
Name:query VmPeak:46796 kB VmRSS:1532 kB RSSMax:0 kB \
user:0 sys:0 CPU:0 real:0.0166016
2.3.5 Training the Translation System
Finally we come to the main event - training the translation model. To do this, we run word-
alignment (using GIZA++), phrase extraction and scoring, create lexicalised reordering tables
and create your Moses configuration file, all with a single command. I recommend that you
create an appropriate directory as follows, and then run the training command, catching logs:
mkdir ~/working
cd ~/working
nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train \
-corpus ~/corpus/ \
-f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:3:$HOME/lm/ \
-external-bin-dir ~/mosesdecoder/tools >& training.out &
2.3. Baseline System 39
If you have a multi-core machine it’s worth using the -cores argument to encourage as much
parallelisation as possible.
This took about 1.5 hours using 2 cores on a powerful laptop (Intel i7-2640M, 8GB RAM, SSD).
Once it’s finished there should be a moses.ini file in the directory ~/working/train/model.
You can use the model specified by this ini file to decode (i.e. translate), but there’s a couple
of problems with it. The first is that it’s very slow to load, but we can fix that by binarising the
phrase table and reordering table, i.e. compiling them into a format that can be load quickly.
The second problem is that the weights used by Moses to weight the different models against
each other are not optimised - if you look at the moses.ini file you’ll see that they’re set to
default values like 0.2, 0.3 etc. To find better weights we need to tune the translation system,
which leads us on to the next step...
2.3.6 Tuning
This is the slowest part of the process, so you might want to line up something to read whilst it’s
progressing. Tuning requires a small amount of parallel data, separate from the training data,
so again we’ll download some data kindly provided by WMT. Run the following commands
(from your home directory again) to download the data and put it in a sensible place.
cd ~/corpus
tar zxvf dev.tgz
We’re going to use news-test2008 for tuning, so we have to tokenise and truecase it first (don’t
forget to use the correct language if you’re not building a fr->en system)
cd ~/corpus
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< dev/news-test2008.en > news-test2008.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< dev/ >
~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en \
< news-test2008.tok.en > news-test2008.true.en
~/mosesdecoder/scripts/recaser/truecase.perl --model \
< >
Now go back to the directory we used for training, and launch the tuning process:
40 2. Installation
cd ~/working
nohup nice ~/mosesdecoder/scripts/training/ \
~/corpus/ ~/corpus/news-test2008.true.en \
~/mosesdecoder/bin/moses train/model/moses.ini --mertdir ~/mosesdecoder/bin/ \
&> mert.out &
If you have several cores at your disposal, then it’ll be a lot faster to run Moses multi-threaded.
Add --decoder-flags="-threads 4" to the last line above in order to run the decoder with 4
threads. With this setting, tuning took about 4 hours for me.
The end result of tuning is an ini file with trained weights, which should be in ~/working/mert-
work/moses.ini if you’ve used the same directory structure as me.
2.3.7 Testing
You can now run Moses with
~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini
and type in your favourite French sentence to see the results. You’ll notice, though, that the
decoder takes at least a couple of minutes to start-up. In order to make it start quickly, we
can binarise the phrase-table and lexicalised reordering models. To do this, create a suitable
directory and binarise the models as follows:
mkdir ~/working/binarised-model
cd ~/working
~/mosesdecoder/bin/processPhraseTableMin \
-in train/model/phrase-table.gz -nscores 4 \
-out binarised-model/phrase-table
~/mosesdecoder/bin/processLexicalTableMin \
-in train/model/reordering-table.wbe-msd-bidirectional-fe.gz \
-out binarised-model/reordering-table
Note: If you get the error ...~/mosesdecoder/bin/processPhraseTableMin: No such file
or directory, please make sure to compile Moses with CMPH23.
Then make a copy of the ~/working/mert-work/moses.ini in the binarised-model directory
and change the phrase and reordering tables to point to the binarised versions, as follows:
1. Change PhraseDictionaryMemory to PhraseDictionaryCompact
2.3. Baseline System 41
2. Set the path of the PhraseDictionary feature to point to
1. Set the path of the LexicalReordering feature to point to
Loading and running a translation is pretty fast (for this I supplied the French sentence "faire
revenir les militants sur le terrain et convaincre que le vote est utile .") :
Defined parameters (per moses.ini or switch):
config: binarised-model/moses.ini
distortion-limit: 6
feature: UnknownWordPenalty WordPenalty PhraseDictionaryCompact \
name=TranslationModel0 table-limit=20 num-features=5 \
path=/home/bhaddow/working/binarised-model/phrase-table \
input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 \
num-features=6 type=wbe-msd-bidirectional-fe-allff \
input-factor=0 output-factor=0 \
Distortion KENLM lazyken=0 name=LM0 \
factor=0 path=/home/bhaddow/lm/ order=3
input-factors: 0
mapping: 0 T 0
weight: LexicalReordering0= 0.119327 0.0221822 0.0359108 \
0.107369 0.0448086 0.100852 Distortion0= 0.0682159 \
LM0= 0.0794234 WordPenalty0= -0.0314219 TranslationModel0= 0.0477904 \
0.0621766 0.0931993 0.0394201 0.147903
FeatureFunction: UnknownWordPenalty0 start: 0 end: 0
FeatureFunction: WordPenalty0 start: 1 end: 1
line=PhraseDictionaryCompact name=TranslationModel0 table-limit=20 \
num-features=5 path=/home/bhaddow/working/binarised-model/phrase-table \
input-factor=0 output-factor=0
FeatureFunction: TranslationModel0 start: 2 end: 6
line=LexicalReordering name=LexicalReordering0 num-features=6 \
type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 \
42 2. Installation
FeatureFunction: LexicalReordering0 start: 7 end: 12
Initializing LexicalReordering..
FeatureFunction: Distortion0 start: 13 end: 13
line=KENLM lazyken=0 name=LM0 factor=0 \
path=/home/bhaddow/lm/ order=3
FeatureFunction: LM0 start: 14 end: 14
binary file loaded, default OFF_T: -1
Created input-output object : [0.000] seconds
Translating line 0 in thread id 140592965015296
Translating: faire revenir les militants sur le terrain et \
convaincre que le vote est utile .
reading bin ttable
size of OFF_T 8
binary phrasefile loaded, default OFF_T: -1
binary file loaded, default OFF_T: -1
Line 0: Collecting options took 0.000 seconds
Line 0: Search took 1.000 seconds
bring activists on the ground and convince that the vote is useful .
BEST TRANSLATION: bring activists on the ground and convince that \
the vote is useful . [111111111111111] [total=-8.127] \
core=(0.000,-13.000,-10.222,-21.472,-4.648,-14.567,6.999,-2.895,0.000, \
Line 0: Translation took 1.000 seconds total
Name:moses VmPeak:214408 kB VmRSS:74748 kB \
RSSMax:0 kB user:0.000 sys:0.000 CPU:0.000 real:1.031
The translation ("bring activists on the ground and convince that the vote is useful .")b is quite
rough, but understandable - bear in mind this is a very small data set for general domain trans-
lation. Also note that your results may differ slightly due to non-determinism in the tuning
At this stage, your probably wondering how good the translation system is. To measure this,
we use another parallel data set (the test set) distinct from the ones we’ve used so far. Let’s pick
newstest2011, and so first we have to tokenise and truecase it as before
cd ~/corpus
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< dev/newstest2011.en > newstest2011.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< dev/ >
~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en \
< newstest2011.tok.en > newstest2011.true.en
~/mosesdecoder/scripts/recaser/truecase.perl --model \
< >
2.3. Baseline System 43
The model that we’ve trained can then be filtered for this test set, meaning that we only retain
the entries needed translate the test set. This will make the translation a lot faster.
cd ~/working
~/mosesdecoder/scripts/training/ \
filtered-newstest2011 mert-work/moses.ini ~/corpus/ \
-Binarizer ~/mosesdecoder/bin/processPhraseTableMin
You can test the decoder by first translating the test set (takes a wee while) then running the
BLEU script on it:
nohup nice ~/mosesdecoder/bin/moses \
-f ~/working/filtered-newstest2011/moses.ini \
< ~/corpus/ \
> ~/working/newstest2011.translated.en \
2> ~/working/newstest2011.out
~/mosesdecoder/scripts/generic/multi-bleu.perl \
-lc ~/corpus/newstest2011.true.en \
< ~/working/newstest2011.translated.en
This gives me a BLEU score of 23.5 (in comparison, the best result at WMT11 was 30.524, al-
though it should be cautioned that this uses NIST BLEU, which does its own tokenisation, so
there will be 1-2 points difference in the score anyway)
2.3.8 Experiment Management System (EMS)
If you’ve been through the effort of typing in all the commands, then by now you’re probably
wondering if there’s an easier way. If you’ve skipped straight down here without bothering
about the manual route then, well, you may have missed on a useful Moses "rite of passage".
The easier way is, of course, to use the EMS (Section 3.5). To use EMS, you’ll have to install a
few dependencies, as detailed on the EMS page, and then you’ll need this config25 file. Make
a directory ~/working/experiments and place the config file in there. If you open it up, you’ll
see the home-dir variable defined at the top - then make the obvious change. If you set the
home directory, download the train, tune and test data and place it in the locations described
above, then this config file should work.
To run EMS from the experiments directory, you can use the command:
44 2. Installation
nohup nice ~/mosesdecoder/scripts/ems/experiment.perl -config config -exec &> log &
then sit back and wait for the BLEU score to appear in evaluation/report.1
Subsection last modified on January 12, 2017, at 02:13 PM
2.4 Releases
2.4.1 Release 4.0 (5th Oct, 2017)
This is the current stable release.
Get the code on github26
Download Binaries27
Pre-made models28
Virtual Machines files29
Release notes30
2.4.2 Release 3.0 (3rd Feb, 2015)
Get the code on github31
Download Binaries32
Pre-made models33
Virtual Machines files34
Release notes35
2.4.3 Release 2.1.1 (3rd March, 2014)
This is a minor patch for a bug that prevent Moses from linking with tcmalloc when it is avail-
able on the compilation machine. Using tcmalloc can substantially speed up decoding, at the
cost of more memory usage.
Get the code on github36
2.4. Releases 45
2.4.4 Release 2.1 (21th Jan, 2014)
Get the code on github37
Download Binaries38
The broad aim of this release is to tackle more complicated issues to enable better expandability
and reliability.
Specifically, the decoder has been refactored to create a more modular framework to enable
easier incorporation of new feature functions into Moses. This has necessitate major changes
in many other parts of the toolkit, including training and tuning.
As well as the refactored code, this release also incorporate a host of new features donated
by other developers. Transliteration modules, better error handling, small and fast language
models, and placeholders are just some of the new features that spring to mind.
We have also continue to expand the testing regime to maintain the reliability of the toolkit,
while enable more developers to contribute to the project.
We distribute Moses as: 1. source code, 2. binaries for Windows (32 and 64 bit), Mac OSX
(Mavericks), and various flavours of Linux (32 and 64 bit). 3. pre-installed in a Linux virtual
machine, using the open source VirtualBox application. 4. Amazon cloud server image.
2.4.5 Release 1.0 (28th Jan, 2013)
Get the code on github39
The Moses community has grown tremendously over the last few years. From the beginning
as a purely research-driven project, we are now a diverse community of academic and business
users, ranging in experience from hardened developers to new users.
Therefore, the first priority of this release has been to concentrate on resolving long-standing,
but straightforward, issues to make the toolkit easier to use and more efficient. The provision of
full-time development team devoted to the maintenance and enhancement of the Moses toolkit
has allowed has to tackle many useful engineering problems.
A second priority was to put in place a multi-tiered testing regime to enable more developers
to contribute to the project, more quickly, while ensuring the reliability of the toolkit. However,
we have not stopped adding new features to the toolkit; the next section lists a number of major
features added in the last 9 months.
46 2. Installation
New Features
The following is a list of the major new features in the Moses toolkit since May 2012, in roughly
chronological order.
Parallel Training by Hieu Hoang and Rohit Gupta. The training process has been improved
and can take advantage of multi-core machines. Parallelization was achieved by partitioning
the input data, then running the translation rule extraction processes in parallel before merging
the data. The following is the timing for the extract process on different number of cores:
Cores One Two Three Four
Time taken (mins) 48:55 33:13 27:52 25:35
The training processes have also been redesigned to decrease disk access, and to use less disk
space. This is important for parallel processing as disk IO often becomes the limiting factor
with a large number of simultaneous disk access. It is also important when training syntacti-
cally inspired models or using large amounts of training data, which can result in very large
translation models.
IRST LM training integration by Hieu Hoang and Philipp Koehn The IRST toolkit for
training language models have been integrated into the Experiment Management System. The
SRILM software previously carried out this functionality. Substituting IRST for SRI means that
the entire training pipeline can be run using only free, open-source software. Not only is the
IRST toolkit unencumbered by a proprietary license, it is also parallelizable and capable of
training with a larger amount of data than was otherwise possible with SRI.
Distributed Language Model by Oliver Wilson. Language models can be distributed across
many machines, allowing more data to be used at the cost of a performance overhead. This is
still experimental code.
Incremental Search Algorithm by Kenneth Heafield. A replacement for the cube pruning
algorithm in CKY++ decoding, used in hierarchical and syntax models. It offers better tradeoff
between decoding speed and translation quality.
Compressed Phrase-Table and Reordering-Tables by Marcin Junczys-Dowmunt. A phrase-
table and lexicalized reordering-table implementation which is both small and fast. More de-
2.4. Releases 47
Sparse features by Eva Hasler, Barry Haddow, Philipp Koehn A framework to allow a large
number of sparse features in the decoder. A number of sparse feature functions described in
the literature have been reproduced in Moses. Currently, the available sparse feature functions
1. TargetBigramFeature
2. TargetNgramFeature
3. SourceWordDeletionFeature
4. SparsePhraseDictionaryFeature
5. GlobalLexicalModelUnlimited
6. PhraseBoundaryState
7. PhraseLengthFeature
8. PhrasePairFeature
9. TargetWordInsertionFeature
Suffix array for hierarchical models by Hieu Hoang The training of syntactically-inspired
hierarchical models requires a large amount of time and resource. An alternative to training a
translation is to only extract the required translation rules for each input sentence.
We have integrated Adam Lopez’s suffix array implementation into Moses. This is a well-
known and mature implementation, which is hosted and maintained by the cdec community.
Multi-threaded tokenizer by Pidong Wang
Batched MIRA by Colin Cherry. A replacement for MERT, especially suited for tuning a
large number of sparse features. (Cherry and Foster, NAACL 201241).
LR score by Lexi Birch and others. The BLEU score commonly used in MT is insensitive
to reordering errors. We have integrated another metric , LR score, described in (Birch and
Osborne, 201142) which better accounts for reordering, in the Moses toolkit.
Convergence of Translation Memory and Statistical Machine Translation by Philipp Koehn
and Hieu Hoang An alternative extract algorithm, (Koehn, Senellart, 2010 AMTA43), which
is inspired by the use of translation memories has been integrated into the Moses toolkit.
Word Alignment Information is turned on by default by Hieu Hoang and Barry Haddow
The word alignment produced by GIZA++/mgiza is carried by the phrase-table and made
available to the decoder. This information is required by some feature functions. The use of
these word alignment is now optimized for memory and speed, and enabled by default.
Modified Moore-Lewis filtering by Barry Haddow and Philipp Koehn Reimplementation
of domain adaptation of parallel corpus described by Axelrod et al. (EMNLP 2011)44.
48 2. Installation
Lots and lots of cleanups and bug fixes By Ales Tamchyna, Wilker Aziz, Mark Fishel, Tet-
suo Kiso, Rico Sennrich, Lane Schwartz, Hiroshi Umemoto, Phil Williams, Tom Hoar, Arianna
Bisazza, Jacob Dlougach, Jonathon Clark, Nadi Tomeh, Karel Bilek, Christian Buck, Oliver Wil-
son, Alex Fraser, Christophe Servan, Matous Machecek, Christian Federmann, Graham Neu-
Building and Installing
The structure and installation of the Moses toolkit has been simplified to make compilation and
installation easier. The training and decoding process can be run from the directory in which
the toolkit was downloaded, without the need for separate installation step.
This allows binary, ready-to-run versions of Moses to distributed which can be downloaded
and executed immediately. Previously, the installation needed to be configured specifically for
the user’s machine.
A new build system has been implemented to build the Moses toolkit. This uses the boost
library’s build framework. The new system offers several advantages over the previous build
Firstly, the source code for the new build system is included in the Moses repository which
is then bootstrapped the first time Moses is compiled. It does not rely on the the cmake, au-
tomake, make, and libtool applications. These have issues with cross-platform compatibility
and running on older operating systems.
Secondly, the new build system integrates the running of the unit tests and regression tests
with compilation.
Thirdly, the new system is significantly more powerful, allowing us to support a number of
new build features such as static and debug compilation, linking to external libraries such as
mpi and tmalloc, and other non-standard builds.
The MosesCore team has implemented several layers of testing to ensure the reliability of the
toolkit. We describe each below.
Unit Testing Unit testing tests each function or class method in isolation. Moses uses the
unit testing framework available from the Boost library to implement unit testing.
The source code for the unit tests are integrated into the Moses source. The tests are executed
every time the Moses source is compiled.
The unit testing framework has recently been implemented. There are currently 20 unit tests
for various features in mert, mira, phrase extraction, and decoding.
2.4. Releases 49
Regression Testing The regression tests ensure that changes to source code do not have
unknown consequences to existing functionality. The regression tests are typically applied to a
larger body of work than unit tests. They are designed to test specific functionality rather than
a specific function. Therefore, regression tests are applied to the actual Moses programs, rather
than tested in isolation.
The regression test framework forms the core of testing within the Moses toolkit. However, it
was created many years ago at the beginning of the Moses project and was only designed to
test the decoder. During the past 6 months, the scope of the regression test framework has been
expanded to test any part of the Moses toolkit, in addition to testing the decoder. The test are
grouped into the following types:
1. Phrase-based decoder
2. Hierarchical/Syntax decoder
3. Mert
4. Rule Extract
5. Phrase-table scoring
6. Miscellaneous, including domain adaptation features, binarizing phrase tables, parallel
rule extract, and so forth.
The number of tests has increased from 46 in May 2012 to 73 currently.
We have also overhauled the regression test to make it easier to add new tests. Previously, the
data for the regression tests could only be updated by developers who had access to the web
server at Edinburgh University. This has now been changed so that the data now resides in a
versioned repository on github.com45.
This can be accessed and changed by any Moses developer, and is subject to the same checks
and controls as the rest of the Moses source code.
Every Moses developer is obliged to ensure the regression test are successfully executed before
they commit their changes to the master repository.
Cruise Control This is a daily task run on a server at the University of Edinburgh which
compiles the Moses source code and executes the unit tests and regressions tests. Addition-
ally, it also runs a small training pipeline to completion. The results of this testing is publicly
available online46.
This provides an independent check that all unit tests and regression tests passed, and that the
entirety of the SMT pipeline is working. Therefore, it tests not only the Moses toolkit, but also
external tools such as GIZA++ that are essential to Moses and the wider SMT community.
All failures are investigated by the MosesCore team and any remedial action is taken. This is
done to enforce the testing regime and maintain reliability.
The cruise control is a subproject of Moses initiated by Ales Tamchyna with contribution by
Barry Haddow.
50 2. Installation
Operating-System Compatibility
The Moses toolkit has always strived to be compatible on multiple platforms, particularly on
the most popular operating systems used by researchers and commercial users.
Before each release, we make sure that Moses compiles and the unit tests and regression test
successfully runs on various operating systems.
Moses, GIZA++ mgiza, and IRSTLM was compiled for
1. Linux 32-bit
2. Linux 64-bit
3. Cygwin
4. Mac OSX 10.7 64-bit
Effort was made to make the executables runnable on as many platforms as possible. Therefore,
they were statically linked when possible. Moses was then tested on the following platforms:
1. Windows 7 (32-bit) with Cygwin 6.1
2. Mac OSX 10.7 with MacPorts
3. Ubuntu 12.10, 32 and 64-bit
4. Debian 6.0, 32 and 64-bit
5. Fedora 17, 32 and 64-bit
6. openSUSE 12.2, 32 and 64-bit
All the binary executables are made available for download47 for users who do not wish to
compile their own version.
GIZA++, mgiza, and IRSTLM are also available for download as binaries to enable users to run
the entire SMT pipeline without having to download and compile their own software.
1. IRSTLM was not linked statically. The 64-bit version fails to execute on Debian 6.0. All
other platforms can run the downloaded executables without problem.
2. Mac OSX does not support static linking. Therefore, it is not known if the executables
would work on any other platforms, other than the one on which it was tested.
3. mgiza compilation failed on Mac OSX with gcc v4.2. It could only be successfully com-
pilednwith gcc v4.5, available via MacPorts.
End-to-End Testing Before each Moses release, a number of full scale experiments are run.
This is the final test to ensure that the Moses pipeline can run from beginning to end, uninter-
rupted, with "real-world" datasets. The translation quality, as measured by BLEU, is also noted,
to ensure that there is no decrease in performance due to any interaction between components
in the pipeline.
This testing takes approximately 2 weeks to run. The following datasets and experiments are
currently used for end-to-end testing:
Europarl en-es: phrase-based, hierarchical
Europarl en-es: phrase-based, hierarchical
Europarl cs-en: phrase-based, hierarchical
2.4. Releases 51
Europarl en-cs: phrase-based, hierarchical
Europarl de-en: phrase-based, hierarchical, factored German POS, factored German+English
Europarl en-de: phrase-based, hierarchical, factored German POS, factored German+English
Europarl fr-en: phrase-based, hierarchical, recased (as opposed to truecased), factored
English POS
Europarl en-fr: phrase-based, hierarchical, recased (as opposed to truecased), factored
English POS
Pre-Made Models The end-to-end tests produces a large number of tuned models. The mod-
els, as well as all configuration and data files, are made available for download48. This is useful
as a template for users setting up their own experimental environment, or for those who just
want the models without running the experiments.
2.4.6 Release 0.91 (12th October, 2012)
The code is available in a branch on github49.
This version was tested on 8 Europarl language pairs, phrase-based, hierarchical, and phrase-
base factored models. All runs through without major intervention. Known issues:
1. Hierarchical models crashes on evaluation when threaded. Strangely, run OK during
2. EMS bugs when specifying multiple language models
3. Complex factored models not tested
4. Hierarchical models with factors does not work
2.4.7 Status 11th July, 2012
A roundup of the new features that have been implemented in the past year:
1. Lexi Birch’s LR score integrated into tuning. Finished coding: YES. Tested: NO. Docu-
mented: NO. Developer: Hieu, Lexi. First/Main user: Yvette Graham.
2. Asynchronous, batched LM requests for phrase-based models. Finished coding: YES.
Tested: UNKNOWN. Documented: YES. Developer: Oliver Wilson, Miles Osborne. First/Main
user: Miles Osborne.
3. Multithreaded tokenizer. Finished coding: YES. Tested: YES. Documented: NO. Devel-
oper: Pidong Wang.
4. KB Mira. Finished coding: YES. Tested: YES. Documented: YES. Developer: Colin
5. Training & decoding more resilient to non-printing characters and Moses’ reserved char-
acters. Escaping the reserved characters and throwing away lines with non-printing
chars. Finished coding: YES. Tested: YES. Documented: NO. Developer: Philipp Koehn
and Tom Hoar.
52 2. Installation
6. Simpler installation. Finished coding: YES. Tested: YES. Documented: YES. Developer:
Hieu Hoang. First/Main user: Hieu Hoang.
7. Factors work with chart decoding. Finished coding: YES. Tested: NO. Documented: NO.
Developer: Hieu Hoang. First/Main user: Fabienne Braune.
8. Less IO and disk space needed during training. Everything written directly to gz files.
Finished coding: YES. Tested: YES. Documented: NO. Developer: Hieu. First/Main user:
9. Parallel training. Finished coding: YES. Tested: YES. Documented: YES. Developer:
Hieu. First/Main user: Hieu
10. Adam Lopez’s suffix array integrated into Moses’s training & decoding. Finished coding:
YES. Tested: NO. Documented: YES. Developer: Hieu.
11. Major MERT code cleanup. Finished coding: YES. Tested: NO. Documented: NO. Devel-
oper: Tetsuo Kiso.
12. Wrapper for Berkeley parser (german). Finished coding: YES. Tested: UNKNOWN. Doc-
umented: UNKNOWN. Developer: Philipp Koehn.
13. Option to use p(RHS_t|RHS_s,LHS) or p(LHS,RHS_t|RHS_s), as a grammar rule’s di-
rect translation score. Finished coding: YES. Tested: UNKNOWN. Documented: UN-
KNOWN. Developer: Philip Williams. First/Main user: Philip Williams.
14. Optional PCFG scoring feature for target syntax models. Finished coding: YES. Tested:
UNKNOWN. Documented: UNKNOWN. Developer: Philip Williams. First/Main user:
Philip Williams.
15. Add -snt2cooc option to use mgiza’s reduced memory snt2cooc program. Finished cod-
ing: YES. Tested: YES. Documented: YES. Developer: Hieu Hoang.
16. queryOnDiskPt program. Finished coding: YES. Tested: YES. Documented: NO. Devel-
oper: Hieu Hoang. First/Main user: Daniel Schaut.
17. Output phrase segmentation to n-best when -report-segmentation is used. Finished
coding: YES. Tested: UNKNOWN. Developer: UNKNOWN. First/Main user: Jonathon
18. CDER and WER metric in tuning. Finished coding: UNKNOWN. Tested: UNKNOWN.
Documented: UNKNOWN. Developer: Matous Machacek.
19. Lossy Distributed Hash Table Language Model. Finished coding: UNKNOWN. Tested:
UNKNOWN. Documented: UNKNOWN. Developer: Oliver Wilson.
20. Interpolated scorer for MERT. Finished coding: YES. Tested: UNKNOWN. Documented:
UNKNOWN. Developer: Matous Machacek.
21. IRST LM training integrated into Moses. Finished coding: YES. Tested: YES. Docu-
mented: YES. Developer: Hieu Hoang.
22. GlobalLexiconModel. Finished coding: UNKNOWN. Tested: UNKNOWN. Documented:
UNKNOWN. Developer: Jiri Marsik, Christian Buck and Philipp Koehn.
23. TM Combine (translation model combination). Finished coding: YES. Tested: YES. Doc-
umented: YES. Developer: Rico Sennrich.
24. Alternative to CKY+ for scope-3 grammar. Reimplementation of Hopkins and Langmead
(2010). Finished coding: YES. Tested: UNKNOWN. Documented: UNKNOWN. Devel-
oper: Philip Williams.
25. Sample Java client for Moses server. Finished coding: YES. Tested: NO. Documented:
NO. Developer: Marwen Azouzi. First/Main user: Mailing list users.
26. Support for mgiza, without having to install GIZA++ as well. Finished coding: YES.
Tested: YES. Documented: NO. Developer: Marwen Azouzi.
27. Interpolated language models. Finished coding: YES. Tested: YES. Documented: YES.
2.4. Releases 53
Developer: Philipp Koehn.
28. Duplicate removal in MERT. Finished coding: YES. Tested: YES. Documented: NO. De-
veloper: Thomas Schoenemann.
29. Use bjam instead of automake to compile. Finished coding: YES. Tested: YES. Docu-
mented: YES. Developer: Ken Heafield.
30. Recaser train script updated to support IRSTLM as well. Finished coding: YES. Tested:
YES. Documented: YES. Developer: Jehan.
31. extract-ghkm. Finished coding: UNKNOWN. Tested: UNKNOWN. Documented: UN-
KNOWN. Developer: Philip Williams.
32. PRO tuning algorithm. Finished coding: YES. Tested: YES. Documented: YES. Developer:
Philipp Koehn and Barry Haddow.
33. Cruise control. Finished coding: YES. Tested: YES. Documented: YES. Developer: Ales
34. Faster SCFG rule table format. Finished coding: YES. Tested: UNKNOWN. Documented:
NO. Developer: Philip Williams.
35. LM OOV feature. Finished coding: YES. Tested: UNKNOWN. Documented: NO. Devel-
oper: Barry Haddow and Ken Heafield.
36. TER Scorer in MERT. Finished coding: UNKNOWN. Tested: UNKNOWN. Documented:
NO. Developer: Matous Machacek & Christophe Servan.
37. Multi-threading for decoder & MERT. Finished coding: YES. Tested: YES. Documented:
YES. Developer: Barry Haddow et al.
38. Expose n-gram length as part of LM state calculation.Finished coding: YES. Tested: UN-
KNOWN. Documented: NO. Developer: Ken Heafield and Marc Legendre.
39. Changes to chart decoder cube pruning: create one cube per dotted rule instead of one
per translation. Finished coding: YES. Tested: YES. Documented: NO. Developer: Philip
40. Syntactic LM. Finished coding: YES. Tested: YES. Documented: YES. Developer: Lane
41. Czech detokenization. Finished coding: YES. Tested: UNKNOWN. Documented: UN-
KNOWN. Developer: Ondrej Bojar.
2.4.8 Status 13th August, 2010
Changes since the last status report:
1. change or delete character Ø to 0 in extract-rules.cpp (Raphael and Hieu Hoang)
2.4.9 Status 9th August, 2010
Changes since the last status report:
1. Add option of retaining alignment information in the phrase-based phrase table. Decoder
loads this information if present. (Hieu Hoang & Raphael Payen)
2. When extracting rules, if the source or target syntax contains an unsupported escape
sequence (anything other than "<", ">", "&", "&apos", and "&quot") then write a warning
message and skip the sentence pair (instead of asserting).
3. In, calculates the p-value and confidence
intervals not only using BLEU, but also the NIST score. (Mark Fishel)
54 2. Installation
4. Dynamic Suffix Arrays (Abby Levenberg)
5. Merge multi-threaded Moses into Moses (Barry Haddow)
6. Continue partial translation (Ondrej Bojar and Ondrej Odchazel)
7. Bug fixes, minor bits & bobs. (Philipp Koehn, Christian Hardmeier, Hieu Hoang, Barry
Haddow, Philip Williams, Ondrej Bojar, Abbey, Mark Mishel, Lane Schwartz, Nicola
Bertoldi, Raphael, ...)
2.4.10 Status 26th April, 2010
Changes since the last status report:
1. Synchronous CFG based decoding, a la Hiero (Chiang 2005), plus with syntax. And all
the scripts to go with it. (Thanks to Philip Williams and Hieu Hoang)
2. caching clearing in IRST LM (Nicola Bertoldi)
3. Factored Language Model. (Ondrej Bojar)
4. Fixes to lattice (Christian Hardmeier, Arianna Bisazza, Suzy Howlett)
5. zmert (Ondrej Bojar)
6. Suffix arrays (Abby Levenberg)
7. Lattice MBR and consensus decoding (Barry Haddow and Abhishek Arun)
8. Simple program that illustrates how to access a phrase table on disk from an external
program (Felipe Sánchez-Martínez)
9. Odds and sods by Raphael Payen and Sara Stymne.
2.4.11 Status 1st April, 2010
Changes since the last status report:
1. Fix for Visual Studio, and potentially other compilers (thanks to Barry, Christian, Hieu)
2. Memory leak in unique n-best fixed (thanks to Barry)
3. Makefile fix for Moses server (thanks to Barry)
2.4.12 Status 26th March, 2010
Changes since the last status report:
1. Minor bug fixes & tweaks, especially to the decoder, MERT scripts (thanks to too many
people to mention)
2. Fixes to make decoder compile with most versions of gcc, Visual studio and other com-
pilers (thanks to Tom Hoar, Jean-Bapist Fouet).
3. Multi-threaded decoder (thanks to Barry Haddow)
4. Update for IRSTLM (thanks to Nicola Bertoldi and Marcello Federico)
5. Run mert on a subset of features (thanks to Nicola Bertoldi)
6. Training using different alignment models (thanks to Mark Fishel)
7. "A handy script to get many translations from Google" (thanks to Ondrej Bojar)
8. Lattice MBR (thanks to Abhishek Arun and Barry Haddow)
9. Option to compile Moses as a dynamic library (thanks to Jean-Bapist Fouet).
10. Hierarchical re-ordering model (thanks to Christian Harmeier, Sara Styme, Nadi, Mar-
cello, Ankit Srivastava, Gabriele Antonio Musillo, Philip Williams, Barry Haddow).
2.5. Work in Progress 55
11. Global Lexical re-ordering model (thanks to Philipp Koehn)
12. Experiment.perl scripts for automating the whole MT pipeline (thanks to Philipp Koehn)
2.5 Work in Progress
Refer to the website (
56 2. Installation
3.1 Phrase-based Tutorial
This tutorial describes the workings of the phrase-based decoder in Moses, using a simple
model downloadable from the Moses website.
3.1.1 A Simple Translation Model
Let us begin with a look at the toy phrase-based translation model that is available for down-
load at Unpack the tar ball and en-
ter the directory sample-models/phrase-model.
The model consists of two files:
phrase-table the phrase translation table, and
moses.ini the configuration file for the decoder.
Let us look at the first line of the phrase translation table (file phrase-table):
der ||| the ||| 0.3 ||| |||
This entry means that the probality of translating the English word the from the German der
is 0.3. Or in mathematical notation: p(the|der)=0.3. Note that these translation probabilities are
in the inverse order due to the noisy channel model.
The translation tables are the main knowledge source for the machine translation decoder. The
decoder consults these tables to figure out how to translate input in one language into output
in another language.
Being a phrase translation model, the translation tables do not only contain single word entries,
but multi-word entries. These are called phrases, but this concept means nothing more than an
arbitrary sequence of words, with no sophisticated linguistic motivation.
Here is an example for a phrase translation entry in phrase-table:
58 3. Tutorials
das ist ||| this is ||| 0.8 ||| |||
3.1.2 Running the Decoder
Without further ado, let us run the decoder (it needs to be run from the sample-models direc-
tory) :
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini > out
Defined parameters (per moses.ini or switch):
config: phrase-model/moses.ini
input-factors: 0
lmodel-file: 8 0 3 lm/europarl.srilm.gz
mapping: T 0
n-best-list: nbest.txt 100
ttable-file: 0 0 0 1 phrase-model/phrase-table
ttable-limit: 10
weight-d: 1
weight-l: 1
weight-t: 1
weight-w: 0
Loading lexical distortion models...have 0 models
Start loading LanguageModel lm/europarl.srilm.gz : [0.000] seconds
Loading the LM will be faster if you build a binary file.
Reading lm/europarl.srilm.gz
The ARPA file is missing <unk>. Substituting log10 probability -100.000.
Finished loading LanguageModels : [2.000] seconds
Start loading PhraseTable phrase-model/phrase-table : [2.000] seconds
filePath: phrase-model/phrase-table
Finished loading phrase tables : [2.000] seconds
Start loading phrase table from phrase-model/phrase-table : [2.000] seconds
Reading phrase-model/phrase-table
Finished loading phrase tables : [2.000] seconds
Created input-output object : [2.000] seconds
Translating line 0 in thread id 0
Translating: das ist ein kleines haus
Collecting options took 0.000 seconds
Search took 0.000 seconds
BEST TRANSLATION: this is a small house [11111] [total=-28.923] <<0.000, -5.000, 0.000, -27.091, -1.833>>
Translation took 0.000 seconds
Finished translating
% cat out
this is a small house
Here, the toy model managed to translate the German input sentence das ist ein kleines
haus into the English this is a small house, which is a correct translation.
3.1. Phrase-based Tutorial 59
The decoder is controlled by the configuration file moses.ini. The file used in the example
above is displayed below.
# input factors
# mapping steps, either (T) translation or (G) generation
T 0
KENLM name=LM factor=0 order=3 num-features=1 path=lm/europarl.srilm.gz
PhraseDictionaryMemory input-factor=0 output-factor=0 path=phrase-model/phrase-table num-features=1 table-limit=10
WordPenalty0= 0
LM= 1
Distortion0= 1
PhraseDictionaryMemory0= 1
We will take a look at all the parameters that are specified here (and then some) later. At this
point, let us just note that the translation model files and the language model file are specified
here. In this example, the file names are relative paths, but usually having full paths is better,
so that the decoder does not have to be run from a specific directory.
We just ran the decoder on a single sentence provided on the command line. Usually we want
to translate more than one sentence. In this case, the input sentences are stored in a file, one
sentence per line. This file is piped into the decoder and the output is piped into some output
file for further processing:
% moses -f phrase-model/moses.ini < phrase-model/in > out
3.1.3 Trace
How the decoder works is described in detail in the background (Section 6.1) section. But let
us first develop an intuition by looking under the hood. There are two switches that force the
decoder to reveal more about its inner workings: -report-segmentation and -verbose.
The trace option reveals which phrase translations were used in the best translation found by
the decoder. Running the decoder with the segmentation trace switch (short -t) on the same
60 3. Tutorials
echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -t >out
gives us the extended output
% cat out
this is |0-1| a |2-2| small |3-3| house |4-4|
Each generated English phrase is now annotated with additional information:
this is was generated from the German words 0-1, das ist,
awas generated from the German word 2-2, ein,
small was generated from the German word 3-3, kleines, and
house was generated from the German word 4-4, haus.
Note that the German sentence does not have to be translated in sequence. Here an example,
where the English output is reordered:
echo ’ein haus ist das’ | moses -f phrase-model/moses.ini -t -weight-overwrite "Distortion0= 0"
The output of this command is:
this |3-3| is |2-2| a |0-0| house |1-1|
3.1.4 Verbose
Now for the next switch, -verbose (short -v), that displays additional run time information.
The verbosity of the decoder output exists in three levels. The default is 1. Moving on to -v 2
gives additional statistics for each translated sentences:
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -v 2
TRANSLATING(1): das ist ein kleines haus
Total translation options: 12
Total translation options pruned: 0
3.1. Phrase-based Tutorial 61
A short summary on how many translations options were used for the translation of these
Stack sizes: 1, 10, 2, 0, 0, 0
Stack sizes: 1, 10, 27, 6, 0, 0
Stack sizes: 1, 10, 27, 47, 6, 0
Stack sizes: 1, 10, 27, 47, 24, 1
Stack sizes: 1, 10, 27, 47, 24, 3
Stack sizes: 1, 10, 27, 47, 24, 3
The stack sizes after each iteration of the stack decoder. An iteration is the processing of all
hypotheses on one stack: After the first iteration (processing the initial empty hypothesis), 10
hypothesis that cover one German word are placed on stack 1, and 2 hypotheses that cover two
foreign words are placed on stack 2. Note how this relates to the 12 translation options.
total hypotheses generated = 453
number recombined = 69
number pruned = 0
number discarded early = 272
During the beam search a large number of hypotheses are generated (453). Many are discarded
early because they are deemed to be too bad (272), or pruned at some later stage (0), and some
are recombined (69). The remainder survives on the stacks.
total source words = 5
words deleted = 0 ()
words inserted = 0 ()
Some additional information on word deletion and insertion, two advanced options that are
not activated by default.
BEST TRANSLATION: this is a small house [11111] [total=-28.923] <<0.000, -5.000, 0.000, -27.091, -1.833
Sentence Decoding Time: : [4.000] seconds
And finally, the translated sentence, its coverage vector (all 5 bits for the 5 German input words
are set), its overall log-probability score, and the breakdown of the score into language model,
reordering model, word penalty and translation model components.
62 3. Tutorials
Also, the sentence decoding time is given.
The most verbose output -v 3 provides even more information. In fact, it is so much, that we
could not possibly fit it in this tutorial. Run the following command and enjoy:
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -v 3
Let us look together at some highlights. The overall translation score is made up from several
components. The decoder reports these components, in our case:
The score component vector looks like this:
0 distortion score
1 word penalty
2 unknown word penalty
3 3-gram LM score, factor-type=0, file=lm/europarl.srilm.gz
4 Translation score, file=phrase-table
Before decoding, the phrase translation table is consulted for possible phrase translations. For
some phrases, we find entries, for others we find nothing. Here an excerpt:
[das ; 0-0]
the , pC=-0.916, c=-5.789
this , pC=-2.303, c=-8.002
it , pC=-2.303, c=-8.076
[das ist ; 0-1]
it is , pC=-1.609, c=-10.207
this is , pC=-0.223, c=-10.291
[ist ; 1-1]
is , pC=0.000, c=-4.922
’s , pC=0.000, c=-6.116
The pair of numbers next to a phrase is the coverage, pC denotes the log of the phrase translation
probability, after cthe future cost estimate for the phrase is given.
Future cost is an estimate of how hard it is to translate different parts of the sentence. After
looking up phrase translation probabilities, future costs are computed for all contigous spans
over the sentence:
3.1. Phrase-based Tutorial 63
future cost from 0 to 0 is -5.789
future cost from 0 to 1 is -10.207
future cost from 0 to 2 is -15.722
future cost from 0 to 3 is -25.443
future cost from 0 to 4 is -34.709
future cost from 1 to 1 is -4.922
future cost from 1 to 2 is -10.437
future cost from 1 to 3 is -20.158
future cost from 1 to 4 is -29.425
future cost from 2 to 2 is -5.515
future cost from 2 to 3 is -15.236
future cost from 2 to 4 is -24.502
future cost from 3 to 3 is -9.721
future cost from 3 to 4 is -18.987
future cost from 4 to 4 is -9.266
Some parts of the sentence are easier to translate than others. For instance the estimate for
translating the first two words (0-1: das ist) is deemed to be cheaper (-10.207) than the last
two (3-4: kleines haus, -18.987). Again, the negative numbers are log-probabilities.
After all this preperation, we start to create partial translations by translating a phrase at a time.
The first hypothesis is generated by translating the first German word as the:
creating hypothesis 1 from 0 ( <s> )
base score 0.000
covering 0-0: das
translated as: the
score -2.951 + future cost -29.425 = -32.375
unweighted feature scores: <<0.000, -1.000, 0.000, -2.034, -0.916>>
added hyp to stack, best on stack, now size 1
Here, starting with the empty initial hypothesis 0, a new hypothesis (id=1) is created. Starting
from zero cost (base score), translating the phrase das into the carries translation cost (-0.916),
distortion or reordering cost (0), language model cost (-2.034), and word penalty (-1). Recall
that the score component information is printed out earlier, so we are able to interpret the
Overall, a weighted log-probability cost of -2.951 is accumulated. Together with the future cost
estimate for the remaining part of the sentence (-29.425), this hypothesis is assigned a score of
And so it continues, for a total of 453 created hypothesis. At the end, the best scoring final
hypothesis is found and the hypothesis graph traversed backwards to retrieve the best transla-
64 3. Tutorials
Best path: 417 <= 285 <= 163 <= 5 <= 0
Confused enough yet? Before we get caught too much in the intricate details of the inner
workings of the decoder, let us return to actually using it. Much of what has just been said will
become much clearer after reading the background (Section 6.1) information.
3.1.5 Tuning for Quality
The key to good translation performance is having a good phrase translation table. But some
tuning can be done with the decoder. The most important is the tuning of the model parame-
The probability cost that is assigned to a translation is a product of probability costs of four
phrase translation table,
language model,
reordering model, and
word penalty.
Each of these models contributes information over one aspect of the characteristics of a good
The phrase translation table ensures that the English phrases and the German phrases
are good translations of each other.
The language model ensures that the output is fluent English.
The distortion model allows for reordering of the input sentence, but at a cost: The more
reordering, the more expensive is the translation.
The word penalty ensures that the translations do not get too long or too short.
Each of the components can be given a weight that sets its importance. Mathematically, the
cost of translation is:
p(e|f) = φ(f|e)weightφ×LMweightLM ×D(e, f)weightd×W(e)weightφ(3.1)
The probability p(e|f) of the English translation e given the foreign input f is broken up into
four models, phrase translation phi(f|e), language model LM(e), distortion model D(e,f), and
word penalty W(e) = exp(length(e)). Each of the four models is weighted by a weight.
The weighting is provided to the decoder with the four parameters weight-t,weight-l,weight-d,
and weight-w. The default setting for these weights is 1, 1, 1, and 0. These are also the values
in the configuration file moses.ini.
Setting these weights to the right values can improve translation quality. We already sneaked
in one example above. When translating the German sentence ein haus ist das, we set the
distortion weight to 0 to get the right translation:
3.1. Phrase-based Tutorial 65
% echo ’ein haus ist das’ | moses -f phrase-model/moses.ini -d 0
this is a house
With the default weights, the translation comes out wrong:
% echo ’ein haus ist das’ | moses -f phrase-model/moses.ini
a house is the
What is the right weight setting depends on the corpus and the language pair. Ususally, a held
out development set is used to optimize the parameter settings. The simplest method here is
to try out with a large number of possible settings, and pick what works best. Good values
for the weights for phrase translation table (weight-t, short tm), language model (weight-l,
short lm), and reordering model (weight-d, short d) are 0.1-1, good values for the word penalty
(weight-w, short w) are -3-3. Negative values for the word penalty favor longer output, positive
values favor shorter output.
3.1.6 Tuning for Speed
Let us now look at some additional parameters that help to speed up the decoder. Unfortu-
nately higher speed usually comes at cost of translation quality. The speed-ups are achieved by
limiting the search space of the decoder. By cutting out part of the search space, we may not be
able to find the best translation anymore.
Translation Table Size
One strategy to limit the search space is by reducing the number of translation options used for
each input phrase, i.e. the number of phrase translation table entries that are retrieved. While
in the toy example, the translation tables are very small, these can have thousands of entries
per phrase in a realistic scenario. If the phrase translation table is learned from real data, it
contains a lot of noise. So, we are really interested only in the most probable ones and would
like to elimiate the others.
The are two ways to limit the translation table size: by a fixed limit on how many translation
options are retrieved for each input phrase, and by a probability threshold, that specifies that
the phrase translation probability has to be above some value.
Compare the statistics and the translation output for our toy model, when no translation table
limit is used
66 3. Tutorials
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -ttable-limit 0 -v 2
Total translation options: 12
total hypotheses generated = 453
number recombined = 69
number pruned = 0
number discarded early = 272
BEST TRANSLATION: this is a small house [11111] [total=-28.923]
with the statistics and translation output, when a limit of 1 is used
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -ttable-limit 1 -v 2
Total translation options: 6
total hypotheses generated = 127
number recombined = 8
number pruned = 0
number discarded early = 61
BEST TRANSLATION: it is a small house [11111] [total=-30.327]
Reducing the number of translation options to only one per phrase, had a number of effects:
(1) Overall only 6 translation options instead of 12 translation options were collected. (2) The
number of generated hypothesis fell to 127 from 442, and no hypotheses were pruned out. (3)
The translation changed, and the output now has lower log-probability: -30.327 vs. -28.923.
Hypothesis Stack Size (Beam)
A different way to reduce the search is to reduce the size of hypothesis stacks. For each num-
ber of foreign words translated, the decoder keeps a stack of the best (partial) translations.
By reducing this stack size the search will be quicker, since less hypotheses are kept at each
stage, and therefore less hypotheses are generated. This is explained in more detail on the
Background (Section 6.1) page.
From a user perspective, search speed is linear to the maximum stack size. Compare the fol-
lowing system runs with stack size 1000, 100 (the default), 10, and 1:
3.1. Phrase-based Tutorial 67
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -v 2 -s 1000
total hypotheses generated = 453
number recombined = 69
number pruned = 0
number discarded early = 272
BEST TRANSLATION: this is a small house [11111] [total=-28.923]
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -v 2 -s 100
total hypotheses generated = 453
number recombined = 69
number pruned = 0
number discarded early = 272
BEST TRANSLATION: this is a small house [11111] [total=-28.923]
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -v 2 -s 10
total hypotheses generated = 208
number recombined = 23
number pruned = 42
number discarded early = 103
BEST TRANSLATION: this is a small house [11111] [total=-28.923]
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -v 2 -s 1
total hypotheses generated = 29
number recombined = 0
number pruned = 4
number discarded early = 19
BEST TRANSLATION: this is a little house [11111] [total=-30.991]
Note that the number of hypothesis entered on stacks is getting smaller with the stack size: 453,
453, 208, and 29.
As we have previously described with translation table pruning, we may also want to use the
relative scores of hypothesis for pruning instead of a fixed limit. The two strategies are also
called histogram pruning and threshold pruning.
Here some experiments to show the effects of different stack size limits and beam size limits.
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -v 2 -s 100 -b 0
68 3. Tutorials
total hypotheses generated = 1073
number recombined = 720
number pruned = 73
number discarded early = 0
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -v 2 -s 1000 -b 0
total hypotheses generated = 1352
number recombined = 985
number pruned = 0
number discarded early = 0
% echo ’das ist ein kleines haus’ | moses -f phrase-model/moses.ini -v 2 -s 1000 -b 0.1
total hypotheses generated = 45
number recombined = 3
number pruned = 0
number discarded early = 32
In the second example no pruning takes place, which means an exhaustive search is performed.
With small stack sizes or small thresholds we risk search errors, meaning the generation of
translations that score worse than the best translation according to the model.
In this toy example, a worse translation is only generated with a stack size of 1. Again, by
worse translation, we mean worse scoring according to our model (-30.991 vs. -28.923). If it is
actually a worse translation in terms of translation quality, is another question. However, the
task of the decoder is to find the best scoring translation. If worse scoring translations are of
better quality, then this is a problem of the model, and should be resolved by better modeling.
3.1.7 Limit on Distortion (Reordering)
The basic reordering model implemented in the decoder is fairly weak. Reordering cost is
measured by the number of words skipped when foreign phrases are picked out of order.
Total reordering cost is computed by D(e,f) = - Σi(d_i) where d for each phrase i is defined as
d = abs( last word position of previously translated phrase + 1 - first word position of newly
translated phrase ).
This is illustrated by the following graph:
3.2. Tutorial for Using Factored Models 69
This reordering model is suitable for local reorderings: they are discouraged, but may occur
with sufficient support from the language model. But large-scale reorderings are often arbitrary
and effect translation performance negatively.
By limiting reordering, we can not only speed up the decoder, often translation performance is
increased. Reordering can be limited to a maximum number of words skipped (maximum d)
with the switch -distortion-limit, or short -dl.
Setting this parameter to 0 means monotone translation (no reordering). If you want to allow
unlimited reordering, use the value -1.
Subsection last modified on June 21, 2014, at 08:16 PM
3.2 Tutorial for Using Factored Models
Note: There may be some discrepancies between this description and the actual workings of
the training script.
- Train an unfactored model (Section 3.2.1)
- Train a model with POS tags (Section 3.2.2)
- Train a model with generation and translation steps (Section 3.2.3)
- Train a morphological analysis and generation model (Section 3.2.4)
- Train a model with multiple decoding paths (Section 3.2.5)
To work through this tutorial, you first need to have the data in place. The instructions also
assume that you have the training script and the decoder in you executable path.
You can obtain the data as follows:
tar xzf factored-corpus.tgz
For more information on the training script, check the documentation, which is linked to on
the right navigation column under "Training".
70 3. Tutorials
3.2.1 Train an unfactored model
The corpus package contains language models and parallel corpora with POS and lemma fac-
tors. Before playing with factored models, let us start with training a traditional phrase-based
% train-model.perl \
--root-dir unfactored \
--corpus factored-corpus/proj-syndicate \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm \
--external-bin-dir .../tools \
--input-factor-max 4
This creates a phrase-based model in the directory unfactored/model in about 20 minutes (on
a 2.8GHZ machine). For a quicker training run that only takes a few minutes (with much worse
results) use the just the first 1000 sentence pairs of the corpus, contained in factored-corpus/proj-syndicate.1000.
% train-model.perl \
--root-dir unfactored \
--corpus factored-corpus/proj-syndicate.1000 \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm \
--external-bin-dir .../tools \
--input-factor-max 4
This creates a typical phrase-based model, as specified in the created configuration file moses.ini.
Here the part of the file that points to the phrase table:
PhraseDictionaryMemory ... path=/.../phrase-table.gz ...
You can take a look at the generated phrase table, which starts as usual with rubbish but then
occasionally contains some nice entries. The scores ensure that during decoding the good en-
tries are preferred.
3.2. Tutorial for Using Factored Models 71
! ||| ! ||| 1 1 1 1 2.718
" ( ||| " ( ||| 1 0.856401 1 0.779352 2.718
" ) , ein neuer film ||| " a new film ||| 1 0.0038467 1 0.128157 2.718
" ) , ein neuer film über ||| " a new film about ||| 1 0.000831718 1 0.0170876 2.71
frage ||| issue ||| 0.25 0.285714 0.25 0.166667 2.718
frage ||| question ||| 0.75 0.555556 0.75 0.416667 2.718
3.2.2 Train a model with POS tags
Take a look at the training data. Each word is not only represented by its surface form (as you
would expect in raw text), but also with additional factors.
% tail -n 1 factored-corpus/proj-syndicate.??
==> factored-corpus/ <==
korruption|korruption|nn| floriert|florieren|vvfin|vvfin .|.|per|per
==> factored-corpus/proj-syndicate.en <==
corruption|corruption|nn flourishes|flourish|nns .|.|.
The German factors are
surface form,
part of speech, and
part of speech with additional morphological information.
The English factors are
surface form,
lemma, and
part of speech.
Let us start simple and build a translation model that adds only the target part-of-speech factor
on the output side:
% train-model.perl \
--root-dir pos \
--corpus factored-corpus/proj-syndicate.1000 \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm \
--lm 2:3:factored-corpus/pos.lm \
--translation-factors 0-0,2 \
--external-bin-dir .../tools
72 3. Tutorials
Here, we specify with --translation-factors 0-0,2 that the input factor for the translation
table is the (0) surface form, and the output factor is (0) surface form and (2) part of speech.
PhraseDictionaryMemory ... input-factor=0 output-factor=0,2
The resulting phrase table looks very similar, but now also contains part-of-speech tags on the
English side:
! ||| !|. ||| 1 1 1 1 2.718
" ( ||| "|" (|( ||| 1 0.856401 1 0.779352 2.718
" ) , ein neuer film ||| "|" a|dt new|jj film|nn ||| 1 0.00403191 1 0.128157 2.718
" ) , ein neuer film über ||| "|" a|dt new|jj film|nn about|in ||| 1 0.000871765 1 0.0170876 2.718
frage ||| issue|nn ||| 0.25 0.285714 0.25 0.166667 2.718
frage ||| question|nn ||| 0.75 0.625 0.75 0.416667 2.718
We also specified two language models. Besides the regular language model based on surface
forms, we have a second language model that is trained on POS tags. In the configuration file
this is indicated by two lines in the LM section:
KENLM name=LM0 ...
KENLM name=LM1 ...
Also, two language model weights are specified:
LM0= 0.5
LM1= 0.5
The part-of-speech language model includes preferences such as that determiner-adjective is
likely followed by a noun, and less likely by a determiner:
-0.192859 dt jj nn
-2.952967 dt jj dt
3.2. Tutorial for Using Factored Models 73
This model can be used just like normal phrase based models:
% echo ’putin beschreibt menschen .’ > in
% moses -f pos/model/moses.ini < in
BEST TRANSLATION: putin|nnp describes|vbz people|nns .|. [1111] [total=-6.049]
<<0.000, -4.000, 0.000, -29.403, -11.731, -0.589, -1.303, -0.379, -0.556, 4.000>>
During the decoding process, not only words (putin), but also part-of-speech are generated
Let’s take a look what happens, if we input a German sentence that starts with the object:
% echo ’menschen beschreibt putin .’ > in
% moses -f pos/model/moses.ini < in
BEST TRANSLATION: people|nns describes|vbz putin|nnp .|. [1111] [total=-8.030]
<<0.000, -4.000, 0.000, -31.289, -17.770, -0.589, -1.303, -0.379, -0.556, 4.000>>
Now, this is not a very good translation. The model’s aversion to do reordering trumps our
ability to come up with a good translation. If we downweight the reordering model, we get a
better translation:
% moses -f pos/model/moses.ini < in -d 0.2
BEST TRANSLATION: putin|nnp describes|vbz people|nns .|. [1111] [total=-7.649]
<<-8.000, -4.000, 0.000, -29.403, -11.731, -0.589, -1.303, -0.379, -0.556, 4.000>>
Note that this better translation is mostly driven by the part-of-speech language model, which
prefers the sequence nnp vbz nns . (-11.731) over the sequence nns vbz nnp . (-17.770).
The surface form language model only shows a slight preference (-29.403 vs. -31.289). This
is because these words have not been seen next to each other before, so the language model
has very little to work with. The part-of-speech language model is aware of the count of the
nouns involved and prefers a singular noun before a singular verb (nnp vbz) over a plural
noun before a singluar verb (nns vbz).
To drive this point home, the unfactored model is not able to find the right translation, even
with downweighted reordering model:
% moses -f unfactored/model/moses.ini < in -d 0.2
people describes putin . [1111] [total=-11.410]
<<0.000, -4.000, 0.000, -31.289, -0.589, -1.303, -0.379, -0.556, 4.000>>
74 3. Tutorials
3.2.3 Train a model with generation and translation steps
Let us now train a slightly different factored model with the same factors. Instead of mapping
from the German input surface form directly to the English output surface form and part of
speech, we now break this up into two mapping steps, one translation step that maps surface
forms to surface forms, and a second step that generates the part of speech from the surface
form on the output side:
% train-model.perl \
--root-dir pos-decomposed \
--corpus factored-corpus/proj-syndicate.1000 \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm \
--lm 2:3:factored-corpus/pos.lm \
--translation-factors 0-0 \
--generation-factors 0-2 \
--decoding-steps t0,g0 \
--external-bin-dir .../tools
Now, the translation step is specified only between surface forms (--translation-factors
0-0) and a generation step is specified (--generation-factors 0-2), mapping (0) surface form
to (2) part of speech. We also need to specified in which order the mapping steps are applied
(--decoding-steps t0,g0).
Besides the phrase table that has the same format as the unfactored phrase table, we now also
have a generation table. It is referenced in the configuration file:
Generation ... input-factor=0 output-factor=2
GenerationModel0= 0.3 0
Let us take a look at the generation table:
% more pos-decomposed/model/generation.0-2
nigerian nnp 1.0000000 0.0008163
proven vbn 1.0000000 0.0021142
issue nn 1.0000000 0.0021591
control vb 0.1666667 0.0014451
control nn 0.8333333 0.0017992
3.2. Tutorial for Using Factored Models 75
The beginning is not very interesting. As most words, nigerian,proven, and issue occur only
with one part of speech, e.g., p(nnp|nigerian) = 1.0000000. Some words, however, such as
control occur with multiple part of speech, such as base form verb (vb) and single noun (nn).
The table also contains the reverse translation probability p(nigerian|nnp) = 0.0008163. In our
example, this may not be a very useful feature. It basically hurts open class words, especially
unusual ones. If we do not want this feature, we can also train the generation model as single-
featured by the switch --generation-type single.
3.2.4 Train a morphological analysis and generation model
Translating surface forms seems to be a somewhat questionable pursuit. It does not seem
to make much sense to treat different word forms of the same lemma, such as mensch and
menschen differently. In the worst case, we will have seen only one of the word forms, so we
are not able to translate the other. This is what in fact happens in this example:
% echo ’ein mensch beschreibt putin .’ > in
% moses.1430.srilm -f unfactored/model/moses.ini < in
a mensch|UNK|UNK|UNK describes putin . [11111] [total=-158.818]
<<0.000, -5.000, -100.000, -127.565, -1.350, -1.871, -0.301, -0.652, 4.000>>
Factored translation models allow us to create models that do morphological analysis and de-
composition during the translation process. Let us now train such a model:
% train-model.perl \
--root-dir morphgen \
--corpus factored-corpus/proj-syndicate.1000 \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm \
--lm 2:3:factored-corpus/pos.lm \
--translation-factors 1-1+3-2 \
--generation-factors 1-2+1,2-0 \
--decoding-steps t0,g0,t1,g1 \
--external-bin-dir .../tools
We have a total of four mapping steps:
a translation step that maps lemmas (1-1),
a generation step that sets possible part-of-speech tags for a lemma (1-2),
a translation step that maps morphological information to part-of-speech tags (3-2), and
a generation step that maps part-of-speech tag and lemma to a surface form (1,2-0).
This enables us now to translate the sentence above:
76 3. Tutorials
% echo ’ein|ein|art|art.indef.z mensch|mensch|nn| \
beschreibte|beschreiben|vvfin|vvfin putin|putin|nn| \
.|.|per|per’ > in
% moses -f morphgen/model/moses.ini < in
BEST TRANSLATION: a|a|dt individual|individual|nn describes|describe|vbz \
putin|putin|nnp .|.|. [11111] [total=-17.269]
<<0.000, -5.000, 0.000, -38.631, -13.357, -2.773, -21.024, 0.000, -1.386, \
-1.796, -4.341, -3.189, -4.630, 4.999, -13.478, -14.079, -4.911, -5.774, 4.999>>
Note that this is only possible, because we have seen an appropriate word form in the output
language. The word individual occurs as single noun in the parallel corpus, as translation of
einzelnen. To overcome this limitation, we may train generation models on large monolingual
corpora, where we expect to see all possible word forms.
3.2.5 Train a model with multiple decoding paths
Decomposing translation into a process of morphological analysis and generation will make
our translation model more robust. However, if we have seen a phrase of surface forms before,
it may be better to take advantage of such rich evidence.
The above model poorly translates sentences, as it does use the source surface form at all,
relying on translating the properties of the surface forms.
In practice, we fair better when we allow both ways to translate in parallel. Such a model
is trained by the introduction of decoding paths. In our example, one decoding path is the
morphological analysis and generation as above, the other path the direct mapping of surface
forms to surface forms (and part-of-speech tags, since we are using a part-of-speech tag lan-
guage model):
% train-model.perl \
--corpus factored-corpus/proj-syndicate.1000 \
--root-dir morphgen-backoff \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm \
--lm 2:3:factored-corpus/pos.lm \
--translation-factors 1-1+3-2+0-0,2 \
--generation-factors 1-2+1,2-0 \
--decoding-steps t0,g0,t1,g1:t2 \
--external-bin-dir .../tools
This command is almost identical to the previous training run, except for the additional trans-
lation table 0-0,2 and its inclusion as a different decoding path :t2.
3.3. Syntax Tutorial 77
A strategy for translating surface forms which have not been seen in the training corpus is to
translate its lemma instead. This is especially useful for translation from morphologically rich
languages to simpler languages, such as German to English translation.
% train-model.perl \
--corpus factored-corpus/proj-syndicate.1000 \
--root-dir lemma-backoff \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm \
--lm 2:3:factored-corpus/pos.lm \
--translation-factors 0-0,2+1-0,2 \
--decoding-steps t0:t1 \
--external-bin-dir .../tools
Subsection last modified on May 29, 2016, at 10:02 PM
3.3 Syntax Tutorial
24 And the people murmured against Moses, saying, What shall we drink?
25 And he cried unto the Lord; and the Lord showed him a tree, which when he had cast into the waters,
the waters were made sweet.
Exodus 15, 24-25
Moses supports models that have become known as hierarchical phrase-based models and syntax-
based models. These models use a grammar consisting of SCFG (Synchronous Context-Free
Grammar) rules. In the following, we refer to these models as tree-based models.
3.3.1 Tree-Based Models
Traditional phrase-based models have as atomic translation step the mapping of an input
phrase to an output phrase. Tree-based models operate on so-called grammar rules, which
include variables in the mapping rules:
ne X1 pas -> not X1 (French-English)
ate X1 -> habe X1 gegessen (English-German)
X1 of the X2 -> le X2 X1 (English-French)
The variables in these grammar rules are called non-terminals, since their occurrence indicates
that the process has not yet terminated to produce the final words (the terminals). Besides a
78 3. Tutorials
generic non-terminal X, linguistically motivated non-terminals such as NP (noun phrase) or VP
(verb phrase) may be used as well in a grammar (or translation rule set).
We call these models tree-based, because during the translation a data structure is created that
is a called a tree. To fully make this point, consider the following input and translation rules:
Input: Das Tor geht schnell auf
Rules: Das Tor -> The door
schnell -> quickly
geht X1 auf -> opens X1
X1 X2 -> X1 X2
When applying these rules in the given order, we produce the translation The door opens quickly
in the following fashion:
First the simple phrase mappings (1) Das Tor to The door and (2) schnell to quickly are
carried out. This allows for the application of the more complex rule (3) geht X1auf to opens
X1. Note that at this point, the non-terminal X, which covers the input span over schnell is
replaced by a known translation quickly. Finally, the glue rule (4) X1X2to X1X2combines the
two fragments into a complete sentence.
Here is how the spans over the input words are getting filled in:
|4 ---- The door opens quickly ---- |
| |3 --- opens quickly --- |
|1 The door | |2 quickly | |
| Das | Tor | geht | schnell | auf |
Formally, such context-free grammars are more constraint than the formalism for phrase-based
models. In practice, however, phrase-based models use a reordering limit, which leads to linear
decoding time. For tree-based models, decoding is not linear with respect to sentence length,
unless reordering limits are used.
Current research in tree-based models has the expectation to build translation models that more
closely model the underlying linguistic structure of language, and its essential element: recur-
sion. This is an active field of research.
3.3. Syntax Tutorial 79
A Word on Terminology
You may have read in the literature about hierarchical phrase-based, string-to-tree, tree-to-
string, tree-to-tree, target-syntactified, syntax-augmented, syntax-directed, syntax-based, grammar-
based, etc., models in statistical machine translation. What do the tree-based models support?
All of the above.
The avalanche of terminology stems partly from the need of researchers to carve out their own
niche, partly from the fact that work in this area has not yet fully settled on a agreed framework,
but also from a fundamental difference. As we already pointed out, the motivation for tree-
based models are linguistic theories and their syntax trees. So, when we build a data structure
called a tree (as Computer Scientist call it), do we mean that we build a linguistic syntax tree (as
Linguists call it)?
Not always, and hence the confusion. In all our examples above we used a single non-terminal
X, so not many will claim the the result is a proper linguistic syntax with its noun phrases NP,
verb phrases VP, and so on. To distinguish models that use proper linguistic syntax on the input
side, on the output side, on both, or on neither all this terminology has been invented.
Let’s decipher common terms found in the literature:
hierarchical phrase-based: no linguistic syntax,
string-to-tree: linguistic syntax only in output language,
tree-to-string: linguistic syntax only in input language,
tree-to-tree: linguistic syntax in both languages,
target-syntactified: linguistic syntax only in output language,
syntax-augmented: linguistic syntax only in output language,
syntax-directed: linguistic syntax only in input language,
syntax-based: unclear, we use it for models that have any linguistic syntax, and
grammar-based: wait, what?
In this tutorial, we refer to un-annotated trees as trees, and to trees with syntactic annotation
as syntax. So a so-called string-to-tree model is here called a target-syntax model.
Chart Decoding
Phrase-Based decoding generates a sentence from left to right, by adding phrases to the end of
a partial translation. Tree-based decoding builds a chart, which consists of partial translation
for all possible spans over the input sentence.
Currently Moses implements a CKY+ algorithm for arbitrary number of non-terminals per rule
and an arbitrary number of types of non-terminals in the grammar.
3.3.2 Decoding
We assume that you have already installed the chart decoder, as described in the Get Started1
You can find an example model for the decoder from the Moses web site2. Unpack the tar ball
and enter the directory sample-models:
80 3. Tutorials
% wget
% tar xzf sample-models.tgz
% cd sample-models/string-to-tree
The decoder is called just as for phrase models:
% echo ’das ist ein haus’ | moses_chart -f moses.ini > out
% cat out
this is a house
What happened here?
Using the option -T we can some insight how the translation was assembled:
41 X TOP -> <s> S </s> (1,1) [0..5] -3.593 <<0.000, -2.606, -9.711, 2.526>> 20
20 X S -> NP V NP (0,0) (1,1) (2,2) [1..4] -1.988 <<0.000, -1.737, -6.501, 2.526>> 3 5 11
3 X NP -> this [1..1] 0.486 <<0.000, -0.434, -1.330, 2.303>>
5 X V -> is [2..2] -1.267 <<0.000, -0.434, -2.533, 0.000>>
11 X NP -> DT NN (0,0) (1,1) [3..4] -2.698 <<0.000, -0.869, -5.396, 0.000>> 7 9
7 X DT -> a [3..3] -1.012 <<0.000, -0.434, -2.024, 0.000>>
9 X NN -> house [4..4] -2.887 <<0.000, -0.434, -5.774, 0.000>>
Each line represents a hypothesis that is part of the derivation of the best translation. The pieces
of information in each line (with the first line as example) are:
the hypothesis number, a sequential identifier (41),
the input non-terminal (X),
the output non-termial (S),
the rule used to generate this hypothesis (TOP -> <s>S</s>),
alignment information between input and output non-terminals in the rule ((1,1)),
the span covered by the hypothesis, as defined by input word positions ([0..5]),
the score of the hypothesis (3.593),
its component scores (<<...>>):
unknown word penalty (0.000),
word penalty (-2.606),
language model score (-9.711),
rule application probability (2.526), and
prior hypotheses, i.e. the children nodes in the tree, that this hypothesis is built on (20).
As you can see, the model used here is a target-syntax model, It uses linguistic syntactic anno-
tation on the target side, but on the input side everything is labeled X.
3.3. Syntax Tutorial 81
Rule Table
If we look at the string-to-tree directory, we find two files: the configuration file moses.ini
which points to the language model (in lm/europarl.srilm.gz), and the rule table file rule-table.
The configuration file moses.ini has a fairly familiar format. It is mostly identical to the con-
figuration file for phrase-based models. We will describe further below in detail the new pa-
rameters of the chart decoder.
The rule table rule-table is an extension of the Pharaoh/Moses phrase-table, so it will be
familiar to anybody who has used it before. Here are some lines as example:
gibt [X] ||| gives [ADJ] ||| 1.0 ||| ||| 3 5
es gibt [X] ||| there is [ADJ] ||| 1.0 ||| ||| 2 3
[X][DT] [X][NN] [X] ||| [X][DT] [X][NN] [NP] ||| 1.0 ||| 0-0 1-1 ||| 2 4
[X][DT] [X][ADJ] [X][NN] [X] ||| [X][DT] [X][ADJ] [X][NN] [NP] ||| 1.0 ||| 0-0 1-1 2-2 ||| 5 6
[X][V] [X][NP] [X] ||| [X][V] [X][NP] [VP] ||| 1.0 ||| 0-0 1-1 ||| 4 3
Each line in the rule table describes one translation rule. It consists of five components sepa-
rated by three bars:
1. the source string and source left-hand-side,
2. the target string and target left-hand-side,
3. score(s): here only one, but typically multiple scores are used,
4. the alignment between non-terminals (using word positions starting with 0, as source-
target), and
5. frequency counts of source & target phrase (for debugging purposes; not used during
The format is slightly different from the Hiero format. For example, the Hiero rule
[X] ||| [X,1] trace ’ ||| [X,1] &#52628;&#51201; ’ \
||| 0.727273 0.444625 1 0.172348 2.718
is formatted as
[X][X] trace ’ [X] ||| [X][X] &#52628;&#51201; ’ [X] \
||| 0.727273 0.444625 1 0.172348 2.718 ||| 0-0 ||| 2 3
A syntax rule in a string-to-tree grammar:
82 3. Tutorials
[NP] ||| all [NN,1] ||| &#47784;&#46304; [NN,1] \
||| 0.869565 0.627907 0.645161 0.243243 2.718
is formatted as
all [X][NN] [X] ||| &#47784;&#46304; [X][NN] [NP] \
||| 0.869565 0.627907 0.645161 0.243243 2.718 ||| 1-1 ||| 23 31
The format can also a represent a tree-to-string rule, which has no Hiero equivalent:
all [NN][X] [NP] ||| &#47784;&#46304; [NN][X] [X] \
||| 0.869565 0.627907 0.645161 0.243243 2.718 ||| 1-1 ||| 23 31
Usually, you will also need these ’glue’ rules:
<s> [X][S] </s> [X] ||| <s> [X][S] </s> [TOP] ||| 1.0 ||| 1-1
<s> [X][NP] </s> [X] ||| <s> [X][NP] </s> [TOP] ||| 1.0 ||| 1-1
<s> [X] ||| <s> [S] ||| 1 |||
[X][S] </s> [X] ||| [X][S] </s> [S] ||| 1 ||| 0-0
[X][S] [X][X] [X] ||| [X][S] [X][X] [S] ||| 2.718 ||| 0-0 1-1
Finally, this rather technical rule applies only to spans that cover everything except the sentence
boundary markers <s>and </s>. It completes a translation with of a sentence span (S).
More Example
The second rule in the table, that we just glanced at, allows something quite interesting: the
translation of a non-contiguous phrase: macht X auf.
Let us try this with the decoder on an example sentence:
3.3. Syntax Tutorial 83
% echo ’er macht das tor auf’ | moses_chart -f moses.ini -T trace-file ; cat trace-file
14 X TOP -> <s> S </s> (1,1) [0..6] -7.833 <<0.000, -2.606, -17.163, 1.496>> 13
13 X S -> NP VP (0,0) (1,1) [1..5] -6.367 <<0.000, -1.737, -14.229, 1.496>> 2 11
2 X NP -> he [1..1] -1.064 <<0.000, -0.434, -2.484, 0.357>>
11 X VP -> opens NP (1,1) [2..5] -5.627 <<0.000, -1.303, -12.394, 1.139>> 10
10 X NP -> DT NN (0,0) (1,1) [3..4] -3.154 <<0.000, -0.869, -7.224, 0.916>> 6 7
6 X DT -> the [3..3] 0.016 <<0.000, -0.434, -0.884, 0.916>>
7 X NN -> gate [4..4] -3.588 <<0.000, -0.434, -7.176, 0.000>>
he opens the gate
You see the creation application of the rule in the creation of hypothesis 11. It generates opens
NP to cover the input span [2..5] by using hypothesis 10, which coveres the span [3..4].
Note that this rule allows us to do something that is not possible with a simple phrase-based
model. Phrase-based models in Moses require that all phrases are contiguous, they can not
have gaps.
The final example illustrates how reordering works in a tree-based model:
% echo ’ein haus ist das’ | moses_chart -f moses.ini -T trace-file ; cat trace-file
41 X TOP -> <s> S </s> (1,1) [0..5] -2.900 <<0.000, -2.606, -9.711, 3.912>> 18
18 X S -> NP V NP (0,2) (1,1) (2,0) [1..4] -1.295 <<0.000, -1.737, -6.501, 3.912>> 11 5 8
11 X NP -> DT NN (0,0) (1,1) [1..2] -2.698 <<0.000, -0.869, -5.396, 0.000>> 2 4
2 X DT -> a [1..1] -1.012 <<0.000, -0.434, -2.024, 0.000>>
4 X NN -> house [2..2] -2.887 <<0.000, -0.434, -5.774, 0.000>>
5 X V -> is [3..3] -1.267 <<0.000, -0.434, -2.533, 0.000>>
8 X NP -> this [4..4] 0.486 <<0.000, -0.434, -1.330, 2.303>>
this is a house
The reordering in the sentence happens when hypothesis 18 is generated. The non-lexical rule
S ->NP V NP takes the underlying children nodes in inverse order ((0,2) (1,1) (2,0)).
Not any arbitrary reordering is allowed -– as this can be the case in phrase models. Reordering
has to be motivated by a translation rule. If the model uses real syntax, there has to be a
syntactic justification for the reordering.
3.3.3 Decoder Parameters
The most important consideration in decoding is a speed/quality trade-off. If you want to
win competitions, you want the best quality possible, even if it takes a week to translate 2000
sentences. If you want to provide an online service, you know that users get impatient, when
they have to wait more than a second.
84 3. Tutorials
Beam Settings
The chart decoder has an implementation of CKY decoding using cube pruning. The latter
means that only a fixed number of hypotheses are generated for each span. This number can be
changed with the option cube-pruning-pop-limit (or short cbp). The default is 1000, higher
numbers slow down the decoder, but may result in better quality.
Another setting that directly affects speed is the number of rules that are considered for each
input left hand side. It can be set with ttable-limit.
Limiting Reordering
The number of spans that are filled during chart decoding is quadratic with respect to sentence
length. But it gets worse. The number of spans that are combined into a span grows linear with
sentence length for binary rules, quadratic for trinary rules, and so on. In short, long sentences
become a problem. A drastic solution is the size of internal spans to a maximum number.
This sounds a bit extreme, but does make some sense for non-syntactic models. Reordering is
limited in phrase-based models, and non-syntactic tree-based models (better known as hierar-
chical phrase-based models) and should limit reordering for the same reason: they are just not
very good at long-distance reordering anyway.
The limit on span sizes can be set with max-chart-span. In fact its default is 10, which is not a
useful setting for syntax models.
Handling Unknown Words
In a target-syntax model, unknown words that just copied verbatim into the output need to
get a non-terminal label. In practice unknown words tend to be open class words, most likely
names, nouns, or numbers. With the option unknown-lhs you can specify a file that contains
pairs of non-terminal labels and their probability per line.
Optionally, we can also model the choice of non-terminal for unknown words through sparse
features, and optimize their cost through MIRA or PRO. This is implemented by relaxing the la-
bel matching constraint during decoding to allow soft matches, and allowing unknown words
to expand to any non-terminal. To activate this feature:
use-unknown-word-soft-matches = true (in EMS config)
-unknown-word-label FILE1 -unknown-word-soft-matches FILE2 (in train-model.perl)
Technical Settings
The parameter non-terminals is used to specify privileged non-terminals. These are used for
unknown words (unless there is a unknown word label file) and to define the non-terminal
label on the input side, when this is not specified.
Typically, we want to consider all possible rules that apply. However, with a large maximum
phrase length, too many rule tables and no rule table limit, this may explode. The number of
rules considered can be limited with rule-limit. Default is 5000.
3.3. Syntax Tutorial 85
3.3.4 Training
In short, training uses the identical training script as phrase-based models. When running
train-model.perl, you will have to specify additional parameters, e.g. -hierarchical and
-glue-grammar. You typically will also reduce the number of lexical items in the grammar with
-max-phrase-length 5.
That’s it.
Training Parameters
There are a number of additional decisions about the type of rules you may want to include in
your model. This is typically a size / quality trade-off: Allowing more rule types increases the
size of the rule table, but lead to better results. Bigger rule tables have a negative impact on
memory use and speed of the decoder.
There are two parts to create a rule table: the extraction of rules and the scoring of rules. The
first can be modified with the parameter --extract-options="..." of train-model.perl.
The second with --score-options="...".
Here are the extract options:
--OnlyDirect: Only creates a model with direct conditional probabilities p(f|e) instead of
the default direct and indirect (p(f|e) and p(e|f)).
--MaxSpan SIZE: maximum span size of the rule. Default is 15.
--MaxSymbolsSource SIZE and --MaxSymbolsTarget SIZE: While a rule may be extracted
from a large span, much of it may be knocked out by sub-phrases that are substituted by
non-terminals. So, fewer actual symbols (non-terminals and words remain). The default
maximum number of symbols is 5for the source side, and practically unlimited (999) for
the target side.
--MinWords SIZE: minimum number of words in a rule. Default is 1, meaning that each
rule has to have at least one word in it. If you want to allow non-lexical rules set this to
zero. You will not want to do this for hierarchical models.
--AllowOnlyUnalignedWords: This is related to the above. A rule may have words in it,
but these may be unaligned words that are not connected. By default, at least one aligned
word is required. Using this option, this requirement is dropped.
--MaxNonTerm SIZE: the number of non-terminals on the right hand side of the rule. This
has an effect on the arity of rules, in terms of non-terminals. Default is to generate only
binary rules, so the setting is 2.
--MinHoleSource SIZE and --MinHoleTarget SIZE: When sub-phrases are replaced by
non-terminals, we may require a minimum size for these sub-phrases. The default is 2on
the source side and 1(no limit) on the target side.
--DisallowNonTermConsecTarget and --NonTermConsecSource. We may want to re-
strict if there can be neighboring non-terminals in rules. In hierarchical models there
is a bad effect on decoding to allow neighboring non-terminals on the source side. The
default is to disallow this -- it is allowed on the target side. These switches override the
--NoFractionalCounting: For any given source span, any number of rules can be gen-
erated. By default, fractional counts are assigned, so probability of these rules adds up to
one. This option leads to the count of one for each rule.
86 3. Tutorials
--NoNonTermFirstWord: Disallows that a rule starts with a non-terminal.
Once rules are collected, the file of rules and their counts have to be converted into a proba-
bilistic model. This is called rule scoring, and there are also some additional options:
--OnlyDirect: only estimates direct conditional probabilities. Note that this option needs
to be specified for both rule extraction and rule scoring.
--NoLex: only includes rule-level conditional probabilities, not lexical scores.
--GoodTuring: Uses Good Turing discounting to reduce actual accounts. This is a good
thing, use it.
Training Syntax Models
Training hierarchical phrase models, i.e., tree-based models without syntactic annotation, is
pretty straight-forward. Adding syntactic labels to rules, either on the source side or the target
side, is not much more complex. The main hurdle is to get the annotation. This requires a
syntactic parser.
Syntactic annotation is provided by annotating all the training data (input or output side, or
both) with syntactic labels. The format that is used for this uses XML markup. Here an exam-
<tree label="NP"> <tree label="DET"> the </tree> \
<tree label="NN"> cat </tree> </tree>
So, constituents are surrounded by an opening and a closing <tree>tag, and the label is
provided with the parameter label. The XML markup also allows for the placements of the
tags in other positions, as long as a span parameter is provided:
<tree label="NP" span="0-1"/> <tree label="DET" span="0-0"/> \
<tree label="NN" span="1-1"/> the cat
After annotating the training data with syntactic information, you can simply run train-model.perl
as before, except that the switches --source-syntax or --target-syntax (or both) have to be
You may also change some of the extraction settings, for instance --MaxSpan 999.
Annotation Wrappers
To obtain the syntactic annotation, you will likely use a third-party parser, which has its own
idiosyncratic input and output format. You will need to write a wrapper script that converts it
into the Moses format for syntax trees.
We provide wrappers (in scripts/training/wrapper) for the following parsers.
Bitpar is available from the web site of the University of Munich3. The wrapper is
3.3. Syntax Tutorial 87
Collins parser is availble from MIT4. The wrapper is parse-en-collins.perl
If you wrote your own wrapper for a publicly available parsers, please share it with us!
Relaxing Parses
The use of syntactic annotation puts severe constraints on the number of rules that can be
extracted, since each non-terminal has to correspond to an actual non-terminal in the syntax
Recent research has proposed a number of relaxations of this constraint. The program relax-parse
(in training/phrase-extract) implements two kinds of parse relaxations: binarization and a
method proposed under the label of syntax-augmented machine translation (SAMT) by Zoll-
mann and Venugopal.
Readers familiar with the concept of binarizing grammars in parsing, be warned: We are talk-
ing here about modifying parse trees, which changes the power of the extracted grammar, not
binarization as a optimization step during decoding.
The idea is the following: If the training data contains a subtree such as
then it is not possible to extract translation rules for Ariel Sharon without additional syntactic
context. Recall that each rule has to match a syntactic constituent.
The idea of relaxing the parse trees is to add additional internal nodes that makes the extrac-
tion of additional rules possible. For instance left-binarization adds two additional nodes and
converts the subtree into:
The additional node with the label ˆ
NP allows for the straight-forward extraction of a translation
rule (of course, unless the word alignment does not provide a consistent alignment).
The program relax-parse allows the following tree transformations:
88 3. Tutorials
--LeftBinarize and --RightBinarize: Adds internal nodes as in the example above.
Right-binarization creates a right-branching tree.
--SAMT 1: Combines pairs of neighboring children nodes into tags, such as DET+ADJ. Also
nodes for everything except the first child (NP
DET) and everything except the last child (NP/NN) are added.
--SAMT 2: Combines any pairs of neighboring nodes, not only children nodes, e.g., VP+DET.
--SAMT 3: not implemented.
--SAMT 4: As above, but in addition each previously unlabeled node is labeled as FAIL,
so no syntactic constraint on grammar constraint remains.
Note that you can also use both --LeftBinarize and --RightBinarize. Note that in this case,
as with all the SAMT relaxations, the resulting annotation is not any more a tree, since there is
not a single set of rule applications that generates the structure (now called a forest).
Here an example, what parse relaxation does to the number of rules extracted (English-German
News Commentary, using Bitpar for German, no English syntax):
Relaxation Setting Number of Rules
no syntax 59,079,493
basic syntax 2,291,400
left-binarized 2,914,348
right-binarized 2,979,830
SAMT 1 8,669,942
SAMT 2 35,164,756
SAMT 4 131,889,855
On-Disk Rule Table
The rule table may become too big to fit into the RAM of the machine. Instead of loading
the rules into memory, it is also possible to leave the rule table on disk, and retrieve rules on
This is described in On-Disk Phrase Table5.
3.3.5 Using Meta-symbols in Non-terminal Symbols (e.g., CCG)
Often a syntactic formalism will use symbols that are part of the meta-symbols that denote non-
terminal boundaries in the SCFG rule table, and glue grammar. For example, in Combinatory
Categorial Grammar (CCG, Steedman, 2000), it is customary to denote grammatical features
by placing them after the non-terminal symbol inside square brackets, as in S[dcl] (declarative
sentence) vs. S[q] (interrogative sentence).
Although such annotations may be useful to discriminate good translations from bad, includ-
ing square brackets in the non-terminal symbols themselves can confuse Moses. Some users
have reported that category symbols were mangled (by splitting them at the square brackets)
after converting to an on-disk representation (and potentially in other scenarios -- this is cur-
rently an open issue). A way to side-step this issue is to escape square brackets with a symbol
that is not part of the meta-language of the grammar files, e.g. using the underscore symbol:
3.3. Syntax Tutorial 89
S[dcl] =>S_dcl_
S[q] =>S_q_
before extracting a grammar. This should be done in all data or tables that mention such syn-
tactic categories. If the rule table is automatically extracted, it suffices to escape the categories
in the <tree label="..."...>mark-up that is supplied to the training script. If you roll your own
rule tables (or use an unknown-lhs file), you should make sure they are properly escaped.
3.3.6 Different Kinds of Syntax Models
Most SCFG-based machine translation decoders at the current time are designed to uses hierar-
chical phrase-based grammar (Chiang, 2005) or syntactic grammar. Joshua, cdec, Jane are some
of the open-sourced systems that have such decoders.
The hierarchical phrase-based grammar is well described elsewhere so we will not go into
details here. Briefly, the non-terminals are not labelled with any linguistically-motivated labels.
By convention, non-terminals have been simply labelled as X, e.g.
X --> der X1 ||| the X1
Usually, a set of glue rules are needed to ensure that the decoder always output an answer. By
convention, the non-terminals for glue rules are labelled as S, e.g.
S --> <s> ||| <s>
S --> X1 </s> ||| X1 </s>
S --> X1 X2 ||| X1 X2
In a syntactic model, non-terminals are labelled with linguistically-motivated labels such as
’NOUN’, ’VERB’ etc. For example:
DET --> der ||| the
ADJ --> kleines ||| small
These labels are typically obtained by parsing the target side of the training corpus. (However,
it is also possible to use parses of the source side which has been projected onto the target side
(Ambati and Chen, 2007) ).
90 3. Tutorials
The input to the decoder when using this model is a conventional string, as in phrase-based and
hierarchical phrase-based models. The output is a string. However, the CFG-tree derivation of
the output (target) can also be obtained (in Moses by using the -T argument), the non-terminals
in this tree will be labelled with the linguistically-motivated labels.
For these reasons, these syntactic models are called ’target’ syntax models, or ’string-to-tree’
model, by many in the Moses community and elsewhere. (Some papers by people at ISI in-
verted this naming convention due to their adherance to the noisy-channel framework).
The implementation of string-to-tree models is fairly standard and similar across different
open-source decoders such as Moses, Joshua, cdec and Jane.
There is a ’string-to-tree’ model among the downloadable sample models6.
The input to the model is the string:
das ist ein kleines haus
The output string is
this is a small house
The target tree it produces is
(TOP <s> (S (NP this) (VP (V is) (NP (DT a) (ADJ small) (NN house)))) </s>)
RECAP - The input is a string, the output is a tree with linguistically-motivated labels.
Unlike the string-to-tree model, the tree-to-string model is not as standardized across different
decoders. This section describes the Moses implementation.
Input tree representation The input to the decoder is a parse tree, not a string. For Moses,
the parse tree should be formatted using XML. The decoder converts the parse tree into an
annotated string (a chart?). Each span in the chart is labelled with the non-terminal from the
parse tree. For example, the input
3.3. Syntax Tutorial 91
<tree label="NP"> <tree label="DET"> the </tree> <tree label="NN"> cat </tree> </tree>
is converted to an annotated string
the cat
-DET- -NN--
To support easier glue rules, the non-terminal ’X’ is also added for every span in the annotated
string. Therefore, the input above is actually converted to:
the cat
-DET- -NN--
--X-- --X--
Translation rules During decoding, the non-terminal of the rule that spans a substring in the
sentence must match the label on the annoated string. For example, the following rules can be
applied to the above sentence.
NP --> the katze ||| die katze
NP --> the NN1 ||| der NN1
NP --> DET1 cat ||| DET1 katze
NP --> DET1 NN2 ||| DET1 NN2
However, these rules can’t as they don’t match one or more non-terminals.
VB --> the katze ||| die katze
NP --> the ADJ1 ||| der ADJ1
NP --> ADJ1 cat ||| ADJ1 katze
ADV --> ADJ1 NN2 ||| ADJ1 NN2
92 3. Tutorials
Therefore, non-terminal in the translation rules in a tree-to-string model acts as constraints on
which rules can be applied. This constraint is in addition to the usual role of non-terminals.
A feature which is currently unique to the Moses decoder is the ability to separate out these
two roles. Each non-terminal in all translation rules is represented by two labels:
1. The source non-terminal which constrains rules to the input parse tree
2. The target non-terminal which has the normal parsing role.
When we need to differentiate source and target non-terminals, the translation rules are instead
written like this:
NP --> the NN1 ||| X --> der X1
This rule indicates that the non-terminal should span a NN constituent in the input text, and
that the whole rule should span an NP constituent. The target non-terminals in this rule are
both X, therefore, this rule would be considered part of tree-to-string grammar.
(Using this notation is probably wrong as the source sentence is not properly parsed - see next
section. It may be better to express the Moses tree-to-string grammar as a hierarchical grammar,
with added constraints. For example:
X --> the X1 ||| der X1 ||| LHS = NP, X_1 = NN
However, this may be even more confusing so we will stick with our convention for now.)
RECAP - Grammar rules in Moses have 2 labels for each non-terminals; one to constrain the
non-terminal to the input parse tree, the other is used in parsing.
1. The Moses decoder always checks the source non-terminal, even when it is decoding with a
string-to-string or string-to-tree grammar. For example, when checking whether the following
rule can be applied
X --> der X1 ||| the X1
the decoder will check whether the RHS non-terminal, and the whole rule, spans an input parse
constituent X. Therefore, even when decoding with a string-to-string or string-to-tree grammar,
it is necessary to add the X non-terminal to every input span. For example, the input string the
cat must be annotated as follows
3.3. Syntax Tutorial 93
the cat
--X-- --X--
to allow the string to be decoded with a string-to-string or string-to-tree grammar.
2. There is no difference between a linguistically derived non-terminal label, such as NP, VP
etc, and the non-linguistically motivated X label. They can both be used in one grammar, or
even 1 translation rule. This ’mixed-syntax’ model was explored in (Hoang and Koehn, 2010)
and in Hieu Hoang’s thesis7
3. The source non-terminals in translation rules are used just to constrain against the input
parse tree, not for parsing. For example, if the input parse tree is
(VP (NP (PRO he)) (VB goes))
and tree-to-string rules are:
PRO --> he ||| X --> il
VB --> goes ||| X --> va
VP --> NP1 VB2 ||| X --> X1 X2
This will create a valid translation. However, the span over the word ’he’ will be labelled as
PRO by the first rule, and NP by the 3rd rule. This is illustrated in more detail in Hieu’s thesis
Section 4.2.11.
4. To avoid the above and ensure that source spans are always consistently labelled, simply
project the non-terminal label to both source and target. For example, change the rule
VP --> NP1 VB2 ||| X --> X1 X2
VP --> NP1 VB2 ||| VP --> NP1 VB2
94 3. Tutorials
3.3.7 Format of text rule table
The format of the Moses rule table is different from that used by Hiero, Joshua and cdec, and
has often been a source of confusion. We shall attempt to explain the reasons in this section.
The format is derived from the Pharaoh/Moses phrase-based format. In this format, a transla-
tion rule
a b c --> d e f , with word alignments a1, a2 ..., and probabilities p1, p2, ...
is formatted as
a b c ||| d e f ||| p1 p2 ... ||| a1 a2 ...
For a hierarchical pb rule,
The Hiero/Joshua/cdec format is
X ||| a [X,1] b c [X,2] ||| d e f [X,2] [X,1] ||| p1 p2 ...
The Moses format is
a [X][X] b c [X][X] [X] ||| d e f [X][X] [X][X] [X] ||| p1 p2 ... ||| 1-4 4-3
For a string-to-tree rule,
VP --> a X1 b c X2 ||| d e f NP2 ADJ1
the Moses format is
3.4. Optimizing Moses 95
a [X][ADJ] b c [X][NP] [X] ||| d e f [X][NP] [X][ADJ] [VP] ||| p1 p2 ... ||| 1-4 4-3
For a tree-to-string rule,
VP --> a ADJ1 b c NP2 ||| X --> d e f X2 X1
The Moses format is
a [ADJ][X] b c [NP][X] [VP] ||| d e f [NP][X] [ADJ][X] [X] ||| p1 p2 ... ||| 1-4 4-3
The main reasons for the difference between the Hiero/Joshua/cdec and Moses formats are as
1. The text rule table should be easy to convert to a binary, on-disk format. We have seen in
the community that this allows much larger models to be used during decoding, even on
memory-limited servers. To make the conversion efficient, the text rule table must have
the following properties:
(a) For every rule, the sequence of terminals and non-terminals in the first column (the
’source’ column) should match the lookup sequence that the decoder will perform.
(b) The file can be sorted so that the first column is in alphabetical order. The decoder
needs to look up the target non-terminals on the right-hand-side of each rule so the
first column consists of source terminals and non-terminal, and target non-terminals
from the right-hand-side.
2. The phrase probability calculations should be performed efficiently. To calculate p(t |s) =
count(t,s) / count(s) the extract file must be sorted in contiguous order so that each count
can be performed and used to calculate the probability, then discarded immediately to
save memory. Similarly for p(s |t) = count(t,s) / count(t)
The Hiero/Joshua/cdec file format is sufficient for hierarchical models, but not for the various
syntax models supported by Moses.
Subsection last modified on August 26, 2015, at 02:38 PM
3.4 Optimizing Moses
3.4.1 Multi-threaded Moses
Moses supports multi-threaded operation, enabling faster decoding on multi-core machines.
The current limitations of multi-threaded Moses are:
96 3. Tutorials
1. irstlm is not supported, since it uses a non-threadsafe cache
2. lattice input may not work - this has not been tested
3. increasing the verbosity of Moses will probably cause multi-threaded Moses to crash
4. Decoding speed will flatten out after about 16 threads. For more scalable speed with
many threads, use Moses2
Multi-threaded Moses is now built by default. If you omit the -threads argument, then Moses
will use a single worker thread, and a thread to read the input stream. Using the argument
-threads n specifies a pool of nthreads, and -threads all will use all the cores on the ma-
3.4.2 How much memory do I need during decoding?
The single-most important thing you need to run Moses fast is MEMORY. Lots of MEMORY.
(For example, the Edinburgh group have servers with 144GB of RAM). The rest of this section
is just details of how to make the training and decoding run fast.
Calculate total file size of the binary phrase tables, binary language models and binary reorder-
ing models.
For example,
% ll -h phrase-table.0-0.1.1.binphr.*
-rw-r--r-- 1 s0565741 users 157K 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.idx
-rw-r--r-- 1 s0565741 users 5.4M 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.srctree
-rw-r--r-- 1 s0565741 users 282K 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.srcvoc
-rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.tgtdata
-rw-r--r-- 1 s0565741 users 1.7M 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.tgtvoc
% ll -h reordering-table.1.wbe-msd-bidirectional-fe.binlexr.*
-rw-r--r-- 1 s0565741 users 157K 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.idx
-rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.srctree
-rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.tgtdata
-rw-r--r-- 1 s0565741 users 282K 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.voc0
-rw-r--r-- 1 s0565741 users 1.7M 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.voc1
% ll -h interpolated-binlm.1
-rw-r--r-- 1 s0565741 users 28G 2012-06-15 11:07 interpolated-binlm.1
The total size of these files is approx. 31GB. Therefore, a translation system using these models
requires 31GB (+ roughly 500MB) of memory to run fast.
I’ve got this much memory but it’s still slow. Why?
Run this:
cat phrase-table.0-0.1.1.binphr.* > /dev/null
cat reordering-table.1.wbe-msd-bidirectional-fe.binlexr.* > /dev/null
cat interpolated-binlm.1 > /dev/null
3.4. Optimizing Moses 97
This forces the operating system to cache the binary models in memory, minimizing pages
faults while the decoder is running. Other memory-intensive processes on the computer should
not be running, otherwise the file-system cache may be reduced.
Use huge pages
Moses does a lot of random lookups. If you’re running Linux, check that transparent huge
pages8are enabled. If
cat /sys/kernel/mm/transparent_hugepage/enabled
responds with
[always] madvise never
then transparent huge pages are enabled.
On some RedHat/Centos systems, the file is /sys/kernel/mm/redhat_transparent_hugepage/enabled
and madvise will not appear. If neither file exists, upgrade the kernel to at least 2.6.38 and com-
pile with CONFIG_SPARSEMEM_VMEMMAP. If the file exists, but the square brackets are not around
"always", then run
echo always > /sys/kernel/mm/transparent_hugepage/enabled
as root (NB: to use sudo, quote the >character). This setting will not be preserved across
reboots, so consider adding it to an init script.
Use the compact phrase and reordering table representations to reduce memory usage by a
factor of 10
See the manual on binarized9and compact10 phrase table for a description how to compact
your phrase tables. All the things said above for the standard binary phrase table are also true
for the compact versions. The principle is the same, the total size of the binary files determines
your memory usage, but since the combined size of the compact phrase table and the compact
reordering model maybe up to 10 to 12 times smaller than with the original binary implemen-
tations, you will save exactly this much memory. You can also use the --minphr-memory and
98 3. Tutorials
--minlexr-memory options to load the tables into memory at Moses start-up instead of doing
the above mentioned caching trick. This may take some time during warm-up, but may save
a lot of time in the long term. If you are concerned for performance, see Junczys-Dowmunt
(2012)11 for a comparison. There is virtually no overhead due to on-the-fly decompression on
large-memory-systems and considerable speed-up on systems with limited memory.
3.4.3 How little memory can I get away with during decoding?
The decoder can run on very little memory, about 200-300MB for phrase-based and 400-500MB
for hierarchical decoding (according to Hieu). The decoder can run on an iPhone! And laptops.
However, it will be VERY slow, unless you have very small models or the models are on fast
disks such as flash disks.
3.4.4 Faster Training
Parallel training
When word aligning, using mgiza12 with multiple threads significantly speed up word align-
MGIZA To use MGIZA with multiple threads in the Moses training script, add these argu-
.../train-model.perl -mgiza -mgiza-cpus 8 ....
To enable it in the EMS, add this to the [TRAINING] section
training-options = "-mgiza -mgiza-cpus 8"
snt2cooc When running GIZA++ or MGIZA, the first stage involves running a program
3.4. Optimizing Moses 99
This requires approximately 6GB+ for typical Europarl-size corpora (1.8 million sentences). For
users without this amount of memory on their computers, an alternative version is included in
To use this script, you must copy 2 files to the same place where snt2cooc is run:
Add this argument when running the Moses training script:
.../train-model.perl -snt2cooc
Parallel Extraction
Once word alignment is completed, the phrase table is created from the aligned parallel corpus.
There are 2 main ways to speed up this part of the training process.
Firstly, the training corpus and alignment can be split and phrase pairs from each part can be
extracted simultaneously. This can be done by simply using the argument -cores, e.g.,
.../train-model.perl -cores 4
Secondly, the Unix sort command is often executed during training. It is essential to optimize
this command to make use of the available disk and CPU. For example, recent versions of sort
can take the following arguments
sort -S 10G --batch-size 253 --compress-program gzip --parallel 5
The Moses training script names these arguments
100 3. Tutorials
.../train-model.perl -sort-buffer-size 10G -sort-batch-size 253 \
-sort-compress gzip -sort-parallel 5
You should set these arguments. However, DO NOT just blindly copy the above settings, they
must be tuned to the particular computer you are running on. The most important issues are:
1. you must make sure the version of sort on your machine supports the arguments you
specify, otherwise the script will crash. The --parallel,--compress-program, and --batch-size
arguments have only recently been added to the sort command.
2. make sure you have enough memory when setting -sort-buffer-size. In particu-
lar, you should take into account other programs running on the computer. Also, two
or three simultaneous sort program will run (one to sort the extract file, one to sort
extract.inv, one to sort extract.o). If there is not enough memory because you’ve set
sort-buffer-size too high, your entire computer will likely crash.
3. the maximum number for the --batch-size argument is OS-dependent. For example, it
is 1024 on Linux, 253 on old Mac OSX, 2557 on new OSX.
4. on Mac OSX, using --compress-program can occasionally result in the following timeout
gsort: couldn’t create process for gzip -d: Operation timed out
3.4.5 Training Summary
In summary, to maximize speed on a large server with many cores and up-to-date software,
add this to your training script:
.../train-model.perl -mgiza -mgiza-cpus 8 -cores 10 \
-parallel -sort-buffer-size 10G -sort-batch-size 253 \
-sort-compress gzip -sort-parallel 10
To run on a laptop with limited memory
.../train-model.perl -mgiza -mgiza-cpus 2 -snt2cooc \
-parallel -sort-batch-size 253 -sort-compress gzip
In the EMS, for large servers, this can be done by adding:
3.4. Optimizing Moses 101
script = $moses-script-dir/training/train-model.perl
training-options = "-mgiza -mgiza-cpus 8 -cores 10 \
-parallel -sort-buffer-size 10G -sort-batch-size 253 \
-sort-compress gzip -sort-parallel 10"
parallel = yes
For servers with older OSes, and therefore older sort commands:
script = $moses-script-dir/training/train-model.perl
training-options = "-mgiza -mgiza-cpus 8 -cores 10 -parallel"
parallel = yes
For laptops with limited memory:
script = $moses-script-dir/training/train-model.perl
training-options = "-mgiza -mgiza-cpus 2 -snt2cooc \
-parallel -sort-batch-size 253 -sort-compress gzip"
parallel = yes
3.4.6 Language Model
Convert your language model to binary format. This reduces loading time and provides more
Building a KenLM binary file
See the KenLM web site13 for the time-memory tradeoff presented by the KenLM data struc-
tures. Use bin/build_binary (found in the same directory as moses and moses_chart) to con-
vert ARPA files to the binary format. You can preview memory consumption with:
102 3. Tutorials
This preview includes only the language model’s memory usage, which is in addition to the
phrase table etc. For speed, use the default probing data structure.
bin/build_binary file.binlm
To save memory, change to the trie data structure
bin/build_binary trie file.binlm
To further losslessly compress the trie ("chop" in the benchmarks), use -a 64 which will com-
press pointers to a depth of up to 64 bits.
bin/build_binary -a 64 trie file.binlm
Note that you can also make this parameter smaller which will go faster but use more memory.
Quantization will make the trie smaller at the expense of accuracy. You can choose any number
of bits from 2 to 25, for example 10:
bin/build_binary -a 64 -q 10 trie file.binlm
Note that quantization can be used independently of -a.
Loading on-demand
By default, language models fully load into memory at the beginning. If you are short on
memory, you can use on-demand language model loading. The language model must be con-
verted to binary format in advance and should be placed on LOCAL DISK, preferably SSD. For
KenLM, you should use the trie data structure, not the probing data structure.
If the LM for binarized using IRSTLM, append .mm to the file name and change the ini file to
reflect this. Eg. change
IRSTLM .... path=file.lm
3.4. Optimizing Moses 103
If the LM was binarized using KenLM, add the argument lazyken=true. Eg. from
KENLM ....
KENLM .... lazyken=true
3.4.7 Suffix array
Suffix arrays store the entire parallel corpora and word alignment information in memory, in-
stead of the phrase table. The parallel corpora and alignment file is often much smaller than
the phrase table. For example, for the Europarl German-English (gzipped files):
de = 94MB
en = 84MB
alignment = 57MB
phrase-based = 2.0GB
hierarchical = 16.0GB
Therefore, it is more memory efficient to store the corpus in memory, rather than the entire
phrase-table. This is usually structured as a suffix array to enable fast extraction of translations.
Translations are extracted as needed, usually per input test set, or per input sentence.
Moses support two different implementations of suffix arrays, one for phrase-based models14,
[[one for hierarchical models ->AdvancedFeatures#ntoc43 ]].
104 3. Tutorials
3.4.8 Cube Pruning
Cube pruning limits the number of hypotheses created for each stack (or chart cell in chart
decoding). It is essential for chart decoding (otherwise decoding will take a VERY long time)
and an option in phrase-based decoding.
In the phrase-based decoder, add:
There is a speed-quality tradeoff, lower pop limit means less work for the decoder, so faster
decoding but less accurate translation.
3.4.9 Minimizing memory during training
TODO: MGIZA with reduced memory sntcoc
3.4.10 Minimizing memory during decoding
The biggest consumer of memory during decoding are typically the models. Here are some
links on how to reduce the size of each.
Language model:
* use KenLM with trie data structure Moses.Optimize#ntoc14\footnote{\sf}
* use on-demand loading Moses.Optimize#ntoc15\footnote{\sf}
Translation model:
* use phrase table pruning Advanced.RuleTables#ntoc5\footnote{\sf}
* use a compact phrase table
* filter the translation model given the text you want to translate Moses.SupportTools#ntoc3\footnote{\sf}
Reordering model:
3.4. Optimizing Moses 105
* similar techniques than for translation models are possible: pruning Advanced.RuleTables#ntoc3\footnote{\sf}, compact tables Advanced.RuleTables#ntoc4\footnote{\sf}, and filtering Moses.SupportTools#ntoc3\footnote{\sf}.
Compile-time options
These options can be added to the bjam command line, trading generality for performance.
You should do a full rebuild with -a when changing the values of most of these options.
Don’t use factors? Add
Tailor KenLM’s maximum order to only what you need. If your highest-order language model
has order 5, add
Turn debug symbols off for speed and a little more memory.
But don’t expect support from the mailing list until you rerun with debug symbols on!
Don’t care about debug messages?
Download tcmalloc15 and see BUILD-INSTRUCTIONS.txt in Moses for installation instructions.
bjam will automatically detect tcmalloc’s presence and link against it for multi-threaded builds.
Install Boost and zlib static libraries. Then link statically:
106 3. Tutorials
This may mean you have to install Boost and zlib yourself.
Running single-threaded? Add threading=single.
Using hierarchical or string-to-tree models, but none with source syntax?
3.4.11 Phrase-table types
Moses has multiple phrase table implementations. The one that suits you best depends on the
model you’re using (phrase-based or hierarchical/syntax), and how much memory your server
Here is a complete list of the types:
Memory - this read in the phrase table into memory. For phrase-based model and chart decod-
ing. Note that this is much faster than Binary and OnDisk phrase table format, but it uses a lot
of RAM.
Binary - a phrase table is converted into a ’database’. Only the translations which are required
are loaded into memory. Therefore, requiring less memory, but potentially slower to run. For
phrase-based model
OnDisk - reimplementation of Binary for chart decoding.
SuffixArray - stores the parallel training data and word alignment in memory, instead of the
phrase table. Extraction is done on the fly. Also have a feature where you can add parallel
data while the decoder is running (’Dynamic Suffix Array’). For Phrase-based models. See
Levenberg et al., (2010)16.
ALSuffixArray - Suffix array for hierarchical models. See Lopez (2008)17.
FuzzyMatch - Implementation of Koehn and Senellart (2010)18.
Hiero - like SCFG, but translation rules are in standard Hiero-style format
Compact - for phrase-based model. See Junczys-Dowmunt (2012)19.
Subsection last modified on December 15, 2016, at 01:50 PM
3.5. Experiment Management System 107
3.5 Experiment Management System
3.5.1 Introduction
The Experiment Management System (EMS), or Experiment.perl, for lack of a better name,
makes it much easier to perform experiments with Moses.
There are many steps in running an experiment: the preparation of training data, building
language and translation models, tuning, testing, scoring and analysis of the results. For most
of these steps, a different tool needs to be invoked, so this easily becomes very messy.
Here a typical example:
This graph was automatically generated by Experiment.perl. All that needed to be done was
to specify one single configuration file that points to data files and settings for the experiment.
In the graph, each step is a small box. For each step, Experiment.perl builds a script file that
gets either submitted to the cluster or run on the same machine. Note that some steps are quite
108 3. Tutorials
involved, for instance tuning: On a cluster, the tuning script runs on the head node a submits
jobs to the queue itself.
Experiment.perl makes it easy to run multiple experimental runs with different settings or data
resources. It automatically detects which steps do not have to be executed again but instead
which results from an earlier run can be re-used.
Experiment.perl also offers a web interface to the experimental runs for easy access and com-
parison of experimental results.
The web interface also offers some basic analysis of results, such as comparing the n-gram
matches between two different experimental runs:
3.5.2 Requirements
In order to run properly, EMS will require:
The GraphViz toolkit20,
3.5. Experiment Management System 109
The ImageMagick toolkit21, and
The GhostView tool22.
3.5.3 Quick Start
Experiment.perl is extremely simple to use:
Find experiment.perl in scripts/ems
Get a sample configuration file from someplace (for instance scripts/ems/example/config.toy).
Set up a working directory for your experiments for this task (mkdir does it).
Edit the following path settings in config.toy
Run experiment.perl -config config.toy from your experiment working directory.
Marvel at the graphical plan of action.
Run experiment.perl -config config.toy -exec.
Check the results of your experiment (in evaluation/report.1)
Let us take a closer look at what just happened.
The configuration file config.toy consists of several sections. For instance there is a section for
each language model corpus to be used. In our toy example, this section contains the following:
### raw corpus (untokenized)
raw-corpus = $toy-data/nc-5k.$output-extension
The setting raw-corpus species the location of the corpus. The definition uses the variables
$toy-data and $output-extension, which are also settings defined elsewhere in the configu-
ration file. These variables are resolved, leading to the file path ems/examples/data/nc-5k.en
in your Moses scripts directory.
The authoritative definition of the steps and their interaction is in the file experiment.meta (in
the same directory as experiment.perl: scripts/ems).
The logic of experiment.meta is that it wants to create a report at the end. To generate the
report it needs to evaluation scores, to get these it needs decoding output, to get these it needs
to run the decoder, to be able to run the decoder it needs a trained model, to train a model it
110 3. Tutorials
needs data. This process of defining the agenda of steps to be executed is very similar to the
Make utility in Unix.
We can find the following step definitions for the language model module in experiment.meta:
in: get-corpus-script
out: raw-corpus
default-name: lm/txt
template: IN > OUT
in: raw-corpus
out: tokenized-corpus
default-name: lm/tok
pass-unless: output-tokenizer
template: $output-tokenizer < IN > OUT
parallelizable: yes
The tokenization step tokenize requires raw-corpus as input. In our case, we specified the
setting in the configuration file. We could have also specified an already tokenized corpus with
tokenized-corpus. This would allow us to skip the tokenization step. Or, to give another
example, we could have not specified raw-corpus, but rather specify a script that generates the
corpus with the setting get-corpus-script. This would have triggered the creation of the step
The steps are linked with the definition of their input in and output out. Each step has also a
default name for the output (efault-name) and other settings.
The tokenization step has as default name lm/tok. Let us look at the directory lm to see which
files it contains:
% ls -tr lm/*
We find the output of the tokenization step in the file lm/toy.tok.1. The toy was added from
the name definition of the language model (see [LM:toy] in config.toy). The 1was added,
because this is the first experimental run.
The directory steps contains the script that executes each step, its STDERR and STDOUT out-
put, and meta-information. For instance:
3.5. Experiment Management System 111
% ls steps/1/LM_toy_tokenize.1* | cat
The file steps/2/LM_toy_tokenize.2 is the script that is run to execute the step. The file with
the extension DONE is created when the step is finished - this communicates to the scheduler that
subsequent steps can be executed. The file with the extension INFO contains meta-information
- essential the settings and dependencies of the step. This file is checked to detect if a step can
be re-used in new experimental runs.
In case that the step crashed, we expect some indication of a fault in STDERR (for instance the
words core dumped or killed). This file is checked to see if the step was executed successfully,
so subsequent steps can be scheduled or the step can be re-used in new experiments. Since the
STDERR file may be very large (some steps create megabytes of such output), a digested version
is created in STDERR.digest. If the step was successful, it is empty. Otherwise it contains the
error pattern that triggered the failure detection.
Let us now take a closer look at re-use. If we run the experiment again but change some of the
settings, say, the order of the language model, then there is no need to re-run the tokenization.
Here is the definition of the language model training step in experiment.meta:
in: split-corpus
out: lm
default-name: lm/lm
ignore-if: rlm-training
rerun-on-change: lm-training order settings
template: $lm-training -order $order $settings -text IN -lm OUT
error: cannot execute binary file
The mention of order in the list behind rerun-on-change informs experiment.perl that this
step does need to be re-run, if the order of the language model changes. Since none of the
settings in the chain of steps leading up to the training have been changed, the step can be
Try changing the language model order (order = 5 in config.toy), run experiment.perl again
(experiment.perl -config config.toy) in the working directory, and you will see the new
language model in the directory lm:
112 3. Tutorials
% ls -tr lm/*
3.5.4 More Examples
The example directory contains some additional examples.
These require the training and tuning data released for the Shared Translation Task for WMT
2010. Create a working directory, and change into it. Then execute the following steps:
mkdir data
cd data
tar xzf training-parallel.tgz
tar xzf dev.tgz
cd ..
The examples using these corpora are
config.basic - a basic phrase based model,
config.factored - a factored phrase based model,
config.hierarchical - a hierarchical phrase based model, and
config.syntax - a target syntax model.
In all these example configuration files, most corpora are commented out. This is done by
adding the word IGNORE at the end of a corpus definition (also for the language models). This
allows you to run a basic experiment with just the News Commentary corpus which finished
relatively quickly. Remove the IGNORE to include more training data. You may run into memory
and disk space problems when using some of the larger corpora (especially the news language
model), depending on your computing infrastructure.
If you decide to use multiple corpora for the language model, you may also want to try out
interpolating the individual language models (instead of using them as separate feature func-
tions). For this, you need to comment out the IGNORE next to the [INTERPOLATED-LM] section.
You may also specify different language pairs by changing the input-extension,output-extension,
and pair-extension settings.
Finally, you can run all the experiments with the different given configuration files and the
data variations in the same working directory. The experimental management system figures
out automatically which processing steps do not need to repeated because they can be re-used
from prior experimental runs.
3.5. Experiment Management System 113
Phrase Model
Phrase models are, compared to the following examples, the simplest models to be trained with
Moses and the fastest models to run. You may prefer these models over the more sophisticated
models whose added complexity may not justify the small (if any) gains.
The example config.basic is similar to the toy example, except for a larger training and test
corpora. Also, the tuning stage is not skipped. Thus, even with most of the corpora commented
out, the entire experimental run will likely take a day, with most time taken up by word align-
ment (TRAINING_run-giza and TRAINING_run-giza-inverse) and tuning (TUNING_tune).
Factored Phrase Model
Factored models allow for additional annotation at the word level which may be exploited
in various models. The example in config.factored uses part-of-speech tags on the English
target side.
Annotation with part-of-speech tags is done with MXPOST, which needs to be installed first.
Please read the installation instructions23. After this, you can run experiment.perl with the
configuration file config.factored.
If you compare the factored example config.factored with the phrase-based example config.basic,
you will notice the definition of the factors used:
### factored training: specify here which factors used
# if none specified, single factor training is assumed
# (one translation step, surface to surface)
input-factors = word
output-factors = word pos
alignment-factors = "word -> word"
translation-factors = "word -> word+pos"
reordering-factors = "word -> word"
#generation-factors =
decoding-steps = "t0"
the factor definition:
# also used for output factors
114 3. Tutorials
temp-dir = $working-dir/training/factor
### script that generates this factor
mxpost = /home/pkoehn/bin/mxpost
factor-script = "$moses-script-dir/training/wrappers/make-factor-en-pos.mxpost.perl -mxpost $mxpost"
and the specification of a 7-gram language model over part of speech tags:
factors = "pos"
order = 7
settings = "-interpolate -unk"
raw-corpus = $wmt10-data/training/news-commentary10.$pair-extension.$output-extension
This factored model using all the available corpora is identical to the Edinburgh submission
to the WMT 2010 shared task for English-Spanish, Spanish-English, and English-German lan-
guage pairs (the French language pairs also used the 109corpus, the Czech language pairs did
not use the POS language model, and German-English used additional pre-processing steps).
Hierarchical model
Hierarchical phrase models allow for rules with gaps. Since these are represented by non-
terminals and such rules are best processed with a search algorithm that is similar to syntactic
chart parsing, such models fall into the class of tree-based or grammar-based models. For more
information, please check the Syntax Tutorial (Section 3.3).
From the view of setting up hierarchical models with experiment.perl, very little has to be
changed in comparison to the configuration file for phrase-based models:
% diff config.basic config.hierarchical
< decoder = $moses-src-dir/bin/moses
> decoder = $moses-src-dir/bin/moses_chart
< ttable-binarizer = $moses-src-dir/bin/processPhraseTable
> #ttable-binarizer = $moses-src-dir/bin/processPhraseTable
< #ttable-binarizer = "$moses-src-dir/bin/CreateOnDiskPt 1 1 5 100 2"
3.5. Experiment Management System 115
> ttable-binarizer = "$moses-src-dir/bin/CreateOnDiskPt 1 1 5 100 2"
< lexicalized-reordering = msd-bidirectional-fe
> #lexicalized-reordering = msd-bidirectional-fe
< #hierarchical-rule-set = true
> hierarchical-rule-set = true
< decoder-settings = "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000"
> #decoder-settings = ""
The changes are: a different decoder binary (by default compiled into bin/moses_chart) and
ttable-binarizer are used. The decoder settings for phrasal cube pruning do not apply. Also,
hierarchical models do not allow for lexicalized reordering (their rules fulfill the same purpose),
and the setting for hierarchical rule sets has to be turned on. The use of hierarchical rules is
indicated with the setting hierarchical-rule-set.
Target syntax model
Syntax models imply the use of linguistic annotation for the non-terminals of hierarchical mod-
els. This requires running a syntactic parser.
In our example config.syntax, syntax is used only on the English target side. The syntactic
constituents are labeled with Collins parser, which needs to be installed first. Please read the
installation instructions24.
Compared to the hierarchical model, very little has to be changed in the configuration file:
% diff config.hierarchical config.syntax
> # syntactic parsers
> collins = /home/pkoehn/bin/COLLINS-PARSER
> output-parser = "$moses-script-dir/training/wrappers/parse-en-collins.perl"
< #extract-settings = ""
> extract-settings = "--MinHoleSource 1 --NonTermConsecSource"
The parser needs to be specified, and the extraction settings may be adjusted. And you are
ready to go.
116 3. Tutorials
3.5.5 Try a Few More Things
Stemmed Word Alignment
The factored translation model training makes it very easy to set up word alignment not based
on the surface form of words, but any other property of a word. One relatively popular method
is to use stemmed words for word alignment.
There are two reasons for this: For one, for morphologically rich languages, stemming over-
comes data sparsity problems. Secondly, GIZA++ may have difficulties with very large vocab-
ulary sizes, and stemming reduces the number of unique words.
To set up stemmed word alignment in experiment.perl, you need to define a stem as a factor:
factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl 4"
factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl 4"
and indicate the use of this factor in the TRAINING section:
input-factors = word stem4
output-factors = word stem4
alignment-factors = "stem4 -> stem4"
translation-factors = "word -> word"
reordering-factors = "word -> word"
#generation-factors =
decoding-steps = "t0"
Using Multi-Threaded GIZA++
GIZA++ is one of the slowest steps in the training pipeline. Qin Gao implemented a multi-
threaded version of GIZA++, called MGIZA, which speeds up word alignment on multi-core
To use MGIZA, you will first need to install25 it.
To use it, you simply need to add some training options in the section TRAINING:
### general options
training-options = "-mgiza -mgiza-cpus 8"
3.5. Experiment Management System 117
Using Berkeley Aligner
The Berkeley Aligner is a alternative to GIZA++ for word alignment. You may (or may not) get
better results using this tool.
To use the Berkeley Aligner, you will first need to install26 it.
The example configuration file already has a section for the parameters for the tool. You need
to un-comment them and adjust berkeley-jar to your installation. You should comment out
alignment-symmetrization-method, since this is a GIZA++ setting.
### symmetrization method to obtain word alignments from giza output
# (commonly used: grow-diag-final-and)
#alignment-symmetrization-method = grow-diag-final-and
### use of berkeley aligner for word alignment
use-berkeley = true
alignment-symmetrization-method = berkeley
berkeley-train = $moses-script-dir/ems/support/
berkeley-process = $moses-script-dir/ems/support/
berkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jar
berkeley-java-options = "-server -mx30000m -ea"
berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"
berkeley-process-options = "-EMWordAligner.numThreads 8"
berkeley-posterior = 0.5
The Berkeley Aligner proceeds in two step: a training step to learn the alignment model from
the data and a processing step to find the best alignment for the training data. This step has the
parameter berkeley-posterior to adjust a bias towards more or less alignment points. You
can try different runs with different values for this parameter. Experiment.perl will not re-run
the training step, just the processing step.
Using Dyers Fast Align
Another alternative to GIZA++ is fast_align from Dyer et al.27. It runs much faster, and may
even give better results, especially for language pairs without much large-scale reordering.
To use Fast Align, you will first need to install28 it.
The example configuration file already has a example setting for the tool, using the recom-
mended defaults. Just remove the comment marker @#@ before the setting:
118 3. Tutorials
### use of Chris Dyer’s fast align for word alignment
fast-align-settings = "-d -o -v"
Experiment.perl assumes that you copied the binary into the usual external bin dir (setting
external-bin-dir) where GIZA++ and other external binaries are located.
IRST Language Model
The provided examples use the SRI language model during decoding. When you want to
use the IRSTLM instead, an additional processing step is required: the language model has to
converted into a binary format.
This part of the LM section defines the use of IRSTLM:
### script to use for binary table format for irstlm
# (default: no binarization)
#lm-binarizer = $moses-src-dir/irstlm/bin/compile-lm
### script to create quantized language model format (irstlm)
# (default: no quantization)
#lm-quantizer = $moses-src-dir/irstlm/bin/quantize-lm
If you un-comment lm-binarizer, IRSTLM will be used. If you comment out in addition
lm-quantizer, the language model will be compressed into a more compact representation.
Note that the values above assume that you installed the IRSTLM toolkit in the directory
Randomized Language Model
Randomized language models allow a much more compact (but lossy) representation. Being
able to use much larger corpora for the language model may be beneficial over the small chance
of making mistakes.
First of all, you need to install29 the RandLM toolkit.
There are two different ways to train a randomized language model. One is to train it from
scratch. The other way is to convert a SRI language model into randomized representation.
3.5. Experiment Management System 119
Training from scratch: Find the following section in the example configuration files and un-
comment the rlm-training setting. Note that the section below assumes that you installed the
randomized language model toolkit in the directory $moses-src-dir/randlm.
### tool to be used for training randomized language model from scratch
# (more commonly, a SRILM is trained)
rlm-training = "$moses-src-dir/randlm/bin/buildlm -falsepos 8 -values 8"
Converting SRI language model: Find the following section in the example configuration files
and un-comment the lm-randomizer setting.
### script to use for converting into randomized table format
# (default: no randomization)
lm-randomizer = "$moses-src-dir/randlm/bin/buildlm -falsepos 8 -values 8"
You may want to try other values for falsepos and values. Please see the language model
section on RandLM30 for some more information about these parameters.
You can also randomize a interpolated language model by specifying the lm-randomizer in the
Compound Splitting
Compounding languages, such as German, allow the creation of long words such as Neuwort-
generierung (new word generation). This results in a lot of unknown words in any text, so splitting
up these compounds is a common method when translating from such languages.
Moses offers a support tool that splits up words, if the geometric average of the frequency of
its parts is higher than the frequency of a word. The method requires a model (the frequency
statistics of words in a corpus), so there is a training and application step.
Such word splitting can be added to experiment.perl simply by specifying the splitter script in
the GENERAL section:
input-splitter = $moses-script-dir/generic/compound-splitter.perl
Splitting words on the output side is currently not supported.
120 3. Tutorials
3.5.6 A Short Manual
The basic lay of the land is: experiment.perl breaks up the training, tuning, and evaluating
of a statistical machine translation system into a number of steps, which are then scheduled
to run in parallel or sequence depending on their inter-dependencies and available resources.
The possible steps are defined in the file experiment.meta. An experiment is defined by a
configuration file.
The main modules of running an experiment are:
CORPUS: preparing a parallel corpus,
INPUT-FACTOR and OUTPUT-FACTOR: commands to create factors,
TRAINING: training a translation model,
LM: training a language model,
INTERPOLATED-LM: interpolate language models,
SPLITTER: training a word splitting model,
RECASING: training a recaser,
TRUECASING: training a truecaser,
TUNING: running minumum error rate training to set component weights,
TESTING: translating and scoring a test set, and
REPORTING: compile all scores in one file.
The actual steps, their dependencies and other salient information are to be found in the file
experiment.meta. Think of experiment.meta as a "template" file.
Here the parts of the step description for CORPUS:get-corpus and CORPUS:tokenize:
in: get-corpus-script
out: raw-stem
in: raw-stem
out: tokenized-stem
Each step takes some input (in) and provides some output (out). This also establishes the de-
pendencies between the steps. The step tokenize requires the input raw-stem. This is provided
by the step get-corpus.
experiment.meta provides a generic template for steps and their interaction. For an actual
experiment, a configuration file determines which steps need to be run. This configuration file
is the one that is specified when invocing experiment.perl. It may contain for instance the
3.5. Experiment Management System 121
### raw corpus files (untokenized, but sentence aligned)
raw-stem = $europarl-v3/training/
Here, the parallel corpus to be used is named europarl and it is provided in raw text for-
mat in the location $europarl-v3/training/ (the variable $europarl-v3
is defined elsewhere in the config file). The effect of this specification in the config file is that
the step get-corpus does not need to be run, since its output is given as a file. More on the
configuration file below in the next section.
Several types of information are specified in experiment.meta:
in and out: Established dependencies between steps; input may also be provided by files
specified in the configuration.
default-name: Name of the file in which the output of the step will be stored.
template: Template for the command that is placed in the execution script for the step.
template-if: Potential command for the execution script. Only used, if the first param-
eter exists.
error:experiment.perl detects if a step failed by scanning STDERR for key words such
as killed, error, died, not found, and so on. Additional key words and phrase are provided
with this parameter.
not-error: Declares default error key words as not indicating failures.
pass-unless: Only if the given parameter is defined, this step is executed, otherwise the
step is passed (illustrated by a yellow box in the graph).
ignore-unless: If the given parameter is defined, this step is not executed. This overrides
requirements of downstream steps.
rerun-on-change: If similar experiments are run, the output of steps may be used, if
input and parameter settings are the same. This specifies a number of parameters whose
change disallows a re-use in different run.
parallelizable: When running on the cluster, this step may be parallelized (only if
generic-parallelizer is set in the config file, the script can be found in $moses-script-dir/scripts/ems/support.
qsub-script: If running on a cluster, this step is run on the head node, and not submitted
to the queue (because it submits jobs itself).
Here now the full definition of the step CONFIG:tokenize
in: raw-stem
out: tokenized-stem
default-name: corpus/tok
pass-unless: input-tokenizer output-tokenizer
template-if: input-tokenizer IN.$input-extension OUT.$input-extension
template-if: output-tokenizer IN.$output-extension OUT.$output-extension
parallelizable: yes
122 3. Tutorials
The step takes raw-stem and produces tokenized-stem. It is parallizable with the generic
That output is stored in the file corpus/tok. Note that the actual file name also contains the cor-
pus name, and the run number. Also, in this case, the parallel corpus is stored in two files, so file
name may be something like corpus/ and corpus/europarl.tok.1.en.
The step is only executed, if either input-tokenizer or output-tokenizer are specified. The
templates indicate how the command lines in the execution script for the steps look like.
Multiple Corpora, One Translation Model
We may use multiple parallel corpora for training a translation model or multiple monolingual
corpora for training a language model. Each of these have their own instances of the CORPUS
and LM module. There may be also multiple test sets in TESTING). However, there is only one
translation model and hence only one instance of the TRAINING module.
The definitions in experiment.meta reflect the different nature of these modules. For instance
CORPUS is flagged as multiple, while TRAINING is flagged as single.
When defining settings for the different modules, the singular module TRAINING has only one
section, while this one general section and specific LM sections for each training corpus. In the
specific section, the corpus is named, e.g. LM:europarl.
As you may imagine, the tracking of dependencies between steps of different types of modules
and the consolidation of corpus-specific instances of modules is a bit complex. But most of that
is hidden from the user of the Experimental Management System.
When looking up the parameter settings for a step, first the set-specific section (LM:europarl)
is consulted. If there is no definition, then the module definition (LM) and finally the general
definition (in section GENERAL) is consulted. In other words, local settings override global set-
Defining Settings
The configuration file for experimental runs is a collection of parameter settings, one per line
with empty lines and comment lines for better readability, organized in sections for each of the
The syntax of setting definition is setting = value (note: spaces around the equal sign). If the
value contains spaces, it must be placed into quotes (setting = "the value"), except when
a vector of values is implied (only used when defining list of factors: output-factor = word
Comments are indicated by a hash (#).
The start of sections is indicated by the section name in square brackets ([TRAINING] or [CORPUS:europarl]).
If the word IGNORE is appended to a section definition, then the entire section is ignored.
Settings can be used as variables to define other settings:
3.5. Experiment Management System 123
working-dir = /home/pkoehn/experiment
wmt10-data = $working-dir/data
Variable names may be placed in curly brackets for clearer separation:
wmt10-data = ${working-dir}/data
Such variable references may also reach other modules:
tokenized = $LM:europarl:tokenized-corpus
Finally, reference can be made to settings that are not defined in the configuration file, but are
the product of the defined sequence of steps.
Say, in the above example, tokenized-corpus is not defined in the section LM:europarl, but in-
stead raw-corpus. Then, the tokenized corpus is produced by the normal processing pipeline.
Such an intermediate file can be used elsewhere:
tokenized = [LM:europarl:tokenized-corpus]
Some error checking is done on the validity of the values. All values that seem to be file paths
trigger the existence check for such files. A file with the prefix of the value must exist.
There are a lot of settings reflecting the many steps, and explaining these would require explain-
ing the entire training, tuning, and testing pipeline. Please find the required documentation for
step elsewhere around here. Every effort has been made to include verbose descriptions in the
example configuration files, which should be taken as starting point.
Working with Experiment.Perl
You have to define an experiment in a configuration file and the Experiment Management
System figures out which steps need to be run and schedules them either as jobs on a cluster or
runs them serially on a single machine.
Other options:
124 3. Tutorials
-no-graph: Supresses the display of the graph.
-continue RUN: Continues the experiment RUN, which crashed earlier. Make sure that
crashed step and its output is deleted (see more below).
-delete-crashed RUN: Delete all step files and their output files for steps that have crashed
in a particular RUN.
-delete-run RUN: Delete all step files and their output files for steps for a given RUN,
unless these steps are used by other runs.
-delete-version RUN: Same as above.
-max-active: Specifies the number of steps that can be run in parallel when running on
a single machine (default: 2, not used when run on cluster).
-sleep: Sets the number of seconds to be waited in the scheduler before the completion
of tasks is checked (default: 2).
-ignore-time: Changes the re-use behavior. By default files cannot be re-used when
their time stamp changed (typically a tool such as the tokenizer which was changed, thus
requiring re-running all tokenization steps in new experiments). With this switch, files
with changed time stamp can be re-used.
-meta: Allows the specification of a custom experiment.meta file, instead of using the
one in the same directory as the experiment.perl script.
-final-step STEP: Do not run a complete experiment, but finish at the specified STEP.
-final-out OUT: Do not run a complete experiment, but finish when the specified output
file OUT is created. These are the output file specifiers as used in experiment.meta.
-cluster: Indicates that the current machine is a cluster head node. Step files are sub-
mitted as jobs to the cluster.
-multicore: Indicates that the current machine is a multi-core machine. This allows for
additional parallelization with the generic parallelizer setting.
The script may automatically detect if it is run on a compute cluster or a multi-core machine, if
this is specified in the file experiment.machines, for instance:
cluster: townhill seville
multicore-8: tyr thor
multicore-16: loki
defines the machines townhill and seville as GridEngine cluster machines, tyr and thor as
8-core machines and loki as 16-core machines.
Typically, experiments are started with the command:
experiment.perl -config my-config -exec
Since experiments run for a long time, you may want to run this in the background and also
set a nicer priority:
3.5. Experiment Management System 125
nice nohup -config my-config -exec >& OUT.[RUN] &
This keeps also a report (STDERR and STDOUT) on the execution in a file named, say, OUT.1,
with the number corresponding to the run number.
The meta-information for the run is stored in the directory steps. Each run has a sub directory
with its number (steps/1,steps/2, etc.). The sub directory steps/0 contains step specification
when Experiment.perl is called without the -exec switch.
The sub directories for each run contain the step definitions, as well as their meta-information
and output. The sub directories also contain a copy of the configuration file (e.g. steps/1/config.1),
the agenda graph (e.g. steps/1/graph.1.{dot,ps,png}), a file containing all expanded pa-
rameter settings (e.g. steps/1/parameter.1), and an empty file that is touched every minute
as long as the experiment is still running (e.g. steps/1/running.1).
Continuing Crashed Experiments
Steps may crash. No, steps will crash, be it because faulty settings, faulty tools, problems with
the computing resources, willful interruption of an experiment, or an act of God.
The first thing to continue a crashed experiment is to detect the crashed step. This is shown
either by the red node in the displayed graph or reported on the command line in the last
lines before crashing; though this may not be pretty obvious, if parallel steps kept running
after that. However, the automatic error detection is not perfect and a step may have failed
upstream without detection causes failure further down the road.
You should have a understanding of what each step does. Then, by looking at its STDERR
and STDOUT file, and the output files it should have produced, you can track down what went
Fix the problem, and delete all files associated with the failed step (e.g., rm steps/13/TUNING_tune.13*,
rm -r tuning/tmp.1). To find what has been produced by the crashed step, you may need to
consult where the output of this step is placed, by looking at experiment.meta.
You can automatically delete all crashed steps and their output files with
experiment.perl -delete-crashed 13 -exec
After removing the failed step and ensuring that the cause of the crash has been addressed,
you can continue a crashed experimental run (e.g., run number 13) with:
experiment.perl -continue 13 -exec
126 3. Tutorials
You may want to check what will be run by excluding the -exec command at first. The graph
indicates which steps will be re-used from the original crashed run.
If the mistake was a parameter setting, you can change that setting in the stored configuration
file (e.g., steps/1/config.1). Take care, however, to delete all steps (and their subsequent
steps) that would have been run differently with that setting.
If an experimental run crashed early, or you do not want to repeat it, it may be easier to delete
the entire step directory (rm -r steps/13). Only do this with the latest experimental run (e.g.,
not when there is already a run 14), otherwise it may mess up the re-use of results.
You may also delete all output associated with a run with the command rm -r */*.13*. How-
ever this requires some care, so you may want to check first what you are deleting (ls */*.13).
Running a Partial Experiment
By default, experiment.perl will run a full experiment: model building, tuning and testing. You
may only want to run parts of the pipeline, for instance building a model, but not tuning and
testing. You can do this by specifying either a final step or a final outcome.
If you want to terminate at a specific step
experiment.perl -config my-config -final-step step-name -exec
where step-name is for instance TRAINING:create-config,LM:my-corpus:train, or TUNING:tune.
If you want to terminate once a particular output file is generated:
experiment.perl -config my-config -final-out out -exec
Examples for out are TRAINING:config,LM:my-corpus:lm, or TUNING:weight-config. In fact,
these three examples are identical to the three examples above, it is just another way to specify
the final point of the pipeline.
Technically, this works by not using REPORTING:report as the end point of the pipeline, but the
specified step.
Removing a Run
If you want to remove all the step files and output files associated with a particular run, you
can do this with, for instance:
experiment.perl -delete-run 13 -exec
3.5. Experiment Management System 127
If you run this without -exec you will see a list of files that would be deleted (but no files are
actually deleted).
Steps that are used in other runs, and the output files that they produced are kept. Also, the
step directory (e.g., steps/13 is not removed. You may remove this by hand, if there are no
step files left.
Running on a Cluster
Experiment.perl works with Sun GridEngine clusters. The script needs to be run on the head
node and jobs are scheduled on the nodes.
There are two ways to tell experiment.perl that the current machine is a cluster computer. One
is by using the switch -cluster, or by adding the machine name into experiment.machines.
The configuration file has a section that allows for the setting of cluster-specific settings. The
setting jobs is used to specify into how many jobs to split the decoding during tuning and
testing. For more details on this, please see
All other settings specify switches that are passed along with each submission of a job via qsub:
qsub-memory: number of memory slots (-pe memory NUMBER),
qsub-hours: number of hours reserved for each job (-l h_rt=NUMBER:0:0),
qsub-project: name if the project for user accounting (-P PROJECT), and
qsub-settings: any other setting that is passed along verbatim.
Note that the general settings can be overriden in each module definition - you may want to
have different settings for different steps.
If the setting generic-parallelizer is set (most often it is set to to the ems support script
$moses-script-dir/ems/support/generic-parallelizer.perl), then a number of additional
steps are parallelized. For instance, tokenization is performed by breaking up the corpus into
as many parts as specified with jobs, jobs to process the parts are submitted in parallel to the
cluster, and their output pieced together upon completion.
Be aware that there are many different ways to configure a GridEngine cluster. Not all the
options described here may be available, and it my not work out of the box, due to your specific
Running on a Multi-core Machine
Using a multi-core machine means first of all that more steps can be scheduled in parallel.
There is also a generic parallelizer (generic-multicore-parallelizer.perl) that plays the
same role as the generic parallelizer for clusters.
However, decoding is not broken up into several pieces. It is more sensible to use multi-
threading in the decoder31.
128 3. Tutorials
Web Interface
The introduction included some screen shots of the web interface to the Experimental Manage-
ment System. You will need to have a running web server on a machine (LAMPP on Linux or
MAMP on Mac does the trick) that has access to the file system where your working directory
is stored.
Copy or link the web directory (in scripts/ems) on a web server. Make sure the web server
user has the right write permissions on the web interface directory.
To add your experiments to this interface, add a line to the file /your/web/interface/dir/setup.
The format of the file is explained in the file.
3.5.7 Analysis
You can include additional analysis for an experimental run into the web interface by specifying
the setting analysis in its configuration file.
analysis = $moses-script-dir/ems/support/analysis.perl
This currently reports n-gram precision and recall statistics and color-coded n-gram correctness
markup for the output sentences, as in
The output is color-highlighted according to n-gram matches with the reference translation.
The following colors are used:
grey: word not in reference,
light blue: word part of 1-gram match,
blue: word part of 2-gram match,
dark blue: word part of 3-gram match, and
very dark blue: word part of 4-gram match.
The setting analyze-coverage include a coverage analysis: which words and phrases in the
input occur in the training data or the translation table? This is reported in color coding and
in a yellow report box when moving the mouse of the word or the phrase. Also, summary
statistics for how many words occur how often are given, and a report on unknown or rare
words is generated.
3.5. Experiment Management System 129
Coverage Analysis
The setting analyze-coverage include a coverage analysis: which words and phrases in the
input occur in the training data or the translation table? This is reported in color coding and
in a yellow report box when moving the mouse of the word or the phrase. Also, summary
statistics for how many words occur how often are given, and a report on unknown or rare
words is generated.
Bilingual Concordancer
To more closely inspect where input words and phrases occur in the training corpus, the anal-
ysis tool includes a bilingual concordancer. You turn it on by adding this line to the training
section of your configuration file:
biconcor = $moses-bin-dir/biconcor
During training, a suffix array of the corpus is built in the model directory. The analysis web
interface accesses these binary files to quickly scan for occurrences of source words and phrases
in the training corpus. For this to work, you need to include the biconcor binary in the web
root directory.
When you click on a word or phrase, the web page is augemented with a section that shows
all (or frequent word, a sample of all) occurences of that phrase in the corpus, and how it was
Source occurrences (with context) are shown on the left half, the aligned target on the right.
In the main part, occurrences are grouped by different translations -– also shown bold in con-
text. Unaligned boundary words are shown in blue. The extraction heuristic extracts additional
rules for these cases, but these are not listed here for clarity.
130 3. Tutorials
At the end, source occurrences for which no rules could be extracted are shown. This may
happen because the source words are not aligned to any target words. In this case, the tool
shows alignments of the previous word (purple) and following word(olive), as well as some
neighboring unaligned words (again, in blue). Another reason for failure to extract rules are
misalignments, when the source phrase maps to a target span which contains words that also
align to outside source words (violation of the coherence contraint). These misaligned words
(in source and target) are shown in red.
Note by Dingyuan Wang - biconcor binary should be copied to the web interface directory.
Precision by coverage
To investigate further, if the correctness of the translation of input words depends on frequency
in the corpus (and what the distribution of word frequency is), a report for precision by cover-
age can be turned on with the following settings:
report-precision-by-coverage = yes
precision-by-coverage-factor = pos
precision-by-coverage-base = $working-dir/evaluation/test.analysis.5
Only the first setting report-precision-by-coverage is needed for the report. The second
setting precision-by-coverage-factor provides an additional breakdown for a specific input
factor (in the example, the part-of-speech factor named pos). More on the precision-by-coverage-base
When clicking on "precision of input by coverage" on the main page, a precision by coverage
graph is shown:
The log-coverage class is on the x-axis (-1 meaning unknown, 0 singletons, 1 words that occur
twice, 2 words that occur 3-4 times, 3 words that occur 5-8 times, and so on). The scale of boxes
for each class is determined by the ratio of words in the class in the test set. The precision of
translations of words in a class is shown on the y-axis.
Translation of precision of input words cannot be determined in a clear cut word. Our determi-
nation relies on phrase alignment of the decoder, word alignment within phrases, and account-
ing for multiple occurrences of transled words in output and reference translations. Not that
3.5. Experiment Management System 131
the precision metric does not penalize for dropping words, so this is shown in a second graph
(in blue), below the precision graph.
If you click on the graph, you will see the graph in tabular form. Following additional links
allows you to see breakdowns for the actual words, and even find the sentences in which they
Finally, the precision-by-coverage-base setting. For comparison purposes, it may be useful
to base the coverage statistics on the corpus of a previous run. For instance, if you add training
data, does the translation quality of the words increase? Well, a word that occured 3 times in
the small corpus, may now occur 10 times in the big corpus, hence the word is placed in a
different class. To maintain the original classification of words into the log-coverage classes,
you may use this setting to point to an earlier run.
Subsection last modified on August 06, 2017, at 04:16 PM
132 3. Tutorials
User Guide
4.1 Support Tools
4.1.1 Overview
Scripts are in the scripts subdirectory in the source release in the Git repository.
The following basic tools are described elsewhere:
Moses decoder (Section 3.1)
Training script train-model.perl (Section 5.3)
Corpus preparation clean-corpus-n.perl (Section 5.2)
Minimum error rate training (tuning) (Section 5.14)
4.1.2 Converting Pharaoh configuration files to Moses configuration files
Moses is a successor to the Pharaoh decoder, so you can use the same models that work for
Pharaoh and use them with Moses. The following script makes the necessary changes to the
configuration file:
exodus.perl < pharaoh.ini > moses.ini
4.1.3 Moses decoder in parallel
Since decoding large amounts of text takes a long time, you may want to split up the text into
blocks of a few hundred sentences (or less), and distribute the task across a Sun GridEngine
cluster. This is supported by the script, which is run as follows: -decoder decoder -config cfgfile -i input -jobs N [options]
134 4. User Guide
Use absolute paths for your parameters (decoder, configuration file, models, etc.).
decoder is the file location of the binary of Moses used for decoding
cfgfile is the configuration fileofthe decoder
input is the file to translate
Nis the number of processors you require
options are used to overwrite parameters provided in cfgfile
Among them, overwrite the following two parameters for nbest generation (NOTE: they
differ from standard Moses)
-n-best-file output file for nbest list
-n-best-size size of nbest list
4.1.4 Filtering phrase tables for Moses
Phrase tables easily get too big, but for the translation of a specific set of text only a fraction of
the table is needed. So, you may want to filter the translation table, and this is possible with
the script: filter-dir config input-file
This creates a filtered translation table with new configuration file in the directory filter-dir
from the model specified with the configuration file config (typically named moses.ini), given
the (tokenized) input from the file input-file.
In the advanced feature section, you find the additional option of binarizing translation and
reordering table, which allows these models to be kept on disk and queried by the decoder. If
you want to both filter and binarize these tables, you can use the script: filter-dir config input-file -Binarizer binarizer
The additional binarizer option points to the appropriate version of processPhraseTable.
4.1.5 Reducing and Extending the Number of Factors
Instead of the two following scripts, this one does both at the same time, and is better suited
for our directory structure and factor naming conventions: \
czeng05.cs \
0,2 pos lcstem4 \
> czeng05_restricted_to_0,2_and_with_pos_and_lcstem4_added
4.1. Support Tools 135
4.1.6 Scoring translations with BLEU
A simple BLEU scoring tool is the script multi-bleu.perl:
multi-bleu.perl reference < mt-output
Reference file and system output have to be sentence-aligned (line X in the reference file cor-
responds to line X in the system output). If multiple reference translation exist, these have to
be stored in seperate files and named reference0,reference1,reference2, etc. All the texts
need to be tokenized.
A popular script to score translations with BLEU is the NIST mteval script1. It requires that text
is wrapped into a SGML format. This format is used for instance by the NIST evaluation2and
the WMT Shared Task evaluations3. See the latter for more details on using this script.
4.1.7 Missing and Extra N-Grams
Missing n-grams are those that all reference translations wanted but MT system did not pro-
duce. Extra n-grams are those that the MT system produced but none of the references ap-
proved. hypothesis reference1 reference2 ...
4.1.8 Making a Full Local Clone of Moses Model + ini File
Assume you have a moses.ini file already and want to run an experiment with it. Some
months from now, you might still want to know what exactly did the model (incl. all the
tables) look like, but people tend to move files around or just delete them.
To solve this problem, create a blank directory, go in there and run: ../path/to/moses.ini will make a copy of the moses.ini file and local symlinks (and if pos-
sible also hardlinks, in case someone deleted the original file) to all the tables and language
models needed.
It will be now safe to run moses locally in the fresh directory.
136 4. User Guide
4.1.9 Absolutizing Paths in moses.ini
Run: ../path/to/moses.ini > moses.abs.ini
to build an ini file where all paths to model parts are absolute. (Also checks the existence of the
4.1.10 Printing Statistics about Model Components
The script moses.ini
Prints basic statistics about all components mentioned in the moses.ini. This can be useful to
set the order of mapping steps to avoid explosion of translation options or just to check that the
model components are as big/detailed as we expect.
Sample output lists information about a model with 2 translation and 1 generation step. The
three language models over three factors used and their n-gram counts (after discounting) are
listed, too.
Translation 0 -> 1 (/fullpathto/phrase-table.0-1.gz):
743193 phrases total
1.20 phrases per source phrase
Translation 1 -> 2 (/fullpathto/phrase-table.1-2.gz):
558046 phrases total
2.75 phrases per source phrase
Generation 1,2 -> 0 (/fullpathto/generation.1,2-0.gz):
1.04 outputs per source token
Language model over 0 (/fullpathto/lm.1.lm):
1 2 3
49469 245583 27497
Language model over 1 (/fullpathto/lm.2.lm):
1 2 3
25459 199852 32605
Language model over 2 (/fullpathto/lm.3.lm):
709 20946 39885 45753 27964 12962 7524
4.1. Support Tools 137
4.1.11 Recaser
Often, we train machine translation systems on lowercased data. If we want to present the
output to a user, we need to re-case (or re-capitalize) the output. Moses provides a simple
tool to recase data, which essentially runs Moses without reordering, using a word-to-word
translation model and a cased language model.
The recaser requires a model (i.e., the word mapping model and language model mentioned
above), which is trained with the command:
train-recaser.perl --dir MODEL --corpus CASED [--train-script TRAIN]
The script expects a cased (but tokenized) training corpus in the file CASED, and creates a recas-
ing model in the directory MODEL. KenLM’s lmplz is used to train language models by default;
pass --lm to change the toolkit.
To recase output from the Moses decoder, you run the command
recase.perl --in IN --model MODEL/moses.ini --moses MOSES [--lang LANGUAGE] [--headline SGML] > OUT
The input is in file IN, the output in file OUT. You also need to specify a recasing model MODEL.
Since headlines are capitalized different from regular text, you may want to provide an SGML file
that contains information about headline. This file uses the NIST format, and may be identical
to source test sets provided by the NIST or other evluation campaigns. A language LANGUAGE
may also be specified, but only English (en) is currently supported.
By default, EMS trains a truecaser (see below). To use a recaser, you have to make the following
Comment out output-truecaser and detruecaser and add instead output-lowercaser
and EVALUATION:recaser.
Add IGNORE to the [TRUECASING] section, and remove it from the [RECASING] section
Specify in the [RECASING] section, which training corpus should be used for the recaser.
This is typically the target side of the parallel corpus or a large language model corpus.
You can directly link to a corpus already specified to the config file, e.g., tokenized =
4.1.12 Truecaser
Instead of lowercasing all training and test data, we may also want to keep words in their nat-
ural case, and only change the words at the beginning of their sentence to their most frequent
form. This is what we mean by truecasing. Again, this requires first the training of a truecasing
model, which is a list of words and the frequency of their different forms.
138 4. User Guide
train-truecaser.perl --model MODEL --corpus CASED
The model is trained from the cased (but tokenized) training corpus CASED and stored in the
file MODEL.
Input to the decoder has to be truecased with the command
truecase.perl --model MODEL < IN > OUT
Output from the decoder has to be restored into regular case. This simply uppercases words at
the beginning of sentences:
detruecase.perl < in > out [--headline SGML]
An SGML file with headline information may be provided, as done with the recaser.
4.1.13 Searchgraph to DOT
This small tool converts Moses searchgraph (-output-search-graph FILE option) to dot for-
mat. The dot format can be rendered using the graphviz4tool dot.
moses ... --output-search-graph temp.graph -s 3
# we suggest to use a very limited stack size, -s 3
sg2dot.perl [--organize-to-stacks] < temp.graph >
dot -Tps >
Using --organize-to-stacks makes nodes in the same stack appear in the same column (this
slows down the rendering, off by default).
Caution: the input must contain the searchgraph of one sentence only.
4.1. Support Tools 139
4.1.14 Threshold Pruning of Phrase Table
The phrase table trained by Moses contains by default all phrase pairs encountered in the par-
allel training corpus. This often includes 100,000 different translations for the word "the" or the
comma ",". These may clog up various processing steps down the road, so it is helpful to prune
the phrase table to the reasonable choices.
Threshold pruning is currently implemented at two different stages: You may filter the entire
phrase table file, or use threshold pruning as an additional filtering criterion when filtering the
phrase table for a given test set. In either case, phrase pairs are thrown out when their phrase
translation probability p(e|f) falls below a specified threshold. A safe number for this threshold
may be 0.0001, in the sense that it hardly changes any phrase translation while ridding the table
of a lot of junk.
Pruning the full phrase table file
The script scripts/training/threshold-filter.perl operates on any phrase table file:
threshold-filter.perl 0.0001 > PHRASE_TABLE.reduced
If the phrase table is zipped, then:
zcat PHRASE_TABLE.gz | \
threshold-filter.perl 0.0001 | \
gzip - > PHRASE_TABLE.reduced.gz
While this often does not remove much of the phrase table (which contains to a large part sin-
gleton phrase pairs with p(e|f)=1), it may nevertheless be helpful to also reduce the reordering
model. This can be done with a second script:
remove-orphan-phrase-pairs-from-reordering-table.perl PHRASE_TABLE \
Again, this also works for zipped files:
remove-orphan-phrase-pairs-from-reordering-table.perl PHRASE_TABLE | \
gzip - > REORDERING_TABLE.pruned.gz
140 4. User Guide
Pruning during test/tuning set filtering
In the typical experimental setup, the phrase table is filtered for a tuning or test set using the
script. During this process, we can also remove low-probability phrase pairs. This can be done
simply by adding the switch -MinScore, which takes a specification of the following form: [...] \
where FIELDn is the position of the score (typically 2 for the direct phrase probability p(e|f), or
0 for the indirect phrase probability p(f|e)) and THRESHOLD the maximum probability allowed.
Subsection last modified on February 06, 2016, at 10:11 PM
4.2 External Tools
A very active community is engaged in statistical machine translation research, which has
produced a number of tools that may be useful for training a Moses system. Also, the more
linguistically motivated models (factored model, syntax model) require tools to the linguistic
annotation of corpora.
In this section, we list some useful tools. If you know (or are the developer of) anything we
missed here, please contact us and we can add it to the list. For more comprehensive listings of
MT tools, refer to the following pages:
List of Free/Open-source MT Tools5, maintained by Mikel Forcada.
TAUS Tracker6, a comprehensive list of Translation and Language Technology tools main-
tained by TAUS.
4.2.1 Word Alignment Tools
Berkeley Word Aligner
The BerkeleyAligner7(available at Sourceforge8) is a word alignment software package that
implements recent innovations in unsupervised word alignment. It is implemented in Java
and distributed in compiled format.
4.2. External Tools 141
mkdir /my/installation/dir
cd /my/installation/dir
tar xzf berkeleyaligner_unsupervised-2.1.tar.gz
cd berkeleyaligner
chmod +x align
./align example.conf
Multi-threaded GIZA++
MGIZA9was developed by Qin Gao. It is an implementation of the popular GIZA++ word
alignment toolkit to run multi-threaded on multi-core machines. Check the web site for more
recent versions.
git clone
cd mgiza/mgizapp
cmake .
make install
Compiling MGIZA requires the Boost library. If your Boost library are in non-system directory,
use the script
to compile MGIZA.
The MGIZA binary and the script need to be copied in you binary direc-
tory that Moses will look up for word alignment tools. This is the exact command I use to copy
MGIZA to it final destination:
142 4. User Guide
export BINDIR=~/workspace/bin/training-tools
cp bin/* $BINDIR/mgizapp
cp scripts/ $BINDIR
MGIZA works with the training script train-model.perl. You indicate its use (opposed to reg-
ular GIZA++) with the switch -mgiza. The switch -mgiza-cpus NUMBER allows you to specify
the number of CPUs.
Dyer et al.’s Fast Align
The Fast Align10 is a comparable fast unsupervised word aligner that nevertheless gives com-
parable results to GIZA++. It’s details are described in a NAACL 2013 paper11
mkdir /my/installation/dir
cd /my/installation/dir
git clone
cd fast_align
Anymalign12 is a multilingual sub-sentential aligner. It can extract lexical equivalences from
sentence-aligned parallel corpora. Its main advantage over other similar tools is that it can
align any number of languages simultaneously. The details are describe in Lardilleux and
Lepage (2009)13. To understand the algorithm, a pure python implementation can be found in
minimalign.py14 but it is advisable use the main implementation for realistic usage.
mkdir /your/installation/dir
cd /your/installation/dir
4.2. External Tools 143
4.2.2 Evaluation Metrics
Translation Error Rate (TER)
Translation Error Rate15 is an error metric for machine translation that measures the number of
edits required to change a system output into one of the references. It is implemented in Java.
mkdir /my/installation/dir
cd /my/installation/dir
tar xzf tercom-0.7.25.tgz
METEOR16 is a metric that includes stemmed and synonym matches when measuring the sim-
ilarity between system output and human reference translations.
mkdir /my/installation/dir
cd /my/installation/dir
RIBES17 is a metric that word rank-based metric that compares the ratio of contiguous and
dis-contiguous word pairs between the system output and human translations.
# First download from
# (need to accept to agree to the free license, so no direct URL)
tar -xvzf RIBES-1.03.1.tar.gz
cd RIBES-1.03.1/
python --help
144 4. User Guide
4.2.3 Part-of-Speech Taggers
MXPOST (English)
MXPOST was developed by Adwait Ratnaparkhi as part of his PhD thesis. It is a Java imple-
mentation of a maximum entropy model and distributed as compiled code. It can be trained
for any language pair for with annotated POS data exists.
mkdir /your/installation/dir
cd /your/installation/dir
tar xzf jmx.tar.gz
echo ’#!/usr/bin/env bash’ > mxpost
echo ’export CLASSPATH=/your/installation/dir/mxpost.jar’ >> mxpost
echo ’java -mx30m tagger.TestTagger /your/installation/dir/tagger.project’ >> mxpost
chmod +x mxpost
echo ’This is a test .’ | ./mxpost
The script script/training/wrappers/make-factor-en-pos.mxpost.perl is a wrapper script
to create factors for a factored translation model. You have to adapt the definition of $MXPOST
to point to your installation directory.
TreeTagger (English, French, Spanish, German, Italian, Dutch, Bulgarian, Greek)
TreeTagger18 is a tool for annotating text with part-of-speech and lemma information.
Installation (Linux, check web site for other platforms):
mkdir /my/installation/dir
cd /my/installation/dir
4.2. External Tools 145
The wrapper script scripts/training/wrapper/make-pos.tree-tagger.perl creates part-of-
speech factors using TreeTagger in the format expected by Moses. The command has the
required parameters -tree-tagger DIR to specify the location of your installation and -l
LANGUAGE to specify the two-letter code for the language (de,fr, ...). Optional parameters are
-basic to output only basic part-of-speech tags (VER instead of VER:simp -- not available for all
languages), and --stem to output stems instead of part-of-speech tags.
Treetagger can also shallow parse the sentence, labelling it with chunk tags. See their website19
for details.
FreeLing20 is a set of a tokenizers, morpological analyzers, syntactic parsers. and other lan-
guage tools for Asturian, Catalan, English, Galician, Italian, Portuguese, Russian, Spanish, and
4.2.4 Syntactic Parsers
Collins (English)
Michael Collins21 developed the first statistical parser as part of his PhD thesis. It is imple-
mented in C.
mkdir /your/installation/dir
cd /your/installation/dir
tar xzf PARSER.tar.gz
146 4. User Guide
Collins parser also requires the installation of MXPOST (Section 4.2.3). A wrapper file to gen-
erate parse trees in the format required to train syntax models with Moses is provided in
BitPar (German, English)
Helmut Schmid developed BitPar22, a parser for highly ambiguous probabilistic context-free
grammars (such as treebank grammars). BitPar uses bit-vector operations to speed up the
basic parsing operations by parallelization. It is implemented in C and distributed as compiled
mkdir /your/installation/dir
cd /your/installation/dir
tar xzf BitPar.tar.gz
cd BitPar/src
cd ../..
You will also need the parsing model for German which was trained on the Tiger treebank:
tar xzf GermanParser.tar.gz
cd GermanParser/src
cd ../..
There is also an English parsing model.
LoPar (German)
LoPar23 is an implementation of a parser for head-lexicalized probabilistic context-free gram-
mars, which can be also used for morphological analysis. The program is distributed without
source code.
4.2. External Tools 147
mkdir /my/installation/dir
cd /my/installation/dir
tar xzf lopar-3.0.linux.tar.gz
cd LoPar-3.0
Berkeley Parser
The Berkeley is a phrase structure grammar parser implemented in Java and distributed open
source. Models are provided for English, Bugarian, Arabic, Chinese, French, German.
4.2.5 Other Open Source Machine Translation Systems
Joshua24 is a machine translation decoder for hierarchical models. Joshua development is cen-
tered at the Center for Language and Speech Processing at the Johns Hopkins University in
Baltimore, Maryland. It is implemented in Java.
Cdec25 is a decoder, aligner, and learning framework for statistical machine translation and
other structured prediction models written by Chris Dyer in the University of Maryland De-
partment of Linguistics. It is written in C++.
Apertium26 is an open source rule-based machine translation (RBMT) system, maintained prin-
cipally by the University of Alicante and Prompsit Engineering.
Docent27 is a decoder for phrase-based SMT that treats complete documents, rather than sin-
gle sentences, as translation units and permits the inclusion of features with cross-sentence
dependencies. It is developed by Christian Hardmeier and implemented in C++
148 4. User Guide
Phrase-based SMT toolkit written in Java.
4.2.6 Other Translation Tools
COSTA MT Evaluation Tool
COSTA MT Evaluation Tool28 is an open-source Java program that can be used to evaluate
manually the quality of the MT output. It is simple in use, designed to allow MT potential
users and developers to analyse their engines using a friendly environment. It enables the
ranking of the quality of MT output segment-by-segment for a particular language pair.
Appraise29 is an open-source tool for manual evaluation of Machine Translation output. Ap-
praise allows to collect human judgments on translation output, implementing annotation
tasks such as translation quality checking, ranking of translations, error classification, and man-
ual post-editing. It is used in the ACL WMT evaluation campaign30.
Indic NLP Library
Python based libraries for common text processing and Natural Language Processing in Indian
languages. Indian languages share a lot of similarity in terms of script, phonology, language
syntax, etc. and this library is an attempt to provide a general solution to very commonly
required toolsets for Indian language text.
The library provides the following functionalities:
Text Normalization
Morphological Analysis
Subsection last modified on July 05, 2017, at 08:45 AM
4.3. User Documentation 149
4.3 User Documentation
The basic features of the decoder are explained in the Tutorial (Section 3.1) and Training sec-
tions. But to get good results from Moses you probably need to use some of the features de-
scribed in this page.
Advanced Models (Section 4.4) A basic SMT system contains a language model and a trans-
lation model, however there are several ways to extend this (and potentially improve
translation) by adding extra models. These may improve the modelling of reordering, for
example, or capture similarities between related words.
Efficient Phrase and Rule Storage (Section 4.5) To build a state-of-the-art translation system,
Moses often requires huge phrase-pair or rule tables. The efficient storage and access of
these tables requires specialised data structures and this page describes several different
Search (Section 4.6) Given an MT model and a source sentence, the problem of finding the
best translation is an intractable search problem. Moses implements several methods for
taming this intractability.
Unknown Words (Section 4.7) No matter how big your training data is, there will always be
OOVs (out-of-vocabulary words) in the text you wish to translate. One approach may be
to transliterate - if your source and target languages have different character sets.
Hybrid Translation (Section 4.8) Sometimes you need rules! If you want to add explicit knowl-
edge to Moses models, for example for translating terminology or numbers, dates etc.,
Moses has a few ways of making this possible.
Moses as a Service (Section 4.9) Moses includes a basic server which can deliver translations
over xml-rpc.
Incremental Training (Section 4.10) The traditional Moses pipeline is a sequence of batch pro-
cesses, but what if you want to add extra training data to a running system? Storing the
phrase table in a suffix array makes this possible.
Domain Adaptation (Section 4.11) When the training data differs in a systematic way from
the test data you have a domain problem. Several techniques have been proposed in the
literature have been proposed and Moses includes implementations of many of them.
Constrained Decoding (Section 4.12) In some applications, you know that translation but you
need to know how the model derived it.
Cache-based Models (Section 4.13) These can be a useful way for the document context to
influence the translation.
Sparse features (Section 4.16) Feature functions that produce many features, for instance lex-
icalized features
Support Tools (Section 4.1) Various tools to manipulate models and configuration files.
External Tools (Section 4.2) Linguistic tools, word aligners, evaluation metrics and frameworks,
other open source machine translation systems.
150 4. User Guide
Web Translation (Section 4.17) Web service software to translate web pages and text on de-
Pipeline Creation Language (Section 4.14) A generic mechanism for managing pipelines of
software components, such as Moses training.
Obsolete Features (Section 4.15) Things that have been removed, but documentation is pre-
served for posterity.
Subsection last modified on December 23, 2015, at 05:43 PM
4.4 Advanced Models
4.4.1 Lexicalized Reordering Models
The default standard model that for phrase-based statistical machine translation is only con-
ditioned on movement distance and nothing else. However, some phrases are reordered more
frequently than others. A French adjective like extérieur typically gets switched with the pre-
ceding noun, when translated into English.
Hence, we want to consider a lexicalized reordering model that conditions reordering on the
actual phrases. One concern, of course, is the problem of sparse data. A particular phrase pair
may occur only a few times in the training data, making it hard to estimate reliable probability
distributions from these statistics.
Therefore, in the lexicalized reordering model we present here, we only consider three reorder-
ing types: (m) monotone order, (s) switch with previous phrase, or (d) discontinuous. See
below for an illustration of these three different types of orientation of a phrase.
4.4. Advanced Models 151
To put it more formally, we want to introduce a reordering model pothat predicts an orientation
type {m,s,d}given the phrase pair currently used in translation:
orientation {m, s, d}
How can we learn such a probability distribution from the data? Again, we go back to the word
alignment that was the basis for our phrase table. When we extract each phrase pair, we can
also extract its orientation type in that specific occurrence.
Looking at the word alignment matrix, we note for each extracted phrase pair its corresponding
orientation type. The orientation type can be detected, if we check for a word alignment point
to the top left or to the top right of the extracted phrase pair. An alignment point to the top
left signifies that the preceding English word is aligned to the preceding Foreign word. An
alignment point to the top right indicates that the preceding English word is aligned to the
following french word. See below for an illustration.
The orientation type is defined as follows:
monotone: if a word alignment point to the top left exists, we have evidence for mono-
tone orientation.
swap: if a word alignment point to the top right exists, we have evidence for a swap with
the previous phrase.
discontinuous: if neither a word alignment point to top left nor to the top right exists, we
have neither monotone order nor a swap, and hence evidence for discontinuous orienta-
We count how often each extracted phrase pair is found with each of the three orientation types.
The probability distribution pois then estimated based on these counts using the maximum
likelihood principle:
po(orientation|f,e) = count(orientation,e,f) / Σocount(o,e,f)
152 4. User Guide
Given the sparse statistics of the orientation types, we may want to smooth the counts with the
unconditioned maximum-likelihood probability distribution with some factor σ:
po(orientation) = ΣfΣecount(orientation,e,f) / ΣoΣfΣecount(o,e,f)
po(orientation|f, e) = (σp(orientation) + count(orientation, e, f))/(σ+ Σocount(o, e, f ))
There are a number of variations of this lexicalized reordering model based on orientation
bidirectional: Certain phrases may not only flag, if they themselves are moved out of
order, but also if subsequent phrases are reordered. A lexicalized reordering model for
this decision could be learned in addition, using the same method.
fand e: Out of sparse data concerns, we may want to condition the probability distribu-
tion only on the foreign phrase (f) or the English phrase (e).
monotonicity: To further reduce the complexity of the model, we might merge the orien-
tation types swap and discontinuous, leaving a binary decision about the phrase order.
These variations have shown to be occasionally beneficial for certain training corpus sizes and
language pairs. Moses allows the arbitrary combination of these decisions to define the reorder-
ing model type (e.g. bidirectional-monotonicity-f). See more on training these models in
the training section of this manual.
Enhanced orientation detection
As explained above, statistics about the orientation of each phrase can be collected by looking
at the word alignment matrix, in particular by checking the presence of a word at the top left
and right corners. This simple approach is capable of detecting a swap with a previous phrase
that contains a word exactly aligned on the top right corner, see case (a) in the figure below.
However, this approach cannot detect a swap with a phrase that does not contain a word with
such an alignment, like the case (b). A variation to the way phrase orientation statistics are
collected is the so-called phrase-based orientation model by Tillmann (2004)31, which uses
phrases both at training and decoding time. With the phrase-based orientation model, the case
(b) is properly detected and counted during training as a swap. A further improvement of
this method is the hierarchical orientation model by Galley and Manning (2008)32, which is
able to detect swaps or monotone arrangements between blocks even larger than the length
limit imposed to phrases during training, and larger than the phrases actually used during
decoding. For instance, it can detect at decoding time the swap of blocks in the case (c) shown
4.4. Advanced Models 153
(Figure from Galley and Manning, 2008)
Empirically, the enhanced orientation methods should be used with language pairs involving
significant word re-ordering.
4.4.2 Operation Sequence Model (OSM)
The Operation Sequence Model as described in Durrani et al. (2011)33 and Durrani et al.
(2013)34 has been integrated into Moses.
What is OSM?
OSM is an N-gram-based translation and reordering model that represents aligned bilingual
corpus as a sequence of operations and learns a Markov model over the resultant sequences.
Possible operations are (i) generation of a sequence of source and target words (ii) insertion
of gaps as explicit target positions for reordering operations, and (iii) forward and backward
jump operations which do the actual reordering. The probability of a sequence of operations
is defined according to an N-gram model, i.e., the probability of an operation depends on the
n-1 preceding operations. Let O = o1, ... , oNbe a sequence of operations as hypothesized by
the translator to generate a word-aligned bilingual sentence pair <F;E;A >; the model is then
defined as:
posm(F,E,A) = p(o1,...,oN) = Qip(oi|oi-n+1...oi-1)
The OSM model addresses several drawbacks of the phrase-based translation and lexicalized
reordering models: i) it considers source and target contextual information across phrasal
boundries and does not make independence assumption, ii) it is based on minimal translation
units therefore does not have the problem of spurious phrasal segmentation, iii) it consider
much richer conditioning than the lexcialized reordering model which only learns orientation
of a phrase w.r.t previous phrase (or block of phrases) ignoring how previous words were
translated and reordered. The OSM model conditions translation and reordering decisions on
’n’ previous translation and reordering decisions which can span across phrasal boundaries.
A list of operations is given below:
Generate (X,Y): X and Y are source and target cepts in an MTU (minimal translation unit). This
operation causes the words in Y and the first word in X to be added to the target and source
strings respectively, that were generated so far. Subsequent words in X are added to a queue to
be generated later.
Continue Source Cept: The source words added to the queue by the Generate (X,Y) opera-
tion are generated by the Continue Source Cept operation. Each Continue Source Cept operation
removes one German word from the queue and copies it to the source string.
Generate Source Only (X): The words in X are added at the current position in the source
string. This operation is used to generate an target word with no corresponding target word.
Generate Target Only (Y): The words in Y are added at the current position in the target string.
This operation is used to generate an target word with no corresponding source word.
154 4. User Guide
Generate Identical: The same word is added at the current position in both the source and
target strings. The Generate Identical operation is used during decoding for the translation of
unknown words.
Insert Gap: This operation inserts a gap which acts as a placeholder for the skipped words.
There can be more than one open gap at a time.
Jump Back (W): This operation lets the translator jump back to an open gap. It takes a param-
eter W specifying which gap to jump to. W=1 for the gap closest to the right most source word
covered, W=2 for the second most closest and so on.
Jump Forward: This operation makes the translator jump to the right-most source word so far
covered. It is performed when the next source word to be generated is at the right of the source
word generated and does not follow immediately
The example shown in figure is deterministically converted to the following operation se-
Generate Identical -- Generate (hat investiert, invested) -- Insert Gap -- Continue Source Cept -- Jump
Back (1) -- Generate (Millionen, million) -- Generate Source Only (von) -- Generate (Dollars, dollars) --
Generate (in, in) -- Generate (die, the) -- Generate (Untersuchungen, research)
To enable the OSM model in phrase-based decoder, just put the following in the EMS config
operation-sequence-model = "yes"
operation-sequence-model-order = 5
operation-sequence-model-settings = ""
Factored Model
Due to data sparsity the lexically driven OSM model may often fall back to very small context
sizes. This problem is addressed in Durrani et al. (2014b)35 by learning operation sequences
over generalized representations such as POS/Morph tags/word classes (See Section: Class-
based Models). If the data has been augmented with additional factors, then use
operation-sequence-model-settings = "--factor 0-0+1-1"
4.4. Advanced Models 155
"0-0" will learn OSM model over lexical forms and "1-1" will learn OSM model over second
factor (POS/Morph/Cluster-id etc.). Note that using
operation-sequence-model-settings = ""
for a factor augmented training data is an error. Use
operation-sequence-model-settings = "--factor 0-0"
if you only intend to train OSM model over surface form in such a scenario.
In case you are not using EMS and want to train OSM model manually, you will need to do
two things:
1) Run the following command
/path-to-moses/scripts/OSM/OSM-Train.perl --corpus-f --corpus-e corpus.en --alignment aligned.grow-diag-final-and --order 5 --out-dir /path-to-experiment/model/OSM --moses-src-dir /path-to-moses/ --srilm-dir /path-to-srilm/bin/i686-m64 --factor 0-0 --input-extension fr --output-extension en
2) Edit model/moses.ini to add
OpSequenceModel name=OpSequenceModel0 num-features=5 path=/path-to-experiment/model/OSM/operationLM.bin
... [weight]
OpSequenceModel0= 0.08 -0.02 0.02 -0.001 0.03
Interpolated OSM Model
OSM model trained from the plain concatenation of in-domain data with large and diverse
multi-domain data is sub-optimal. When other domains are sufficiently larger and/or differ-
ent than the in-domain, the probability distribution can skew away from the target domain
resulting in poor performance. The LM-like nature of the model provides motivation to ap-
ply methods such as perplexity optimization for model weighting. The idea is to train OSM
model on each domain separately and interpolate them by minimizing optimizing perplexity
on held-out tuning set. To know more read Durrani et al. (2015)36.
Provide tuning files as additional parameter in the settings. For example:
156 4. User Guide
interpolated-operation-sequence-model = "yes"
operation-sequence-model-order = 5
operation-sequence-model-settings = "--factor 0-0 --tune /path-to-tune-folder/tune_file --srilm-dir /path-to-srilm/bin/i686-m64"
This method requires word-alignment for the source and reference tuning files to generate
operation sequences. This can be done using force-decoding of tuning set or by aligning tuning
sets along with the training. The folder should contain files as (for example ( , tune.en ,
Interpolation script does not work with LMPLZ and will require SRILM installation.
4.4.3 Class-based Models
Automatically clustering the training data into word classes in order to obtain smoother distri-
butions and better generalizations has been a widely known and applied technique in natural
language processing. Using class-based models have shown to be useful when translating into
morphologically rich languages. We use the mkcls utility in GIZA to cluster source and target
vocabularies into classes. This is generally run during alignment process where data is divided
into 50 classes to estimate IBM Model-4. Durrani et al. (2014b)37 found using different number
of clusters to be useful for different language pairs. To map the data (say into higher
number of clusters (say 1000) use:
/path-to-GIZA/statmt/bin/mkcls Űc1000 -n2 -p/path-to-corpus/ -V/path-to-experiment/training/prepared.stepID/fr.vcb.classes opt
To annotate the data with cluster-ids add the following to the EMS-config file:
temp-dir = $working-dir/training/factor
### script that generates this factor
factor-script = "/path-to-moses/scripts/training/wrappers/make-factor-brown-cluster-mkcls.perl 0 $working-dir/training/prepared.stepID/$input-extension.vcb.classes"
### script that generates this factor
factor-script = "/path-to-moses/scripts/training/wrappers/make-factor-brown-cluster-mkcls.perl 0 $working-dir/training/prepared.stepID/$output-extension.vcb.classes"
4.4. Advanced Models 157
Adding the above will augment the training data with cluster-ids. These can be enabled in
different models. For example to train a joint-source target phrase-translation model, add the
following to the EMS-config file:
input-factors = word mkcls
output-factors = word mkcls
alignment-factors = "word -> word"
translation-factors = "word+mkcls -> word+mkcls"
reordering-factors = "word -> word"
decoding-steps = "t0"
To train a target sequence model over cluster-ids, add the following to the EMS config-file
raw-corpus = /path-to-raw-monolingual-data/rawData.en
factors = mkcls
settings = "-unk"
To train operation sequence model over cluster-ids, use the following in the EMS config-file
operation-sequence-model-settings = "--factor 1-1"
if you want to train both lexically driven and class-based OSM models then use:
operation-sequence-model-settings = "--factor 0-0+1-1"
4.4.4 Multiple Translation Tables and Back-off Models
Moses allows the use of multiple translation tables, but there are three different ways how they
are used:
158 4. User Guide
both translation tables are used for scoring: This means that every translation option is
collected from each table and scored by each table. This implies that each translation
option has to be contained in each table: if it is missing in one of the tables, it can not be
either translation table is used for scoring: Translation options are collected from one
table, and additional options are collected from the other tables. If the same translation
option (in terms of identical input phrase and output phrase) is found in multiple tables,
separate translation options are created for each occurrence, but with different scores.
the union of all translation options from all translation tables is considered. Each option
is scored by each table. This uses a different mechanism than the above two methods and
is discussed in the PhraseDictionaryGroup section below.
In any case, each translation table has its own set of weights.
First, you need to specify the translation tables in the section [feature] of the moses.ini con-
figuration file, for instance:
PhraseDictionaryMemory path=/my-dir/table1 ...
PhraseDictionaryMemory path=/my-dir/table2 ...
Secondly, you need to set weights for each phrase-table in the section [weight].
Thirdly, you need to specify how the tables are used in the section [mapping]. As mentioned
above, there are two choices:
scoring with both tables:
scoring with either table:
Note: what we are really doing here is using Moses’ capabilities to use different decoding paths.
The number before "T" defines a decoding path, so in the second example are two different
decoding paths specified. Decoding paths may also contain additional mapping steps, such as
generation steps and translation steps using different factors.
4.4. Advanced Models 159
Also note that there is no way to have the option "use both tables, if the phrase pair is in both
table, otherwise use only the table where you can find it". Keep in mind, that scoring a phrase
pair involves a cost and lowers the chances that the phrase pair is used. To effectively use this
option, you may create a third table that consists of the intersection of the two phrase tables,
and remove shared phrase pairs from each table.
PhraseDictionaryGroup: You may want to combine translation tables such that you can use
any option in either table, but all options are scored by all tables. This gives the flexibility of
the either option with the reliable scoring of the both option. This is accomplished with the
PhraseDictionaryGroup interface that combines any number of translation tables on a single
decoding path.
In the [feature] section, add all translation tables as normal, but specify the tuneable=false
option. Then add the PhraseDictionaryGroup entry, specifying your translation tables as
members and the total number of features (sum of member feature numbers). It is recom-
mended to activate default-average-others=true. When an option is found in some mem-
ber tables but not others, its feature scores default to 0 (log(1)), a usually unreasonably high
score. Turning on the averaging option tells Moses to fill in the missing scores by averaging
the scores from tables that have seen the phrase (similar to the "fill-up" approach, but allowing
any table to be filled in by all other tables while maintaining a full feature set for each). See the
notes below for other options.
In the [weight] section, specify all 0s for member tables except for the index of φ(e|f) (2 by
default). This is only used for sorting options to apply the table-limit as the member tables will
not contribute scores directly. The weights for the PhraseDictionaryGroup entry are the actual
weights for the member tables in order. For instance, with 2 member tables of 4 features each,
features 0-3 are the first table’s 0-3 and 4-7 are the second table’s 0-3.
Finally, only add a mapping for the index of the PhraseDictionaryGroup (number of member
tables plus one).
PhraseDictionaryMemory name=PhraseDictionaryMemory0 num-features=4 tuneable=false path=/my-dir/table1 ...
PhraseDictionaryMemory name=PhraseDictionaryMemory1 num-features=4 tuneable=false path=/my-dir/table2 ...
PhraseDictionaryGroup name=PhraseDictionaryGroup0 members=PhraseDictionaryMemory0,PhraseDictionaryMemory1 num-features=8 default-average-others=true
PhraseDictionaryMemory0= 0 0 1 0
PhraseDictionaryMemory1= 0 0 1 0
PhraseDictionaryGroup0= 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
You may want to add indicator features to tell Moses what translation table each option
originates from. Activating phrase-counts=true adds an indicator feature for each table
to each option that returns 1 if the table contains the option and 0 otherwise. Similarly,
activating word-counts=true adds a word count for each table. For instance, an option
with target phrase length 3 would receive a 3 for each table that contains it and 0 for each
that does not. Each of these options adds one feature per table, so set num-features and
weights accordingly. (Adding both to the above example would yield num-features=12:
4 per model, 2 phrase counts, and 2 word counts)
160 4. User Guide
Backoff Models: You may want to prefer to use the first table, and the second table only if
there are no translations to be found in the first table. In other words, the second table is only
a back-off table for unknown words and phrases in the first table. This can be specified by
the option decoding-graph-back-off. The option also allows if the back-off table should only
be used for single words (unigrams), unigrams and bigrams, everything up to trigrams, up to
4-grams, etc.
For example, if you have two translation tables, and you want to use the second one only for
unknown words, you would specify:
The 0indicates that the first table is used for anything (which it always should be), and the 1
indicates that the second table is used for unknown n-grams up to size 1. Replacing it with a 2
would indicate its use for unknown unigrams and bigrams (unknown in the sense that the first
table has no translations for it).
Also note, that this option works also with more complicated mappings than just a single trans-
lation table. For instance the following specifies the use of a simple translation table first, and
as a back-off a more complex factored decomposition involving two translation tables and two
generation tables:
Caveat: Multiple Translation Tables and Lexicalized Reordering You may specify any num-
ber of lexicalized reordering models. Each of them will score any translation option, no matter
where it comes from. If a lexicalized reordering table does not have an entry for a translation
option, it will not assign any score to it. In other words, such a translation option is given the
probability 1 no matter how it is reordered. This may not be the way you want to handle it.
For instance, if you have an in-domain translation table and an out-of-domain translation table,
you can also provide an in-domain reordering table and an out-of-domain reordering table. If a
phrase pair occurs in both translation tables, it will be scored by both reordering tables. How-
ever, if a phrase pairs occurs only in one of the phrase tables (and hence reordering tables), it
4.4. Advanced Models 161
will be only score by one of them and get a free ride with the other. This will have the undesir-
able effect of discouraging phrase pairs that occur in both tables.
To avoid this, you can add default scores to the reordering table:
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff [...] default-scores=0.5,0.3,0.2,0.5,0.3,0.2
LexicalReordering name=LexicalReordering1 num-features=6 type=wbe-msd-bidirectional-fe-allff [...] default-scores=0.5,0.3,0.2,0.5,0.3,0.2
4.4.5 Global Lexicon Model
The global lexicon model predicts the bag of output words from the bag of input words. It
does not use an explicit alignment between input and output words, so word choice is also
influenced by the input context. For details, please check Mauser et al., (2009)38.
The model is trained with the script
scripts/training/train-global-lexicon-model.perl --corpus-stem FILESTEM --lex-dir DIR --f EXT --e EXT
which requires the tokenized parallel corpus, and the lexicon files required for GIZA++.
You will need the MegaM39 maximum entropy classifier from Hal Daume for training.
Warning: A separate maximum entropy classifier is trained for each target word, which is
very time consuming. The training code is a very experimental state. It is very inefficient. For
instance training a model on Europarl German-English with 86,700 distinct English words took
about 10,000 CPU hours.
The model is stored in a text file.
File format:
county initiativen 0.34478
county land 0.92405
county schaffen 0.23749
county stehen 0.39572
county weiteren 0.04581
county europa -0.47688
Specification in moses.ini:
162 4. User Guide
GlobalLexicalModel input-factor=0 output-factor=0 path=.../global-lexicon.gz
GlobalLexicalModel0= 0.1
4.4.6 Desegmentation Model
The in-Decoder desegmentation model is described in Salameh et al.(2016)40.
The desegmentation model extends the multi-stack phrase-based decoding paradigm to en-
able the extraction of word-level features inside morpheme-segmented models. It assumes that
the target side of the parallel corpus has been segmented into morphemes where a plus "+" at
the end of a token is a prefix, and at the beginning is a suffix. This allows us to define a com-
plete word as a maximal morpheme sequence consisting of 0 or more prefixes, followed by at
most one stem, and then 0 or more suffixes. The word-level features extracted by this model are
an unsegmented Language Model(word-level LM) score, contiguity feature, and WordPenalty
that counts the number of words rather than the default one that counts morphemes.
The word level features extracted from the hypotheses in the example are:
Unsegmented LM score for (lnŽr AfkArh)
WordPenalty = 2
Contiguity feature: (2 0 0) indicating that desegmented tokens are aligned to continuous source
The feature is activated by adding the following line to the Moses config file
DesegModel name=LM1 path=/path/to/unsegmented/lm.blm deseg-path=/path/to/desegmentation/table optimistic=(default=y) deseg-scheme=(default=r)
4.5. Efficient Phrase and Rule Storage 163
optimistic=(y or n) where n means it is delayed option(explained in the paper).
The optimistic option assumes that the morphemes form a complete word at the end of each
hypothesis, while the delayed option desegments the morphemes when it guarantees that they
form a complete word.
The desegmentation table has the form of:
You can download the desegmentaton table used for English Arabic translation here.
At this point, the frequency (count of occurrence of the unsegmented-segmented pair in a cor-
pus) is not used but will later update it to handle multiple desegmentation options.
deseg-scheme=(r or s) where r is rule-based desegmentation ONLY for Arabic and s is
simple desgmentation that concatenates the tokens based on segmentation boundaries
4.4.7 Advanced Language Models
Moses supports various neural, bilingual and syntactic language models (Section 5.13)
Subsection last modified on August 07, 2016, at 11:38 PM
4.5 Efficient Phrase and Rule Storage
4.5.1 Binary Phrase Tables with On-demand Loading
For larger tasks the phrase tables usually become huge, typically too large to fit into memory.
Therefore, Moses supports a binary phrase table with on-demand loading, i.e. only the part of
the phrase table that is required to translate a sentence is loaded into memory.
There are currently 3 binary formats to do this:
OnDisk phrase-table. Works with SCFG models and phrase-based models.
<del>Binary phrase-table. Works with phrase-based models only.</del>
Compact ph