MANUAL

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 41

Introduction
Online
- Online Demo
- Web Service
Release
MorphoDiTa Installation
MorphoDiTa User's Manual
MorphoDiTa API Tutorial
MorphoDiTa API Reference
Contact
Acknowledgements

MorphoDiTa: Morphological Dictionary and Tagger

Version 1.9.2

Contents

1 Introduction 3

2 Online 3

2.1 Online Demo ............................................... 3

2.2 Web Service ................................................ 3

3 Release 3

3.1 Download ................................................. 3

3.1.1 Language Models ......................................... 3

3.2 License ................................................... 3

3.3 Platforms and Requirements ....................................... 4

4 MorphoDiTa Installation 4

4.1 Requirements ............................................... 4

4.2 Compilation ................................................ 4

4.2.1 Platforms ............................................. 4

4.2.2 Further Details .......................................... 5

4.3 Other language bindings ......................................... 5

4.3.1 C# ................................................ 5

4.3.2 Java ................................................ 5

4.3.3 Perl ................................................ 5

4.3.4 Python .............................................. 5

5 MorphoDiTa User’s Manual 5

5.1 Czech MorphoDiTa Models ....................................... 6

5.1.1 Download ............................................. 6

5.1.2 Acknowledgements ........................................ 6

5.1.3 Czech Morphological System .................................. 7

5.1.4 Main Czech Model ........................................ 7

5.1.5 Part of Speech Only Variant .................................. 8

5.1.6 No Diacritical Marks Variant .................................. 8

5.1.7 Morphological Derivation Model using DeriNet ........................ 8

5.1.8 Models with Raw Lemmas .................................... 9

5.1.9 Czech Model History ....................................... 9

5.2 English MorphoDiTa Models ...................................... 9

5.2.1 Download ............................................. 9

5.2.2 Acknowledgements ........................................ 9

5.2.3 English Morphological System ................................. 10

5.2.4 English Model .......................................... 10

5.2.5 No Negations Variant ...................................... 11

5.2.6 English Model Changes ..................................... 11

5.3 Running the Tagger ........................................... 11

5.3.1 Input Formats .......................................... 11

5.3.2 Tag Set Conversion ........................................ 12

5.3.3 Morphological Derivation .................................... 12

5.3.4 Morphological Guesser ...................................... 12

5.3.5 Output Formats ......................................... 12

5.4 Running the Morphology ........................................ 13

5.4.1 Morphological Analysis ..................................... 13

5.4.2 Morphological Generation .................................... 15

5.4.3 Interactive Morphological Analysis and Generation ...................... 16

5.5 Running the Tokenizer .......................................... 16

5.5.1 Output Formats ......................................... 16

5.6 Running REST Server .......................................... 17

5.7 Custom Morphological and Tagging Models .............................. 17

5.7.1 Custom Morphological Models ................................. 17

5.7.2 Custom Tagging Models ..................................... 18

6 MorphoDiTa API Tutorial 22

6.1 Tagger API ................................................ 23

6.2 Morphological Dictionary API ...................................... 23

6.2.1 Dictionary Construction ..................................... 24

6.2.2 Morphological Analysis ..................................... 24

6.2.3 Generation ............................................ 24

6.3 Questions and Answers .......................................... 25

7 MorphoDiTa API Reference 25

7.1 MorphoDiTa Versioning ......................................... 25

7.2 Lemma Structure ............................................. 26

7.3 Struct string piece ............................................ 26

7.4 Struct tagged form ............................................ 26

7.5 Struct tagged lemma ........................................... 26

7.6 Struct tagged lemma forms ....................................... 26

7.7 Struct token range ............................................ 27

7.8 Struct derivated lemma ......................................... 27

7.9 Class version ............................................... 27

7.9.1 version::current .......................................... 27

7.10 Class tokenizer .............................................. 27

7.10.1 tokenizer::set text ........................................ 28

7.10.2 tokenizer::next sentence ..................................... 28

7.10.3 tokenizer::new vertical tokenizer ................................ 28

7.10.4 tokenizer::new czech tokenizer .................................. 28

7.10.5 tokenizer::new english tokenizer ................................. 28

7.10.6 tokenizer::new generic tokenizer ................................. 28

7.11 Class derivator .............................................. 29

7.11.1 derivator::parent ......................................... 29

7.11.2 derivator::children ........................................ 29

7.12 Class derivation formatter ........................................ 29

7.12.1 derivation formatter::format derivation ............................. 30

7.12.2 derivation formatter::new none derivation formatter ..................... 30

7.12.3 derivation formatter::new root derivation formatter ...................... 30

7.12.4 derivation formatter::new path derivation formatter ..................... 30

7.12.5 derivation formatter::new tree derivation formatter ...................... 30

7.12.6 derivation formatter::new derivation formatter ........................ 30

7.13 Class morpho ............................................... 30

7.13.1 morpho::load(const char*) .................................... 31

7.13.2 morpho::load(istream&) ..................................... 31

7.13.3 morpho::guesser mode ...................................... 31

7.13.4 morpho::analyze() ........................................ 31

7.13.5 morpho::generate() ........................................ 31

7.13.6 morpho::raw lemma len ..................................... 32

7.13.7 morpho::lemma id len ...................................... 32

7.13.8 morpho::raw form len ...................................... 32

7.13.9 morpho::new tokenizer ...................................... 32

7.13.10 morpho::get derivator ...................................... 32

7.14 Class tagger ................................................ 32

7.14.1 tagger::load(const char*) ..................................... 33

7.14.2 tagger::load(istream&) ...................................... 33

7.14.3 tagger::get morpho() ....................................... 33

7.14.4 tagger::tag() ........................................... 33

7.14.5 tagger::tag analyzed() ...................................... 33

7.14.6 tagger::new tokenizer ....................................... 34

7.15 Class tagset converter .......................................... 34

7.15.1 tagset converter::convert() .................................... 34

7.15.2 tagset converter::convert analyzed() .............................. 34

7.15.3 tagset converter::convert generated() .............................. 34

7.15.4 tagset converter::new identity converter() ........................... 34

7.15.5 tagset converter::new pdt to conll2009 converter() ...................... 34

7.15.6 tagset converter::new strip lemma comment converter() ................... 35

7.15.7 tagset converter::new strip lemma id converter() ....................... 35

7.16 C++ Bindings API ............................................ 35

7.16.1 Helper Structures ......................................... 35

7.16.2 Main Classes ........................................... 36

7.17 C# Bindings ............................................... 37

7.18 Java Bindings ............................................... 37

7.19 Perl Bindings ............................................... 37

7.20 Python Bindings ............................................. 38

8 Contact 38

9 Acknowledgements 38

9.1 Publications ................................................ 38

9.2 Bibtex for Referencing .......................................... 38

9.3 Persistent Identiﬁer ............................................ 39

1 Introduction

MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural

language texts. It performs morphological analysis, morphological generation, tagging and tokenization and

is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language,

MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa

is a free software under Mozilla Public License 2.0 and the linguistic models are free for non-commercial use

and distributed under CC BY-NC-SA license, although for some models the original data used to create the

model may impose additional licensing conditions. MorphoDiTa is versioned using Semantic Versioning.

University in Prague, Czech Republic.

2 Online

2.1 Online Demo

Online demo is available as one of LINDAT/CLARIN services.

2.2 Web Service

Web service is also available as one of LINDAT/CLARIN services.

3 Release

3.1 Download

MorphoDiTa releases are available on GitHub, either as a pre-compiled binary package, or source code only.

The binary package contains Linux, Windows and OS X binaries, Java bindings binary, C# bindings binary,

and source code of MorphoDiTa and all language bindings). While the binary packages do not contain compiled

Python or Perl bindings, packages for those languages are available in standard package repositories, i.e. on

PyPI and CPAN.

•Latest release

•All releases,Changelog

3.1.1 Language Models

To use MorphoDiTa, a language model is needed. The language models are available from LINDAT/CLARIN

infrastructure and described further in the MorphoDiTa User’s Manual. Currently the following language models

are available:

•Czech: czech-morﬄex-pdt-160222 (documentation)czech-morﬄex-pdt-131112 (documentation)

•English: english-morphium-wsj-140407 (documentation)

3.2 License

MorphoDiTa is an open-source project and is freely available for non-commercial purposes. The library is

distributed under Mozilla Public License 2.0 and the associated models and data under CC BY-NC-SA, although

for some models the original data used to create the model may impose additional licensing conditions.

If you use this tool for scientiﬁc work, please give credit to us by referencing MorphoDiTa website and Strakov´a

et al. 2014.

3.3 Platforms and Requirements

MorphoDiTa is available as a standalone tool and as a library for Linux/Windows/OS X. It does not require

any additional libraries. As any supervised machine learning tool, it needs trained linguistic models to perform

morphological analysis. The models for the Czech language are available with the tool.

4 MorphoDiTa Installation

MorphoDiTa releases are available on GitHub, either as a pre-compiled binary package, or source code only.

The binary package contains Linux, Windows and OS X binaries, Java bindings binary, C# bindings binary,

and source code of MorphoDiTa and all language bindings. While the binary packages do not contain compiled

Python or Perl bindings, packages for those languages are available in standard package repositories, i.e. on

PyPI and CPAN.

To use MorphoDiTa, a language model is needed. Here is a list of available language models.

If you want to compile MorphoDiTa manually, sources are available on on GitHub, both in the pre-compiled

binary package releases and in the repository itself.

4.1 Requirements

•G++ 4.7 or newer, clang 3.2 or newer, Visual C++ 2015 or newer

•make

•SWIG or newer for language bindings other than C++

4.2 Compilation

To compile MorphoDiTa, run make in the src directory.

Make targets and options:

•exe: compile the binaries (default)

•server: compile the REST server

•tools: compile various tools

•lib: compile MorphoDiTa library (decoding only)

•BITS=32 or BITS=64: compile for speciﬁed 32-bit or 64-bit architecture instead of the default one

•MODE=release: create release build which statically links the C++ runtime and uses LTO

•MODE=debug: create debug build

•MODE=profile: create proﬁle build

4.2.1 Platforms

Platform can be selected using one of the following options:

•PLATFORM=linux,PLATFORM=linux-gcc: gcc compiler on Linux operating system, default on Linux

•PLATFORM=linux-clang: clang compiler on Linux, must be selected manually

•PLATFORM=osx,PLATFORM=osx-clang: clang compiler on OS X, default on OS X; BITS=32+64 enables

multiarch build

•PLATFORM=win,PLATFORM=win-gcc: gcc compiler on Windows (TDM-GCC is well tested), default on

Windows

•PLATFORM=win-vs: Visual C++ 2015 compiler on Windows, must be selected manually; note that the

cl.exe compiler must be already present in PATH and corresponding BITS=32 or BITS=64 must be speciﬁed

Either POSIX shell or Windows CMD can be used as shell, it is detected automatically.

4.2.2 Further Details

MorphoDiTa uses C++ BuilTem system, please refer to its manual if interested in all supported options.

4.3 Other language bindings

4.3.1 C#

Binary C# bindings are available in MorphoDiTa binary packages.

To compile C# bindings manually, run make in the bindings/csharp directory, optionally with the options

descriged in MorphoDiTa Installation.

4.3.2 Java

Binary Java bindings are available in MorphoDiTa binary packages.

To compile Java bindings manually, run make in the bindings/java directory, optionally with the options

descriged in MorphoDiTa Installation. Java 6 and newer is supported.

The Java installation speciﬁed in the environment variable JAVA HOME is used. If the environment variable does

not exist, the JAVA HOME can be speciﬁed using

make JAVA_HOME=path_to_Java_installation

4.3.3 Perl

The Perl bindings are available as Ufal-MorphoDiTa package on CPAN.

To compile Perl bindings manually, run make in the bindings/perl directory, optionally with the options

descriged in MorphoDiTa Installation. Perl 5.10 and later is supported.

Path to the include headers of the required Perl version must be speciﬁed in the PERL INCLUDE variable using

make PERL_INCLUDE=path_to_Perl_includes

4.3.4 Python

The Python bindings are available as ufal.morphodita package on PyPI.

To compile Python bindings manually, run make in the bindings/python directory, optionally with options

descriged in MorphoDiTa Installation. Both Python 2.6+ and Python 3+ are supported.

Path to the include headers of the required Python version must be speciﬁed in the PYTHON INCLUDE variable

using

make PYTHON_INCLUDE=path_to_Python_includes

5 MorphoDiTa User’s Manual

In a natural language text, the task of morphological analysis is to assign for each token (word) in a sentence its

lemma (cannonical form) and a part-of-speech tag (POS tag). This is usually achieved in two steps: a morpho-

logical dictionary looks up all possible lemmas and POS tags for each word, and subsequently, a morphological

tagger picks for each word the best lemma-POS tag candidate. The second task is called a disambiguation.

MorphoDiTa also performs these two steps of morphological analysis: It ﬁrst outputs all possible pairs of lemma

and POS tag for each token. Consequently, the optimal combination of lemmas and POS tags is selected for

the words in a sentence using an algorithm described in Spoustov´a et al. 2009.

Like any supervised machine learning tool, MorphoDiTa needs a trained linguistic model. This section describes

the available language models and also the commandline tools and interfaces. The C++ library is described

elsewhere, either in MorphoDiTa API Tutorial or in MorphoDiTa API Reference.

5.1 Czech MorphoDiTa Models

Czech models are distributed under the CC BY-NC-SA licence. The Czech morphology uses the MorfFlex CZ

160310 Czech morphological dictionary and the Czech tagger is trained on PDT 3.0. Czech models work in

MorphoDiTa version 1.0 or later.

Apart from MorfFlex CZ dictionary, a preﬁx guesser and statistical guesser are implemented and can be op-

tionally used when performing morphological analysis. The version 160310 cotains also models with DeriNet as

a morphological derivator (requiring MorphoDiTa 1.9 or later).

Czech models are versioned according to the version of the MorfFlex CZ morphological dictionary used, the

version format is YYMMDD, where YY,MM and DD are two-digit representation of year, month and day, respectively.

The latest version is 160310.

Compared to Featurama http://sourceforge.net/projects/featurama/ (state-of-the-art Czech tagger implemen-

tation), the models are 5 times faster and 10 times smaller.

5.1.1 Download

The latest version 160310 of the Czech MorphoDiTa models can be downloaded from LINDAT/CLARIN repos-

itory.

Previous Versions

•Version 131112 of the Czech MorphoDiTa models can be downloaded from LINDAT/CLARIN repository.

5.1.2 Acknowledgements

This work has been using language resources developed and/or stored and/or distributed by the LIN-

DAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013 ).

The Czech morphological system was devised by Jan Hajiˇc.

The MorfFlex CZ dictionary was created by Jan Hajiˇc and Jaroslava Hlav´aˇcov´a.

The morphological guesser research was supported by the projects 1ET101120503 and 1ET101120413 of

Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research

was performed by Jan Hajiˇc, Jaroslava Hlav´aˇcov´a and David Kolovratn´ık.

The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of

Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the

Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed

by Drahom´ıra ”johanka” Spoustov´a, Jan Hajiˇc, Jan Raab and Miroslav Spousta.

The tagger is trained on morphological layer of Prague Dependency Treebank PDT 3.0, which was supported

by the projects LM2010013,LC536,LN00A063 and MSM0021620838 of Ministry of Education, Youth and

Sports of the Czech Republic, and developed by Martin Buben, Jan Hajiˇc, Jiˇr´ı Hana, Hana Hanov´a, Barbora

Hladk´a, Emil Jeˇr´abek, Lenka Kebortov´a, Krist´yna Kupkov´a, Pavel Kvˇetoˇn, Jiˇr´ı M´ırovsk´y, Andrea Pﬁmpfrov´a,

Jan ˇ

Stˇep´anek and Daniel Zeman.

The morphological derivator is based on DeriNet, which was supported by the Grant No. 16-18177S of the

Grant Agency of the Czech Republic and uses language resources developed, stored, and distributed by the

LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project

LM2015071).

Publications

•(Hajiˇc 2004) Jan Hajiˇc. Disambiguation of Rich Inﬂection: Computational Morphology of Czech.

Karolinum Press (2004).

•Hlav´aˇcov´a Jaroslava, Kolovratn´ık David. Morfologie ˇceˇstiny znovu a l´epe. In Informaˇcn´e Technol´ogie -

Aplik´acie a Te´oria. Zborn´ık pr´ıspevkov, ITAT 2008. Seˇna, Slovakia: PONT s.r.o., 2008, pp. 43-47.

•(Spoustov´a et al. 2009) Drahom´ıra ”johanka” Spoustov´a, Jan Hajiˇc, Jan Raab, Miroslav Spousta. 2009.

Semi-Supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference

of the European Chapter of the ACL (EACL 2009), pages 763-771, Athens, Greece, March. Association

for Computational Linguistics.

•(Strakov´a et al. 2014) Strakov´a Jana, Straka Milan and Hajiˇc Jan. Open-Source Tools for Morphology,

Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of

the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Mary-

land, June 2014. Association for Computational Linguistics.

•(ˇ

Zabokrtsk´y et al. 2016) Zdenˇek ˇ

Zabokrtsk´y, Magda ˇ

Sevˇc´ıkov´a, Milan Straka, Jon´aˇs Vidra and Ad´ela

Limbursk´a. Merging Data Resources for Inﬂectional and Derivational Morphology in Czech. In Proceed-

ings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoroˇz,

Slovenia, May 2016.

5.1.3 Czech Morphological System

In the Czech language, MorphoDiTa uses Czech morphological system by Jan Hajiˇc (Hajiˇc 2004). In this system,

which we call PDT tag set, the tags are positional with 15 positions corresponding to part of speech, detailed

part of speech, gender, number, case, etc. (e.g. NNFS1-----A----). Diﬀerent meanings of same lemmas are

distinguished and additional comments can be provided for every lemma meaning. The lemma itself without

the comments and meaning speciﬁcation is called a raw lemma. The following examples illustrate this:

•Japonsko ;G (raw lemma: Japonsko)

•se ^(zvr. z´ajmeno/ˇc´astice) (raw lemma: se)

•tvoˇrit :T (raw lemma: tvoˇrit)

For a more detailed reference about the Czech morphology, please see Lemma and Tag Structure in PDT 2.0.

5.1.4 Main Czech Model

The main Czech model contains the following ﬁles:

czech-morfflex-160310.dict

Morphological dictionary based on the Jan Hajiˇc’s (Hajiˇc 2004) system with PDT tag set created from

MorfFlex CZ 160310 morphological dictionary.

czech-morfflex-pdt-160310.tagger

Tagger trained on the training portion of PDT 3.0 using the neopren feature set. It contains the

czech-morfflex-160310.dict morphological dictionary. and reaches 95.57% tag accuracy, 97.75% lemma

accuracy and 94.93% overall accuracy on PDT 3.0 etest data (whose morphological tags and lemmas were

remapped using the czech-morfflex-160310.dict dictionary). Model speed: ˜10k words/s, model size:

17MB.

5.1.5 Part of Speech Only Variant

The PDT tag set used by the main Czech model is very ﬁne-grained. In many situations, only the part of

speech tags would be suﬃcient. Therefore, we provide a variant of the model, denoted as pos only, where only

the ﬁrst two characters of the ﬁfteen-letter tags are used, representing the part of speech and detailed part of

speech, respectively. There are 67 such two-letter tags.

czech-morfflex-160310-pos only.dict

Morphological dictionary based on the Jan Hajiˇc’s (Hajiˇc 2004) system created from MorfFlex CZ 160310

morphological dictionary. Only ﬁrst two tag characters of PDT tag set are used.

czech-morfflex-pdt-160310-pos only.tagger

Very fast tagger trained on the training portion of PDT 3.0 using the neopren feature set. It contains

the czech-morfflex-160310-pos only.dict morphological dictionary and reaches 99.04% tag accuracy,

97.62% lemma accuracy and 97.56% overall accuracy on PDT 3.0 etest data (which morphological tags

and lemmas were remapped using the czech-morfflex-160310-pos only.dict dictionary). Model speed:

˜200k words/s, model size: 4MB.

5.1.6 No Diacritical Marks Variant

Sometimes the text to be analyzed does not contain diacritical marks. We therefore provide variants of the

morphological dictionary and tagger for this purpose – morphological analysis, morphological generation and

tagging employ forms without diacritical marks. Note that the lemmas do have diacritical marks.

We provide the no dia variants for all four models described above:

czech-morfflex-160310-no dia.dict

No diacritical marks variant of czech-morfflex-160310.dict.

czech-morfflex-pdt-160310-no dia.tagger

No diacritical marks variant of czech-morfflex-160310.tagger. It reaches 94.74% tag accuracy, 97.05%

lemma accuracy and 93.83% overall accuracy on PDT 3.0 etest data (which morphological tags and lem-

mas were remapped using the czech-morfflex-160310-no dia.dict dictionary) with diacritical marks

removed. Model speed: ˜5k words/s, model size: 21MB.

czech-morfflex-160310-no dia-pos only.dict

No diacritical marks variant of czech-morfflex-160310-pos only.dict.

czech-morfflex-pdt-160310-no dia-pos only.tagger

No diacritical marks variant of czech-morfflex-160310-pos only.tagger. It reaches 98.59% tag accu-

racy, 97.04% lemma accuracy and 96.96% overall accuracy on PDT 3.0 etest data (which morphological

tags and lemmas were remapped using the czech-morfflex-160310-no dia-pos only.dict dictionary)

with diacritical marks removed. Model speed: ˜130k words/s, model size: 9MB.

5.1.7 Morphological Derivation Model using DeriNet

All version 160310 models are available also with morphological derivator using DeriNet version 1.1. The models

which include DeriNet require MorphoDiTa 1.9 or later.

In the future, only the models with DeriNet will be released.

5.1.8 Models with Raw Lemmas

The Czech morphological system distinguish diﬀerent meanings of same lemmas by numbering the lemmas with

multiple meanings and supplying additional comments for every lemma meaning, as described and demonstrated

in Czech Morphological System. Sometimes this may be undesirable, for example when comparing to systems

which do not use the MorfFlex CZ 160310 morphological dictionary.

To obtain lemmas without any additional information (raw lemmas in terms of MorphoDiTa API), use

strip lemma id tag set converter. Previously, speciﬁc dictionary and tagger model variants were provided,

which is not needed anymore.

5.1.9 Czech Model History

czech-morfflex-160310 and czech-morfflex-pdt-160310 (require MorphoDiTa 1.0 or later)

Trained on PDT 3.0 using MorfFlex CZ 160310, variants: Part of Speech Only, No Diacritical Marks.

Download from LINDAT/CLARIN repository.

czech-morfflex-131112 and czech-morfflex-pdt-131112 (require MorphoDiTa 1.0 or later)

Trained on PDT 2.5 using MorfFlex CZ 131112, variants Part of Speech Only, Raw Lemmas. Download

from LINDAT/CLARIN repository.

5.2 English MorphoDiTa Models

English models are created using the following data:

•SCOWL (Spell Checker Oriented Word Lists): This word list is used in morphological generation to create

all possible word forms of a given word.

these word lists, the associated scripts, the output created from the scripts, and its documentation for any

purpose is hereby granted without fee, provided that the above copyright notice appears in all copies and

that both that copyright notice and this permission notice appear in supporting documentation. Kevin

Atkinson makes no representations about the suitability of this array for any purpose. It is provided ”as

is” without express or implied warranty.

•Wall Street Journal, part of the Penn Treebank 3: Morphologically annotated texts which are commonly

used to train English POS tagger.

Licensing: Available as LDC99T42 in LDC catalog under LDC User Agreement.

The resulting models are distributed under the CC BY-NC-SA licence. English models work in MorphoDiTa

version 1.1 or later.

English models are versioned according to the release date, the version format is YYMMDD, where YY,MM and DD

are two-digit representation of year, month and day, respectively. The latest version is 140407.

5.2.1 Download

The latest version 140407 of the English MorphoDiTa models can be downloaded from LINDAT/CLARIN

repository.

5.2.2 Acknowledgements

This work has been using language resources developed and/or stored and/or distributed by the LIN-

DAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013 ).

The morphological POS analyzer development was supported by grant of the Ministry of Educa-

tion, Youth and Sports of the Czech Republic No. LC536 ”Center for Computational Linguistics”.

The morphological POS analyzer research was performed by Johanka Spoustov´a (Spoustov´a 2008; the

Treex::Tool::EnglishMorpho::Analysis Perl module). The lemmatizer was implemented by Martin Popel

(Popel 2009; the Treex::Tool::EnglishMorpho::Lemmatizer Perl module). The lemmatizer is based on

morpha, which was released under LGPL licence as a part of RASP system.

The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of

Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the

Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed

by Drahom´ıra ”johanka” Spoustov´a, Jan Hajiˇc, Jan Raab and Miroslav Spousta.

Publications

•(Popel 2009) Martin Popel. Ways to Improve the Quality of English-Czech Machine Translation. Master

Thesis at Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles

University in Prague (2009).

•(Spoustov´a 2008) Drahom´ıra ”johanka” Spoustov´a. Morphium – morphological analyser for Penn tree-

bank POS tagset. Perl Software developed at Institute of Formal and Applied Linguistics, Faculty of

Mathematics and Physics, Charles University in Prague (2008).

•(Spoustov´a et al. 2009) Drahom´ıra ”johanka” Spoustov´a, Jan Hajiˇc, Jan Raab, Miroslav Spousta. 2009.

Semi-Supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference

of the European Chapter of the ACL (EACL 2009), pages 763-771, Athens, Greece, March. Association

for Computational Linguistics.

•(Strakov´a et al. 2014) Strakov´a Jana, Straka Milan and Hajiˇc Jan. Open-Source Tools for Morphology,

Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of

the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Mary-

land, June 2014. Association for Computational Linguistics.

5.2.3 English Morphological System

The English morphology uses standard Penn Treebank POS tags. Nevertheless, the lemma structure is unique:

•The lemmatizer recognizes negative preﬁxes and removes it from the lemma. In terms of MorphoDiTa

API, raw lemma is the lemma without negative preﬁx.

•The negative preﬁx is also stored to allow morphological generation of word form with the same negative

preﬁx. In terms of MorphoDiTa API, lemma id is the raw lemma plus the negative preﬁx.

The negative preﬁx is separated from the (always nonempty) lemma using a ^character (able^un). During

morphological generation, the negative preﬁx is honored. Furthermore, when the lemma ends with ^(i.e.,

negative preﬁx is empty, as in able^), forms with negative preﬁxes are generated. It is also possible to generate

all forms without any negative preﬁx by appending +after the lemma (for example able+).

5.2.4 English Model

The English model contains the following ﬁles:

english-morphium-<version>.dict

Morphological dictionary. The SCOWL word list has been automatically analyzed and lemmatized and

uses as the dictionary. The guesser performing the analyzation and lemmatization is available.

english-morphium-wsj-<version>.tagger

Tagger trained on the training portion of Wall Street Journal (Sections 0-18) and tuned on the development

portion (Sections 19-21). Contains the english-morphium-<version>.dict morphological dictionary.

The latest version english-morphium-wsj-140407.tagger reaches 97.27% tag accuracy on Wall Street

Journal test portion (Section 22-24). Model speed: ˜60k words/s, model size: 6MB.

5.2.5 No Negations Variant

Stripping of negative preﬁxes (or handling the lemmas with negative preﬁxes stripped) may not be desirable.

Therefore, a variant of the English model denoted by no negation is provided, which does not strip negative

preﬁxes from lemmas.

english-morphium-<version>-no negation.dict

Morphological dictionary which does not strip negative lemma preﬁxes. The SCOWL word list has been

automatically analyzed and lemmatized and uses as the dictionary. The guesser performing the analyzation

and lemmatization is available.

english-morphium-wsj-<version>-no negation.tagger

Tagger which does not strip negative lemma preﬁxes, trained on the training portion of Wall Street

Journal (Sections 0-18) and tuned on the development portion (Sections 19-21). Contains the

english-morphium-<version>-no negation.dict morphological dictionary.

The latest version english-morphium-wsj-140407-no negation.tagger reaches 97.25% tag accuracy on

Wall Street Journal test portion (Section 22-24). Model speed: ˜60k words/s, model size: 6MB.

5.2.6 English Model Changes

english-morphium-140407 and english-morphium-wsj-140407 (require MorphoDiTa 1.1 or later)

Recognize also ”non-” as a negative preﬁx. Formerly, only ”non” was recognized.

english-morphium-140304 and english-morphium-wsj-140304 (require MorphoDiTa 1.0 or later)

Initial release.

5.3 Running the Tagger

Probably the most common usage of MorphoDita is running a tagger to tag your data using

run_tagger tagger_model

The input is assumed to be in UTF-8 encoding and can be either already tokenized and segmented, or it can

be a plain text which is tokenized and segmented automatically.

Any number of ﬁles can be speciﬁed after the tagger model. If an argument input file:output file is used,

the given input file is processed and the result is saved to output file. If only input file is used, the

result is saved to standard output. If no argument is given, input is read from standard input and written to

standard output.

The full command syntax of run tagger is

run_tagger [options] tagger_file [file[:output_file]]...

Options: --input=untokenized|vertical

--convert_tagset=pdt_to_conll2009|strip_lemma_comment|strip_lemma_id

--derivation=none|root|path|tree

--guesser=0|1 (should morphological guesser be used)

--output=vertical|xml

5.3.1 Input Formats

The input format is speciﬁed using the --input option. Currently supported input formats are:

•untokenized (default): the input is tokenized and segmented using a tokenizer deﬁned by the model,

•vertical: the input is in vertical format, every line is considered a word, with empty line denoting end

of sentence.

5.3.2 Tag Set Conversion

Some tag sets can be converted to diﬀerent ones. Currently supported tag set conversions are:

•pdt to conll2009: convert Czech PDT tag set to CoNLL 2009 tag set,

•strip lemma comment: strip lemma comment (see Lemma Structure in API Reference),

•strip lemma id: strip lemma id (see Lemma Structure in API Reference).

5.3.3 Morphological Derivation

If the morphological model includes a morphological derivator, some morphological derivation operation may

be performed on lemmas:

•none (default): no morphological derivation is performed

•root: lemma is replaced by its root in the morphological derivation tree

•path: lemma is replaced by a space separated path to its root in the morphological derivation tree (the

original lemma is ﬁrst, followed by its parent, with the root being the last one)

•tree: whole morphological derivation tree is appended after the lemma, encoded in the following way:

root node is the ﬁrst, then the subtrees of the root children are encoded recursively (each after one space),

followed by a ﬁnal space (which denotes that the children are complete)

5.3.4 Morphological Guesser

By default, every tagger model uses the morphological guesser settings employed during the model training.

However, the usage of morphological guesser can be overridden by the guesser parameter.

5.3.5 Output Formats

The output format is speciﬁed using the --output option. Currently supported output formats are:

•xml (default): Simple XML format without a root element, using <sentence>element to mark sentences

and <token lemma="..." tag="...">...</token>element to encode token and its assigned lemma

and tag.

Example output for input Dˇeti pojedou k babiˇcce. Uˇz se tˇeˇs´ı. (line breaks added):

<sentence><token lemma=’d´ıtˇe’ tag=’NNFP1-----A----’>Dˇeti</token>

<token lemma=’jet-1_^(pohybovat_se,_ne_vˇsak_ch˚uz´ı)’

tag=’VB-P---3F-AA---’>pojedou</token>

<token lemma=’babiˇcka’ tag=’NNFS3-----A----’>babiˇcce</token>

<token lemma=’tˇeˇsit_:T’ tag=’VB-S---3P-AA---’>tˇeˇs´ı</token>

•vertical: Every output line is a tag separated triple form-lemma-tag, with empty line denoting end of

sentence.

Example output for input Dˇeti pojedou k babiˇcce. Uˇz se tˇeˇs´ı.:

Dˇeti d´ıtˇe NNFP1-----A----

pojedou jet-1_^(pohybovat_se,_ne_vˇsak_ch˚uz´ı) VB-P---3F-AA---

k k-1 RR--3----------

babiˇcce babiˇcka NNFS3-----A----

. . Z:-------------

Uˇz uˇz-1 Db-------------

se se_^(zvr._z´ajmeno/ˇc´astice) P7-X4----------

tˇeˇs´ı tˇeˇsit_:T VB-S---3P-AA---

. . Z:-------------

5.4 Running the Morphology

There are multiple commands performing morphological tasks. The run morpho analyze executable performs

morphological analysis and the run morpho generate executable performs morphological generation. The out-

put of these commands is suitable for automatic processing.

The run morpho cli executable performs both morphological analysis and generation, but is designed to be

used interactively and produces more human-readable output.

5.4.1 Morphological Analysis

The morphological analysis can be performed by running

run_morpho_analyze morphology_model use_guesser

The input is assumed to be in UTF-8 encoding and can be either already tokenized and segmented, or it can

be a plain text which is tokenized and segmented automatically. The input ﬁles are speciﬁed same as with the

run tagger command.

Some morphological models contain both a manually created dictionary and a guesser. Therefore, a numeric

use guesser argument is required. If non-zero, the guesser is used, otherwise not.

Because tagger models contain an embedded morphological model, a tagger model can be used instead of

morphological one if --from tagger option is speciﬁed.

The full command syntax of run morpho analyze is

run_morpho_analyze [options] morphology_model use_guesser [file[:output_file]]...

Options: --input=untokenized|vertical

--convert_tagset=pdt_to_conll2009|strip_lemma_comment|strip_lemma_id

--derivation=none|root|path|tree

--output=vertical|xml

--from_tagger

Input Formats

The input format is speciﬁed using the --input option. Currently supported input formats are:

•untokenized (default): the input is tokenized and segmented using a tokenizer deﬁned by the model,

•vertical: the input is in vertical format, every line is considered a word, with empty line denoting end

of sentence.

Note that the input data is also segmented, even if it is not strictly necessary. Therefore, the input is processed

by whole paragraphs (ending by an empty line).

Tag Set Conversion

Some tag sets can be converted to diﬀerent ones. Currently supported tag set conversions are:

•pdt to conll2009: convert Czech PDT tag set to CoNLL 2009 tag set,

•strip lemma comment: strip lemma comment (see Lemma Structure in API Reference),

•strip lemma id: strip lemma id (see Lemma Structure in API Reference).

Morphological Derivation

If the morphological model includes a morphological derivator, some morphological derivation operation may

be performed on lemmas:

•none (default): no morphological derivation is performed

•root: lemma is replaced by its root in the morphological derivation tree

•path: lemma is replaced by a space separated path to its root in the morphological derivation tree (the

original lemma is ﬁrst, followed by its parent, with the root being the last one)

•tree: whole morphological derivation tree is appended after the lemma, encoded in the following way:

root node is the ﬁrst, then the subtrees of the root children are encoded recursively (each after one space),

followed by a ﬁnal space (which denotes that the children are complete)

Output Formats

The output format is speciﬁed using the --output option. Currently supported output formats are:

•xml (default): Simple XML format without a root element, using using <token><analysis lemma="..."

tag="..."/><analysis...>...</token>element to encode morphological analysis.

Example output for input Dˇeti pojedou k babiˇcce. Uˇz se tˇeˇs´ı. (line breaks added):

<sentence><token><analysis lemma="d´ıtˇe" tag="NNFP1-----A----"/><analysis lemma="d´ıtˇe"

tag="NNFP4-----A----"/><analysis lemma="d´ıtˇe" tag="NNFP5-----A----"/>Dˇeti</token>

<token><analysis lemma="jet-1_^(pohybovat_se,_ne_vˇsak_ch˚uz´ı)"

tag="VB-P---3F-AA---"/>pojedou</token>

<token><analysis lemma="k-1" tag="RR--3----------"/><analysis

lemma="k-3_^(oznaˇcen´ı_pomoc´ı_p´ısmene)" tag="NNNXX-----A----"/><analysis

lemma="k-4‘k˚uˇn_:B_^(jednotka_v´ykonu)" tag="NNMXX-----A---8"/><analysis

lemma="k-8_:B_^(ost._zkratka)" tag="XX------------8"/><analysis

lemma="komanditn´ı_:B_^(jen_komanditn´ı_spoleˇcnost)"

tag="AAXXX----1A---8"/><analysis lemma="koncernov´y_:B"

tag="AAXXX----1A---8"/><analysis lemma="kuo-1_:B_,t_^(star´a_jednotka_v´ykonu)"

tag="NNNXX-----A---8"/>k</token>

<token><analysis lemma="babiˇcka" tag="NNFS3-----A----"/><analysis lemma="babiˇcka"

tag="NNFS6-----A----"/>babiˇcce</token>

<sentence><token><analysis lemma="uˇz-1" tag="Db-------------"/><analysis lemma="uˇz-2"

tag="TT-------------"/>Uˇz</token>

<token><analysis lemma="se_^(zvr._z´ajmeno/ˇc´astice)" tag="P7-X4----------"/><analysis

lemma="s-1" tag="RV--2----------"/><analysis lemma="s-1"

tag="RV--7----------"/>se</token>

<token><analysis lemma="tˇeˇsit_:T" tag="VB-P---3P-AA---"/><analysis lemma="tˇeˇsit_:T"

tag="VB-S---3P-AA---"/>tˇeˇs´ı</token>

•vertical: Every output line contains a word and a tab separated lemma-tag pairs assigned to the input

word, with empty line denoting end of sentence.

Example output for input Dˇeti pojedou k babiˇcce. Uˇz se tˇeˇs´ı.:

Dˇeti d´ıtˇe NNFP1-----A---- d´ıtˇe NNFP4-----A---- d´ıtˇe NNFP5-----A----

pojedou jet-1_^(pohybovat_se,_ne_vˇsak_ch˚uz´ı) VB-P---3F-AA---

k k-1 RR--3---------- k-3_^(oznaˇcen´ı_pomoc´ı_p´ısmene) NNNXX-----A----

k-4‘k˚uˇn_:B_^(jednotka_v´ykonu) NNMXX-----A---8 k-8_:B_^(ost._zkratka)

XX------------8 komanditn´ı_:B_^(jen_komanditn´ı_spoleˇcnost) AAXXX----1A---8

koncernov´y_:B AAXXX----1A---8 kuo-1_:B_,t_^(star´a_jednotka_v´ykonu)

NNNXX-----A---8

babiˇcce babiˇcka NNFS3-----A---- babiˇcka NNFS6-----A----

. . Z:-------------

Uˇz uˇz-1 Db------------- uˇz-2 TT-------------

se se_^(zvr._z´ajmeno/ˇc´astice) P7-X4---------- s-1 RV--2----------

s-1 RV--7----------

tˇeˇs´ı tˇeˇsit_:T VB-P---3P-AA--- tˇeˇsit_:T VB-S---3P-AA---

. . Z:-------------

5.4.2 Morphological Generation

The morphological generation can be performed by running

run_morpho_generate morphology_model use_guesser

The input is assumed to be in UTF-8 encoding. The input ﬁles are speciﬁed same as with the run tagger

command.

Input for morphological generation has to be in vertical format, each line containing a lemma, which can be

optionally followed by a tab and a tag wildcard. The output has the same number of lines as input, line l

contains tab separated form-lemma-tag triplets which can be generated from the lemma on he input line l. If a

tag wildcard was provided, only triplets with matching tags are returned.

Some morphological models contain both a manually created dictionary and a guesser. Therefore, a numeric

use guesser argument is required. If non-zero, the guesser is used, otherwise not.

Because tagger models contain an embedded morphological model, a tagger model can be used instead of

morphological one if --from tagger option is speciﬁed.

The full command syntax of run morpho generate is

run_morpho_generate [options] morphology_model use_guesser [input_file[:output_file]]...

Options: --convert_tagset=pdt_to_conll2009|strip_lemma_comment|strip_lemma_id

--from_tagger

Example input data:

d´ıtˇe

jet ?[fN]??[-1]

k-1

babiˇcka NNFS3-----A----

Example output:

d´ıtˇe d´ıtˇe NNNS1-----A---- d´ıtˇe d´ıtˇe NNNS4-----A---- d´ıtˇe d´ıtˇe

NNNS5-----A---- d´ıtˇete d´ıtˇe NNNS2-----A---- d´ıtˇeti d´ıtˇe NNNS3-----A---- d´ıtˇeti

d´ıtˇe NNNS6-----A---- d´ıtˇetem d´ıtˇe NNNS7-----A---- dˇeti d´ıtˇe NNFP1-----A----

dˇeti d´ıtˇe NNFP4-----A---- dˇeti d´ıtˇe NNFP5-----A---- dˇetma d´ıtˇe

NNFP7-----A---6 dˇetmi d´ıtˇe NNFP7-----A---- dˇetem d´ıtˇe NNFP3-----A---- dˇet´ı

d´ıtˇe NNFP2-----A---- dˇetech d´ıtˇe NNFP6-----A---- dˇetima d´ıtˇe_,h NNFP7-----A---6

ject jet Vf--------A---6 jet jet-1_^(pohybovat_se,_ne_vˇsak_ch˚uz´ı)

Vf--------A---- jeti jet-1_^(pohybovat_se,_ne_vˇsak_ch˚uz´ı) Vf--------A---2 nejet

jet-1_^(pohybovat_se,_ne_vˇsak_ch˚uz´ı) Vf--------N---- nejeti

jet-1_^(pohybovat_se,_ne_vˇsak_ch˚uz´ı) Vf--------N---2 jet

jet-2_,h_^(letadlo_s_tryskov´ym_pohonem)NNIS1-----A---- jety

jet-2_,h_^(letadlo_s_tryskov´ym_pohonem) NNIP1-----A----

k k-1 RR--3---------- ke k-1 RV--3---------- ku k-1

RV--3---------1

babiˇcce babiˇcka NNFS3-----A----

Tag Set Conversion

Some tag sets can be converted to diﬀerent ones. Currently supported tag set conversions are:

•pdt to conll2009: convert Czech PDT tag set to CoNLL 2009 tag set,

•strip lemma comment: strip lemma comment (see Lemma Structure in API Reference),

•strip lemma id: strip lemma id (see Lemma Structure in API Reference).

Note that the tag set conversion is applied only to the output, not to the input lemmas and wildcards.

Tag Wildcards

When only forms with a speciﬁc tag should be generated for a given lemma, tag wildcard can be speciﬁed. The

tag wildcard is a simple wildcard allowing to ﬁlter the results of morphological generation.

Most characters of a tag wildcard match corresponding characters of a tag, with the following exceptions:

•?matches any character of a tag.

•[chars] matches any of the characters listed. The dash -has no special meaning and if ]is the ﬁrst

character in chars, it is considered as one of the characters and does not end the group.

•[^chars] matches any of the characters not listed.

5.4.3 Interactive Morphological Analysis and Generation

Morphological analysis and generation which is interactive and more human readable can be run using:

run_morpho_cli morphology_model

The input is read from standard input, command on each line. If there is no tab on a line, analysis is performed

on the given word. If there is a tab on a line, generation is performed on the ﬁrst word, using the second word

as a tag wildcard. If the second word is empty (i.e., the input is for example “on “), all forms are generated.

Because tagger models contain an embedded morphological model, a tagger model can be used instead of

morphological one if --from tagger option is speciﬁed.

The full command syntax of run morpho cli is

run_morpho_cli [options] morphology_model

Options: --from_tagger

5.5 Running the Tokenizer

Using the run tokenizer executable it is possible to perform only tokenization and segmentation.

The input is a UTF-8 encoded plain text and the input ﬁles are speciﬁed same as with the run tagger command.

The tokenizer can be speciﬁed either by using a morphology model (--morphology option), tagger model

(--tagger option) or by using a tokenizer identiﬁer (--tokenizer option). Currently supported tokenizer

identiﬁers are:

•czech

•english

•generic

The full command syntax of run tokenizer is

run_tokenizer [options] [file[:output_file]]...

Options: --tokenizer=czech|english|generic

--morphology=morphology_model_file

--tagger=tagger_model_file

--output=vertical|xml

5.5.1 Output Formats

The output format is speciﬁed using the --output option. Currently supported output formats are:

•xml (default): Simple XML format without a root element, using <sentence>element to mark sentences

and <token>element to mark tokens.

Example output for input Dˇeti pojedou k babiˇcce. Uˇz se tˇeˇs´ı. (line breaks added):

<sentence><token>Dˇeti</token> <token>pojedou</token> <token>k</token>

<token>babiˇcce</token><token>.</token></sentence> <sentence><token>Uˇz</token>

<token>se</token> <token>tˇeˇs´ı</token><token>.</token></sentence>

•vertical: Each token is on a separate line, every sentence is ended by a blank line.

Example output for input Dˇeti pojedou k babiˇcce. Uˇz se tˇeˇs´ı.:

Dˇeti

pojedou

babiˇcce

Uˇz

tˇeˇs´ı

5.6 Running REST Server

MorphoDiTa also provides REST server binary morphodita server. The binary uses MicroRestD as a REST

server implementation and provides MorphoDiTa REST API.

The full command syntax of morphodita server is

morphodita_server [options] port (model_name weblicht_id model_file acknowledgements)*

Options: --daemon

The morphodita server can run either in foreground or in background (when --daemon is used). The speciﬁed

model ﬁles are loaded during start and kept in memory all the time. This behaviour might change in future to

load the models on demand.

5.7 Custom Morphological and Tagging Models

It is possible to create custom morphological and tagging models.

5.7.1 Custom Morphological Models

Custom morphological models can be created using encode dictionary binary.

The encode dictionary reads from standard input and prints MorphoDiTa morphological model on standard

output. The input of encode dictionary is a textual representation of morphological dictionary. It should

be UTF-8 encoded and every line should be a tab separated triplet lemma \t tag \t form. All forms of one

lemma must appear in a continuous region and no line should appear more than once (sort -u can be used to

achieve this).

Run encode dictionary with the following options:

encode_dictionary generic max_suffix_len unknown_tag number_tag punctuation_tag symbol_tag

•generic: This parameter deﬁnes tokenizer and other language speciﬁc behaviour. Other values than

generic take diﬀerent options and are not documented.

•max suffix len: Maximum length of suﬃxes in automatically inferred inﬂexion classes. If unsure, use 8

(we use 8 for Czech and 4 for English). Smaller values produce larger and slightly faster models.

•unknown tag: Assigned to a form during analysis if no matching tag can be found.

•number tag: Assigned to a form during analysis if the form was not found in the dictionary and it looks

like a number. Can be the same as unknown tag.

•punctuation tag: Assigned to a form during analysis if the form was not found in the dictionary and it

consists of Unicode characters in the Punctuation category. Can be the same as unknown tag.

•symbol tag: Assigned to a form during analysis if the form was not found in the dictionary and it consists

of Unicode characters in the Symbol category. Can be the same as unknown tag.

Example input data:

dog NN dog

dog NNS dogs

go VB go

go VBP go

go VBZ goes

go VBG going

go VBD went

Example command line:

encode_dictionary generic 8 UNK NUM PUNC SYM <input_data >output_model

Using External Morphology

Sometimes it is useful to train MorphoDiTa tagger using external morphological analysis, without having a

MorphoDiTa morphological dictionary.

That is possible using a so called external morphology model. External morphology model can be created easily

using

encode_dictionary external unknown_tag >output_model

No standard input is read in this case. The unknown tag parameter is used when no tag is assigned to a word

form during analysis. The resulting model is printed on standard output.

The external morphology model does not contain any morphological dictionary. Instead, it expects the user to

perform morphological analysis and generation on their own. Therefore, the input form to analysis is expected

to be followed by space separated lemma-tag pairs, which are returned by the analysis. Similarly, the input

lemma to generation is expected to be followed by space separated form-tag pairs, which are again returned by

the generation (possibly ﬁltered by a tag wildcard). (To extract the length of the form or lemma itself even

when followed by external analyses, API calls raw form len or raw lemma len and lemma id len can be used.)

Note that the tokenizer returned by the external morphology model is the same as the tokenizer of the generic

model, and splits input on spaces. Therefore, it can be used to tokenize input, the tokens then passed to the

external morphology, and the results can be after proper formatting used as input to MorphoDiTa in vertical

input format.

Example input form for analysis using external morphology model:

wishes wish NNS wish VBZ

Example input lemma for generation using external morphology model:

go go VB go VBP goes VBZ going VBG went VBG

5.7.2 Custom Tagging Models

Custom tagging models can be trained using train tagger binary, which has the following options:

train_tagger generic_234 morphology use_guesser features iterations prune_features

[heldout_data [early_stopping]] <input_data >tagger_model

•generic 234: This parameter deﬁnes the tagger (elementary features and algorithm) and the order of

Viterbi decoding. Use either generic2,generic3 or generic4. If unsure, use generic3 (best released

Czech and English models use generic3). The generic2 produces faster, but less accurate models,

generic4 produces larger and only marginally better models.

•morphology: File with the morphological dictionary to use.

•use guesser: Use 0/1to specify whether morphological guesser should be used. Unless you have a good

reason not to, use 1.

•features: File with feature sequences for the tagger. The ﬁle format and available elementary features

are described in following section.

•iterations: Number of training iterations. For English, values 5-10 are used, for Czech, values 10-15 are

used. Can be aﬀected by early stopping.

•prune features: Use 0/1to disable/enable pruning of feature sequences not found in training data. Use

1for smaller and marginally less accurate models, and 0for larger and marginally better models. If

unsure, use 1(best released Czech and English models use 1).

•heldout data: Optional ﬁle with heldout data in the same format as input data. If supplied, accuracy is

measured on the heldout data after every training iteration.

•early stopping: Optionally use 0/1to disable/enable early stopping. If early stopping is enabled, the

resulting model is not the one after the last training iteration, but the one with best heldout data accuracy.

Example command line (use morphology from morpho.dict, features from features.ft and no heldout data):

train_tagger generic3 morpho.dict 1 features.ft 10 1 <input.data >tagger.model

Example command line (use morphology from morpho.dict, features from features.ft and use heldout data

with early stopping):

train_tagger generic3 morpho.dict 1 features.ft 15 1 heldout.data 1 <input.data

>tagger.model

See next sections for examples of input data and feature ﬁles.

Input Data Format

The input data (and the heldout data) represent a sequence of sentences. Diﬀerent sentences do not interact in

any way. Words of one sentence are stored on consecutive lines, each line containing tab separated triplet form

\t lemma \t tag in UTF-8 encoding. End of sentence is denoted by an empty line.

Example:

Dˇeti d´ıtˇe NNFP1-----A----

pojedou jet-1_^(pohybovat_se,_ne_vˇsak_ch˚uz´ı) VB-P---3F-AA---

k k-1 RR--3----------

babiˇcce babiˇcka NNFS3-----A----

. . Z:-------------

Uˇz uˇz-1 Db-------------

se se_^(zvr._z´ajmeno/ˇc´astice) P7-X4----------

tˇeˇs´ı tˇeˇsit_:T VB-S---3P-AA---

. . Z:-------------

Feature File Format

The features used in the tagger have major inﬂuence on tagging performance. The feature ﬁle contains several

feature sequences, each sequence consisting of several elementary features. The elementary features are computed

by MorphoDiTa and diﬀerent tagger models can have a diﬀerent set of elementary features. Here we describe

elementary features of generic tagger:

•Form: word form

•Prefix1 .. Prefix9: word form preﬁx of length 1..9 (measured in Unicode characters)

•Suffix1 .. Suffix9: word form suﬃx of length 1..9 (measured in Unicode characters)

•Num: whether the word form contains at least one numbers (Unicode category Number)

•Cap: whether the word form contains at least one uppercase or titlecase letter

•Dash: whether the word form contains at least one dash (Unicode category ’Punctuation, Dash’)

•Tag: word form PoS tag

•Tag1 .. Tag5: letter 1..5 of word form PoS tag

•Lemma: word form lemma

•FollowingVerbTag: PoS tag of a nearest following verb, i.e., a nearest following word form with at least

one of the PoS tags starting with V

•FollowingVerbLemma: lemma of a nearest following verb, i.e., a nearest following word form with at least

one of the PoS tags starting with V

•PreviousVerbTag: PoS tag of a nearest previous verb, i.e., a nearest previous word whose PoS tag

(assigned by the tagger) starts with V

•PreviousVerbTag: lemma of a nearest previous verb, i.e., a nearest previous word whose PoS tag (assigned

by the tagger) starts with V

The feature ﬁle deﬁnes feature sequences which can be applied to a word form. A feature sequence consists of

elementary features assigned to the given form or its neighbours.

Every line in the feature ﬁle deﬁnes one feature sequence. A feature sequence consists of comma joined space

separated pairs of elementary feature and an oﬀset to which does the elementary feature apply (i.e., Form 0 or

Tag 0,Lemma -1). The ﬁle format is strict and does not allow any additional spaces or commas.

Note that oﬀset of some of the elementary features is aﬀected by the order or Viterbi decoding used. Notably,

if Viterbi decoding of order Nis utilized, Tag and Lemma can be used inside the decoded window, i.e., only with

oﬀsets -N+1 .. 0.

For inspiration, we present feature ﬁles used for releases Czech and English MorphoDiTa models. Both these

feature ﬁles are slight modiﬁcations of feature ﬁles described in the paper Spoustov´a et al. 2009: Drahom´ıra

”johanka” Spoustov´a, Jan Hajiˇc, Jan Raab, Miroslav Spousta. 2009. Semi-Supervised Training for the Averaged

Perceptron POS Tagger. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL

2009), pages 763-771, Athens, Greece, March. Association for Computational Linguistics.

Feature ﬁle for English:

Tag 0,Form 0

Tag 0,Prefix1 0

Tag 0,Prefix2 0

Tag 0,Prefix3 0

Tag 0,Prefix4 0

Tag 0,Prefix5 0

Tag 0,Prefix6 0

Tag 0,Prefix7 0

Tag 0,Prefix8 0

Tag 0,Prefix9 0

Tag 0,Suffix1 0

Tag 0,Suffix2 0

Tag 0,Suffix3 0

Tag 0,Suffix4 0

Tag 0,Suffix5 0

Tag 0,Suffix6 0

Tag 0,Suffix7 0

Tag 0,Suffix8 0

Tag 0,Suffix9 0

Tag 0,Num 0

Tag 0,Cap 0

Tag 0,Dash 0

Tag 0,Tag -1

Tag 0,Tag -1,Tag -2

Tag 0,Form -1

Tag 0,Form -2

Tag 0,Form -1,Form -2

Tag 0,Form 1

Tag 0,Form 1,Form 2

Tag 0,Tag1 -1

Tag 0,Lemma -1

Lemma 0,Tag -1

Feature ﬁle for Czech (note that some feature sequences predict only part of PoS tags trying to overcome data

sparseness; Tag2 is extended PoS, Tag3 is gender, Tag5 is case):

Tag 0

Tag 0,Tag -1

Tag 0,Tag -1,Tag -2

Tag 0,Tag -2

Tag 0,Form 0

Tag 0,Form 0,Form -1

Tag 0,Form -1

Tag 0,Form -2

Tag 0,PreviousVerbTag 0

Tag 0,PreviousVerbLemma 0

Tag 0,FollowingVerbTag 0

Tag 0,FollowingVerbLemma 0

Tag 0,Lemma -1

Lemma 0,Tag -1

Tag 0,Form 1

Tag2 0,Tag5 0

Tag2 0,Tag5 0,Tag2 -1,Tag5 -1

Tag2 0,Tag5 0,Tag2 -1,Tag5 -1,Tag2 -2,Tag5 -2

Tag5 0

Tag5 0,Tag -1

Tag5 0,Tag -1,Tag -2

Tag5 0,Tag -2

Tag5 0,Form 0

Tag5 0,Form 0,Form -1

Tag5 0,Form -1

Tag5 0,Form -2

Tag5 0,PreviousVerbTag 0

Tag5 0,PreviousVerbLemma 0

Tag5 0,FollowingVerbTag 0

Tag5 0,FollowingVerbLemma 0

Tag5 0,Lemma -1

Tag5 0,Form 1

Tag3 0

Tag3 0,Tag -1

Tag3 0,Tag -1,Tag -2

Tag3 0,Tag -2

Tag3 0,Form 0

Tag3 0,Form 0,Form -1

Tag3 0,Form -1

Tag3 0,Form -2

Tag3 0,PreviousVerbTag 0

Tag3 0,PreviousVerbLemma 0

Tag3 0,FollowingVerbTag 0

Tag3 0,FollowingVerbLemma 0

Tag3 0,Lemma -1

Tag3 0,Form 1

Tag 0,Prefix1 0

Tag 0,Prefix2 0

Tag 0,Prefix3 0

Tag 0,Prefix4 0

Tag 0,Suffix1 0

Tag 0,Suffix2 0

Tag 0,Suffix3 0

Tag 0,Suffix4 0

Tag 0,Num 0

Tag 0,Cap 0

Tag 0,Dash 0

Tag5 0,Suffix1 0

Tag5 0,Suffix2 0

Tag5 0,Suffix3 0

Tag5 0,Suffix4 0

Feature ﬁle for Czech, Part of Speech only variant:

Tag 0

Tag 0,Tag -1

Tag 0,Tag -1,Tag -2

Tag 0,Tag -2

Tag 0,Form 0

Tag 0,Form 0,Form -1

Tag 0,Form -1

Tag 0,Form -2

Tag 0,PreviousVerbTag 0

Tag 0,PreviousVerbLemma 0

Tag 0,FollowingVerbTag 0

Tag 0,FollowingVerbLemma 0

Tag 0,Lemma -1

Lemma 0,Tag -1

Tag 0,Form 1

Tag 0,Prefix1 0

Tag 0,Prefix2 0

Tag 0,Prefix3 0

Tag 0,Prefix4 0

Tag 0,Suffix1 0

Tag 0,Suffix2 0

Tag 0,Suffix3 0

Tag 0,Suffix4 0

Tag 0,Num 0

Tag 0,Cap 0

Tag 0,Dash 0

Measuring Tagger Accuracy

Measuring custom tagger accuracy can be performed by running:

tagger_accuracy tagger_model <test_data

This binary reads input in the same format as train tagger, i.e., tab separated form-lemma-tag triplets, and

evaluates the accuracy of the tagger model on the given testing data.

6 MorphoDiTa API Tutorial

The MorphoDiTa API is deﬁned in header morphodita.h and resides in ufal::morphodita namespace. The

easiest way to use MorphoDita is therefore:

#include morphodita.h

using namespace ufal::morphodita;

6.1 Tagger API

The main access to MorphoDiTa tagger is through class tagger. An example of this class usage can be found

in program ﬁle run tagger.cpp. A typical tagger usage may look like this:

#include tagger/tagger.h;

using namespace ufal::morphodita;

//...

// load model to memory and construct tagger

tagger* my_tagger = tagger::load("path_to_model");

if (!t) ...

// create sample input

vector<string> words;

words.push_back("mal´y");

words.push_back("pes");

vector<string_piece> forms;

for (auto& word : words)

forms.emplace_back(word)

// intialize output and tag

vector<tagged_lemma> tags;

my_tagger->tag(forms, tags);

// access the output

for (auto& tag : tags)

printf("%s\t%s\n", tag.lemma.c_str(), tag.tag.c_str());

delete my_tagger;

The tagger is constructed by an overloaded factory method with one argument. The constructor either accepts

an input stream (istream&) with the model or a C string (const char*) with a ﬁle name of the model. The

constructor loads the linguistic model to memory and returns the tagger pointer ready for tagging, returning

NULL if unsuccessful. If an input stream is used, it is positioned right after the end of the model.

The main tagging method is tagger::tag:

void tag(const std::vector<string_piece>& forms, std::vector<tagged_lemma>& tags) const;

The input is a std::vector of string piece which is a structure referencing a string using const char* str

and size t len.

The tagger::tag method returns the tagged output in it’s second argument, std::vector<tagged lemma>.

The calling procedure must provide a result vector and the tagger assigns the output to this vector. Obviously,

the indexes in the output vector correspond to indexes in input vector. tagged lemma has two public members:

std::string lemma and std:string tag, corresponding to predicted lemma and tag, respectively.

6.2 Morphological Dictionary API

The main access to MorphoDiTa morphological dictionary is through class morpho. An example of this interface

usage can be found in a program ﬁle run morpho.cpp.

6.2.1 Dictionary Construction

Similarly to the tagger, MorphoDiTa morphological dictionary is constructed by an overloaded factory method

which accepts either an input stream (istream&) or a C string const char* with the ﬁle name of the dictionary.

The factory method returns a pointer to morphological dictionary or NULL if unsuccessful.

#include morpho/morpho.h

using namespace ufal::morphodita;

//...

// load dictionary to memory

morpho* my_morpho = morpho::load("path_to_dictionary");

//...

delete(my_morpho);

Another way of obtaining a pointer to morphology dictionary is through an instance of tagger class – every

tagger has a morphology dictionary, which is available through the method

virtual const morpho* get_morpho() const = 0;

Please note that you should not delete this pointer as it is owned by the tagger class instance.

6.2.2 Morphological Analysis

MorphoDiTa morphological dictionary oﬀers two functionalities: It either analyzes the given word, that means

it outputs all possible lemma-tag pairs candidates for the given form; or for a given lemma-tag pair, it generates

a form or a whole list of possible forms.

In the ﬁrst case, one performs morphological analysis for a given word by calling a method morpho::analyze:

int analyze(string_piece form, guesser_mode guesser, std::vector<tagged_lemma>& lemmas)

const;

An example (assuming that morphological dictionary is already constructed, see previous example):

vector<tagged_lemma> lemmas; // output

my_morpho->analyze("pes", morpho::GUESSER, vector<tagged_lemma>& lemmas);

for (auto& lemma: lemmas)

printf ("%s %s\n, lemma.lemma.c_str(), lemma.tag.c_str())

The input is a form to analyze, then a Guesser mode (whether to use some kind of guesser or strictly dictionary

only, see question Guesser Mode in Questions and Answers) and output std::vector<tagged lemma>. The

caller must provide an output vector std::vector<tagged lemma>and the method morpho::analyze assigns

the output to this vector.

6.2.3 Generation

MorphoDiTa performs morphological generation from a given lemma:

int generate(string_piece lemma, const char* tag_wildcard, guesser_mode guesser,

std::vector<tagged_lemma_forms>& forms) const;

Tag Wildcard

Optionally, a tag wildcard can be speciﬁed (or be NULL) and if so, results are ﬁltered using this wildcard. This

method can be therefore used in more ways: One may wish to generate all possible forms and their tags from a

given lemma. Then the tag wildcard is set to NULL and the method generates all possible combinations. One

may also need a generate a speciﬁc form and tag from a given lemma, then tag wildcard is set to this tag

value.

Or even more, for example, in the Czech positional morphology tagging system (Hajiˇc 2004), one may even

wish to generate something like ”all forms in fourth case”, then tag wildcard should be set to ????4. Please

see Section ”Czech Morphology” in User’s Manual for more details about the Czech positional tagging system.

The previous example applies to morphological annotation of PDT, however, the tag wildcards can be used in

any morphological tagging system.

Most characters of a tag wildcard match corresponding characters of a tag, with the following exceptions:

•?matches any character of a tag.

•[chars] matches any of the characters listed. The dash -has no special meaning and if ]is the ﬁrst

character in chars, it is considered as one of the characters and does not end the group.

•[^chars] matches any of the characters not listed.

Unknown Lemmas

When the lemma is unknown, MorphoDiTa’s generation behavior is deﬁned by Guesser mode (see also question

Guesser Mode in Questions and Answers). If at least one lemma is found in the dictionary, NO GUESSER is

returned. If guesser == GUESSER and the lemma is found by the guesser, GUESSER is returned. Otherwise,

forms are cleared and -1 is returned.

6.3 Questions and Answers

What is a Guesser Mode?

Morphological analysis may try to guess the lemma and tag of an uknown word. This option is turned on

by morpho::GUESSER and oﬀ by morpho::NO GUESSER.

Why ‘string piece“ and not const char* or std::string?

We aim to make MorphoDiTa interface as eﬀective as possible. Because the input strings may be substrings

of larger text or come from diﬀerent than C++ memory regions, we want to avoid the cost of \\0

padding or string conversion. Nevertheless, both const char* and std::string can be used instead of

astring piece because of existing implicit conversion rules.

7 MorphoDiTa API Reference

The MorphoDiTa API is deﬁned in header morphodita.h and resides in ufal::morphodita namespace.

The strings used in the MorphoDiTa API are always UTF-8 encoded (except from ﬁle paths, whose encoding

is system dependent).

7.1 MorphoDiTa Versioning

MorphoDiTa is versioned using Semantic Versioning. Therefore, a version consists of three numbers ma-

jor.minor.patch, optionally followed by a hyphen and pre-release version info, with the following semantics:

•Stable versions have no pre-release version info, development have non-empty pre-release version info.

•Two versions with the same major.minor have the same API with the same behaviour, apart from bugs.

Therefore, if only patch is increased, the new version is only a bug-ﬁx release.

•If two versions vand uhave the same major, but minor(v) is greater than minor(u), version vcontains

only additions to the API. In other words, the API of uis all present in vwith the same behaviour (once

again apart from bugs). It is therefore safe to upgrade to a newer MorphoDiTa version with the same

major.

•If two versions diﬀer in major, their API may diﬀer in any way.

Models created by MorphoDiTa have the same behaviour in all MorphoDiTa versions with same major, apart

from obvious bugﬁxes. On the other hand, models created from the same data by diﬀerent major.minor

MorphoDiTa versions may have diﬀerent behaviour.

7.2 Lemma Structure

The lemmas used by MorphoDiTa consist of three parts:

1. raw lemma: text form of the lemma. May not uniquely distinguish lemma meanings, lemma use cases etc.

2. lemma id: together with raw lemma provide a unique identiﬁer of the lemma, possibly including lemma

meanings or use cases.

3. lemma comments: additional comments for the given lemma.

These parts are stored in one string and the boundaries between them can be determined by

morpho::raw lemma len and morpho::lemma id len methods. Analyzer and tagger always return lemma in

this structured form. When performing morphological generation, either raw lemma or both raw lemma and

lemma id can be speciﬁed, any lemma comments are ignored.

7.3 Struct string piece

struct string_piece {

const char* str;

size_t len;

string_piece();

string_piece(const char* str);

string_piece(const char* str, size_t len);

string_piece(const std::string& str);

}

The string piece is used for eﬃcient string passing. The string referenced in string piece is not owned by

it, so users have to make sure the referenced string exists as long as the string piece.

7.4 Struct tagged form

struct tagged_form {

std::string form;

std::string tag;

};

The tagged form is a pair of strings used when obtaining a form and tag pair.

7.5 Struct tagged lemma

struct tagged_lemma {

std::string lemma;

std::string tag;

};

The tagged lemma is a pair of strings used when obtaining a lemma and tag pair.

7.6 Struct tagged lemma forms

struct tagged_lemma_forms {

std::string lemma;

std::vector<tagged_form> forms;

};

The tagged lemma forms represents a lemma and a list of tagged forms.

7.7 Struct token range

struct token_range {

size_t start;

size_t length;

};

The token range represent a range of a token as returned by a tokenizer. The start and length ﬁelds specify

the token position in Unicode characters, not in bytes of UTF-8 encoding.

7.8 Struct derivated lemma

struct derivated_lemma {

std::string lemma;

};

The derivated lemma structure stores information about a derivation. This information currently consists of

lemma only, but a type of the derivation may be added later.

7.9 Class version

class version {

public:

unsigned major;

unsigned minor;

unsigned patch;

std::string prerelease;

static version current();

};

The version class represents MorphoDiTa version. See MorphoDiTa Versioning for more information.

7.9.1 version::current

static version current();

Returns current MorphoDiTa version.

7.10 Class tokenizer

class tokenizer {

public:

virtual ~tokenizer() {}

virtual void set_text(string_piece text, bool make_copy = false) = 0;

virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>*

tokens) = 0;

static tokenizer*new_vertical_tokenizer();

static tokenizer*new_czech_tokenizer();

static tokenizer*new_english_tokenizer();

static tokenizer*new_generic_tokenizer();

};

The tokenizer class performs segmentation and tokenization of given text. The class is not threadsafe.

The tokenizer instances can be obtained either directly using static methods or through instances of morpho

and tagger.

7.10.1 tokenizer::set text

virtual void set_text(string_piece text, bool make_copy = false) = 0;

Set the text which is to be tokenized.

If make copy is false, only a reference to the given text is stored and the user has to make sure it exists until

the tokenizer is released or set text is called again. If make copy is true, a copy of the given text is made and

retained until the tokenizer is released or set text is called again.

7.10.2 tokenizer::next sentence

virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>*

tokens) = 0;

Locate and return next sentence of the given text. Returns true when successful and false when there are no

more sentences in the given text. The arguments are ﬁlled with found tokens if not NULL. The forms contain

token ranges in bytes of UTF-8 encoding, the tokens contain token ranges in Unicode characters.

7.10.3 tokenizer::new vertical tokenizer

static tokenizer new_vertical_tokenizer();

Returns a new instance of a vertical tokenizer, which considers every line to be one token, with empty line

denoting end of sentence. The user should delete the instance after use.

7.10.4 tokenizer::new czech tokenizer

static tokenizer new_czech_tokenizer();

Returns a new instance of a Czech tokenizer. The user should delete it after use.

If two MorphoDiTa versions have the same major.minor, this tokenizer should behave identically (apart from

obvious bugﬁxes). Nevertheless, the behaviour of this tokenizer might change in diﬀerent major.minor version.

If you need a tokenizer whose behaviour does not change, use tokenizer embedded in a morphological dictionary.

7.10.5 tokenizer::new english tokenizer

static tokenizer new_english_tokenizer();

Returns a new instance of a English tokenizer. The user should delete it after use.

If two MorphoDiTa versions have the same major.minor, this tokenizer should behave identically (apart from

obvious bugﬁxes). Nevertheless, the behaviour of this tokenizer might change in diﬀerent major.minor version.

If you need a tokenizer whose behaviour does not change, use tokenizer embedded in a morphological dictionary.

7.10.6 tokenizer::new generic tokenizer

static tokenizer new_generic_tokenizer();

Returns a new instance of a generic tokenizer. The user should delete it after use.

If two MorphoDiTa versions have the same major.minor, this tokenizer should behave identically (apart from

obvious bugﬁxes). Nevertheless, the behaviour of this tokenizer might change in diﬀerent major.minor version.

If you need a tokenizer whose behaviour does not change, use tokenizer embedded in a morphological dictionary.

7.11 Class derivator

class derivator {

public:

virtual ~derivator();

virtual bool parent(string_piece lemma, derivated_lemma& parent) const = 0;

virtual bool children(string_piece lemma, std::vector<derivated_lemma>& children) const =

};

The derivator class perform morphological derivation on given lemmas. The derivation are computed using

lemma ids, see Lemma Structure.

The derivator instances can be obtained through instances of morpho (and transitively through tagger).

7.11.1 derivator::parent

virtual bool parent(string_piece lemma, derivated_lemma& parent) const = 0;

Return the parent of a given lemma in the morphological derivation tree. The lemma is assumed to be lemma

id (see Lemma Structure), so if it contains any lemma comments, they are ignored.

The returned lemma is a full lemma (lemma id plus appropriate lemma comments).

If no parent exists, the function empties the parent lemma and returns false.

7.11.2 derivator::children

virtual bool children(string_piece lemma, std::vector<derivated_lemma>& children) const = 0;

Return children of a given lemma in the morphological derivation tree. The lemma is assumed to be lemma id

(see Lemma Structure), so if it contains any lemma comments, they are ignored.

The returned lemmas are full lemmas (lemma ids plus appropriate lemma comments).

If no children exist, the function empties the children vector and returns false.

7.12 Class derivation formatter

class derivation_formatter {

public:

virtual ~derivation_formatter() {}

virtual void format_derivation(std::string& lemma) const = 0;

static derivation_formatter*new_none_derivation_formatter();

static derivation_formatter*new_root_derivation_formatter(const derivator* derinet);

static derivation_formatter*new_path_derivation_formatter(const derivator* derinet);

static derivation_formatter*new_tree_derivation_formatter(const derivator* derinet);

static derivation_formatter*new_derivation_formatter(string_piece name, const derivator*

derinet);

};

The derivation formatter class performs required morphological derivation and formats the results using a

single string ﬁeld (i.e., directly in the lemma).

7.12.1 derivation formatter::format derivation

virtual void format_derivation(std::string& lemma) const = 0;

Perform the required morphological derivation and format the result back directly in the lemma.

7.12.2 derivation formatter::new none derivation formatter

static derivation_formatter* new_none_derivation_formatter();

Return a new derivation formatter instance which does nothing (i.e., it performs no derivation).

7.12.3 derivation formatter::new root derivation formatter

static derivation_formatter* new_root_derivation_formatter(const derivator* derinet);

Return a new derivation formatter instance which replaces a lemma by the corresponding root in the deriva-

tion tree.

7.12.4 derivation formatter::new path derivation formatter

static derivation_formatter* new_path_derivation_formatter(const derivator* derinet);

Return a new derivation formatter instance which replaces a lemma by a space separated path to the root

in the morphological derivation tree (the original lemma is ﬁrst, followed by its parent, with the root being the

last one).

7.12.5 derivation formatter::new tree derivation formatter

static derivation_formatter* new_tree_derivation_formatter(const derivator* derinet);

Return a new derivation formatter instance which appends to the lemma the whole morphological derivation

tree which contains it.

The tree is encoded in the following way: root node is the ﬁrst, then the subtrees of the root children are encoded

recursively (each after one space), followed by a ﬁnal space (which denotes that the children are complete).

7.12.6 derivation formatter::new derivation formatter

static derivation_formatter* new_derivation_formatter(string_piece name, const derivator*

derinet);

Return one of the available derivation formatter instances according to the name parameter:

•none: return new none derivation formatter instance

•root: return new root derivation formatter instance

•path: return new path derivation formatter instance

•tree: return new tree derivation formatter instance

7.13 Class morpho

class morpho {

public:

virtual ~morpho() {}

static morpho*load(const char* fname);

static morpho*load(istream& is);

enum guesser_mode { NO_GUESSER = 0, GUESSER = 1 };

virtual int analyze(string_piece form, guesser_mode guesser, std::vector<tagged_lemma>&

lemmas) const = 0;

virtual int generate(string_piece lemma, const char* tag_wildcard, guesser_mode guesser,

std::vector<tagged_lemma_forms>& forms) const = 0;

virtual int raw_lemma_len(string_piece lemma) const = 0;

virtual int lemma_id_len(string_piece lemma) const = 0;

virtual int raw_form_len(string_piece form) const = 0;

virtual tokenizer*new_tokenizer() const = 0;

virtual const derivator*get_derivator() const;

};

Amorpho instance represents a morphological dictionary. Such a dictionary allow morphological analysis,

morphological generation provide information about lemma structure and provides a suitable tokenizer. All

methods are thread-safe.

7.13.1 morpho::load(const char*)

static morpho* load(const char* fname);

Factory method constructor. Accepts C string with a ﬁle name of the model. Returns a pointer to an instance

of morpho which the user should delete after use.

7.13.2 morpho::load(istream&)

static morpho* load(istream& is);

Factory method constructor. Accepts an input stream with the model. Returns a pointer to an instance of

morpho which the user should delete after use.

7.13.3 morpho::guesser mode

enum guesser_mode { NO_GUESSER = 0, GUESSER = 1 };

Guesser mode deﬁnes behavior in case of unknown words. When set to GUESSER, morpho tries to guess unknown

words. When set to NO GUESSER, morpho does not guess unknown words.

7.13.4 morpho::analyze()

virtual int analyze(string_piece form, guesser_mode guesser, std::vector<tagged_lemma>&

lemmas) const = 0;

Perform morphological analysis of a form. The guesser parameter speciﬁes whether a guesser can be used if the

form is not found in the dictionary. Output is assigned to the lemmas vector.

If the form is found in the dictionary, analyses are assigned to lemmas and NO GUESSER returned. If guesser ==

GUESSER and the form analyses are found using a guesser, they are assigned to lemmas and GUESSER is returned.

Otherwise -1 is returned and lemmas are ﬁlled with one analysis containing given form as lemma and a tag for

unknown word.

7.13.5 morpho::generate()

virtual int generate(string_piece lemmma, const char* tag_wildcard, guesser_mode guesser,

std::vector<tagged_lemma_forms>& forms) const = 0;

Perform morphological generation of a lemma. Optionally a tag wildcard can be speciﬁed (or be NULL) and if

so, results are ﬁltered using this wildcard. The guesser parameter speﬁcies whether a guesser can be used if the

lemma is not found in the dictionary. Output is assigned to the forms vector.

Tag wildcard can be either NULL or a wildcard applied to the results. A ?in the wildcard matches any character,

[bytes] matches any of the bytes and [^bytes] matches any byte diﬀerent from the speciﬁed ones. A -has

no special meaning inside the bytes and if ]is ﬁrst in bytes, it does not end the bytes group.

If the given lemma is only a raw lemma, all lemma ids with this raw lemma are returned. Otherwise only

matching lemma ids are returned, ignoring any lemma comments. For every found lemma, matching forms are

ﬁltered using the tag wildcard. If at least one lemma is found in the dictionary, NO GUESSER is returned. If

guesser == GUESSER and the lemma is found by the guesser, GUESSER is returned. Otherwise, forms are cleared

and -1 is returned.

7.13.6 morpho::raw lemma len

virtual int raw_lemma_len(string_piece lemma) const = 0;

When given a lemma returned by the dictionary, returns the length of a raw lemma (see Lemma Structure).

7.13.7 morpho::lemma id len

virtual int lemma_id_len(string_piece lemma) const = 0;

When given a lemma returned by the dictionary, returns the length of a raw lemma plus a lemma id (see Lemma

Structure). Therefore, the substring of the original lemma of this length is a unique lemma identiﬁer. The rest

of the original lemma are lemma comments which do not identify the lemma.

7.13.8 morpho::raw form len

virtual int raw_form_len(string_piece form) const = 0;

When given a form, returns the length of a raw form. This is used only in external morphology model, where

form contains also morphological analyses, and this call can return the length of the form without the analyses.

7.13.9 morpho::new tokenizer

virtual tokenizer* new_tokenizer() const = 0;

Returns a new instance of a suitable tokenizer or NULL if no such tokenizer exists. The user should delete it

after use.

Note that the tokenizer might use the morpho instance, so the tokenizer must not be used after the morpho

instance is destructed.

7.13.10 morpho::get derivator

virtual const derivator* get_derivator() const;

Returns a derivator for the morphology, or NULL if not available.

The derivator is owned by the morphology, so the returned instance should not be freed and it cannot be used

after the morpho instance is destructed.

7.14 Class tagger

class tagger {

public:

virtual ~tagger() {}

static tagger*load(const char* fname);

static tagger*load(istream& is);

virtual const morpho*get_morpho() const = 0;

virtual void tag(const std::vector<string_piece>& forms, std::vector<tagged_lemma>& tags,

morpho::guesser_mode guesser = -1) const = 0;

virtual void tag_analyzed(const std::vector<string_piece>& forms,

std::vector<std::vector<tagged_lemma> >& analyses, std::vector<int>& tags) const = 0;

tokenizer*new_tokenizer() const = 0;

};

Atagger instance represents a tagger, which perform disambiguation of morphological analyses. All methods

are thread-safe.

7.14.1 tagger::load(const char*)

static tagger* load(const char* fname);

Factory method constructor. Accepts C string with a ﬁle name of the model. Returns a pointer to an instance

of tagger which the user should delete after use.

7.14.2 tagger::load(istream&)

static tagger* load(istream& is);

Factory method constructor. Accepts an input stream with the model. Returns a pointer to an instance of

tagger which the user should delete after use.

7.14.3 tagger::get morpho()

virtual const morpho* get_morpho() const = 0;

Returns a pointer to an instance of morpho associated with the tagger. Do not delete the pointer, it is owned

by the tagger instance and deleted in the tagger destructor.

7.14.4 tagger::tag()

virtual void tag(const std::vector<string_piece>& forms, std::vector<tagged_lemma>& tags,

morpho::guesser_mode guesser = -1) const = 0;

Perform morphological analysis and subsequent disambiguation. Accepts a std::vector of string piece and

ﬁlls the output vector of tagged lemma.

The ‘guesser‘ parameter deﬁnes whether morphological guesser should be used. If negative value is speciﬁed

(which is the default), the guesser settings employed when the tagger model was trained is used.

7.14.5 tagger::tag analyzed()

virtual void tag_analyzed(const std::vector<string_piece>& forms,

std::vector<std::vector<tagged_lemma> >& analyses, std::vector<int>& tags) const = 0;

Perform morphological disambiguation using given morphological analyses. The indices of chosen analyses are

stored in the output vector tags.

None of the analyses can be empty – in that case, no operation is performed and tags is empty. On the other

hand, the analyses vector can be larger than forms – additional entries are ignored in that case.

Note that the tagger was trained with a speciﬁc morphology – the more your morphological analyses diﬀer from

the original ones, the worse the results will be. One of the usages of tag analyzed is to consider only a subset

of morphological analyses.

7.14.6 tagger::new tokenizer

virtual tokenizer* new_tokenizer() const = 0;

Returns a new instance of a suitable tokenizer or NULL if no such tokenizer exists. The user should delete it

after use. The call is equal to get morpho()->new tokenizer().

7.15 Class tagset converter

class tagset_converter {

public:

virtual ~tagset_converter() {}

virtual void convert(tagged_lemma& tagged_lemma) const = 0;

virtual void convert_analyzed(std::vector<tagged_lemma>& tagged_lemmas) const = 0;

virtual void convert_generated(std::vector<tagged_lemma_forms>& forms) const = 0;

static tagset_converter*new_identity_converter();

static tagset_converter*new_pdt_to_conll2009_converter();

static tagset_converter*new_strip_lemma_comment_converter(const morpho& dictionary);

static tagset_converter*new_strip_lemma_id_converter(const morpho& dictionary);

};

7.15.1 tagset converter::convert()

virtual void convert(tagged_lemma& tagged_lemma) const = 0;

Convert the given tagged lemma.

7.15.2 tagset converter::convert analyzed()

virtual void convert_analyzed(std::vector<tagged_lemma>& tagged_lemmas) const = 0;

Convert the given results of morpho::analyze. Apart from calling convert, any repeated entries are removed.

7.15.3 tagset converter::convert generated()

virtual void convert_generated(std::vector<tagged_lemma_forms>& forms) const = 0;

Convert the given results of morpho::generate. Apart from calling convert, any repeated entries are removed.

7.15.4 tagset converter::new identity converter()

static tagset_converter* new_identity_converter();

Returns a new instance of an identity converter. All convert methods of an identity converter do nothing. The

user should delete the instance after use.

7.15.5 tagset converter::new pdt to conll2009 converter()

static tagset_converter* new_pdt_to_conll2009_converter();

Returns a new instance of a Czech PDT tag set to CoNLL2009 tag set converter. The user should delete the

instance after use.

CoNLL2009 tag set uses two columns for tags – one is a POS and the other one are additional FEATs. Because

we have only one tag ﬁeld, we merge these ﬁelds together by using Pos=?|FEAT, i.e., the POS is stored as a ﬁrst

FEAT.

7.15.6 tagset converter::new strip lemma comment converter()

static tagset_converter* new_strip_lemma_comment_converter(const morpho& dictionary);

Returns a new instance of a tag set converter stripping lemma comment using the given morpho instance, which

must remain valid during existence of the tag set converter. The user should delete the tag set converter instance

after use.

7.15.7 tagset converter::new strip lemma id converter()

static tagset_converter* new_strip_lemma_id_converter(const morpho& dictionary);

Returns a new instance of a tag set converter stripping lemma id using the given morpho instance, which must

remain valid during existence of the tag set converter. The user should delete the tag set converter instance

after use.

7.16 C++ Bindings API

Bindings for other languages than C++ are created using SWIG from the C++ bindings API, which is a slightly

modiﬁed version of the native C++ API. Main changes are replacement of string piece type by native strings

and removal of methods using istream. Here is the C++ bindings API declaration:

7.16.1 Helper Structures

typedef vector<int> Indices;

typedef vector<string> Forms;

struct TaggedForm {

string form;

string tag;

};

typedef vector<TaggedForm> TaggedForms;

struct TaggedLemma {

string lemma;

string tag;

};

typedef vector<TaggedLemma> TaggedLemmas;

typedef vector<TaggedLemmas> Analyses;

struct TaggedLemmaForms {

string lemma;

TaggedForms forms;

};

typedef vector<TaggedLemmaForms> TaggedLemmasForms;

struct TokenRange {

size_t start;

size_t length;

};

typedef vector<TokenRange> TokenRanges;

struct DerivatedLemma {

std::string lemma;

};

typedef vector<DerivatedLemma> DerivatedLemmas;

7.16.2 Main Classes

class Version {

public:

unsigned major;

unsigned minor;

unsigned patch;

string prerelease;

static Version current();

};

class Tokenizer {

public:

virtual void setText(const char* text);

virtual bool nextSentence(Forms* forms, TokenRanges* tokens);

static Tokenizer* newVerticalTokenizer();

static Tokenizer* newCzechTokenizer();

static Tokenizer* newEnglishTokenizer();

static Tokenizer* newGenericTokenizer();

};

class Derivator {

public:

virtual bool parent(const char* lemma, DerivatedLemma& parent) const;

virtual bool children(const char* lemma, DerivatedLemmas& children) const;

};

class DerivationFormatter {

public:

virtual string formatDerivation(const char* lemma) const;

static DerivationFormatter* newNoneDerivationFormatter();

static DerivationFormatter* newRootDerivationFormatter(const Derivator* derivator);

static DerivationFormatter* newPathDerivationFormatter(const Derivator* derivator);

static DerivationFormatter* newTreeDerivationFormatter(const Derivator* derivator);

static DerivationFormatter* newDerivationFormatter(const char* name, const Derivator*

derivator);

};

class Morpho {

public:

static Morpho* load(const char* fname);

enum { NO_GUESSER = 0, GUESSER = 1 };

virtual int analyze(const char* form, int guesser, TaggedLemmas& lemmas) const;

virtual int generate(const char* lemma, const char* tag_wildcard, int guesser,

TaggedLemmasForms& forms) const;

virtual string rawLemma(const char* lemma) const;

virtual string lemmaId(const char* lemma) const;

virtual string rawForm(const char* form) const;

virtual Tokenizer* newTokenizer() const;

virtual Derivator* getDerivator() const;

};

class Tagger {

public:

static Tagger* load(const char* fname);

virtual const Morpho* getMorpho() const;

virtual void tag(const Forms& forms, TaggedLemmas& tags, int guesser = -1) const;

virtual void tagAnalyzed(const Forms& forms, const Analyses& analyses, Indices& tags)

const;

Tokenizer* newTokenizer() const;

};

class TagsetConverter {

public:

static TagsetConverter* newIdentityConverter();

static TagsetConverter* newPdtToConll2009Converter();

static TagsetConverter* newStripLemmaCommentConverter(const Morpho& morpho);

static TagsetConverter* newStripLemmaIdConverter(const Morpho& morpho);

virtual void convert(TaggedLemma& lemma) const;

virtual void convertAnalyzed(TaggedLemmas& lemmas) const;

virtual void convertGenerated(TaggedLemmasForms& forms) const;

};

7.17 C# Bindings

MorphoDiTa library bindings is available in the Ufal.MorphoDiTa namespace.

The bindings is a straightforward conversion of the C++ bindings API. The bindings requires native C++ library

libmorphodita csharp (called morphodita csharp on Windows).

7.18 Java Bindings

MorphoDiTa library bindings is available in the cz.cuni.mff.ufal.morphodita package.

The bindings is a straightforward conversion of the C++ bindings API. Vectors do not have native Java inter-

face, see cz.cuni.mff.ufal.morphodita.Forms class for reference. Also, class members are accessible and

modiﬁable using using getField and setField wrappers.

The bindings require native C++ library libmorphodita java (called morphodita java on Windows). If the

library is found in the current directory, it is used, otherwise standard library search process is used.

7.19 Perl Bindings

MorphoDiTa library bindings is available in the Ufal::MorphoDiTa package. The classes can be imported into

the current namespace using the :all export tag.

The bindings is a straightforward conversion of the C++ bindings API. Vectors do not have native Perl interface,

see Ufal::MorphoDiTa::Forms for reference. Static methods and enumerations are available only through the

module, not through object instance.

7.20 Python Bindings

MorphoDiTa library bindings is available in the ufal.morphodita module.

The bindings is a straightforward conversion of the C++ bindings API. In Python 2, strings can be both unicode

and UTF-8 encoded str, and the library always produces unicode. In Python 3, strings must be only str.

8 Contact

Authors:

•Milan Straka,straka@ufal.mﬀ.cuni.cz

•Jana Strakov´a,strakova@ufal.mﬀ.cuni.cz

MorphoDiTa website.

MorphoDiTa LINDAT/CLARIN entry.

9 Acknowledgements

This work has been using language resources developed and/or stored and/or distributed by the LIN-

DAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013 ).

Acknowledgements for individual language models are listed in MorphoDiTa User’s Manual.

9.1 Publications

•(Strakov´a et al. 2014) Strakov´a Jana, Straka Milan and Hajiˇc Jan. Open-Source Tools for Morphology,

Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of

the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Mary-

land, June 2014. Association for Computational Linguistics.

•(Spoustov´a et al. 2009) Drahom´ıra ”johanka” Spoustov´a, Jan Hajiˇc, Jan Raab, Miroslav Spousta. 2009.

Semi-Supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference

of the European Chapter of the ACL (EACL 2009), pages 763-771, Athens, Greece, March. Association

for Computational Linguistics.

9.2 Bibtex for Referencing

@InProceedings{strakova14,

author = {Strakov\’{a}, Jana and Straka, Milan and Haji\v{c}, Jan},

title = {Open-{S}ource {T}ools for {M}orphology, {L}emmatization, {POS} {T}agging and

{N}amed {E}ntity {R}ecognition},

booktitle = {Proceedings of 52nd Annual Meeting of the Association for Computational

Linguistics: System Demonstrations},

month = {June},

year = {2014},

address = {Baltimore, Maryland},

publisher = {Association for Computational Linguistics},

pages = {13--18},

url = {http://www.aclweb.org/anthology/P/P14/P14-5003.pdf}

}

9.3 Persistent Identiﬁer

If you prefer to reference MorphoDiTa by a persistent identiﬁer (PID), you can use

http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0.

MANUAL

Navigation menu

Versions of this User Manual:

Views

Navigation