Development Of A Japanese English Software Manual Parallel Corpus

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 6

Development of a Japanese-English Software Manual Paralell Corpus
Tatsuya Ishisaka
Kazuhide Yamamoto
Nagaoka University of Technology
1603-1 Kamitomiokamachi, Nagaoka,
Niigata 940-2188, Japan
{ishisaka,ykaz}@nlp.nagaokaut.ac.jp
Masao Utiyama††
Eiichiro Sumita††
†† MASTAR Project
National Institute of Information
and Communications Tehnology
3-5, Hikaridai, Seika, Soraku,
Kyoto 619-0289, Japan
{mutiyama,eiichiro.sumita}
@nict.go.jp
Abstract
To address the shortage of Japanese-English
parallel corpora, we developed a parallel cor-
pus by collecting open source software man-
uals from the Web. The constructed cor-
pus contains approximately 500 thousand sen-
tence pairs that were aligned automatically by
an existing method. We also conducted statis-
tical machine translation (SMT) experiments
with the corpus and confirmed that the corpus
is useful for SMT.
1 Introduction
Multilingual parallel corpora are required to sup-
port many tasks in natural language processing. For
example, statistical machine translation (SMT) re-
quires a parallel corpus for training, and cross-
lingual processing such as information retrieval and
information extraction also use parallel corpora.
There is no doubt on the importance of parallel cor-
pora for any language pair.
Specially, Japanese-English parallel corpora are
very scarce. Although some parallel corpora
(Utiyama and Isahara, 2007) are available, the do-
mains and sizes of these corpora are limited.
In general, European countries use multiple lan-
guages officially. Based on this multilingual envi-
ronment, Koehn (2005) has built a corpus by col-
lecting parallel texts in eleven languages from the
proceedings of the European Parliament, which are
published on the Web.
However, some countries such as Japan have no
such language situation, that leads us difficulties for
creating parallel corpora. Hence, more efforts are
needed to collect them effectively.
Available Japanese-English parallel corpora are
scarce. However, there are a lot of translated texts on
the Web. Specially, open source manuals are trans-
lated into Japanese from English by volunteer trans-
lators.
We collected such English and Japanese texts.
Then, the sentences in collected texts were automat-
ically aligned, resulting in a parallel corpus made
from open source software manuals. Manuals of
open source software has been used for making a
parallel corpus named OPUS (Tiedemann and Ny-
gaard 2004), which was made from OpenOffice.org
documentation1, KDE manuals including KDE mes-
sages2, and PHP manuals 3. However, the Japanese-
English part of OPUS is not large. In contrast, we
collected about 500 thousand sentence pairs. In ad-
dition, our work involved extensive human efforts to
ensure the quality of our parallel corpus.
The original and translated texts often proscribe
copy, distribute, display, and make derivative works.
Our target texts are open source software manuals.
Such open source software manuals are often pub-
lished under open licenses under which we can mod-
ify and distribute them.
The translation quality of open source software
manuals are considered to be relatively high, be-
cause they are translated by many translators who
belong to the projects and drafts of the translations
are corrected by other project members. Therefore,
1http://www.openoffice.org
2http://i18n.kde.org
3http://www.php.net/download-docs.php
we can trust the quality of software manuals.
In the following we present how we collect, clean,
and align software manuals. We also illustrate per-
formance of SMT experiments using the corpus.
2 Target license
We will publish a parallel corpus constructed from
open source software manuals. This action is con-
sidered as a redistribution with modifications. We
therefore target licenses that allow redistribution and
modifications.
Here are four example licenses that allow redistri-
bution and modification.
MIT License4The MIT License is very open. It
is necessary only to include the copyright notice and
the permission notice.
FreeBSD Documentation License5
The FreeBSD Documentation License is similar
to the MIT License. It is as follows:
Redistribution and use in source (SGML Doc-
Book) and ’compiled’ forms (SGML, HTML,
PDF, PostScript, RTF and so forth) with or
without modification, are permitted provided
that the following conditions are met:
(1) Redistributions of source code (SGML
DocBook) must retain the their copy-
right notice, this list of conditions and
the their disclaimer as the first lines of
this file unmodified.
(2) Redistributions in compiled form (trans-
formed to other DTDs, converted to
PDF, PostScript, RTF and other formats)
must reproduce the their copyright no-
tice, this list of conditions and the their
disclaimer in the documentation and/or
other materials provided with the distri-
bution.
Creative Commons licenses6Creative Com-
mons licenses include several license types. We in-
troduce the “Attribution-Share Alike 3.0 Unported”
model in which we can copy, distribute, transmit,
and adapt the work under the following conditions;
4http://www.opensource.org/licenses/mit-license.php
5http://www.freebsd.org/copyright/freebsd-doc-
license.html
6http://creativecommons.org/licenses/by-nc-sa/3.0/deed.en
Attribution You must attribute the work in
the manner specified by the author or li-
censor (but not in any way that suggests
that they endorse you or your use of the
work).
Share Alike If you alter, transform, or build
upon this work, you may distribute the
resulting work only under the same, sim-
ilar or a compatible license.
Common Development and Distribution Li-
cense7
Common Development and Distribution License
(CDDL) is a long and detailed license. Thus, we
describe CDDL briefly using the FAQ of NetBeans8.
Modify NetBeans source code and redistribute
it for free (or for sale) as long as I follow the
terms of the CDDL, including the following
provisions:
We must bundle the CDDL with any
source code version we distribute,
We must make my changes to the Net-
Beans source code (but not new source
files I create) available to the NetBeans
community under the CDDL (so the
community can benefit from my changes
or improvements),
We cannot modify the rights granted un-
der the CDDL License,
We can add external files to NetBeans,
compile these and redistribute them for
free or for sale and we do not need to
make such external files or changes to
them available in source code form or bi-
nary form to the NetBeans project. If my
value-add is worth the price, we can sell
it.
3 Characteristics of manuals
The software manuals are difficult to handle with.
Japanese open source software manuals usually
contain both Japanese and English texts. Conse-
quently, some parts of manuals are not needed for
making parallel corpora. For example, when manu-
als explain commands, the commands are not trans-
lated into Japanese. Further, open source software
manuals often include program source codes, which
are not translated.
7http://opensource.org/licenses/cddl1.php
8http://wiki.netbeans.org/FaqCDDLinANutshell
Software is periodically updated, primarily by
adding new features and functionality. Thus, new
sentences are typically added to the corresponding
manuals. As a result, the latest original document
version may be newer than the translated document
version. Therefore, we have to collect original and
translated documents with matching versions. This
needs human efforts.
In many cases, translated open source software
manuals are HTML files with HTML tags that con-
form to the original document’s format, but this need
not be so. Therefore, we have to modify translated
documents to match the original format. In addi-
tion, the formats of open source manuals differ from
project to project. Consequently, we have to write a
script to extract text portions for each project.
4 Constructing the corpus
Constructing the Japanese-English corpus takes
three steps;
(1) searching for open source software manuals on
the Web
(2) cleaning up documents, and
(3) aligning sentences.
We describe these three steps in the following sub-
sections.
4.1 Searching for open source software
manuals
We used Web search engines manually to search
for open source software manuals. We searched
for Japanese Web pages containing phrases such
as 翻訳 プロジェク(translation project). Then,
we manually checked if those pages contained soft-
ware manuals. If they had, we downloaded manu-
als. We also searched for the corresponding manu-
als in English. When downloading documents, we
checked and matched the versions of both English
and Japanese documents.
Table 1 show list of collected manuals and their
URLs. In the table, JF represents “Linux Japanese
FAQ Project. JF translates documents related with
Linux. JM means “JM project. JM translates Linux
manual pages. RFC represents “Request for Com-
ments. Note that RFCs are not manuals, but their
contents are similar to manuals and their use and
importance are widespread. Others are open source
software manuals.
4.2 Cleaning up documents
Software manuals contain HTML tags. We normal-
ize documents by deleting HTML tags with a Perl
script using pattern matching. This script is tailored
to each software manual.
Sentences in software manuals are often broken
by newlines. It is difficult to judge whether a new-
line character represents a sentence end or not. For
example, headings are usually separated by new-
lines without periods. We delete newline charac-
ters in a paragraph, which is defined by a text region
separated by empty lines, if that paragraph contains
punctuation. Otherwise, newlines are not deleted be-
cause they are regarded as sentence ends.
4.3 Aligning sentences
We use Utiyama and Isahara’s alignment method,
because their method has been successfully used
in aligning noisy Japanese-English parallel texts
(Utiyama and Isahara, 2007). Below is a concise
description of their algorithm.
We begin by obtaining the maximum similarity
sentence alignments. Let Jand Ebe a Japanese text
file and an English text file, respectively. We cal-
culate the maximum similarity sentence alignments
(J1,E1), (J2,E2), ..., (Jm,Em), using a dynamic
programming matching method (Gale and Church,
1993), where (Ji,Ei) is a Japanese and English sen-
tence alignment pair in Jand E. We allow 1-to-
n,n-to-1 (0 n5), or 2-to-2 alignments when
aligning sentences. The similarity between Jiand
Eiis calculated based on word overlap (i.e., num-
ber of word pairs from Jiand Eithat are transla-
tions of each other based on a bilingual dictionary
with 450,000+ entries). The similarity between a
Japanese document, J, and an English document, E,
(noted AVSIM(J,E)) is calculated using:
AVSIM(J, E) = m
i=1 SIM(Ji, Ei)
m(1)
A high AVSIM(J,E) value occurs when the sen-
tence alignments in Jand Etake on high similarity
values. We also calculate the ratio of the number of
Japanese English
FreeBSD http://www.freebsd.org/ja/ http://www.freebsd.org/
Gentoo Linux http://www.gentoo.org/doc/ja/index.xml http://www.gentoo.org/doc/en/index.xml
JF http://www.linux.or.jp/JF/ http://www.kernel.org/pub/linux/kernel/
http://tldp.org/
http://www.sfr-fresh.com/
JM http://www.linux.or.jp/JM/ http://www.kernel.org/
http://ftp.gnu.org/gnu/
Net Beans http://ja.netbeans.org/index.html http://www.netbeans.org/index.html
PEAR http://pear.php.net/index.php http://pear.php.net/index.php
PHP http://www.php.net/download-docs.php http://www.php.net/download-docs.php
PostgreSQL http://www.postgresql.jp/ http://www.postgresql.org/
Python http://www.python.jp/doc/ http://docs.python.org/download.html
RFC collected from a lot of sites http://www.rfc-editor.org/
XFree86 http://xjman.dsl.gr.jp/download.html http://www.xfree86.org/
Table 1: List of collected manuals
sentences between Jand E(noted R(J,E)) using:
R(J, E) = min( |J|
|E|,|E|
|J|)(2)
where |J|is the number of sentences in J, and |E|
is the number of sentences in E.
A high R(J, E)value occurs when |J|∼|E|.
Consequently, R(J, E)can be used to measure the
proportion of potentially corresponding sentences.
Using AVSIM(J,E) and R(J,E), we defined the
similarity between Jand E(noted AR(J,E)) as
AR(J, E) = AVSIM(J, E)×R(J, E)(3)
Finally, we define the score of alignment Jiand Ei
as
Score(Ji, Ei) = SIM(Ji, Ei)×AR(J, E)(4)
A high Score(Ji, Ei) value occurs in the follow-
ing case: (1) sentences Jiand Eiare similar, (2)
documents Jand Eare similar, and (3) the number
of sentences |J|and |E|are similar. Score(Ji, Ei)
combines both sentence and document similarities
to discriminate between correct and incorrect align-
ments.
4.4 Results of sentence alignment
We examined the results of the sentence alignment
and concluded that 1-to-1, 1-to-2, or 2-to-1 sentence
alignments are clean. Thus, we extracted only these
sentence alignments to make our parallel corpus.
Although we included all 1-to-1, 1-to-2, or 2-to-1
alignments, it is possible to extract only highly pre-
cise sentence alignments if we use the score defined
in Equation 4, as verified in (Utiyama and Isahara,
2007).
Table 2 shows the number of aligned sentences.
Overall, there are a total of just under 500 thou-
sand sentences. Among these, over 90% of sentence
alignments are 1-to-1.
Table 3 shows examples of aligned sentences. As
the examples show, some Japanese sentences in-
clude English words, such as “PUT” or “root”. We
also see that both long and short sentences are in-
cluded.
We found that over 80% of sentence alignments
were precisely aligned. We think that further im-
provements are possible, since we have failed to
clean up some noisy sentences. Our simple pattern
match rules did not work well for removing some
sentences such as notes by translators. We expect
that the alignment accuracy would improve if we re-
move such noisy sentences.
5 MT Experiments
MT experiments were conducted to verify the use-
fulness of our constructed corpus for SMT. We used
the Moses system (Koehn et al., 2007). We used
GIZA++ (Och and Ney ,2003) for word alignment
and SRILM (Stolcke,2002) for language modeling.
In our experiments, we used 5-gram language mod-
els. Minimum error rate training (MERT) was per-
English Japanese
sentences tokens(average of sentences) tokens(average of sentences)
FreeBSD 10528 156749(14.9) 245780(23.34)
Gentoo Linux 11117 1488461(13.39) 224324(20.17)
JF 122072 1867792(15.30) 2854297(23.38)
JM 41573 483098(11.62) 731045(17.58)
Net Beans 32774 450849(13.76) 682229(20.82)
PEAR 23333 294233(12.61) 446863(19.15)
PHP 67023 639857(9.55) 977281(14.58)
PostgreSQL 22843 396570(17.36) 627994(27.49)
Python 26215 297830(11.36) 499860(19.07)
RFC 128827 2229786(17.31) 3201737(24.85)
XFree86 12155 171725(14.27) 277254(22.81)
total 498460 8476950(13.77) 10768664(21.20)
Table 2: Number of aligned sentences
Japanese English
現在設定されている PUT イルへのパス
含む文字列を返します
Returns a string containing the path to the cur-
rently set put file.
root
ルトパスです
That will be a default path for normal and root
users respectively.
クライアント機Grub で、フロピーデ
クからブートします
The client machine boots from a Grub floppy
disk.
メッセージの HTTP ヘッダを含む連想配列を
返します
Returns an associative array containing the mes-
sages HTTP headers.
画像のマットチャネルを設定しますSets the image matte channel.
さまざまなハュアルゴリズムを使用して
任意の長さのメッセージに対する直接的ある
いは段階的な処理を可能とします
Allows direct or incremental processing of arbi-
trary length messages using a variety of hashing
algorithms.
塗りつぶしや描画を行わずに現在のパスオブ
ジェクトを終了します
Ends current path object without performing fill-
ing and painting operations.
これーザ所有BIOS 設定カー
構成、およびいくつかの簡素化を含んでいま
This includes BIOS settings, kernel configuration
and some simplifications in user land.
このシグナルはリモートからセッションのチ
ックポイントを行うときにも利用できる。
This signal can be used to perform a remote
checkpoint of a session.
Xlib はテキストの描画やテキストのディメン
ジョンの計算で必要な時だけフォントをロー
ドしトデータをキることを
選択できる。
Xlib may choose to cache font data, loading it
only as needed to draw text or compute text di-
mensions.
Table 3: Example of parallel sentences
formed to tune the decoder’s parameters on the basis
of the bilingual evaluation understudy (BLEU) score
(Papinei et al., 2002). The evaluation was done us-
ing a single reference. Tuning was performed us-
ing the standard technique developed by Och (Och,
2003). The test and development data were extracted
from the aligned JF sentences. Each of test and de-
velopment data consists of 500 sentences.
In the following experiments, we simulated a sit-
uation where an SMT system was applied to help
volunteer translators translate English JF documents
into Japanese. We want to use all parallel sentences
efficiently to help translators. This is a problem of
domain adaptation. All of paralell sentences were
translated from English to Japanese. Therefore we
did MT experiments from English.
In the first experiment, we used all parallel sen-
tences (excluding development and test sentences)
as our training data, which contained approximately
500 thousand parallel sentences, as shown in Ta-
ble 2. The BLEU score obtained was 37.38.
In the second experiment, we used only JF par-
allel sentences (approximately 100 thousand sen-
tences). The BLEU score obtained was 40.02.
In the third experiment, we linearly interpolated
language models of the first and second experiments.
We changed the weight of JF’s language model from
0.1, 0.3, 0.5, 0.7, and 0.9. The BLEU scores were
38.40, 39.30, 38.92, 40.07, and 42.53. The BLEU
score was highest when the weight is 0.9. The trans-
lation model used was that in the first experiment.
In the fourth experiment, we log-linearly interpo-
lated translation models of the first and second ex-
periments. The weights were set with MERT. The
BLEU score was 41.26. The language model used
was that in the first experiment.
In the final experiment, we used the language
model with a weight of 0.9 in the third experiment
and the translation model in the fourth experiment.
The BLEU score was 44.36, which was the highest
in these experiments.
6 Discussion
The BLEU scores obtained were relatively high for
a Japanese-English corpus. For example, Utiyama
and Isahara (2007) reported a maximum BLEU
score of 25.89 for patent document SMT.
However, we have to be careful about our ex-
perimental results. First, our test sentences were
extracted from those having the highest alignment
scores, which might not be representative samples.
This was because sentences with low alignment
scores could be wrong alignments, which were not
suitable for measuring SMT performance. We plan
to sample test sentences from the whole corpus and
clean them for the purpose of evaluation in our fu-
ture work. Second, our Japanese word segmenter
segmented ASCII words into characters. (Japanese
words were segmented properly.) For example,
“word” was segmented into “w o r d” in the cor-
pus used in the above experiments. Consequently,
the BLEU scores obtained were optimistic, though
the occurrences of ASCII words were much smaller
than those of Japanese words.
Because the BLEU scores were rather optimistic,
we manually examined test sentences. We found
that short sentences were generally translated well
and longer sentences were not translated well.
Overall, we concluded that our parallel corpus is
useful for English-Japanese SMT developments. We
also hope that this corpus will be useful for support-
ing human translations of these manuals.
7 Conclusion
We have reported a project on developing a
Japanese-English parallel corpus made from soft-
ware manuals. It has approximately 500 thousand
sentence pairs. The corpus will be available at
http://www2.nict.go.jp/x/x161/members/mutiyama/
manual/index.html
References
Masao Utiyama and Hitoshi Isahara. 2007. A Japanese-
English Patent Parallel Corpus. In MT summit XI,
pages 475–482.
Philipp Koehn. 2005. Europarl: A Parallel Corpus for
Statistical Machine Translation. In proceedings of the
Machine Translation Summit X,pages 79–86.
J¨
org Tiedemann, Lars Nygaard. 2004. The OPUS corpus
- parallel and free. In LREC,pages 93–96.
William A. Gale and Kenneth W. Church. 1993. A
program for aligning sentences in bilingual corpora.
Computational Linguistics, 19(1):75–102.
Philipp Koehn, et. al. 2007. Moses: Open source toolkit
for statistical machine translation. In ACL Demo and
Poster Sessions, pages 79–86.
Andreas Stolcke. 2002. SRILM - an extensible language
modeling toolkit. In INSLP, pages 901–904.
Kishore Papinei, et al. 2002. BLEU:a method for au-
tomatic evaluation of machine translation. In ACL,
pages 311–318.
Franz Josef Och. 2003. Minimum error rate training in
statistical machine translation.In In ACL, pages 160–
167.
Franz Josef Och and Hermann Ney. 2003. A system-
atic comparison of various statistical alignment mod-
els. Computational Linguistics, 29(1):19–51.

Navigation menu