Development Of A Japanese English Software Manual Parallel Corpus

User Manual:

Open the PDF directly: View PDF .
Page Count: 6

Development of a Japanese-English Software Manual Paralell Corpus

Tatsuya Ishisaka†

Kazuhide Yamamoto†

†Nagaoka University of Technology

1603-1 Kamitomiokamachi, Nagaoka,

Niigata 940-2188, Japan

{ishisaka,ykaz}@nlp.nagaokaut.ac.jp

Masao Utiyama††

Eiichiro Sumita††

†† MASTAR Project

National Institute of Information

and Communications Tehnology

3-5, Hikaridai, Seika, Soraku,

Kyoto 619-0289, Japan

{mutiyama,eiichiro.sumita}

@nict.go.jp

Abstract

To address the shortage of Japanese-English

parallel corpora, we developed a parallel cor-

pus by collecting open source software man-

uals from the Web. The constructed cor-

pus contains approximately 500 thousand sen-

tence pairs that were aligned automatically by

an existing method. We also conducted statis-

tical machine translation (SMT) experiments

with the corpus and conﬁrmed that the corpus

is useful for SMT.

1 Introduction

Multilingual parallel corpora are required to sup-

port many tasks in natural language processing. For

example, statistical machine translation (SMT) re-

quires a parallel corpus for training, and cross-

lingual processing such as information retrieval and

information extraction also use parallel corpora.

There is no doubt on the importance of parallel cor-

pora for any language pair.

Specially, Japanese-English parallel corpora are

very scarce. Although some parallel corpora

(Utiyama and Isahara, 2007) are available, the do-

mains and sizes of these corpora are limited.

In general, European countries use multiple lan-

guages ofﬁcially. Based on this multilingual envi-

ronment, Koehn (2005) has built a corpus by col-

lecting parallel texts in eleven languages from the

proceedings of the European Parliament, which are

published on the Web.

However, some countries such as Japan have no

such language situation, that leads us difﬁculties for

creating parallel corpora. Hence, more efforts are

needed to collect them effectively.

Available Japanese-English parallel corpora are

scarce. However, there are a lot of translated texts on

the Web. Specially, open source manuals are trans-

lated into Japanese from English by volunteer trans-

lators.

We collected such English and Japanese texts.

Then, the sentences in collected texts were automat-

ically aligned, resulting in a parallel corpus made

from open source software manuals. Manuals of

open source software has been used for making a

parallel corpus named OPUS (Tiedemann and Ny-

gaard 2004), which was made from OpenOfﬁce.org

documentation1, KDE manuals including KDE mes-

sages2, and PHP manuals 3. However, the Japanese-

English part of OPUS is not large. In contrast, we

collected about 500 thousand sentence pairs. In ad-

dition, our work involved extensive human efforts to

ensure the quality of our parallel corpus.

The original and translated texts often proscribe

copy, distribute, display, and make derivative works.

Our target texts are open source software manuals.

Such open source software manuals are often pub-

lished under open licenses under which we can mod-

ify and distribute them.

The translation quality of open source software

manuals are considered to be relatively high, be-

cause they are translated by many translators who

belong to the projects and drafts of the translations

are corrected by other project members. Therefore,

1http://www.openofﬁce.org

2http://i18n.kde.org

3http://www.php.net/download-docs.php

we can trust the quality of software manuals.

In the following we present how we collect, clean,

and align software manuals. We also illustrate per-

formance of SMT experiments using the corpus.

2 Target license

We will publish a parallel corpus constructed from

open source software manuals. This action is con-

sidered as a redistribution with modiﬁcations. We

therefore target licenses that allow redistribution and

modiﬁcations.

Here are four example licenses that allow redistri-

bution and modiﬁcation.

MIT License4The MIT License is very open. It

is necessary only to include the copyright notice and

the permission notice.

FreeBSD Documentation License5

The FreeBSD Documentation License is similar

to the MIT License. It is as follows:

Redistribution and use in source (SGML Doc-

Book) and ’compiled’ forms (SGML, HTML,

PDF, PostScript, RTF and so forth) with or

without modiﬁcation, are permitted provided

that the following conditions are met:

(1) Redistributions of source code (SGML

DocBook) must retain the their copy-

right notice, this list of conditions and

the their disclaimer as the ﬁrst lines of

this ﬁle unmodiﬁed.

(2) Redistributions in compiled form (trans-

formed to other DTDs, converted to

PDF, PostScript, RTF and other formats)

must reproduce the their copyright no-

tice, this list of conditions and the their

disclaimer in the documentation and/or

other materials provided with the distri-

bution.

Creative Commons licenses6Creative Com-

mons licenses include several license types. We in-

troduce the “Attribution-Share Alike 3.0 Unported”

model in which we can copy, distribute, transmit,

and adapt the work under the following conditions;

4http://www.opensource.org/licenses/mit-license.php

5http://www.freebsd.org/copyright/freebsd-doc-

license.html

6http://creativecommons.org/licenses/by-nc-sa/3.0/deed.en

Attribution You must attribute the work in

the manner speciﬁed by the author or li-

censor (but not in any way that suggests

that they endorse you or your use of the

work).

Share Alike If you alter, transform, or build

upon this work, you may distribute the

resulting work only under the same, sim-

ilar or a compatible license.

Common Development and Distribution Li-

cense7

Common Development and Distribution License

(CDDL) is a long and detailed license. Thus, we

describe CDDL brieﬂy using the FAQ of NetBeans8.

Modify NetBeans source code and redistribute

it for free (or for sale) as long as I follow the

terms of the CDDL, including the following

provisions:

•We must bundle the CDDL with any

source code version we distribute,

•We must make my changes to the Net-

Beans source code (but not new source

ﬁles I create) available to the NetBeans

community under the CDDL (so the

community can beneﬁt from my changes

or improvements),

•We cannot modify the rights granted un-

der the CDDL License,

•We can add external ﬁles to NetBeans,

compile these and redistribute them for

free or for sale and we do not need to

make such external ﬁles or changes to

them available in source code form or bi-

nary form to the NetBeans project. If my

value-add is worth the price, we can sell

it.

3 Characteristics of manuals

The software manuals are difﬁcult to handle with.

Japanese open source software manuals usually

contain both Japanese and English texts. Conse-

quently, some parts of manuals are not needed for

making parallel corpora. For example, when manu-

als explain commands, the commands are not trans-

lated into Japanese. Further, open source software

manuals often include program source codes, which

are not translated.

7http://opensource.org/licenses/cddl1.php

8http://wiki.netbeans.org/FaqCDDLinANutshell

Software is periodically updated, primarily by

adding new features and functionality. Thus, new

sentences are typically added to the corresponding

manuals. As a result, the latest original document

version may be newer than the translated document

version. Therefore, we have to collect original and

translated documents with matching versions. This

needs human efforts.

In many cases, translated open source software

manuals are HTML ﬁles with HTML tags that con-

form to the original document’s format, but this need

not be so. Therefore, we have to modify translated

documents to match the original format. In addi-

tion, the formats of open source manuals differ from

project to project. Consequently, we have to write a

script to extract text portions for each project.

4 Constructing the corpus

Constructing the Japanese-English corpus takes

three steps;

(1) searching for open source software manuals on

the Web

(2) cleaning up documents, and

(3) aligning sentences.

We describe these three steps in the following sub-

sections.

4.1 Searching for open source software

manuals

We used Web search engines manually to search

for open source software manuals. We searched

for Japanese Web pages containing phrases such

as 翻訳プロジェクト (translation project). Then,

we manually checked if those pages contained soft-

ware manuals. If they had, we downloaded manu-

als. We also searched for the corresponding manu-

als in English. When downloading documents, we

checked and matched the versions of both English

and Japanese documents.

Table 1 show list of collected manuals and their

URLs. In the table, JF represents “Linux Japanese

FAQ Project.” JF translates documents related with

Linux. JM means “JM project.” JM translates Linux

manual pages. RFC represents “Request for Com-

ments.” Note that RFCs are not manuals, but their

contents are similar to manuals and their use and

importance are widespread. Others are open source

software manuals.

4.2 Cleaning up documents

Software manuals contain HTML tags. We normal-

ize documents by deleting HTML tags with a Perl

script using pattern matching. This script is tailored

to each software manual.

Sentences in software manuals are often broken

by newlines. It is difﬁcult to judge whether a new-

line character represents a sentence end or not. For

example, headings are usually separated by new-

lines without periods. We delete newline charac-

ters in a paragraph, which is deﬁned by a text region

separated by empty lines, if that paragraph contains

punctuation. Otherwise, newlines are not deleted be-

cause they are regarded as sentence ends.

4.3 Aligning sentences

We use Utiyama and Isahara’s alignment method,

because their method has been successfully used

in aligning noisy Japanese-English parallel texts

(Utiyama and Isahara, 2007). Below is a concise

description of their algorithm.

We begin by obtaining the maximum similarity

sentence alignments. Let Jand Ebe a Japanese text

ﬁle and an English text ﬁle, respectively. We cal-

culate the maximum similarity sentence alignments

(J1,E1), (J2,E2), ..., (Jm,Em), using a dynamic

programming matching method (Gale and Church,

1993), where (Ji,Ei) is a Japanese and English sen-

tence alignment pair in Jand E. We allow 1-to-

n,n-to-1 (0 ≤n≤5), or 2-to-2 alignments when

aligning sentences. The similarity between Jiand

Eiis calculated based on word overlap (i.e., num-

ber of word pairs from Jiand Eithat are transla-

tions of each other based on a bilingual dictionary

with 450,000+ entries). The similarity between a

Japanese document, J, and an English document, E,

(noted AVSIM(J,E)) is calculated using:

AVSIM(J, E) = ∑m

i=1 SIM(Ji, Ei)

m(1)

A high AVSIM(J,E) value occurs when the sen-

tence alignments in Jand Etake on high similarity

values. We also calculate the ratio of the number of

Japanese English

FreeBSD http://www.freebsd.org/ja/ http://www.freebsd.org/

Gentoo Linux http://www.gentoo.org/doc/ja/index.xml http://www.gentoo.org/doc/en/index.xml

JF http://www.linux.or.jp/JF/ http://www.kernel.org/pub/linux/kernel/

http://tldp.org/

http://www.sfr-fresh.com/

JM http://www.linux.or.jp/JM/ http://www.kernel.org/

http://ftp.gnu.org/gnu/

Net Beans http://ja.netbeans.org/index.html http://www.netbeans.org/index.html

PEAR http://pear.php.net/index.php http://pear.php.net/index.php

PHP http://www.php.net/download-docs.php http://www.php.net/download-docs.php

PostgreSQL http://www.postgresql.jp/ http://www.postgresql.org/

Python http://www.python.jp/doc/ http://docs.python.org/download.html

RFC collected from a lot of sites http://www.rfc-editor.org/

XFree86 http://xjman.dsl.gr.jp/download.html http://www.xfree86.org/

Table 1: List of collected manuals

sentences between Jand E(noted R(J,E)) using:

R(J, E) = min( |J|

|E|,|E|

|J|)(2)

where |J|is the number of sentences in J, and |E|

is the number of sentences in E.

A high R(J, E)value occurs when |J|∼|E|.

Consequently, R(J, E)can be used to measure the

proportion of potentially corresponding sentences.

Using AVSIM(J,E) and R(J,E), we deﬁned the

similarity between Jand E(noted AR(J,E)) as

AR(J, E) = AVSIM(J, E)×R(J, E)(3)

Finally, we deﬁne the score of alignment Jiand Ei

Score(Ji, Ei) = SIM(Ji, Ei)×AR(J, E)(4)

A high Score(Ji, Ei) value occurs in the follow-

ing case: (1) sentences Jiand Eiare similar, (2)

documents Jand Eare similar, and (3) the number

of sentences |J|and |E|are similar. Score(Ji, Ei)

combines both sentence and document similarities

to discriminate between correct and incorrect align-

ments.

4.4 Results of sentence alignment

We examined the results of the sentence alignment

and concluded that 1-to-1, 1-to-2, or 2-to-1 sentence

alignments are clean. Thus, we extracted only these

sentence alignments to make our parallel corpus.

Although we included all 1-to-1, 1-to-2, or 2-to-1

alignments, it is possible to extract only highly pre-

cise sentence alignments if we use the score deﬁned

in Equation 4, as veriﬁed in (Utiyama and Isahara,

2007).

Table 2 shows the number of aligned sentences.

Overall, there are a total of just under 500 thou-

sand sentences. Among these, over 90% of sentence

alignments are 1-to-1.

Table 3 shows examples of aligned sentences. As

the examples show, some Japanese sentences in-

clude English words, such as “PUT” or “root”. We

also see that both long and short sentences are in-

cluded.

We found that over 80% of sentence alignments

were precisely aligned. We think that further im-

provements are possible, since we have failed to

clean up some noisy sentences. Our simple pattern

match rules did not work well for removing some

sentences such as notes by translators. We expect

that the alignment accuracy would improve if we re-

move such noisy sentences.

5 MT Experiments

MT experiments were conducted to verify the use-

fulness of our constructed corpus for SMT. We used

the Moses system (Koehn et al., 2007). We used

GIZA++ (Och and Ney ,2003) for word alignment

and SRILM (Stolcke,2002) for language modeling.

In our experiments, we used 5-gram language mod-

els. Minimum error rate training (MERT) was per-

English Japanese

sentences tokens(average of sentences) tokens(average of sentences)

FreeBSD 10528 156749(14.9) 245780(23.34)

Gentoo Linux 11117 1488461(13.39) 224324(20.17)

JF 122072 1867792(15.30) 2854297(23.38)

JM 41573 483098(11.62) 731045(17.58)

Net Beans 32774 450849(13.76) 682229(20.82)

PEAR 23333 294233(12.61) 446863(19.15)

PHP 67023 639857(9.55) 977281(14.58)

PostgreSQL 22843 396570(17.36) 627994(27.49)

Python 26215 297830(11.36) 499860(19.07)

RFC 128827 2229786(17.31) 3201737(24.85)

XFree86 12155 171725(14.27) 277254(22.81)

total 498460 8476950(13.77) 10768664(21.20)

Table 2: Number of aligned sentences

Japanese English

現在設定されている PUT ファイルへのパスを

含む文字列を返します。

Returns a string containing the path to the cur-

rently set put ﬁle.

これらはそれぞれ通常ユーザーと root のデフォ

ルトパスです。

That will be a default path for normal and root

users respectively.

クライアント機は Grub で、フロッピーディス

クからブートします。

The client machine boots from a Grub ﬂoppy

disk.

メッセージの HTTP ヘッダを含む連想配列を

返します。

Returns an associative array containing the mes-

sages HTTP headers.

画像のマットチャネルを設定します。 Sets the image matte channel.

さまざまなハッシュアルゴリズムを使用して、

任意の長さのメッセージに対する直接的ある

いは段階的な処理を可能とします。

Allows direct or incremental processing of arbi-

trary length messages using a variety of hashing

algorithms.

塗りつぶしや描画を行わずに現在のパスオブ

ジェクトを終了します。

Ends current path object without performing ﬁll-

ing and painting operations.

これはユーザが所有する BIOS 設定、カーネル

構成、およびいくつかの簡素化を含んでいま

す。

This includes BIOS settings, kernel conﬁguration

and some simpliﬁcations in user land.

このシグナルはリモートからセッションのチ

ェックポイントを行うときにも利用できる。

This signal can be used to perform a remote

checkpoint of a session.

Xlib はテキストの描画やテキストのディメン

ジョンの計算で必要な時だけフォントをロー

ドし、フォントデータをキャッシュすることを

選択できる。

Xlib may choose to cache font data, loading it

only as needed to draw text or compute text di-

mensions.

Table 3: Example of parallel sentences

formed to tune the decoder’s parameters on the basis

of the bilingual evaluation understudy (BLEU) score

(Papinei et al., 2002). The evaluation was done us-

ing a single reference. Tuning was performed us-

ing the standard technique developed by Och (Och,

2003). The test and development data were extracted

from the aligned JF sentences. Each of test and de-

velopment data consists of 500 sentences.

In the following experiments, we simulated a sit-

uation where an SMT system was applied to help

volunteer translators translate English JF documents

into Japanese. We want to use all parallel sentences

efﬁciently to help translators. This is a problem of

domain adaptation. All of paralell sentences were

translated from English to Japanese. Therefore we

did MT experiments from English.

In the ﬁrst experiment, we used all parallel sen-

tences (excluding development and test sentences)

as our training data, which contained approximately

500 thousand parallel sentences, as shown in Ta-

ble 2. The BLEU score obtained was 37.38.

In the second experiment, we used only JF par-

allel sentences (approximately 100 thousand sen-

tences). The BLEU score obtained was 40.02.

In the third experiment, we linearly interpolated

language models of the ﬁrst and second experiments.

We changed the weight of JF’s language model from

0.1, 0.3, 0.5, 0.7, and 0.9. The BLEU scores were

38.40, 39.30, 38.92, 40.07, and 42.53. The BLEU

score was highest when the weight is 0.9. The trans-

lation model used was that in the ﬁrst experiment.

In the fourth experiment, we log-linearly interpo-

lated translation models of the ﬁrst and second ex-

periments. The weights were set with MERT. The

BLEU score was 41.26. The language model used

was that in the ﬁrst experiment.

In the ﬁnal experiment, we used the language

model with a weight of 0.9 in the third experiment

and the translation model in the fourth experiment.

The BLEU score was 44.36, which was the highest

in these experiments.

6 Discussion

The BLEU scores obtained were relatively high for

a Japanese-English corpus. For example, Utiyama

and Isahara (2007) reported a maximum BLEU

score of 25.89 for patent document SMT.

However, we have to be careful about our ex-

perimental results. First, our test sentences were

extracted from those having the highest alignment

scores, which might not be representative samples.

This was because sentences with low alignment

scores could be wrong alignments, which were not

suitable for measuring SMT performance. We plan

to sample test sentences from the whole corpus and

clean them for the purpose of evaluation in our fu-

ture work. Second, our Japanese word segmenter

segmented ASCII words into characters. (Japanese

words were segmented properly.) For example,

“word” was segmented into “w o r d” in the cor-

pus used in the above experiments. Consequently,

the BLEU scores obtained were optimistic, though

the occurrences of ASCII words were much smaller

than those of Japanese words.

Because the BLEU scores were rather optimistic,

we manually examined test sentences. We found

that short sentences were generally translated well

and longer sentences were not translated well.

Overall, we concluded that our parallel corpus is

useful for English-Japanese SMT developments. We

also hope that this corpus will be useful for support-

ing human translations of these manuals.

7 Conclusion

We have reported a project on developing a

Japanese-English parallel corpus made from soft-

ware manuals. It has approximately 500 thousand

sentence pairs. The corpus will be available at

http://www2.nict.go.jp/x/x161/members/mutiyama/

manual/index.html

References

Masao Utiyama and Hitoshi Isahara. 2007. A Japanese-

English Patent Parallel Corpus. In MT summit XI,

pages 475–482.

Philipp Koehn. 2005. Europarl: A Parallel Corpus for

Statistical Machine Translation. In proceedings of the

Machine Translation Summit X,pages 79–86.

J¨

org Tiedemann, Lars Nygaard. 2004. The OPUS corpus

- parallel and free. In LREC,pages 93–96.

William A. Gale and Kenneth W. Church. 1993. A

program for aligning sentences in bilingual corpora.

Computational Linguistics, 19(1):75–102.

Philipp Koehn, et. al. 2007. Moses: Open source toolkit

for statistical machine translation. In ACL Demo and

Poster Sessions, pages 79–86.

Andreas Stolcke. 2002. SRILM - an extensible language

modeling toolkit. In INSLP, pages 901–904.

Kishore Papinei, et al. 2002. BLEU:a method for au-

tomatic evaluation of machine translation. In ACL,

pages 311–318.

Franz Josef Och. 2003. Minimum error rate training in

statistical machine translation.In In ACL, pages 160–

167.

Franz Josef Och and Hermann Ney. 2003. A system-

atic comparison of various statistical alignment mod-

els. Computational Linguistics, 29(1):19–51.

Development Of A Japanese English Software Manual Parallel Corpus

Navigation menu

Versions of this User Manual:

Views

Navigation