PDF Rnews 2005 1

PDF Rnews_2005-1 CRAN: R News

User Manual: PDF CRAN - Contents of R News

Open the PDF directly: View PDF PDF.
Page Count: 72

DownloadPDF Rnews 2005-1
Open PDF In BrowserView PDF
The Newsletter of the R Project

News

Volume 5/1, May 2005

Editorial
by Douglas Bates
This edition of R News is the work of a new editorial
team and I would particularly like to recognize my
associate editors, Paul Murrell and Torsten Hothorn,
who stepped up and did a superb job of handling
most of the submissions whilst I was otherwise occupied.
This issue accompanies the release of R 2.1.0 and
we begin with three articles describing new facilities in this release. For many users the most notable
change with this release is the addition of internationalization and localization capabilities, described
in our feature article by Brian Ripley, who did most
of the work in adding these facilities to R. The next
article, also written by Brian Ripley, describes the
new package management structure in R 2.1.0 then
Paul Murrell updates us on changes in the standard
package grid.
Following these articles are three articles on contributed packages beginning with Alessandra Brazzale’s description of the hoa bundle for statistical
inference using higher-order asymptotics. By contributing the hoa bundle to CRAN Alessandra has
extended to five years the association of R with win-

Contents of this issue:
Editorial . . . . . . . . . . . . . . . . . . . . . .
Internationalization Features of R 2.1.0 . . . . .
Packages and their Management in R 2.1.0 . . .
Recent Changes in grid Graphics . . . . . . . .
hoa: An R package bundle for higher order
likelihood inference . . . . . . . . . . . . . .
Fitting linear mixed models in R . . . . . . . .
Using R for Statistical Seismology . . . . . . . .
Literate programming for creating and maintaining packages . . . . . . . . . . . . . . . .

1
2
8
12
20
27
31
35

ners of the Chambers Award for outstanding work
on statistical software by a student. She received the
2001 Chambers Award for her work on this software.
The 2002 Chambers Award winner, Simon Urbanek,
is now a member of the R Development Core Team.
The work of the 2003 winner, Daniel Adler for RGL,
the 2004 winner, Deepayan Sarkar for Lattice, and
the recently-announced 2005 winner, Markus Helbig
for JGR, are all intimately connected with R.
The next section of this newsletter is three articles
on tools for working with R. We follow these with an
article on experiences using R within a large pharmaceutical company. This article is part of a continuing series on the use of R in various settings. We
welcome contributions to this series. Finally, an article on magic squares and John Fox’s guest appearance in the Programmer’s Niche column, where he
discusses a short but elegant function for producing
textual representations of numbers, put the focus on
programming in R, which is where the focus should
be, shouldn’t it?
Douglas Bates
University of Wisconsin – Madison, U.S.A.
bates@R-project.org

CRAN Task Views . . . . . . . . . . . . . . . . .
Using Control Structures with Sweave . . . . .
The Value of R for Preclinical Statisticians . . .
Recreational mathematics with R: introducing
the “magic” package . . . . . . . . . . . . . .
Programmer’s Niche . . . . . . . . . . . . . . .
Book Review of
Julian J. Faraway: Linear Models with R . . .
R Foundation News . . . . . . . . . . . . . . . .
Changes in R . . . . . . . . . . . . . . . . . . . .
Changes on CRAN . . . . . . . . . . . . . . . .
Events . . . . . . . . . . . . . . . . . . . . . . . .

39
40
44
48
51
56
57
57
67
71

Vol. 5/1, May 2005

2

Internationalization Features of R 2.1.0
by Brian D. Ripley
R 2.1.0 introduces a range of features to make it possible or easier to use R in languages other than English or American English: this process is known as
internationalization, often abbreviated to i18n (since
18 letters are omitted). The process of adapting to
a particular language, currency and so on is known
as localization (abbreviated to l10n), and R 2.1.0 ships
with several localizations and the ability to add others.
What is involved in supporting non-English languages? The basic elements are

(16-bit characters) from most later internationalization efforts. (We still have R users running the obsolete Windows 95/98/ME OSes, which had languagespecific versions.) MacOS X is a hybrid of components which approach internationalization in different ways but the R maintainers for that platform
have managed to smooth over many of its quirks.
Recent Linux and other systems using glibc have
extensive support. Commercial Unixes have varying
amounts of support, although Solaris was a pioneer
and often the model for Open Source implementations.

1. The ability to represent words in the language.
This needs support for the encoding used for the
language (which might be OS-specific) as well
as support for displaying the characters, both
at the console and on graphical devices.
2. Manipulation of text in the language, by for
example grep(). This may sound straightforward, but earlier versions of R worked with
bytes and not characters: at least two bytes are
needed to represent Chinese characters.
3. Translation of key components from English to
the user’s own language. Currently R supports
the translation of diagnostic messages and the
menus and dialogs of the Windows and MacOS
X GUIs.
There are other aspects to internationalization, for
example the support of international standard paper
sizes rather than just those used in North America,
different ways to represent dates and times,1 monetary amounts and the representation of numbers.2
R has ‘for ever’ had some support for Western
European languages, since several early members of
the core team had native languages other than English. We have been discussing more comprehensive internationalization for a couple of years, and a
group of users in Japan have been producing a modified version of R with support for Japanese. During
those couple of years OS support for internationalization has become much more widespread and reliable, and the increasing prevalence of UTF-83 locales
as standard in Linux distributions made greater internationalization support more pressing.
How successfully your R will support nonEnglish languages depends on the level of operating
system support, although as always the core team
tries hard to make facilities available across all platforms. Windows NT took internationalization seriously from the mid 1990s, but took a different route

Figure 1: Part of the Windows console in an Italian
locale.

Figure 2: Part of the Windows console in Japanese.
Note the use of Japanese variable names, that the last
line is an error message in Japanese, as well as the
difference in width between English and Japanese
characters.
Hopefully, if your OS has enough support to allow you to work comfortably in your preferred language, it will have enough support to allow you to
use R in that language. Figures 1 and 2 show examples of internationalization for the Windows GUI:
note how the menus are translated, variable names
are accepted in the preferred language, and error

1 e.g.

the use of 12-hour or 24-hour clock.
example, the use of , for the decimal point, and the use of , or . or nothing as the thousands separator.
3 see below.
2 for

R News

ISSN 1609-3631

Vol. 5/1, May 2005

3

messages are translated too. The Japanese screenshot shows another difference: the Japanese characters are displayed at twice the width of the English
ones, and cursor movement has to take this into account.
The rest of this article expands on the concepts
and then some of the issues they raise for R users and
R programmers. Quite a lot of background is needed
to appreciate what users of very different languages
need from the internationalization of R.

Locales
A locale is a specification of the user-specific environment within which an operating system is running
an application program such as R. What exactly this
covers is OS-specific, but key components are
• The set of ‘legal’ characters the user might input: this includes which characters are alphabetic and which are numeric. This is the C locale category LC_CTYPE.
• How characters are sorted: the C locale category LC_COLLATE. Even where languages share
a character set they may sort words differently:
it comes as a surprise to English-speaking visitors to Denmark to find ‘Aa’ sorted at the end
of the alphabet.
• How to represent dates and times (C locale category LC_TIME). The meaning of times also depends on the timezone which is sometimes regarded as part of the locale (and sometimes
not).
• Monetary formatting
LC_MONETARY).

(C

locale

category

• Number formatting
LC_NUMERIC).

(C

locale

category

• The preferred language in which to communicate with the user (for some systems, C locale
category LC_MESSAGES).
• How characters in that language are encoded.
How these are specified is (once again) OSspecific. The first four categories can be set by
Sys.setlocale(): the initial settings are taken
from the OS and can be queried by calling
Sys.getlocale() (see Figure 2). On Unix-like OSes
the settings are taken from environment variables
with the names given, defaulting to the value of
LC_ALL and then of LANG. If none of these are set, the
value is likely to be C which is usually implemented
as the settings appropriate in the USA. Other aspects
of the locale are reported by Sys.localeconv().

Other OSes have other ways of specifying the locale: for example both Windows and MacOS X have
listboxes in their user settings. This is becoming
more common: when I select a ‘Language’ at the login screen for a session in Fedora Core 3 Linux I am
actually selecting a locale, not just my preferred language.
R does not fully support LC_NUMERIC, and is unlikely ever to. Because the comma is used as the separator in argument lists, it would be impossible to
parse expressions if it were also allowed as the decimal point. We do allow the decimal point to be specified in scan(), read.table() and write.table().
Sys.setlocale() does allow LC_NUMERIC to be set,
with a warning and with inconsistent behaviour.

For the R user
The most important piece of advice is to specify your
locale when asking questions, since R will behave
differently in different locales. We have already seen
experienced R users insisting on R-help that R has
been broken when it seems they had changed locales.
To be sure, quote the output of Sys.getlocale() and
localeToCharset().

For the package writer
Try not to write language-specific code. A package which was submitted to CRAN with variables
named a~
nos worked for the author and the CRAN
testers, but not on my Solaris systems. Use only
ASCII characters4 for your variable names, and as far
as possible for your documentation.

Languages and character sets
What precisely do we mean by ‘the preferred language’? Once again, there are many aspects to consider.
• The language, such as ‘Chinese’ or ‘Spanish’.
There is an international standard ISO 6395 for
two-letter abbreviations for languages, as well
as some three-letter ones. These are pretty standard, except that for Hebrew which has been
changed.
• Is this Spanish as spoken in Spain or in Latin
America?
• ISO 639 specifies no as ‘Norwegian’, but Norway has two official languages, Bokmål and
Nynorsk. Which is this?6

4 Digits

and upper- and lower-case A–Z, without accents.
or http://www.loc.gov/standards/iso639-2/englangn.html
6 Bokmål, usually, with nn for Nynorsk and nb for Bokmål specifically.
5 http://www.w3.org/WAI/ER/IG/ert/iso639.htm

R News

ISSN 1609-3631

Vol. 5/1, May 2005

• The same language can be written in different
ways. The most prominent example is Chinese,
which has Simplified and Traditional orthography, and most readers can understand only one
of them.
• Spelling. The Unix spell program has the following comment in the BUGS section of its man
page:
‘British spelling was done by an
American.’
but that of itself is an example of cultural imperialism. The nations living on the British Isles
do not consider themselves to speak ‘British
English’ and do not agree on spelling: for example in general Chambers’ dictionary (published in Scotland) prefers ‘-ise’ and the Oxford
English Dictionary prefers ‘-ize’.
To attempt to make the description more precise, the
two-letter language code is often supplemented by a
two-letter ‘country or territory’ code from ISO 31667 .
So, for example
pt_BR is Portuguese as written in Brazil.
zh_CN is (simplified) Chinese as written in most
of mainland China, and zh_TW is (traditional)
Chinese as written in Taiwan (which is written in the same way as zh_HK used in Hong
Kong).
en_US is American.
en_GB is English as written somewhere (unspecified) in the UK.
We need to specify the language for at least three
purposes:
1. To delimit the set of characters which can be
used, and which of those are ‘legal’ in object
names.
2. To know the sorting order.

4

Encodings
Computers work with bits and bytes, and encoding
is the act of representing characters in the language
as bytes, and vice versa. The earliest encodings (e.g.
for Telex) just represented capital letters and numbers but this expanded to ASCII, which has 92 printable characters (upper- and lower-case letters, digits
and punctuation) plus control codes, represented as
bytes with the topmost bit as 0. This is also the ISO
646 international standard and those characters are
included in most8 other encodings.
Bytes allow 256 rather than 128 different characters, and ISO 646 has been extended into the bytes
with topmost bit as 1, in many different ways. There
is a series of standards ISO 8859-1 to ISO 8859-15
for such extensions, as well as many vendor-specific
ones (such as the ‘WinAnsi’ and other code pages).
Most languages have less than the 186 or more characters9 that an 8-bit encoding provides. The problem
is that given a sequence of bytes there is no way to
know that it is in an 8-bit encoding let alone which
one.
The CJK10 languages have tens of thousands of
characters. There have been many ways to represent such character sets, for example using shift sequences (like the shift, control, alt and altgr keys
on your keyboard) to shift between ‘pages’ of characters. Windows uses a two-byte encoding for the
ideographs, with the ‘lead byte’ with topmost bit 1
followed by a ‘trail byte’. So the character sets used
for CJK languages on Windows occupy one or two
bytes.
A problem with such schemes is their lack of extensibility. What should we do when a new character
comes along, notably the Euro? Most commonly, replace something which is thought to be uncommon
(the generic currency symbol in ISO 8859-1 was replaced by the Euro in ISO 8859-15, amongst other
changes).

3. To decide what language to respond in.
In addition, we might need to know the direction of
the language, but currently R only supports left-toright processing of character strings.
The first two are part of the locale specification.
Specifying the language to be used for messages is
a little more complicated: a Nynorsk speaker might
prefer Nynorsk then Bokmål then generic Norwegian then English, which can be expressed by setting LANGUAGE=nn:nb:no (since generic ‘English’ is
the default).

Unicode
Being able to represent one language may not be
enough: if for example we want to represent personal names we would like to be able to do so equally
for American, Chinese, Greek, Arabic and Maori11
speakers. So the idea emerged of a universal encoding, to represent all the characters we might like to

7 http://www.iso.org/iso/en/prods-services/iso3166ma/index.html.
8 In

a common Japanese encoding, backslash is replaced by Yen. As this is the encoding used for Windows Japanese fonts, file paths
look rather peculiar in a Japanese version of Windows.
9 the rest are normally allocated to ‘space’ and control bytes
10 Chinese, Japanese, Korean: known to Windows as ‘East Asian’. The CJK ideographs were also used for Vietnamese until the early
twentieth century.
11 language code mi. Maori is usually encoded in ISO 8859-13, along with Lithuanian and Latvian!

R News

ISSN 1609-3631

Vol. 5/1, May 2005

5

print—all human languages, mathematical and musical notation and so on.
This is the purpose of Unicode12 , which allows up
to 232 different characters, although it is now agreed
that only 221 will ever be prescribed. The human
languages are represented in the ‘basic multilingual
plane’ (BMP), the first 216 characters, and so most
characters you will encounter have a 4-hex-digit representation. Thus U+03B1 is alpha, U+05D0 is aleph
(the first letter of the Hebrew alphabet) and U+30C2
is the Katakana letter ‘di’.
If we all used Unicode, would everyone be
happy? Unfortunately not, as sometimes the same
character can be written in various ways. The
Unicode standard allows ‘only’ 20992 slots for CJK
ideographs, whereas the Taiwanese national standard defines 48711. The result is that different fonts
have different variants for e.g. U+8FCE and a variant
may be recognizable to only some of the users.
Nevertheless, Unicode is the best we have, and a
decade ago Windows NT adopted the BMP of Unicode as its native encoding, using 16-bit characters.
However, Unix-alike OSes expect nul (the zero byte)
to terminate a string such as a file path, and people looked for ways to represent Unicode characters
as sequences of a variable number of bytes. By far
the most successful such scheme is UTF-8, which has
over the last couple of years become the de facto standard for Linux distributions. So when I select
English (UK)

British English

as my language at the login screen for a session in Fedora Core 3 Linux, I am also selecting UTF-8 as my
encoding.
In UTF-8, the 7-bit ASCII characters are represented as a single byte. All other characters are represented as two or more bytes all with topmost bit 1.
This introduces a number of implementation issues:
• Each character is represented by a variable
number of bytes, from one up to six (although
it is likely at most four will ever be used).
• Some bytes can never occur in UTF-8.
• Many byte sequences are not valid in UTF-8.
The major implementation issue in internationalizing R has been to support such multi-byte character
sets (MBCSs). There are other examples, including
EUC-JP used for Japanese on Unix-alikes and the
one- or two-byte encodings used for CJK on older
versions of Windows.
As UTF-8 locales can support multiple languages,
which are allowed for the ‘legal’ characters in R object names? This may depend on the OS, but in all
the examples we know of all characters which would
be alphanumeric in at least one human language are

allowed. So for example ~
n and ü are allowed in locale en_US.utf8 on Linux, just as they were in the
Latin-1 locale en_US (whereas on Windows the different Latin-1 locales have different sets of ‘legal’ characters, indeed different in different versions of Windows for some languages). We have decided that
only the ASCII numbers will be regarded as numeric
in R.
Sorting orders in Unicode are handled by an OS
service: this is a very intricate subject so do not expect too much consistency between different implementations (which includes the same language in
different encodings).

Fonts
Being able to represent alpha, aleph and the
Katakana letter ‘di’ is one thing: being able to display them is another, and we want to be able to display them both at the console and in graphical output. Effectively all fonts available cover quite small
subsets of Unicode, although they may be usable in
combinations to extend the range.
R consoles normally work in monospaced fonts.
That means that all printable ASCII characters have
the same width. Once we move to Unicode, characters are normally categorized into three widths,
one (ASCII, Greek, Arabic, . . . ), two (most CJK
ideographs) and zero (the ‘combining characters’ of
languages such as Thai). Unfortunately there are two
conventions for CJK characters, with some fonts having e.g. Katakana single-width and some (including
that used in Figure 2) double-width.
Which characters are available in a font can seem
capricious: for example the one I originally used to
test translations had directional single quotes but not
directional double quotes.
If displaying another language is not working in
a UTF-8 locale the most likely explanation is that the
font used is incomplete, and you may need to install
extra OS components to overcome this.
Font support for R graphics devices is equally difficult and has currently only been achieved on Windows and on comprehensively equipped X servers.
In particular, postscript() and pdf() are restricted
to Western and Eastern European languages as the
commonly-available fonts are (and some are restricted to ISO 8859-1).

For the R user
Now that R is able to accept input in different encodings, we need a way to specify the encoding. Connections allow you to specify the encoding via the
encoding argument, and the encoding for stdin can
be set when reading from a file by the command-

12 There is a more formal definition in the ISO 10646 standard, but for our purposes Unicode is a more precisely constrained version of
the same encoding. There are various versions of Unicode, currently 4.0.1—see http://www.unicode.org.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

line argument --encoding. Also, source() has an
encoding argument.
It is important to make sure the encoding is correct: in particular a UTF-8 file will be valid input in
most locales, but will only be interpreted properly
in a UTF-8 locale or if the encoding is specified. So
to read a data file produced on Windows containing
the names of Swiss students in a UTF-8 locale you
will need one of
A <- read.table(file("students",
encoding="latin1"))
A <- read.table(file("students",
encoding="UCS-2LE"))
the second if this is a ‘Unicode’ Windows file.13
In a UTF-8 locale, characters can be entered as
e.g. \u8fce or \u{8fce} and non-printable characters will be displayed by print() in the first of these
forms. Further, \U can be used with up to 8 hex digits
for characters not in the BMP.

For the R programmer
There is a common assumption that one byte = one
character = one display column. As from R 2.1.0
a character can use more than one byte and extend
over more than one column when printed. The function nchar() now has a type argument allowing
the number of bytes, characters or columns to be
found—note that this defaults to bytes for backwards
compatibility.

Switching between encodings
R is able to support input in different encodings by
using on-the-fly translation between encodings. This
should be an OS service, and for modern glibcbased systems it is. Even with such systems it can
be frustratingly difficult to find out what encodings they support, and although most systems accept
aliases such as ISO_8859-1, ISO8859-1, ISO_8859_1,
8859_1 and LATIN-1, it is quite possible to find that
with three systems there is no name that they all accept. One solution is to install GNU libiconv: we
have done so for the Windows port of R, and Apple
have done so for MacOS X.
We have provided R-level support for changing
encoding via the function iconv(), and iconvlist()
will (if possible) list the supported encodings.

Quote symbols
ASCII has a quotation mark ", apostrophe ’ and
grave accent ‘, although some fonts14 represent
the latter two as right and left quotes respectively.

6

Unicode provides a wide variety of quote symbols, include left and right single (U+2018, U+2019)
and double (U+201C, U+201D) quotation marks. R
makes use of these for sQuote() and dQuote() in
UTF-8 locales.
Other languages use other conventions for quotations, for example low and mirrored versions of the
right quotation mark (U+201A, U+201B), as well as
guillemets (U+00AB, U+00BB; something like , ).
Looking at translations of error messages in other
GNU projects shows little consistency between native writers of the same language, so we have made
no attempt to define language-specific versions of
sQuote() and dQuote().

Translations
R comes with a lot of documentation in English.
There is a Spanish translation of An Introduction to
R available from CRAN, and an Italian introduction
based on it as well as links to Japanese translations
of An Introduction to R, other manuals and some help
pages.
R 2.1.0 facilitates the translation of messages from
‘English’ into other languages: it will ship with
a complete translation into Japanese and of many
of the messages into Brazilian Portuguese, Chinese,
German and Italian. Updates of these translations
and others can be added at a later date by translation
packages.
Which these ‘messages’ are was determined by
those (mainly me) who marked up the code for possible translation. We aimed to make all but the most
esoteric warning and error messages available for
translation, together with informational messages
such as the welcome message and the menus and dialogs of the Windows and MacOS X consoles. There
are about 4200 unique messages in command-line R
plus a few hundred in the GUIs.
One beneficial side-effect even for English readers has been much improvement in the consistency
of the messages, as well as some re-writing where
translators found the messages unclear.

For the R user
All the end user needs to do is to ensure that the desired translations are installed and that the language
is set correctly. To test the latter, try it and see: very
likely the internal code will be able to figure out the
correct language from the locale name, and the welcome message (and menus if appropriate) will appear in the correct language. If not, try setting the
environment variable LANGUAGE and if that does not

13 This will probably not work unless the first two bytes are removed from the file. Windows is inclined to write a ‘Unicode’ file starting
with a Byte Order Mark 0xFEFE, whereas most Unix-alikes do not recognize a BOM. It would be technically easy to do so but apparently
politically impossible.
14 most likely including the one used to display this document.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

help, read the appropriate section of the R Installation
and Administration manual.
Unix-alike versions of R can be built without support for translation, and it is possible that some OSes
will provide too impoverished an environment to determine the language from the locale or even for the
translation support to be compiled in.
There is a pseudo-language en@quot for which
translations are supplied. This is intended for use
in UTF-8 locales, and makes use of the Unicode directional single and double quotation marks. This
should be selected via the LANGUAGE environment
variable.

For package writers
Any package can have messages marked for translation: see Writing R Extensions. Traditionally messages have been paste-d together, and such messages can be awkward or impossible to translate into
languages with other word orders. We have added
support for specifying messages via C-like format
strings, using the R function gettextf. A typical usage is
stop(gettextf(
"autoloader did not find ’%s’ in ’%s’",
name, package),
domain = NA)
I chose this example as the Chinese translation reverses15 the order of the two variables. Using
gettextf() marks the format (the first argument)
for translation and then passes the arguments to
sprintf() for C-like formatting. The argument
domain = NA is passed to stop() as the message returned by gettextf() will already be translated (if
possible) and so needs no further translation. As the
quotation marks are included in the format, translators can use other conventions (and the pseudolanguage en@quot will).
Plurals are another source of difficulties for translators. Some languages have no plural forms, others
have ‘singular’ and ‘plural’ as in English, and others have up to four forms. Even for those languages
with just ‘singular’ and ‘plural’ there is the question
of whether zero is singular or plural, which differs
by language. There is quite general support for plurals in the R and C functions ngettextf and a small
number16 of examples in the R sources.
See Writing R Extensions for more details.

15 using

7

For would-be translators
Additional translations and help completing the current ones would be most welcome. Experience has
shown that it is helpful to have a small translation
team to cross-check each other’s work, discuss idiomatic usage and to share the load. Translation will
be an ongoing process as R continues to grow.
Please see the documents linked from
http://developer.r-project.org.

Future directions
At this point we need to gain more experience
with internationalization infrastructure. We know it
works with the main R platforms (recent Linux, Windows, MacOS X) and Solaris and for a small set of
quite diverse languages.
We currently have no way to allow Windows
users to use many languages in one session, as Windows’ implementation of Unicode is not UTF-8 and
the standard C interface on Windows does not allow
UTF-8 as an encoding. We plan on making a ‘Unicode’ (in the Windows sense) implementation of R
2.2.0 that uses UTF-8 internally and 16-bit characters
to interface with Windows. It is likely that such a version will only be available for NT-based versions of
Windows.
It is hoped to support a wider range of encodings
on the postscript() and pdf() devices.
The use of encodings in documentation remains
problematic, including in this article. Texinfo is
used for the R manuals but does not currently support even ISO 8859-1 correctly. We have started with
some modest support for encodings in .Rd files, and
may be able to do more.

Acknowledgements
The work of the Japanese group (especially Ei-ji
Nakama and Masafumi Okada) supplied valuable
early experience in internationalization.
The translations have been made available by
translation teams and in particular some enthusiastic
individuals. The infrastructure for translation is provided by GNU gettext, suitably bent to our needs
(for R is a large and complex project, and in particular extensible).
Brian D. Ripley
University of Oxford, UK
ripley@stats.ox.ac.uk

’%2$s’ and ’%1$s’ in the translation to refer to the second and first argument respectively.
22, about 0.5% of the messages.

16 currently

R News

ISSN 1609-3631

Vol. 5/1, May 2005

8

Packages and their Management in R 2.1.0
by Brian D. Ripley
R 2.1.0 introduces changes that make packages and
their management considerably more flexible and
less tied to CRAN.

What is a package?
The idea of an R package was based quite closely on
that of an S library section: a collection of R functions with associated documentation and perhaps
compiled code that is loaded from a library1 by the
library() command. The Help Desk article (Ligges,
2003) provides more background.
Such packages still exist, but as from R 2.1.0
there are other types of R extensions, distinguished
by the Type: field in the ‘DESCRIPTION’ file: the
classic package has Type: Package or no Type:
field. The Type: field tells R CMD INSTALL and
install.packages() what to do with the extension,
which is still distributed as a tarball or zip file containing a directory of the same name as the package.
This allows new types of extensions to be added
in the future, but R 2.1.0 knows what to do with two
other types.

Translation packages
with Type: Translation. These are used to add or
update translations of the diagnostic messages produced by R and the standard packages. The convention is that languages are known by their ISO
639 two-letter (or less commonly three-letter) codes,
possibly with a ‘country or territory’ modifier. So
package Translation-sl should contain translations
into Slovenian, and Translation-zhTW translations
into traditional Chinese as used in Taiwan. Writing
R Extensions describes how to prepare such a package: it contains compiled translations and so all R
CMD INSTALL and install.packages() need to do is
to create the appropriate subdirectories and copy the
compiled translations into place.

Frontend packages
with Type: Frontend. These are only supported as
source packages on Unix, and are used to install alternative front-ends such as the GNOME console,
previously part of the R distribution and now in
CRAN package gnomeGUI. For this type of package
the installation runs configure and the make and so
allows maximal flexibility for the package writer.
1a

Source vs binary
Initially R packages were distributed in source
form as tarballs.
Then the Windows binary
packages came along (distributed as Zip files)
and later Classic MacOS and MacOS X packages.
As from R 2.1.0 the package management functions all have a type argument defaulting to the
value of getOption("pkgType"), with known values
"source", "win.binary" and "mac.binary". These
distributions have different file extensions: .tar.gz,
.zip and .tgz respectively.
This allows packages of each type to be manipulated on any platform: for example, whereas one
cannot install Windows binary packages on Linux,
one can download them on Linux or Solaris, or find
out if a MacOS X binary distribution is available.
Note that install.packages on MacOS X can
now install either "source" or "mac.binary" package files, the latter being the default. Similarly,
install.packages on Windows can now install either "source" or "win.binary" package files. The
only difference between the platforms is the default
for getOption("pkgType").

Repositories
The package management functions such
as install.packages,
update.packages and
packageStatus need to know where to look for packages. Function install.packages was originally
written to access a CRAN mirror, but other repositories of packages have been created, notably that of
Bioconductor. Function packageStatus was the first
design to allow multiple repositories.
We encourage people with collections of packages to set up a repository. Apart from the CRAN
mirrors and Bioconductor there is an Omegahat
repository, and I have a repository of hard-tocompile Windows packages to supplement those
made automatically. We anticipate university departments setting up repositories of support software for their courses. A repository is just a tree of
files based at a URL: details of the format are in Writing R Extensions. It can include one, some or all of
"source", "win.binary" and "mac.binary" types in
separate subtrees.

Specifying repositories
Specifying repositories has become quite complicated: the (sometimes conflicting) aims were to be
intuitive to new users, very flexible and as far as possible backwards compatible.

directory containing subdirectories of installed packages

R News

ISSN 1609-3631

Vol. 5/1, May 2005

As before, the place(s) to look for files are specified by the contriburl argument. This is a character vector of URLs, and can include file:// URLs
if packages are stored locally, for example on a distribution CD-ROM. Different URLs in the vector can
be of different types, so one can mix a distribution
CD-ROM with a CRAN mirror.
However, as before, most people will not specify contriburl directly but rather the repos argument that replaces the CRAN argument.2 The function
contrib.url is called to convert the base URLs in
repos into URLs pointing to the correct subtrees of
the repositories. For example
> contrib.url("http://base.url",
type = "source")
[1] "http://base.url/src/contrib"
> contrib.url("http://base.url",
type = "win.binary")
[1] "http://base.url/bin/windows/contrib/2.1"
> contrib.url("http://base.url",
type = "mac.binary")
[1] "http://base.url/bin/macosx/2.1"

9

?setRepositories. We envisaged people distributing a CD-ROM containing R and a repository, with
setRepositories() customized to include the local
repository by default.

Suppose a package exists on more than one
repository?
Since multiple repositories are allowed, how is this
resolved? The rule is to select the latest version of
the package, from the first repository on the list that
is offering it. So if a user had
> getOption("repos")
[1] "file://d:/R/local"
[2] "http://cran.us.r-project.org"
packages would be fetched from the local repository
if they were current there, and from a US CRAN mirror if there is an update available there.

This is of course hidden from the average user.

Figure 1: The listboxes from setRepositories() on
Windows (left) and Unix (right).
The repos argument of the package management
functions defaults to getOption("repos"). This has
a platform-specific default, and can be set by calling
setRepositories. On most systems this uses a listbox widget, as shown in figure 1. Where no GUI is
available, a text list is used such as
> setRepositories()
--- Please select repositories for use
in this session --1: + CRAN
2: + CRAN (extras)
3:
Bioconductor
4:
Omegahat
Enter one or more numbers separated by spaces
1: 1 4
The list of repositories and which are selected by default can be customized for a user or for a site: see
2 This

Figure 2: Selecting a CRAN mirror on Windows.

No default CRAN
Windows users are already used to selecting a CRAN
mirror. Now ‘factory-fresh’ R does not come with a
default CRAN mirror. On Unix the default is in fact
> getOption("repos")
CRAN
"@CRAN@"

still exists but has been deprecated and will be removed in R 2.2.0.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

Whenever @CRAN@ is encountered in a repository
specification, the function chooseCRANmirror is
called to ask for a mirror (and can also be called directly and from a menu item on the Windows GUI).
The Windows version is shown in Figure 2: there are
MacOS X and Tcl/Tk versions, as well as a text-based
version using menu.3
Experienced users can avoid being asked for a
mirror by setting options("repos") in their ‘.Rprofile’ files: examples might be
options(repos=
c(CRAN="http://cran.xx.r-project.org"))
on Unix-alikes and
options(repos=
c(CRAN="http://cran.xx.r-project.org",
CRANextra=
"http://www.stats.ox.ac.uk/pub/RWin"))
on Windows.
Although not essential, it is helpful to have a
named vector of repositories as in this example: for
example setRepositories looks at the names when
setting its initial selection.

Installing packages
Users of the Windows GUI will be used to selecting
packages to be installed from a scrollable list box.
This now works on all platforms from the command
line if install.packages is called with no pkgs (first)
argument. Packages from all the selected repositories are listed in alphabetical order, and the packages
from bundles are listed4 individually in the form
MASS (VR).
The dependencies argument introduced in
R 2.0.0 can be used to install a package and its dependencies (and their dependencies . . . ), and R will
arrange to install the packages in a suitable order.
The dependencies argument can be used to select
only those dependencies that are essential, or to include those which are suggested (and might only be
needed to run some of the examples).

Installing all packages
System administrators often want to install ‘all’ packages, or at least as many as can be installed on their
system: a standard Linux system lacks the additional
software needed to install about 20 and a handful can
only be installed on Windows or on Linux.
The new function new.packages() is a helpful
way to find out which packages are not installed of
those offered by the repositories selected. If called as
new.packages(ask = "graphics") a list box is used
to allow the user to select packages to be installed.
3 menu

10

Figure 3: Selecting amongst uninstalled packages on
Linux.

Updating packages
The only real change in update.packages is the ask
argument. This defaults to ask = TRUE where as before the user is asked about each package (but with a
little more information and the option to quit). Other
options are ask = FALSE to install all updates, and
ask = "graphics" to use a listbox (similar to figure 3) to de-select packages (as by default all the updates are selected).

Looking at repositories
The function CRAN.packages was (as it name implies) designed to look at a single CRAN mirror. It
has been superseded by available.packages which
can look at one or more repositories and by default
looks at those specified by getOption("repos")
for package type (source/binary) specified by
getOption("pkgType").
This returns a matrix
with rather more columns than before, including
"Repository", the dependencies and the contents of
bundles.
Function packageStatus is still available but has
been re-implemented on top of functions such as
available.packages. Its summary method allows
a quick comparison of the installed packages with
those offered by the selected repositories. However,
its print method is concise: see figure 4.

Bibliography
U. Ligges. R help desk: Package management. R
News, 3(3):37–39, December 2003. URL http://
CRAN.R-project.org/doc/Rnews/. 8
Brian D. Ripley
University of Oxford, UK
ripley@stats.ox.ac.uk

has been much enhanced for R 2.1.0.
the repository has been set up to supply the information.

4 provided

R News

ISSN 1609-3631

Vol. 5/1, May 2005

11

> (ps <- packageStatus())
--- Please select a CRAN mirror for use in this session --Number of installed packages:
ok upgrade unavailable
c:/R/library
34
1
0
c:/R/rw2010/library 12
0
13
Number of available packages (each package/bundle counted only once):

http://www.sourcekeg.co.uk/cran/bin/windows/contrib/2.1
http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.1

installed
31
4

http://www.sourcekeg.co.uk/cran/bin/windows/contrib/2.1
http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.1

not installed
421
8

> summary(ps)
Installed packages:
------------------*** Library c:/R/library
$ok
[1] "abind"
"acepack"
...
$upgrade
[1] "XML"

"akima"

"ash"

"car"

$unavailable
NULL
*** Library c:/R/rw2010/library
$ok
[1] "base"
"datasets" "graphics"
[10] "tcltk"
"tools"
"utils"

"grDevices" "grid"

"methods"

"splines"

"stats"

"stats4"

$upgrade
NULL
$unavailable
[1] "boot"
...

"cluster"

"foreign"

"KernSmooth"

"lattice"

Available packages:
------------------(each package appears only once)
*** Repository http://www.sourcekeg.co.uk/cran/bin/windows/contrib/2.1
$installed
[1] "abind"
"acepack"
"akima"
"ash"
"car"
...
$"not installed"
[1] "accuracy"
...
[421] "zicounts"

"adapt"

"ade4"

*** Repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.1
$installed
[1] "GLMMGibbs" "xgobi"
"XML"
"yags"
$"not installed"
[1] "gsl"

"hdf5"

"ncdf"

"RDCOMClient"

"rgdal"

"RNetCDF"

"survnnet"

"udunits"

> upgrade(ps)
XML :
0.97-0 at c:/R/library
0.97-3 at http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.1
Update (y/N)? y

Figure 4: packageStatus on a Windows machine. The voluminous summary has been edited to save space.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

12

Recent Changes in grid Graphics
by Paul Murrell

Introduction
The grid graphics package provides an alternative
graphics system to the “traditional” S graphics system that is provided by the graphics package.
The majority of high-level plotting functions (functions that produce whole plots) that are currently
available in the base packages and in add-on packages are built on the graphics package, but the lattice
package (Sarkar, 2002), which provides high-level
Trellis plots, is built on grid.
The grid graphics package can be useful for customising lattice plots, creating complex arrangements of several lattice plots, or for producing
graphical scenes from scratch.
The basic features of grid were described in an R
News article in June 2002 (Murrell, 2002). This article provides an update on the main features that have
been added to grid since then. This article assumes
a familiarity with the basic grid ideas of units and
viewports.

Changes to grid
The most important organisational change is that
grid is now a “base” package, so it is installed as part
of the standard R installation. In other words, package writers can assume that grid is available.
The two main areas where the largest changes have
been made to grid are viewports and graphical objects. The changes to viewports are described in the
next section and summarised in Table 1 at the end of
the article. A separate section describes the changes
to graphical objects with a summary in Table 2 at the
end of the article.

Changes to viewports
grid provides great flexibility for creating regions on
the page to draw in (viewports). This is good for being able to locate and size graphical output on the
page in very sophisticated ways (e.g., lattice plots),
but it is bad because it creates a complex environment when it comes to adding further output (e.g.,
annotating a lattice plot).
1 The

The first change to viewports is that they are now
much more persistent; it is possible to have any number of viewports defined at the same time.
There are also several new functions that provide a
consistent and straightforward mechanism for navigating to a particular viewport when several viewports are currently defined.
A viewport is just a rectangular region that provides
a geometric and graphical context for drawing. The
viewport provides several coordinate systems for locating and sizing output, and it can have graphical
parameters associated with it to affect the appearance of any output produced within the viewport.
The following code describes a viewport that occupies the right half of the graphics device and within
which, unless otherwise specified, all output will be
drawn using thick green lines. A new feature is that a
viewport can be given a name. In this case the viewport is called "rightvp". The name will be important
later when we want to navigate back to this viewport.
> vp1 <- viewport(x=0.5, width=0.5,
just="left",
gp=gpar(col="green", lwd=3),
name="rightvp")
The above code only creates a description of a viewport; a corresponding region is created on a graphics device by pushing the viewport on the device, as
shown below.1
> pushViewport(vp1)
Now, all drawing occurs within the context defined
by this viewport until we change to another viewport. For example, the following code draws some
basic shapes, all of which appear in the right half of
the device, with thick green lines (see the right half
of Figure 1). Another new feature is that a name
can be associated with a piece of graphical output.
In this case, the rectangle is called "rect1", the line
is called "line1", and the set of circles are called
"circles1". These names will be used later in the
section on “Changes to graphical objects”.
>
>
>
>
>

grid.rect(name="rect1")
r <- seq(0, 2*pi, length=5)[-5]
x <- 0.5 + 0.4*cos(r + pi/4)
y <- 0.5 + 0.4*sin(r + pi/4)
grid.circle(x, y, r=0.05,

function used to be called push.viewport().

R News

ISSN 1609-3631

Vol. 5/1, May 2005

13

name="circles1")
> grid.lines(x[c(2, 1, 3, 4)],
y[c(2, 1, 3, 4)],
name="line1")

viewport[ROOT]->(
viewport[leftvp],
viewport[rightvp])

There are two ways to change the viewport. You can
pop the viewport, in which case the region is permanently removed from the device, or you can navigate
up from the viewport and leave the region on the device. The following code demonstrates the second
option, using the upViewport() function to revert to
using the entire graphics device for output, but leaving a viewport called "rightvp" still defined on the
device.

There is always a ROOT viewport representing the entire device and the two viewports we have created
are both direct children of the ROOT viewport. We
now have a tree structure of viewports that we can
navigate around. As before, we can navigate up from
the viewport we just pushed and, in addition, we can
navigate down to the previous viewport. The function downViewport() performs the navigation to a
viewport lower down the viewport tree. It requires
the name of the viewport to navigate to. In this case,
we navigate down to the viewport called "rightvp".

> upViewport()
Next, a second viewport is defined to occupy the left
half of the device, this viewport is pushed, and some
output is drawn within it. This viewport is called
"leftvp" and the graphical output is associated with
the names "rect2", "lines2", and "circles2". The
output from the code examples so far is shown in Figure 1.
> vp2 <- viewport(x=0, width=0.5,
just="left",
gp=gpar(col="blue", lwd=3),
name="leftvp")
> pushViewport(vp2)
> grid.rect(name="rect2")
> grid.circle(x, y, r=0.05,
name="circles2")
> grid.lines(x[c(3, 2, 4, 1)],
y[c(3, 2, 4, 1)],
name="line2")

●

●

●

●

●

●

> upViewport()
> downViewport("rightvp")

It is now possible to add further output within the
context of the viewport called "rightvp". The following code draws some more circles, this time explicitly grey (but still with thick lines; see Figure 2).
The name "circles3" is associated with these extra
circles.

> x2 <- 0.5 + 0.4*cos(c(r, r+pi/8, r-pi/8))
> y2 <- 0.5 + 0.4*sin(c(r, r+pi/8, r-pi/8))
> grid.circle(x2, y2, r=0.05,
gp=gpar(col="grey"),
name="circles3")
> upViewport()

●
●

●

●

●

●

●
●
●
●
●

●●●

●●●

●
●
●
●
●

Figure 1: Some simple shapes drawn in some simple
viewports.
There are now three viewports defined on the device;
the function current.vpTree() shows all viewports
that are currently defined.2
> current.vpTree()

Figure 2: Navigating between viewports and annotating.
What this simple demonstration shows is that it is
possible to define a number of viewports during
drawing and leave them in place for others to add

2 The output of current.vpTree() has been reformatted to make the structure more obvious; the natural output is all just on a single
line.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

14

further output. A more sophisticated example is now
presented using a lattice plot.3
0.4

0.3

Density

The following code creates a simple lattice plot. The
"trellis" object created by the histogram() function is stored in a variable, hist1, so that we can
use it again later; printing hist1 produces the plot
shown in Figure 3.4

0.2

0.1

0.0

> hist1 <- histogram(
par.settings=list(fontsize=list(text=8)),
rnorm(500), type = "density",
panel=
function(x, ...) {
panel.histogram(x, ...)
panel.densityplot(x,
col="brown",
plot.points=FALSE)
})
> trellis.par.set(canonical.theme("pdf"))
> print(hist1)

The lattice package uses grid to produce output and
it defines lots of viewports to draw in. In this case,
there are six viewports created, as shown below.

> current.vpTree()

viewport[ROOT]->(
viewport[plot1.toplevel.vp]->(
viewport[plot1.],
viewport[plot1.panel.1.1.off.vp],
viewport[plot1.panel.1.1.vp],
viewport[plot1.strip.1.1.off.vp],
viewport[plot1.xlab.vp],
viewport[plot1.ylab.vp]))

−2

0

2

rnorm(500)

Figure 3: A simple lattice plot.
What we are going to do is add output to the lattice
plot by navigating to a couple of the viewports that
lattice set up and draw a border around the viewport
to show where it is. We will use the frame() function
defined below to draw a thick grey rectangle around
the viewport, then a filled grey rectangle at the topleft corner, and the name of the viewport in white
within that.
> frame <- function(name) {
grid.rect(gp=gpar(lwd=3, col="grey"))
grid.rect(x=0, y=1,
height=unit(1, "lines"),
width=1.2*stringWidth(name),
just=c("left", "top"),
gp=gpar(col=NA, fill="grey"))
grid.text(name,
x=unit(2, "mm"),
y=unit(1, "npc") unit(0.5, "lines"),
just="left",
gp=gpar(col="white"))
}
The viewports used to draw the lattice plot consist of a single main viewport called (in this case)
"plot1.toplevel.vp", and a number of other viewports within that for drawing the various components of the plot. This means that the viewport
tree has three levels (including the top-most ROOT
viewport). With a multipanel lattice plot, there can
be many more viewports created, but the naming
scheme for the viewports uses a simple pattern and

3 This low-level grid mechanism is general and available for any graphics output produced using grid including lattice output. An
additional higher-level interface has also been provided specifically for lattice plots (e.g., the trellis.focus() function).
4 The call to trellis.par.set() sets lattice graphical parameters so that the colours used in drawing the histogram are those used on
a PDF device. This explicit control of the lattice “theme” makes it easier to exactly reproduce the histogram output in a later example.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

15

there are lattice functions provided to generate the
appropriate names (e.g., trellis.vpname()).
plot1.panel.1.1.off.vp

0.3

Density

When there are more than two levels, there are two
ways to specify a particular low-level viewport. By
default, downViewport() will search within the children of a viewport, so a single viewport name will
work as before. For example, the following code annotates the lattice plot by navigating to the viewport
"plot1.panel.1.1.off.vp" and drawing a frame to
show where it is. This example also demonstrates the
fact that downViewport() returns a value indicating
how many viewports were descended. This “‘depth”
can be passed to upViewport() to ensure that the correct number of viewports are ascended after the annotation.

0.2

0.1

0.0

−2

0

2

plot1.xlab.vprnorm(500)

> depth  frame("plot1.panel.1.1.off.vp")
> upViewport(depth)
Just using the final destination viewport name can be
ambiguous if more than one viewport in the viewport tree has the same name, so it is also possible
to specify a viewport path. A viewport path is a list
of viewport names that must be matched in order
from parent to child. For example, this next code
uses an explicit viewport path to frame the viewport "plot1.xlab.vp" that is a child of the viewport
"plot1.toplevel.vp".
> depth  frame("plot1.xlab.vp")
> upViewport(depth)
It is also possible to use a viewport name or viewport path in the vp argument to drawing functions, in
which case the output will occur in the named viewport. In this case, the viewport path is strict, which
means that the full path must be matched starting from the context in which the grob was drawn.
The following code adds a dashed white line to the
borders of the frames. This example also demonstrates that viewport paths can be specified as explicit strings, with a "::" path separator.

Figure 4: Annotating the simple lattice plot.

Changes to graphical objects
All grid drawing functions create graphical objects
representing the graphical output. This is good because it makes it possible to manipulate these objects
to modify the current scene, but there needs to be a
coherent mechanism for working with these objects.
There are several new functions, and some existing
functions have been made more flexible, to provide
a consistent and straightforward mechanism for accessing, modifying, and removing graphical objects.
All grid functions that produce graphical output,
also create a graphical object, or grob, that represents
the output, and there are functions to access, modify,
and remove these graphical objects. This can be used
as an alternative to modifying and rerunning the R
code to modify a plot. In some situations, it will be
the only way to modify a low-level feature of a plot.

Consider the output from the simple viewport example (shown in Figure 2). There are seven pieces of
output, each with a corresponding grob: two rectangles, "rect1" and "rect2"; two lines, "line1"
The final output after all of these annotations is
and "line2"; and three sets of circles, "circles1",
shown in Figure 4.
"circles2", and "circles3". A particular piece
of output can be modified by modifying the corre> grid.rect(
sponding grob; a particular grob is identified by its
gp=gpar(col="white", lty="dashed"),
name.
vp="plot1.toplevel.vp::plot1.panel.1.1.off.vp")
The names of all (top-level) grobs in the current
> grid.rect(
scene can be obtained using the getNames() funcgp=gpar(col="white", lty="dashed"),
vp="plot1.toplevel.vp::plot1.xlab.vp")
tion.5
5 Only

introduced in R version 2.1.0; a less efficient equivalent for version 2.0.0 is to use grid.get().

R News

ISSN 1609-3631

Vol. 5/1, May 2005

16

> getNames()

●
●
●
●
●

[1] "rect1"
"circles1" "line1"
"rect2"
[5] "circles2" "line2"
"circles3"
The following code uses the grid.get() function to
obtain a copy of all of the grobs. The first argument
specifies the names(s) of the grob(s) that should be
selected; the grep argument indicates whether the
name should be interpreted as a regular expression;
the global argument indicates whether to select just
the first match or all possible matches.
> grid.get(".*", grep=TRUE, global=TRUE)
(rect[rect1], circle[circles1], lines[line1],
rect[rect2], circle[circles2], lines[line2],
circle[circles3])
The value returned is a gList; a list of one or more
grobs. This is a useful way to obtain copies of one or
two objects representing some portion of a scene, but
it does not return any information about the context
in which the grobs were drawn, so, for example, just
drawing the gList is unlikely to reproduce the original output (for that, see the grid.grab() function
below).
The following code makes use of the grid.edit()
function to change the colour of the grey circles to
black (see Figure 5). In general, most arguments provided in the creation of the output are available for
editing (see the documentation for individual functions). It should be possible to modify the gp argument for all grobs.
> grid.edit("circles3", gp=gpar(col="black"))

●

●

●

●

●
●
●
●
●

●●●

●●●

●
●
●
●
●

●●●

●●●

●
●
●
●
●

Figure 6: Deleting grid output.
These manipulations are possible with any grid output. For example, in addition to lots of viewports, a
lattice plot creates a large number of grobs and all of
these can be accessed, modified, and deleted using
the grid functions (an example is given at the end of
the article).

Hierarchical graphical objects
Basic grobs simply consist of a description of something to draw. As shown above, it is possible to get
a copy of a grob, modify the description in a grob,
and/or remove a grob.
It is also possible to create and work with more complicated graphical objects that have a tree structure,
where one graphical object is the parent of one or
more others. Such a graphical object is called a
gTree. The functions described so far all work with
gTrees as well, but there are some additional functions just for creating and working with gTrees.
In simple usage, a gTree just groups several grobs
together. The following code uses the grid.grab()
function to create a gTree from the output in Figure
6.
> nzgrab <- grid.grab(name="nz")
If we draw this gTree on a new page, the output is
exactly the same as Figure 6, but there is now only
one graphical object in the scene: the gTree called
"nz".
> grid.newpage()

Figure 5: Editing grid output.
> grid.draw(nzgrab)
The following code uses grid.remove() to delete all
of the grobs whose names end with a "2" – all of the
blue output (see Figure 6).

> getNames()

> grid.remove(".+2$", grep=TRUE, global=TRUE)

[1] "nz"

R News

ISSN 1609-3631

Vol. 5/1, May 2005

17

This gTree has four children; the four original grobs
that were “grabbed”. The function childNames()
prints out the names of all of the child grobs of a
gTree.

The functions for accessing, modifying, and removing graphical objects all work with hierarchical
graphical objects like this. For example, it is possible to remove a specific child from a gTree using
grid.remove().

> childNames(grid.get("nz"))

When dealing with a scene that includes gTrees, a
simple grob name can be ambiguous because, by default, grid.get() and the like will search within the
children of a gTree to find a match.

[1] "rect1"
"circles1" "line1"
[4] "circles3"
A gTree contains viewports used to draw its child
grobs as well as the grobs themselves, so the original viewports are also available.
> current.vpTree()
viewport[ROOT]->(
viewport[leftvp],
viewport[rightvp])
The grid.grab() function works with any output
produced using grid, including lattice plots, and
there is a similar function grid.grabExpr(), which
will capture the output from an expression.6 The following code creates a gTree by capturing an expression to draw the histogram in Figure 3.
> histgrab <- grid.grabExpr(
{ trellis.par.set(canonical.theme("pdf"));
print(hist1) },
name="hist", vp="leftvp")
The grid.add() function can be used to add further
child grobs to a gTree. The following code adds the
histogram gTree as a child of the gTree called "nz".
An important detail is that the histogram gTree was
created with vp="leftvp"; this means that the histogram gets drawn in the viewport called "leftvp"
(i.e., in the left half of the scene). The output now
looks like Figure 7.
> grid.add("nz", histgrab)
There is still only one graphical object in the scene,
the gTree called "nz", but this gTree now has five
children: two lots of circles, one line, one rectangle,
and a lattice histogram (as a gTree).
> childNames(grid.get("nz"))
[1] "rect1"
"circles1" "line1"
[4] "circles3" "hist"

Just as you can provide a viewport path in
downViewport() to unambiguously specify a particular viewport, it is possible to provide a grob path in
grid.get(), grid.edit(), or grid.remove() to unambiguously specify a particular grob.
A grob path is a list of grob names that must be
matched in order from parent to child. The following
code demonstrates the use of the gPath() function to
create a grob path. In this example, we are going to
modify one of the grobs that were created by lattice
when drawing the histogram. In particular, we are
going to modify the grob called "plot1.xlab" which
represents the x-axis label of the histogram.
In order to specify the x-axis label unambiguously,
we construct a grob path that identifies the grob
"plot1.xlab", which is a child of the grob called
"hist", which itself is a child of the grob called "nz".
That path is used to modify the grob which represents the xaxis label on the histogram. The xaxis label is moved to the far right end of its viewport and
is drawn using a bold-italic font face (see Figure 8).
> grid.edit(gPath("nz", "hist", "plot1.xlab"),
x=unit(1, "npc"), just="right",
gp=gpar(fontface="bold.italic"))

Summary
When a scene is created using the grid graphics system, a tree of viewport objects is created to represent
the drawing regions used in the construction of the
scene and a list of graphical objects is created to represent the actual graphical output.
Functions are provided to view the viewport tree and
navigate within it in order to add further output to a
scene.
Other functions are provided to view the graphical
objects in a scene, to modify those objects, and/or
remove graphical objects from the scene. If graphical objects are modified or removed, the scene is redrawn to reflect the changes. Graphical objects can
also be grouped together in a tree structure and dealt

6 The function grid.grabExpr() is only available in R version 2.1.0; the grid.grab() function could be used instead by explicitly
opening a new device, drawing the histogram, and then capturing it.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

18

0.4

Density

0.3

0.2

0.1

0.0

−2

0

2

rnorm(500)

Figure 7: Grabbing grid output.

0.4

Density

0.3

0.2

0.1

0.0

−2

0

2

rnorm(500)

Figure 8: Editing a gTree.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

19

Table 1: Summary of the new and changed functions in R 2.0.0 relating to viewports.
Function
pushViewport()
popViewport()
upViewport()
downViewport()
current.viewport()
current.vpTree()
vpPath()
viewport(..., name=)

Description
Create a region for drawing on the current graphics device.
Remove the current drawing region
(does not delete any output).
Navigate to parent drawing region (but
do not remove current region).
Navigate down to named drawing region.
Return the current viewport.
Return the current tree of viewports that
have been created on the current device.
Create a viewport path; a concatenated
list of viewport names.
A viewport can have a name associated
with it.

Table 2: Summary of the new and changed functions since R 2.0.0 relating to graphical objects (some functions
only available since R 2.1.0).
Function
grid.get()
grid.edit()
grid.remove()
grid.add()
grid.grab()
grid.grabExpr()
gPath()
getNames()

childNames()
grid.rect(..., name=)

R News

Description
Return a single grob or a gList of grobs.
Modify one or more grobs and (optionally) redraw.
Remove one or more grobs and (optionally) redraw.
Add a grob to a gTree.
Create a gTree from the current scene.
Create a gTree from an R expression
(only available since R 2.1.0).
Create a grob path; a concatenated list of
grob names.
List the names of the top-level grobs in
the current scene (only available since R
2.1.0).
List the names of the children of a gTree.
Grid drawing functions can associate a
name with their output.

ISSN 1609-3631

Vol. 5/1, May 2005

with as a single unit.

20

D. Sarkar. Lattice. R News, 2(2):19–23, June 2002. URL
http://CRAN.R-project.org/doc/Rnews/. 12

Bibliography
P. Murrell. The grid graphics package. R News, 2(2):
14–19, June 2002. URL http://CRAN.R-project.
org/doc/Rnews/. 12

Paul Murrell
University of Auckland, New Zealand
pmurrell@auckland.ac.nz

hoa: An R package bundle for higher
order likelihood inference
by Alessandra R. Brazzale

Introduction

The likelihood function represents the basic ingredient of many commonly used statistical methods
for estimation, testing and the calculation of confidence intervals. In practice, much application of likelihood inference relies on first order asymptotic results such as the central limit theorem. The approximations can, however, be rather poor if the sample
size is small or, generally, when the average information available per parameter is limited. Thanks
to the great progress made over the past twentyfive years or so in the theory of likelihood inference,
very accurate approximations to the distribution of
statistics such as the likelihood ratio have been developed. These not only provide modifications to wellestablished approaches, which result in more accurate inferences, but also give insight on when to rely
upon first order methods. We refer to these developments as higher order asymptotics.
One intriguing feature of the theory of higher order likelihood asymptotics is that relatively simple
and familiar quantities play an essential role. The
higher order approximations discussed in this paper
are for the significance function, which we will use
to set confidence limits or to calculate the p-value
associated with a particular hypothesis of interest.
We start with a concise overview of the approximations used in the remainder of the paper. Our
first example is an elementary one-parameter model
where one can perform the calculations easily, chosen to illustrate the potential accuracy of the procedures. Two more elaborate examples, an application
of binary logistic regression and a nonlinear growth
curve model, follow. All examples are carried out using the R code of the hoa package bundle.
R News

Basic ideas
Assume we observed n realizations y1 , . . . , yn of independently distributed random variables Y1 , . . . , Yn
whose density function f ( yi ; θ ) depends on an unknown parameter θ. Let `(θ ) = ∑in=1 log f ( yi ; θ )
denote the corresponding log likelihood and θ̂ =
argmaxθ `(θ ) the maximum likelihood estimator. In
almost all applications the parameter θ is not scalar
but a vector of length d. Furthermore, we may
re-express it as θ = (ψ, λ ), where ψ is the d0 dimensional parameter of interest, about which we
wish to make inference, and λ is a so-called nuisance
parameter, which is only included to make the model
more realistic.
Confidence intervals and p-values can be computed using the significance function p(ψ; ψ̂) =
Pr(Ψ̂ ≤ ψ̂; ψ) which records the probability left of
the observed “data point” ψ̂ for varying values of
the unknown parameter ψ (Fraser, 1991). Exact elimination of λ, however, is possible only in few special
cases (Severini, 2000, Sections 8.2 and 8.3). A commonly used approach is to base inference about ψ on
the profile log likelihood `p (ψ) = `(ψ, λ̂ψ ), which we
obtain from the log likelihood function by replacing
the nuisance parameter with its constrained estimate
λ̂ψ obtained by maximising `(θ ) = `(ψ, λ ) with respect to λ for fixed ψ. Let jp (ψ) = −∂2 `p (ψ)/∂ψ∂ψ>
denote the observed information from the profile log
likelihood. Likelihood inference for scalar ψ is typically based on the
• Wald statistic,

w(ψ) = jp (ψ̂)1/2 (ψ̂ − ψ);

• likelihood root,

1/2
r(ψ) = sign(ψ̂ − ψ) 2{`p (ψ̂) − `p (ψ)}
;
or
• score statistic,

s(ψ) = jp (ψ̂)−1/2 d`p (ψ)/dψ.

Under suitable regularity conditions on f ( y; θ ), all of
these have asymptotic standard normal distribution
ISSN 1609-3631

Vol. 5/1, May 2005

21

1
log {q(ψ)/r(ψ)} ,
r∗ (ψ) = r(ψ) +
r(ψ)

(1)

where q(ψ) is a suitable correction term. Expression (1) is a higher order pivot whose finite sample
distribution is standard normal up to the third order. As it was the case for its first order counterpart r, the significance function is approximated by
Φ{r∗ (ψ)}, and there is a version of r∗ for multidimensional ψ (Skovgaard, 2001, Section 3.1). More
details about the computation of the q(ψ) correction
term are given in the Appendix.
It is sometimes useful to decompose the modified
likelihood root as
r∗ (ψ) = r(ψ) + rinf (ψ) + rnp (ψ),
where rinf is the information adjustment and rnp is the
nuisance parameter adjustment. The first term accounts
for non normality of r, while the second compensates
r for the presence of the nuisance parameter λ. Pierce
and Peters (1992, Section 3) discuss the behaviour of
these two terms in the multiparameter exponential
family context. They find that while rnp is often appreciable, the information adjustment rinf has typically a minor effect, provided the ψ-specific information jp (ψ̂) is not too small relative to the dimension
of λ.

A simple example
Suppose that a sample y1 , . . . , yn is available from
the Cauchy density
f ( y; θ ) =

1
.
π {1 + ( y − θ )2 }

(2)

The maximum likelihood estimate θ̂ of the unknown
location parameter θ is the value which maximises
the log likelihood function
n

`(θ; y) = − ∑ log{1 + ( yi − θ )2 }.
i =1

R News

For n = 1, we obtain the exact distribution of θ̂ = y
from (2) as F (θ̂; θ ) = F ( y; θ ) = π −1 arctan( y − θ ).

1.0

0.8
significance function

up to the first order. Using any of the above statistics we can approximate the significance function by
Φ{w(ψ)}, Φ{r(ψ)} or Φ{s(ψ)}. When d0 > 1, we
may use the quadratic forms of the Wald, likelihood
root and score statistics whose finite sample distribution is χ2d with d0 degrees of freedom up to the sec0
ond order. We refer the reader to Chapters 3 and 4 of
Severini (2000) for a review of first order likelihood
inference.
Although it is common to treat `p (ψ) as if it were
an ordinary log likelihood, first order approximations can give poor results, particularly if the dimension of λ is high and the sample size small. An important variant of the likelihood root is the modified
likelihood root

0.6

0.4

0.2

0.0
−4

−2

0

2

4

θ

Figure 1: Significance functions for the location parameter of a Cauchy distribution when y = 1.32:
exact (bold), Wald pivot (dotted), r (dashed) and r∗
(solid). The vertical dashed line corresponds to the
null hypothesis θ = 0.
Assume that y = 1.32 was observed. In Figure 1
we compare the exact significance function p(θ; y) =
Pr(Y ≤ y; θ ) (bold line) to the two first order approximations obtained from the Wald statistic
√
w(θ ) = 2( y − θ ),
(dotted line), and from the likelihood root
h
i1/2
r(θ ) = sign(θ̂ − θ ) 2 log{1 + ( y − θ )2 }
,
(dashed line). We also show the third order approximation Φ{r∗ (θ )} (solid line). Since this is a location
model and there is no nuisance parameter, the statistic q(θ ) in (1) is the score statistic
√
s(θ ) = 2( y − θ )/{1 + ( y − θ )2 }
(see formula (6) in the Appendix). The vertical
dashed line corresponds to the null hypothesis that
θ = 0. The exact p-value is 0.413, while the first
and third order approximations yield 0.0619 (Wald),
0.155 (r) and 0.367 (r∗ ). The r∗ statistic is strikingly
accurate, while the first order approximations are
very poor. This is surprising if we consider that the
score function is not monotonic in y and that only
one observation is available. Figure 2 summarises
the R code used to generate Figure 1 and obtain the
above p-values.
Suppose now that we observed a sample of size
n = 15 from the Student t distribution with 3 degrees of freedom. It is no longer possible to derive
the exact distribution of the maximum likelihood estimator θ̂, but we may use the code provided in the
marg package of the hoa package bundle to compute
the p-values for testing the significance of the location parameter.
ISSN 1609-3631

Vol. 5/1, May 2005

22

## likelihood pivots
> wald.stat <- function(theta, y) {
+
sqrt(2) * (y - theta) }
> lik.root <- function(theta, y) {
+
sign(y - theta) * sqrt( 2 * log(1 + (y - theta)^2) ) }
> score.stat <- function(theta, y) {
+
( sqrt(2) * (y - theta) )/( 1 + (y - theta)^2 ) }
> rstar <- function(theta, y) {
+
lik.root(theta, y) + 1/lik.root(theta, y) * log( score.stat(theta, y)/lik.root(theta, y) ) }
## significance functions : Figure 1
> theta.seq <- seq(-4, 4, length = 100)
> par( las = 1, mai = c(0.9, 0.9, 0.2, 0.2) )
> plot( theta.seq, pcauchy( q = 1.32 - theta.seq ), type = "l", lwd = 2, ylim = c(0,1),
+
xlab = expression(theta), ylab = "significance function", cex.lab = 1.5, cex.axis = 1.5 )
> lines( theta.seq, pnorm( wald.stat(theta.seq, 1.32) ), lty = "dotted" )
> lines( theta.seq, pnorm( lik.root(theta.seq, 1.32) ), lty = "dashed" )
> lines( theta.seq, pnorm( rstar(theta.seq, 1.32) ), lty = "solid" )
> abline( v = 0, lty = "longdash" )
## p-values
> 2 * ( min(
> 2 * ( min(
> 2 * ( min(
> 2 * ( min(

tp
tp
tp
tp

<<<<-

pt(1.32, df = 1), 1 - tp ) )
pnorm( wald.stat(0, 1.32) ), 1 - tp ) )
pnorm( lik.root(0, 1.32) ), 1 - tp ) )
pnorm( rstar(0, 1.32) ), 1 - tp ) )

Figure 2: R code for drawing Figure 1 and calculating the p-values for testing the significance of the location
parameter θ of a Cauchy distribution when y = 1.32.
## simulated data
> library(marg)
> set.seed(321)
> y <- rt(n = 15, df = 3)
> y.rsm <- rsm(y ~ 1, family = student(3))
> y.cond <- cond(y.rsm, offset = 1)
> summary(y.cond, test = 0)
The previous set of instructions yields the p-values
0.282 (Wald), 0.306 (r) and 0.354 (r∗ ). The difference
between first order and higher order approximations
is slightly smaller than it was the case before. For this
particular model a sample size of n = 15 still does
not provide enough information on the scalar parameter θ to wipe out completely the effect of higher order corrections.

Higher order asymptotics in R
hoa is an R package bundle which implements higher
order inference for three widely used model classes:
logistic regression, linear non normal models and
nonlinear regression with possibly non homogeneous variance. The corresponding code is organised in three packages, namely cond, marg and nlreg.
We already saw a (very elementary) application of
the marg code. The two examples which follow will
give a glimpse of the use of the routines in cond
and nlreg. Attention is restricted to the calculation
of p-values and confidence intervals, although sevR News

eral routines for accurate point estimation and model
checking are also available. The hoa bundle includes
a fourth package, called sampling, which we will not
discuss here. It implements a Metropolis-Hastings
sampler which can be used to simulate from the
conditional distribution of the higher order statistics
considered in marg.
The hoa package bundle will soon be available on
CRAN. More examples of applications, and generally of the use of likelihood asymptotics, are given in
Brazzale et al. (to appear).

Example 1: Binary data
Collet (1998) gives a set of binary data on the presence of a sore throat in a sample of 35 patients undergoing surgery during which either of two devices
was used to secure the airway. In addition to the
variable of interest, device type (1=tracheal tube or
0=laryngeal mask), there are four further explanatory variables: the age of the patient in years, an indicator variable for sex (1=male, 0=female), an indicator variable for lubricant use (1=yes, 0=no) and
the duration of the surgery in minutes. The observations form the data frame airway which is part of
the hoa bundle.
A natural starting point for the analysis is a logistic regression model with success probability of the
ISSN 1609-3631

Vol. 5/1, May 2005

23

form
3

exp( x> β)
,
1 + exp( x> β)

where x represents the explanatory variables associated with the binary response Y (1=sore throat and
0=no sore throat). The following set of instructions
fits this model to the data with all five explanatory
variables included.

2
1
pivot

Pr(Y = 1; β) =

0
−1
−2

## binomial model fit
> airway.glm <- glm( formula(airway),
+
family = binomial, data = airway )
> summary( airway.glm )
[omitted]
Coefficients:
Estimate Std. Error z value Pr(>|z
(Intercept) -2.75035
2.08914 -1.316
0.18
age
0.02245
0.03763
0.597
0.55
sex1
0.32076
1.01901
0.315
0.75
lubricant1
0.08448
0.97365
0.087
0.93
duration
0.07183
0.02956
2.430
0.01
type1
-1.62968
0.94737 -1.720
0.08
[omitted]
The coefficient of device type is only marginally significant.
As in the previous example we may wonder
whether the sample size is large enough to allow us
to rely upon first order inference. For the airway data
we have n = 35 and p = 5, so we might expect
higher order corrections to the usual approximations
to have little effect. We can check this using the routines in the cond package.
## higher order inference
> library(cond)
> airway.cond <- cond( airway.glm,
+
offset = type1 )
> summary( airway.cond )
#
produces 95% confidence intervals
As our model is a canonical exponential family,
the correction term q(ψ) in (1) involves the Wald
statistic w(ψ) plus parts of the observed information matrix (see formula (5) in the Appendix). The
95% confidence intervals obtained from the Wald
pivot and from the likelihood root are respectively
(−3.486, 0.227) and (−3.682, 0.154). The third order statistic r∗ yields a 95% confidence interval of
(−3.130, 0.256). First and third order results are
rather different, especially with respect to the lower
bound.
1 Technical

R News

−3
−5

−4

−3

−2

−1

0

1

2

coefficient of type

Figure 3: airway data analysis: profile plots of the
pivots w(ψ) (dashed line), r(ψ) (solid line) and r∗ (ψ)
(bold line), where ψ is the coefficient of the covariate
device type.
Figure 3 plots the profiles of the first and third order
pivots w(ψ) (dashed line), r(ψ) (solid line) and r∗ (ψ)
(bold line). The correction term q(ψ) is particularly
significant for values of ψ belonging to the lower half
of the confidence interval. The nuisance parameter
correction is rnp = 0.51, while rinf = 0.059 is about
ten times smaller.

Example 2: Nonlinear regression
A simple illustration of nonlinear regression is Example 7.7 of Davison and Hinkley (1997), which refers
to the calcium data of package boot. This data set
records the calcium uptake (in nmoles/mg) of cells y
as a function of time x (in minutes), after being suspended in a solution of radioactive calcium. There
are 27 observations in all. The model is
yi = β0 {1 − exp(−β1 xi )} + σiεi ,

(3)

where β0 and β1 are unknown regression coefficients
and the error term εi ∼ N (0, 1) is standard normal.
We complete the definition of model (3) by allowing
the response variance σi2 = σ 2 (1 + xi )γ to depend
nonlinearly on the time covariate through the two
variance parameters γ and σ 2 .
Figure 4 gives the R code for the analysis. The
variables cal and time represent respectively the calcium uptake and suspension time. Model (3) is fitted
by maximum likelihood using the nlreg routine of
package nlreg, which yields β̂0 = 4.317 (s.e. 0.323),
β̂1 = 0.207 (s.e. 0.036), γ̂ = 0.536 (s.e. 0.320), and
log σ̂ 2 = −2.343 (s.e. 0.634). Note that the baseline variance σ 2 is fitted on the logarithmic scale.
This does not affect inference based on the r and r∗
statistics, which are parametrisation invariant, and
ensures positive values for σ 2 when the Wald statistic

details are omitted from the output.

ISSN 1609-3631

Vol. 5/1, May 2005

24

> library(boot)
> library(nlreg)
## maximum likelihood fit
> calcium.nl <- nlreg( cal ~ b0 * (1 - exp(-b1 * time)), weights = ~ (1 + time)^g,
+
data = calcium, start = c(b0 = 4, b1 = 0.1, g = 0) )
> summary( calcium.nl )
# yields estimates and standard errors
## pivot profiling for \gamma
> calcium.prof <- profile( calcium.nl, offset = g )
> summary( calcium.prof )
Two-sided confidence intervals for g
lower upper
r* - Fr (0.95) -0.14340 1.191
r* - Sk (0.95) -0.14320 1.190
r (0.95)
-0.12441 1.154
Wald (0.95)
-0.08991 1.163

g

Estimate Std. Error
0.5364
0.3196

[omitted]
## inference on proportion of maximum
> calcium.nl <- nlreg( cal ~ b0 * (1 - exp(- log(1 + exp(psi)) * time / 15)),
+
data = calcium, start = c(b0 =4.3, psi =2) )
> calcium.prof <- profile( calcium.nl, offset = psi )
> calcium.sum <- summary( calcium.prof )
> exp(calcium.sum$CI) / (1 + exp(calcium.sum$CI))
# 95% confidence intervals for \pi
## profile and contour plots : Figure 5
> calcium.prof <- profile( calcium.nl )
> par( las = 1, mai = c(0.5, 0.5, 0.2, 0.2) )
> contour( calcium.prof, alpha = 0.05, cl1 = "black", cl2 = "black", lwd2 = 2 )

Figure 4: calcium uptake data analysis: R code for model fitting and pivot profiling for different parameters
of interest (variance parameter γ and “proportion of maximum” π).
is used. Figure 4 shows also how to use the profile
method of the nlreg package to set various first and
higher order 95% confidence intervals for the variance parameter γ.1 A difficulty we had not to face in
the previous two examples is that it is no longer possible to calculate the correction term in (1) exactly.
The profile function implements two slightly different versions of the higher order pivot r∗ which we
obtain by using the two approximations of q(ψ) discussed in the Appendix. The four statistics agree in
letting us question the heterogeneity of the response
variance.
Davison and Hinkley (1997, p. 356) consider not
only inference on the nonlinear mean function, but
also on other aspects of the model such as the
“proportion of maximum”, π = 1 − exp(−β1 x).
For x = 15 minutes they give the estimate π̂ =
0.956 and the associated 95% bootstrap confidence
interval (0.83, 0.98). We may obtain the corresponding first and higher order likelihood analogues
by reparametrizing the mean response curve into
R News

(π, β0 ) and re-running the whole analysis. This time
we assume that the response variance is homogeneous. Because of the constraint that π must lie
in the interval (0, 1), we actually fit the model for
ψ = log{π /(1 − π )} and back-transform to the original scale by π = exp(ψ)/{1 + exp(ψ)}. This yields
the intervals (0.87, 0.99) and (0.88, 0.99) for respectively the Wald and likelihood root statistics and
(0.87, 0.99) for both versions of r∗ , which is in agreement with the bootstrap simulation.
The profile method of the nlreg package provides also all elements needed to display graphically a fitted nonlinear model, as done for instance
in Figure 5 with respect to the second model fit. The
contour method of the nlreg package represents,
in fact, an enhanced version of the original algorithm by Bates and Watts (1988, Chapter 6), to which
we refer the reader for the interpretation of these
plots. The dashed, solid and bold lines represent respectively the Wald pivot, the likelihood root r and
Skovgaard’s (1996) approximation to the r∗ statistic
ISSN 1609-3631

Vol. 5/1, May 2005

25

b0

3
2
1
0
−1
−2
−3
3.5

4.5

3
2
1
0
−1
−2
−3

3
2
1
0
−1
−2
−3
5.5

−3

−1

4
3
2
3.5

4.5

5.5

2

−0.5

−0.5

−1.0

−1.0

−1.5

−1.5

−2.0

−2.0
3.5

4.5

5.5

3

psi

3
2
1
0
−1
−2
−3

5

1

3

−3

−1

1

3

−3

−1

1

3

3
2
1
0
−1
−2
−3
4

5

logs

3
2
1
0
−1
−2
−3
2

3

4

5

−2.0

−1.0

Figure 5: calcium uptake data analysis: profile plots and profile pair sketches for the parameters β0 , ψ and
log σ 2 using the Wald statistic (dashed), the likelihood root r (solid) and Skovgaard’s (1996) approximation to
r∗ (bold).
(see the Appendix). The bivariate contour plots in
the lower triangle are plotted on the original scale,
whereas the ones in the upper triangle are on the
r scale. Figure 5 highlights different aspects of the
model fit. First, the maximum likelihood estimate
of log σ 2 is biased downwards, which we can tell
from the fact the corresponding r∗ profile is shifted
to the right of r. Otherwise, there does not seem to
be a huge difference between first and higher order
methods as the corresponding profiles and contours
are not too different. The finite sample estimates of
β0 and ψ are strongly correlated, while they are almost independent of log σ̂ 2 . The contours of r(ψ)
and r∗ (ψ) are close to elliptic which indicates that
the log likelihood function is not too far from being
quadratic. A further indication for a small curvature
effect due to parametrisation is that the contours on
the original and on the r scale look similar.

Acknowledgments
I am in debt with Ruggero Bellio, Anthony Davison
and Nancy Reid for the many helpful discussions on
and around higher order likelihood asymptotics. I
would like to thank two anonymous referees whose
R News

suggestions and comments helped improving a previous version of this paper.

Appendix: q(ψ) correction term
In this appendix we give the general expression of
the correction term q(ψ) in (1) and the explicit formulae for two special model classes, that is, linear exponential families and regression-scale models. We
will furthermore discuss two ways of approximating
q(ψ) in case we cannot calculate it explicitly.

Basic expression
Let `(θ ) = `(ψ, λ ) be the log likelihood function,
θ̂ = (ψ̂, λ̂ ) the maximum likelihood estimator of the
d-dimensional parameter θ = (ψ, λ ), and j(θ ) =
−∂2 `(θ )/∂θ∂θ > the d × d observed information matrix. Denote λ̂ψ the constrained maximum likelihood estimator of the nuisance parameter λ given
the value of the scalar parameter of interest ψ. Write
jλλ (θ ) the corner of j(θ ) = j(ψ, λ ) which corresponds to λ, and θ̂ψ = (ψ, λ̂ψ ).
ISSN 1609-3631

Vol. 5/1, May 2005

26

The basic expression for q(ψ) is
q(ψ) =

|`;θ̂ (θ̂ ) − `;θ̂ (θ̂ψ ) `λ> ;θ̂ (θ̂ψ )|
,

1/2
| jλλ (θ̂ψ )|| j(θ̂ )|

(4)

where | · | indicates determinant (Severini, 2000, Section 7.4.1). The d × d matrix appearing in the numerator of q(ψ) consists of a column vector formed using
so-called sample space derivatives

`;θ̂ (θ ) =

∂`(θ; θ̂ | a)
,
∂θ̂

and a d × (d − 1) matrix of mixed derivatives

`λ> ;θ̂ =

∂2 `(ψ, λ; θ̂ | a)
.
∂λ > ∂θ̂

The hoa package

The former are defined as the derivatives of the log
likelihood function `(ψ, λ; θ̂ | a) with respect to the
maximum likelihood estimator θ̂; mixed derivatives
furthermore involve differentiation with respect to
the whole parameter θ or parts of it (Severini, 2000,
Section 6.2.1). Note that to do so, the data vector
has to be re-expressed as y = (θ̂, a), where a represents the observed value of an ancillary statistic upon
which we condition.

Approximations
Exact computation of the sample space derivatives
involved in expression (4) requires that we are able to
write the data vector y as a function of the maximum
likelihood estimator θ̂ and of an ancillary statistic a.
This is, with few exceptions, only feasible for linear
exponential families and transformation models, in
which cases the q(ψ) term involves familiar likelihood quantities. If the reference model is a full rank
exponential family with ψ and λ taken as canonical
parameters, the correction term

q(ψ) = w(ψ) | jλλ (θ̂ )|/| jλλ (θ̂ψ )|

1/2

(5)

depends upon the Wald statistic. In case of a
regression-scale model, that is, of a linear regression
model with non necessarily normal errors,

q(ψ) = s(ψ) | jλλ (θ̂ψ )|/| jλλ (θ̂ )|

1/2

(6)

involves the score statistic. Here, ψ is linear in (β, σ )
and the nuisance parameter λ is taken linear in β and
ξ = log σ, where β and σ represent respectively the
regression coefficients and the scale parameter.
In general, the calculation of the sample space
derivatives `;θ̂ (θ ) and mixed derivatives `λ> ;θ̂ (θ )
may be difficult or impossible. To deal with this difficulty, several approximations were proposed. For
a comprehensive review we refer the reader to Section 6.7 of Severini (2000). Here we will mention
two of them. A first approximation, due to Fraser
R News

et al. (1999), is based upon the idea that in order to
differentiate the likelihood function `(θ; θ̂ | a) on the
surface in the n-dimensional sample space defined
by conditioning on a we need not know exactly the
transformation from y to (θ̂, a), but only the d vectors which are tangent to this surface (Severini, 2000,
Section 6.7.2). Skovgaard (1996) on the other hand
suggests to approximate the sample space and mixed
derivatives by suitable covariances of the log likelihood and of the score vector (Severini, 2000, Section 6.7.3). While the first approximation maintains
the third order accuracy of r∗ , we lose one degree
when following Skovgaard’s (1996) approach. See
Sections 7.5.3 and 7.5.4 of Severini (2000) for the details.

The expressions of q(ψ) implemented in the hoa
package bundle are: i) (5) and (6) for respectively the
cond and marg packages (logistic and linear non normal regression), and ii) the two approximations discussed above for the nlreg package (nonlinear heteroscedastic regression). The formulae are given in
Brazzale et al. (to appear). The nlreg package also
implements Skovgaard’s (2001, Section 3.1) multiparameter extension of the modified likelihood root.
The implementation of the cond and nlreg packages
is discussed in Brazzale (1999) and Bellio and Brazzale (2003).

Bibliography
D. M. Bates and D. G. Watts. Nonlinear Regression
Analysis and Its Applications. Wiley, New York,
1988. 24
R. Bellio and A. R. Brazzale. Higher-order asymptotics unleashed: Software design for nonlinear
heteroscedastic models. Journal of Computational
and Graphical Statistics, 12:682–697, 2003. 26
A. R. Brazzale. Approximate conditional inference in
logistic and loglinear models. Journal of Computational and Graphical Statistics, 8:653–661, 1999. 26
A. R. Brazzale, A. C. Davison, and N. Reid. Applied
Asymptotics. Cambridge University Press, Cambridge, to appear. 22, 26
D. Collet. Binary data. In P. Armitage and T. Colton,
editors, Encyclopedia of Biostatistics. John Wiley &
Sons, Chichester, 1998. 22
A. C. Davison and D. V. Hinkley. Bootstrap Methods
and Their Application. Cambridge University Press,
Cambridge, 1997. 23, 24
D. A. S. Fraser. Statistical inference: Likelihood to
significance. Journal of the American Statistical Association, 86:258–265, 1991. 20
ISSN 1609-3631

Vol. 5/1, May 2005

D. A. S. Fraser, N. Reid, and J. Wu. A simple general
formula for tail probabilities for frequentist and
Bayesian inference. Biometrika, 86:249–264, 1999.
26
D. A. Pierce and D. Peters. Practical use of higherorder asymptotics for multiparameter exponential
families. Journal of the Royal Statistical Society Series
B, 54:701–737, 1992. 21
T. A. Severini. Likelihood Methods in Statistics. Oxford
University Press, Oxford, 2000. 20, 21, 26

27

I. M. Skovgaard. An explicit large-deviation approximation to one-parameter tests. Bernoulli, 2:145–
165, 1996. 26
I. M. Skovgaard. Likelihood asymptotics. Scandinavian Journal of Statistics, 28:3–32, 2001. 21
Alessandra R. Brazzale
Institute of Biomedical Engineering, Italian National Research Council
alessandra.brazzale@isib.cnr.it

Fitting linear mixed models in R
Using the lme4 package
by Douglas Bates
The lme function, which fits linear mixed models of
the form described in Pinheiro and Bates (2000), has
been available in the required R package nlme for
several years. Recently my colleagues and I have
been developing another R package, called lme4, and
its lmer function which provides more flexible fitting
of linear mixed models and also provides extensions
to generalized linear mixed models.
The good news for lme users is that the lmer function fits a greater range of models, is more reliable,
and is faster than the lme function. The bad news
is that the model specification has been changed
slightly. The purpose of this article is to introduce
lmer, to describe how it can be used to fit linear
mixed models and to highlight some of the differences between lmer and lme.

Linear mixed models
Before describing how to fit linear mixed models I
will describe what they are. In building a statistical model for experimental or observational data we
often want to characterize the dependence of a response, such as a patient’s heart rate, on one or more
covariates, such as the patient identifier, whether the
patient is in the treatment group or the control group,
and the time under treatment. We also want to characterize the “unexplained” variation in the response.
Empirical models (i.e., models that are derived from
the data itself, not from external assumptions on the
mechanism generating the data) are generally chosen to be linear in the parameters as these are much
simpler to use than are nonlinear models.
Some of the available covariates may be repeatable in the sense that (conceptually, at least) we can
obtain new observations with the same values of the
covariate as in the current study. For example, we
R News

can recruit a new patient to the study and assign
this patient to the treatment group or to the control
group. We can then observe this patient’s heart rate
at one minute, five minutes, etc. after treatment.
Hence we would regard both the treatment factor
and the time after treatment covariate as repeatable.
A factor is repeatable if the set of possible levels
of the factor is fixed and each of these levels is itself
repeatable. In most studies we would not regard the
patient identifier factor (or, more generally, the “subject” factor or any other factor representing an experimental unit) as being repeatable. Instead we regard
the subjects in the study as a random sample from
the population of interest.
Our goals in modeling repeatable covariates and
non-repeatable covariates are different. With a
repeatable covariate we want to characterize the
change in the response between different levels and
for this we use fixed-effects terms that represent, say,
the typical rate of change of the response with respect
to time under treatment or the difference between
a typical response in the treatment and the control
groups. For a non-repeatable covariate we want to
characterize the variation induced in the response by
the different levels of the covariate and for this we
use random-effects terms.
A statistical model that incorporates both fixedeffects terms and random-effects terms is called a
mixed-effects model or, more simply, a mixed model.

Single grouping factor
As indicated above, a random effect is associated
with a grouping factor, which would be the patient
identifier in our example, and possibly with other covariates. We specify a random-effects term in lmer by
a linear model term and a grouping factor separated
by ‘|’, which we would read as “given” or “conditional on”. That is, a random effect is a linear model
term conditional on the level of the grouping factor.
Because the precedence of ‘|’ as an operator
ISSN 1609-3631

Vol. 5/1, May 2005

28

is lower than most other operators used in linear
model formulae, the entire random-effects expression should be enclosed in parentheses.
Many models for longitudinal data (repeated
measurements over time on each of several subjects)
incorporate random effects associated with a single
grouping factor. Consider the HR (heart rate) data
from the SASmixed package
> data("HR", package = "SASmixed")
> names(HR)
[1] "Patient" "Drug"
[4] "HR"
"Time"

"baseHR"

Initially we fit a linear mixed model with fixedeffects terms for the base heart rate, the time since
administration of the medication, the type of drug
and the time/drug interaction. The random effect associated with the patient is a simple additive shift.
> (fm1 <- lmer(HR ~ baseHR + Time *
+
Drug + (1 | Patient), HR))
Linear mixed-effects model fit by REML
Formula: HR ~ baseHR+Time*Drug+(1|Patient)
Data: HR
AIC
BIC logLik MLdeviance REMLdeviance
790.7 815.8 -386.3
791.9
772.7
Random effects:
Groups
Name
Variance Std.Dev.
Patient (Intercept) 44.5
6.67
Residual
29.8
5.46
# of obs: 120, groups: Patient, 24
Fixed effects:
Estimate Std. Error
(Intercept)
33.962
9.931
baseHR
0.588
0.118
Time
-10.698
2.421
Drugb
3.380
3.784
Drugp
-3.778
3.802
Time:Drugb
3.512
3.424
Time:Drugp
7.501
3.424
t value Pr(>|t|)
(Intercept)
3.42 0.00087
baseHR
4.97 2.5e-06
Time
-4.42 2.3e-05
Drugb
0.89 0.37358
Drugp
-0.99 0.32244
Time:Drugb
1.03 0.30717
Time:Drugp
2.19 0.03050

DF
113
113
113
113
113
113
113

The first few lines of the output state that the
model has been fit using the REML (restricted or
residual maximum likelihood) criterion and indicate
the formula and the data set used. The next line provides several measures of the quality of the fit including Akaike’s Information Criterion, Schwartz’s
Bayesian Information Criterion, the log-likelihood
(actually the log-restricted likelihood because REML
R News

is being used) and the ML and REML versions of the
deviance.
The estimates of the variance components are
given next. Note that the column headed Std.Dev. is
not the standard error of the estimate of the variance
component. It is simply the square root of the estimated variance and is included because many people (and I am one) find it easier to interpret an estimated standard deviation instead of an estimated
variance.
Finally the estimates for the fixed-effects coefficients are summarized. If we wish to consider
the significance of terms rather than individual coefficients we can use a single-argument call to the
anova generic which provides the sequential sums of
squares and the corresponding F tests.
> anova(fm1)
Analysis of Variance Table
Df Sum Sq Mean Sq Denom
baseHR
1
746
746
113
Time
1
753
753
113
Drug
2
87
43
113
Time:Drug 2
143
72
113
F value Pr(>F)
baseHR
25.05 2.1e-06
Time
25.28 1.9e-06
Drug
1.46
0.237
Time:Drug
2.40
0.095
At present the denominator degrees of freedom
shown in the coefficient table and in the analysis of
variance table are upper bounds. In a sense there
is no “correct” denominator degrees of freedom because the F test is always an approximation for these
models. But, even taking this into account, it is still
the case that the denominator degrees of freedom for
the Drug term should be lower than shown. The reason that these degrees of freedom are not more accurately approximated at present is because it is difficult to decide exactly how this should be done for
the general models described below.
Before changing the fixed effects terms in the
model we may wish to examine models with more
general specifications of the random effects, such as
both a random intercept and a random slope (with
respect to time) for each patient.
> VarCorr(fm2 <- lmer(HR ~ baseHR +
+
Time * Drug + (Time | Patient),
+
HR))
Groups
Patient

Name
Variance Std.Dev.
(Intercept) 60.6
7.79
Time
37.8
6.15
Residual
24.4
4.94
Corr
-0.563
ISSN 1609-3631

Vol. 5/1, May 2005

> head(model.matrix(~Time, HR), n = 3)

1
2
3

(Intercept)
Time
1 0.01667
1 0.08333
1 0.25000

> head(ranef(fm2)$Patient, n = 3)

201
202
203

(Intercept)
Time
0.871 4.04733
-9.341 6.79574
5.005 -0.07822

> anova(fm1, fm2)
Data: HR
Models:
fm1:HR ~ baseHR+Time*Drug+(1|Patient)
fm2:HR ~ baseHR+Time*Drug+(Time|Patient)
Df AIC BIC logLik Chisq Chi Df
fm1 9 810 835
-396
fm2 11 810 841
-394 3.77
2
Pr(>Chisq)
fm1
fm2
0.15
To save space I summarized this fitted model using VarCorr which describes only the estimates of
the variance components. Notice that the expression
Time generates a model matrix (model.matrix) with
two columns, (Intercept) and Time, so there are
two random effects associated with each patient.
The distribution of the random effects is a bivariate
normal distribution with mean zero and a positivedefinite 2 × 2 variance-covariance matrix. The estimates of the two variances and the estimated correlation are given in the output.
The ranef generic function returns the BLUPs
(Best Linear Unbiased Predictors) of the random effects and anova provides a likelihood ratio test when
given multiple model fits to the same data set. As
described in Pinheiro and Bates (2000), the p-value
calculated for this test will be conservative (i.e. it is
an upper bound on the true p-value) because the parameter space is bounded and in the null hypothesis
one of the parameters is at the boundary.

Multiple random-effects expressions
As shown above, the estimated variance-covariance
matrix from a random-effects expression, such as
(Time|Patient), for which the model matrix has
multiple columns is a general, positive-definite, symmetric matrix. (“General” in the sense that there are
no constraints imposed on this matrix other than its
being positive-definite.)
R News

29

Occasionally we wish to impose further constraints such as independence of random effects associated with the same grouping factor. For example we can fit a model with an intercept and slope
for each patient but assuming independence of these
random effects with
> VarCorr(fm3 <- lmer(HR ~ baseHR +
+
Time * Drug + (1 | Patient) +
+
(Time - 1 | Patient), HR))
Groups
Name
Patient (Intercept)
Patient Time
Residual

Variance
47.9
25.0
25.6

Std.Dev.
6.92
5.00
5.06

> anova(fm1, fm3, fm2)
Data: HR
Models:
fm1: HR~baseHR+Time*Drug+(1|Patient)
fm3: HR~baseHR+Time*Drug+(1|Patient)
+(Time-1|Patient)
fm2: HR~baseHR+Time*Drug+(Time|Patient)
Df AIC BIC logLik Chisq Chi Df
fm1 9 810 835
-396
fm3 10 811 839
-396 0.84
1
fm2 11 810 841
-394 2.93
1
Pr(>Chisq)
fm1
fm3
0.358
fm2
0.087
The fitted model fm3 has a random intercept and a
random slope for each patient, as does fm2, but in fm3
these random effects are assumed to be independent
within patient. Hence fm3 uses one fewer degree of
freedom than does fm2.
In general the random effects from each expression are modeled as independent. If the model matrix from the ith random-effects expression has qi
columns and the grouping factor has ki levels (this
number is well-defined because unused levels are
dropped during construction of the model frame) the
corresponding random effects vector has length ki qi
and could be arranged as a ki × qi matrix as shown in
the ranef output above. The rows share a common
ki × ki variance-covariance matrix. Different rows are
independent.

Nested and non-nested grouping factors
We have seen an example of multiple random-effects
expressions for the same grouping factor. It is more
common to create random-effects expressions associated with different grouping factors. For example
in a multicenter trial we may have observations over
time on each of several patients at each of several institutions and it could be appropriate to use random
effects for both patient and institution.
If each patient is observed at only one institution
we say that the levels of patient are nested within the
ISSN 1609-3631

Vol. 5/1, May 2005

30

levels of institution; otherwise the factors are nonnested. Notice that it only takes one patient migrating between institutions to cause the grouping factors to lose the nesting property. This may not be a
primary concern in the analysis of multicenter trials
but it is an important issue in the analysis of student
test scores over time where it is quite common to
have some portion of the students observed at multiple schools. In these cases analysis of the data as
if students were nested within schools is at best an
approximation and at worst quite inappropriate.
A major distinction between lme and lmer is that
lme is optimized for nested grouping factors whereas
lmer handles nested and non-nested grouping factors equally easily. In lmer one simply provides multiple random-effects expressions and the determination of nested or non-nested is done in the code.
Consider the star data (Student Teacher
Achievement Ratio) from the mlmRev package. These
data are from a large study - 26,796 observations
on a total of 11,598 students in 80 schools. We inferred teacher ids from characteristics of the teacher
and classroom, resulting in 1387 distinct teacher ids.
(This teacher count is not entirely accurate but it is a
reasonable approximation.)
We fit an initial model to the math test scores.
> VarCorr(fm4 <- lmer(math ~ gr +
+
sx + eth + cltype + (yrs |
+
id) + (yrs | sch), star))
Groups
id

Name
Variance Std.Dev.Corr
(Intercept) 1079.2
32.85
yrs
22.9
4.78
-0.315
sch
(Intercept) 292.8
17.11
yrs
56.8
7.53
-0.777
Residual
584.1
24.17
> anova(fm4)
Analysis of Variance Table
Df Sum Sq Mean Sq
gr
3 1622511 540837
sx
1
9641
9641
eth
5 192909
38582
cltype 2
63848
31924
F value Pr(>F)
gr
925.9 < 2e-16
sx
16.5 4.9e-05
eth
66.0 < 2e-16
cltype
54.6 < 2e-16

Denom
24566
24566
24566
24566

It happens in this case that the grouping factors
id and sch are not nested but if they were nested
there would be no change in the model specification.
It is likely that the lmer function would converge
more rapidly if the grouping factors were nested
but even with the nonnested grouping factors in this
large study convergence is reasonably fast.

R News

Specifying levels
Because nested and non-nested grouping factors are
expressed in exactly the same way, it is not possible
to use implicit nesting when specifying the levels of
the grouping factors. Implicit nesting occurs when
the levels of the “inner” grouping factor are reused
for different levels of the “outer” grouping factor.
For example, if the patients are numbered starting at
patient 1 for each institution then patient 1 at institution 1 is distinguished from patient 1 at institution
2 only if it is assumed that patient is nested within
institution.
For lmer each distinct experimental unit must
correspond to a distinct level in the corresponding
grouping factor. It is easy to create a new grouping factor with this property from implicitly nested
factors using the interaction operator ‘:’. The Pixel
data set in the MEMSS package has one grouping factor Dog and another factor Side. If we wish to fit a
model with random effects for “side within dog” we
must first create a dog/side factor as
> Pixel$DS <- with(Pixel,Dog:Side)[drop=TRUE]
In this case the subset expression [drop=TRUE] is
not needed but it is a good practice to use it whenever creating such a factor. The expression results
in any combinations of levels that does not occur in
the data being dropped from the list of levels in the
newly created factor.

Summary
The lmer function in the lme4 package fits linear
mixed models. There are vast changes in the internals of lmer relative to the earlier lme function and
we hope they will ensure that lmer is faster, more reliable, and easier to use than was lme. The biggest
impact on the user is the change in model specification, which was made so as to clarify the model being
fit. The lmer function uses a single model formula including random-effects expressions that specify both
a linear model term and a grouping factor and can be
used to fit models based on multiple nested or nonnested grouping factors. It can also be used to fit
generalized linear mixed models but that is a topic
for another day.

Bibliography
J. C. Pinheiro and D. M. Bates. Mixed-Effects Models
in S and S-PLUS. Springer, 2000. 27, 29
Douglas Bates
University of Wisconsin - Madison, U.S.A.
bates@wisc.edu

ISSN 1609-3631

Vol. 5/1, May 2005

31

Using R for Statistical Seismology
by Ray Brownrigg & David Harte
Statistical Seismology is a relatively new term used
to describe the application of statistical methodology to earthquake data. The purpose is to raise
new questions about the physical earthquake process, and to describe stochastic components not explained by physical models. Such stochastic quantification allows one to test the validity of various physical hypotheses, and also makes probability forecasts
more feasible.
We describe here a suite of R packages, known
as the "Statistical Seismology Library" (SSLib). This
paper firstly introduces SSLib, providing a little history, and a description of its structure. Then the
various components of SSLib are described in more
detail and some examples are given. While the
packages were developed primarily to analyse earthquake data, two of the packages (Fractal and PtProcess) have more general applicability.
For a general text on seismology, see Lay and Wallace (1995). Further, a large collection of seismological software can be found at http://orfeus.knmi.
nl/other.services/software.links.shtml.

Introduction to SSLib
The Statistical Seismology Library (SSLib) is a collection of earthquake hypocentral catalogues (3D location of the rupture initiation point, time and magnitude) and R functions to manipulate, describe and
model event data contained in the catalogues. At this
stage, analyses include graphical data displays, fitting of point process models, estimation of fractal dimensions, and routines to apply the M8 Algorithm.
While we have named it the "Statistical Seismology
Library", the type of analyses that are performed really only reflect the research interests of the authors.
Part of the rationale was to require our students and
collaborators to formally document their programs
so that others could determine what they were supposed to do, and to be able to use them after they
have possibly moved on. Another aim was to make
some of our statistical methods and models more directly available to our seismologist and geophysicist
colleagues.
The library is divided into a number of R packages. Details of these and source code can all be
found on the Statistical Seismology Library web
page (http://homepages.paradise.net.nz/david.
harte/SSLib/). Package names with a specifically
seismological character are prefixed by "ss" (e.g. the
New Zealand catalogue is named ssNZ), whereas
those with a more general statistical interest are not
(e.g. PtProcess and Fractal). A reference manual
R News

(standard R format) for each package can also be
found on the web page, along with a Users Guide
that contains examples and shows how the different
packages relate to each other (Harte, 2005e).
SSLib was started in 1996 as an S-PLUS library
(Harte, 1998). After being ported to the R language
in 1999, development of SSLib switched to using the
R implementation. At this time, SSLib was only available on the Unix and Linux platforms, but in 2004 a
Windows version was released.

Earthquake Catalogues
Generally, users will want to use their own earthquake catalogues. However SSLib does contain
various earthquake catalogues including the New
Zealand catalogue and the PDE catalogue (Preliminary Determinations of Epicentres) which is
a worldwide catalogue collated by the US Geological Survey. Further details of the catalogues
available in SSLib can be found on the SSLib
web page (see http://homepages.paradise.net.
nz/david.harte/SSLib/).
The earthquake catalogues are based on raw data
available from the World Wide Web. These data are
generally collected by national seismological observatories. The raw data appears in many different formats, but the SSLib input routines coerce these different formats into a single common structure. This
allows for both a uniformity in analysis, and the ability to make comparisons between the different catalogues. Nevertheless, the data structure within the
catalogue packages allows for the inclusion of any
number of extra data fields; see Harte (2005e) for further details.

Catalogue Manipulation Utilities
The ssBase package (Harte, 2005b) provides catalogue preparation and data manipulation functions.
These include functions for date/time formatting,
data summaries, printing and data subsetting. This
package also contains other functions of a general nature used by the other packages.
A catalogue can be subsetted on any combination
of location, time range, depth range or magnitude
range, with the location being rectangular (in the latitude/longitude sense), circular (actually cylindrical
in 3 dimensions), spherical, or based on an arbitrary
polygon on the surface of the earth. Many of the
analysis functions will work directly with a subset
’object’.
ISSN 1609-3631

Vol. 5/1, May 2005

32

Exploratory Data Analyses
The ssEDA package (Harte, 2005c) consists of functions for exploratory data analysis. In particular
these can provide histograms of event depth or event
frequency (monthly or yearly), line plots of magnitude over time, maps of the locations of the epicentres, and frequency-magnitude plots to determine a
b-value (see Figure 2) or to check the completeness of
a catalogue. Further, the epicentre maps can identify
different magnitudes and depths. Interactive graphics can be used to identify individual events on epicentral plots, and dynamic graphics can be used to
show 3-dimensional views with rotation and linked
plots.

Figure 2 is a frequency-magnitude plot for the
same subset as used in Figure 1 and displays the
Gutenberg-Richter law (Lay and Wallace, 1995). This
law says that the logarithm of the cumulative number of events with magnitude greater than m is linear as a function of m. The R commands required
to present this graph (given the call to subset.rect
from the previous figure) follow.
freq.magnitude(b)
title("Frequency Magnitude Power-Law
Relationship")

−35

NZ Events Since 1965 With M ≥ 4.5

200, Inf), mapname="nz",
criteria=FALSE, cex=0.8,
usr=c(172, 179, -42, -35))
title(expression(paste("NZ Events Since 1965
With ", M >= 4.5)))

●
●●
●
●

●

Frequency Magnitude Power−Law Relationship

●

●
●
●

−41
−42

●

●
●

−0.5

●
●
●
●

−1.0

●
●
●
●

−1.5

●
●
●
●

−2.0

●
●

●

●
●

−2.5

−40

−39

−38

−37

● ●
●
●
●
●
●●
●
● ●
●
● ●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●●
●
●● ●
●●● ● ●● ●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●●●
●
●
● ●●
●
● ●
●
●
● ●
●●● ● ●●●●●
●
●
●
●
●●●
●
● ● ●
●● ● ●●
●
● ●●
●●
●●●●●
● ● ●●
●
●●
●●
●
●●
●● ●
●●
●●
●
●●
●● ●
●
●
●●●
● ● ●●
●●●● ● ●
●
●● ●● ● ● ●
●●
●● ●
●
●●●
●
● ● ● ●●●
● ●
●
● ●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ● ●●●
●
●●
●
●●
●
●●●
●● ● ●
●● ●●● ●
●●●
●●●●
●●
●●
●● ●
●
●●●●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●●
●● ●●● ● ●
●
●●
●●
● ● ●
●
●
●
● ●● ●●
●●●
● ●
● ●
●● ●
●●
●
●●
●●
●
●●
●
●●
●
●● ●●●
●
●
●●
●
●
●
●
●
●
●●●
●●
●
●●
●
●●
● ●
●
●●
●
●
●
●●●●
●
●
●●
●
●●
●●●
●
●●●
●●●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
● ●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ● ●●
●
●
●●
●
●
●
●●
●
●
●●
●
● ●●
● ●
●
●●
●
●
●●
●
●●
●
●
●●●
●
●
●●● ● ●
●
● ●
●
●●
●
●●
●
●
●
●
●
●
●
● ●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●●
● ●●
●
●
●
●
●
●
●
●● ●
●
●
● ●●
●●
●
●●● ●
●
●
●●
●
●
●●
●
●
●●●
●●● ●
● ●
● ●
●
● ●
● ●●●●● ●
●
●
●
●● ●
●
●
●● ●● ●
●
●
●
●
●
●● ●
●●
●
●
●
●●
●●
●
●● ●
●●
●
●
● ●
●
●
●
● ● ●● ●● ●
●
●
● ●
●
●
●
●●
●●
●●
●●●
● ● ●●
● ●●
●
●● ●
●●
●
●●
●
●●
●
●●
● ●
●●
●
●
●
●●
●
● ●
●
●
●●● ● ●
●
●
●●
● ● ●●
●
●
● ●●
●
●
● ●
●●●
●
●
● ●
●
●●
●●
●
● ●
●
●
●
●
● ●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●● ●
● ●
●
●●
●
●
●
●
●● ●
●
●
●●
●
●●
●
●
● ●
●●
●●
●●
● ● ●●
●●
●
● ●●
●●
● ●●
● ●
●
● ●
●●
●●
●
●
●
●●
●
●● ● ● ●
● ●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
● ●
●
● ●
●
●
● ●
●
●
●
●
●●
●
●● ● ●
●
●
●
●●
●
●●
●
●
●
● ●
●● ●
● ● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
● ●

●

● ●
● ●●
●

log10(Proportion of Events with Magnitude > m)

●
●

● ●●
● ●●●
● ●
●●

●

●
●

−3.0

−36

●

0.0

●

●

●

●

4.5

5.0

5.5

6.0

6.5

m

●

●

172

173

174

175

176

177

178

179

Figure 1: Epicentral plot of events from the NZ catalogue since 1965 with magnitude ≥ 4.5 (1777 events).
The Pacific tectonic plate to the east is subducting the
Australian plate to the west.
Figure 1 shows a map of the epicentres of a subset of events from the New Zealand catalogue. The
colour of the point depicts the depth of the event,
with the most shallow events at the red end of the
spectrum and the deep events at the blue end of the
spectrum. The R commands required to display this
plot follow.
library(ssNZ)
library(ssEDA)
b <- subset.rect(NZ, minday=julian(1, 1, 1965),
minmag=4.5, minlat=-42,
maxlat=-35, minlong=172,
maxlong=179, mindepth=40)
epicentres(b, depth=c(40, 60, 80, 100, 150,

R News

Figure 2: Plot of events from the NZ catalogue since
1965 with magnitude ≥ 4.5 showing the GutenbergRichter law. The absolute value of the slope is referred to as the b-value and is typically about one.
The interactive graphics is provided by the
epicentres.identify function through use of the
identify R function. The dynamic graphics is provided through the threeD function, which links to
an external viewer provided by the xgobi software
(Swayne et al., 1998) via the xgobi R function. Unfortunately xgobi is not easily usable on a Windows
system, so work is in progress to use ggobi (GGobi)
instead.

Point Process Modelling
The PtProcess package (Harte, 2004), which can be
used independently of the other packages, provides
a framework for point process modelling. This includes parameter estimation, model evaluation and
ISSN 1609-3631

Vol. 5/1, May 2005

R News

over the time interval (0, 800). Using these data and
approximate maximum likelihood solutions for the
parameters contained in p (the ETAS model contains
5 parameters), the conditional intensity function can
be plotted as follows (see Figure 3).
library(PtProcess)
data(Ogata)
p <- c(0.02, 70.77, 0.47, 0.002, 1.25)
ti <- seq(0, 800, 0.5)
plot(ti, log(etas.cif(Ogata, ti, p)),
ylab=expression(paste("log ", lambda(t))),
xlab="Time", type="l", xlim=c(0, 800),
main="Conditional Intensity Function")

−3

−2

−1

log λ(t)

0

1

2

3

Conditional Intensity Function

−4

simulation. This package is likely to be of most interest to a general statistical audience.
Our PtProcess package has a slightly different
emphasis to the point process packages spatstat
(available on CRAN) and ptproc (http://sandybox.
typepad.com/software/ptproc/index.html). Our
point process models are for events strictly ordered
by time, and are conditional on the history of the process up to time t. This could be thought of as the
ground process (Daley and Vere-Jones, 2003). Other
event characteristics could be added as marks, e.g.
earthquake event magnitude. The spatstat package
(Baddeley and Turner, 2005) has an emphasis on the
spatial location of items, presumably at the same
point in time, e.g. tree locations or animals in a forest, etc. Emphasis is then on modelling the spatial
intensity. Here marks can be added, e.g. the particular species of tree. The ptproc package (Peng, 2003) is
derived from our package, and has extended the procedure into multidimensional processes. However,
these additional variables are not treated as marks,
and hence provides an alternative direction to our intended direction.
While the conditional intensity functions provided within the package have a distinctly seismological flavour, the general methodology and structure of the package is probably applicable to a reasonably large class of point process models. The
models fitted are essentially marked point processes
(Daley and Vere-Jones, 2003), where the mark distribution has been explicitly built into the conditional intensity function. Our next task is to separate the mark distribution and ground intensity function, and further generalise the framework so that
any number of mark distributions can be attached to
a given ground intensity function.
Currently the conditional intensity function is the
most basic "building block". The conditional intensity function, λ (t|Ht ), can be thought of as an instantaneous value of the Poisson rate parameter at time t
and is conditional on the history of the process up to
but not including t. We have given each conditional
intensity function a suffix ".cif". There are a number of "generic"-like functions which perform some
operation given an intensity function, for example,
simulate a process, perform a goodness of fit evaluation, etc.
As an example, consider the ETAS (Epidemic
Type Aftershock Sequence) model. This assumes
that earthquake events behave in a similar way to
an epidemic, where each event reproduces a number
of aftershocks. The larger the event, the more aftershocks that will be reproduced. If various criticality
conditions are satisfied, the aftershock sequence will
eventually die out. See Harte (2004) and Utsu and
Ogata (1997) for further technical details about the
ETAS model.
The package contains an example dataset provided by Yosihiko Ogata. These data were simulated

33

0

200

400

600

800

Time

Figure 3: The spikes occur at event times and their
heights are proportional to the event magnitudes,
with an "Omori" decay (Lay and Wallace, 1995) in the
rate after each event.
Further, the log-likelihood can be calculated as
follows.
pp.LL(Ogata, etas.cif, p, c(0, 800))

Using the vector p as initial values, maximum
likelihood parameter estimates can be calculated as
follows.
posterior <- make.posterior(Ogata, etas.cif,
c(0, 800))
neg.posterior <- function(params)
(-posterior(params))
p <- c(0.02, 70.77, 0.47, 0.002, 1.25)
z <- nlm(neg.posterior, p, hessian=TRUE,
iterlim=1000, typsize=p)

The function make.posterior creates a loglikelihood function to be evaluated on the time interval (0, 800), and also gives one the option of enforcing constraints on the parameters (none here). We
ISSN 1609-3631

Vol. 5/1, May 2005

then create a negative posterior function because the
function nlm is a minimiser. The maximum likelihood parameter estimates are contained within the
object z.
The creation of the posterior function was partly
done so that the function to be "optimised" within SPLUS had only one argument. This is not necessary
in R. Further, the use of priors has not been as useful
as was initially thought. Consequently, it is probably
best to revise the package so that the optimisation
works more directly on the pp.LL function.
One way to test for the goodness of fit is to calculate the transformed residual process. This effectively creates a new time variable which magnifies or
contracts the original process, assuming that the fitted model is correct, in such a manner that the resultant process is a homogeneous Poisson process with
rate parameter one. A plot of the transformed event
times versus the event number should roughly follow the line x = y. Large deviations from this line
indicate a poorly fitting model. This can be achieved
with the following code.
tau <- pp.resid(Ogata, z$estimate, etas.cif)
n <- nrow(Ogata)
plot(1:n, tau, type="l", xlab="Event Number",
ylab="Transformed Time",
xlim=c(0, n), ylim=c(0, n))
abline(a=0, b=1, lty=2, col="red")

Using the maximum likelihood parameter estimates, one can simulate the process over the subsequent interval (800, 1200), say. This is achieved as
follows.
x <- pp.sim(Ogata, z$estimate, etas.cif,
TT=c(800, 1200))

The object x contains the original Ogata data,
with the new simulated events appended. One may
be interested in forecasting the time taken for an
event with magnitude ≥ 6, say, to occur after time
800. One would then perform many such simulations, and determine the empirically simulated distribution for the given event of interest.
As can be seen, the conditional intensity function
is the essential ingredient in each of the above analyses, and hence an entity like this is an essential component in a more object oriented setup for these models. It is a relatively simple step to set this up in a
more object oriented manner, however, we have held
off with this until we have disentangled the conditional intensity into its ground process and a general
number of mark distributions.

Other Analyses
The ssM8 package (Harte, 2005d) implements the
Keilis-Borok & Kossobokov M8 algorithm (KeilisBorok and Kossobokov, 1990). It is an empirically
R News

34

based algorithm that attempts to deduce "times of
increased probability". We have been involved in
projects that have attempted to test the efficacy of
this algorithm.
The Fractal package (Harte, 2005a) has been used
to calculate various fractal dimensions based on
earthquake hypocentral locations, for example, see
Harte (2001).

Problems and Future Development
We have already mentioned a number of extensions
to the Point Process package: separating the conditional intensity into a ground intensity and a general number of mark distributions, writing more object oriented code, and determining if there is still
a need for the make.posterior function. To implement more object oriented code, the naming conventions would clearly need to be changed.
There is a difference with the functions contained
in the chron package (Ripley and Hornik, 2001;
Grothendieck and Petzoldt, 2004) and our "datetimes" format in ssBase (Harte, 2005b). We would
prefer to use the chron functions, however, we
would like the format.times function within chron
to have greater flexibility, including fractional numbers of seconds. For example, we would like the
ability to specify a times format as hh:mm:ss.s, or
hh:mm:ss.ss. Further, many historical earthquake
events do not have all date-time components available, and our "datetimes" format has a mechanism to
deal with this. Note that the POSIX functions are inappropriate, as the event times are generally stored
as UTC times.
A feature that we have included in all of our packages is a "changes" manual page. This documents all
recent changes made to the package with the date.
We have found this particularly useful when old
analyses have been repeated and different answers
are produced!
As noted earlier, the type of analyses included in
SSLib largely reflects our own research interests. This
also means that it is continually being changed and
augmented.

Bibliography
A. Baddeley and R. Turner. Spatstat: an R package
for analyzing spatial point patterns. Journal of Statistical Software, 12(6):1–42, 2005. ISSN 1548-7660.
URL http://www.jstatsoft.org. 33
D. Daley and D. Vere-Jones. An Introduction to the
Theory of Point Processes, volume I: Elementary
Theory and Methods. Springer-Verlag, New York,
second edition, 2003. 33
GGobi. GGobi data visualization system.
//www.ggobi.org/, 2003. 32

http:

ISSN 1609-3631

Vol. 5/1, May 2005

G. Grothendieck and T. Petzoldt. R help desk: Date
and time classes in R. R News, 4(1):29–32, 2004. 34
D. Harte. Documentation for the Statistical Seismology Library. Technical Report 98-10, School
of Mathematical and Computing Sciences, Victoria University of Wellington, Wellington, New
Zealand, 1998. 31
D. Harte. Multifractals: Theory and Applications. Chapman and Hall/CRC, Boca Raton, 2001. 34
D. Harte. Package PtProcess: Time Dependent Point
Process Modelling.
Statistics Research Associates, Wellington, New Zealand, 2004. URL
http://homepages.paradise.net.nz/david.
harte/SSLib/Manuals/pp.pdf. 32, 33
D. Harte. Package Fractal: Fractal Analysis. Statistics
Research Associates, Wellington, New Zealand,
2005a. URL http://homepages.paradise.net.
nz/david.harte/SSLib/Manuals/fractal.pdf.
34
D. Harte.
Package ssBase:
Base Functions
for SSLib.
Statistics Research Associates,
Wellington, New Zealand, 2005b.
URL
http://homepages.paradise.net.nz/david.
harte/SSLib/Manuals/base.pdf. 31, 34
D. Harte. Package ssEDA: Exploratory Data Analysis
for Earthquake Data. Statistics Research Associates, Wellington, New Zealand, 2005c. URL
http://homepages.paradise.net.nz/david.
harte/SSLib/Manuals/eda.pdf. 32
D. Harte.
Package ssM8: M8 Earthquake Forecasting Algorithm.
Statistics Research Associates, Wellington, New Zealand, 2005d. URL
http://homepages.paradise.net.nz/david.
harte/SSLib/Manuals/m8.pdf. 34

35

D. Harte. Users Guide for the Statistical Seismology Library.
Statistics Research Associates,
Wellington, New Zealand, 2005e.
URL
http://homepages.paradise.net.nz/david.
harte/SSLib/Manuals/guide.pdf. 31
V. Keilis-Borok and V. Kossobokov. Premonitory activation of earthquake flow: algorithm M8. Phys.
Earth & Planet. Int., 61:73–83, 1990. 34
T. Lay and T. Wallace. Modern Global Seismology. Academic Press, San Diego, 1995. 31, 32, 33
R. Peng. Multi-dimensional point process models in
R. Journal of Statistical Software, 8(16):1–24, 2003.
ISSN 1548-7660. URL http://www.jstatsoft.
org. 33
B. Ripley and K. Hornik. Date-time classes. R News,
1(2):8–11, 2001. 34
D. F. Swayne, D. Cook, and A. Buja.
XGobi:
Interactive dynamic data visualization in the X
window system. Journal of Computational and
Graphical Statistics, 7(1):113–130, 1998.
ISSN
1061-8600. URL http://www.research.att.com/
areas/stat/xgobi/. 32
T. Utsu and Y. Ogata. Statistical analysis of seismicity. In J. Healy, V. Keilis-Borok, and W. Lee, editors, Algorithms for Earthquake Statistics and Prediction, pages 13–94. IASPEI, Menlo Park CA, 1997.
33
Ray Brownrigg
Victoria University of Wellington
ray@mcs.vuw.ac.nz
David Harte
Statistics Research Associates
david@statsresearch.co.nz

Literate programming for creating and
maintaining packages
Jonathan Rougier

Outline
I describe a strategy I have found useful for developing large packages with lots of not-obvious mathematics that needs careful documentation. The basic
idea is to combine the ‘noweb’ literate programming
tool with the Unix ‘make’ utility. The source of the
package itself has the usual structure, but with the
addition of a noweb directory alongside the R and man
directories. For further reference below, the general
R News

directory tree structure I adopt is
pkgName
JJ TTTT
JJ TTT
uu
u
u
T
JJ
u
JJ TTTTTT
uu
u
TT*
J
u
%

zu
man, . . .
inst
R
noweb

doc


doc

where ‘. . . ’ denotes other optional directories such as
src and data.
The files in the noweb directory are the entry-point
ISSN 1609-3631

Vol. 5/1, May 2005

for all of the documented code in the package, including *.R files, *.Rd files, low-level source code,
datasets, and so on. The noweb tool, under the control of the file ‘Makefile’ and the Unix make utility, is
used to strip the files in the noweb directory into the
appropriate places in the other directories. Other targets in ‘Makefile’ also handle things like building the
package, installing it to a specific location, cleaning
up and creating various types of documentation.
A complete example R package is available
at http://maths.dur.ac.uk/stats/people/jcr/
myPkg.tar.gz containing the code below, and a bit
more to make ‘myPkg’ into a proper R package.

Literate programming and noweb
The basic idea of literate programming is that when
writing complicated programs, it makes a lot of sense
to keep the code and the documentation of the code
together, in one file. This requires a mechanism for
treating the file one way to strip out and assemble the
code, and another way to create the documentation.
With such a mechanism in place, the functionality
of the code can be made much more transparent—
and the documentation much more informative—
because the code itself can be written in a modular or
‘bottom-up’ fashion, and then assembled in the correct order automatically.
I use the noweb literate programming tool. A
noweb file looks like a LATEX file with additional environments delimited by <>=
and @.
These define code chunks that appear specially-formatted in the documentation, and
which get stripped out and assembled into the actual code. Here is an example of a minimal noweb
file, called, say, ‘myDoc1.nw’.
\documentclass[a4paper]{article}
\usepackage{noweb}\noweboptions{smallcode}
\pagestyle{noweb}
\begin{document}
Here is some \LaTeX\ documentation. Followed
by the top-level source code.
<>=
# The source of this file is noweb/myDoc1.nw
<>
<>
@
Now I can describe the [[foo]] function in more
detail, and then give the code.
<>=
"foo" <- function(x) 2 * x^2 - 1
@
And here is the [[bar]] function.
<>=
"bar" <- function(y) sqrt((1 + y) / 2)
@

R News

36

% stuff for man page not shown
\end{document}

The [[name]] syntax is used by noweb to highlight
variable names. There are many additional features
of noweb that are useful to programmers. More information can be found at the noweb homepage, http:
//www.eecs.harvard.edu/~nr/noweb/.

Tangle
noweb strips out the code chunks and assembles them
using the ‘notangle’ command. One useful feature is
that it can strip out chunks with specific identifiers
using the -R flag. Thus the command notangle -RR
myDoc1.nw will strip all the code chunks identified
directly or indirectly by <>=, assembling them in
the correct order. The result of this command is the
file
# The source of this file is noweb/myDoc1.nw
"foo" <- function(x) 2 * x^2 - 1
"bar" <- function(y) sqrt((1 + y) / 2)

which is written to standard output. By redirecting
the standard output we can arrange for this file to go
in the appropriate place in the package directory tree.
So the full command in this case would be notangle
-RR myDoc1.nw > ../R/myDoc1.R. So as long as we
are consistent in always labelling the ‘top’ chunk of
the R code with an <>= identifier, we can automate the process of getting the R code out.
If we want, we can do the same thing using
<>= for the content of man pages, <>= for
low-level code, <>= for datasets, and so on.
Each of these can be stripped out and saved as a file
into the appropriate place in the package directory
tree.

Weave
To build the documentation, noweb uses the ‘noweave’
command, where it is useful to set the -delay option.
Again, this is written to the standard output, and can
be redirected to place the resulting LATEX file at the
appropriate place in the package tree. I put the LATEX
versions of the *.nw files in the directory noweb/doc.
Thus the appropriate command is noweave -delay
myDoc1.nw > doc/myDoc1.tex. This file contains
quite a lot of LATEX commands inserted by noweb to
help with cross-referencing and indexing, and so is
not particularly easy to read. But after using latex,
the result is a dvi file which integrates the LATEX documentation and the actual code, in a very readable
format.
ISSN 1609-3631

Vol. 5/1, May 2005

April 13, 2005

37

myDoc1.nw

1

Here is some LATEX documentation. Followed by the top-level source code.
hRi≡
# The source of this file is noweb/myDoc1.nw
hfoo functioni
hbar functioni

Now I can describe the foo function in more detail, and then give the code.
hfoo functioni≡
"foo" <- function(x) 2 * x^2 - 1

And here is the bar function.
hbar functioni≡
"bar" <- function(y) sqrt((1 + y) / 2)

Similarity to ‘Sweave’
Many users of R will be familiar with Friedrich
Leisch’s ‘Sweave’ function, available in the tools
package. Sweave provides a means of embedding R
code into a report, such as a LATEX report, and then
automatically replacing this code with its output. It
uses the notangle functionality of a literate programming tool like noweb, because the R code must be
stripped out, prior to evaluating it. The difference
is that when Sweave builds the resulting document it
does more than noweave would do, because it must
actually evaluate R on each code chunk and put the
output back into the document. In the literate programming described in this article, all that notangle
does is wrap the code chunk in a LATEX command
to format it differently. Sweave can duplicate the
simpler behaviour of notangle with the arguments
echo=TRUE,eval=FALSE in each code chunk: see the
help file for the function ‘RWeaveLatex’ for more details.

Using make
Our starting point is a collection of *.nw files in the
noweb subdirectory. We could, by hand, issue a series
of notangle and noweave commands, each time we
update one or more of these files. Happily, the Unix
tool make can be used instead. make requires a file
of specifiers, often called ‘Makefile’, which describe
the kinds of things that can be made, and the commands necessary to make them. Obvious candidates
for things we would like to make are the *.R and
*.Rd files that go into the R and man subdirectories,
and the documentation that goes into the inst/doc
subdirectory.
The make utility is very powerful, and complicated. I am not an expert at writing a ‘Makefile’,
but the following approach seems to work well. My
Makefile lives in the noweb directory, and all of the
path specifiers take this as their starting point.

in the noweb/*.nw files; the argument R is known as
a ‘target’ of the make command. We need the following lines in Makefile:
# make R files in ../R
R:

$(RFILES)

$(RFILES):
%.R : %.nw
@echo ’Building R file for $<’
@notangle -RR $< > ../R/$@
@if test ‘egrep ’^<>=’ $<‘ ; then \
notangle -Rman $< > ../man/$*.Rd ; fi

After the comment, prefixed by #, the first line states
that the command ‘make R’ depends on the files in
the list ‘RFILES’, which we have already compiled.
The next set of lines is executed for each component
of ‘RFILES’ in turn, conditionally on the *.R file being
older than the corresponding *.nw file. Where the *.R
file is out-of-date in this way, a message is printed,
notangle is called, and the *.R file is rebuilt and
placed in the R subdirectory; if there is a <>=
entry, the *.Rd file is also rebuilt and placed in the
man subdirectory.
A ‘Makefile’ has its own syntax. Within a target
specification, the tags ‘$<’, ‘$@’ and ‘$*’ indicate the
origin file, the result file, and the file stem. Under
each initial line the collection of subsequent commands is indented by tabbing, and the backslash at
the end of the line indicates that the next line is a continuation. The @ at the start of each command line
suppresses the printing of the command to the standard output.
The attractive feature of make is that it only rebuilds files that are out-of-date, and it does this
all automatically using information in the file datestamps. This saves a lot of time with a large package in which there are many *.nw files in the noweb
directory. In my experience, however, one feels a
certain loss of control with so much automation.
The useful command ‘make --dry-run R’ will print
out the commands that will be performed when the
command ‘make R’ is given, but not actually do
them, which is a useful sanity check. The command ‘touch myDoc1.nw’ will change the date-stamp
on ‘myDoc1.nw’, so that all derived files such as
‘myDoc1.R’ will automatically be out-of-date, which
will force a rebuild even if ‘myDoc1.nw’ has not been
modified: this can be another useful sanity-check.
Suppose instead we want to rebuild the LATEX
documentation in the noweb/doc subdirectory, using
the command ‘make latex’. We need the following
lines in ‘Makefile’:
# make latex files in doc

Simple make targets
Suppose that we have compiled a list of files that
ought to be in the R subdirectory (say, ‘RFILES’), and
we want to use the command ‘make R’ to update
these files according to changes that have occurred
R News

latex:

$(TEXFILES)

$(TEXFILES):
%.tex : %.nw
@echo ’Building tex file for $<’
@noweave -delay $< > doc/$@

ISSN 1609-3631

Vol. 5/1, May 2005

@cd doc; \
latex $@ > /dev/null

where ‘TEXFILES’ is a list of files that ought to be in
the noweb/doc subdirectory. As before, each *.tex
file is only rebuilt if it is out-of-date relative to the
corresponding *.nw file. The resulting *.tex file is
put into noweb/doc, and then latex is called to create
the corresponding *.dvi file (rather lazily, the output
from latex is diverted to /dev/null and lost).
Slightly more complicated, suppose we want to
put pdf versions of the *.tex files into the subdirectory inst/doc, using the command ‘make pdf’. We
need the following lines in ‘Makefile’:
# make pdf files in ../inst/doc
pdf:

$(TEXFILES) $(PDFFILES)

$(PDFFILES):
%.pdf : %.tex
@echo ’Building pdf file for $<’
@cd doc; \
pdflatex $< > /dev/null;
@mv doc/$@ ../inst/doc/

where ‘PDFFILES’ is a list of files that ought to be
in the inst/doc subdirectory. The new feature here
is that the command ‘make pdf’ depends on both
‘TEXFILES’ and ‘PDFFILES’. This means that each
component of ‘TEXFILES’ is updated, if necessary,
before the *.pdf file is rebuilt from the *.tex file.
To rebuild the *.pdf file, the pdflatex command is
called on the *.tex file in noweb/doc, and then the
result is moved to inst/doc.

Additional lines in the ‘Makefile’
The main role of additional lines in ‘Makefile’ is to
supply the lists ‘RFILES’, ‘TEXFILES’ and so on. The
following commands achieve this, and a little bit else
besides, and go at the start of ‘Makefile’:
## minimal Makefile
# look in other directories for files
vpath %.R ../R
vpath %.tex doc
vpath %.pdf ../inst/doc
# get lists of files
NWFILES = $(wildcard *.nw)
RFILES = $(NWFILES:.nw=.R)
TEXFILES = $(NWFILES:.nw=.tex)
PDFFILES = $(NWFILES:.nw=.pdf)

38

the *.tex files in doc, and so on (all relative to the
noweb directory). The next four lines compile the
lists of various types of file: ‘NWFILES’ are found
in the noweb directory, ‘RFILES’ are ‘NWFILES’ files
with the ‘nw’ suffix replaced with a ‘R’ suffix, and so
on. For the final line, the .PHONY command is a bit of
defensive programming to identify the targets of the
make command.

Example
The file http://maths.dur.ac.uk/stats/people/
jcr/myPkg.tar.gz contains a small example of an R
package created using noweb and make, where there
is only one *.nw file in the noweb directory, namely
myDoc1.nw. The following commands illustrate the
steps outlined above (‘debreu’ is my computer):
debreu% make --dry-run R
make: Nothing to be done for ‘R’.
debreu% touch myDoc1.nw
debreu% make --dry-run R
echo ’Building R file for myDoc1.nw’
notangle -RR myDoc1.nw > ../R/myDoc1.R
if test ‘egrep ’^<>=’ myDoc1.nw‘ ; then \
notangle -Rman myDoc1.nw > ../man/myDoc1.Rd ; fi
debreu% make R
Building R file for myDoc1.nw
debreu% make --dry-run pdf
echo ’Building tex file for myDoc1.nw’
noweave -delay myDoc1.nw > doc/myDoc1.tex
cd doc; \
latex myDoc1.tex > /dev/null
echo ’Building pdf file for myDoc1.tex’
cd doc; \
pdflatex myDoc1.tex > /dev/null;
mv doc/myDoc1.pdf ../inst/doc/
debreu% make pdf
Building tex file for myDoc1.nw
Building pdf file for myDoc1.tex

Initially the source file myDoc1.nw is up-to-date. I
touch it to make it more recent than the derived
files myDoc1.{R,Rd,tex,pdf}. The dry run shows
the commands that will be executed when I type
make R: in this case the file myDoc1.R is built and
moved to ../R. After issuing the command make R
the only output to screen is the comment ‘Building R
file for myDoc1.nw’. The dry run for make pdf shows
a more complicated set of operations, because before
myDoc1.pdf can be built, myDoc1.tex has to be built.

# here are the various targets for make

More complicated targets
.PHONY: R latex pdf install
# Now insert the lines from above ...

The three ‘vpath’ lines are necessary to help make to
look in other directories than noweb for certain types
of file. Thus the *.R files are to be found in ../R,
R News

The make command can also be used to perform more
complicated operations. Two that I have found useful are building the package, and installing the package onto my $R_LIBS path. In both cases, these operations must first make sure that the files are all up-todate. For ‘make install’, for example, we might have:
ISSN 1609-3631

Vol. 5/1, May 2005

PKGNAME = myPkg
RLIBRARY = /tmp/
R = R
# install
install:
$(RFILES) $(PDFFILES)
@echo ’Installing $(PKGNAME) at $(RLIBRARY)’
@cd ../..; \
$(R) CMD INSTALL -l $(RLIBRARY) $(PKGNAME)

where the variables ‘PKGNAME’, ‘RLIBRARY’ and ‘R’
are specified separately, so that it is easy to change
them for different packages, different locations on
$R_LIBS or different versions of R. These are best put
at the top of the ‘Makefile’. The target install should
be added to the .PHONY line.
Issuing the make install command when everything is up-to-date gives:
debreu% make --dry-run install
echo ’Installing myPkg at /tmp/’
cd ../..; R CMD INSTALL -l /tmp/ myPkg
debreu% make install
Installing myPkg at /tmp/
* Installing *source* package ’myPkg’ ...
** R
** inst
** help
>>> Building/Updating help pgs for package ’myPkg’
Formats: text html latex example
* DONE (myPkg)

If some of the files were not up-to-date then they
would have been rebuilt from the original *.nw files
first.

39

Conclusion
The great thing about literate programming is that
there is no arguing with the documentation, since the
documentation actually includes the code itself, presented in a format that is easy to check. For those
of us who use LATEX, noweb is a very simple literate
programming tool. I have found using noweb to be a
good discipline when developing code that is moderately complicated. I have also found that it saves
a lot of time when providing code for other people,
because the programmer’s notes and the documentation become one and the same thing: I shudder to
think how I used to write the code first, and then document it afterwards.
For large projects, for which the development of
an R package is appropriate, it is often possible to
break the tasks down into chunks, and assign each
chunk to a separate file. At this point the Unix make
utility is useful both for automating the processing
of the individual files, and also for speeding up this
process by not bothering with up-to-date files. The
‘Makefile’ that controls this process is almost completely generic, so that the one described above can
be used in any package which conforms to the outline given in the introduction, and which uses the
tags ‘R’, ‘man’ and so on to identify the type of code
chunk in each *.nw file.
Jonathan Rougier
Department of Mathematical Sciences, University of
Durham, UK
J.C.Rougier@durham.ac.uk

CRAN Task Views
by Achim Zeileis
With the fast-growing list of packages on CRAN (currently about 500), the following two problems became more apparent over the last years:
1. When a new user comes to CRAN and is looking for packages that are useful for a certain
task (e.g., econometrics, say), which of all the
packages should he/she look at as they might
contain relevant functionality?
2. If it is clear that a collection of packages is useful for a certain task, it would be nice if the full
collection could be installed easily in one go.
The package ctv tries to address both problems by
providing infrastructure for maintained task views
on CRAN-style repositories. The idea is the following: a (group of) maintainer(s) should provide: (a)
a list of packages that are relevant for a specific task
(which can be used for automatic installation) along
R News

with (b) meta-information (from which HTML pages
can be generated) giving an overview of what each
package is doing. Both aspects of the task views are
equally important as is the fact that the views are
maintained. This should provide some quality control and also provide the meta-information in the jargon used in the community that the task view addresses.
Using CRAN task views is very simple: the
HTML overviews are available at http://CRAN.
R-project.org/src/contrib/Views/ and the task
view installation tools are very similar to the package
installation tools. The list of views can be queried by
CRAN.views() that returns a list of "ctv" objects:
R> library(ctv)
R> x <- CRAN.views()
R> x

CRAN Task Views
ISSN 1609-3631

Vol. 5/1, May 2005

--------------Name: Econometrics
Topic: Computational Econometrics
Maintainer: Achim Zeileis
Repository: http://cran.r-project.org
--------------Name: Finance
Topic: Empirical Finance
Maintainer: Dirk Eddelbuettel
Repository: http://cran.r-project.org
--------------Name: MachineLearning
Topic: Machine Learning&Statistical Learning
Maintainer: Torsten Hothorn
Repository: http://cran.r-project.org
--------------Name: gR
Topic: gRaphical models in R
Maintainer: Claus Dethlefsen
Repository: http://cran.r-project.org

R> x[[1]]
CRAN Task View
-------------Name:
Econometrics
Topic:
Computational Econometrics
Maintainer: Achim Zeileis
Repository: http://cran.r-project.org
Packages:
bayesm, betareg, car*, Design,
dse, dynlm, Ecdat, fCalendar,
Hmisc, ineq, its, lmtest*, Matrix,
micEcon, MNP, nlme, quantreg,
sandwich*, segmented, sem,
SparseM, strucchange, systemfit,
tseries*, urca*, uroot, VR,
zicounts, zoo*
(* = core package)

40

Note that currently each CRAN task view is associated with a single CRAN-style repository (i.e.,
a repository which has in particular a src/contrib
structure), future versions of ctv should relax this
and make it possible to include packages from various repositories into a view, but this is not implemented, yet.
A particular view can be installed subsequently
by either passing its name or the corresponding
"ctv" object to install.views():
R> install.views("Econometrics",
+
lib = "/path/to/foo")
R> install.views(x[[1]],
+
lib = "/path/to/foo")
An overview of these client-side tools is given on
the manual page of these functions.
Writing a CRAN task is also very easy: all information can be provided in a single XML-based format called .ctv. The .ctv file specifies the name,
topic and maintainer of the view, has an information
section (essentially in almost plain HTML), a list of
the associated packages and further links. For examples see the currently available views in ctv and also
the vignette contained in the package. All it takes for
a maintainer to write a new task view is to write this
.ctv file, the rest is generated automatically when
the view is submitted to us. Currently, there are task
views available for econometrics, finance, machine
learning and graphical models in R—furthermore,
task views for spatial statistic and statistics in the social sciences are under development. But to make
these tools more useful, task views for other topics
are needed: suggestions for new task views are more
than welcome and should be e-mailed to me. Of
course, other general comments about the package
ctv are also appreciated.
Achim Zeileis
Wirtschaftsuniversität Wien, Austria
Achim.Zeileis@R-project.org

Using Control Structures with Sweave
Damian Betebenner
Sweave is a tool loaded by default with the utils
package that permits the integration of R/S with
LATEX. In one of its more prominant applications,
Sweave enables literate statistical practice—where
R/S source code is interwoven with corresponding LATEX formatted documentation (R Development
Core Team, 2005; Leisch, 2002a,b). A particularly elegant implementation of this is the vignette() function (Leisch, 2003). Another, more pedestrian, use
R News

of Sweave, which is the focus of this article, is the
batch processing of reports whose contents, including figures and variable values, are dynamic in nature. Dynamic forms are common on the web where
a base .html template is populated with user specific
data drawn, most often, from a database. The incorporation of repetitive and conditional control structures into the processed files allows for almost limitless possibilities to weave together output from R/S
within the confines of a LATEX document and produce
professional quality dynamic output.

ISSN 1609-3631

Vol. 5/1, May 2005

Motivation for the current project came as a result
of pilot data analyses for a state department of education. Dissemination of the analyses to schools in
the state required the production of approximately
3,000 school reports documenting the results using
school specific numbers/text, tables, and figures.
The R/Sweave/LATEX combination was deployed in
various ways to produce prototype reports. This article provides details of the process so that others with
similar document production needs can employ this
combination of tools.

Outline of Process
The production process is quite basic in its implementation: Sweave a file using an appropriate
dataframe to create a .tex that includes appropriate output. Most users of Sweave would have little
problem creating such a .tex file based upon a given
dataframe. Adding the capacity to deal with numerous different dataframes is possible using basic control structures available in R. In general, there are two
files necessary for the mass production of reports:
template.Rnw A template file that provides the template into which all the relevant R output is
placed.
master.R A master file that loops over all the recipients of the template file, creates the subset of data to be used within the template file,
Sweaves the template and saves the output to
a recipient specific .tex file.
In the following, the essential parts of each of these
files is discussed in detail:

master.R
The master.R file implements a control structure outside of the Sweave process. This allows one to create
a .tex file for each recipient. The key to the creation
of these files is the creation of a distinct outfile for
each unique recipient in the list. A basic master.R
file which performs this looks like:
load(the.data.frame)
recip.list <- unique(the.data.frame$recip)
for (name in recip.list){
subset.data <- subset(the.data.frame,
the.data.frame$recip==name)
outfile <- paste(name, "tex", sep=".")
Sweave("template.Rnw", output=outfile)
}
The process starts by sourcing master.R. This
loads a master data frame that contains the data for
all the potential report recipients. A list of recipients is then derived from that dataframe. A looping
R News

41

structure is implemented that subsets the appropriate data file from the master data frame. Crucially,
this subsetted data frame contains the recipient specific data that is used to construct the individualized
report. Finally, a recipient specific .tex file is created
by by Sweaving the template, template.Rnw.
Depending upon the number of recipients, it is
entirely possible for thousands of .tex files to be
generated. To avoid the files accumulating in the
same directory, one can easily modify the code to
place the output in other directories. For example,
if one wished to create a distinct directory for each
recipient, the following additions would suffice:
for (name in recip.list){
subset.data <- subset(the.data.frame,
the.data.frame$recip==name)
outfile <- paste(name, "tex", sep=".")
dir.create(name)
setwd(name)
Sweave(file.path("..", "template.Rnw"),
output=outfile)
setwd("..")
}
This basic control structure can be tweaked in many
ways. For example, a call can be made to a relational
database that pulls the relevant data into R for processing. This is particularly attractive for those applications where data storage is cumbersome and better
administered with a relational database.

template.Rnw
The template, template.Rnw, is a standard Sweave
file containing LATEX markup together with documentation and code chunks. The file, as its name suggests, is a template into which recipient specific numbers/text, tables, and figures can be included. One of
the greatest challenges to producing a template is ensuring that it will produce the expected output given
the range of possible values it is expected to accept.
One of the easiest ways of providing recipient
specific output with the template.Rnw file is through
the liberal use of the \Sexpr function. For the aforementioned school reports, the name of the school
was included throughout the text, tables, and figures
to provide a customized feel. In particular, the fancyhdr LATEX package was used and the school name was
included in the left header. Because school names,
and string expressions in general, vary widely in
their length, care must be exercised so that any place
that the \Sexpr is utilized, the string (or number) resulting from \Sexpr will function in the LATEX document for the range of values of the field that occur in
the data set.
The creation of recipient specific tables is a
frequently encountered (particularly with .html)
means of dynamic report generation. There are
a number of ways of including tables into the
ISSN 1609-3631

Vol. 5/1, May 2005

template.Rnw document. Perhaps the simplest way
is through the use of the xtable function within
David Dahl’s xtable package which produces LATEX
output for a variety of R objects. A basic code chunk
yielding .tex markup producing a table is given by
<>=
xtable(object, caption="table:caption",
label="table:label")
@
Though suitable for a variety of tasks, the function xtable is limited in its capacity to produce
complex tables (e.g., tables with multicolumn headers). Another frequently encountered difficulty occurs when tables being produced are long and possibly extend over several pages. Such long tables are
not a problem with |.html| because the pages can
be of nearly infinite length. However, printed pages
must have the table broken into pieces according to
the space available on the page. The LATEX code produced by xtable will not yield acceptable results. To
circumvent this, one can use the latex.table function provided by Roger Koenker’s quantreg package together with the option for using the longtable
LATEX package. The latex.table function accepts
matrix objects and outputs results to a specified file.
As such, its incorporation into the Sweave document
requires a slightly different two part technique than
that used with xtable:
1. latex.table is called within a code chunk and
produces the required LATEX markup and saves
it to an output file.
<>=
latex.table(matrix,
file=paste(name, ":table.tex",
sep=""))
@
2. The LATEX markup for the table is inserted into
the document (n.b., outside of a code chunk) at
the appropriate place:
\begin{center}
\input{\Sexpr{name}:table.tex}
\end{center}
As with xtable, there are still limitations to
the amount of tabular complexity available with
latex.table.
Multicolumn headings, multirow
headings, colored columns, etc. are all difficult if
not impossible to produce. Depending upon ones
willingness to program, however, all effects can ultimately be produced. Though tedious, frequent use
of Sexpr commands outside of code chunks or cat
commands within code chunks allows all LATEX commands to be incorporated into the final .tex document upon Sweave compilation.
R News

42

In many instances, particularly with tables and
figures, it’s necessary to produce multiple tables or
figures within a single Sweave document based upon
particular attributes of the recipient. For example,
with the school project each school’s report contained information pertinent to each grade in the
school. Thus, a table was produced for each grade
within the school. To implement this, a loop is
placed within template.Rnw. This procedure, given
in FAQ A.8 from Friedrich Leisch’s Sweave User
Manual (Leisch, 2005), is useful is many situations.
A major difficulty with control structures within
the template is the variable nature of the output. In
the example with school reports, some rural schools
have grades 1 through 12. Thus, if a table or figure
is produced for each grade, one must take account
of the fact that so many tables or figures will result
in the .tex document. The simplest way to accommodate different numbers of loop iterations within
the template is to create a \newpage each time the
loop completes. Thus, for example, each figure for
each grade receives a page to itself. This doesn’t always look acceptable for especially parsimonious tables or figures. However, combining multiple tables
on a single page can often cause overfull hbox and
vbox situations to occur. A workaround to this difficulty is to use the LATEX \scalebox and \resizebox
commands. These commands allow great flexibility
in shrinking output to fit a given page. In general,
a good deal of planning and experimentation is necessary to produce a template that works well for all
situations.
The extent to which one wishes to make the reports specific to a particular recipient can be extended far beyond expressing the recipient specific
information in tables and figures. For reports where
the recipient is an individual, text within the report can be made gender specific through the use
of if else statements based upon values in the
subsetted data frame that the template is using to
construct the recipient specific report. Supposing
the gender code in the subsetted data frame is 1
for males and 2 for females, one could incorporate
he/she (or him/her) pronouns into the text using a
simple if else statement such as
\Sexpr{if (data.frame$gender==1) "he"
else "she"}
The amount of customization is limited only by the
amount of work put into constructing the template.
One is not limited to the basic LATEX classes
(e.g., article.cls or report.cls) when producing the template. The beamer.cls (Tantau, 2004) is
Sweave compatible and provides extensive functionality to produce rich, full color .pdf output with
extensive hyperlink capabilities. Figure 1 presents
a two page example demonstrating some possibilities of the R/LATEX/Sweave combination for dynamic
.pdf report generation using the beamer.cls. Other
ISSN 1609-3631

Vol. 5/1, May 2005

43

!

!

A

EE A
E A
EEE A
A

A

R

Z

Z

O
Z
ZOOM
ZO
ZOOM

L LY
L L Y
L
L L Y
LL L Y

EE A
E A
EEE A
A

A

I
IN

G

I

!

!

Z

Y

LL Y

Z

L LY
L L Y
L
L L Y
LL L Y

A

LL Y

Y

I
IN

!

Z

Y

Z

Y

LL Y

A

RR

LL Y

L LY
L L Y
L
L L Y
LL L Y

O
Z
ZOOM
ZO
ZOOM

I
IN

G

I

OOMIN
IN
OOO
O MING
!
M MING IN
I
N NG IN!
G G I
N!
N IN!
A

EE A
E A
EEE A
A

A

O
Z
ZOOM
ZO
ZOOM

L LY
L L Y
L
L L Y
LL L Y

E A
A
A
E A
A

R E
E
EE

RR

OOMIN
IN
OOO
O MING
!
M MING IN
I
N NG IN!
G G I
N!
N IN!

G

I

OOMIN
IN
OOO
O MING
!
M MING IN
I
N NG IN!
G G I
N!
N IN!
Z

Z

!

O
Z
ZOOM
ZO
ZOOM

I
IN

G

A Report on Student Achievement

I

!

This prototype provides a picture of what is possible by using Sweave/LATEX/R with the beamer.cls.
There are many possible ways to customize the report. The easiest is to liberally use the Sexpr function
throughout the report. For example, one can easily
indicate the name of the recipient on the report using
Sexpr within the .Rnw template file. This allows one
a great deal of flexibility with regard to inputing user
specific data directly into the document.
The first page of the report might provide some
basic instructions for the person receiving the report.
These instructions are likely generic and don’t include
much customization. As one refines the template,
however, it is not difficult to produce increasingly complicated reports. One just has to be careful that the
template works for all the data that is sent to it. One
problem encountered in the school project arose when
using Sexpr{school.name} in the template.Rnw file.
Some school names used the & character. Sexpr returned that string including the & which caused the
pdfLATEX stage to come to a halt. Weaving user spe-

OOMIN
IN
OOO
O MING
!
M MING IN
I
N NG IN!
G G I
N!
N IN!

School Name Goes Here

Your Graphic Here

School Name Goes Here Title of Report Goes Here

!

cific information into the prose is not difficult using
the Sexpr command in combination with various R
functions. One just has to be careful.

Organization Producing Report Goes Here February 18, 2005

1/2

Highest
High Middle
Low Middle

Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic

Dynamic

Multi-Column header here
School
District
State
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Multi-column header here
School
District
State
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Percent of
School Total
Dynamic
Dynamic
Dynamic
Dynamic

Lowest
40

Number of
Students
Dynamic
Dynamic
Dynamic
Dynamic

A Colorful Barplot for Insert School Name Here

30

50

Grade “Insert Grade Here” Reading

Dynamic

20

Total

10

2004 Student
Performance Level
Lowest
Low Middle
High Middle
Highest

0

The following tables provide data summarizing student achievement at
Insert School Name Here. One could additionally insert text to guide
the reader through the tables and figures that appear on the page.

School Name Goes Here Title of Report Goes Here

Organization Producing Report Goes Here February 18, 2005

2/2

Figure 1: A basic two-page example demonstrating dynamic .pdf using the R/LATEX/Sweave combination.
LATEX classes, packages and font families are available that can be combined to yield unlimited possibilities.

lowing loop which can be placed into master.R accomplishes this task:

Final Compilation

for (name in recip.list){
outfile <- paste(name, "tex", sep=".")
texi2dvi(outfile, pdf=TRUE, clean=TRUE)
}

Upon Sweave completion, master.R yields numerous .tex files that need to be compiled. For the
school project, approximately 3,000 .tex files were
created. In addition, there were, in most cases, three
.pdf figures created for each school. All of these files
can be compiled from within R using the texi2dvi
command available in the tools package. The fol-

texi2dvi will run LATEX and B IBTEX the appropriate
number of times to resolve any reference and citation
dependencies. The result is compiled versions of all
the .tex files generated from the earlier loop containing the Sweave process. It is also possible to perform
this task outside of R using a script or batch file to
perform the same operations.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

Conclusions
For batch production of dynamic .pdf or .ps documentation, the R/Sweave/LATEX combination is
powerful and nearly limitless in its capabilities. A
major benefit to producing reports using this combination of tools is how closely Sweave integrates the
processing power of R with the typesetting capability of LATEX. The key to producing dynamic reports
for a large number of recipients is the use of iterative control structures in R. This article provides the
author’s “homebrew” code. Other, more elegant, solutions are likely possible.

44

Bibliography
F. Leisch. Sweave: Dynamic generation of statistical
reports using literate data analysis. In W. Härdle
and B. Rönz, editors, Compstat 2002—Proceedings
in Computational Statistics, pages 575–580, Heidelberg, Germany, 2002a. Physika Verlag. ISBN 37908-1517-9. URL http://www.ci.tuwien.ac.at/
~leisch/Sweave. 40
F. Leisch. Sweave, Part I: Mixing R and LATEX. R News,
2(3):28–31, December 2002b. URL http://cran.
r-project.org/doc/Rnews/Rnews_2002-3.pdf.
40

A next step after report generation is report distribution. In theory, given the appropriate server configuration, it should be possible to electronically distribute the reports to the appropriate recipient based
upon, for example, an email address contained in the
database. I would appreciate learning how others
have addressed this and similar problems.

F. Leisch. Sweave, Part II: Package Vignettes. R News,
3(2):21–24, October 2003. URL http://cran.
r-project.org/doc/Rnews/Rnews_2003-2.pdf.
40

Acknowledgments

T. Tantau. User’s Guide to the Beamer Class. Sourceforge.net, 2004.
URL http://latex-beamer.
sourceforge.net. 42

I would like to thank Bill Oliver for introducing me
to R and for his gracious help. In addition, I would
like to thank one of the reviewers for providing particularly insightful improvements to my code. Finally, I would like to express gratitude to the developers/contributors of R, Sweave, and LATEX, without
their efforts none of this would be possible.

F. Leisch. Sweave User Manual, 2005. 42
R Development Core Team. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria, 2005. URL
http://www.R-project.org. 3-900051-07-0. 40

Damian Betebenner
Lynch School of Education
Educational Research, Measurement and Evaluation
Boston College
Damian.Betebenner@bc.edu

The Value of R for Preclinical Statisticians
Bill Pikounis and Andy Liaw
Merck & Co., Inc.
Participation on the R-help mailing list has shown R
to be widely used across organizations of all types
the world over. This article discusses one corner
where R has become an indispensable tool for effective statistical practice: a preclinical pharmaceutical
environment at the authors’ place of employment.
Preclinical is defined here as a portion of the
research and development process for prescription
medicines. In this article that portion is defined
to start with fundamental discovery, and to end up
where a potential product is first put into human
subjects in clinical trials. There is a wide variety of
knowledge sought across this spectrum, including:
biological targets believed to be important influences
on a given disease; chemical entities that affect such
R News

targets; effectiveness and safety in animals; formulation of product itself. This brief listing is only intended to provide a broad overview. In reality, there
are many more areas that could be included based on
desired granularity. A complementary term of “nonclinical,” which is also commonly used, is perhaps
more appropriate here; however, we will stay with
the phrase preclinical for brevity.
The hallmark diversity of preclinical research
presents unlimited opportunities for statisticians.
Data are collected everywhere in the process, and in
large quantities. We have the liberty to choose methods of processing and analysis. This is a different
environment than clinical biostatistics, where SAS is
the dominant tool for various reasons. Below we
present examples and discussion on how and why
we use R and find it invaluable.
ISSN 1609-3631

Vol. 5/1, May 2005

Daily operations
Our corporate-issued laptops run Windows XP and
we have a small cluster of dual-processor Opterons
that run 64-bit Linux for intensive jobs. The clientserver configuration of VNC (TightVNC , 2004) provides a very stable way to bridge the two environments onto one monitor. An X-window session appears as just another window on the Windows desktop. Samba (Samba Team , 1992-2005) sets up file
sharing so that a /home directory share under Linux
appears as a network drive in Windows. The autocutsel (Witrant , 2004) utility eases cut-and-paste operations between the two operating systems.
If we work on data that can be comfortably explored on our laptops, we stay there. Generating
reports for our research collaborators is helped by
the right-click cutting and default pasting of a Windows metafile from the windows graphics device into
a Microsoft Office document. The internal Data Editor (see ?edit) is very useful for browsing a data
frame. Once a workspace has been saved in a Windows folder, launching it from Explorer automatically sets the container working directory. The overall R GUI console is thankfully unobtrusive, and we
well appreciate the separation of it from graphics device windows in SDI mode. We find it much more
pleasant to switch amongst desktop windows where
each stands equally on its own, rather than having to
search inside a set of windows within the same application.
If an analysis task becomes too large (in memory or time) for the laptops, it is a simple matter
to transfer the R workspace image to a Linux box
and pick up right where we left off. Through the
aforementioned network drive mapping of a Linux
share, source code editing can be done in the same
Windows editor, and file operations of saving, copying, opening, etc. are transparently done through the
standard Windows GUI.
All these task-oriented details add up to increased productivity. More importantly, however, is
the content of the environment itself. We point out
here that some of what we say in the rest of the article is not necessarily unique to R, and such benefits can be found in the other implementation of
the S language, namely S-PLUS. In his “Exegeses
on Linear Models” talk (Venables , 2000), one exegesis from Bill Venables that we particularly admire,
and paraphrase here, is that “the subject should dictate what the program should do and not the other
way around.” Our work environment streams a constant wealth of different data structures and types
of scientific questions to address, and we can exercise our freedom to choose our strategies of analysis. Comprehensive data analyses are only limited
by our own human capabilities, not by R. R enables
us to seamlessly move from data management to exploration to formal evaluations to communications
R News

45

of results. No excuse remains to prevent the production of good data graphs for the purposes of valuable study or presentation. The implementation of
modern methods allows us to use resampling, resistance/robustness, and dimension reduction methods on a routine basis, to name a few. We work
with cutting edge science, which we firmly believe
deserves cutting edge data analysis. R promotes this
for us, with its virtues of flexibility, stability and efficiency so clearly practiced by its R Core caretakers.
Professor Brian D. Ripley, in “How Computing
has Changed Statistics” (Ripley , 2004), aptly states
of R that “the only barrier to understanding how it
works, precisely, is skill.” We interpret the meaning of “how it works” to be on multiple levels, and
the level most important for us is that a true, continuous investment to learn its many functions, structures, and features pays off time and again. It helps
us bring value to scientific research, gain trust from
our collaborators, and stimulates the intellect.

Illustrations
Preclinical statisticians do not get nearly enough opportunities to provide input to design of experiments. When we do get such an opportunity, we
are especially eager to make recommendations as
quickly as possible. A recent query involved a
crossover design with binary responses. An important part of the design plan was to estimate the number of subjects needed to see a meaningful difference
between two conditions, where incidence rates were
low. Sample size determination tends to be the predominant aspect of such study design requests.
Estimating sample size is a process of several approximations, but with some effort such approximations do not substantially diminish its value. In
this example, tracking down software that could
be readily used, or finding a literature reference
where a recipe could be programmed, was not timely
enough. So a simulation-based approach was taken.
calcPower <- function(nsubj, pyn, pny, pnnyy,
numsim=1000) {
## A simple approach to calculate the power of
## McNemar’s matched-pair test for these inputs:
## nsubj = the number of subjects/pairs of
##
measurements (msmts)
##
(note that msmt1 comes from condition 1 and
##
msmt2 comes from condition 2)
## pyn = p(msmt1 = yes & msmt2 = no)
## pny = p(msmt1 = no & msmt2 = yes)
## pnnyy = p(msmt1 != msmt2)
## numsim = Number of Simulations
##
(at least 1000 recommended)
outcomes <- rmultinom(n=numsim, size=nsubj,
prob=c(pnn=pnnyy, pyy=0,
pny=pny, pyn=pyn))
tscompares  qchisq(0.95, 1), na.rm=FALSE)
}

This simple-minded function evaluates how McNemar’s test would behave for a given configuration
of underlying parameters, thereby providing an estimate of its power to detect that difference. Recall that
small occurrence rates were expected. A grid of potential configurations can be evaluated:
powerGrid  pny)
powerGrid <- powerGrid[order(powerGrid$pyn,
powerGrid$pny),]
powerGrid$power  magic(3)

[1,]
[2,]
[3,]

[,1] [,2] [,3]
2
7
6
9
5
1
4
3
8

This magic square has been known since antiquity (legend has it that the square was revealed to
humanity inscribed upon the shell of a divine turtle).
More generally, if consecutive numbers of a magic
square are joined by lines, a pleasing image is often obtained (figure 1, for example, shows a magic
square of order 7; when viewed in this way, the algorithm for creating such a square should be immediately obvious).

30

39

48

1

10

19

28

38

47

7

9

18

27

29

46

6

8

17

26

35

37

5

14

16

25

34

36

45

13

15

24

33

42

44

4

21

23

32

41

43

3

12

22

31

40

49

2

11

20

• A semimagic square is one all of whose row sums
equal all its columnwise sums (i.e. the magic
constant).
• A magic square is a semimagic square with the
sum of both unbroken diagonals equal to the

Figure 1: Magic square of order 7 in graphical form
(obtained by magicplot(magic.2np1(3)))

1 Most workers require the entries to start at 1, which is the convention here; but there are several instances where starting at 0 is far
more convenient. In any case, if x is magic, then x+n is magic for any integer n.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

Function magic() takes an integer argument n
and returns a normal magic square of size n × n.
There are eight equivalent forms for lo zhu or indeed any magic square, achieved by rotating and reflecting the matrix (Benson and Jacoby, 1976); such
equivalence is tested by eq() or %eq%. Of these eight
forms, a magic square a is said to be in Frénicle’s
standard form if a[1,1]≤b[1,1] whenever a %eq% b,
and a[1,2]a[2,1], take the transpose”. I shall show
later that expressing such an algorithm in R leads to
new insight when considering magic hypercubes.
A wide variety of algorithms exists for calculating
magic squares. For a given order n, these algorithms
generally depend on n modulo 4.
A typical paper algorithm for magic squares of
order n = 4m would go as follows.

Algorithm 1: in a square of order 4m,
shade the long major diagonal. Then
shade all major diagonals distant by a
multiple of 4 cells from the long diagonal.
Do the same with the minor diagonals.
Then, starting with “1” at the top left corner and proceeding from left to right and
top to bottom, count from 1 to n2 , filling
in the shaded squares with the appropriate number and omitting the unshaded
ones [figure 2]. Fill in the remaining (unshaded) squares in the same way, starting at the lower right corner, moving leftwards and upwards [figure 3].

Such paper algorithms are common in the literature but translating this one into code that uses R’s
vectorized tools effectively can lead to new insight.
The magicness of such squares may be proved by
considering the increasing and decreasing sequences
separately.

49

1

4

5

8

10

11

14

15

18

19

22

23

25

28

29

32

33

36

37

40

42

43

46

47

50

51

54

55

57

60

61

64

Figure 2: Half-completed magic square of order 8

1

63

62

4

5

59

58

8

56

10

11

53

52

14

15

49

48

18

19

45

44

22

23

41

25

39

38

28

29

35

34

32

33

31

30

36

37

27

26

40

24

42

43

21

20

46

47

17

16

50

51

13

12

54

55

9

57

7

6

60

61

3

2

64

Figure 3: Magic square of order 8
The interesting part of the above paper algorithm lies in determining the pattern of shaded
and unshaded squares2 .
As the reader may
care to verify, parsing the algorithm into R idiom is not straightforward.
An alternative,
readily computed in R, would be to recognize that the repeating 4 × 4 cell a[2:5,2:5]
is kronecker(diag(2),matrix(1,2,2)) -> b say,
replicate it with kronecker(matrix(1,3,3),b) ->
g; then trim off the border by selecting only the
middle elements, in this case g[2:9,2:9]. Function
magic.4n() implements the algorithm for general m.

2 If a <- matrix(1:(n*n),n,n), with jj a Boolean vector of length n2 with TRUE corresponding to shaded squares, then with it is clear
that a[jj] <- rev(a[jj]) will return the above magic square.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

50

Magic hypercubes
One of the great strengths of R is its ability to handle arbitrary dimensioned arrays in an efficient and
elegant manner.
Generalizing magic squares to magic hypercubes (Hendricks, 1973) is thus natural when working in R. The following definitions represent a general consensus, but are far from universal:
• A semimagic hypercube has all “rook’s move”
sums equal to the magic constant (that is,
each ∑inr =1 a[i1 , i2 , . . . , ir−1 , ir , ir+1 , . . . , id ] with
1 ≤ r ≤ d is equal to the magic constant for
all values of the other i’s).
• A magic hypercube is a semimagic hypercube
with the additional requirement that all 2d−1
long (ie extreme point-to-extreme point) diagonals sum correctly.
• A perfect magic hypercube is a magic hypercube
with all nonbroken diagonals summing correctly3 .
• A pandiagonal hypercube is a perfect magic hypercube with all broken diagonals summing
correctly.
(a magic hypercube is understood to be of dimension rep(n,d) and normal).
Functions
is.semimagichypercube(), is.magichypercube()
and is.perfect(a) test for the first three properties; the fourth is not yet implemented. Function
is.diagonally.correct() tests for correct summation of the 2d (sic) long diagonals.

Magic hypercubes of order 4n
Consider algorithm 1 generalized to a d-dimensional
hypercube.
The appropriate generalization of
the repeating cell of the 8 × 8 magic square discussed above is not immediately obvious when
considering figure 2, but the R formalism (viz
kronecker(diag(2),matrix(1,2,2))) makes it
clear that the appropriate generalization is to replace
matrix(1,2,2) with array(1,rep(2,d)).
The appropriate generalization for diag(2) (call
it g) is not so straightforward, but one might be
guided by the following requirements:
• The dimension of g must match the first argument to kronecker(), viz rep(2,d)
• The number of 0s must be equal to the number
of 1s: sum(g==1)==sum(g==0)
3 This

• The observation that diag(2) is equal to its
transpose would generalize to requiring that
aperm(g,K) be identical to g for any permutation K.
These lead to specifying that g[i1,...,id] should
be zero if (i1 , . . . , id ) contains an odd number of 2s
and one otherwise.
One appropriate R idiom would be to define a function dimension(a,p) to be an integer matrix with the same dimensions as a, with
element (n1 , n2 , ..., nd ) being n p , then if jj =
∑id=1 dimension(a,i), we can specify g=jj*0 and
then g[jj%%2==1] <- 1.
Another application of kronecker() gives a hypercube that is of extent 4m + 2 in each of its d dimensions, and this may be trimmed off as above to give
an array of dimensions rep(4m,d) using do.call()
and [<-. The numbers may be filled in exactly as for
the 2d case.
The resulting hypercube is magic, in the sense
defined above4 , although it is not perfect; function
magichypercube.4n() implements the algorithm.
The ability to generate magic hypercubes of arbitrary
dimension greater than one is apparently novel.
Standard form for hypercubes
Consider again the paper definition for Frénicle’s
standard form of a magic square a: it is rotated so
that the smallest number appears at the top left; then
if a[1,2] makeDigits <- function(x)
+ strsplit(as.character(x), "")[[1]]
Here are some examples of the use of this function:
> makeDigits(123456)
[1] "1" "2" "3" "4" "5" "6"
> makeDigits(-123456)
[1] "-" "1" "2" "3" "4" "5" "6"
> makeDigits(1000000000)
[1] "1" "e" "+" "0" "9"
Notice the problems revealed by the second and
third examples: It’s necessary to make provision for
negative numbers, and R wants to render certain
numbers in scientific notation.1 By setting the scipen
(“scientific notation penalty”) option to a large number, we can avoid the second problem:
> options(scipen=100)
> makeDigits(1000000000)
[1] "1" "0" "0" "0" "0" "0" "0" "0" "0" "0"
It also seemed useful to have a function that converts a vector of numerals in character form back into
a number:
> makeNumber <- function(x)
+ as.numeric(paste(x, collapse=""))
> makeNumber(c("1", "2", "3", "4", "5"))
[1] 12345
Finally, by way of preparation, I constructed several vectors of number words:
>
+
+
>
+
+
+
>
>

ones <- c("zero", "one", "two", "three",
"four", "five", "six", "seven",
"eight", "nine")
teens <- c("ten", "eleven", "twelve",
"thirteen", "fourteen", "fifteen",
"sixteen", " seventeen", "eighteen",
"nineteen")
names(ones) <- names(teens) <- 0:9
tens <- c("twenty", "thirty", "forty",

52

+
"fifty", "sixty", "seventy", "eighty",
+
"ninety")
> names(tens) <- 2:9
> suffixes <- c("thousand,", "million,",
+
"billion,", "trillion,")
Because the names of the elements of the first
three vectors are numerals, they can conveniently be
indexed; for example:
> ones["5"]
5
"five"
> teens["3"]
3
"thirteen"
> tens["7"]
7
"seventy"
The vector of suffixes includes a comma after
each word.
Figure 1 shows a function for converting a single
integer to words; I’ve added line numbers to make it
easier to describe how the function works:
And here are some examples of its use, wrapping
long lines of output to fit on the page:
> number2words(123456789)
[1] "one hundred twenty-three million,
four hundred fifty-six thousand,
seven hundred eighty-nine"
> number2words(-123456789)
[1] "minus one hundred twenty-three million,
four hundred fifty-six thousand,
seven hundred eighty-nine"
> number2words(-123456000)
[1] "minus one hundred twenty-three million,
four hundred fifty-six thousand"
I believe that the first five lines of the function are
essentially self-explanatory. The rest of the function
probably requires some explanation, however:
[6] If the number is composed of a single digit,
then we can find the answer by simply indexing into the vector ones; the function
as.vector is used to remove the name of
(i.e., the numeral labelling) the selected element.
[7-9] If the number is composed of two digits
and is less than or equal to 19, then we can
get the answer by indexing into teens with
the last digit (i.e., the second element of the
digits vector). If the number is 20 or larger,
then we need to attach the tens digit to the
ones digit, with a hyphen in between. If,

1 I don’t want to mislead the reader: I discovered these and other problems the hard way, when they surfaced as bugs. The account
here is a reconstruction that avoids my missteps. I can honestly say, however, that it took me much longer to write this column explaining
how the program works than to write the original program. Moreover, in the process of writing up the program, I saw several ways to
improve it, especially in clarity — a useful lesson.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

[ 1]
[ 2]
[ 3]
[ 4]
[ 5]
[ 6]
[ 7]
[ 8]
[ 9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]

53

number2words <- function(x){
negative <- x < 0
x <- abs(x)
digits <- makeDigits(x)
nDigits <- length(digits)
result <- if (nDigits == 1) as.vector(ones[digits])
else if (nDigits == 2)
if (x <= 19) as.vector(teens[digits[2]])
else trim(paste(tens[digits[1]], "-", ones[digits[2]], sep=""))
else if (nDigits == 3) {
tail <- makeNumber(digits[2:3])
if (tail == 0) paste(ones[digits[1]], "hundred")
else trim(paste(ones[digits[1]], "hundred", number2words(tail)))
}
else {
nSuffix <- ((nDigits + 2) %/% 3) - 1
if (nSuffix > length(suffixes) || nDigits > 15)
stop(paste(x, "is too large!"))
pick <- 1:(nDigits - 3*nSuffix)
trim(paste(number2words(makeNumber(digits[pick])),
suffixes[nSuffix], number2words(makeNumber(digits[-pick]))))
}
if (negative) paste("minus", result) else result
}
Figure 1: A function to convert a single integer into words.
however, the ones digit is 0, ones["0"] is
"zero", and thus we have an embarrassing
result such as "twenty-zero". More generally, the program can produce spurious hyphens, commas, spaces, and the strings ",
zero" and "-zero" in appropriate places.
My solution was to write a function to trim
these off:

trim <- function(text){
gsub("(^\ *)|((\ *|-|,\ zero|-zero)$)",
"", text)
}
The trim function makes use of R’s ability
to process “regular expressions.” See Lumley (2003) for a discussion of the use of regular expressions in R.
[10-14] If the number consists of three digits,
then the first digit is used for hundreds, and
the remaining two digits can be processed as
an ordinary two-digit number; this is done
by a recursive call to number2words2 — unless the last two digits are 0, in which case,
we don’t need to convert them into words.
The hundreds digit is then pasted onto the
representation of the last two digits, and the
result is trimmed. Notice that makeNumber

is used to put the last two digits back into a
number (called tail).
[15-22] Finally, if the number contains more than
three digits, we’re into the realm of thousands, millions, etc. The computation on line
[16] determines with which power of 1000
we’re dealing. Then, if the number is not too
large, the appropriate digits are stripped off
from the left of the number and attached to
the proper suffix; the remaining digits to the
right are recomposed into a number and processed with a recursive call, to be attached at
the right.
[23] If the original number was negative, the
word "minus" is pasted onto the front before
the result is returned.
The final function, called numbers2words (shown
in Figure 2), adds some bells and whistles: The various vectors of names are defined locally in the function; the utility functions makeDigits, makeNumbers,
and trim, are similarly defined as local functions;
and the function number2words, renamed helper,
is also made local. Using a helper function rather
than a recursive call permits efficient vectorization,
via sapply, at the end of numbers2words. Were

2 It’s traditional in S to use Recall for a recursive function call, but I’m not fond of this convention, and I don’t see an argument for it
here: It’s unlikely that number2words will be renamed, and in any event, it will become a local function in the final version of the program
(see below).

R News

ISSN 1609-3631

Vol. 5/1, May 2005

numbers2words to call itself recursively, the local definitions of objects (such as the vector ones and the
function trim) would be needlessly recomputed at
each call, rather than only once. Because of R’s lexical scoping, objects defined in the environment of
numbers2words are visible to helper. For more on
recursion in R, see Venables (2001).
numbers2words includes a couple of additional
features. First, according to the Oxford English Dictionary, the definition of “billion” differs in the U.S.
and (traditionally) in Britain: “1. orig. and still
commonly in Great Britain: A million millions. (=
U.S. trillion.) ... 2. In U.S., and increasingly in
Britain: A thousand millions.” Thus, if the argument
billion is set to "UK", a different vector of suffixes
is used. Moreover, provision is made to avoid awkward translations that repeat the word “million,”
such as “five thousand million, one hundred million,
... ,” which is instead, and more properly rendered as
“five thousand, one hundred million, ... .”
Second, Bill Venables tells me that outside of the
U.S., it is common to write or speak a number such
101 as “one hundred and one” rather than as “one
hundred one.” (Both of these phrases seem correct
to me, but as I said, I’m hopelessly confused about
international variations in English.) I have therefore included another argument, called and, which
is pasted into the number at the appropriate point.
By default, this argument set is to "" when billion
is "US" and to "and" when billion is "UK".
Some examples, again wrapping long lines of
output:
> numbers2words(c(1234567890123, -0123, 1000))
[1] "one trillion,
two hundred thirty-four billion,
five hundred sixty-seven million,
eight hundred ninety thousand,
one hundred twenty-three"
[2] "minus one hundred twenty-three"
[3] "one thousand"
> numbers2words(c(1234567890123, -0123, 1000),
+
billion="UK")

R News

54

[1] "one billion,
two hundred and thirty-four thousand,
five hundred and sixty-seven million,
eight hundred and ninety thousand,
one hundred and twenty-three"
[2] "minus one hundred and twenty-three"
[3] "one thousand"
> numbers2words(c(1234567890123, -0123, 1000),
+
and="and")
[1] "one trillion,
two hundred and thirty-four billion,
five hundred and sixty-seven million,
eight hundred and ninety thousand,
one hundred and twenty-three"
[2] "minus one hundred and twenty-three"
[3] "one thousand"
Finally, a challenge to the reader: At present,
numbers2words rounds its input to whole numbers.
Modify the program so that it takes a digits argument (with default 0), giving the number of places to
the right of the decimal point to which numbers are
to be rounded, and then make provision for translating such numbers (e.g., 1234567.890) into words.
John Fox
Sociology, McMaster University
jfox@mcmaster.ca

Bibliography
T. Lumley. Programmer’s niche: Little bits of string.
R News, 3(3):40–41, December 2003. URL http:
//CRAN.R-project.org/doc/Rnews/. 53
B. Venables. Programmer’s niche. R News, 1(1):27–
30, January 2001. URL http://CRAN.R-project.
org/doc/Rnews/. 54

ISSN 1609-3631

Vol. 5/1, May 2005

55

numbers2words <- function(x, billion=c("US", "UK"),
and=if (billion == "US") "" else "and"){
billion <- match.arg(billion)
trim <- function(text){
gsub("(^\ *)|((\ *|-|,\ zero|-zero)$)", "", text)
}
makeNumber <- function(x) as.numeric(paste(x, collapse=""))
makeDigits <- function(x) strsplit(as.character(x), "")[[1]]
helper <- function(x){
negative <- x < 0
x <- abs(x)
digits <- makeDigits(x)
nDigits <- length(digits)
result <- if (nDigits == 1) as.vector(ones[digits])
else if (nDigits == 2)
if (x <= 19) as.vector(teens[digits[2]])
else trim(paste(tens[digits[1]], "-", ones[digits[2]], sep=""))
else if (nDigits == 3) {
tail <- makeNumber(digits[2:3])
if (tail == 0) paste(ones[digits[1]], "hundred")
else trim(paste(ones[digits[1]], trim(paste("hundred", and)),
helper(tail)))
}
else {
nSuffix <- ((nDigits + 2) %/% 3) - 1
if (nSuffix > length(suffixes) || nDigits > 15)
stop(paste(x, "is too large!"))
pick <- 1:(nDigits - 3*nSuffix)
trim(paste(helper(makeNumber(digits[pick])),
suffixes[nSuffix], helper(makeNumber(digits[-pick]))))
}
if (billion == "UK"){
words <- strsplit(result, " ")[[1]]
if (length(grep("million,", words)) > 1)
result <- sub(" million, ", ", ", result)
}
if (negative) paste("minus", result) else result
}
opts <- options(scipen=100)
on.exit(options(opts))
ones <- c("zero", "one", "two", "three", "four", "five", "six", "seven",
"eight", "nine")
teens <- c("ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", " seventeen", "eighteen", "nineteen")
names(ones) <- names(teens) <- 0:9
tens <- c("twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty",
"ninety")
names(tens) <- 2:9
suffixes <- if (billion == "US")
c("thousand,", "million,", "billion,", "trillion,")
else
c("thousand,", "million,", "thousand million,", "billion,")
x <- round(x)
if (length(x) > 1) sapply(x, helper) else helper(x)
}
Figure 2: A function to convert a vector of integers into a vector of strings containing word-equivalents of the
integers.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

56

Book Review of
Julian J. Faraway: Linear Models with R
Chapman & Hall/CRC, Boca Raton FL, USA, 2005
229 pages, ISBN 1-58488-425-8
http://www.stat.lsa.umich.edu/∼faraway/LMR
This book is useful to serve for the practical aspects of an advanced undergraduate linear regression course. Regarding the potential readership, the
Author writes in the preface that the book is “not introductory” and that it “presumes some knowledge
of basic statistical theory and practice. Readers are
expected to know the essentials of statistical inference such as estimation, hypothesis testing and confidence intervals. A basic knowledge of data analysis is presumed. Some linear algebra and calculus
are also required.” Thus this book is most suitable
for undergraduate statistics majors at least half way
through their degrees. The book needs to be accompanied by a theoretical book, such as Seber and Lee
(2003). A somewhat similar competitor to the book
is Fox (2002).
With a large number (16) of chapters in this smallish book, most of them are short and mention ideas
briefly. Of course, some chapters are on core practical
topics such as residual diagnostics, transformations,
weighted and generalized least squares, ANOVA
(factorial designs, block designs), ANCOVA, variable selection techniques, and inference. A good feature of the book is that more ‘optional’ topics are
covered, include missing values, regression splines,
robust regression (M-estimation and least trimmed
squares), permutation tests, shrinkage methods such
as partial least squares, measurement error models including SIMEX, latin squares and iteratively
reweighted least squares. Most instructors are unlikely to be familiar with all these topics, and the
book does a good job in giving a very gentle introduction to these topics. The reader is usually referred
to references at the back for further reading.
The book has two small appendices. The first describes R installation, functions and data, while the
second is a quick introduction to R. The examples in
the book are based on R 1.9.0. In my copy there were
a few font problems, e.g., p.189 has (αβ)i j instead of
(αβ)i j .
The tone of the book is that of an informal tutorial
with some necessary theory interdispersed throughout. Explanations are good overall, and the mathematical notation is quite standard. The Author’s
website serves as the repository for the book’s package, called faraway, and includes errata, R commands and data sets (the package is also on CRAN.)
The Author writes “Data analysis cannot be learned
without actually doing it” and this is facilitated
by practical exercises (which are usually short and
R News

sharp) at the end of each chapter. Solutions appear
unavailable.
The book centers on lm() and assumes familiarity with the S language, e.g., I found it referred to “.”
in a formula without reminding or telling the reader
that it means all variables in the data frame other
than the response. The book is rather unsuitable as a
reference book, for example, summation constraints
J

∑ α j = 0 is a popular parameterization for a factor

j=1

with J levels. The function contr.sum is alluded to
but not mentioned directly in the book, nor how to
switch between different parameterizations. Ideally,
a handy summary of topics such as the WilkinsonRogers operators (e.g., *, /, :) and all the different
contrast options available should be put into an appendix for fast and easy reference. This would make
Chambers and Hastie (1991) less necessary for students to buy or refer to. The book makes good and
frequent use of basic graphics with a liberal sprinkling of plots everywhere, and many examples.
Altogether, there are 27 data sets used in the book;
the large majority of these are small. Even though
the figures are in black and white, the reader is not
encouraged to use colors in the plotting—something
very useful on a computer screen. It would be good
also that the reader be reminded how to create PDF
files of figures (especially if the user uses Linux),
which is useful if the user writes reports.
The book has many good points but I picked up
a few bad points. For example, the use of extractor
functions is not always taken where possible, e.g.,
coef(fit) should be used instead of fit$coef. Also,
numerous examples use cut-and-paste from previous output rather than extracting quantities from objects, for example, p.20, 60.975 is used rather than
mdls$sigma. I would have liked the paragraph on
additive models expanded to make the reader aware
of how to check the linearity assumptions using lm()
and regression splines.
I thought there were a number of omissions.
The prediction problem should be mentioned (in
S-PLUS data-dependent terms such as scale(x),
poly(x, 2) and bs(x) give problems, whereas in
R, a data-dependent function called inside another
function are problematic, e.g., poly(scale(x), 2)).
Other omissions include not mentioning computational details such as the QR-algorithm and not using more advanced graphics such as those found in
the lattice package.
Although I agree that the book is “not introductory”, neither is it advanced. For example, a broken stick regression model is fitted in
ISSN 1609-3631

Vol. 5/1, May 2005

Section 7.2.1. Improvements to this example include using coef() instead of $coef, and using
the predict generic function, i.e., predict(gb,
data.frame(pop15=seq(20,48,by=1))). The two
(dotted) lines described in the text actually do not
show up on the plot. Furthermore, although it is instructional to create two basis functions lhs() and
rhs(), the reader is not informed that bs(pop15,
degree=1, df=1, knot=35) would be equivalent.
In conclusion, the book is quite suitable to serve
for the practical component of an advanced undergraduate course in linear models to reasonably prepared students. A sequel on generalized linear models and extensions would be a natural next step!

57

Bibliography
J. Fox. An R and S-Plus Companion to Applied Regression, 2002. Thousand Oaks, CA: Sage Publications.
56
G. A. F. Seber and A. J. Lee. Linear Regression Analysis,
2nd Edition, 2003. New York: Wiley. 56
Chambers, J. M. and Hastie, T. J. Statistical Models
in S, 1991. Pacific Grove, Calif.: Wadsworth &
Brooks/Cole. 56
Thomas Yee
University of Auckland, New Zealand
yee@stat.auckland.ac.nz

R Foundation News
by Bettina Grün

Donations and new members
Donations
Adelchi Azzalini (Italy)
BC Cancer Agency, Vancouver (Canada)
David W. Crawford (USA)
Peter L. Flom (USA)
Google Inc., Mountain View, California (USA)
Faculty of Economics, University of Groningen
(Netherlands)
Shigeru Mase (Japan)
Merck and Co., Inc. (USA)
Network Theory Ltd, Bristol (United Kingdom)
Stanford University, California (USA)
Douglas Wilson (Canada)
Ivo Welch (USA)

New supporting institutions
Baxter Healthcare Corp., California, USA
Department of Mathematics and Statistics, Utah
State University, USA
Department of Statistics, Iowa State University, USA
Dipartimento di Statistica, Università Ca’ Foscari di
Venezia, Italy
School of Economics and Finance, Victoria University of Wellington, New Zealand

New supporting members

New benefactors

Claudio Agostinelli (Italy)
Roger J. Bos (USA)
Dianne Cook (USA)
Gregor Gorjanc (Slovenia)
Ivan Kojadinovic (France)
Iago Mosqueira (Spain)
Jonas Ranstam (Sweden)
Christian Schulz (Gerrmany)
Mario Walter (Germany)

Burns Statistics Ltd., London, U.K.
Loyalty Matrix Inc., California, USA
Statisticon AB, Uppsala, Sweden
Merck and Co., Inc., USA

Bettina Grün
Technische Universität Wien, Austria
Bettina.Gruen@ci.tuwien.ac.at

Changes in R
by the R Core Team

User-visible changes
• box plots by boxplot() or bxp() now have the
median line three times the normal line width
R News

in order to distinguish it from the quartile ones.

• Unix-alike versions of R can now be used in
UTF-8 locales on suitably equipped OSes. See
the internationalization section below.
ISSN 1609-3631

Vol. 5/1, May 2005

• The meaning of ’encoding’ for a connection has
changed: See the internationalization section
below.
• There has been some rationalization of the format of warning/error messages, to make them
easier to translate. Generally names of functions and arguments are single-quoted, and
classes double-quoted.
• Reading text files with embedded "\" (as in
Windows file names) may now need to use
scan(* , allowEscapes = FALSE), see also
below.

New features
• %% now warns if its accuracy is likely to be
affected by lack of precision (as in 1e18 %%
11, the unrealistic expectation of PR#7409), and
tries harder to return a value in range when it
is.
• abbreviate() now warns if used with nonASCII chars, as the algorithm is designed for
English words.
• The default methods for add1() and drop1()
check for changes in the number of cases in use.
The "lm" and "glm" methods for add1() quoted
the  model on the original fitted values
when using (with a warning) a smaller set of
cases for the expanded models.
• Added alarm() function to generate a bell or
beep or visual alert.
• all/any() now attempt to coerce their arguments to logical, as documented in the Blue
Book. This means e.g. any(list()) works.
• New functions for multivariate linear
models:
anova.mlm(), SSD(), estVar(),
mauchley.test() (for sphericity).
vcov() now does something more sensible for
"mlm" class objects.
• as.data.frame.table() has a new argument
’responseName’ (contributed by Bill Venables).
• as.dist() and cophenetic() are now generic,
and the latter has a new method for objects of
class "dendrogram".
• as.ts() is now generic.
• binomial() has a new "cauchit" link (suggested by Roger Koenker).
• chisq.test() has a new argument ’rescale.p’.
It is now possible to simulate (slowly) the P
value also in the 1D case (contributed by Rolf
Turner).
R News

58

• choose(n,k) and lchoose(.) now also work
for arbitrary (real) n in accordance with the
general binomial theorem. choose(*,k) is
more accurate (and faster) for small k.
• Added colorRamp() and colorRampPalette()
functions for color interpolation.
• colSums()/rowSums() now allow arrays with a
zero-length extent (requested by PR#7775).
• confint() has stub methods for classes "glm"
and "nls" that invoke those in package MASS.
This avoids using the "lm" method for "glm"
objects if MASS is not attached.
confint() has a default method using asymptotic normality.
• contr.SAS() has been moved from the ’nlme’
package to the ’stats’ package.
• New function convertColors() maps between
color spaces. colorRamp() uses it.
• The cov() function in the non-Pearson cases
now ranks data after removal of missing values, not before. The pairwise-complete method
should now be consistent with cor.test. (Code
contributed by Shigenobu Aoki.)
• Added delayedAssign() function to replace
delay(), which is now deprecated.
• dir.create() has a new argument ’recursive’
serving the same purpose as Unix’s mkdir -p.
• do.call() now takes either a function or a
character string as its first argument. The supplied arguments can optionally be quoted.
• duplicated() and unique() now accept "list"
objects, but are fast only for simple list objects.
• ecdf() now has jumps of the correct size (a
multiple of 1/n) if there are ties. (Wished by
PR#7292).
• eff.aovlist() assumed orthogonal contrasts
for any term with more than one degree of freedom: this is now documented and checked for.
Where each term only occurs in only one stratum the efficiencies are all one: this is detected
and orthogonal contrasts are not required.
• New function encodeString() to encode character strings in the same way that printing
does.
ISSN 1609-3631

Vol. 5/1, May 2005

• file("clipboard") now work for reading the
primary selection on Unix-alikes with an active
X11 display. (It has long worked for reading
and writing under Windows.) The secondary
selection can also be read: see ?file.
file() now allows mode "w+b" as well as
"w+".
• file.append() has been tuned, including for
the case of appending many files to a single file.
• Functions flush.console() and select.list()
are now available on all platforms. There is a
Tcl/Tk-based version of select.list() called
tk_select.list() in package tcltk.
• gc() now reports maximum as well as current
memory use.
• A new function getGraphicsEvent() has been
added which will allow mouse or keyboard input from a graphics device. (NB: currently only
the Windows screen device supports this function. This should improve before the 2.1.0 release.)
• New functions gray.colors()/grey.colors()
for gray color palettes.
• grep(), gsub(), sub() and regexpr() now always attempt to coerce their ’pattern’, ’x’, ’replacement’ and ’text’ arguments to character.
Previously this was undocumented but done
by [g]sub() and regexpr() for some values of
their other arguments. (Wish of PR#7742.)
• gsub/sub() have a new ’fixed’ method.
• New function hcl() for creating colors for a
given hue, chroma and luminance (i.e. perceptual hsv).
• isTRUE() convenience function to be used for
programming.
• kmeans() now returns an object of class
"kmeans" which has a print() method.
Two alternative algorithms have been implemented.
If the number of centres is supplied, it has a
new option of multiple random starts.

59

• mahalanobis() now has a ’...’ argument which
is passed to solve() for computing the inverse
of the covariance matrix, this replaces the former ’tol.inv’ argument.
• menu() uses a multi-column layout if possible
for more than 10 choices.
menu(graphics = TRUE)
is
implemented
on most platforms via select.list() or
tk_select.list().
• New function message() in ’base’ for generating "simple" diagnostic messages, replacing
such a function in the ’methods’ package.
• na.contiguous() is now (S3) generic with first
argument renamed to ’object’.
• New function normalizePath() to find canonical paths (and on Windows, canonical names
of components).
• The default in options("expressions") has
been increased to 5000, and the maximal settable value to 500000.
• p.adjust() has a new method "BY".
• pbeta() now uses a different algorithm for
large values of at least one of the shape parameters, which is much faster and is accurate
and reliable for very large values. (This affects
pbinom(), pf(), qbeta() and other functions
using pbeta at C level.)
• pch="." now by default produces a rectangle
at least 0.01" per side on high-resolution devices. (It used to be one-pixel square even
on high-resolution screens and Windows printers, but 1/72" on postscript() and pdf() devices.) Additionally, the size is now scalable by
’cex’; see ?points and note that the details are
subject to change.
• pdf() now responds to the ’paper’ and ’pagecentre’ arguments. The default value of ’paper’
is "special" for backward-compatibility (this is
different from the default for postscript()).
• plot.data.frame() tries harder to produce
sensible plots for non-numeric data frames
with one or two columns.

• The limits on the grid size in layout() are now
documented, and have been raised somewhat
by using more efficient internal structures.

• The predict() methods for "prcomp" and
"princomp" now match the columns of ’newdata’ to the original fit using column names if
these are available.

• legend() now accepts positioning by keyword,
e.g. "topleft", and can put a title within the
legend. (Suggested by Elizabeth Purdom in
PR#7400.)

• New function recordGraphics() to encapsulate calculations and graphics output together
on graphics engine display list. To be used with
care.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

60

• New function RSiteSearch() to query Rrelated resources on-line (contributed by
Jonathan Baron and Andy Liaw).

• File-name matching is no longer caseinsensitive with unz() connections, even on
Windows.

• scan() arranges to share storage of duplicated
character strings read in: this can dramatically reduce the memory requirements for large
character vectors which will subsequently be
turned into factors with relatively few levels.
For a million items this halved the time and reduced storage by a factor of 20.

• New argument ’immediate.’ to warning() to
send an immediate warning.

scan() has a new argument ’allowEscapes’
(default TRUE) that controls when C-style escapes in the input are interpreted. Previously
only \n and \r were interpreted, and then only
within quoted strings when no separator was
supplied.
scan() used on an open connection now
pushes back on the connection its private
‘ungetc’ and so is safer to use to read partial
lines.
• scatter.smooth() and loess.smooth() now
handle missing values in their inputs.
• seq.Date() and seq.POSIXt() now allow ’to’
to be before ’from’ if ’by’ is negative.
• sprintf() has been enhanced to allow the
POSIX/XSI specifiers like "%2$6d", and also accepts "%x" and "%X".
sprintf() does limited coercion of its arguments.
sprintf() accepts vector arguments and operates on them in parallel (after re-cycling if
needed).
• New function strtrim() to trim character vectors to a display width, allowing for doublewidth characters in multi-byte character sets.
• subset() now has a method for matrices, similar to that for data frames.
• Faster algorithm in summaryRprof().
• sunflowerplot() has new arguments ’col’ and
’bg’.
• sys.function() now has argument ’which’ (as
has long been presaged on its help page).
• Sys.setlocale("LC_ALL", ) now only sets
the locale categories which R uses, and
Sys.setlocale("LC_NUMERIC", ) now gives a
warning (as it can cause R to malfunction).
• unclass() is no longer allowed for environments and external pointers (since these cannot
be copied and so unclass() was destructive of
its argument). You can still change the "class"
attribute.
R News

• New convenience wrappers write.csv() and
write.csv2().
• There is a new version for write.table()
which is implemented in C. For simple matrices and data frames this is several times faster
than before, and uses negligible memory compared to the object size.
The old version (which no longer coerces a matrix to a data frame and then back to a matrix)
is available for now as write.table0().
• The functions xinch(), yinch(), and xyinch()
have been moved from package ’grDevices’
into package ’graphics’.
• Plotmath now allows underline in expressions.
(PR#7286, contributed by Uwe Ligges.)
• BATCH on Unix no longer sets --gui="none"
as the X11 module is only loaded if needed.
• The X11 module (and the hence X11(), jpeg()
and png() devices and the X-based dataentry
editor) is now in principle available under all
Unix GUIs except --gui="none", and this is reflected in capabilities().
capabilities("X11") determines if an X
server can be accessed, and so is more likely to
be accurate.
• Printing of arrays now honours the ’right’ argument if there are more than two dimensions.
• Tabular printing of numbers now has headers
right-justified, as they were prior to version
1.7.0 (spotted by Rob Baer).
• Lazy-loading databases are now cached in
memory at first use: this enables R to run much
faster from slow file systems such as USB flash
drives. There is a small (less than 2Mb) increase
in default memory usage.
• The implicit class structure for numeric vectors
has been changed, so that integer/real vectors
try first methods for class "integer"/"double"
and then those for class "numeric".
The implicit classes for matrices and arrays
have been changed to be "matrix"/"array" followed by the class(es) of the underlying vector.
ISSN 1609-3631

Vol. 5/1, May 2005

• splines::splineDesign() now allows the
evaluation of a B-spline basis everywhere instead of just inside the "inner" knots, by setting
the new argument ‘outer.ok = TRUE’.
• Hashing has been tweaked to use half as much
memory as before.
• Readline is not used for tilde expansion when
R is run with --no-readline, nor from embedded applications. Then " name" is no longer expanded, but " " still is.
• The regular expression code is now based on
that in glibc 2.3.3. It has stricter conformance
to POSIX, so metachars such as + * may need
to be escaped where before they did not (but
could have been).
• New encoding ’TeXtext.enc’ improves the way
postscript() works with Computer Modern
fonts.
• Replacement in a non-existent column of a data
frame tries harder to create a column of the correct length and so avoid a corrupt data frame.
• For Windows and readline-based history, the
saved file size is re-read from R_HISTSIZE immediately before saving.
• Collected warnings during start-up are now
printed before the initial prompt rather than after the first command.
• Changes to package ’grid’:
– preDrawDetails(), drawDetails(), and
postDrawDetails() methods are now
recorded on the graphics engine display
list. This means that calculations within
these methods are now run when a device
is resized or when output is copied from
one device to another.
– Fixed bug in grid.text() when ’rot’ argument has length 0. (privately reported
by Emmanuel Paradis)
– New getNames() function to return just
the names of all top-level grobs on the display list.
– Recording on the grid display list is
turned off within preDrawDetails(),
drawDetails(), and postDrawDetails()
methods.
– Grid should recover better from errors or
user-interrupts during drawing (i.e., not
leave you in a strange viewport or with
strange graphical parameter settings).
– New function grid.refresh() to redraw
the grid display list.
R News

61

– New function grid.record() to capture
calculations with grid graphics output.
– grobWidth and grobHeight ("grobwidth"
and "grobheight" units) for primitives
(text, rects, etc, ...) are now calculated
based on a bounding box for the relevant
grob.
NOTE: this has changed the calculation of
the size of a scalar rect (or circle or lines).
– New arguments ’warn’ and ’wrap’ for
function grid.grab()
– New function grid.grabExpr() which
captures the output from an expression
(i.e., not from the current scene) without
doing any drawing (i.e., no impact on the
current scene).
– upViewport() now (invisibly) returns the
path that it goes up (suggested by Ross
Ihaka).
– The ’gamma’ gpar has been deprecated
(this is a device property not a property
of graphical objects; suggested by Ross
Ihaka).
– New ’lex’ gpar; a line width multiplier.
– grid.text() now handles any language
object as mathematical annotation (instead of just expressions).
– plotViewport() has default value for
’margins’ argument (that match the default value for par(mar)).
– The ’extension’ argument to dataViewport()
can now be vector, in which case the first
value is used to extend the xscale and the
second value is used to extend the y scale.
(suggested by Ross Ihaka).
– All ’just’ arguments (for viewports, layouts, rectangles, text) can now be numeric
values (typically between 0 [left] and 1
[right]) as well as character values ("left",
"right", ...).
For rectangles and text, there are additional ’hjust’ and ’vjust’ arguments which
allow numeric vectors of justification in
each direction (e.g., so that several pieces
of text can have different justifications).
(suggested by Ross Ihaka)
– New ’edits’ argument for grid.xaxis()
and grid.yaxis() to allow specification
of on-the-fly edits to axis children.
– applyEdit(x, edit) returns x if target of
edit (i.e., child specified by a gPath) cannot be found.
– Fix for calculation of length of
max/min/sum unit. Length is now (correctly) reported as 1 (was reported as
length of first arg).
ISSN 1609-3631

Vol. 5/1, May 2005

– Viewport names can now be any string
(they used to have to be a valid R symbol).
– The ’label’ argument for grid.xaxis()
and grid.yaxis() can now also be a language object or string vector, in which
case it specifies custom labels for the tick
marks.

Internationalization
• Unix-alike versions of R can now be used in
UTF-8 and other multi-byte locales on suitably equipped OSes if configured with option
--enable-mbcs (which is the default). [The
changes to font handling in the X11 module
are based on the Japanization patches of Eiji
Nakama.]
Windows versions of R can be used in ‘East
Asian’ locales on suitable versions of Windows.
See the ’Internationalization’ chapter in the ’Installation and Administration’ manual.
• New command-line flag --encoding to specify
the encoding to be assumed for stdin (but not
for a console).
• New function iconv() to convert character vectors between encodings, on those
OSes which support this.
See the new
capabilities("iconv").
• The meaning of ’encoding’ for a connection has
changed: it now allows any charset encoding
supported by iconv on the platform, and can
re-encode output as well as input.
As the new specification is a character string
and the old was numeric, this should not cause
incorrect operation.
• New
function
localeToCharset()
to
find/guess encoding(s) from the locale name.
• nchar() returns the true number of bytes
stored (including any embedded nuls), this being 2 for missing values. It has an optional argument ’type’ with possible non-default values
"chars" and "width" to give the number of characters or the display width in columns.
• Characters can be entered in hexadecimal as
e.g. \x9c, and in UTF-8 and other multibyte
locales as \uxxxx, \u{xxxx}, \Uxxxxxxxx or
\U{xxxxxxxx}. Non-printable Unicode characters are displayed C-style as \uxxxx or
\Uxxxxxxxx.
• LC_MONETARY is set to the locale, which affects the result of Sys.localeconv(), but nothing else in R itself. (It could affect add-on packages.)
R News

62

• source() now has an ’encoding’ argument
which can be used to make it try out various possible encodings. This is made use of
by example() which will convert (non-UTF-8)
Latin-1 example files in a UTF-8 locale.
• read/writeChar() work in units of characters,
not bytes.
• .C() now accepts an ENCODING= argument
where re-encoding is supported by the OS. See
‘Writing R Extensions’.
• delimMatch (tools) now reports match positions and lengths in units of characters, not
bytes. The delimiters can be strings, not just
single ASCII characters.
• .Rd files can indicate via a \encoding{} argument the encoding that should be assumed for
non-ASCII characters they contain.
• Phrases in .Rd files can be marked by \enc{}{}
to show a transliteration to ASCII for use in e.g.
text help.
• The use of ’pch’ in points() now allows for
multi-byte character sets: in such a locale a
glyph can either be specified as a multi-byte
single character or as a number, the Unicode
point.
• New function l10n_info() reports on aspects
of the locale/charset currently in use.
• scan() is now aware of double-byte locales
such as Shift-JIS in which ASCII characters can
occur as the second (’trail’) byte.
• Functions sQuote() and dQuote() use the Unicode directional quotes if in a UTF-8 locale.
• The infrastructure is now in place for C-level
error and warning messages to be translated
and used on systems with Native Language
Support. This has been used for the startup
message in English and to translate Americanisms such as ’color’ into English: translations
to several other languages are under way, and
some are included in this release.
See ’Writing R Extensions’ for how to make use
of this in a package: all the standard packages
have been set up to do translation, and the ’language’ ’en@quot’ is implemented to allow Unicode directional quotes in a UTF-8 locale.
• R-level stop(), warning() and message()
messages can be translated, as can other messages via the new function gettext(). Tools
xgettext() and xgettext2pot() are provided
in package tools to help manage error messages.
gettextf() is a new wrapper to call sprintf()
using gettext() on the format string.
ISSN 1609-3631

Vol. 5/1, May 2005

63

• Function ngettext() allows the management
of singular and plural forms of messages.

• Sweave changes:

Utilities
• New
functions
checkCRAN().
• R CMD check
’--use-valgrind’.

mirror2html()
has

a

new

and
option

• R CMD check now checks that Fortran and C++
files have LF line endings, as well as C files.
It also checks Makevars[.in] files for portable
compilation flags.
• R CMD check will now work on a source tarball
and prints out information about the version of
R and the package.
• tools:::.install_package_code_files()
(used to collate R files when installing packages) ensures files are separated by a line feed.
• vignette() now returns an object of class "vignette" whose print() method opens the corresponding PDF file. The edit() method can
be used to open the code of the vignette in an
editor.
• R CMD INSTALL on Unix has a new option
’--build’ matching that on Windows, to package as tarball the installed package.
• R CMD INSTALL on Unix can now install binary bundles.
• R CMD build now changes src files to LF line
endings if necessary.
• R CMD build now behaves consistently between source and binary builds: in each case
it prepares a source directory and then either
packages that directory as a tarball or calls R
CMD INSTALL -build on the prepared sources.
This means that R CMD build --binary now
respects .Rbuildignore and will rebuild vignettes (unless the option --no-vignettes is
used). For the latter, it now installs the current
sources into a temporary library and uses that
version of the package/bundle to rebuild the
vignettes.
• R CMD build now reports empty directories in
the source tree.
• New function write_PACKAGES() in package
’tools’ to help with preparing local package
repositories. (Based on a contribution by Uwe
Ligges.) How to prepare such repositories is
documented in the ’R Installation and Administration’ manual.
R News

• package.skeleton() adds a bit more to DESCRIPTION.

– \usepackage[nogin]{Sweave} in the
header of an Sweave file suppresses autosetting of graphical parameters such as
the width of the graphics output.
– The new \SweaveInput{} command
works similar to LaTeX’s \input{} command.
– Option value strip.white=all strips all
blank lines from the output of a code
chunk.
– Code chunks with eval=false are commented out by Stangle() and hence no
longer tested by R CMD check.

Documentation
• File doc/html/faq.html no longer exists, and
doc/manual/R-FAQ.html (which has active
links to other manuals) is used instead. (If
makeinfo >= 4.7 is not available, the version on
CRAN is linked to.)
• Manual ’Writing R Extensions’ has further details on writing new front-ends for R using the
new public header files.
• There are no longer any restrictions on characters in the \name{} field of a .Rd file: in particular _ is supported.

C-level facilities
• There are new public C/C++ header files Rinterface.h and R_ext/RStartup.h for use with external GUIs.
• Added an onExit() function to graphics devices, to be executed upon user break if nonNULL.
• ISNAN now works even in C++ code that undefines the ’isnan’ macro.
• R_alloc’s limit on 64-bit systems has been
raised from just under 231 bytes (2Gb) to just
under 234 (16Gb), and is now checked.
• New math utility functions log1pmx(x),
lgamma1p(x),
logspace_add(logx, logy),
and logspace_sub(logx, logy).
ISSN 1609-3631

Vol. 5/1, May 2005

64

Deprecated & defunct
• The aqua module for MacOS X has been removed: --with-aqua now refers to the unbundled Cocoa GUI.
• Capabilities "bzip2", "GNOME, "libz" and
"PCRE" are defunct.
• The undocumented use of UseMethod() with
no argument was deprecated in 2.0.1 and is
now regarded as an error.
• Capability "IEEE754" is deprecated.
• The ’CRAN’ argument to update.packages(),
old.packages(), new.packages(),
download.packages() and install.packages()
is deprecated in favour of ’repos’, which replaces it as a positional argument (so this is
only relevant for calls with named args).
• The S3 methods for getting and setting names
of "dist" objects have been removed (as they
provided names with a different length from
the "dist" object itself).
• Option "repositories" is no longer used and so
not set.
• loadURL() is
load(url()).

deprecated

in

favour

of

• delay() is deprecated. Use delayAssign() instead.

Installation changes
• New configure option --enable-utf8 to enable support for UTF-8 locales, on by default.
• R_XTRA_[CF]FLAGS are now used during
the configuration tests, and [CF]PICFLAGS if
--enable-R-shlib was specified. This ensures
that features such as inlining are only used if
the compilation flags specified support them.
(PR#7257)
• Files FAQ, RESOURCES, doc/html/resources.html
are no longer in the SVN sources but are made
by ’make dist’.
• The GNOME GUI is unbundled, now provided
as a package on CRAN.
• Configuring without having the recommended packages is now an error unless
--with-recommended-packages=no (or equivalent) is used.
• Configuring without having the X11 headers and libraries is now an error unless
--with-x=no (or equivalent) is used.
R News

• Configure tries harder to find a minimal set
of FLIBS. Under some circumstances this may
remove from R_LD_LIBRARY_PATH path elements that ought to have specified in LDFLAGS (but were not).
• The C code for most of the graphics device
drivers and their afm files are now in package
grDevices.
• R is now linked against ncurses/termlib/termcap
only if readline is specified (now the default)
and that requires it.
• Makeinfo 4.7 or later is now required for building the HTML and Info versions of the manuals.

Installation changes
• There are new types of packages, identified by
the Type field in the DESCRIPTION file. For
example the GNOME console is now a separate package (on CRAN), and translations can
be distributed as packages.
• There is now support of installing from within
R both source and binary packages on MacOS X and Windows. Most of the R functions now have a ’type’ argument defaulting to
getOption("pkgType") and with possible values "source", "win.binary" and "mac.binary".
The default is "source" except under Windows
and the CRAN GUI build for MacOS X.
• install.packages() and friends now accept a
vector of URLs for ’repos’ or ’contriburl’ and
get the newest available version of a package
from the first repository on the list in which it is
found. The argument ’CRAN’ is still accepted,
but deprecated.
install.packages() on Unix can now install
from local .tar.gz files via repos = NULL (as has
long been done on Windows).
install.packages() no longer asks if downloaded packages should be deleted: they will
be deleted at the end of the session anyway
(and can be deleted by the user at any time).
If the repository provides the information,
install.packages() will now accept the name
of a package in a bundle.
If ’pkgs’ is omitted install.packages() will
use a listbox to display the available packages,
on suitable systems.
’dependencies’ can be a character vector to allow only some levels of dependencies (e.g. not
"Suggests") to be requested.
ISSN 1609-3631

Vol. 5/1, May 2005

65

• There is a new possible value
update.packages(ask="graphics") that uses
a widget to (de)select packages, on suitable systems.

• S4 methods for primitive functions must be exported from namespaces; this is now done automatically. Note that is.primitive() is now
in ’base’, not ’methods’.

• The option used is now getOption("repos")
not getOption("CRAN") and it is initially set to
a dummy value. Its value can be a character
vector (preferably named) giving one or several repositories.

• Package grid:

A new function chooseCRANmirror() will select a CRAN mirror. This is called automatically if the contrib.url() encounters the initial dummy value of getOption("repos")
A new function setRepositories() can be
used to create getOption("repos") from a
(platform-specific) list of known repositories.
• New function new.packages() to report uninstalled packages available at the requested
repositories. This also reports incomplete bundles. It will optionally install new packages.
• New function available.packages(), similar
to CRAN.packages() but for use with multiple
repositories. Both now only report packages
whose R version requirements are met.
• update.packages() and old.packages() have
a new option ’checkBuilt’ to allow packages installed under earlier versions of R to be updated.
• remove.packages() can now remove bundles.
• The Contains: field of the DESCRIPTION file
of package bundles is now installed, so later
checks can find out if the bundle is complete.
• packageStatus() is now built on top of *.packages, and gains a ’method’ argument. It defaults to the same repositories as the other
tools, those specified by getOption("repos").

Bug fixes
• Configuring
for
Tcl/Tk
makes
use
of
${TK_LIB_SPEC}
${TK_LIBS}
not
${TK_LIB_SPEC} ${TK_XLIBSW}, which is correct for recent versions of Tk, but conceivably
not for old tkConfig.sh files.

– Fixed bug in grid.text() when "rot" argument has length 0. (reported by Emmanuel Paradis)
• .install_package_vignette_index()
created an index even in an empty ’doc’ directory.
• The print() method for factors now escapes
characters in the levels in the same way as they
are printed.
• str() removed any class from environment objects.
str() no longer interprets control characters
in character strings and factor levels; also no
longer truncates factor levels unless they are
longer than ’nchar.max’. Truncation of such
long strings is now indicated ”outside” the
string.
str() was misleading for the case
of a single slot.
str() now also properly displays S4 class definitions (such as returned by getClass().
• print.factor(quote=TRUE) was not quoting
levels, causing ambiguity when the levels contained spaces or quotes.
• R CMD check was confused by a trailing / on
a package name.
• write.table() was writing incorrect column
names if the data frame contained any matrixlike columns.
• write.table() was not quoting row names for
a 0-column x.
• t(x)’s default method now also preserves
names(dimnames(x)) for 1D arrays ’x’.
• r <- a %*% b no longer gives names(dimnames(r))
== c("", "") unless one of a or b has named
dimnames.

• detach() was not recomputing the S4 methods
for primitives correctly.

• Some .Internal functions that were supposed
to return invisibly did not. This was behind
PR#7397 and PR#7466.

• Methods package now has class "expression"
partly fixed in basic classes, so S4 classes can
extend these (but "expression" is pretty broken
as a vector class in R).

• eval(expr, NULL, encl) now looks up variables in encl, as eval(expr, list(), encl) always did

• Collected warnings had messages with unneeded trailing space.

• Coercing as.data.frame(NULL) to a pairlist
caused an error.

R News

ISSN 1609-3631

Vol. 5/1, May 2005

• p.adjust(p, ..) now correctly works when
‘p’ contains NAs (or when it is of length 0 or
length 2 for method = "hommel").
• ’methods’ initialization was calling a function
intended for .Call() with .C().
• optim() needed a check that the objective function returns a value of length 1 (spotted by Ben
Bolker).
• X11() was only scaling its fonts to pointsize if
the dpi was within 0.5 of 100dpi.
• X11() font selection was looking for any symbol font, and sometimes got e.g. bold italic if
the server has such a font.
• dpois(*, lambda=Inf) now returns 0 (or -Inf
for log).
• Using pch="" gave a square (pch=0)! Now it is
regarded as the same as NA, which was also
undocumented but omits the point.
• Base graphics now notices (ab)lines which have
a zero coordinate on log scale, and omits them.
(PR#7559)
• stop() and warning() now accept NULL as
they are documented to do (although this
seems of little use and is equivalent to "").
• weighted.mean() now checks the length of the
weight vector w.
• getAnywhere() was confused by names with
leading or trailing dots (spotted by Robert
McGehee)

66

• The "aovlist" method for se.contrast() failed
in some very simple cases that were effectively
not multistratum designs, e.g. only one treatment occurring in only one stratum.
• pgamma() uses completely re-written algorithms, and should work for all (even very
extreme) arguments; this is based on Morten
Welinder’s contribution related to PR#7307.
• dpois(10, 2e-308, log=TRUE) and similar
cases gave -Inf.
• x <- 2^(0:1000); plot(x, x^.9, type="l",
log="xy") and x <- 2^-(1070:170); plot(x,
x^.9, type="l", log="xy") now both work
• summary.lm() asked for a report on a reasonable occurrence, but the check failed to take account of NAs.
• lm() was miscalculating ’df.residual’ for empty
models with a matrix response.
• summary.lm() now behaves more sensibly for
empty models.
• plot.window() was using the wrong sign
when adjusting xlim/ylim for positive ’asp’
and a reversed axis.
• If malloc() fails when allocating a large object the allocator now does a gc and tries the
malloc() again.
• packageSlot() and getGroupMembers() are
now exported from the ’methods’ package
as they should from documentation and the
Green Book.

• eval() was not handling values from return()
correctly.

• rhyper() was giving numbers slightly too
small, due to a bug in the original algorithm.
(PR#7314)

• par(omd) is now of the form c(x1, x2, y1,
y2) to match the documentation and for SPLUS compatibility.

• gsub() was sometimes incorrectly matching ^
inside a string, e.g. gsub("^12", "x", "1212")
was "xx".

[Previously, par(omd) was of the form
c(bottom, left, top, right) like par(oma)
and par(omi)]

• [g]sub(perl = TRUE) was giving random results for a 0-length initial match. (PR#7742)

• Contrasts needed to be coerced to numeric (e.g.
from integer) inside model.matrix. (PR#7695)

• [g]sub was ignoring most 0-length matches,
including all initial ones. Note that substitutions such as gsub("[[:space:]]*", "
", ...)
now work as they do in ’sed’
(whereas the effect was previously the same as
gsub("[[:space:]]+", " ", ...)). (In part
PR#7742)

• socketSelect() did not check for buffered input.

• Promises are now evaluated when extracted
from an environment using '$' or '[[ ]]'.

• Reads on a non-blocking socket with no available data were not handled properly and could
result in a segfault.

• reshape(direction="wide") had some sorting problems when guessing time points
(PR#7669)

• formatC() did not check its ’flag’ argument, and could segfault if it was incorrect.
(PR#7686)

R News

ISSN 1609-3631

Vol. 5/1, May 2005

67

• par() set ’xaxp’ before ’xlog’ and ’yaxp’ before
’ylog’, causing PR#831.

• acf() could cause a segfault with some
datasets. (PR#7771)

• The logic in tclRequire() to check the availability of a Tcl package turned out to be fallible. It now uses a try()-and-see mechanism
instead.

• tan(1+LARGEi) now gives 0+1i rather than
0+NaNi (PR#7781)

• Opening a unz() connection on a non-existent
file left a file handle in use.

• summary(data.frame(mat = I(matrix(1:8,
4)))) does not go into infinite recursion anymore.

• "dist" objects of length 0 failed to print.

• writeBin() performed byte-swapping incorrectly on complex vectors, also swapping real
and imaginary parts. (PR#7778)

• INSTALL and the libR try harder to find a temporary directory (since there might be one left
over with the same PID).

• read.table() sometimes discarded as blank
lines containing only white space, even if
sep=",".

Changes on CRAN
by Kurt Hornik

New contributed packages
AMORE A MORE flexible neural network package.
This package was born to release the TAO robust neural network algorithm to the R users.
It has grown and can be of interest for the users
wanting to implement their own training algorithms as well as for those others whose needs
lie only in the “user space”. By Manuel Castejón Limas, Joaquín B. Ordieres Meré, Eliseo P.
Vergara González, Francisco Javier Martínez de
Pisón Ascacibar, Alpha V. Pernía Espinoza, and
Fernando Alba Elías.
BHH2 Functions and data sets reproducing some
examples in “Statistics for Experimenters II” by
G. E. P. Box, J. S. Hunter, and W. C. Hunter,
2005, John Wiley and Sons. By Ernesto Barrios.
Bolstad Functions and data sets for the book “Introduction to Bayesian Statistics” by W. M. Bolstad, 2004, John Wiley and Sons. By James Curran.
Ecdat Data sets from econometrics textbooks. By
Yves Croissant.
GDD Platform and X11 independent device for creating bitmaps (png, gif and jpeg). By Simon Urbanek.
GeneNT The package implements a two-stage algorithm to screen co-expressed gene pairs with
controlled FDR and MAS. The packages also
constructs relevance networks and clusters coexpressed genes (both similarly co-expressed
R News

and transitively co-expressed). By Dongxiao
Zhu.
Geneland Detection of spatial structure from genetic data. By Gilles Guillot.
HTMLapplets Functions inserting dynamic scatterplots and grids in documents generated by
R2HTML. By Gregoire Thomas.
IDPmisc The IDPmisc package contains different
high-level graphics functions for displaying
large datasets, brewing color ramps, drawing nice arrows, creating figures with differently colored margins and plot regions, and
other useful goodies. By Andreas Ruckstuhl,
Thomas Unternährer, and Rene Locher.
LDheatmap Produces a graphical display, as a heat
map, of measures of pairwise linkage disequilibria between SNPs. Users may optionally include the physical locations or genetic map distances of each SNP on the plot. By Ji-Hyung
Shin, Sigal Blay, Jinko Graham, and Brad McNeney.
LogicReg Routines for Logic Regression. By Charles
Kooperberg and Ingo Ruczinski.
MEMSS Data sets and sample analyses from
“Mixed-effects Models in S and S-PLUS” by J.
Pinheiro and D. Bates, 2000, Springer. By Douglas Bates.
MatchIt Select matched samples of the original
treated and control groups with similar covariate distributions—can be used to match exactly
on covariates, to match on propensity scores,
or perform a variety of other matching procedures. By Daniel Ho, Kosuke Imai, Gary King,
and Elizabeth Stuart.
ISSN 1609-3631

Vol. 5/1, May 2005

Matching Provides functions for multivariate and
propensity score matching and for finding optimal balance based on a genetic search algorithm. A variety of univariate and multivariate
tests to determine if balance has been obtained
are also provided. By Jasjeet Singh Sekhon.
NORMT3 Evaluates the probability density function of the sum of the Gaussian and Student’s t
density on 3 degrees of freedom. Evaluates the
p.d.f. of the sphered Student’s t density function. Also evaluates the erf and erfc functions
on complex-valued arguments. By Guy Nason.
PK Estimation of pharmacokinetic parameters. By
Martin Wolfsegger.
ProbForecastGOP Probabilistic weather field forecasts using the Geostatistical Output Perturbation method introduced by Gel, Raftery
and Gneiting (2004). By Yulia Gel, Adrian
E. Raftery, Tilmann Gneiting, and Veronica J.
Berrocal.
R.matlab Provides methods to read and write MAT
files. It also makes it possible to communicate (evaluate code, send and retrieve objects
etc.) with Matlab v6 or higher running locally
or on a remote host. The auxiliary Java class
provides static methods to read and write Java
data types. By Henrik Bengtsson.
R.oo Methods and classes for object-oriented programming in R with or without references. By
Henrik Bengtsson.
RGrace Mouse/menu driven interactive plotting
application. By M. Kondrin.
RII Estimation of the relative index of inequality
for interval-censored data using natural cubic
splines. By Jamie Sergeant.
ROCR ROC graphs, sensitivity/specificity curves,
lift charts, and precision/recall plots are popular examples of trade-off visualizations for specific pairs of performance measures. ROCR is
a flexible tool for creating cutoff-parametrized
2D performance curves by freely combining
two from over 25 performance measures (new
performance measures can be added using a
standard interface). By Tobias Sing, Oliver
Sander, Niko Beerenwinkel, and Thomas
Lengauer.
ResistorArray Electrical properties of resistor networks. By Robin K. S. Hankin.
Rlab Functions and data sets for the NCSU ST370
class. By Dennis D. Boos, Atina Dunlap Brooks,
and Douglas Nychka.
R News

68

SciViews A bundle of packages to implement a full
reusable GUI API for R. Contains svGUI with
the main GUI features, svDialogs for the dialog
boxes, svIO for data import/export, svMisc
with miscellaneous supporting functions, and
svViews providing views and report features
(views are HTML presentations of the content
of R objects, combining text, tables and graphs
in the same document). By Philippe Grosjean
& Eric Lecoutre.
SemiPar Functions for semiparametric regression
analysis, to complement the book “Semiparametric Regression” by R. Ruppert, M. P. Wand,
and R. J. Carroll, 2003, Cambridge University
Press. By Matt Wand.
SeqKnn Estimate missing values sequentially from
the gene that had least missing rate in microarray data. By Ki-Yeol Kim and Gwan-Su Yi.
UsingR Data sets to accompany the textbook “Using R for Introductory Statistics” by J. Verzani,
2005, Chapman & Hall/CRC. By John Verzani.
Zelig Everyone’s statistical software: an easy-to-use
program that can estimate, and help interpret
the results of, an enormous range of statistical models. By Kosuke Imai, Gary King, and
Olivia Lau.
adlift Adaptive Wavelet transforms for signal denoising. By Matt Nunes and Marina Popa.
alr3 Methods and data to accompany the textbook
“Applied Linear Regression” by S. Weisberg,
2005, Wiley. By Sanford Weisberg.
arules Provides the basic infrastructure for mining
and analyzing association rules and an interface to the C implementations of Apriori and
Eclat by Christian Borgelt. By Bettina Gruen,
Michael Hahsler and Kurt Hornik.
bayesm Covers many important models used in
marketing and micro-econometrics applications.
The package includes: Bayes Regression (univariate or multivariate dependent variable), Multinomial Logit (MNL) and
Multinomial Probit (MNP), Multivariate Probit, Multivariate Mixtures of Normals, Hierarchical Linear Models with normal prior
and covariates, Hierarchical Multinomial Logits with mixture of normals prior and covariates, Bayesian analysis of choice-based conjoint
data, Bayesian treatment of linear instrumental
variables models, and Analyis of Multivariate
Ordinal survey data with scale usage heterogeneity. By Peter Rossi and Rob McCulloch.
bitops Functions for Bitwise operations on integer
vectors. S original by Steve Dutky, initial R port
ISSN 1609-3631

Vol. 5/1, May 2005

69

and extensions by Martin Maechler. Revised
and modified by Steve Dutky.
boost Contains a collection of boosting methods,
these are ‘BagBoost’, ‘LogitBoost’, ‘AdaBoost’
and ‘L2Boost’, along with feature preselection
by the Wilcoxon test statistic. Moreover, methods for the simulation of data according to correlation and mean structures of existing real
datasets are included. By Marcel Dettling.
changeLOS Change in length of hospital stay (LOS).
By Matthias Wangler and Jan Beyersmann.
clac Clust Along Chromosomes, a method to call
gains/losses in CGH array data.
By Pei
Wang, with contributions from Balasubramanian Narasimhan.
climatol Functions to fill missing data in climatological (monthly) series and to test their homogeneity, plus functions to draw wind-rose and
Walter&Lieth diagrams. By José A. Guijarro.
clue CLUster Ensembles. By Kurt Hornik, with contributions from Walter Boehm.
coin Conditional inference procedures for the general independence problem including twosample, K-sample, correlation, censored, ordered and multivariate problems. By Torsten
Hothorn and Kurt Hornik, with contributions
by Mark van de Wiel and Achim Zeileis.
colorspace Carries out mapping between assorted
color spaces. By Ross Ihaka.
concor Concordance, providing “SVD by blocks”.
By R. Lafosse.
ctv Server-side and client-side tools for task views to
CRAN-style repositories. By Achim Zeileis and
Kurt Hornik.
cyclones Functions for locating local
ima/maxima. By Rasmus E. Benestad.

min-

eco Fits parametric and nonparametric Bayesian
models for ecological inference in 2 by 2 tables. The models are fit using the Markov chain
Monte Carlo algorithms that are described in
Imai and Lu (2004). By Ying Lu and Kosuke
Imai.
edci Detection of edgepoints in images based on the
difference of two asymmetric M kernel estimators. Linear and circular regression clustering
based on redescending M estimators. Detection of linear edges in images. By Tim Garlipp.
elasticnet Elastic net regularization and variable selection. Also implements the sparse PCA algorithm based on the elastic net/lasso. By Hui
Zou and Trevor Hastie.
R News

ensembleBMA Uses Bayesian Model Averaging to
create probabilistic forecasts of ensembles using a mixture of normal distributions. By
Adrian E. Raftery, J. McLean Sloughter, and
Michael Polakowski.
epitools Basic tools for applied epidemiology. By
Tomas Aragon.
epsi Smoothing methods for images which are
based on a redescending M kernel estimator
which preserves edges and corners. By Tim
Garlipp.
far Modelizations and previsions functions for
Functional AutoRegressive processes using
nonparametric methods: functional kernel, estimation of the covariance operator in a subspace, . . . . By Damon Julien and Guillas Serge.
frailtypack Fit a shared gamma frailty model and
Cox proportional hazards model using a Penalized Likelihood on the hazard function. Left
truncated, censored data and strata (max=2)
are allowed. Original Fortran routines by Virginie Rondeau. Modified Fortran routines, R
code and packaging by Juan R Gonzalez.
gcmrec Parameters estimation of the general semiparametric model for recurrent event data proposed by Peña and Hollander. By Juan R. Gonzalez, Elizabeth H. Slate, and Edsel A Peña.
genalg R based genetic algorithm for binary and
floating point chromosomes. By Egon Willighagen.
gmp Multiple precision Arithmetic (prime numbers,
. . . ), “arithmetic without limitations” using the
C library gmp. By Antoine Lucas, Immanuel
Scholz, Rainer Boehme, and Sylvain Jasson.
gsl An R wrapper for the special functions and quasi
random number generators of the Gnu Scientific Library (http://www.gnu.org/software/
gsl/). By Robin K. S. Hankin; qrng functions
by Duncan Murdoch.
hmm.discnp Fits hidden Markov models with discrete non-parametric observation distributions
to data sets. Simulates data from such models.
By Rolf Turner and Limin Liu..
hopach The Hierarchical Ordered Partitioning and
Collapsing Hybrid (HOPACH) clustering algorithm. By Katherine S. Pollard, with Mark J.
van der Laan.
intcox Implementation of the Iterated Convex Minorant Algorithm for the Cox proportional hazard model for interval censored event data. By
Volkmar Henschel, Christiane Heiss, and Ulrich Mansmann.
ISSN 1609-3631

Vol. 5/1, May 2005

irr Coefficients of Interrater Reliability and Agreement for quantitative, ordinal and nominal
data: ICC, Finn-Coefficient, Robinson’s A,
Kendall’s W, Cohen’s Kappa, . . . . By Matthias
Gamer.
kknn Weighted k-Nearest Neighbors Classification
and Regression. By Klaus Schliep & Klaus
Hechenbichler.
ks Bandwidth matrices for kernel density estimators
and kernel discriminant analysis for bivariate
data. By Tarn Duong.

70

mvoutlier Outlier detection using robust estimations of location and covariance structure. By
Moritz Gschwandtner and Peter Filzmoser.
nice Get or set UNIX priority (niceness) of running
R process. By Charles J. Geyer.
ouch Fit and compare Ornstein-Uhlenbeck models
for evolution along a phylogenetic tree. By
Aaron A. King.
outliers A collection of some tests commonly used
for identifying outliers. By Lukasz Komsta.

labdsv A variety of ordination and vegetation analyses useful in analysis of datasets in community ecology. Includes many of the common
ordination methods, with graphical routines to
facilitate their interpretation, as well as several
novel analyses. By David W. Roberts.

perturb Evaluates collinearity by adding random
noise to selected variables. By John Hendrickx.

latticeExtra Generic function and standard methods
for Trellis-based displays. By Deepayan Sarkar.

pls Multivariate regression by partial least squares
regression (PLSR) and principal components
regression (PCR). This package supersedes the
pls.pcr package. By Ron Wehrens and BjørnHelge Mevik.

ltm Analysis of multivariate Bernoulli data using latent trait models (including the Rasch model)
under the Item Response Theory approach. By
Dimitris Rizopoulos.
maanova Analysis of N-dye Micro Array experiment using mixed model effect. Containing
analysis of variance, permutation and bootstrap, cluster and consensus tree. By Hao Wu,
with ideas from Gary Churchill, Katie Kerr and
Xiangqin Cui.
matlab Emulate MATLAB code using R. By P. Roebuck.
mcmc Functions for Markov chain Monte Carlo
(MCMC). By Charles J. Geyer.
meta Fixed and random effects meta-analysis. Functions for tests of bias, forest and funnel plot. By
Guido Schwarzer.
micEcon Tools for microeconomic analysis and microeconomic modelling. By Arne Henningsen.
minpack.lm Provides R interface for two functions from MINPACK library, solving nonlinear least squares problem by modification of
the Levenberg-Marquardt algorithm. By Timur
V. Elzhov.
mlica An R code implementation of the maximum
likelihood (fixed point) algorithm of Hyvaerinen, Karhuna and Oja for independent component analysis. By Andrew Teschendorff.
mlmRev Data and examples from a multilevel modelling software review as well as other wellknown data sets from the multilevel modelling
literature. By Douglas Bates.
R News

phpSerialize Serializes R objects for import by PHP
into an associative array. Can be used to build
interactive web pages with R. By Dieter Menne.

plsgenomics Provides routines for PLS-based genomic analyses. By Anne-Laure Boulesteix and
Korbinian Strimmer.
plugdensity Kernel density estimation with global
bandwidth selection via “plug-in”". By Eva
Herrmann (C original); R interface et cetera by
Martin Maechler.
polycor Computes polychoric and polyserial correlations by quick “two-step” methods or ML,
optionally with standard errors; tetrachoric
and biserial correlations are special cases. By
John Fox.
ppc Sample classification of protein mass spectra by
peak probability contrasts. By Balasubramanian Narasimhan, R. Tibshirani, and T. Hastie.
proto An object oriented system using prototype
or object-based (rather than class-based) object
oriented ideas. By Louis Kates and Thomas
Petzoldt.
pvclust Assessing the uncertainty in hierarchical cluster analysis. By Ryota Suzuki and
Hidetoshi Shimodaira.
qtlDesign Tools for the design of QTL experiments.
By Saunak Sen, Jaya Satagopan, and Gary
Churchill.
qvalue Q-value estimation for false discovery rate
control. By Alan Dabney and John D. Storey,
with assistance from Gregory R. Warnes.
ISSN 1609-3631

Vol. 5/1, May 2005

71

relsurv Various functions for regression in relative
survival. By Maja Pohar.

tweedie Maximum likelihood computations for
Tweedie families. By Peter Dunn.

resper Two accept-and-reject algorithms to sample
from permutations. By Johannes Hüsing.

uroot Unit root tests (KPSS, ADF, CH and HEGY)
and graphics for seasonal time series. By Javier
López-de-Lacalle & Ignacio Díaz-Emparanza.

rstream Unified object oriented interface for multiple independent streams of random numbers
from different sources. By Josef Leydold.
rwt Performing digital signal processing. By P. Roebuck, based on MATLAB extension by Rice
University’s DSP group.

vabayelMix Performs inference of a gaussian mixture model within a bayesian framework using an optimal separable approximation to the
posterior density. The optimal posterior approximation is obtained using a variational approach. By Andrew Teschendorff.

seqinr Exploratory data analysis and data visualization for biological sequence (DNA and protein)
data. By Delphine Charif and Jean Lobry.

verification Contains utilities for verification of discrete and probabilistic forecasts. By the NCAR
Research Application Program.

spectrino Spectra organizer, visualization and data
extraction from within R. By Teodor Krastev.

zicounts Fits classical and zero-inflated count data
regression model as well as censored count
data regression. By S M Mwalili.

stepwise A stepwise approach to identifying recombination breakpoints in a sequence alignment.
By Jinko Graham, Brad McNeney, and Francoise Seillier-Moiseiwitsch, R interface by Sigal
Blay.
survBayes Fits a proportional hazards model to time
to event data by a Bayesian approach. Right
and interval censored data and a log-normal
frailty term can be fitted. By Volkmar Henschel,
Christiane Heiss, Ulrich Mansmann.
tdist Computes the distribution of a linear combination of independent Student’s t-variables (with
small degrees of freedom, dff ≤ 100) and/or
standard Normal Z random variables. By Viktor Witkovsky and Alexander Savin.

Other changes
• Packages CoCoAn, gpls, and multiv were
moved from the main CRAN section to the
Archive.
• Package Dopt was moved from the Devel section of CRAN to the Archive.
• Package multidim had to be removed from
CRAN.
Kurt Hornik
Wirtschaftsuniversität Wien, Austria
Kurt.Hornik@R-project.org

Events
Chambers Workshop
A workshop entitled “40 Years of Statistical Computing and Beyond” was held at Bell Laboratories, Murray Hill, NJ, U.S.A. on April 29, 2005 to mark the occasion of John Chambers’ retirement from the Labs
but not from active research. He is continuing his
record of innovation by becoming the first Emeritus
Member of Technical Staff in the history of Bell Labs.
R is an implementation of the S language and
John is the originator and principal designer of S.
Without John’s work on S there never would have
been an R. His patient responding to questions during early development of R was crucial to its success
and, since 2000, he has been a member of the R Development Core Team.
The history and development of the S language
and of its implementation in R were featured in
R News

many of the presentations at the workshop, especially in those by Rick Becker and Allan Wilks. Naturally, the highlight of the day was John’s presentation in which he first looked back on his 40 years
of involvement in statistical computing and the development of languages for programming with data
and then gave some tantalizing glimpses into the future. The agenda for the workshop can be seen at
http://stat.bell-labs.com.

DSC 2005
DSC 2005 – Directions in Statistical Computing – will
be held from the evening of August 12 through August 15, in Seattle, Washington. This conference follows on from the successful DSC 1999, 2001, and 2003
conferences at the Vienna University of Technology.
ISSN 1609-3631

Vol. 5/1, May 2005

The workshop will focus on, but is not limited to,
open source statistical computing.
We are inviting abstracts for contributed
presentations on issues related to the development of statistical computing and graphics
environments.
Abstracts should be sent to
dsc2005@u.washington.edu, by May 15 and questions to tlumley@u.washington.edu. More information is available at http://depts.washington.edu/
dsc2005/ and online registration will soon open at
that website.

Editor-in-Chief:
Douglas Bates
Department of Statistics
1300 University Ave University of Wisconsin
Madison, WI 53706
USA
Editorial Board:
Paul Murrell and Torsten Hothorn.
Editor Programmer’s Niche:
Bill Venables
Editor Help Desk:
Uwe Ligges
Email of editors and editorial board:
firstname.lastname @R-project.org

R News

72

Bioconductor User Conference
The 2005 Bioconductor User Conference will be
hosted at the Fred Hutchinson Cancer Research Center in Seattle on August 16 and 17, following the
DSC. The two day meeting will consist of morning lectures and afternoon laboratories. Scheduled
speakers include: Michael Boutros, Eric Schadt, Sridhar Ramaswamy, Rafael Irizarry, and Robert Gentleman. The conference web site can be found at http:
//www.bioconductor.org/meeting05. For more details contact biocworkshop@fhcrc.org.

R News is a publication of the R Foundation for Statistical Computing, communications regarding this
publication should be addressed to the editors. All
articles are copyrighted by the respective authors.
Please send submissions to regular columns to the
respective column editor, all other submissions to
the editor-in-chief or another member of the editorial board (more detailed submission instructions
can be found on the R homepage).
R Project Homepage:
http://www.R-project.org/
This newsletter is available online at
http://CRAN.R-project.org/doc/Rnews/

ISSN 1609-3631



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Page Count                      : 72
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfeTeX-1.10b
Create Date                     : 2005:05:07 12:41:00
EXIF Metadata provided by EXIF.tools

Navigation menu