R News 2007/1 PDF Rnews 2007 1

PDF Rnews_2007-1 CRAN: R News

User Manual: PDF CRAN - Contents of R News

Open the PDF directly: View PDF PDF.
Page Count: 58

DownloadR News 2007/1 PDF Rnews 2007-1
Open PDF In BrowserView PDF
The Newsletter of the R Project

News

Volume 7/1, April 2007

Editorial
by Torsten Hothorn
Welcome to the first issue of R News for 2007, which
follows the release of R version 2.5.0. This major revision, in addition to many other features, brings better support of JAVA and Objective C to our desks.
Moreover, there is a new recommended package,
codetools, which includes functions that automagically check R code for possible problems.
Just before the release of R 2.5.0 the fifth developer conference on “Directions in Statistical Computing” was held in Auckland, NZ, the birthplace of
R. Hadley Wickham reports on the highlights of this
meeting. The R user community is not only active in
conferences. Volume 7, like the preceding volumes
of R News since 2001, wouldn’t be what it is without
the outstanding support of our referees. The editorial
board would like to say “Thank you!” to all who contributed criticism and encouragement during the last
year—the complete list of referees in 2006 is given at
the end of this issue.
The scientific part of Volume 7 starts with an
article by Paul Murrell, our former editor-in-chief,
on handling binary files with tools provided by the
hexView package. Andrew Robinson teaches how
R users can make use of standard Unix tools, for
example mail for auto-generating large amounts of

Contents of this issue:
Editorial . . . . . . . . . . . . . . . . . . . . . .
Viewing Binary Files with the hexView Package
FlexMix: An R Package for Finite Mixture
Modelling . . . . . . . . . . . . . . . . . . . .
Using R to Perform the AMMI Analysis on
Agriculture Variety Trials . . . . . . . . . . .
Inferences for Ratios of Normal Means . . . . .
Working with Unknown Values . . . . . . . . .
A New Package for Fitting Random Effect
Models . . . . . . . . . . . . . . . . . . . . . .

1
2
8
14
20
24
26

email (not spam!). Many of us are regularly confronted with data lacking a unique definition of missing values–the gdata package can help in this situation, as Gregor Gorjanc explains.
Bettina Grün and Fritz Leisch give an introduction to the flexmix package for finite mixture modeling, analyzing a dataset on 21 different whiskey
brands. The analysis of field agricultural experiments by means of additive main effect multiplicative interactions is discussed by Andrea Onofri and
Egidio Ciriciofolo. Tests and confidence intervals for
ratios of means, such as ratios of regression coefficients, implemented in package mratio are described
by Gemechis Dilba and colleagues. The npmlreg
package for fitting random effect models is introduced by Jochen Einbeck and his co-workers. Mathieu Ribatet models peaks over a threshold by POT,
and financial instruments like stocks or options are
(back-)tested by Kyle Campbell and colleagues.
Finally, I would like to remind everyone that the
next “useR!” conference is taking place in Ames,
Iowa, August 8–10. I hope to see you there!
Torsten Hothorn
Ludwig–Maximilians–Universität München
Germany
Torsten.Hothorn@R-project.org
Augmenting R with Unix Tools . . . . . .
POT: Modelling Peaks Over a Threshold .
Backtests . . . . . . . . . . . . . . . . . . .
Review of John Verzani’s Book
Using R for Introductory Statistics . . .
DSC 2007 . . . . . . . . . . . . . . . . . . .
New Journal: Annals of Applied Statistics
Forthcoming Events: useR! 2007 . . . . . .
Changes in R 2.5.0 . . . . . . . . . . . . . .
Changes on CRAN . . . . . . . . . . . . .
R Foundation News . . . . . . . . . . . . .
R News Referees 2006 . . . . . . . . . . . .

. . .
. . .
. . .

30
34
36

.
.
.
.
.
.
.
.

41
42
43
43
43
51
56
56

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

Vol. 7/1, April 2007

2

Viewing Binary Files with the hexView
Package
by Paul Murrell

> viewRaw(hexViewFile("rawTest.txt"))

I really like plain text files.
I like them because I can see exactly what is in
them. I can even easily modify the file if I’m feeling
dangerous. This makes me feel like I understand the
file.
I am not so fond of binary files. I always have to
use specific software to access the contents and that
software only shows me an interpretation of the basic content of the file. The raw content is hidden from
me.
Sometimes I want to know more about a real binary file, for example when I need to read data in a
binary format that no existing R function will read.
When things go wrong, like when an R workspace
file becomes “corrupt”, I may have a strong need to
know more.
Hex editors are wonderful tools that provide a
view of the raw contents of a binary (or text) file,
whether just to aid in understanding the file or to inspect or recover a file. The hexView package is an
attempt to bring this sort of facility to R.

0
8

Viewing raw text files

One noteworthy feature of this simple file is the
last byte, which has the hexadecimal value 0a (or
00001010 in binary; the decimal value 10) and no
printable ASCII interpretation. This is the ASCII
code for the newline or line feed (LF) special character that indicates the end of a line in text files. This
is a simple demonstration that even plain text files
have details that are hidden from the user by standard viewing software; viewers will show text on
separate lines, but do not usually show the “character” representing the start of a new line.
The next example provides a more dramatic
demonstration of hidden details in text files. The file
we will look at contains the same text as the previous
example, but was created on a Windows XP system
with Notepad using “Save As...” and selecting “Unicode” as the “Encoding”. The readLines() function
just needs the file to be opened with the appropriate
encoding, then it produces the same result as before.

The viewRaw() function reads and displays the raw
content of a file. The content is displayed in three
columns: the left column provides a byte offset
within the file, the middle column shows the raw
bytes, and the right column displays each byte as an
ASCII character. If the byte does not correspond to
a printable ASCII character then a full stop is displayed.
As a simple example, we will look at a plain text
file, "rawTest.txt", that contains a single line of
text. This file was created using the following code
(on a Linux system).
> writeLines("test pattern", "rawTest.txt")
A number of small example files are included as
part of the hexView package and the hexViewFile()
function is provided to make it convenient to refer
to these files. The readLines() function from the
base package reads in the lines of a plain text file as
a vector of strings, so the plain text content of the file
"rawTest.txt" can be retrieved as follows.

:
:

74 65 73 74 20 70 61 74
74 65 72 6e 0a

|
|

test pat
tern.

As this example shows, by default, the raw bytes
are printed in hexadecimal format. The first byte
in this file is 74, which is 7 ∗ 16 + 4 = 116 in decimal notation—the ASCII code for the character t.
This byte pattern can be seen several times in the file,
wherever there is a t character.
The machine argument to the viewRaw() function
controls how the raw bytes are displayed. It defaults
to "hex" for hexadecimal output, but also accepts the
value "binary", which means that the raw bytes are
printed in binary format, as shown below.
> viewRaw(hexViewFile("rawTest.txt"),
machine="binary")
0
3
6
9
12

:
:
:
:
:

01110100
01110100
01100001
01100101
00001010

01100101
00100000
01110100
01110010

01110011
01110000
01110100
01101110

|
|
|
|
|

tes
t p
att
ern
.

> readLines(
file(hexViewFile("rawTest.unicode"),
encoding="UCS-2LE"))

> readLines(hexViewFile("rawTest.txt"))

[1] "test pattern"

[1] "test pattern"

However, the raw content of the file is now very different.

The following code uses the viewRaw() function
from hexView to display the raw contents of this file.
R News

> viewRaw(hexViewFile("rawTest.unicode"))
ISSN 1609-3631

Vol. 7/1, April 2007

0
8
16
24

:
:
:
:

ff
74
74
6e

fe
00
00
00

74
20
74
0d

00
00
00
00

3

65
70
65
0a

00 73 00
00 61 00
00 72 00
00

|
|
|
|

..t.e.s.
t. .p.a.
t.t.e.r.
n.....

It is fairly straightforward to identify some parts of
this file. The ASCII codes from the previous example
are there again, but there is an extra 00 byte after each
one. This reflects the fact that, on Windows, Unicode
text is stored using two bytes per character1 .
Instead of the 13 bytes in the original file, we
might expect 26 bytes in this file, but there are actually 30 bytes. Where did the extra bytes come from?
The first two bytes at the start of the file are a
byte order mark (BOM). With two bytes to store for
each character, there are two possible orderings of
the bytes; for example, the two bytes for the character t could be stored as 74 00 (called little endian) or
as 00 74 (big endian). The BOM tells software which
order has been used. Another difference occurs at the
end of the file. The newline character is there again
(with an extra 00), but just before it there is a 0d character (with an extra 00). This is the carriage return
(CR) character. On Windows, a new line in a text file
is signalled by the combination CR+LF, but on UNIX
a new line is just indicated by a single LF.
As this example makes clear, software sometimes
does a lot of work behind the scenes in order to display even “plain text”.

Viewing raw binary files
An example of a binary file is the native binary format used by R to store information via the save()
function. The following code was used to create the
file "rawTest.bin".
> save(rnorm(50), file="rawTest.bin")
We can view this file with the following code; the
nbytes argument is used to show the raw data for
only the first 80 bytes.
> viewRaw(hexViewFile("rawTest.bin"),
nbytes=80)
0
8
16
24
32
40
48
56
64
72

:
:
:
:
:
:
:
:
:
:

1f
00
44
00
03
00
00
00
c6
bb

8b
03
58
02
00
01
01
00
fe
4e

08
01
32
00
00
00
7a
32
0d
18

00
c0
0a
02
00
00
00
3f
3f
bf

00
01
58
05
04
10
00
e7
e1
c4

00
3f
0a
00
02
09
00
60
3b
9e

00
fe
00
00
00
00
0e
e6
c5
0f

00
52
00
02
00
00
00
49
2f
1a

|
|
|
|
|
|
|
|
|
|

........
.....?.R
DX2.X...
........
........
........
..z.....
..2?.`.I
...?.;./
.N......

This is a good example of a binary file that is intriguing to view, but there is little hope of retrieving any useful information because the data has been
compressed (encoded). In other cases, things are a
not so hopeless, and it is not only possible to view the
raw bytes, but also to see useful patterns and structures.
The next example looks at a binary file with a
much simpler structure. The file "rawTest.int" only
contains (uncompressed) integer values and was created by the following code.
> writeBin(1:50, "rawTest.int", size=4)
This file only contains the integers from 1 to 50,
with four bytes used for each integer. The raw contents are shown below; this time the nbytes argument has been used to show only the raw data for
the first 10 integers (the first 40 bytes).
> viewRaw(hexViewFile("rawTest.int"),
nbytes=40)
0
8
16
24
32

:
:
:
:
:

01
03
05
07
09

00
00
00
00
00

00
00
00
00
00

00
00
00
00
00

02
04
06
08
0a

00
00
00
00
00

00
00
00
00
00

00
00
00
00
00

|
|
|
|
|

........
........
........
........
........

None of the bytes correspond to printable ASCII
characters in this case, so the right column of output is not terribly interesting. The viewRaw() function has two arguments, human and size, which control the way that the raw bytes are interpreted and
displayed. In this case, rather than interpreting each
byte as an ASCII character, it makes sense to interpret
each block of four bytes as an integer. This is done in
the following code using human="int" and size=4.
> viewRaw(hexViewFile("rawTest.int"),
nbytes=40, human="int", size=4)
0
8
16
24
32

:
:
:
:
:

01
03
05
07
09

00
00
00
00
00

00
00
00
00
00

00
00
00
00
00

02
04
06
08
0a

00
00
00
00
00

00
00
00
00
00

00
00
00
00
00

|
|
|
|
|

1 2
3 4
5 6
7 8
9 10

With this simple binary format, we can see how
the individual integers are being stored. The integer
1 is stored as the four bytes 01 00 00 00, the integer
2 as 02 00 00 00, and so on. This clearly demonstrates the idea of little endian byte order; the leastsignificant byte, the value 1, is stored first. In big endian byte order, the integer 1 would be 00 00 00 01
(as we shall see later).
The other option for interpreting bytes is "real"
which means that each block of size bytes is interpreted as a floating-point value. A simple example

1 Over-simplification alert! Windows used to use the UCS-2 encoding, which has two bytes per character, but now it uses UTF-16, which
has two or four bytes per character. There are only two bytes per character in this case because these are common english characters.

R News

ISSN 1609-3631

Vol. 7/1, April 2007

4

is provided by the file "rawTest.real", which was
generated by the following code. I have deliberately
used big endian byte order because it will make it
easier to see the structure in the resulting bytes.
> writeBin(1:50/50, "rawTest.real", size=8,
endian="big")
Here is an example of reading this file and interpreting each block of 8 bytes as a floating-point number. This also demonstrates the use of the width argument to explicitly control how many bytes are displayed per line of output.
> viewRaw(hexViewFile("rawTest.real"),
nbytes=40, human="real", width=8,
endian="big")
0
8
16
24
32

:
:
:
:
:

3f
3f
3f
3f
3f

94
a4
ae
b4
b9

7a
7a
b8
7a
99

e1
e1
51
e1
99

47
47
eb
47
99

ae
ae
85
ae
99

14
14
1e
14
99

7b
7b
b8
7b
9a

|
|
|
|
|

0.02
0.04
0.06
0.08
0.10

Again, we are able to see how individual floatingpoint values are stored. The following code takes this
a little further and allows us to inspect the bit representation of the floating point numbers. The output
is shown in Figure 1.
> viewRaw(hexViewFile("rawTest.real"),
nbytes=40, human="real",
machine="binary", width=8,
endian="big")
The bit representation adheres to the IEEE Standard for Binary Floating-Point Arithmetic (IEEE,
1985; Wikipedia, 2006). Each value is stored in the
form sign × mantissa × 2exponent . The first (left-most)
bit indicates the sign of the number, the next 11 bits
describe the exponent and the remaining 52 bits describe the mantissa. The mantissa is a binary fraction, with bit i corresponding to 2−i .
For the first value in "rawTest.real", the first bit
has value 0 indicating a positive number, the exponent bits are 0111111 1001 = 1017, from which we
subtract 1023 to get −6, and the mantissa is an implicit 1 plus 0 × 2−1 + 1 × 2−2 + 0 × 2−3 + 0 × 2−4 +
0 × 2−5 + 1 × 2−6 ... = 1.28.2 So we have the value
1.28 × 2−6 = 0.02.

Viewing a Binary File in Blocks
As the examples so far have hopefully demonstrated,
being able to see the raw contents of a file can be a
very good way to teach concepts such as endianness,
character encodings, and floating-point representations of real numbers. Plus, it is just good fun to poke
around in a file and see what is going on.

In this section, we will look at some more advanced functions from the hexView package, which
will allow us to take a more detailed look at more
complex binary formats and will allow us to perform
some more practical tasks.
We will start by looking again at R’s native binary format. The file "rawTest.XDRint" contains
the integers 1 to 50 saved as a binary R object
and was produced using the following code. The
compress=FALSE is important to allow us to see the
structure of the file.
> save(1:50, file="rawTest.XDRint",
compress=FALSE)
We can view (the first 80 bytes of) the raw file using viewRaw() as before and this does show us some
interesting features. For example, we can see the text
RDX2 at the start of the file (it is common for files to
have identifying markers at the start of the file). If
we look a little harder, we can also see the first few
integers (1 to 9); the data is stored in an XDR format
(Wikipedia, 2006a), which uses big endian byte order, so the integers are in consecutive blocks of four
bytes that look like this: 00 00 00 01, then 00 00 00
02, and so on.
> viewRaw(hexViewFile("rawTest.XDRint"),
width=8, nbytes=80)
0
8
16
24
32
40
48
56
64
72

:
:
:
:
:
:
:
:
:
:

52
00
02
00
00
00
00
00
00
00

44
00
03
00
00
00
00
00
00
00

58
02
00
01
01
00
00
00
00
00

32
00
00
00
78
32
02
04
06
08

0a
02
00
00
00
00
00
00
00
00

58
04
04
10
00
00
00
00
00
00

0a
00
02
09
00
00
00
00
00
00

00
00
00
00
0d
01
03
05
07
09

|
|
|
|
|
|
|
|
|
|

RDX2.X..
........
........
........
...x....
...2....
........
........
........
........

It is clear that there is some text in the file and that
there are some integers in the file, so neither viewing
the whole file as characters nor viewing the whole
file as integers is satisfactory. What we need to be
able to do is view the text sections as characters and
the integer sections as integers. This is what the functions memFormat(), memBlock(), and friends are for.
The memBlock() function creates a description of
a block of memory, specifying how many bytes are
in the block; the block is interpreted as ASCII characters. The atomicBlock() function creates a description of a memory block that contains a single
value of a specified type (e.g., a four-byte integer),
and the vectorBlock() function creates a description of a memory block consisting of 1 or more memory blocks.
A number of standard memory blocks are predefined: integer4 (a four-byte integer) and integer1,

2 At least, as close as it is possible to get to 1.28 with a finite number of bits. Another useful thing about viewing raw values is that it
makes explicit the fact that most decimal values do not have an exact floating-point representation.

R News

ISSN 1609-3631

Vol. 7/1, April 2007

0
8
16
24
32

:
:
:
:
:

00111111
00111111
00111111
00111111
00111111

5

10010100
10100100
10101110
10110100
10111001

01111010
01111010
10111000
01111010
10011001

11100001
11100001
01010001
11100001
10011001

01000111
01000111
11101011
01000111
10011001

10101110
10101110
10000101
10101110
10011001

00010100
00010100
00011110
00010100
10011001

01111011
01111011
10111000
01111011
10011010

|
|
|
|
|

0.02
0.04
0.06
0.08
0.10

Figure 1: The floating point representation of the numbers 0.02 to 0.10 following IEEE 754 in big endian byte
order.
integer2, and integer8; real8 (an eight-byte
floating-point number, or double) and real4; and
ASCIIchar (a single-byte character). There is also
a special ASCIIline memory block for a series of
single-byte characters terminated by a newline.
The memFormat() function collects a number of memory block descriptions together and
viewFormat() reads the memory blocks and displays
them.
As an example, the following code reads in the
"RDX2" header line of the file "rawTest.XDRint",
treats the next 39 bytes as just raw binary, ignoring
any structure, then reads the first nine integers (as integers). A new memory block description is needed
for the integers because the XDR format is big endian
(the predefined integer4 is little endian). The names
of the memory blocks within the format are used to
separate the blocks of output.
> XDRint <- atomicBlock("int", endian="big")
> viewFormat(hexViewFile("rawTest.XDRint"),
memFormat(saveFormat=ASCIIline,
rawBlock=memBlock(39),
integers=vectorBlock(XDRint,
9)))
========saveFormat
0 : 52 44 58 32
========rawBlock
5 : 58 0a 00 00
12 : 02 04 00 00
19 : 00 00 04 02
26 : 01 00 00 10
33 : 00 01 78 00
40 : 00 00 00 32
========integers
44 : 00 00 00 01
52 : 00 00 00 03
60 : 00 00 00 05
68 : 00 00 00 07
76 : 00 00 00 09

0a
00
02
00
09
00

02
03
00
00
00

00
00
00
00
0d

00
00
00
00

00
00
00
00

00
00
00
00

02
04
06
08

|

RDX2.

|
|
|
|
|
|

X......
.......
.......
.......
..x....
...2

|
|
|
|
|

1
3
5
7
9

2
4
6
8

The raw 39 bytes can be further broken down—
see the description of R’s native binary format on
pages 11 and 12 of the “R Internals” manual (R Development Core Team, 2006) that is distributed with
R—but that is beyond the scope of this article.
3 There

R News

Extracting Blocks from a Binary
File
As well as viewing different blocks of a binary
file, we may want to extract the values from each
block. For this purpose, the readFormat() function
is provided to read a binary format, as produced
by memFormat(), and generate a "rawFormat" object
(but not explicitly print it3 ). A "rawFormat" object is
a list with a component "blocks" that is itself a list
of "rawBlock" objects, one for each memory block
defined in the memory format. A "rawBlock" object
contains the raw bytes read from a file.
The blockValue() function extracts the interpreted value from a "rawBlock" object.
The
blockString() function is provided specifically
for extracting a null-terminated string from a
"rawBlock" object.
The following code reads in the file
"rawTest.XDRint" and just extracts the 50 integer
values.
> XDRfile  blockValue(XDRfile$blocks$integers)
[1] 1 2 3 4 5 6 7 8 9
[14] 14 15 16 17 18 19 20 21 22
[27] 27 28 29 30 31 32 33 34 35
[40] 40 41 42 43 44 45 46 47 48

10
23
36
49

11 12 13
24 25 26
37 38 39
50

A Caution
On a typical 32-bit platform, R uses 4 bytes for representing integer values in memory and 8 bytes for
floating-point values. This means that there may be
limits on what sort of values can be interpreted correctly by hexView.
For example, if a file contains 8-byte integers, it is
possible to view each set of 8 bytes as an integer, but
on my system R can only represent an integer using
4 bytes, so 4 of the bytes are (silently) dropped. The
following code demonstrates this effect by reading

is a readRaw() function too.

ISSN 1609-3631

Vol. 7/1, April 2007

6

the file "testRaw.int" and interpreting its contents
as 8-byte integers.
> viewRaw(hexViewFile("rawTest.int"),
nbytes=40, human="int", size=8)
0
8
16
24
32

:
:
:
:
:

01
03
05
07
09

00
00
00
00
00

00
00
00
00
00

00
00
00
00
00

02
04
06
08
0a

00
00
00
00
00

00
00
00
00
00

00
00
00
00
00

|
|
|
|
|

1
3
5
7
9

An extended example:
Reading EViews Files
On November 18 2006, Dietrich Trenkler sent a message to the R-help mailing list asking for a function to read files in the native binary format used
by Eviews, an econometrics software package (http:
//www.eviews.com/). No such function exists, but
John C Frain helpfully pointed out that an unofficial description of the basic structure of Eviews files
had been made available by Allin Cottrell (creator of
gretl, the Gnu Regression, Econometrics and Timeseries Library). The details of Allin Cottrell’s reverseengineering efforts are available on the web (http:
//www.ecn.wfu.edu/~cottrell/eviews_format/).
In this section, we will use the hexView package
to explore an Eviews file and produce a new function
for reading files in this format. The example data file
we will use is from Ramu Ramanathan’s Introductory Econometrics text (Ramanathan, 2002). The data
consists of four variables measured on single family
homes in University City, San Diego, in 1990:
price: sale price in thousands of dollars.

5
6
7
8
9
10
11
12
13
14

239.0
293.0
285.0
365.0
295.0
290.0
385.0
505.0
425.0
415.0

1600
1750
1800
1870
1935
1948
2254
2600
2800
3000

3
4
4
4
4
4
4
3
4
4

2.00
2.00
2.75
2.00
2.50
2.00
3.00
2.50
3.00
3.00

An Eviews file begins with a header, starting with
the text “New MicroTSP Workfile” and including important information about the size of the header and
the number of variables and the number of observations in the file. The following code defines an appropriate "memFormat" object for this header information.
> EViewsHeader  read.table(hexViewFile("data4-1.txt"),
col.names=c("price", "sqft",
"bedrms", "baths"))

1
2
3
4

price
199.9
228.0
235.0
285.0
4 The

sqft bedrms baths
1065
3 1.75
1254
3 2.00
1300
3 2.00
1577
4 2.50

> data4.1.header  data4.1.header
=========firstline
0 : 4e 65 77 20
6 : 63 72 6f 54
12 : 20 57 6f 72
18 : 69 6c 65 00
24 : d8 5e 0e 01
30 : 00 00 00 00
36 : 15 00 00 00
42 : ff ff ff ff
48 : 00 00 00 00
54 : 00 00 06 00
60 : 0f 00 00 00
66 : 00 00 01 00
72 : 66 03 00 00
78 : 00 00
=========headersize
80 : 90 00 00 00
=========unknown
88 : 01 00 00 00

4d
53
6b
00
00
08
00
21
00
00
06
01
00

69
50
66
00
00
00
00
00
00
00
00
00
00

|
|
|
|
|
|
|
|
|
|
|
|
|
|

New Mi
croTSP
Workf
ile...
.^....
......
......
....!.
......
......
......
......
f.....
..

00 00 00 00

|

144

01 00

|

......

original source of the files was: http://ricardo.ecn.wfu.edu/pub/gretl_cdrom/data/

R News

ISSN 1609-3631

Vol. 7/1, April 2007

7

94 : 00 00 01 00 00 00
100 : 00 00 00 00 00 00
106 : 00 00 00 00 00 00
112 : 00 00
=========numvblesplusone
114 : 07 00 00 00
=========date
118 : d5 b7 0d 3a
=========unkown
122 : 06 00
=========datafreq
124 : 01 00
=========startperiod
126 : 00 00
=========startobs
128 : 01 00 00 00
=========unkown
132 : 00 5d 67 0e 01 59
138 : 8b 41
=========numobs
140 : 0e 00 00 00

|
|
|
|

......
......
......
..

|

7

|

...:

|

..

|

1

|

0

|

1

|
|

.]g..Y
.A

|

14

We can extract some pieces of information from
this header and use them to look at later parts of the
file.
> headerSize  numObs  EViewsVbleInfo  data4.1.vinfo  data4.1.vinfo
=========unknown
170 : 00 00 00 00 0b 00
=========recsize
176 : 86 00 00 00

R News

|

......

|

134

=========memsize
180 : 70 00 00 00
=========ptrtodata
184 : f6 03 00 00 00
=========vblename
192 : 42 41 54 48 53
202 : 00 00 00 00 00
212 : 00 00 00 00 00
222 : 00 00
=========ptrtohistory
224 : 00 00 00 00 d5
=========vbletype
232 : 2c 00
=========unknown
234 : 60 02 10 00 01

|

112

00 00 00

|

1014

00 00 00 00 00
00 00 00 00 00
00 00 00 00 00

|
|
|
|

BATHS.....
..........
..........
..

b7 0d 3a

|

0

|

44

|

`

00

.....

> firstVbleLoc  EViewsVbleData <- function(numObs) {
memFormat(numobs=integer4,
startobs=integer4,
unknown=memBlock(8),
endobs=integer4,
unknown=memBlock(2),
values=vectorBlock(real8,
numObs))
}
> viewFormat(hexViewFile("data4-1.wf1"),
EViewsVbleData(numObs),
offset=firstVbleLoc)
=========numobs
1014 : 0e 00 00
=========startobs
1018 : 01 00 00
=========unknown
1022 : 00 00 00
1028 : 00 00
=========endobs
1030 : 0e 00 00
=========unknown
1034 : 00 00
=========values
1036 : 00 00 00
1044 : 00 00 00
1052 : 00 00 00
1060 : 00 00 00
1068 : 00 00 00
1076 : 00 00 00
1084 : 00 00 00
1092 : 00 00 00
1100 : 00 00 00
1108 : 00 00 00
1116 : 00 00 00
1124 : 00 00 00
1132 : 00 00 00
1140 : 00 00 00

00

|

14

00

|

1

00 00 00

|
|

......
..

00

|

14

|

..

|
|
|
|
|
|
|
|
|
|
|
|
|
|

1.75
2.00
2.00
2.50
2.00
2.00
2.75
2.00
2.50
2.00
3.00
2.50
3.00
3.00

00
00
00
00
00
00
00
00
00
00
00
00
00
00

00
00
00
00
00
00
00
00
00
00
00
00
00
00

00
00
00
00
00
00
00
00
00
00
00
00
00
00

fc
00
00
04
00
00
06
00
04
00
08
04
08
08

3f
40
40
40
40
40
40
40
40
40
40
40
40
40

This manual process of exploring the file structure can easily be automated within a function. The
hexView package includes such a function under the
ISSN 1609-3631

Vol. 7/1, April 2007

name readEViews(). With this function, we can read
in the data set from the Eviews file as follows.
> readEViews(hexViewFile("data4-1.wf1"))
Skipping boilerplate variable
Skipping boilerplate variable
BATHS BEDRMS PRICE SQFT
1
1.75
3 199.9 1065
2
2.00
3 228.0 1254
3
2.00
3 235.0 1300
4
2.50
4 285.0 1577
5
2.00
3 239.0 1600
6
2.00
4 293.0 1750
7
2.75
4 285.0 1800
8
2.00
4 365.0 1870
9
2.50
4 295.0 1935
10 2.00
4 290.0 1948
11 3.00
4 385.0 2254
12 2.50
3 505.0 2600
13 3.00
4 425.0 2800
14 3.00
4 415.0 3000
This solution is not the most efficient way to read
Eviews files, but the hexView package does make it
easy to gradually build up a solution, it makes it easy
to view the results, and it does provide a way to solve
the problem without having to resort to C code.

Summary
The hexView package provides functions for viewing the raw byte contents of files. This is useful for
exploring a file structure and for demonstrating how
information is stored on a computer. More advanced
functions make it possible to read quite complex binary formats using only R code.

Acknowledgements
At the heart of the hexView package is the
readBin() function and the core facilities for work-

8

ing with "raw" binary objects in R code (e.g.,
rawToChar()); thanks to the R-core member(s) who
were responsible for developing those features.
I would also like to thank the anonymous reviewer for useful comments on early drafts of this
article.

Bibliography
IEEE Standard 754 for Binary Floating-Point Arithmetic.
IEEE computer society, 1985. 4
R Development Core Team. R Internals. R Foundation for Statistical Computing, Vienna, Austria,
2006. URL http://www.R-project.org. ISBN 3900051-14-3. 5
R. Ramanathan. INTRODUCTORY ECONOMETRICS WITH APPLICATIONS. Harcourt College, 5
edition, 2002. ISBN 0-03-034342-9. 6
Wikipedia.
External data representation —
wikipedia, the free encyclopedia, 2006a. URL
http://en.wikipedia.org/w/index.php?
title=External_Data_Representation&oldid=
91734878. [Online; accessed 3-December-2006]. 4
Wikipedia.
IEEE floating-point standard —
wikipedia, the free encyclopedia, 2006b. URL
http://en.wikipedia.org/w/index.php?
title=IEEE_floating-point_standard&oldid=
89734307. [Online; accessed 3-December-2006]. 4

Paul Murrell
Department of Statistics
The University of Auckland
New Zealand
paul@stat.auckland.ac.nz

FlexMix: An R Package for Finite Mixture
Modelling
by Bettina Grün and Friedrich Leisch

Introduction
Finite mixture models are a popular method for
modelling unobserved heterogeneity or for approximating general distribution functions. They are applied in a lot of different areas such as astronomy, biology, medicine or marketing. An overview on these
R News

models with many examples for applications is given
in the recent monographs McLachlan and Peel (2000)
and Frühwirth-Schnatter (2006).
Due to this popularity there exist many (standalone) software packages for finite mixture modelling (see McLachlan and Peel, 2000; Wedel and Kamakura, 2001). Furthermore, there are several different R packages for fitting finite mixture models
available on CRAN. Packages which use the EM algoISSN 1609-3631

Vol. 7/1, April 2007

9

rithm for model estimation are flexmix, fpc, mclust,
mixreg, mixtools, and mmlcr. Packages with other
model estimation methods are bayesmix, depmix,
moc, vabayelMix and wle. A short description
of these packages can be found in the CRAN task
view on clustering (http://cran.at.r-project.
org/src/contrib/Views/Cluster.html).

Finite mixture models
A finite mixture model is given by a convex combination of K different components, i.e. the weights of the
components are non-negative and sum to one. For
each component it is assumed that it follows a parametric distribution or is given by a more complex
model, such as a generalized linear model (GLM).
In the following we consider finite mixture densities h(·|·) with K components, dependent variables
y and (optional) independent variables x:
K

h( y| x, w, Θ) =

∑ πk (w, α ) f ( y|x, ϑk )

k=1

where ∀w, α:
K

πk (w, α ) ≥ 0 ∀k

∧

∑ πk (w, α ) = 1

k=1

and
ϑk 6= ϑl

∀k 6= l.

We assume that the component distributions f (·|·)
are from the same distributional family with component specific parameters ϑk . The component weights
or prior class probabilities πk optionally depend on
the concomitant variables w and the parameters α
and are modelled through multinomial logit models
as suggested for example in Dayton and Macready
(1988). A similar model class is also described in
McLachlan and Peel (2000, p. 145). The model can
be estimated using the EM algorithm (see Dempster
et al., 1977; McLachlan and Peel, 2000) for ML estimation or using MCMC methods for Bayesian analysis
(see for example Frühwirth-Schnatter, 2006).
A possible extension of this model class is to
either have mixtures with components where the
parameters of one component are fixed a-priori
(e.g. zero-inflated models; Grün and Leisch, 2007b)
or to even allow different component specific models (e.g. for modelling noise in the data; Dasgupta
and Raftery, 1998).

The EM algorithm provides a common basis for
estimation of a general class of finite mixture models and the package flexmix tries to enable the user
to exploit this commonness. flexmix provides the Estep and takes care of all data handling while the user
is supposed to supply the M-step via model drivers
for the component-specific model and the concomitant variable model. For the M-step available functions for weighted maximum likelihood estimation
can be used as for example glm() for fitting GLMs or
multinom() in MASS for multinomial logit models.
Currently model drivers are available for
model-based clustering of multivariate Gaussian distributions with diagonal or unrestricted
variance-covariance matrices (FLXMCmvnorm()) and
multivariate Bernoulli and Poisson distributions
(FLXMCmvbinary() and FLXMCmvpois()) where the
dimensions are mutually independent. flexmix does
not provide functionality for estimating mixtures
of Gaussian distributions with special variancecovariance structures, as this functionality has already been implemented in the R package mclust
(Fraley and Raftery, 2006).
For mixtures of regressions the Gaussian, binomial, Poisson and gamma distribution can be specified (FLXMRglm()). If some parameters are restricted
to be equal over the components the model driver
FLXMRglmfix() can be used. Zero-inflated Poisson
and binomial regression models can be fitted using FLXMRziglm(). For an example of zero-inflated
models see example("FLXMRziglm"). For the concomitant variable models either constant component
weights (default) can be used or multinomial logit
models (FLXPmultinom()) can be fitted.
Estimation problems can occur if the components
become too small during the EM algorithm. In order to avoid these problems a minimum size can be
specified for each component. This is especially important for finite mixtures of multivariate Gaussian
distributions where full variance-covariance matrices are estimated for each component.
Further details on the implementation and the design principles as well as exemplary applications of
the package can be found in the accompanying vignettes "flexmix-intro" which is an updated version of Leisch (2004) and "regression-examples"
and in Grün and Leisch (2007a). Note that this article
uses the new version 2.0 of the package, where the
names of some driver functions have changed compared with older versions of flexmix.

Design principles of FlexMix

Exemplary applications

The main reason for the implementation of the package was to allow easy extensibility and to have the
possibility for rapid prototyping in order to be able
to try out new mixture models. The package was implemented using S4 classes and methods.

In the following we present two examples for using
the package. The first example demonstrates modelbased clustering, i.e., mixtures without independent
variables, and the second example gives an application for fitting mixtures of generalized linear regres-

R News

ISSN 1609-3631

Vol. 7/1, April 2007

10

sion models.

Model-based clustering
The following dataset is taken from Edwards and Allenby (2003) who refer to the Simmons Study of Media and Markets. It contains all households which
used any whiskey brand during the last year and
provides a binary incidence matrix on their brand
use for 21 whiskey brands during this year. This
means only the information on the different brands
used in a household is available.
We first load the package and the dataset. The
whiskey dataset contains observations from 2218
households. The relative frequency of usage for each
brand is given in Figure 1. Additional information
is available for the brands indicating the type of
whiskey: blend or single malt.
R> library("flexmix")
R> data("whiskey")
R> set.seed(1802)

respect to the log-likelihood for each of the different numbers of components is returned in an object
of class "stepFlexmix". The control argument can
be used to control the fitting with the EM algorithm.
With minprior the minimum relative size of the components is specified, components falling below this
threshold are removed during the EM algorithm.
The dataset contains only the unique binary patterns observed with the corresponding frequency.
We use these frequencies for the weights argument
instead of transforming the dataset to have one row
for each observation. The use of a weights argument
allows to use only the number of unique observations for fitting, which can substantially reduce the
size of the model matrix and hence speed up the estimation process. For this dataset this means that the
model matrix has 484 instead of 2218 rows.
Model selection can be made using information
criteria, as for example the BIC (see Fraley and
Raftery, 1998). For this example the BIC suggests a
mixture with 5 components:
R> BIC(wh_mix)

Singleton
Knockando
White Horse
Scoresby Rare
Ushers
Macallan
Grant's
Passport
Black & White
Clan MacGregor
Ballantine
Pinch (Haig)
Other brands
Cutty Sark
Glenfiddich
Glenlivet
J&B
Dewar's White Label
Johnnie Walker Black Label
Johnnie Walker Red Label
Chivas Regal

1
2
3
4
27705.1 26327.6 25987.7 25683.2
5
6
7
25647.0 25670.3 25718.6
R> wh_best <- getModel(wh_mix, "BIC")
R> wh_best

0.0

0.1

0.2

Probability

0.3

Blend
Single Malt

Figure 1: Relative frequency of the whiskey brands.
We fit a mixture of binomial distributions to the
dataset where the variables in each component specific models are assumed to be independent. The
EM algorithm is repeated nrep = 3 times using random initialization, i.e. each observation is assigned
to one component with an a-posteriori probability of
0.9 and 0.1 otherwise and the component is selected
with equal probability.

Call:
stepFlexmix(Incidence ~ 1,
weights = ~Freq, data = whiskey,
model = FLXMCmvbinary(truncated = TRUE),
control = list(minprior = 0.005),
k = 5, nrep = 3)
Cluster sizes:
1
2
3
4
5
283 791 953 25 166
convergence after 180 iterations
The estimated parameters can be inspected using
accessor functions such as prior() or parameters().

R> wh_mix <- stepFlexmix(Incidence ~ 1,
+
weights = ~ Freq, data = whiskey,
+
model = FLXMCmvbinary(truncated = TRUE),
+
control = list(minprior = 0.005),
+
k = 1:7, nrep = 3)

R> prior(wh_best)

Model-based clustering uses no explanatory variables, hence the right hand side of the formula
Incidence ~ 1 is constant.
The model driver
is FLXMCmvbinary() with argument truncated =
TRUE, as the number of non-users is not available and
a truncated likelihood is maximized in each M-step
again using the EM-algorithm. We vary the number
of components for k = 1:7. The best solution with

R> parameters(wh_best, component=4:5)[1:2,]

R News

[1] 0.1421343 0.3303822
[3] 0.4311072 0.0112559
[5] 0.0851203

Comp.4
center.Singleton 0.643431
center.Knockando 0.601124
Comp.5
center.Singleton 2.75013e-02
center.Knockando 1.13519e-32
ISSN 1609-3631

Vol. 7/1, April 2007

11

0.0 0.2 0.4 0.6 0.8 1.0

Comp. 1

Comp. 2

0.0 0.2 0.4 0.6 0.8 1.0

Comp. 3

Comp. 4

Comp. 5

Singleton
Knockando
White Horse
Scoresby Rare
Ushers
Macallan
Grant's
Passport
Black & White
Clan MacGregor
Ballantine
Pinch (Haig)
Other brands
Cutty Sark
Glenfiddich
Glenlivet
J&B
Dewar's White Label
Johnnie Walker Black Label
Johnnie Walker Red Label
Chivas Regal
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Blend
Single Malt

Probability

Figure 2: Estimated probability of usage for the whiskey brands for each component.

●
●
●

●
●

●

0

●● ● ● ● ●
●●

−2

0

●

4

lgRD

Figure 3: Patent dataset.
The model which is chosen as the best in Wang
et al. (1998) is a finite mixture of three Poisson regression models with Patents as dependent variable, the
logarithmized R&D spending lgRD as independent
variable and the R&D spending per sales RDS as concomitant variable. This model can be fitted in R with
the component-specific model driver FLXMRglm()
R News

150

● ● ●
●

−2

●

● ●● ● ● ●
●
●● ● ●
● ●● ●● ●
●
●● ● ● ● ●

0

2

●●

● ● ●●
●

4

lgRD

●●●

● ● ●●
●

2

●●

●

●
●
●
● ●
●●●●●● ● ● ●
●●
●● ● ●
● ●● ●
● ● ●●
●
● ●
●●●
● ●●● ●

●●●
●
●

0

100

●

50

Patents

●●

●
●

●

150

●

The observed values together with the fitted values for each component are given in Figure 4. The
coloring and characters used for plotting the observations are according to the component assignment using the maximum a-posteriori probabilities, which are obtained using cluster(pat_mix).

100

The patent data given in Wang et al. (1998) includes
70 observations on patent applications, R&D spending and sales in millions of dollar from pharmaceutical and biomedical companies in 1976 taken from the
National Bureau of Economic Research R&D Masterfile. The data is given in Figure 3.

R> data("patent")
R> pat_mix <- flexmix(Patents ~ lgRD,
+
k = 3, data = patent,
+
model = FLXMRglm(family = "poisson"),
+
concomitant = FLXPmultinom(~RDS))

50

Mixtures of regressions

which allows fitting of finite mixtures of GLMs. As
concomitant variable model driver FLXPmultinom()
is used for a multinomial logit model where the posterior probabilities are the dependent variables.

Patents

The fitted parameters of the mixture for each
component are given in Figure 2. It can be seen
that component 4 (1.1% of the households) contains
the households which bought the greatest number
of different brands and all brands to a similar extent. Households from component 5 (8.5%) also buy
a wide range of whiskey brands, but tend to avoid
single malts. Component 3 (43.1%) has a similar usage pattern as component 5 but buys less brands in
general. Component 1 (14.2%) seems to favour single malt whiskeys and component 2 (33%) is especially fond of other brands and tends to avoid Johnnie Walker Black Label.

Figure 4: Patent data with fitted values for each component.
In Figure 5 a rootogram of the posterior probabilities of the observations is given. This is the default plot of the "flexmix" objects returned by the
fitting function. It can be used for arbitrary mixture models and indicates how well the observations
are clustered by the mixture. For ease of interpretation the observations with a-posteriori probability
less than eps=10−4 are omitted as otherwise the peak
at zero would dominate the plot. The observations
where the a-posteriori probability is largest for the
third component are colored differently. The plot is
generated using the following command.
ISSN 1609-3631

Vol. 7/1, April 2007

12

R> plot(pat_mix, mark = 3)
The posteriors of all three components have modes
at 0 and 1, indicating well-separated clusters (Leisch,
2004). Note that the object returned by the plot function is of class "trellis", and that the plot itself
is produced by the corresponding show() method
(Sarkar, 2002).
Rootogram of posterior probabilities > 1e−04
0.0 0.2 0.4 0.6 0.8 1.0

Comp. 1

Comp. 2

Comp. 3

5
4
3

vary between all components even though the coefficients for lgRD are similar for the first and third
component. A smaller model where these coefficients are restricted to be equal can be fitted using
the model driver FLXMRglmfix(). The EM algorithm
can be initialized in the original solution using the
estimated posterior probabilities for the cluster argument. As in this case the first and third component
are restricted to have the same coefficient for lgRD,
the posteriors of the fitted mixture are used for initialization after reordering the components to have
these two components next to each other. The modified model is compared to the original model using
the BIC.

2
1
0
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 5: Rootogram of the posterior probabilities.
Further details of the fitted mixture can be obtained with refit() which returns the fitted values
together with approximate standard deviations and
significance tests, see Figure 6. The standard deviations are only approximative because they are determined separately for each component and it is not
taken into account that the components have been
estimated simultaneously. In the future functionality
to determine the standard deviations using either the
full Hesse matrix or the parametric bootstrap shall be
provided.
The estimated coefficients are given in Figure 7.
The black lines indicate the (approximative) 95%
confidence intervals. This is the default plot for the
objects returned by refit() and is obtained with the
following command.
R> plot(refit(pat_mix), bycluster = FALSE)
The argument bycluster indicates if the clusters/components or the different variables are
used as conditioning variables for the panels.
−3

(Intercept)

−2

−1

0

1

2

lgRD

Comp. 1

R>
+
+
R>
R>
+
+
+
R>

Model_2 <- FLXMRglmfix(family = "poisson",
nested = list(k = c(1,2),
formula = ~lgRD))
Post_1 <- posterior(pat_mix)[,c(2,1,3)]
pat_mix2 <- flexmix(Patents ~ 1,
concomitant = FLXPmultinom(~RDS),
data = patent, cluster = Post_1,
model = Model_2)
c(M_1 = BIC(pat_mix), M_2 = BIC(pat_mix2))

M_1
M_2
437.836 445.243
In this example, the original model is preferred
by the BIC.

Summary
flexmix provides infrastructure for fitting finite mixture models with the EM algorithm and tools for
model selection and model diagnostics. We have
shown the application of the package for modelbased clustering as well as for fitting finite mixtures
of regressions.
In the future we want to implement new model
drivers, e.g., for generalized additive models with
smooth terms, as well as to extend the tools for
model selection, diagnostics and model validation.
Additional functionality will be added which allows
to fit mixture models with different component specific models. The implementation of zero-inflated
models has been a first step in this direction.

Comp. 2

Acknowledgments

Comp. 3

This research was supported by the Austrian Science
Foundation (FWF) under grant P17382.
−3

−2

−1

0

1

2

Bibliography
Figure 7: Estimated coefficients of the component
specific models with corresponding 95% confidence
intervals.
The plot indicates that the estimated coefficients
R News

A. Dasgupta and A. E. Raftery. Detecting features
in spatial point processes with clutter via modelbased clustering. Journal of the American Statistical
Association, 93(441):294–302, March 1998. 9
ISSN 1609-3631

Vol. 7/1, April 2007

13

R> refit(pat_mix)
Call:
refit(pat_mix)
Number of components: 3
$Comp.1
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.50819
0.12366
4.11 3.96e-05 ***
lgRD
0.87976
0.03328
26.43 < 2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1

' '

1

' '

1

' '

1

$Comp.2
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6368
0.4043 -6.522 6.95e-11 ***
lgRD
1.5866
0.0899 17.648 < 2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1
$Comp.3
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.96217
0.13968
14.05
<2e-16 ***
lgRD
0.67190
0.03572
18.81
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1

Figure 6: Details of the fitted mixture model for the patent data.
C. M. Dayton and G. B. Macready. Concomitantvariable latent-class models. Journal of the American
Statistical Association, 83(401):173–178, March 1988.
9
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EMalgorithm. Journal of the Royal Statistical Society B,
39:1–38, 1977. 9
Y. D. Edwards and G. M. Allenby. Multivariate analysis of multiple response data. Journal of Marketing
Research, 40:321–334, August 2003. 10
C. Fraley and A. E. Raftery. How many clusters?
which clustering method? answers via modelbased cluster analysis. The Computer Journal, 41(8):
578–588, 1998. 10

B. Grün and F. Leisch. Flexmix 2.0: Finite mixtures
with concomitant variables and varying and fixed
effects. Submitted for publication, 2007b. 9
F. Leisch. FlexMix: A general framework for finite mixture models and latent class regression in
R. Journal of Statistical Software, 11(8), 2004. URL
http://www.jstatsoft.org/v11/i08/. 9, 12
G. J. McLachlan and D. Peel. Finite Mixture Models.
Wiley, 2000. 8, 9
D. Sarkar. Lattice. R News, 2(2):19–23, June 2002. URL
http://CRAN.R-project.org/doc/Rnews/. 12
P. Wang, I. M. Cockburn, and M. L. Puterman. Analysis of patent data — A mixed-Poisson-regressionmodel approach. Journal of Business & Economic
Statistics, 16(1):27–41, 1998. 11

C. Fraley and A. E. Raftery. MCLUST version 3 for R:
Normal mixture modeling and model-based clustering. Technical Report 504, University of Washington, Department of Statistics, September 2006.
9

M. Wedel and W. A. Kamakura. Market Segmentation — Conceptual and Methodological Foundations. Kluwer Academic Publishers, second edition, 2001. ISBN 0-7923-8635-3. 8

S. Frühwirth-Schnatter. Finite Mixture and Markov
Switching Models. Springer Series in Statistics.
Springer, New York, 2006. ISBN 0-387-32909-9. 8,
9

Bettina Grün
Technische Universität Wien, Austria
Bettina.Gruen@ci.tuwien.ac.at

B. Grün and F. Leisch. Fitting finite mixtures of
generalized linear regressions in R. Computational
Statistics & Data Analysis, 2007a. Accepted for publication. 9
R News

Friedrich Leisch
Ludwig-Maximilians-Universität München, Germany
Friedrich.Leisch@R-project.org
ISSN 1609-3631

Vol. 7/1, April 2007

14

Using R to Perform the AMMI Analysis
on Agriculture Variety Trials
by Andrea Onofri & Egidio Ciriciofolo

Introduction
Field agricultural experiments are generally planned
to evaluate the actual effect produced by manmade chemical substances or human activities on
crop yield and quality, environmental health, farmers’ income and so on. Field experiments include
the testing of new and traditional varieties (genotypes), fertilizers (types and doses), pesticides (types
and doses) and cultural practices. With respect to
greenhouse or laboratory experiments, field trials
are much more strongly subjected to environmental
variability that forces researchers into repeating experiments across seasons and/or locations. A significant ’treatment x environment’ interaction may
introduce some difficulties in exploring the dataset,
summarizing results and determining which treatment (genotype, herbicide, pesticide...) was the best.
In such conditions, the AMMI (Additive Main effect Multiplicative Interaction) analysis has been proposed as an aid to visualize the dataset and explore
graphically its pattern and structure (Gollob, 1968;
Zobel et al., 1988); this technique has received a particular attention from plant breeders (see for example
Abamu and Alluri, 1998; Annichiarico et al., 1995;
Annichiarico, 1997; Ariyo, 1998) and recently it has
been stated as superior to other similar techniques,
such as the GGE (Gauch, 2006). Unfortunately, such
technique is not yet very well exploited by agricultural scientists, who often prefer a more traditional approach to data analysis, based on ANOVA
and multiple comparison testing. Without disregarding the importance of such an approach, one
cannot deny that sometimes this does not help unveil the underlying structure of experimental data,
which may be more important than hypothesis testing, especially at the beginning of data analyses (Exploratory Data Analysis; NIST/SEMATECH, 2004)
or at the very end, when graphs have to be drawn
for publication purposes.
To make more widespread the acceptance and the
use of such powerful tool within agronomists, it is
necessary to increase the availability of both practical information on how to perform and interpret an
AMMI analysis and simple software tools that give
an easily understandable output, aimed at people
with no specific and deep statistical training, such as
students and field technicians.
The aim of this paper was to show how R can be
easily used to perform an AMMI analysis and produce ’biplots’, as well as to show how these tools can
R News

be very useful within variety trials in agriculture.

Some basic statistical aspects
The AMMI analysis combines the ANalysis OF VAriance (ANOVA) and the Singular Value Decomposition (SVD) and it has been explained in detail by Gollob (1968). If we specifically refer to a variety trial,
aimed at comparing the yield of several genotypes
in several environments (years and/or locations), the
ANOVA partitions the total sum of squares into two
main effects (genotypes and environments) plus the
interaction effect (genotypes x environments). This
latter effect may be obtained by taking the observed
averages for each ’genotype x environment’ combination and doubly-centering them (i.e., subtracting
to each data the appropriate genotype and environment means and adding back the grand mean). The
interaction effect is arranged on a two-way matrix γ
(one row for each genotype and one column for each
environment) and submitted to SVD, as follows:
r

γ=

∑ λi · gik · ei j

(1)

i =1

where r is the rank of γ, λi is the singular value
for principal component i, gik is the eigenvector score
for genotype k and Principal Component (PC) i (left
singular vector), while ei j is the eigenvector score for
environment j and PC i (right singular vector). If PC
scores are multiplied by the square root of the singular value, equation 1 is transformed into:
r

γ=

∑

i =1



λi0.5 · gik




λi0.5 · ei j =

r

∑ Gik · Ei j

(2)

i =1

In this way the additive interaction in the
ANOVA model is obtained by multiplication of
genotype PC scores by environment PC scores, appropriately scaled. If a reduced number of PCs is
used (r = 1 or 2, typically) a dimensionality reduction
is achieved with just a small loss in the descriptive
ability of the model. This makes it possible to plot
the interaction effect, via the PC scores for genotypes
and environments. Such graphs are called biplots, as
they contain two kinds of data; typically, a AMMI1
and a AMMI2 biplots are used: the AMMI1 biplot
has main effects (average yields for genotypes and
environments) on the x-axis and PC1 scores on the yaxis, while the AMMI2 biplot has PC1 scores on the
x-axis and PC2 scores on the y-axis.
ISSN 1609-3631

Vol. 7/1, April 2007

15

Table 1: Field averages (three replicates) for six genotypes compared in seven years.
Genotype
1996 1997 1998 1999 2000 2001 2002 Average
COLOSSEO
6.35 6.46 6.70 6.98 6.44 7.07 4.90
6.41
CRESO
5.60 6.09 6.13 7.13 6.08 6.45 4.33
5.97
DUILIO
5.64 8.06 7.15 7.99 5.18 7.88 4.24
6.59
GRAZIA
6.26 6.74 6.35 6.84 4.75 7.30 4.34
6.08
IRIDE
6.04 7.72 6.39 7.99 6.05 7.71 4.96
6.70
SANCARLO 5.70 6.77 6.81 7.41 5.66 6.67 4.50
6.22
SIMETO
5.08 7.19 6.44 7.07 4.82 7.55 3.34
5.93
SOLEX
6.14 6.39 6.44 6.87 5.45 7.52 4.79
6.23
Average
5.85 6.93 6.55 7.29 5.55 7.27 4.42
6.27

The dataset
To show how the AMMI analysis can be easily performed with R, we will use a dataset obtained from
a seven-years field trial on durum wheat, carried out
from 1996 to 2003 in central Italy, on a randomised
block design with three replicates. For the present
analysis, eight genotypes were chosen, as they were
constantly present throughout the years (Colosseo,
Creso, Duilio, Grazia, Iride, Sancarlo, Simeto, Solex).
Yield data referred to the standard humidity content
of 13% (Tab. 1) have been previously published in
Belocchi et al. (2003), Ciriciofolo et al. (2002); Ciriciofolo et al. (2001); Desiderio et al. (2000), Desiderio
et al. (1999), Desiderio et al. (1998), Desiderio et al.
(1997). The interaction matrix (which is submitted to
SVD) is given in table 2.

The AMMI with R
To perform the AMMI analysis, an R function was
defined, as shown on page 17.
The AMMI() function requires as inputs a vector
of genotype codes (factor), a vector of environment
codes (factor), a vector of block codes (factor) and a
vector of yields (numerical). PC is the number of PCs
to be considered (set to 2 by default) and biplot is the
type of biplot to be drown (1 for AMM1 and 2 for
AMMI2). It should be noted that the script is very elementary and that it does not use any difficult function or programming construct. It was simply coded

Table 2:
Genotype
COLOSSEO
CRESO
DUILIO
GRAZIA
IRIDE
SANCARLO
SIMETO
SOLEX

R News

by translating the algebraic procedure proposed by
Gollob (1968) into R statements, which is a very easy
task, even without a specific programming training.
Wherever possible, built-in R functions were used, to
simplify the coding process and to facilitate the adaptation of the script to other kinds of AMMI models.
The first part uses the function tapply() to calculate some descriptive statistics, such as genotype
means, environment means and ’genotype x environment’ means, which are all included in the final
output.
The second part uses the function aov() to perform the ANOVA by using a randomised block design repeated in different environments with a different randomisation in each environment (LeClerg
et al., 1962). The interaction matrix γ is calculated
by using the function model.tables() applied to the
output of the function aov(); the way the R script
is coded, the interaction matrix is actually the transpose of the matrix shown in table 2, but this does not
change much in terms of the results. The interaction
matrix is then submitted to SVD, by using the builtin R function svd().
The significant PCs are assessed by a series of F
tests as shown by Zobel et al. (1988) and PC scores,
genotype means and environment means are used to
produce biplots, by way of the functions plot() and
points().

Interaction effects for the dataset in table 1.
1996 1997 1998 1999 2000 2001 2002
0.35 -0.62 0.00 -0.45 0.74 -0.34 0.33
0.04 -0.54 -0.13 0.14
0.82 -0.52 0.20
-0.54 0.80
0.28
0.38 -0.70 0.29 -0.51
0.60 -0.01 -0.02 -0.27 -0.62 0.21
0.10
-0.24 0.37 -0.59 0.28
0.07
0.01
0.11
-0.10 -0.11 0.31
0.17
0.16 -0.55 0.12
-0.43 0.60
0.23
0.13 -0.40 0.62 -0.75
0.33 -0.50 -0.07 -0.38 -0.07 0.29
0.40

ISSN 1609-3631

Vol. 7/1, April 2007

16

$Additive_ANOVA
Df Sum Sq Mean Sq F value
Pr(>F)
Environments
6 159.279 26.547 178.3996 < 2.2e-16 ***
Genotypes
7 11.544
1.649 11.0824 2.978e-10 ***
Blocks(Environments)
14
3.922
0.280
1.8826
0.03738 *
Environments x Genotypes 42 27.713
0.660
4.4342 6.779e-10 ***
Residuals
98 14.583
0.149
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
$Mult_Interaction
Effect
SS
1
PC1 18.398624
2
PC2 5.475627
3
PC3 1.961049
4 Residuals 1.877427

DF
12
10
8
12

MS
F
Prob.
1.5332187 10.303612 7.958058e-13
0.5475627 3.679758 3.339881e-04
0.2451311 1.647342 1.212529e-01
0.1564522 1.051398 4.094193e-01

$Environment_scores
PC1
PC2
1996 0.4685599 -0.62599974
1997 -0.8859669 0.21085535
1998 -0.1572887 -0.00567589
1999 -0.3139136 0.51881710
2000 0.8229290 0.59868592
2001 -0.5456613 -0.49726356
2002 0.6113417 -0.19941917

PC3
0.01665148
-0.19553672
0.80162642
-0.13286326
-0.03330554
-0.18138908
-0.27518331

$Genotype_scores
COLOSSEO
CRESO
DUILIO
GRAZIA
IRIDE
SANCARLO
SIMETO
SOLEX

PC1
PC2
PC3
0.74335025 -0.02451524 0.1651197989
0.63115567 0.47768803 -0.0001969871
-0.87632103 0.17923645 0.1445152042
-0.07625519 -0.74659598 -0.0108977060
-0.12683903 0.28634343 -0.7627600696
0.18186612 0.35076556 0.3753706117
-0.78109997 0.04751457 0.1740113396
0.30414317 -0.57043681 -0.0851621918

Figure 1: Results from ANOVA and AMMI analyses.

Results
Results (Fig. 1) show a highly significant ’genotypes
x environments’ interaction (GE) on the ANOVA,
that does not permit to define an overall ranking of
varieties across environments.
The SVD decomposition of the interaction matrix
was performed by extracting three PCs, though only
the first two are significant. It is possible to observe
that the first PC accounts for 66% of the interaction
sum of squares, while the second one accounts for an
additional 20%.
The AMMI1 biplot shows contemporarily main
effects (genotypes and environments average yields)
and interaction, as PC1 scores (Fig. 2). This graph is
relevant as it accounts for 87% of total data variability.
R News

Figure 2: AMMI1 biplot.
ISSN 1609-3631

Vol. 7/1, April 2007

17

AMMI <- function(variety, envir, block, yield, PC = 2, biplot = 1) {
## 1 - Descriptive statistics
overall.mean <- mean(yield)
envir.mean <- tapply(yield, envir, mean)
var.mean <- tapply(yield, variety, mean)
int.mean <- tapply(yield, list(variety,envir), mean)
envir.num <- length(envir.mean)
var.num <- length(var.mean)
## 2 - ANOVA (additive model)
variety <- factor(variety)
envir <- factor(envir)
block <- factor(block)
add.anova <- aov(yield ~ envir + block %in% envir + variety + envir:variety)
modelTables <- model.tables(add.anova, type = "effects", cterms = "envir:variety")
int.eff <- modelTables$tables$"envir:variety"
add.anova.residual.SS <- deviance(add.anova)
add.anova.residual.DF <- add.anova$df.residual
add.anova.residual.MS <- add.anova.residual.SS/add.anova.residual.DF
anova.table <- summary(add.anova)
row.names(anova.table[[1]]) <- c("Environments", "Genotypes", "Blocks(Environments)",
"Environments x Genotypes", "Residuals")
## 3 - SVD
dec <- svd(int.eff, nu = PC, nv = PC)
if (PC > 1) {
D <- diag(dec$d[1:PC])
} else {
D <- dec$d[1:PC]
}
E <- dec$u %*% sqrt(D)
G <- dec$v %*% sqrt(D)
Ecolnumb <- c(1:PC)
Ecolnames <- paste("PC", Ecolnumb, sep = "")
dimnames(E) <- list(levels(envir), Ecolnames)
dimnames(G) <- list(levels(variety), Ecolnames)
## 4 - Significance of PCs
numblock <- length(levels(block))
int.SS <- (t(as.vector(int.eff)) %*% as.vector(int.eff))*numblock
PC.SS <- (dec$d[1:PC]^2)*numblock
PC.DF <- var.num + envir.num - 1 - 2*Ecolnumb
residual.SS <- int.SS - sum(PC.SS)
residual.DF <- ((var.num - 1)*(envir.num - 1)) - sum(PC.DF)
PC.SS[PC + 1] <- residual.SS
PC.DF[PC + 1] <- residual.DF
MS <- PC.SS/PC.DF
F <- MS/add.anova.residual.MS
probab <- pf(F, PC.DF, add.anova.residual.DF, lower.tail = FALSE)
percSS <- PC.SS/int.SS
rowlab <- c(Ecolnames, "Residuals")
mult.anova <- data.frame(Effect = rowlab, SS = PC.SS, DF = PC.DF, MS = MS, F = F, Prob. = probab)
## 5 - Biplots
if (biplot == 1) {
plot(1, type = 'n', xlim = range(c(envir.mean, var.mean)), ylim = range(c(E[,1], G[,1])), xlab = "Yield",
ylab = "PC 1")
points(envir.mean, E[,1], col = "red", lwd = 5)
plot(1, type = 'n', xlim = range(c(envir.mean, var.mean)), ylim = range(c(E[,1], G[,1])), xlab = "Yield",
ylab = "PC 1")
points(envir.mean, E[,1], "n", col = "red", lwd = 5)
text(envir.mean, E[,1], labels = row.names(envir.mean), adj = c(0.5, 0.5), col = "red")
points(var.mean, G[,1], "n", col = "blue", lwd = 5)
text(var.mean, G[,1], labels = row.names(var.mean), adj = c(0.5, 0.5), col = "blue")
abline(h = 0, v = overall.mean, lty = 5)
} else {
plot(1, type = 'n', xlim = range(c(E[,1], G[,1])), ylim = range(c(E[,2], G[,2])), xlab = "PC 1",
ylab = "PC 2")
points(E[,1], E[,2], "n",col = "red", lwd = 5)
text(E[,1], E[,2], labels = row.names(E),adj = c(0.5,0.5),col = "red")
points(G[,1],G[,2], "n", col = "blue", lwd = 5)
text(G[,1], G[,2], labels = row.names(G),adj = c(0.5, 0.5), col = "blue")
}
## 6 - Other results
list(Genotype_means = var.mean, Environment_means = envir.mean, Interaction_means = int.mean,
Additive_ANOVA = anova.table, Mult_Interaction = mult.anova, Environment_scores = E,
Genotype_scores = G)
}

R News

ISSN 1609-3631

Vol. 7/1, April 2007

To read this biplot, it is necessary to remember
that genotypes and environments on the right side
of the graph shows yield levels above the average.
Besides, genotypes and environments laying close
to the x-axis (PC 1 score close to 0) did not interact
with each other, while data with positive/negative
score on y-axis interacted positively with environments characterised by a score of same sign.
Indeed, environmental variability was much
higher than genotype variability (Fig. 2, see also the
ANOVA in Fig. 1). Iride showed the highest average yield and did not interact much with the environment (PC1 score close to 0). Duilo ranked overall second, but showed a high interaction with the
environment, i.e., its yield was above the average
in 1997 (first ranking), 2001 (first ranking) and 1999
(second ranking), while it was below the average in
1996, 2000 and 2002. Colosseo gave also a good average yield, but its performances were very positive
in 1996, 2000 and 2002, while they were below the
average in 1997, 2000 and 2002.

Figure 3: AMMI2 biplot.
The AMMI2 biplot (Fig. 3) is more informative on
the GE interaction as it accounts for 86% of the sum
of squares of this latter effect. Remember that genotypes and environments in the center of the graph
did not show a relevant interaction, while genotypes
and environment lying close on the external parts of
the graph interacted positively. Duilio and Simeto
were particularly brilliant in 1997 (compared to their
average performances; notice in tab. 1 that Simeto
was the third in this year, which is very good compared to its seventh position on the ranking based
on average yield). Solex and Grazia were brilliant
in 1997 (they were third and second respectively, in
spite of the eighth and fifth ranking based on average yield). Likewise, Creso and Colosseo were the
best in 2000 and 2002, while Iride and Sancarlo interacted positively with 1999.
R News

18

Discussion and conclusions
The above example should be discussed with reference to two main aspects: the AMMI analysis and
the use of R. Concerning the first aspect, the example
confirms the graphical power of the biplots; indeed,
all the above comments are just an excerpt of what
can be easily grasped at first sight from the AMMI1
and AMMI2 biplots. It is worth to notice that obtaining such information from table 1 is not as immediate
and quick. Of course, the AMMI analysis should be
followed by other procedures to explore the relationship between the behaviour of each variety and the
environmental conditions of each year.
It is also necessary to mention that in the present
example the analyses were aimed only at graphically
exploring the underlying structure of the dataset. In
other cases, whenever hypothesis testing is more important, the F procedure employed on the script may
be too liberal and other techniques may be better
suited to evaluate the significance of PCs (Cornelius,
1993).
Concerning R, the above example confirms that
this environment can be easily used to perform statistical analyses in the agricultural field. Thanks to
its ability to deal with linear models and to the facilities for matrix manipulation, it is very easy to accomplish also rather complex statistical tasks, such
as the AMMI analysis. Indeed, calculations can be
performed having in mind the usual algebraic notation, as one can find in statistical literature, without
a deep knowledge of programming constructs. Indeed, this script has been coded in an elementary
fashion, following the calculation pattern proposed
by Gollob (1968) and including some built-in R functions when possible.
Of course, it is necessary to mention that this
elementary coding style may be useful for simple
scripts, but should not be regarded as optimal, especially for more advanced applications. Indeed, in
such cases an object-oriented approach is much more
advisable to exploit the statistical power of R. In any
case, elementary scripts such this one may be always
used as the starting point to perform other types of
statistical analyses. In particular, with slight modifications, this script (available on: www.unipg.it/
~onofri/software.htm) could be used to draw the
GGE biplot, that has received a certain attention in
the last years (Yan and Tinker, 2005).
However, when re-using this script, one should
bear in mind some limitations. Indeed, it is important to notice that this script has been aimed at a specific experimental design (completely randomised
block experiment repeated across years or environments), as commonly found in field variety trials.
Other designs will require some adaptations into
the code and unbalanced designs (especially those
with missing combinations of genotypes and environments) should not be analysed with this script.
ISSN 1609-3631

Vol. 7/1, April 2007

Furthermore, the ’environment’ effect has been considered as ’fixed’ and changes to the code should be
made in case it should be considered as ’random’.
In spite of the the above limitations, it is clear that
also users with a limited background in computer
programming (which is often the case in agriculture)
can benefit from the use of R: an elementary knowledge of R statements and functions is already enough
to perform also the ’less traditional’ statistical analysis, with a very slight effort.

Bibliography
F. J. Abamu and K. Alluri. Ammi analysis of rainfed
lowland rice (oryza sativa) trials in nigeria. Plant
Breeding, 117(4):395–397, 1998. 14
P. Annichiarico. Addittive main effects and multiplicative interaction (ammi) analysis of genotypelocation interaction in variety trials repeated over
years. Theorethycal applied genetics, 94:1072–1077,
1997. 14
P. Annichiarico, F. Bozzo, G. Parente, F. Gusmeroli,
V. Mair, O. Marguerettaz, and D. Orlandi. Analysis
of adaptation of grass/legume mixtures to italian
alpine and subalpine zones through an addittive
main effects and multiplicative interaction model.
Grass and Forage Science, 50:405–413, 1995. 14
O. J. Ariyo. Use of additive main effects and multiplicative interaction model to analyse multilocation soybean varietal trials. J.Genet. Breed, (53):129–
134, 1998. 14
A. Belocchi, M. Fornara, E. Ciriciofolo, E. Biancolatte, S. D. Puglia, G. Olimpieri, V. Vecchiarelli,
E. Gosparini, C. Piccioni, and E. Desiderio. Scelta
delle varietà di frumento duro: risultati delle
prove varietali 2002-2003 in Lazio ed Umbria.
L’Informatore Agrario, supplemento 1, 2003. 15
E. Ciriciofolo, A. Belocchi, E. Biancolatte, S. D.
Puglia, M. Fornara, G. Olimpieri, C. Piccioni, and
E. Desiderio. Scelta delle varietà di frumento duro:
risultati delle prove varietali 2000-2001 in Lazio ed
Umbria. L’Informatore Agrario, supplemento 1, 2001.
15
E. Ciriciofolo, A. Belocchi, E. Biancolatte, S. D.
Puglia, M. Fornara, G. Olimpieri, C. Piccioni, and
E. Desiderio. Scelta delle varietà di frumento duro:
risultati delle prove varietali 2001-2002 in Lazio ed
Umbria. L’Informatore Agrario, supplemento 1, 2002.
15
P. L. Cornelius. Statistical tests and retention of terms
in the Addittive Main Effects and Multiplicative
interaction model for cultivar trials. Crop Science,
33:1186–1193, 1993. 18

R News

19

E. Desiderio, A. Belocchi, E. Biancolatte, E. Ciriciofolo, S. D. Puglia, G. Olimpieri, C. Piccioni, and
M. Fornara. Scelta delle varietà di frumento duro:
risultati delle prove varietali 1996-1997 in Lazio ed
Umbria. L’Informatore Agrario, supplemento 1, 1997.
15
E. Desiderio, E. Ciriciofolo, A. Belocchi, E. Biancolatte, S. D. Puglia, G. Olimpieri, C. Piccioni,
M. Fornara, and M. Magini. Scelta delle varietà di
frumento duro: risultati delle prove varietali 19971998 in Lazio ed Umbria. L’Informatore Agrario,
supplemento 1, 1998. 15
E. Desiderio, E. Ciriciofolo, A. Belocchi, E. Biancolatte, S. D. Puglia, G. Olimpieri, C. Piccioni,
M. Fornara, and M. Magini. Scelta delle varietà di
frumento duro: risultati delle prove varietali 19981999 in Lazio ed Umbria. L’Informatore Agrario,
supplemento 1, 1999. 15
E. Desiderio, E. Biancolatte, E. Ciriciofolo, S. D.
Puglia, M. Fornara, G. Olimpieri, C. Piccioni, and
L. Frongia. Scelta delle varietà di frumento duro:
risultati delle prove varietali 1999-2000 in Lazio ed
Umbria. L’Informatore Agrario, supplemento 1, 35:
52–55, 2000. 15
H. F. Gollob. A statistical model which combines
features of factor analytic and analysis of variance
techniques. Psychometrika, 33(1):73–114, 1968. 14,
15, 18
H. G. Gauch. Statistical analysis of yield trials by
AMMI and GGE. Crop Science, 46(4):1488–1500,
2006. 14
E. L. LeClerg, W. H. Leonard, and A. G. Clark. Field
Plot Technique. Burgess Publishing Company, Minneapolis, Minnesota, 1962. 15
NIST/SEMATECH. E-Handbook of Statistical Methods. 1: Exploratory Data Analysis. http://www.itl.
nist.gov/div898/handbook/, 2004. 14
W. Yan and N. A. Tinker. An integrated biplot analysis system for displaying, interpreting, and exploring genotype × environment interaction. Crop Science, 45(3):1004–1016, 2005. 18
R. W. Zobel, M. J. Wright, and H. G. Gauch. Statistical analysis of a yield trial. Agronomy Journal, 80:
388–393, 1988. 14, 15

Andrea Onofri & Egidio Ciriciofolo
Department of Agricultural and Environmental Science
University of Perugia
onofri@unipg.it
egicir@unipg.it

ISSN 1609-3631

Vol. 7/1, April 2007

20

Inferences for Ratios of Normal Means
by Gemechis Dilba, Frank Schaarschmidt, Ludwig A.
Hothorn

Introduction

the analysis of the data could be to test whether the
active control (cyclophosphamide at dose 25mg/kg)
results in a significantly higher number of mutations
than the vehicle control. The data appear to be heteroscedastic, and therefore we use the unequal variances option (the default) to compare the two treatments.

Inferences concerning ratios of means of normally
distributed random variables or ratios of regression
coefficients arise in a variety of problems in biomedical research. For example, in tests for non-inferiority
of one or more experimental treatments against a
positive control, it is often easier to define and also
to interpret the non-inferiority margin as percentage changes (or fraction retained compared to the
mean of the control group). In bioassay problems,
one is also interested in ratios of regression coefficients, for instance in parallel line or slope ratio
assays. Our aim here is to introduce an R extension package called mratios which can perform inferences about one or more such ratio parameters in the
general linear model. For two-sample problems, the
package is capable of constructing Fieller confidence
intervals and performing the related tests when the
group variances are assumed homogeneous or heterogeneous. In simultaneous inferences for multiple
ratios, the package can (i) perform multiple tests, (ii)
construct simultaneous confidence intervals using a
variety of techniques, and (iii) calculate the sample
sizes required for many-to-one comparisons in simultaneous tests for non-inferiority (or superiority)
based on relative margins. We demonstrate the functionality of the package by using several data examples.

Note that when testing a ratio of means against
1, the p-value computed by the t.test.ratio function is exactly the same as that computed by t.test
when testing the difference of means against 0.

Two-sample Problem

Simultaneous Inferences

The two-sample problem is one of the standard
methods routinely used in practice. Here the interest
is in comparing the means of two independent normally distributed random variables in terms of the
ratio of their means. This can be accomplished by
using the t.test.ratio function. If the variances are
homogeneous, this function performs a ratio formatted t-test (also known as Sasabuchi test) and computes Fieller’s confidence interval. If variance homogeneity is not tenable (the default), the test proposed
by Tamhane and Logan (2004) is performed using
Satterthwaite adjusted degrees of freedom. For confidence interval estimation under variance heterogeneity, Satterthwaite degrees of freedom depends
on the unknown ratio. To circumvent this problem,
we plug in the maximum likelihood estimate of the
ratio (i.e., ratio of sample means) in the approximate
expression for the number of degrees of freedom.
Example 1. Consider the mutagenicity assay data
described in the mratios package. A first step in

In this section we consider inferential problems involving one or more ratio parameters. The basic distribution underlying the analyses is the multivariate
t-distribution. Under the assumption of normality
and homogeneous variance for the error terms, the
joint distribution of the test statistics associated with
the various contrasts of interest follows a multivariate t-distribution. For the computation of the related
multivariate t probabilities and equi-coordinate critical points, we refer to Hothorn et al. (2001).

R News

>
>
>
+
>
+

library("mratios")
data("Mutagenicity")
muta2 <- subset(Mutagenicity, Treatment ==
"Vehicle" | Treatment == "Cyclo25")
t.test.ratio(MN ~ Treatment, data = muta2,
alternative = "greater")
Ratio t-test for unequal variances

data: Cyclo25 and Vehicle
t = 5.0071, df = 3.07, p-value = 0.0073
alternative hypothesis: true ratio of means
is greater than 1
95 percent confidence interval:
5.110079
Inf
sample estimates:
mean Cyclo25
mean Vehicle
25.000000
2.571429
Cyclo25/Vehicle
9.722222

Multiple Tests
Assume a normal one-way ANOVA model with homogeneous variances. The interest is in simultaneous tests for several ratios of linear combinations of
the treatment means. Such tests for ratio hypotheses (ratios of normal means) appear, for example,
in tests for non-inferiority (or superiority) of several experimental treatments compared to a control
ISSN 1609-3631

Vol. 7/1, April 2007

(placebo). These are so called many-to-one comparisons. In the R-function simtest.ratio, most of the
routinely used multiple comparison procedures [e.g.,
many-to-one (Dunnett type), all pairs (Tukey type),
sequence (successive comparisons of ordered treatment effects)] are implemented in the context of ratio
hypotheses. In general, the function also allows for
any user-defined contrast matrices.
Let γ j = c0j µ /d0j µ, j = 1, . . . , r denote the ratios of interest, where µ = (µ1 , . . . , µk )0 is a vector
of the treatment means, c j and d j are known vectors
of real constants each of dimension k × 1, and r is the
number of ratios. To specify the ratios, we define two
contrast matrices, namely, numerator and denominator contrast matrices. The numerator contrast matrix
is a matrix whose row vectors are c01 , . . . , cr0 , and the
denominator contrast matrix is a matrix whose row
vectors are d01 , . . . , dr0 . Therefore, the dimensions of
both the numerator and denominator contrast matrices are each r × k. Further, let (ψ1 , . . . , ψr )0 denote the set of margins against which we test the r
ratios. Then, for example, for one-sided upper-tailed
alternative hypotheses, the hypotheses of interest are
H0 j : γ j ≤ ψ j versus H1 j : γ j > ψ j , j = 1, . . . , r.
Given a data frame containing the observations, the contrast matrices, the vector of margins,
and the family-wise type I error rate, the function simtest.ratio calculates the point estimates of
the ratios, the test statistics, the raw p-values and
the multiplicity adjusted p-values. The adjusted pvalues are computed by adapting the results of Westfall et al. (1999) for ratio hypotheses and general contrasts.
In general, note that the function simtest.ratio
allows for varying margins for the set of comparisons. This can be quite appealing, for example, in
test problems involving a mixture of non-inferiority
and superiority hypotheses.
Example 2. Bauer et al. (1998) analyzed data
from a multi-dose experiment including a positive
control and placebo. In the experiment, patients
with chronic stable angina pectoris were randomized to five treatment arms (placebo, three doses of
a new compound, and an active control). The response variable is the difference in the duration of
an exercise test before and after treatment. Now,
due to the unavailability of the original data values, we randomly generated independent samples
(from a normal distribution) that satisfy the summary statistics given in Table II of Bauer et al.
(1998). This data set is available in the mratios package. The interest is in simultaneous tests for noninferiority of the three doses versus the active control by including the placebo. Following Pigeot et
al. (2003), the hypotheses can succinctly be formulated as H0i : (µ j − µ2 )/(µ1 − µ2 ) ≤ 0.9 versus
H1i : (µ j − µ2 )/(µ1 − µ2 ) > 0.9, j = 3, 4, 5, where
µi , i = 1, ..., 5 denote the means for the active control, placebo, dose 50, dose 100 and dose 150, consecR News

21

utively. In this example, the non-inferiority margins
are all set to 0.9.
>
>
+
+
>
+
+
>
+
+
+
>

data("AP")
NC <- rbind(N1 = c(0, -1, 1, 0, 0),
N2 = c(0, -1, 0, 1, 0),
N3 = c(0, -1, 0, 0, 1))
DC <- rbind(D1 = c(1, -1, 0, 0, 0),
D2 = c(1, -1, 0, 0, 0),
D3 = c(1, -1, 0, 0, 0))
ap.test <- simtest.ratio(pre_post ~
treatment, data = AP, Num.Contrast = NC,
Den.Contrast = DC, Margin.vec = 0.9,
alternative = "greater")
ap.test

Alternative hypotheses: Ratios greater
than margins
margin estimate statistic
0.9
5.306
2.9812
0.9
4.878
2.7152
0.9
1.969
0.7236
p.value.raw p.value.adj
N1/D1
0.001554
0.004429
N2/D2
0.003505
0.009799
N3/D3
0.234952
0.451045
N1/D1
N2/D2
N3/D3

By using the command summary(ap.test), one
can get further information — for example, the correlation matrix under the null hypotheses and the critical point (equi-coordinate percentage point of the
multivariate t-distribution).

Simultaneous Confidence Intervals
Unlike in multiple testing, in simultaneous estimation of the ratios γ j = c0j µ /d0j µ, j = 1, . . . , r, the joint
distribution of the associated t-statistics follows a
multivariate t-distribution with a correlation matrix
that depends on the unknown ratios. This means that
the critical points that are required for confidence interval construction depend on these unknown parameters. There are various methods of dealing with
this problem. They are (i) using the unadjusted intervals (Fieller confidence intervals without multiplicity adjustments); (ii) Bonferroni (Fieller intervals
with simple Bonferroni adjustments); (iii) a method
called MtI which consists of replacing the unknown
correlation matrix of the multivariate t-distribution
by an identity matrix of the same dimension according to Sidak and Slepian inequalities (Hochberg and
Tamhane, 1987) for two- and one-sided confidence
intervals, respectively; and (iv) plug-in (plugging the
maximum likelihood estimates of the ratios into the
unknown correlation matrix). The latter method is
known to have good simultaneous coverage probabilities and hence it is set as a default method in the
R functions to be introduced. For details regarding
these methodologies, we refer to Dilba et al. (2006a).
ISSN 1609-3631

Vol. 7/1, April 2007

22

The sci.ratio function is used to construct simultaneous CIs for ratios of linear combinations of
treatment means in a one-way ANOVA model. Several standard contrast types (e.g., Dunnett, Tukey, sequence, and many others) are implemented in this
function. The default contrast is many-to-one comparisons (Dunnett type) with the mean of the first
level of the factor (in alpha-numeric order) taken as
the denominator of the ratios. In addition, this function has an option for user-defined contrast matrices.
Example 3. Recall the data from the multi-dose
experiment in Example 2 above. Now, suppose that
the interest is to calculate simultaneous lower 95%
confidence limits for the ratios of the three doses and
the active control to the placebo. Noting that placebo
is the second level in the alpha-numeric order of the
treatments, we use the following R code to calculate
the limits.
> ap.sci <- sci.ratio(pre_post ~
+
treatment, data = AP, type = "Dunnett",
+
base = 2, alternative = "greater",
+
method = "MtI")
The graph of the confidence intervals can be obtained
by applying the plot function to the object in which
the confidence interval estimates are stored, see Figure 1.
> plot(ap.sci)
Lower 95 % simultaneous CI for ratios (method: Slepian)

|

D50/D0

●

|

D150/D0

>
>
>
+
>
+
+
+
>
+
+
+
>
+
>

data(SRAssay)
Y <- SRAssay[, "Response"]
X <- model.matrix(Response ~ Treatment:Dose,
data = SRAssay)
NC <- matrix(c(0, 0, 1, 0, 0,
0, 0, 0, 1, 0,
0, 0, 0, 0, 1),
nrow = 3, byrow = TRUE)
DC <- matrix(c(0, 1, 0, 0, 0,
0, 1, 0, 0, 0,
0, 1, 0, 0, 0),
nrow = 3, byrow = TRUE)
s.ratio <- sci.ratio.gen(Y, X,
Num.Contrast = NC, Den.Contrast = DC)
s.ratio

Two-sided 95 % simultaneous confidence
intervals for ratios:

C1
C2
C3

estimate lower upper
1.1217 1.0526 1.1964
0.7193 0.6603 0.7805
0.7537 0.6942 0.8157

Using the command summary(s.ratio), one can
also get further details regarding the fitted regression model, the contrast matrices and an estimate of
the correlation matrix (when the plug-in method is
used). The estimate of the correlation matrix used
for critical point calculation can also be obtained as

●

|

D100/D0

α + βi Xi j + i j , i = 0, 1, 2, 3; j = 1, . . . , ni , where
the Xi j s are the dose levels and i = 0 refers to the
control group. The vector of regression coefficients
is (α, β0 , β1 , β2 , β3 )0 . Now using the data in Table 5
of Jensen (1989), the interest is to construct simultaneous CIs for βi /β0 , i = 1, 2, 3. The function
sci.ratio.gen needs the response vector Y and the
design matrix X as an input.

●

> s.ratio[["CorrMat.est"]]
|

AC/D0
0.5

●

1.0

1.5

2.0

Figure 1: Graphical visualization of the ap.sci object.
The sci.ratio.gen function is a more general
function that can construct simultaneous confidence
intervals for ratios of linear combinations of coefficients in the general linear model. For this function,
it is necessary to specify the vector of responses, the
design matrix, and the numerator and denominator
contrast matrices.
Example 4. Consider the problem of simultaneously estimating relative potencies in a multiple
slope ratio assay. Jensen (1989) describes an experiment in which three preparations are compared to
a control. The response variable (Y) is pantothenic
acid content of plant tissues. The model is Yi j =
R News

[,1]
[,2]
[,3]
[1,] 1.0000000 0.4083451 0.4260802
[2,] 0.4083451 1.0000000 0.3767098
[3,] 0.4260802 0.3767098 1.0000000
Note that by choosing the option for method
as ’Sidak’, one gets the results reported by Jensen
(1989).
Before closing this section, we give two important
remarks.
i) According to the Slepian inequality (Hochberg
and Tamhane, 1987), it is appropriate to use the MtI
method for estimating one-sided simultaneous confidence limits only when all the elements of the correlation matrix are non-negative. Therefore, if some of
the (estimated) correlations are negative, sci.ratio
and sci.ratio.gen functions report a warning message about the inappropriateness of the MtI method.
ii) In simultaneous CI estimation (using either
sci.ratio or simtest.ratio.gen), one may encounter the case where some of the contrasts in the
ISSN 1609-3631

Vol. 7/1, April 2007

denominators of the ratios are not significantly different from zero. In this situation, NSD (standing for
“non-significant denominator”) will be printed. For
instance, in Example 2 above, since there is no significant difference between the placebo and the active
control, one gets NSD in constructing the related simultaneous CIs for the three ratios.

Sample Size Calculation
Consider the design of a special problem in simultaneous comparison of m ≥ 2 treatments with a control
for non-inferiority (or superiority), where the margins are expressed as a percentage of the mean of
the control group. For sample size calculation, we
implement a method based on normal approximation to the exact method which involves inversion of
a univariate (multivariate) non-central t-distribution
(see Dilba et al. (2006b) for details on the exact
method). Given the number of comparisons (m), the
non-inferiority (superiority) margin (rho), the power
(Power), the coefficient of variation of the control
group (CV0), the percentage (of the mean of the control group) to be detected (rho.star), the familywise type-I error rate (alpha), and the kind of power
to be controlled (by default minimal power), the
function n.ratio calculates the sample size required
in a balanced design.
Example 5. Suppose that we have a response
variable where large response values indicate better
treatment benefits. The following R code calculates
the sample size required per treatment in designing
a non-inferiority trial with four treatment arms (including the control).
> n.ratio(m = 3, rho = 0.7, Power = 0.8,
+
CV0 = 0.5, rho.star = 0.95,
+
alpha = 0.05, Min.power = TRUE)
Number of observations per treatment = 52
Total number of observations = 208
If the aim is to control the complete power, we set
Min.power to FALSE.
For the two-sample design (m = 1), the sample
sizes required in the non-inferiority trials discussed
by Laster and Johnson (2003) can be calculated as a
special case.

Remarks
We conclude by giving some general remarks regarding the four basic functions in the mratios package.
• In two-sample ratio problems with homogeneous variances, t.test.ratio is a special case
of simtest.ratio and sci.ratio.
• The simtest.ratio function with all the elements of the vector of margins equal to 1
gives the same result as the analysis based on
the difference of treatment means. Thus, the
R News

23

difference-based test is a special case of the
ratio-based test with the thresholds set to 1.
• The sci.ratio function is a special case of
sci.ratio.gen for the one-way layout.

Bibliography
P. Bauer, J. Röhmel, W. Maurer and L.A. Hothorn.
Testing strategies in multi-dose experiments including active control. Statistics in Medicine, 17:
2133-2146, 1998. 21
G. Dilba, F. Bretz and V. Guiard. Simultaneous confidence sets and confidence intervals for multiple
ratios. Journal of Statistical Planning and Inference,
136:2640–2658, 2006a. 21
G. Dilba, F. Bretz, L.A. Hothorn and V. Guiard.
Power and sample size computations in simultaneous tests for non-inferiority based on relative margins. Statistics in Medicine, 25:1131–1147,
2006b. 23
J. Hochberg and A. Tamhane. Multiple comparison
procedures. Wiley, New York, 1987, p366. 21, 22
T. Hothorn, F. Bretz and A. Genz. On Multivariate t
and Gauss Probabilities. R News, 1(2):27–29, 2001.
20
G. R. Jensen. Joint confidence sets in multiple dilution assays. Biometrical Journal, 31:841–853, 1989.
22
L. L. Laster and M. F. Johnson. Non-inferiority trials: the ‘at least as good as’ criterion. Statistics in
Medicine, 22:187–200, 2003.
I. Pigeot, J. Schäfer, J. Röhmel and D. Hauschke.
Assessing non-inferiority of a new treatment in a
three-arm clinical trial including a placebo. Statistics in Medicine, 22:883–899, 2003. 21
A. C. Tamhane and B. R. Logan. Finding the maximum safe dose level for heteroscedastic data. Journal of Biopharmaceutical Statistics, 14:843–856, 2004.
20
P. H. Westfall, R. D. Tobias, D. Rom, R. D. Wolfinger
and Y. Hochberg. Multiple comparisons and multiple tests using the SAS system. Cary, NC, SAS Institute Inc,1999, 65–81.
21
G. Dilba, F. Schaarschmidt, L. A. Hothorn
Institute of Biostatistics
Faculty of Natural Sciences
Leibniz University of Hannover, Germany
dilba@biostat.uni-hannover.de
schaarschmidt@biostat.uni-hannover.de
hothorn@biostat.uni-hannover.de
ISSN 1609-3631

Vol. 7/1, April 2007

24

Working with Unknown Values
The gdata package
by Gregor Gorjanc

Introduction
Unknown or missing values can be represented in
various ways. For example SAS uses . (dot), while
R uses NA, which we can read as Not Available.
When we import data into R, say via read.table
or its derivatives, conversion of blank fields to NA
(according to read.table help) is done for logical,
integer, numeric and complex classes. Additionally, the na.strings argument can be used to specify values that should also be converted to NA. Inversely, there is an argument na in write.table and
its derivatives to define value that will replace NA
in exported data. There are also other ways to import/export data into R as described in the R Data
Import/Export manual (R Development Core Team,
2006). However, all approaches lack the possibility
to define unknown value(s) for some particular column. It is possible that an unknown value in one
column is a valid value in another column. For example, I have seen many datasets where values such
as 0, -9, 999 and specific dates are used as column
specific unknown values.
This note describes a set of functions in package
gdata1 (Warnes , 2006): isUnknown, unknownToNA and
NAToUnknown, which can help with testing for unknown values and conversions between unknown
values and NA. All three functions are generic (S3)
and were tested (at the time of writing) to work
with: integer, numeric, character, factor, Date,
POSIXct, POSIXlt, list, data.frame and matrix
classes.

Description with examples
The following examples show simple usage of these
functions on numeric and factor classes, where
value 0 (beside NA) should be treated as an unknown
value:
> library("gdata")
> xNum <- c(0, 6, 0, 7, 8, 9, NA)
> isUnknown(x=xNum)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
TRUE

The default unknown value in isUnknown is NA,
which means that output is the same as is.na — at
least for atomic classes. However, we can pass the
argument unknown to define which values should be
treated as unknown:
1

> isUnknown(x=xNum, unknown=0)
[1] TRUE FALSE TRUE FALSE FALSE FALSE
FALSE

This skipped NA, but we can get the expected answer after appropriately adding NA into the argument unknown:
> isUnknown(x=xNum, unknown=c(0, NA))
[1] TRUE FALSE TRUE FALSE FALSE FALSE
TRUE

Now, we can change all unknown values to NA
with unknownToNA. There is clearly no need to add
NA here. This step is very handy after importing
data from an external source, where many different unknown values might be used. Argument
warning=TRUE can be used, if there is a need to be
warned about “original” NAs:
> xNum2 <- unknownToNA(x=xNum, unknown=0)
[1] NA 6 NA 7 8 9 NA

Prior to export from R, we might want to change
unknown values (NA in R) to some other value. Function NAToUnknown can be used for this:
> NAToUnknown(x=xNum2, unknown=999)
[1] 999
6 999
7
8
9 999

Converting NA to a value that already exists in x
issues an error, but force=TRUE can be used to overcome this if needed. But be warned that there is no
way back from this step:
> NAToUnknown(x=xNum2, unknown=7,
force=TRUE)
[1] 7 6 7 7 8 9 7

Examples below show all peculiarities with class
factor. unknownToNA removes unknown value from
levels and inversely NAToUnknown adds it with a
warning. Additionally, "NA" is properly distinguished from NA. It can also be seen that the
argument unknown in functions isUnknown and
unknownToNA need not match the class of x (otherwise factor should be used) as the test is internally
done with %in%, which nicely resolves coercing issues.
> xFac <- factor(c(0, "BA", "RA", "BA",
NA, "NA"))
[1] 0
BA
RA
BA
 NA
Levels: 0 BA NA RA
> isUnknown(x=xFac)
[1] FALSE FALSE FALSE FALSE TRUE FALSE
> isUnknown(x=xFac, unknown=0)
[1] TRUE FALSE FALSE FALSE FALSE FALSE
> isUnknown(x=xFac, unknown=c(0, NA))
[1] TRUE FALSE FALSE FALSE TRUE FALSE

package version 2.3.1

R News

ISSN 1609-3631

Vol. 7/1, April 2007

25

> isUnknown(x=xFac, unknown=c(0, "NA"))
[1] TRUE FALSE FALSE FALSE FALSE TRUE
> isUnknown(x=xFac, unknown=c(0, "NA", NA))
[1] TRUE FALSE FALSE FALSE TRUE TRUE

> isUnknown(x=xList, unknown=c(0, NA))
$a
[1] TRUE FALSE TRUE FALSE FALSE FALSE
FALSE

> xFac <- unknownToNA(x=xFac, unknown=0)
[1]  BA
RA
BA
 NA
Levels: BA NA RA
> xFac <- NAToUnknown(x=xFac, unknown=0)
[1] 0 BA RA BA 0 NA
Levels: 0 BA NA RA
Warning message:
new level is introduced: 0

$b
[1] FALSE FALSE FALSE FALSE FALSE FALSE

These two examples with classes numeric and
factor are fairly simple and we could get the same
results with one or two lines of R code. The real benefit of the set of functions presented here is in list
and data.frame methods, where data.frame methods are merely wrappers for list methods.
We need additional flexibility for list/data.frame
methods, due to possibly having multiple unknown
values that can be different among list components
or data.frame columns. For these two methods, the
argument unknown can be either a vector or list,
both possibly named. Of course, greater flexibility (defining multiple unknown values per component/column) can be achieved with a list.
When a vector/list object passed to the
argument unknown is not named, the first
value/component of a vector/list matches the first
component/column of a list/data.frame. This
can be quite error prone, especially with vectors.
Therefore, I encourage the use of a list. In case
vector/list passed to argument unknown is named,
names are matched to names of list or data.frame.
If lengths of unknown and list or data.frame do not
match, recycling occurs.
The example below illustrates the application of
the described functions to a list which is composed
of previously defined and modified numeric (xNum)
and factor (xFac) classes. First, function isUnknown
is used with 0 as an unknown value. Note that we
get FALSE for NAs as has been the case in the first example.
> xList <- list(a=xNum, b=xFac)
$a
[1] 0 6 0 7 8 9 NA
$b
[1] 0 BA RA BA 0 NA
Levels: 0 BA NA RA
> isUnknown(x=xList, unknown=0)
$a
[1] TRUE FALSE TRUE FALSE FALSE FALSE
FALSE
$b
[1] TRUE FALSE FALSE FALSE

TRUE FALSE

We need to add NA as an unknown value. However, we do not get the expected result this way!
R News

This is due to matching of values in the argument
unknown and components in a list; i.e., 0 is used for
component a and NA for component b. Therefore, it
is less error prone and more flexible to pass a list
(preferably a named list) to the argument unknown,
as shown below.
> xList1 <- unknownToNA(x=xList,
+
unknown=list(b=c(0, "NA"), a=0))
$a
[1] NA 6 NA 7 8 9 NA
$b
[1]  BA
RA
Levels: BA RA

BA

 

Changing NAs to some other value (only one per
component/column) can be accomplished as follows:
> NAToUnknown(x=xList1,
+
unknown=list(b="no", a=0))
$a
[1] 0 6 0 7 8 9 0
$b
[1] no BA RA BA no no
Levels: BA no RA
Warning message:
new level is introduced: no

A named component .default of a list passed
to argument unknown has a special meaning as it will
match a component/column with that name and any
other not defined in unknown. As such it is very useful if the number of components/columns with the
same unknown value(s) is large. Consider a wide
data.frame named df. Now .default can be used
to define unknown value for several columns:
> df <- unknownToNA(x=df,
+
unknown=(.default=0,
+
col1=999,
+
col2="unknown"))

If there is a need to work only on some components/columns you can of course “skip” columns
with standard R mechanisms, i.e., by subsetting list
or data.frame objects:
> cols <- c("col1", "col2")
> df[, cols] <- unknownToNA(x=df[, cols],
+
unknown=(col1=999,
+
col2="unknown"))

ISSN 1609-3631

Vol. 7/1, April 2007

Summary
Functions isUnknown, unknownToNA and NAToUnknown
provide a useful interface to work with various representations of unknown/missing values. Their use
is meant primarily for shaping the data after importing to or before exporting from R. I welcome any
comments or suggestions.

Bibliography
R Development Core Team. R Data Import/Export,
2006.
URL http://cran.r-project.org/

26

manuals.html. ISBN 3-900051-10-0. 24
G. R. Warnes.
gdata:
Various R programming tools for data manipulation, 2006.
URL
http://cran.r-project.org/src/contrib/
Descriptions/gdata.html. R package version
2.3.1. Includes R source code and/or documentation contributed by Ben Bolker, Gregor Gorjanc
and Thomas Lumley. 24
Gregor Gorjanc
University of Ljubljana, Slovenia
gregor.gorjanc@bfro.uni-lj.si

A New Package for Fitting Random Effect
Models
The npmlreg package
by Jochen Einbeck, John Hinde, and Ross Darnell

Introduction
Random effects have become a standard concept in
statistical modelling over the last decades. They
enter a wide range of applications by providing a
simple tool to account for such problems as model
misspecification, unobserved (latent) variables, unobserved heterogeneity, and the like. One of the most
important model classes for the use of random effects
is the generalized linear model. Aitkin (1999) noted
that “the literature on random effects in generalized
linear models is now extensive,” and this is certainly
even more true today.
However, most of the literature and the implemented software on generalized linear mixed models concentrates on a normal random effect distribution. An approach that avoids specifying this
distribution parametrically was provided by Aitkin
(1996a), using the idea of ’Nonparametric Maximum
Likelihood’ (NPML) estimation (Laird, 1978). The
random effect distribution can be considered as an
unknown mixing distribution and the NPML estimate of this is a finite discrete distribution. This can
be determined by fitting finite mixture distributions
with varying numbers of support points, where each
model is conveniently fitted using a straightforward
EM algorithm.
This approach is implemented in GLIM4 (Aitkin
and Francis, 1995). Despite being a quite powerful tool, the current GLIM-based software is

computationally limited and the GLIM system is
no longer widely used. Though the alternatives
C.A.MAN (Böhning et al., 1992) and the Stata program gllamm (Skrondal and Rabe-Hesketh, 2004)
cover parts of GLIMs capacities (in the latter case
based on Newton-Raphson instead of EM), no R implementation of NPML estimation existed. The package npmlreg (Einbeck et al., 2006), which we wish to
introduce to the R community in this article, is designed to fill this gap.

NPML estimation
Assume there is given a set of explanatory vectors x1 , . . . , xn and a set of observations y1 , . . . , yn
sampled from an exponential family distribution1
f ( yi |β, φi ) with dispersion2 parameter φi . In a generalized linear model, predictors and response are
assumed to be related through a link function h,
µi ≡ E( yi |β, φi ) = h(ηi ) ≡ h( xi0 β),
and the variance Var( yi |β, φi ) = φi v(µi ) depends on
a function v(µi ) which is entirely determined by the
choice of the particular exponential family. However,
often the actual variance in the data is larger than the
variance according to this strict mean-variance relationship. This effect is commonly called overdispersion, reasons for which might be, e.g., correlation in
the data or important explanatory variables not included in the model. In order to account for additional unexplained variability of the individual observations, a random effect zi with density g( z) is in-

1 In

the present implementation, Gaussian, Poisson, Binomial, and Gamma distributed responses are supported
i ≡ 1. For Gaussian and Gamma models, the dispersion may be specified as constant, i.e., φi ≡ φ,
or as depending on the observation i. The theory in this section is provided for the most general case, i.e., variable dispersion.
2 For binomial and Poisson models, φ

R News

ISSN 1609-3631

Vol. 7/1, April 2007

27

cluded into the linear predictor
ηi = xi0 β + zi .
The likelihood can be approximated by a finite mixture
(
)
Z
n

L=

∏

K

n

f ( yi | zi , β, φi ) g( zi ) dzi ≈

i =1

∏ ∑

i =1

f ik πk

,

∂`
= 0,
∂β

∂`
= 0,
∂φk

turn out to be weighted versions of the singledistribution score equations, with weights wik =
πk f ik /∑l πl f il .
The weights wik can be interpreted as posterior
probabilities that the observation yi comes from component k. The score equation for the mixture proportions,
∂` − λ (∑ πk − 1)
= 0,
∂πk
gives the ML estimate π̂k = n1 ∑i wik , which can be
nicely interpreted as the average posterior probability for component k. The parameters β, zk and πk can
now be simultaneously estimated by the EM algorithm, whereby the missing information is the component membership of the observations:
E-Step Adjust weights wik = P(obs. i comes from
comp. k)
M-Step Update parameter estimates fitting a
weighted GLM with weights wik .
As starting values for the EM algorithm one uses
Gauss-Hermite integration points and masses. The
location of these starting points can be scaled inwards or outwards by means of a tuning parameter
tol, which is by default set to 0.5.
This procedure, and its straightforward extension
to random coefficient models, is implemented in the
function alldist, while variance component models
can be fitted with allvc; see Aitkin et al. (2005), pp
474ff and 485ff for details.

The function alldist
The main functions of this package are alldist and
allvc, the names of which were adapted from the
homonymous macros in GLIM4. The functions can
be used in a similar manner to the R function glm.
As an example for alldist, we consider data
from a study on lung cancer mortality presented in
3 The

log

k=1

where f ik = f ( yi | zk , β, φk ), zk are the mass points
and πk their masses. The score equations, obtained
by setting the partial derivatives of the log-likelihood
` = log L equal to zero,
∂`
= 0,
∂zk

Tsutakawa (1985). The data were recorded in the 84
largest Missouri cities from 1972-1981 and give the
number of lung cancer deaths of males aged 45-54 as
well as the city sizes3 . The data were analyzed by
Tsutakawa (1985) and Aitkin (1996b), both authors
considering logit models of type
pi
= zi ,
1 − pi

where zi is a random effect associated with the i −th
city and pi is its associated mortality rate. While Tsutakawa fitted a Poisson model with a normal random effect, Aitkin opted for a binomial model with
an unspecified random effect distribution. We follow
Aitkin and will leave the random effect unspecified,
but as lung cancer death is a rather rare event (the
crude rate does not exceed 0.03 in any of the cities), it
seems natural to work with Poisson models. Hence,
one can write log( pi ) = zi , or equivalently, in terms
of the means µi = ni pi ,
log(µi ) = log(ni ) + zi ,
where the logarithm of the city sizes ni appears as
an offset. The two-point solution is then obtained
via the function alldist, using the same notation
as for a usual glm fit, except that the random term
and the number of mass points k=2 have also to be
specified. The resulting object (which we name, say,
missouri.np2) is of class ‘glmmNPML’ and its printed
output is given by
> print(missouri.np2)
Call: alldist(formula = Deaths ~1, random =
~1, family = poisson(link = "log"), data =
missouri, k = 2, offset = log(Size))
Coefficients:
MASS1
MASS2
-4.844 -4.232
Mixture proportions:
MASS1
MASS2
0.8461624 0.1538376
-2 log L:
355.3
One minor difference to a glm output is that the
disparity (−2 log L) is displayed instead of the deviance, but the latter one is immediately obtained via
> missouri.np2$dev
[1] 92.49207,
which is slightly better than the deviance 93.10 reported in Aitkin (1996b) for the corresponding twomass point binomial logit model fitted with GLIM4.
As also observed by Aitkin, the disparity does not

data set missouri is part of this R package

R News

ISSN 1609-3631

Vol. 7/1, April 2007

28

Methods for most generic functions that
are applied to fitted model objects, such
as
update(), coefficients(), residuals(),
fitted(), summary(), predict() and plot(),
have been defined for the glmmNPML class. In some
cases (the first four generic functions listed above)
the default method is used; in other cases (the last
three generics) explicit methods are provided. The
plot() function offers four different plots: (i) disparity vs. EM iterations; (ii) EM trajectories; (iii)
Empirical Bayes predictions vs. true response; and
(iv) posterior probabilities against the residuals of
the fixed part of the GLM fit in the last EM iteration. Plots (i) and (ii) are generated by default when
running alldist and are depicted in Fig. 1 for the
model fitted above.

One observes that the EM trajectories converge
essentially to the fixed part residuals of cities no. 4
and 84 (in Tsutakawa’s list), which have populations
of 54155 and 22514, respectively, being much larger
than the majority of the other cities with only several
hundreds of inhabitants (For the very interested, the
numerical values associated with the disparity plot
and the EM trajectories, as well as the residuals plotted vertically in the latter plot, are available in component $Misc).

This was a simple example without any covariates. In general an arbitrary number of fixed effects
can be specified, and the random component can be
an intercept (∼1) or a single variable giving a model
with a random slope and random intercept.
R News

400
380
360
0

5

10

15

−4.5
−5.0
−5.5

mass points

−4.0

−3.5

EM iterations

−6.0

Empirical Bayes predictions (generally, h( xi0 β̂ +
ẑi ) for the number of deaths per city can be obtained
by the use of fitted(missouri.np2), or equivalently, missouri.np2$fitted.values, or equivalently, predict(missouri.np2, type="response").
Dividing this quantity by missouri$Size, one obtains estimated or ‘shrunken’ rates which can be
compared to the 6th column in Table 4, Aitkin
(1996b). The shrunken rates are less variable than
the crude rates and hence are useful for small area
estimation problems. The posterior probabilities wik
can be obtained from the fitted model (in analogy to
the 8th and 9th column of Table 4, Aitkin, 1996b), via
the component $post.prob. Further, ’posterior intercepts’ for the construction of ’league tables’ are
stored in component $post.int — see Sofroniou
et al. (2006) for an example of their application.

−2logL

420

440

fall significantly when increasing the number of
mass points further.

0

5

10

15

EM iterations

Figure 1: Top: Disparity (−2 log L) trend; bottom:
EM trajectories. The vertical dots in the latter plot are
the residuals of the fixed part of the fitted random
effect model (Note that these residuals are not centered around 0, in contrast to the residuals of a simple
fixed effect model. The plotted residuals, generally
h−1 ( yi ) − xi0 β̂, represent the random effect distribution and are on the same scale as the mass points).

The function allvc
The function alldist is designed to fit simple
overdispersion models (i.e., one has a random effect
on the individual observations). However, often one
wishes to introduces shared random effects, e.g. for
students from the same class or school, for the same
individual observed repeatedly over time (longitudinal data), or in small area estimation problems. This
leads to variance component models, which can be
fitted in npmlreg using the function allvc. As an
example, let us consider the Oxford boys data from
Brush and Harrison (1990), which were analyzed
with NPML using GLIM4 by Aitkin et al. (2005).
The data set is part of the R package nlme (Pinheiro et al., 2005) and contains the heights of 26
boys, measured in Oxford on nine occasions over
two years. The boys, indexed by the factor Subject,
represent the upper level (primary sampling units,
PSU), and the particular measurements at different
time points correspond to the lower-level units (secondary sampling units, SSU).
As suggested by (Aitkin et al., 2005, p. 495), we
fit a Gaussian model with unspecified random effect
distribution and K = 7 mass points,
(Oxboys.np7 <- allvc(height ~ age, random =
~1|Subject, data = Oxboys, k=7))$disparity
[1] 1017.269
ISSN 1609-3631

Vol. 7/1, April 2007

29

which confirms the GLIM4 result. Aitkin et al. state
that for all models using K > 7 the fitted mass
points ‘are duplicated with the same total mass as
in lower-dimension models’, and hence consider this
7- mass point model as ‘sufficiently complex’. However, fitting for comparison a simple linear mixed
model with function lmer in package lme4 (Bates
and Sarkar, 2006)
(fm1 <- lmer(height ~ age + (1|Subject),
data = Oxboys, method = "ML"))
gives the MLdeviance, i.e., disparity, of 940.569. As
this model is based on a normal random effect distribution, NPML should be superior to this, or at least
competitive with it. Therefore, we went on fitting
models with K = 8 and K = 9 mass points, yielding
(Oxboys.np8 <- allvc(height ~ age, random =
~1|Subject, data = Oxboys, k=8))$disparity
[1] 931.3752
(Oxboys.np9 <- allvc(height ~ age, random =
~1|Subject, data = Oxboys, k=9,
tol=0.3))$disparity
[1] 916.0921
For both models, all mass points are indeed distinct.
For instance, for the 9-mass point model one has

age
MASS1
MASS2
MASS3
MASS4
MASS5
MASS6
MASS7
MASS8
MASS9

Estimate
6.523719
130.200174
138.416628
143.382397
147.350113
150.954327
153.869256
156.178153
159.521550
164.883640

Std. Error
0.05094195
0.16795344
0.09697325
0.09697225
0.07511877
0.06857400
0.11915344
0.09676418
0.16795354
0.11876372

This increased performance compared to the
GLIM code is due to the installation of a ‘damping’
mechanism in the first cycles of the EM algorithm;
see Einbeck and Hinde (2006) for details.
Further increase of K does not yield major drops
in disparity, so we continue to work with nine mass
points, and extend the model by allowing the linear trend to vary across boys. The function call then
takes the form

(Oxboys.np9s <- allvc(height ~ age, random =
~age|Subject, data = Oxboys, k = 9,
tol=0.3))
The difference in disparities is
> Oxboys.np9$disparity - Oxboys.np9s$disparity
[1] 102.1071
R News

on
> Oxboys.np9$df.res - Oxboys.np9s$df.res
[1] 8
degrees of freedom, showing clear heterogeneity in
the slope.

Summary
We have introduced the R package npmlreg, which
is in some parts a simple translation from the corresponding GLIM4 code, but, in other parts, contains
substantial extensions and methodological improvements. In particular, we mention the possibility to
fit Gamma models and to work with dispersion parameters varying smoothly over components, and, as
already noted, the installation of a damping procedure. We note finally that for the sake of comparability all implemented features are also available for
Gaussian Quadrature instead of NPML (leaving the
zk and πk fixed and equal to Gauss-Hermite integration points and masses). The R package is available
on CRAN. Future developments will include 3-level
models and multicategory responses.

Acknowledgements
The work on this R package was supported by
Science Foundation Ireland Basic Research Grant
04/BR/M0051. We are grateful to N. Sofroniou for
sharing his GLIM code for the Missouri data and two
anonymous referees for pointing at the comparison
with package lme4.

Bibliography
M. Aitkin. A general maximum likelihood analysis of overdispersion in generalized linear models.
Statistics and Computing, 6:251–262, 1996a. 26
M. Aitkin. Empirical Bayes shrinkage using posterior
random effect means from nonparametric maximum likelihood estimation in general random effect models. In IWSM 1996: Proceedings of the 11th
International Workshop on Statistical Modelling, Orvieto, pages 87–94, 1996b. 27, 28
M. Aitkin. A general maximum likelihood analysis
of variance components in generalized linear models. Biometrics, 55:218–234, 1999. 26
M. Aitkin and B. Francis. Fitting overdispersed
generalized linear models by nonparametric maximum likelihood. The GLIM Newsletter, 25:37–45,
1995. 26
M. Aitkin, B. Francis, and J. Hinde. Statistical Modelling in GLIM 4. Oxford Statistical Science Series,
Oxford, UK., 2005. 27, 28
ISSN 1609-3631

Vol. 7/1, April 2007

D. Bates and D. Sarkar. lme4: Linear mixed-effects models using S4 classes, 2006. R package version 0.995-2.
29
D. Böhning, P. Schlattmann, and B. Lindsey.
Computer-assisted analysis of mixtures
(C.A.MAN): statistical algorithms. Biometrics, 48:
283–303, 1992. 26
G. Brush and G. A. Harrison. Components of growth
variation in human stature. Mathematical Medicine
and Biology, 7:77–92, 1990. 28
J. Einbeck and J. Hinde. A note on NPML estimation
for exponential family regression models with unspecified dispersion parameter. Austrian Journal of
Statistics, 35:233–243, 2006. 29
J. Einbeck, R. Darnell, and J. Hinde. npmlreg: Nonparametric maximum likelihood estimation for random
effect models, 2006. R package version 0.40. 26
N.M. Laird. Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the
American Statistical Association, 73:805–811, 1978.
26
J. Pinheiro, D. Bates, S. DebRoy, and D. Sarkar. nlme:

30

Linear and nonlinear mixed effects models, 2005. R
package version 3.1-66. 28
A. Skrondal and S. Rabe-Hesketh. Generalized Latent
Variable Modelling. Chapman & Hall/CRC, Boca
Raton, 2004. 26
N. Sofroniou, J. Einbeck, and J. Hinde. Analyzing
Irish suicide rates with mixture models. In IWSM
2006: Proceedings of the 21th International Workshop on Statistical Modelling, Galway, pages 474–481,
2006. 28
R. Tsutakawa. Estimation of cancer mortality rates: a
Bayesian analysis of small frequencies. Biometrics,
41:69–79, 1985. 27

Jochen Einbeck
Durham University, UK
jochen.einbeck@durham.ac.uk
John Hinde
National University of Ireland, Galway, Rep. of Ireland
john.hinde@nuigalway.ie
Ross Darnell
The University of Queensland, Australia
r.darnell@uq.edu.au

Augmenting R with Unix Tools
Andrew Robinson

R and screen

This article describes a small collection of Unix
command-line tools that can be used to augment R. I
introduce:

It is sometimes necessary to run R code that executes
for long periods of time upon remote machines. This
requirement may be because the local computing resources are too slow, too unstable, or have insufficient memory.
For example, I have recently completed some
simulations that took more than a month on a reasonably fast cluster of Linux computers. I developed, trialed, and profiled my code on my local, notparticularly-fast, computer. I then copied the scripts
and data to a more powerful machine provided by
my employer and ran the simulations on R remotely
on that machine. To get easy access to the simulations whilst they ran, I used screen.
screen is a so-called terminal multiplexor, which
allows us to create, shuffle, share, and suspend command line sessions within one window. It provides
protection against disconnections and the flexibility
to retrieve command line sessions remotely. screen
is particularly useful for R sessions that are running
on a remote machine.
We might use the following steps to invoke R
within screen:

make which after judicious preparation will allow
one-word execution of large-scale simulations,
all the way from compilation of source code to
construction of a PDF report;
screen which will permit remote monitoring of program execution progress with automatic protection against disconnection; and
mail which will allow automatic alerting under error conditions or script completion.
These programs will be part of the default installations of many flavours of Unix-like operating systems. Although these tools have many different applications, each is very useful for running R remotely
on a Unix-like operating system, which is the focus of
their usage in this article.
Regardless of what kind of computer is on your
own desk, you may find these tools useful; they do
not require you to be running a BSD or GNU/Linux
on your own machine.
R News

1. log in remotely via secure shell,
ISSN 1609-3631

Vol. 7/1, April 2007

31

2. start screen,

R and mail

3. start R,

It is very useful to be able to monitor an R session
that is running remotely. However, it would also be
useful if the session could alert us when the script has
completed, or when it has stopped for some reason,
including some pre-determined error conditions. We
can use mail for this purpose, if the computer is running an appropriate mail server, such as sendmail.
mail is an old email program, small enough to
be included by default on most Unix-like systems,
and featureless enough to be almost universally ignored by users in favour of other programs, such as
mutt, or those with shiny graphical user interfaces.
However, mail does allow us to send email from the
command line. And, R can do pretty much anything
that can be done from the command line1 . Of course,
a small amount of fiddling is necessary to make it
work. A simple function will help2 . This function
will fail if any of the arguments contain single or
double quotes. So, craft your message carefully.

4. source our script with echo=TRUE,
5. detach the screen session, using Ctrl-a d, and
6. log out.
The R session continues working in the background,
contained within the screen session. If we want to
revisit the session to check its progress, then we
1. log in remotely via secure shell,
2. start screen -r, which recalls the unattached
session,
3. examine the saved buffer; scrolling around,
copying and pasting as necessary,
4. detach the screen session, using Ctrl-a d, and
5. log out.
This approach works best if examining the buffer
is informative, which requires that the R script
be written to provide readable output or flag its
progress every so often. I find that the modest decrease in speed of looping is more than compensated
by a cheerful periodic reminder, say every 1000th
iteration, that everything is still working and that
there are only n iterations left to run. I have also
been experimenting with various algorithms for R to
use the elapsed time, the elapsed number of simulations, and the number of simulations remaining, to
estimate the amount of time left until the run is complete.
screen offers other advantages. You can manage
several screens in one window, so editing a script remotely using, say, emacs, and then sourcing the script
in the remote R session in another screen, is quick
and easy. If you lose your connection, the session is
kept alive in the background, and can be re-attached,
using screen -r as above. If you have forgotten
to detach your session you can do so forcibly from
another login session, using screen -dr. You can
change the default history/scrollback buffer length,
and navigate the buffer using intuitive keystrokes.
Finally, you can share a screen session with other
people who are logged in to the same machine. That
is, each user can type and see what the other is typing, so a primitive form of online collaboration is
possible.
So, running an R session in screen provides a
simple and robust way to allow repeated access to
a simulation or a process.
More information about screen can be found
from http://www.gnu.org/software/screen/, or
man screen.
1 In

mail <- function(address, subject, message) {
system(paste("echo '", message,
"' | mail -s '", subject,
"' ", address, sep=""))
}
We can now send mail from R via, for example,
mail("andrewr", "Test", "Hello world!")
You can place calls to this function in strategic locations within your script. I call it at the end of the
script to let me know that the simulations are finished, and I can collect the results at my leisure. In
order to have R let us know when it has stopped, we
can use a function like this:
alert <- function() {
mail("andrewr", "Stopped", "Problem")
browser()
}
then in our script we call
options(error = alert)
Now in case of an error, R will send an email and
drop into the browser, to allow detailed follow-up. If
you have more than one machine running, then calling hostname via system, and pasting that into the
subject line, can be helpful.
Alternatively, you can use mutt with exactly the
same command line structure as mail, if mutt is installed on your system. An advantage of mutt is
that it uses the MIME protocol for binary attachments.
That would enable you to, say, attach to your email
the PDF that your script has just created with Sweave
and pdflatex, or the cvs file that your script creates,

theory, it can do everything, I suppose. I haven’t tried.
this function is written to work on bash version 3.1.17. Your mileage may vary.

2 Caveat:

R News

ISSN 1609-3631

Vol. 7/1, April 2007

or even the relevant objects saved from the session,
neatly packaged and compressed in a *.RData object.
Different flavours of Unix mail are available. You
can find out more about yours by man mail.

R and make
I have recently been working on bootstrap tests of
a model-fitting program that uses maximum likelihood to fit models with multivariate normal and t
distributions. The program is distributed in FORTRAN, and it has to be compiled every time that it is
run with a new model. I wanted to use a studentized
bootstrap for interval estimates, gather up all the output, and construct some useful tables and graphics.
So, I need to juggle FORTRAN files, FORTRAN
executables, R source files, Sweave files, PDFs, and
so on. R does a great job preparing the input files,
calling the executable (using system()), and scooping up the output files. However, to keep everything
in sync, I use make.
Although it is commonly associated with building executable programs, make can control the conversion of pretty much any file type into pretty much
any other kind of file type. For example, make can be
told how to convert a FORTRAN source file to an executable, and it will do so if and when the source file
changes. It can be told how and when to run an R
source file, and then how and when to call Sweave
upon an existing file, and then how and when to create a PDF from the resulting LATEX file.
The other advantage that make offers is splitting
large projects into chunks. I use Sweave to bind my
documentation and analysis tightly together. However, maintaining the link between documentation
and analysis can be time-consuming. For example,
when documenting a large-scale simulation, I would
rather not run the simulation every time I correct a
spelling mistake.
One option is to tweak the number of runs. make
provides a flexible infrastructure for testing code, as
we can pass parameters, such as the number of runs
to perform, to R. For example, with an appropriate
Makefile, typing make test at the operating system
prompt will run the entire project with only 20 simulations, whereas typing make will run the project with
2000 simulations.
Another option is to split the project into two
parts: the simulation, and the analysis, and run only
the second, unless important elements change in the
first. Again, this sort of arrangement can be constructed quite easily using make.
For example, I have an R source file called sim.r
that controls the simulations and produces a .RData
object called output.RData. The content is:

32

I also have a Sweave file called report.rnw which
provides summary statistics (and graphics, not included here) for the runs. The content is:
\documentclass{article}
\begin{document}
<<>>=
load("output.RData")
@
We performed \Sexpr{reps} simulations.
\end{document}
The latter file often requires tweaking, depending
on the output. So, I want to separate the simulations
and the report writing, and only run the simulations
if I absolutely have to. Figure 1 is an example Makefile that takes advantage of both options noted above.
All the files referenced here are assumed to be held in
the same directory as the Makefile, but of course they
could be contained in subdirectories, which is generally a neater strategy.
I do not actually need make to do these tasks, but
it does simplify the operation a great deal. One advantage is that make provides conditional execution
without me needing to fiddle around to see what has
and has not changed. If I now edit report.rnw, and
type make, it won’t rerun the simulation, it will just
recompile report.rnw. On the other hand, if I make
a change to sim.r, or even run sim.r again (thus
updating the .RData object), make will rerun everything. That doesn’t seem very significant here, but it
can be when you’re working on a report with numerous different branches of analysis.
The other advantage (which my Makefile doesn’t
currently use) is that it provides wildcard matching. So, if I have a bunch of Sweave files (one for
each chapter, say) and I change only one of them,
then make will identify which one has changed and
which ones depend on that change, and recompile
only those that are needed. But I don’t have to produce a separate line in the Makefile for each file.
Also, the Makefile provides implicit documentation for the flow of operations, so if I need to pass the
project on to someone else, all they need to do is call
make to get going.
More information on make can be found from
http://www.gnu.org/software/make/.
Andrew Robinson
Department of Mathematics and Statistics
University of Melbourne
Australia
A.Robinson@ms.unimelb.edu.au

randoms <- runif(reps)
save.image("output.RData")
R News

ISSN 1609-3631

Vol. 7/1, April 2007

33

## Hashes indicate comments. Use these for documentation.
## Makefiles are most easily read from the bottom up, the first time!
# 'make' is automatically interpreted as 'make report.pdf'.
# Update report.pdf whenever report.tex changes, by running this code.
report.pdf : report.tex
( \
\pdflatex report.tex; \
while \grep -q "Rerun to get cross-references right." report.log;\
do \
\pdflatex report.tex; \
done \
)
# Update report.tex whenever report.rnw *or* output.RData changes,
# by running this code.
report.tex : report.rnw output.RData
echo "Sweave(\"report.rnw\")" | R --no-save --no-restore
# Update output.RData whenever sim.r changes, by running this code.
output.RData : sim.r
echo "reps <- 2000; source(\"sim.r\")" | R --no-save --no-restore
########################################################################
## The following section tells make how to respond to specific keywords.
.PHONY: test full neat clean
# 'make test' cleans up, runs a small number of simulations,
# and then constructs a report
test:
make clean;
echo "reps <- 10; source(\"sim.r\")" | R --no-save --no-restore;
make
# 'make full' cleans up and runs the whole project from scratch
full :
make clean;
make
# 'make neat' cleans up temporary files - useful for archiving
neat :
rm -fr *.txt *.core fort.3 *~ *.aux *.log *.ps *.out
# 'make clean' cleans up all created files.
clean :
rm -fr *.core fort.3 *~ *.exe *.tex *.txt *.lof *.lot *.tex \
*.RData *.ps *.pdf *.aux *.log *.out *.toc *.eps

Figure 1: The contents of a Makefile. Note that all the indenting is done by means of tab characters, not spaces.
This Makefile has been tested for both GNU make and BSD make.

R News

ISSN 1609-3631

Vol. 7/1, April 2007

34

POT: Modelling Peaks Over a Threshold
by Mathieu Ribatet

in equation (1) is good enough.
library("POT")
data("ardieres")
tmp <- clust(ardieres, 0.85, tim.cond = 7/365,
clust.max = TRUE)
par(mfrow=c(2,2))
mrlplot(tmp[,"obs"], xlim = c(0.85, 17))
diplot(tmp, u.range = c(0.85, 17))
tcplot(tmp[,"obs"], u.range = c(0.85, 17))
Dispersion Index Plot
Dispersion Index
0.5
1.0
1.5

Mean Excess
5 10 15 20

Mean Residual Life Plot

0.0

0

The Generalised Pareto Distribution (GPD) is the
limiting distribution of normalised excesses over a
threshold, as the threshold approaches the endpoint
of the variable (Pickands, 1975). The POT package
contains useful tools to perform statistical analysis
for peaks over a threshold using the GPD approximation.
There is many packages devoted to the extreme
value theory (evd, ismev, evir, . . . ); however, the
POT package is specialised in peaks over threshold analysis. Moreover, this is currently the only
one which proposes many estimators for the GPD. A
user’s guide (as a package vignette) and two demos
are also included in the package.

5

10
Threshold

15

5

10
Threshold

15

for z ∈ R and where G is a non degenerate distribution function. Then, for i = 1, . . . , n, we have:
IP [ Xi ≤ z| Xi > u] −→ H ( z),
with
y−µ
H ( y) = 1 − 1 + ξ
σ


u → uend

(1)

−1/ξ
,
+

where (µ, σ, ξ ) are the location, scale and shape parameters respectively, σ > 0 and z+ = max( z, 0) and
uend is the right end-point of the variable Xi .
It is usual to fit the GPD to excesses over a
(high enough) threshold. Thus we suppose that the
asymptotic result given by equation (1) is (approximately) true for the threshold of interest.

Application:
Ardières River at Beaujeu
The ardieres data frame containing flood discharges (in m3 · s−1 ) over a period of 33 years of
the Ardières river at Beaujeau (FRANCE) is included
in the package. There are NA values in year 1994 as
a flood event damaged record instrumentation. We
use this dataset as an example for a typical univariate analysis. First, we have to “extract” independent events from the time series and select a suitable threshold such that asymptotic approximation
R News

10
Threshold

15

5

10
Threshold

15

−1.0

Shape
0.0 1.0

Modified Scale
−30 −10 10
30

Let X1 , . . . , Xn be a series of i.id. random variables
with common distribution function F. Let Mn =
max { X1 , . . . , Xn }. Suppose there exists constants
an > 0 and bn such that:


Mn − bn
≤ z −→ G ( z), n → +∞
IP
an

5

2.0

Asymptotic Approximation

Figure 1: Tools for the threshold selection
The threshold selection stage is a compromise between bias and variance. On one hand, if a too
high threshold is selected, the bias decreases as the
asymptotic approximation in equation (1) is good
enough while the variance increases as there is not
enough data above this threshold. On the other
hand, by taking a lower threshold, the variance decreases as the number of observations is larger and
the bias increases as the asymptotic approximation
becomes poorer.
According to Fig. 1, a threshold around five
m3 · s−1 should be a “good compromise”. Indeed,
the mean residual life plot is “linear” on the range
(5, 7); for thresholds greater than 5, the dispersion
index estimates are “near” the theoretical value 1;
and both modified scale and shape estimates seem to
be constant on the range (5, 9). Thus, we select only
independent values above this threshold by invoking:
events <- clust(ardieres, u = 5,
tim.cond = 7/365, clust.max = TRUE)
We can fit the GPD to those excesses according several estimators by setting the method option.
There is currently 7 estimators: Moments "moments",
Unbiased and Biased Probability Weighted Moments
"pwmu", "pwmb", Minimum Density Power Divergence "mdpd", medians "med", Pickands "pickands"
ISSN 1609-3631

and Maximum Likelihood (the default) "mle" estimators. References for these estimators can be found
in (Pickands, 1975; Hosking and Wallis, 1987; Coles,
2001; Peng and Welsh, 2001) and (Juárez and Schucany, 2004). For example, if we want to fit the GPD
using the unbiased probability weighted moment estimator:

obs <- events[,"obs"]
pwmu <- fitgpd(obs, thresh = 5, "pwmu")

Profile Log−likelihood

35

−205 −204 −203 −202 −201 −200 −199

Vol. 7/1, April 2007

20

40

60

80

100

120

Return Level

Here is the scale and shape parameter estimates
of the GPD for the 7 estimators implemented.
scale
2.735991
2.840792
2.668368
2.704665
2.709254
2.135882
2.328240

mle
mom
pwmu
pwmb
mdpd
med
pick

shape
0.2779359
0.2465661
0.2922964
0.2826697
0.2915759
0.8939585
0.6648158

Figure 3: 95% profile confidence interval for the return level associated to non exceedance probability
0.995.

Miscellaneous Features
The POT package can also:
• simulate and compute density, quantile and
distribution functions for the GPD;
• fit the GPD with a varying threshold using
MLE;

By invoking:

• fit the GPD with held fixed parameters using
MLE;
par(mfrow=c(2,2))
plot(pwmu, npy=2.63)

• perform analysis of variance for two nested
models;

we obtain Fig. 2 which depicts graphic tools for
model diagnostic. Profile likelihood confidence intervals can also be computed, even for return levels
see Fig. 3, with:

• estimate the extremal index using two estimators;
• display a L-Moment plot (Hosking and Wallis,
1997);
• compute sample L-moments;

gpd.pfrl(mle, 0.995, range = c(20, 120),
conf = 0.95)

• convert non exceedance probabilities to return
periods and vice-versa;
• compute “averaged” time series using an average mobile window.

QQ−plot

0.0

0.2

0.4
0.6
Empirical

0.8

40

−
Empirical
20 30

−
−−−−−−−−
−−−−−−− −−−
−−−−−−− −−−−
−−−−−− −−−−−−−−
−
−
−
−
−
−
−
−−
−−−
−−−−− −−−−−−−−−
−
−−−−−
−−−−− −−−−−−−−−
−−−−−
−
−−−−− −−−−−−−−−
−
−
−
−
−
−−
−−−−− −−−−−−−−−−
−−−−
−−−− −−−−−−−
−−−− −−−−−−−−−−
−−−
−−
−−−−−−−
−

10

0.0

Model
0.4
0.8

Probability plot

1.0

−
−
−
−−
−−
−
−
−
−
−
−− −
−−−
−−−−−−− −−−−−−−−
−−−
−−
−−
−
−
−−
−−
−−−−−−−−−−−
−
−
−
−−
−
−−
−−
−
−
−−−−−−
5
10
15
20
25
Model

−

30

35

Return Level Plot

Density
0.1 0.2 0.3
0.0

10

20

30
Quantile

40

50

0.5 1.0 2.0
5.0
20.0
Return Period (Years)

50.0

Figure 2: Graphic tools for model diagnostic.
R News

Currently, most of the package developments are
devoted to bivariate peaks over threshold. For this
purpose, the POT package can also:
• fit a bivariate GPD using 6 parametric dependence functions;
• fit a first order Markov chain with a fixed extreme value dependence structure to all threshold exceedances;

Return Level
10 20 30 40

Density Plot

−

• simulate first order Markov chains with a fixed
extreme value dependence structure;
• plot the Pickands’ dependence and the spectral
density functions.
ISSN 1609-3631

Vol. 7/1, April 2007

36

Bibliography

mation for the generalised Pareto distribution. Extremes, 7(3):237–251, 2004. ISSN 13861999. 35

S. Coles. An Introduction to Statistical Modelling of Extreme Values. Springer Series in Statistics. London,
2001. 35
J. Hosking and J. Wallis. Parameter and quantile
estimation for the generalised Pareto distribution.
Technometrics, 29(3):339–349, 1987. 35
J. Hosking and J. Wallis. Regional Frequency Analysis.
Cambridge University Press, 1997. 35
S. Juárez and W. Schucany. Robust and efficient esti-

L. Peng and A. Welsh. Robust estimation of the generalised Pareto distribution. Extremes, 4(1):53–65,
2001. 35
J. Pickands. Statistical inference using extreme order
statistics. Annals of Statistics, 3:119–131, 1975.
34, 35
Mathieu Ribatet
Cemagref Unité de Recherche HH Lyon, France
ribatet@lyon.cemagref.fr

Backtests
by Kyle Campbell, Jeff Enos, Daniel Gerlanc and David
Kane

• Using a longer-term (forward 12-month)
forecast horizon (in addition to the current quarter).

Introduction
The backtest package provides facilities for exploring portfolio-based conjectures about financial instruments (stocks, bonds, swaps, options, et cetera).
For example, consider a claim that stocks for which
analysts are raising their earnings estimates perform
better than stocks for which analysts are lowering estimates. We want to examine if, on average, stocks
with raised estimates have higher future returns than
stocks with lowered estimates and whether this is
true over various time horizons and across different
categories of stocks. Colloquially, “backtest” is the
term used in finance for such tests.

Background
To demonstrate the capabilities of the backtest package we will consider a series of examples based on a
single real-world data set. StarMine1 is a San Fransisco research company which creates quantitative
equity models for stock selection. According to the
company:
StarMine Indicator is a 1-100 percentile ranking of stocks that is predictive of future analyst revisions. StarMine Indicator improves
upon basic earnings revisions models by:
• Explicitly
guidance.

considering

management

• Incorporating
SmartEstimates,
StarMine’s superior estimates constructed by putting more weight on the
most accurate analysts.
1 See

StarMine Indicator is positively correlated to
future stock price movements. Top-decile
stocks have annually outperformed bottomdecile stocks by 27 percentage points over the
past ten years across all global regions.

These ranks and other attributes of stocks are in
the starmine data frame, available as part of the
backtest package.
> data("starmine")
> names(starmine)
[1]
[4]
[7]
[10]

"date"
"id"
"country"
"sector"
"size"
"smi"
"fwd.ret.6m"

"name"
"cap.usd"
"fwd.ret.1m"

starmine contains selected attributes such as sector, market capitalisation, country, and various measures of return for a universe of approximately 6,000
securities. The data is on a monthly frequency from
January, 1995 through November, 1995. The number
of observations varies over time from a low of 4,528
in February to a high of 5,194 in November.
date count
1995-01-31 4593
1995-02-28 4528
1995-03-31 4569
1995-04-30 4708
1995-05-31 4724
1995-06-30 4748
1995-07-31 4878
1995-08-31 5092
1995-09-30 5185
1995-10-31 5109
1995-11-30 5194

www.starmine.com for details.

R News

ISSN 1609-3631

Vol. 7/1, April 2007

37

The smi column contains the StarMine Indicator
score for each security and date if available. Here is
a sample of rows and columns from the data frame:
date
name fwd.ret.1m fwd.ret.6m smi
1995-01-31
Lojack Corp
0.09
0.8 96
1995-02-28 Raymond Corp
0.05
0.1 85
1995-02-28
Lojack Corp
0.08
0.7 90
1995-03-31
Lojack Corp
0.15
1.0 49
1995-08-31 Supercuts Inc
-0.11
-0.5 57
1995-10-31
Lojack Corp
-0.40
-0.2 22
1995-11-30
Lojack Corp
0.20
0.4 51

Most securities (like LoJack above) have multiple
entries in the data frame, each for a different date.
The row for Supercuts indicates that, as of the close of
business on August 31, 1995, its smi was 57. During
the month of September, its return (i.e., fwd.ret.1m)
was -11%.

A simple backtest
Backtests are run by calling the function backtest to
produce an object of class backtest.
> bt <- backtest(starmine, in.var = "smi",
+
ret.var = "fwd.ret.1m")
starmine is a data frame containing all the information necessary to conduct the backtest. in.var
and ret.var identify the columns containing the
input and return variables, respectively. backtest
splits observations into 5 (the default) quantiles, or
“buckets,” based on the value of in.var. Lower
(higher) buckets contain smaller (larger) values of
in.var. Each quantile contains an approximately
equal number of observations. This backtest creates
quantiles according to values in the smi column of
starmine.
[1,21]
6765

(21,40]
6885

(40,59]
6642

(59,82] (82,100]
6600
6496

backtest calculates the average return within
each bucket. From these averages we calculate the
spread, or the difference between the average return
of the highest and lowest buckets.
Calling summary on the resulting object of class
backtest reports the in.var, ret.var, and by.var
used. We will use a by.var in later backtests.
> summary(bt)

This backtest is an example of a pooled backtest. In
such a backtest, we assume that all observations are
exchangeable. This means that a quantile may contain observations for any stock and from any date.
Quantiles may contain multiple observations for the
same stock.
The backtest summary shows that the average return for the highest bucket was 3.2%. This value is
the mean one month forward return of stocks with
smi values in the highest quantile. As the observations are exchangeable, we use every observation
in the starmine data frame with a non-missing smi
value. This means that the returns for LoJack from
both 1995-01-31 and 1995-02-28 would contribute to
the 3.2% mean of the high bucket.
The backtest suggests that StarMine’s model predicted performance reasonably well. On average,
stocks in the highest quantile returned 3.2% while
stocks in the lowest quantile returned 1.1%. The
spread of 2.1% suggests that stocks with high ratings
perform better than stocks with low ratings.

Natural backtests
A natural backtest requires that the frequency of returns and observations be the same.
A natural backtest approximates the following
implementation methodology: in the first period
form an equal weighted portfolio with long positions in the stocks in the highest quantile and short
positions in the stocks in the lowest quantile. Each
stock has an equal weight in the portfolio; if there
are 5 stocks on the long side, each stock has a weight
of 20%. Subsequently rebalance the portfolio every
time the in.var values change. If the observations
have a monthly frequency, the in.var values change
monthly and the portfolio must be rebalanced accordingly. When the in.var values change, rebalancing has the effect of exiting positions that have left
the top and bottom quantiles and entering positions
that have entered the top and bottom quantiles. If the
data contains monthly observations, we will form 12
portfolios per year.
To create a simple natural backtest, we again call
backtest using fwd.ret.1m. This is the only return
value in starmine for which we can construct a natural backtest of smi.
> bt <- backtest(starmine, id.var = "id",
+
date.var = "date", in.var = "smi",
+
ret.var = "fwd.ret.1m", natural = TRUE)

Backtest conducted with:
1 in.var: smi;
1 ret.var: fwd.ret.1m;
and no by.var.

Natural backtests require a date.var and id.var,
the names of the columns in the data frame containing the dates of the observations and unique security
identifiers, respectively. Calling summary displays
the results of the backtest:

low
2
3
4 high spread
pooled 0.011 0.013 0.016 0.02 0.032 0.021

> summary(bt)

R News

ISSN 1609-3631

Vol. 7/1, April 2007

Backtest conducted with:
1 in.var: smi;
1 ret.var: fwd.ret.1m;
and no by.var.

1995-01-31
1995-02-28
1995-03-31
1995-04-30
1995-05-31
1995-06-30
1995-07-31
1995-08-31
1995-09-30
1995-10-31
1995-11-30
MEAN

low
2
3
4
high spread
0.003 0.011 0.003 -0.0001 0.019 0.016
-0.008 -0.003 0.003 0.0072 0.013 0.021
0.029 0.017 0.013 0.0225 0.037 0.008
-0.002 -0.003 0.002 -0.0054 0.005 0.007
0.010 0.013 0.019 0.0228 0.044 0.034
0.072 0.059 0.057 0.0708 0.101 0.030
0.033 0.030 0.034 0.0323 0.052 0.018
-0.004 0.006 0.017 0.0119 0.024 0.028
-0.055 -0.030 -0.031 -0.0219 -0.014 0.041
0.030 0.032 0.040 0.0430 0.038 0.008
0.013 0.016 0.021 0.0294 0.037 0.024
0.011 0.014 0.016 0.0193 0.032 0.021

average turnover: 0.5
mean spread: 0.02
sd spread: 0.01
raw sharpe ratio: 2

Focus on the mean return of the highest quantile for 1995-02-28 of 1.3%. backtest calculated this
value by first computing the 5 quantiles of the input variable smi over all observations in starmine.
Among the observations that fall into the highest
quantile, those with date 1995-02-28 contribute to the
mean return of 1.3%. It is important to note that
the input variable quantiles are computed over the
whole dataset, as opposed to within each category
that may be defined by a date.var or by.var.
The bottom row of the table contains the mean
quantile return over all dates. On account of the
way we calculate quantile means, a single stock will
have more effect on the quantile mean if during that
month there are fewer stocks in the quantile. Suppose that during January there are only 2 stocks in
the low quantile. The return of a single stock in Jan1
uary will account for 22
of the quantile mean. This
is different than a pooled backtest where every observation within a quantile has the same weight. In
a natural backtest, the weight of a single observation
depends on the number of observations for that period.
Calling summary yields information beyond that
offered by the summary method of a pooled backtest. The first piece of extra information is average
turnover. Turnover is the percentage of the portfolio
we would have to change each month if we implemented the backtest as a trading strategy. For example, covering all the shorts and shorting new stocks
would yield a turnover of 50% because we changed
half the portfolio. We trade stocks when they enter
or exit the extreme quantiles due to in.var changes.
On average, we would turn over 50% of this portfolio
each month.
The second piece of extra information is mean
spread. The spread was positive each month, so on
average the stocks with the highest smi values outperformed the stocks with the lowest smi values.
R News

38

On average, stocks in the highest quantile outperformed stocks in the lowest quantile by 2%. The
third piece of extra information, the standard deviation of spread, is 1%. The spread varied from month
to month, ranging from a low of close to 0% to a high
of over 4%.
We define the fourth piece of extra information,
raw (non-annualized) Sharpe ratio, as return
risk . We set
return equal to mean spread return and use the standard deviation of spread return as a measure of risk.

More than one in.var
backtest allows for more than one in.var to be
tested simultaneously. Besides using smi, we will
test market capitalisation in dollars, cap.usd. This
is largely a nonsense variable since we do not expect
large cap stocks to outperform small cap stocks — if
anything, the reverse is true historically.
> bt <- backtest(starmine, id.var = "id",
+
date.var = "date", in.var = c("smi",
+
"cap.usd"), ret.var = "fwd.ret.1m",
+
natural = TRUE)
Because more than one in.var was specified,
only the spread returns for each in.var are displayed, along with the summary statistics for each
variable.
> summary(bt)
Backtest conducted with:
2 in.vars: smi, cap.usd;
1 ret.var: fwd.ret.1m;
and no by.var.

1995-01-31
1995-02-28
1995-03-31
1995-04-30
1995-05-31
1995-06-30
1995-07-31
1995-08-31
1995-09-30
1995-10-31
1995-11-30

smi
0.016
0.021
0.008
0.007
0.034
0.030
0.018
0.028
0.041
0.008
0.024

cap.usd
-0.0138
0.0017
-0.0023
-0.0052
-0.0568
-0.0143
-0.0008
0.0051
0.0321
0.0127
0.0029

summary stats for in.var = smi:
average turnover: 0.5
mean spread: 0.02
sd spread: 0.01
raw sharpe ratio: 2
summary stats for in.var = cap.usd:
average turnover: 0.1
mean spread: -0.004
sd spread: 0.02
raw sharpe ratio: -0.2

ISSN 1609-3631

Vol. 7/1, April 2007

39

> plot(bt.save, type = "cumreturn.split")
Cumulative Spread Return
cap.usd

smi

Total return: −3.9

20

Return (%)

Viewing the results for the two input variables
side-by-side allows us to compare their performance
easily. As we expected, cap.usd as an input variable
did not perform as well as smi over our backtest period. While smi had a positive return during each
month, cap.usd had a negative return in 6 months
and a negative mean spread. In addition, the spread
returns for cap.usd were twice as volatile as those of
smi.

Total return: 23

10
0
−10
−20

Worst drawdown: −8

Mar

May

Worst drawdown: −−

Jul

Sep

Nov

Mar

May

Jul

Sep

Nov

Sep

Nov

Cumulative Quantile Return
cap.usd

smi

quantile
30

Return (%)

There are several plotting facilities available in
backtest that can help illustrate the difference in
performance between these two signals. These plots
can be made from a natural backtest with any number of input variables. Below is a bar chart of the
monthly returns of the two signals together:

high
Q4
Q3
Q2
low

20

10

> plot(bt, type = "return")
0

Spread Return
smi
cap.usd

Mar

May

Jul

Sep

Nov

Mar

May

Jul

Date
4

Figure 2: Cumulative spread and quantile returns.

3
2

Return (%)

1

> plot(bt, type = "turnover")

0
−1

Turnover
smi
cap.usd

−2

●
●

−3
−4
90

−5

80

Date

Figure 1: Monthly return spreads.

70

Turnover (%)

1995−11−30

1995−10−31

1995−09−30

1995−08−31

1995−07−31

1995−06−30

1995−05−31

1995−04−30

1995−03−31

1995−02−28

1995−01−31

−6

60
●

●

50

●

●

●

●

●

●
●
●

40
30
20

Returns for smi were consistently positive. Returns for cap.usd were of low quality, but improved
later in the period. cap.usd had a particularly poor
return in June. We can also plot cumulative returns
for each input variable as shown in Figure 2.
The top region in this plot shows the cumulative
return of each signal on the same return scale, and
displays the total return and worst drawdown of the
entire backtest period. The bottom region shows the
cumulative return of the individual quantiles over
time. We can see that smi’s top quantile performed
best and lowest quantile performed worst. In contrast, cap.usd’s lowest quantile was its best performing.
Though it is clear from the summary above that
smi generated about 5 times as much turnover as
cap.usd, a plot is available to show the month-bymonth turnover of each signal, see Figure 3. This
chart shows that the turnover of smi was consistently
around 50% with lower turnover in September and
October, while the turnover of cap.usd was consistently around 10%.
R News

●

10

Mar

●

Apr

●

May

●

●

●

Jun

Jul

●

●

●

Nov

Dec

●

Aug

Sep

Oct

Date

Figure 3: Monthly turnover.

Using by.var
In another type of backtest we can look at quantile spread returns by another variable. Specifying
by.var breaks up quantile returns into categories defined by the levels of the by.var column in the input
data frame. Consider a backtest of smi by sector:
> bt <- backtest(starmine, in.var = "smi",
+
ret.var = "fwd.ret.1m",
+
by.var = "sector")
> summary(bt)
Backtest conducted with:

ISSN 1609-3631

Vol. 7/1, April 2007

40

> summary(bt)

1 in.var: smi;
1 ret.var: fwd.ret.1m;
and by.var: sector.

Durbl
Enrgy
HiTec
Hlth
Manuf
Money
NoDur
Other
Shops
Telcm
Utils

low
0.0063
0.0152
0.0237
0.0395
0.0005
0.0190
0.0036
0.0045
0.0020
0.0277
0.0128

2
0.007
0.014
0.016
0.036
0.005
0.024
0.010
0.006
0.004
0.014
0.021

3
0.009
0.017
0.026
0.021
0.014
0.021
0.010
0.015
0.005
0.022
0.013

Backtest conducted with:
4
0.007
0.019
0.029
0.038
0.009
0.026
0.019
0.017
0.017
0.023
0.016

high spread
0.01 0.004
0.04 0.024
0.05 0.024
0.05 0.006
0.02 0.022
0.04 0.017
0.03 0.025
0.02 0.017
0.03 0.026
0.03 0.005
0.02 0.007

This backtest categorises observations by the
quantiles of smi and the levels of sector. The highest spread return of 2.6% occurs in Shops. Since
smi quantiles were computed before the observations were split into groups by sector, however, we
can not be sure how much confidence to place in
this result. There could be very few observations in
this sector or one of the top and bottom quantiles
could have a disproportionate number of observations, thereby making the return calculation suspect.
counts provides a simple check.

1 in.var: smi;
1 ret.var: fwd.ret.1m;
and by.var: cap.usd.

low
2
3
4
5
6
7
8
9
high

low
0.0105
0.0078
0.0186
0.0124
0.0080
0.0126
0.0080
0.0050
0.0104
0.0156

2
0.0139
0.0093
0.0072
0.0142
0.0124
0.0121
0.0070
0.0181
0.0153
0.0207

3
0.0236
0.0216
0.0167
0.0139
0.0087
0.0191
0.0160
0.0101
0.0167
0.0133

4
0.028
0.025
0.031
0.013
0.010
0.021
0.019
0.014
0.014
0.023

high spread
0.038 0.028
0.046 0.038
0.034 0.016
0.038 0.026
0.025 0.017
0.026 0.013
0.034 0.026
0.027 0.022
0.028 0.018
0.026 0.011

Since cap.usd is numeric, the observations are
now split by two sets of quantiles. Those listed
across the top are, as before, the input variable quantiles of smi. The row names are the quantiles of
cap.usd. The buckets parameter of backtest controls the number of quantiles. The higher returns
in the lower quantiles of cap.usd suggests that smi
performs better in small cap stocks than in large cap
stocks.

> counts(bt)
$smi
Durbl
Enrgy
HiTec
Hlth
Manuf
Money
NoDur
Other
Shops
Telcm
Utils

low
2
3
4 high
348 349 261 231 223
246 250 158 130
64
647 660 824 1004 1432
380 377 410 464 424
1246 1265 1279 1395 1576
959 1265 1244 1095 875
615 563 528 441 371
1034 940 784 760 710
870 714 710 697 548
186 177 140 129
95
152 245 252 198 130

While there seems to be an adequate number of
observations in Shops, it is important to note that
there are approximately 60% more observations contributing to the mean return of the lowest quantile
than to the mean return of the highest quantile, 870
versus 548. Overall, we should be more confident in
results for Manuf and Money due to their larger sample sizes. We might want to examine the result for
HiTec more closely, however, since there are more
than twice the number of observations in the highest quantile than the lowest.
by.var can also be numeric, as in this backtest using cap.usd:
> bt <- backtest(starmine,
+
in.var = "smi", ret.var = "fwd.ret.1m",
+
by.var = "cap.usd",
+
buckets = c(5, 10))
R News

Multiple return horizons
Using backtest we can also analyse the performance
of a signal relative to multiple return horizons. Below is a backtest that considers one month and six
month forward returns together:
> bt <- backtest(starmine, in.var = "smi",
+
buckets = 4, ret.var = c("fwd.ret.1m",
+
"fwd.ret.6m"))
> summary(bt)
Backtest conducted with:
1 in.var: smi;
2 ret.vars: fwd.ret.1m, fwd.ret.6m;
and no by.var.
low
2
3 high spread
fwd.ret.1m 0.011 0.015 0.018 0.03 0.019
fwd.ret.6m 0.112 0.121 0.142 0.17 0.059

The performance of smi over these two return
horizons tells us that the power of the signal degrades after the first month. Using six month forward return, fwd.ret.6m, the spread is 6%. This
is only 3 times larger than the 2% spread return in
the first month despite covering a period which is 6
times longer. In other words, the model produces 2%
spread returns in the first month but only 4% in the 5
months which follow.
ISSN 1609-3631

Vol. 7/1, April 2007

Conclusion
The backtest package provides a simple collection of
tools for performing portfolio-based tests of financial
conjectures. A much more complex package, portfolioSim, provides facilities for historical portfolio
performance analysis using more realistic assumptions. Built on the framework of the portfolio2 package, portfolioSim tackles the issues of risk exposures
and liquidity constraints, as well as arbitrary portfolio construction and trading rules. Above all, the
flexibility of R itself allows users to extend and modify these packages to suit their own needs. Before
reaching that level of complexity, however, backtest
provides a good starting point for testing a new con-

41

jecture.

Bibliography
J. Enos and D. Kane. Analysing equity portfolios
in R. R News, 6(2):13–19, MAY 2006. URL http:
//CRAN.R-project.org/doc/Rnews. 41

Kyle Campbell, Jeff Enos, Daniel Gerlanc and David Kane
Kane Capital Management
Cambridge, Massachusetts, USA
Kyle.W.Campbell@williams.edu, jeff@kanecap.com,
dgerlanc@gmail.com and david@kanecap.com

Review of John Verzani’s Book
Using R for Introductory Statistics
Andy Liaw
To the best of my knowledge, this
book is the first of its kind: a standalone introductory statistics textbook that integrates R throughout.
The advantages should be obvious: Students would not need to
purchase a supplement that covers the software, in addition to the
main textbook (although the author states in the Preface that the
book should also be useful as an accompaniment for
a standard introductory text). That the software is
freely available is a big bonus. Moreover, the book
covers basic descriptive statistics before any probability models are mentioned. For students that are
less mathematically inclined, this should make materials easier to absorb. (The author states in the Preface that the book aims at classes that are based on
pre-calculus skills.)
The book contains 12 chapters. The first four
chapters of the book cover descriptive statistics, both
numerical and graphical, from general introduction
(Chapter 1), through univariate and bivariate data
(Chapters 2 and 3) to multivariate data (Chapter 4).
Each chapter covers both categorical and numerical data. The author chose to treat two independent samples as bivariate data and several independent samples as multivariate data, which I think is
a bit unusual. Chapter 5 covers probability models.
Chapter 6 covers simulations, setting up for the topics on inference in the chapters that follow. Chapters 7 and 8 cover confidence intervals and significance tests, respectively. Chapter 9 discusses the
2 See

χ2 tests for the multinomial distribution, the test
for independence, and goodness-of-fit tests such as
Kolmogorov-Smirnov and Shapiro-Wilk. Chapter 10
covers both simple and multiple linear regression.
Chapter 11 covers one- and two-way ANOVA as well
as ANCOVA. Chapter 12 covers logistic and nonlinear regression. There are also five appendices that
cover various aspects of R (installation, GUI, teaching with R, graphics, programming). Throughout the
book, examples of R usage are interspersed among
the main text, and some sections devoted to R topics are introduced as the need arises (e.g., in Chapter 6, Simulations, Section 6.2 covers for() loops).
Data used as examples were drawn from a wide variety of areas. Exercises are given at the end of sections (rather than chapters). The book also has an
accompanying add-on package, UsingR (available on
CRAN), which contains data sets and some functions
used in the book. The book also has a web site that
contains answers to selected problems, the UsingR
package for various platforms (including one for SPLUS), as well as errata.
Several ideas presented in the book deserve accolades (e.g., covering EDA before introducing probability models, coverage of robust/resistant methods,
thorough integration of R into the materials). However, there are also drawbacks. The most glaring one
is the fact that many rather technical terms are used
before they are introduced or explained, and some
are not sufficiently elaborated. (E.g., “density” is first
used to show how a kernel density estimate can be
added to a histogram, but no explanation was given
for what a density is or what it means.) In my teaching experience, one of the most difficult (but absolutely essential) concepts for students to grasp is the

Enos and Kane (2006) for an introduction to the portfolio package.

R News

ISSN 1609-3631

Vol. 7/1, April 2007

idea of the sampling distribution of a statistic, yet
this did not receive nearly the attention I believe it
deserves. The discussion of scatterplot smoothing
(Section 3.4.7, “Trend lines”) gave a minimal description of what smoothers are available in base R, such
as smoothing splines, loess, and Friedman’s supersmoother. I would be surprised if students in intro
stat courses are not completely bewildered by such
a minimal description. This will make it harder for
some students to follow the text along on their own.
Some people might be interested in learning how
to do basic data analysis using R. As these people
are not among the intended audience, this book may
not serve them as nicely as others, because the Rspecific topics are scattered throughout the book in
bits and pieces, each making its entrance as the statistical topic being covered requires it.
Now, some real nit-picking on more esoteRic
things: The author seems to use “library” and “package” interchangeably, which could make some R

42

users cringe. Also, on page 8, the virtue of using ’, which means they can be
parsed/scanned.
• Printing functions (without source attributes)
and expressions now preserves integers (using
the L suffix) and NAs (using NA_real_ etc where
necessary).
• The ‘internal’ objects .helpForCall, .tryHelp
and topicName are no longer exported from
utils.
• The internal regex code has been upgraded to
glibc 2.5 (from 2.3.6).
• Text help now attempts to display files which
have an encoding section in the specified encoding via file.show().
• R now attempts to keep track of character
strings which are known to be in Latin-1 or
UTF-8 and print or plot them appropriately in
other locales. This is primarily intended to
R News

47

make it possible to use data in Western European languages in both Latin-1 and UTF8 locales. Currently scan(), read.table(),
readLines(), parse() and source() allow encodings to be declared, and console input in
suitable locales is also recognized.
New function Encoding() can read or set the
declared encodings for a character vector.
• There have been numerous performance improvements to the data editor on both Windows and X11. In particular, resizing the window works much better on X11.

Deprecated & defunct
• symbol.C() and symbol.For() are defunct,
and have been replaced by wrappers that give
a warning.
• Calling a builtin function with an empty argument is now always an error.
• The autoloading of ts() is defunct.
• The undocumented reserved word GLOBAL.ENV
has been removed. (It was yet another way to
get the value of the symbol .GlobalEnv.)
• The deprecated behaviour of structure() in
adding a class when specifying with tsp or
levels attributes is now defunct.
• unix() is now finally defunct, having been
deprecated for at least seven years.
• Sys.putenv() is now deprecated in favour
of Sys.setenv(), following the POSIX recommendation.
• Building R with ‘--without-iconv’ is deprecated.
• Using $ on an atomic vector is deprecated (it
was previously valid and documented to return NULL).
• The use of storage.mode<- for other than
standard types (and in particular for value
"single") is deprecated: use mode<- instead.

Installation
• A suitable iconv (e.g., from glibc or GNU
libiconv) is required.
For 2.5.x only you
can build R without it by configuring using
‘--without-iconv’.
• There is support again for building on AIX
(tested on 5.2 and 5.3) thanks to Ei-ji Nakama.
ISSN 1609-3631

Vol. 7/1, April 2007

48

• Autoconf 2.60 or later is used to create
configure. This makes a number of small
changes, and incorporates the changes to the
detection of a C99-compliant C compiler backported for 2.4.1.

• checkS3methods() (and hence R CMD check)
now checks agreement with primitive internal
generics, and checks for additional arguments
in methods where the generic does not have a
... argument.

• Detection of a Java development environment
was added such that packages don’t need to
provide their own Java detection.

codoc() now knows the argument lists of
primitive functions.

R CMD javareconf was updated to look for the
corresponding Java tools as well.
• Added workaround for reported non-POSIX
sh on OSF1. (PR#9375)

• R CMD INSTALL and R CMD REMOVE now use as
the default library (if ‘-l’ is not specified) the
first library that would be used if R were run
in the current environment (and they run R to
find it).

• make install-strip now works, stripping the
executables and also the shared libraries and
modules on platforms where libtool knows
how to do so.

• There is a new front-end ‘Rscript’ which can
be used for #! scripts and similar tasks. See
help("Rscript") and ‘An Introduction to R’
for further details.

• Building R as a shared library and standalone
nmath now installs pkg-config files ‘libR.pc’
and ‘libRmath.pc’ respectively.

• R CMD BATCH (not Windows) no longer
prepends
’invisible(options(echo =
TRUE))’ to the input script. This was the default unless ‘--slave’ is specified and the latter
is no longer overridden.

• Added test for insufficiently complete implementation of sigaction.

C-level facilities
• Functions str2type, type2char and type2str
are now available in ‘Rinternals.h’.
• Added support for Objective C in R and packages (if available).
• R_ParseVector() has a new 4th argument
SEXP srcfile allowing source references to be
attached to the returned expression list.
• Added ptr_R_WriteConsoleEx callback which
allows consoles to distinguish between regular output and errors/warnings.
To ensure backward compatibility it is only used if
ptr_R_WriteConsole is set to NULL.

Utilities
• Additional Sweave() internal functions are
exported to help writing new drivers, and
RweaveLatexRuncode() is now created using a
helper function (all from a patch submitted by
Seth Falcon).
• The following additional flags are accessible
from R CMD config: OBJC, OBJCFLAGS, JAR,
JAVA, JAVAC, JAVAH, JAVA_HOME, JAVA_LIBS
• R CMD build now takes the package name
from the ‘DESCRIPTION’ file and not from the
directory. (PR#9266)
R News

On all OSes it makes use of the ‘-f’ argument
to R, so file("stdin") can be used from BATCH
scripts.
On all OSes it reports proc.time() at the end
of the script unless q() is called with options to
inhibit this.
• R CMD INSTALL now prepends the installation
directory (if specified) to the library search
path.
• Package installation now re-encodes R files and
the ‘NAMESPACE’ file if the ‘DESCRIPTION’
file specifies an encoding, and sets the encoding used for reading files in preparing for
LazyData. This will help if a package needs to
be used in (say) both Latin-1 and UTF-8 locales
on different systems.
• R CMD check now reports on non-ASCII strings
in datasets. (These are a portability issue,
which can be alleviated by marking their encoding: see ‘Writing R Extensions’.)
• Rdiff now converts CRLF endings in the target
file, and converts UTF-8 single quotes in either
to ASCII quotes.
• New recommended package codetools by
Luke Tierney provides code-analysis tools.
This can optionally be used by R CMD check to
detect problems, especially symbols which are
not visible.
• R CMD config now knows about LIBnn .
ISSN 1609-3631

Vol. 7/1, April 2007

• New recommended package rcompgen
by Deepayan Sarkar provides support for
command-line completion under the Unix terminal interface (provided readline is enabled)
and the Windows Rgui and Rterm front ends.

Bug fixes
• gc() can now report quantities of ‘Vcells’ in excess of 16Gb on 64-bit systems (rather than reporting NA).
• Assigning class factor to an object now requires it has integer (and not say double) codes.
• structure() ensures that objects with added
class factor have integer codes.
• The formula and outer attributes of datasets
ChickWeight, CO2, DNase, Indometh, Loblolly,
Orange and Theoph now have an empty environment and not the environment used to
dump the datasets in the package.
• Dataset Seatbelts now correctly has class
c("mts", "ts").
• str() now labels classes on data frames more
coherently.
• Several ‘special’ primitives and .Internals could
return invisibly if the evaluation of an argument led to the visibility flag being turned off.
These included as.character(), as.vector(),
call(), dim(), dimnames(), lapply(), rep(),
seq() and seq_along(). Others (e.g., dput()
and print.default()) could return visibly
when this was not intended.
• Several primitives such as dim() were not
checking the number of arguments supplied
before method dispatch.
• Tracing of primitive functions has been corrected. It should now be the case that tracing
either works or is not allowed for all primitive functions. (Problems remain if you make a
primitive into a generic when it is being traced.
To be fixed later.)
• max.col() now omits infinite values in determining the relative tolerance.
• R CMD Sweave and R CMD Stangle now respond to ‘--help’ and ‘--version’ like other
utilities.
• .libPaths() adds only existing directories (as
it was documented to, but could add nondirectories).
R News

49

• setIs() and setClassUnion() failed to find
some existing subclasses and produced spurious warnings, now fixed.
• data.frame() ignored row.names for 0-column
data frames, and no longer treats an explicit
row.names=NULL differently from the default
value.
• identical() looked at the internal structure of
the row.names attribute, and not the value visible at R level.
• abline(reg) now also correctly works with
intercept-only lm models, and abline() warns
more when it’s called illogically.
• warning() was truncating messages at
getOption("warning.length") - 1 (not as documented), with no indication. It now appends
[... truncated].
• Stangle/Sweave were throwing spurious
warnings if options result or strip.white
were unset.
• all.equal() was ignoring check.attributes
for list and expression targets, and checking
only attributes on raw vectors. Logical vectors
were being compared as if they were numeric,
(with a mean difference being quoted).
• Calculating the number of significant digits in
a number was itself subject to rounding errors for digits ≥ 16. The calculation has
been changed to err on the side of slightly
too few significant digits (but still at least 15)
rather than far too many. (An example is
print(1.001, digits=16).)
• unlink() on Unix-alikes failed for paths containing spaces.
• substr() and friends treated NA start or stop
incorrectly.
• merge(x, y, all.y = TRUE) would sometimes incorrectly return logical columns for
columns only in y when there were no common rows.
• read.table(fn, col.names=) on an empty
file returned NULL columns, rather than
logical(0) columns (which is what results
from reading a file with just a header).
• grid.[xy]axis(label=logical(0)) failed.
• expression() was unnecessarily duplicating
arguments.
• as.expression(list ) returned a singleelement expression vector, which was not compatible with S: it now copies lists element-byelement.
ISSN 1609-3631

Vol. 7/1, April 2007

• supsmu(periodic = TRUE) could segfault.
(PR#9502, detection and patch by Bill Dunlap.)

50

non-NULL options("error") expressions for
errors within a try, and (ii) try() would catch
user interrupts.

• pmax/pmin called with only logical arguments
did not coerce to numeric, although they were
documented to do so (as max/min do).

• str(obj) could fail when obj contained a dendrogram.

• methods() did not know that cbind() and
rbind() are internally generic.

• Using data_frame[, last_column] <- NULL
failed (PR#9565).

• dim(x) <- NULL removed the names of x, but
this was always undocumented. It is not clear
that it is desirable but it is S-compatible and relied on, so is now documented.
• which(x, arr.ind = TRUE) did not return a
matrix (as documented) if x was an array of
length 0.

• choose(n, k) could return non-integer values
for integer n and small k on some platforms.
• nclass.scott(x) and nclass.FD(x) no longer
return NaN when var(x) or IQR(x) (respectively) is zero.
hist() now allows breaks = 1 (which the
above patch will return), but not breaks = Inf
(which gave an obscure error).

• C-level duplicate() truncated CHARSXPs with
embedded nuls.

• strptime("%j") now also works for the first
days of Feb-Dec. (PR#9577)

• Partial matching of attributes was not working as documented in some cases if there were
more than two partial matches or if ‘names’
was involved.

• write.table() now recovers better if file is
an unopened connection. (It used to open it for
both the column names and the data.)

• data(package=character(0)) was not looking
in ‘./data’ as documented.
• summary.mlm() failed if some response names
were "" (as can easily happen if cbind() is
used).
• The postscript() and pdf() drivers shared
an encoding list but used slightly different formats. This caused problems if both were used
with the same non-default encoding in the
same session. (PR#9517)
• The data editor was not allowing Inf, NA and
NaN to be entered in numerical columns. It was
intended to differentiate between empty cells
and NAs, but did not do so: it now does so for
strings.
• supsmu() could segfault if all cases had nonfinite values. (PR#9519)
• plnorm(x, lower.tail=FALSE) was returning
the wrong tail for x ≤ 0. (PR#9520)
• which.min() would not report a minimum
of +Inf, and analogously for which.max().
(PR#9522)
• R CMD check could fail with an unhelpful error when checking Rd files for errors if there
was only one file and that had a serious error.
(PR#9459)
• try() has been reimplemented using
tryCatch() to solve two problems with the
original implementation: (i) try() would run
R News

• Fixed bug in mosaicplot(sort=) introduced
by undocumented change in R 2.4.1 (changeset
r39655).
• contr.treatment(n=0) failed with a spurious
error message. (It remains an error.)
• as.numeric() was incorrectly documented: it
is identical to as.double().
• jitter(rep(-1, 3)) gave NaNs. (PR#9580)
• max.col() was not random for a row of zeroes.
(PR#9542)
• ansari.test(conf.int=TRUE, exact=FALSE)
failed.
• trace() now works on S3 registered methods,
by modifying the version in the S3 methods table.
• rep(length=1, each=0) segfaulted.
• postscript() could overflow a buffer if used
with a long command argument.
• The internal computations to copy complete attribute lists did not copy the flag marking S4
objects, so the copies no longer behaved like S4
objects.
• The C code of nlminb() was altering a variable without duplicating it. (This did not affect nlminb() but would have if the code was
called from a different wrapper.)
• smooth(kind = "3RS3R") (the current default)
used .C(DUP = FALSE) but altered its input argument. (This was masked by duplication in
as.double().)
ISSN 1609-3631

Vol. 7/1, April 2007

• The signature for the predefined S4 method for
as.character() was missing ... .
• readBin(raw_vector) could read beyond the
end of the vector when size-changing was involved.
• The C entry point PrintValue (designed to emulate auto-printing) would not find show() for
use on S4 objects, and did not have the same
search path (for show(), print() and print()

51

methods) as auto-printing. Also, auto-printing
and print() of S4 objects would fail to find
show if the methods namespace was loaded but
the package was not attached (or otherwise not
in the search path).
• print() (and auto-printing) now recognize S4
objects even when methods is not loaded, and
print a short summary rather than dump the
internal structure.

Changes on CRAN
by Kurt Hornik

New contributed packages
ARES Allelic richness estimation, with extrapolation beyond the sample size. Generates an allelic richness accumulation curve. This curve
shows the expected number of unique alleles
in a population when taking a sample of individuals. The function aresCalc takes a binary
data matrix as input, showing the presence of
alleles per individual, and gives an accumulation curve (mean with 95% confidence bounds)
back. The function aresPlot can be used to
plot the output from aresCalc. By Emiel van
Loon and Scott Davis.
BayHaz Bayesian Hazard Rate Estimation: a suite of
R functions for Bayesian estimation of smooth
hazard rates via Compound Poisson Process
(CPP) priors. By Luca La Rocca.
BiasedUrn Biased Urn model distributions. Statistical models of biased sampling in the form
of univariate and multivariate non-central hypergeometric distributions, including those of
Wallenius and Fisher (also called extended hypergeometric distribution). By Agner Fog.
BootCL Bootstrapping test for chromosomal localization. By Eun-Kyung Lee, Samsun Sung, and
Heebal Kim.
Brobdingnag Very large numbers in R. Real numbers are held using their natural logarithms,
plus a logical flag indicating sign. The package
includes a vignette that gives a step-by-step introduction to using S4 methods. By Robin K. S.
Hankin.
CCA Canonical correlation analysis. Provides a set
of functions that extend the cancor function
with new numerical and graphical outputs.
It also include a regularized extension of the
canonical correlation analysis to deal with data
R News

sets with more variables than observations. By
Ignacio González and Sébastien Déjean.
CreditMetrics Functions for calculating the CreditMetrics risk model. By Andreas Wittmann.
DAAGxtras Data sets and functions additional to
DAAG, used in additional exercises for the
book “Data Analysis and Graphics Using R”
by J. H. Maindonald and W. J. Brain (2007, 2nd
edn.), and for laboratory exercises prepared for
a ‘Data Mining’ course. By John Maindonald.
GenABEL genome-wide association analysis between quantitative or binary traits and SNPs.
By Yurii Aulchenko.
GeoXp Interactive exploratory spatial data analysis. A tool for researchers in spatial statistics, spatial econometrics, geography, ecology etc., allowing to link dynamically statistical plots with elementary maps. By Christine Thomas-Agnan, Yves Aragon, Anne RuizGazen, Thibault Laurent, and Laurianne Robidou.
HydroMe Estimation of the parameters in infiltration and water retention models by curvefitting method. The models considered are
those that are commonly used in soil science.
By Christian Thine Omuto.
IPSUR Data sets and functions accompanying the
book “Introduction to Probability and Statistics
Using R” by G. Andy Chang and G. Jay Kerns
(in progress). By G. Jay Kerns with contributions by Theophilius Boye, adapted from the
work of John Fox et al.
InfNet Simulation of epidemics in a network of contacts. The simulations consider SIR epidemics
with events in continuous time (exponential
inter-event times). It can consider a structure
of local networks and has an option to visualize it with the animator called SoNIA (http:
//www.stanford.edu/group/sonia/). By Lilia
Ramirez Ramirez and Mary Thompson.
ISSN 1609-3631

Vol. 7/1, April 2007

JointModeling Joint modeling of mean and dispersion through two interlinked GLMs or GAMs.
By Mathieu Ribatet and Bertrand Iooss.
NMMAPSlite NMMAPS Data Lite. Provides remote access to daily mortality, weather, and
air pollution data from the National Morbidity, Mortality, and Air Pollution Study for 108
U.S. cities (1987–2000); data are obtained from
the Internet-based Health and Air Pollution
Surveillance System (iHAPSS). By Roger D.
Peng.
NRAIA Data sets from the book “Nonlinear Regression Analysis and Its Applications” by D. M.
Bates and D. G. Watts (1998), John Wiley and
Sons. By Douglas Bates.
PKtools Unified computational interfaces for pop
PK. By M. Suzette Blanchard.
PhySim Phylogenetic tree simulation based on a
virth death model. Functions are provided to
model a lag-time to speciation and extract sister
species ages. By Jason T. Weir, Dolph Schluter.
PresenceAbsence Presence-absence model evaluation. Includes functions for calculating threshold dependent measures such as confusion matrices, pcc, sensitivity, specificity, and Kappa,
and produces plots of each measure as the
threshold is varied. Can also optimize threshold choice, and plot the threshold independent
ROC curves along with the associated AUC By
Elizabeth Freeman.
RJDBC Provides access to databases through the
JDBC interface, implementing R’s DBI interface
using JDBC as a back-end. This allows R to connect to any DBMS that has a JDBC driver. By
Simon Urbanek.
RLadyBug Analysis of small scale infectious disease
data using stochastic Susceptible-ExposedInfectious-Recovered (SEIR) models.
By
Michael Hoehle and Ulrike Feldmann.
RaschSampler A Rasch sampler for sampling binary matrices with fixed margins. By Reinhold
Hatzinger, Patrick Mair, and Norman Verhelst.
Rcmdr.HH Rcmdr support for the introductory
course at Temple University, which spends
time on several topics that are not yet in the
R Commander. By Richard M. Heiberger, with
contributions from Burt Holland.
Rserve A binary R server which acts as a socket
server (TCP/IP or local sockets) allowing binary requests to be sent to R. Every connection
has a separate workspace and working directory. Client-side implementations are available
for popular languages such as C/C++ and Java,
R News

52

allowing any application to use facilities of R
without the need of linking to R code. Supports
remote connection, user authentication and file
transfer. A simple R client is included as well.
By Simon Urbanek.
SLmisc Miscellaneous functions for analysis of
gene expression data at SIRS-Lab GmbH. By
Matthias Kohl.
SMPracticals Data sets and a few functions for use
with the practicals outlined in Appendix A
of the book “Statistical Models” by A. Davison (2003), Cambridge University Press. The
practicals themselves can be found at http://
statwww.epfl.ch/davison/SM/. By Anthony
Davison.
SNPmaxsel Asymptotic methods related to maximally selected statistics, with applications to
SNP data. By Anne-Laure Boulesteix.
Snowball Snowball stemmers. By Kurt Hornik.
SpherWave Spherical wavelets and SW-based spatially adaptive methods. Carries out spherical
wavelet transform developed Li (1999) and Oh
(1999), and implements wavelet thresholding
approaches proposed by Oh and Li (2004). By
Hee-Seok Oh and Donghoh Kim.
Synth Causal inference using the synthetic control group method of Abadie and Gardeazabal (2003) and Abadie, Diamond, Hainmueller
(2007). By Alexis Diamond and Jens Hainmueller.
TRAMPR Matching of terminal restriction fragment length polymorphism (TRFLP) profiles
between unknown samples and a database of
knowns. Facilitates the analysis of many unknown profiles at once, and provides tools for
working directly with electrophoresis output
through to generating summaries suitable for
community analyses with R’s rich set of statistical functions. Also resolves the issues of multiple TRFLP profiles within a species, and shared
TRFLP profiles across species. By Rich FitzJohn
and Ian Dickie.
VGAM Vector generalized linear and additive models, and associated models (Reduced-Rank
VGLMs, Quadratic RR-VGLMs, Reduced-Rank
VGAMs). Fits many models and distribution
by maximum likelihood estimation (MLE) or
penalized MLE. Also fits constrained ordination models in ecology. By Thomas W. Yee.
WWGbook Functions and data sets for the book
“Linear Mixed Models: A Practical Guide Using Statistical Software” by B. T. West, K. B.
Welch, and A. T. Galecki (2006), Chapman
ISSN 1609-3631

Vol. 7/1, April 2007

Hall / CRC Press. By Shu Chen and Andrzej
Galecki.
WeedMap Map of weed intensity. Compute spatial
predictions from exact count and a covariate.
By Gilles Guillot.
ZIGP Adapt and analyze Zero Inflated Generalized
Poisson (ZIGP) regression models. By Vinzenz
Erhardt.
adimpro Adaptive smoothing of digital images
via the Propagation Separation approach by
Polzehl and Spokoiny (2006). By Karsten Tabelow and Joerg Polzehl.
agricolae Statistical Procedures for Agricultural Research. These functions are currently utilized
by the International Potato Center Research
(CIP), the Statistics and Informatics Instructors and the Students of the Universidad Nacional Agraria La Molina Peru, and the Specialized Master in “Bosques y Gestión de Recursos Forestales” (Forest Resource Management).
Contains functionality for the statistical analysis of experimental designs applied specially
for field experiments in agriculture and plant
breeding. By Felipe de Mendiburu.
analogue Analogue methods for palaeoecology. Fits
modern analogue technique transfer function
models for prediction of environmental data
from species data. Also performs analogue
matching, a related technique used in palaeoecological restoration. By Gavin L. Simpson.
arm Data analysis using regression and multilevel/hierarchical models. By Andrew Gelman, Yu-Sung Su, Jennifer Hill, Maria Grazia
Pittau, Jouni Kerman, and Tian Zheng.
arrayMissPattern Exploratory analysis of missing
patterns for microarray data. By Eun-kyung
Lee and Taesung Park.
bbmle Modifications and extensions of the stats4
mle code. By Ben Bolker, modifying code by
Peter Dalgaard and other R-core members.

53

classifly Exploration of classification models in high
dimensions, implementing methods for understanding the division of space between groups.
See http://had.co.nz/classifly for more
details. By Hadley Wickham.
clustTool GUI for clustering data with spatial information. By Matthias Templ.
clusterGeneration Random cluster generation (with
specified degree of separation). By Weiliang
Qiu and Harry Joe.
clusterSim Searching for optimal clustering procedure for a data set. By Marek Walesiak Andrzej
Dudek.
cobs99 Constrained B-splines – outdated 1999 version. By Pin T. Ng and Xuming He; R port by
Martin Maechler.
corrgram Plot correlograms. By Kevin Wright.
distrDoc Documentation for packages distr, distrEx, distrSim, and distrTEst. By Florian Camphausen, Matthias Kohl, Peter Ruckdeschel,
and Thomas Stabla.
drm Likelihood-based marginal regression and association modeling for repeated, or otherwise
clustered, categorical responses using dependence ratio as a measure of the association. By
Jukka Jokinen.
elrm Exact Logistic Regression via MCMC. By
David Zamar, Jinko Graham, and Brad McNeney.
emdbook Data sets and auxiliary functions for the
book “Ecological Models and Data” by B.
Bolker (in progress). By Ben Bolker.
epicalc Epidemiological calculator.
Chongsuvivatwong.

By Virasakdi

fEcofin Selected economic and financial data for
teaching financial engineering and computational finance. By Diethelm Wuertz and many
others.

bcp Bayesian Change Point. An implementation of
the Barry and Hartigan (1993) product partition model for the standard change point problem using Markov Chain Monte Carlo. By
Chandra Erdman.

fame Interface for the FAME time series database.
Includes FAME storage and retrieval function,
as well as functions and S3 classes for time indexes and time indexed series, which are compatible with FAME frequencies. By Jeff Hallman.

binGroup Evaluation and experimental design
for binomial group testing.
By Frank
Schaarschmidt.

filehashSQLite Simple key-value database using
SQLite as the backend. By Roger D. Peng.

cgh Analysis of microarray comparative genome
hybridization data using the Smith-Waterman
algorithm. By Tom Price.
R News

gamlss.dist Extra distributions for GAMLSS modeling. By Mikis Stasinopoulos and Bob Rigby
with contributions from Calliope Akantziliotou
and Raydonal Ospina.
ISSN 1609-3631

Vol. 7/1, April 2007

glmc Fits generalized linear models where the parameters are subject to linear constraints. The
model is specified by giving a symbolic description of the linear predictor, a description
of the error distribution, and a matrix of constraints on the parameters. By Sanjay Chaudhuri, Mark S. Handcock, and Michael S. Rendall.
heplots Visualizing Tests in Multivariate Linear
Models.
Represents sums-of-squares-andproducts matrices for linear hypotheses and for
error using ellipses (in two dimensions) and ellipsoids (in three dimensions). By John Fox,
Michael Friendly, and Georges Monette.
homtest A collection of homogeneity tests and other
stuff for hydrological applications. By Alberto
Viglione.
ibdreg A method to test genetic linkage with covariates by regression methods with response IBD
sharing for relative pairs. Accounts for correlations of IBD statistics and covariates for relative
pairs within the same pedigree. By Jason P. Sinnwell and Daniel J. Schaid.
kernelPop Spatially explicit population genetic simulations. Creates an individual-based population genetic simulation in discrete time, where
individuals have spatial coordinates, with dispersal governed by mixtures of Weibull and
normal pdfs. Simulates diploid and haploid
inheritance. Allows the development of null
distributions of genotypes for complex demographic scenarios. By Allan Strand and James
Niehaus.
klin Implements efficient ways to evaluate and
solve equations of the form Ax = b, where A is
a Kronecker product of matrices. Also includes
functions to solve least squares problems of this
type. By Tamas K Papp.
knnflex A more flexible KNN implementation
which allows the specification of the distance
used to calculate nearest neighbors (euclidean,
binary, etc.), the aggregation method used
to summarize response (majority class, mean,
etc.) and the method of handling ties (all,
random selection, etc.). Additionally, continuous responses are handled. By Atina Dunlap
Brooks.

54

mFilter Several time series filters useful for smoothing and extracting trend and cyclical components of a time series. Currently, ChristianoFitzgerald, Baxter-King, Hodrick-Prescott, Butterworth, and trigonometric regression filters
are included in the package. By Mehmet Balcilar.
meboot Maximum entropy density based dependent data bootstrap for time series. An algorithm is provided to create a population of time
series (ensemble) without assuming stationarity. By Hrishikesh D. Vinod and Javier Lópezde-Lacalle.
memisc Miscellaneous tools for data management,
including tools for preparing (especially social science) survey data, conducting simulation studies, and presentation of results of statistical analyses. By Martin Elff.
mixstock Functions for mixed stock analysis. By
Ben Bolker.
mixtools Tools for analyzing mixture models. By
Ryan Elmore, Tom Hettmansperger, David
Hunter, Hoben Thomas, Fengjuan Xuan, and
Derek Young.
mlCopulaSelection Use numerical maximum likelihood to choose and fit a bivariate copula model
(from a library of 40 models) to the data. By Jesus Garcia and Veronica Gonzalez-Lopez.
msgcop Semiparametric Bayesian Gaussian copula
estimation, treating the univariate marginal
distributions as nuisance parameters in Hoff
(2006). Also provides a semiparametric imputation procedure for missing multivariate data.
By Peter Hoff.
muS2RC S-PLUS to R compatibility for package
muStat. By Knut M. Wittkowski and Tingting
Song.
muStat Prentice Rank Sum Test and McNemar Test.
Performs Wilcox rank sum test, Kruskal rank
sum test, Friedman rank sum test and McNemar test. By Knut M. Wittkowski and Tingting
Song.
muUtil A collection of utility functions for package
muStat. By Knut M. Wittkowski and Tingting
Song.

languageR Data sets and functions for the book
“Analyzing Linguistic Data: A practical introduction to Statistics” by R. H. Baayen (2006),
Cambridge University Press. By R. H. Baayen.

np Nonparametric kernel smoothing methods for
mixed data types. By Tristen Hayfield and Jeffrey S. Racine.

lss Accelerated failure time model to right censored
data based on least-squares principle. By Lin
Huang and Zhezhen Jin.

nsRFA Tools for objective (non-supervised) applications of the Regional Frequency Analysis methods in hydrology. By Alberto Viglione.

R News

ISSN 1609-3631

Vol. 7/1, April 2007

55

pARtial (Partial) attributable risk estimates, corresponding variance estimates and confidence
intervals. By Andrea Lehnert-Batar.

sciplot Scientific graphs with error bars for data collected from one-way or higher factorial designs. By Manuel Morales.

pmml Generate PMML for various models. The
Predictive Modeling Markup Language
(PMML) is a language for representing models, in an application independent way. Such
models can then be loaded into other applications supporting PMML, such as Teradata
Warehouse Miner and IBM’s DB2. By Graham
Williams.

signal A set of generally Matlab/Octave-compatible
signal processing functions. Includes filter generation utilities, filtering functions, resampling
routines, and visualization of filter models. By
Tom Short.

popbio Construct and analyze projection matrix
models from a demography study of marked
individuals classified by age or stage. Covers
methods described in the books “Matrix Population Models” by Caswell (2001) and “Quantitative Conservation Biology” by Morris and
Doak (2002). By Chris Stubben and Brook Milligan.

smoothtail Smooth Estimation of the Generalized
Pareto distribution shape parameter. By Kaspar Rufibach and Samuel Mueller.
snp.plotter Creates plots of p-values using single
SNP and/or haplotype data, with options to
display a linkage disequilibrium (LD) plot and
the ability to plot multiple data sets simultaneously. By Augustin Luna and Kristin K.
Nicodemus.

prim Patient Rule Induction Method (PRIM) for
estimating highest density regions in highdimensional data. By Tarn Duong.

spsurvey Algorithms required for design and analysis of probability surveys such as those utilized by the U.S. Environmental Protection
Agency’s Environmental Monitoring and Assessment Program (EMAP). By Tom Kincaid
and Tony Olsen, with contributions from Don
Stevens, Christian Platt, Denis White, and
Richard Remington.

qgen Analysis of quantitative genetic data, especially helpful to perform parametric resampling of quantitative genetic data sets. By
Thomas Fabbro.

staRt Inferenza classica con TI-83 Plus. Una libreria per utilizzare con semplicità le tecniche di
statistica inferenziale presenti sulla calcolatrice
scientifica grafica TI-83 Plus. By Fabio Frascati.

rcdk Interface to the CDK libraries, a Java framework for cheminformatics. Allows to load
molecules, evaluate fingerprints, calculate
molecular descriptors and so on. By Rajarshi
Guha.

stashR A set of tools for administering shared repositories. By Sandy Eckel, and Roger D. Peng.

rcompgen Generates potential completions for a
given command snippet based on circumstantial evidence. By Deepayan Sarkar.

stream.net Functions with example data for creating, importing, attributing, analyzing, and displaying stream networks represented as binary trees.
Capabilities include upstream
and downstream distance matrices, stochastic
network generation, segmentation of network
into reaches, adding attributes to reaches with
specified statistical distributions, interpolating
reach attributes from sparse data, analyzing
auto-correlation of reach attributes, and creating maps with legends of attribute data. Target applications include dynamic fish modeling. By Denis White.

portfolioSim Framework for simulating equity
portfolio strategies. By Jeff Enos and David
Kane, with contributions from Kyle Campbell.

rcompletion Pseudo-intelligent TAB completion for
R when run from a terminal, using the GNU
Readline library. By Deepayan Sarkar.
relaxo Relaxed Lasso, a generalization of the Lasso
shrinkage technique for linear regression. Both
variable selection and parameter estimation is
achieved by regular Lasso, yet both steps do
not necessarily use the same penalty parameter. By Nicolai Meinshausen.
rggm Fitting Gaussian Graphical Models with robustified methods. By Masashi Miyamura.
rjacobi Compute Jacobi polynomials and GaussJacobi quadrature related operations. By Paulo
Jabardo.
R News

stochmod Stochastic Modeling: learning and inference algorithms for a variety of probabilistic
models. By Artem Sokolov.

svcm Fir 2-d and 3-d space-varying coefficient models to regular grid data using either a full Bspline tensor product approach or a sequential approximation. By Susanne Heim, with
support from Paul Eilers, Thomas Kneib, and
Michael Kobl.
ISSN 1609-3631

Vol. 7/1, April 2007

tdm A tool for therapeutic drug monitoring. Can estimate individual pharmacokinetic parameters
with one or more drug serum/plasma concentrations obtained from a single subject or multiple subjects, calculate a suggested dose with
the target drug concentration or calculate a predicted drug concentration with a given dose.
By Miou-Ting Chen, Yung-Jin Lee.
titan GUI to analyze mass spectrometric data on the
relative abundance of two substances from a
titration series. By Tom Price.

56

wombsoft Wombling Computation. Analyses individually geo-referenced multilocus genotypes
for the inferences of genetic boundaries between populations. Based on the wombling
method which estimates the systemic function
by looking for the local variation of the allele
frequencies. By Ameline Crida.
yaImpute Perform k-NN imputation. By Nicholas
L. Crookston and Andrew O. Finley.

Other changes

tkrgl TK widget tools for package rgl. By Duncan
Murdoch and Ming Chen.

• Package micEcdat was moved from the main
CRAN section to the Archive.

tm A framework for text mining applications within
R. By Ingo Feinerer.

• Package SAGx was removed from CRAN.

trip Access and manipulate spatial data for animal
tracking. By Michael D. Sumner.

Kurt Hornik
Wirtschaftsuniversität Wien, Austria
Kurt.Hornik@R-project.org

R Foundation News
by Kurt Hornik

Donations and new members

New supporting institutions
Ef-prime Inc., Tokyo, Japan

Donations

New supporting members

Peter M Maksymuk (USA)

David Vonka (CZ)
Olivier Celhay (France)
Katarzyna Kopczewska (Poland)

New benefactors
Schröder Investment Management Ltd., London,
GB
Prediction Company, Santa Fe, New Mexico, USA

Kurt Hornik
Wirtschaftsuniversität Wien, Austria
Kurt.Hornik@R-project.org

R News Referees 2006
by Torsten Hothorn

• Adrian Bowman

R News articles are peer-reviewed by external and
independent experts. The editorial board members would like to take the opportunity to thank
all referees who read and commented on submitted
manuscripts during the last year. Much of the quality of R News publications is due to their invaluable
and timely service. Thank you!

• Patrick Burns
• Philip Dixon
• Romain Francois
• Thomas Gerds
• Jos Hageman

• Murray Aitkin

• Frank Harrell

• Doug Bates

• David James

• Axel Benner

• Sigbert Klinke

R News

ISSN 1609-3631

Vol. 7/1, April 2007

57

• Eric Lecoutre

• Alec Stephenson

• Fritz Leisch

• Simon Urbanek

• Martin Maechler
• Andrew Martin
• Georges Monette
• Martyn Plummer
• Matt Pocernich
• Christina Rabe
• Tony Rossini
• Peter Ruckdeschl

R News

• Bill Venables
• Ron Wehrens
• Keith Worsley
• Achim Zeileis
Torsten Hothorn
Friedrich–Alexander–Universität Erlangen–Nürnberg
Germany
Torsten.Hothorn@R-project.org

ISSN 1609-3631

Vol. 7/1, April 2007

Editor-in-Chief:
Torsten Hothorn
Institut für Statistik
Ludwigs–Maximilians–Universität München
Ludwigstraße 33, D-80539 München
Germany
Editorial Board:
John Fox and Vincent Carey.
Editor Programmer’s Niche:
Bill Venables
Editor Help Desk:
Uwe Ligges
Email of editors and editorial board:
firstname.lastname @R-project.org

R News

58

R News is a publication of the R Foundation for Statistical Computing. Communications regarding this
publication should be addressed to the editors. All
articles are copyrighted by the respective authors.
Please send submissions to regular columns to the
respective column editor and all other submissions
to the editor-in-chief or another member of the editorial board. More detailed submission instructions
can be found on the R homepage.
R Project Homepage:
http://www.R-project.org/

This newsletter is available online at
http://CRAN.R-project.org/doc/Rnews/

ISSN 1609-3631



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Page Count                      : 58
Page Mode                       : UseOutlines
Author                          : Torsten Hothorn
Title                           : R News 2007/1
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfeTeX-1.21a
Create Date                     : 2007:04:20 10:55:08+02:00
PTEX Fullbanner                 : This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) kpathsea version 3.5.4
EXIF Metadata provided by EXIF.tools

Navigation menu