PDF Rnews 2006 2

PDF Rnews_2006-2 CRAN: R News

User Manual: PDF CRAN - Contents of R News

Open the PDF directly: View PDF .
Page Count: 72

Editorial
S4 Classes for Distributions
- Motivation
- Concept
- Organization in classes
  - Distribution classes
  - Simulation classes and evaluation class
- Methods
- Options
- System/version requirements
- Examples
- Details/odds and ends
- Acknowledgement
The regress function
Processing data for outliers
Analysing equity portfolios in R
GroupSeq: Designing clinical trials using group sequential designs
Using R/Sweave in everyday clinical practice
changeLOS: An R-package for change in length of hospital stay based on the Aalen-Johansen estimator
Balloon Plot
Drawing pedigree diagrams with R and graphviz
Non-Standard Fonts in PostScript and PDF Graphics
The doBy package
Normed division algebras with R: Introducing the onion package
- Preface
- Introduction
  - The octonions
- Package onion in use
  - Practical applications
- Conclusions
Resistor networks R: Introducing the ResistorArray package
- Introduction
- Matrix formulation
- Package ResistorArray in use
- Conclusions
The good, the bad, and the ugly---Review of Paul Murrell's new book: ``R Graphics''
Changes in R
Changes on CRAN
- New contributed packages
- Other changes
R Foundation News
- Donations and new members
News from the Bioconductor Project
Forthcoming Events: DSC 2007
- Call for Papers

News

The Newsletter of the R Project Volume 6/2, May 2006

Editorial

by Paul Murrell

Welcome to the ﬁrst regular issue of R News for 2006.

This is a bumper issue, with a thirteen contributed

articles. Many thanks to the contributors and review-

ers for all of the enthusiasm and effort that they put

into these articles.

This issue begins with some “methodological”

papers. Peter Ruckdeschel, Matthias Kohl, Thomas

Stabla, and Florian Camphausen introduce their

distr package, which provides a set of S4 classes and

methods for deﬁning distributions of random vari-

ables. This is followed by an article from David Clif-

ford and Peter McCullagh on the many uses of the

regress package, then Lukasz Komsta describes his

outliers package, which contains general tests for

outlying observations.

The next set of articles involve applications of R

to particular settings. David Kane and Jeff Enos de-

scribe their portfolio package for working with eq-

uity portfolios. Roman Pahl, Andreas Ziegler, and

Inke König discuss the GroupSeq package for de-

signing clinical trials. Using Sweave in clinical prac-

tice is the topic of Sven Garbade and Peter Burgard’s

article, while M. Wangler, J. Beyersmann, and M.

Schumacher focus on length of hospital stay with the

changeLOS package.

Next, there are three graphics articles. Nitin Jain

and Gregory Warnes describe the “balloonplot”, a

tool for visualizing tabulated data. Jung Zhao ex-

plains how to produce pedigree plots in R, and Brian

Ripley and I describe some improvements in R’s font

support for PDF and PostScript graphics output.

SAS users struggling to make the transition to R

might enjoy Søren Højsgaard’s article on his doBy

package and the contributed articles are rounded out

by two pieces from Robin Hankin: one on the onion

package for normed division algebras and one on

the ResistorArray package for analysing resistor net-

works.

David Meyer provides us with a review of the

book “R Graphics”.

This issue accompanies the release of R version

2.3.0. Changes in R itself and new CRAN packages

Contents of this issue:

Editorial ...................... 1

S4 Classes for Distributions . . . . . . . . . . . 2

The regress function . . . . . . . . . . . . . . . . 6

Processing data for outliers . . . . . . . . . . . 10

Analysing equity portfolios in R . . . . . . . . . 13

GroupSeq: Designing clinical trials using

group sequential designs . . . . . . . . . . . 21

Using R/Sweave in everyday clinical practice . 26

changeLOS: An R-package for change in

length of hospital stay based on the Aalen-

Johansen estimator . . . . . . . . . . . . . . . 31

BalloonPlot .................... 35

Drawing pedigree diagrams with R and

graphviz..................... 38

Non-Standard Fonts in PostScript and PDF

Graphics..................... 41

The doBy package . . . . . . . . . . . . . . . . . 47

Normed division algebras with R: Introducing

the onion package . . . . . . . . . . . . . . . 49

Resistor networks R: Introducing the Resis-

torArray package . . . . . . . . . . . . . . . . 52

The good, the bad, and the ugly—Review of

Paul Murrell’s new book: “RGraphics” . . . 54

ChangesinR.................... 54

Changes on CRAN . . . . . . . . . . . . . . . . 64

R Foundation News . . . . . . . . . . . . . . . . 70

News from the Bioconductor Project . . . . . . 70

Forthcoming Events: DSC 2007 . . . . . . . . . 71

Vol. 6/2, May 29, 2006 2

that have appeared since the release of R 2.2.0 are de-

scribed in the relevant regular sections. There is also

an update of individuals and institutions who are

providing support for the R Project via the R Foun-

dation and there is an announcement and call for

abstracts for the DSC 2007 conference to be held in

Auckland, New Zealand.

A new addition with this issue is “News from

the Bioconductor Project” by Seth Falcon, which de-

scribes important features of the 1.8 release of Bio-

conductor packages.

Finally, I would like to remind everyone of the

upcoming useR! 2006 conference. It has attracted

a huge number of presenters and promises to be a

great success. I hope to see you there!

Paul Murrell

The University of Auckland, New Zealand

paul@stat.auckland.ac.nz

S4 Classes for Distributions

by Peter Ruckdeschel, Matthias Kohl, Thomas Stabla, and

Florian Camphausen

distr is an Rpackage that provides a conceptual treat-

ment of random variables (r.v.’s) by means of S4–

classes. A mother class Distribution is introduced

with slots for a parameter and for methods r,d,p,

and q, consecutively for simulation and for evalu-

ation of the density, c.d.f., and quantile function of

the corresponding distribution. All distributions of

the base package are implemented as subclasses. By

means of these classes, we can automatically gener-

ate new objects of these classes for the laws of r.v.’s

under standard mathematical univariate transforma-

tions and under convolution of independent r.v.’s. In

the distrSim and distrTEst packages, we also pro-

vide classes for a standardized treatment of simula-

tions (also under contamination) and evaluations of

statistical procedures on such simulations.

Motivation

Rcontains powerful techniques for virtually any use-

ful distribution using the suggestive naming conven-

tion [prefix]<name> as functions, where [prefix]

stands for r,d,p, or q, and <name> is the name of the

distribution.

There are limitations of this concept: You can

only use distributions that are already implemented

in some library or for which you have provided an

implementation. In many natural settings you want

to formulate algorithms once for all distributions, so

you should be able to treat the actual distribution

<name> as if it were a variable.

You may of course paste together a preﬁx and

the value of <name> as a string and then use

eval(parse(....)). This is neither very elegant nor

ﬂexible, however.

Instead, we would prefer to implement the algo-

rithm by passing an object of some distribution class

as an argument to a function. Even better, though,

we would use a generic function and let the S4-

dispatching mechanism decide what to do at run-

time. In particular, we would like to automatically

generate the corresponding functions r,d,p, and q

for the law of expressions like X+3Y for objects Xand Y

of class Distribution, or, more generally, of a trans-

formation of X,Yunder a function f:R2→Rwhich

is already realized as a function in R.

This is possible with package distr. As an exam-

ple, try1

> require("distr")

Loading required package: distr

[1] TRUE

> N <- Norm(mean=2,sd=1.3)

> P <- Pois(lambda=1.2)

> (Z <- 2*N+3+P)

Distribution Object of Class: AbscontDistribution

> plot(Z)

> p(Z)(0.4)

[1] 0.002415384

> q(Z)(0.3)

[1] 6.70507

> r(Z)(10)

[1] 11.072931 7.519611 10.567212 3.358912

[5] 7.955618 9.094524 5.293376 5.536541

[9] 9.358270 10.689527

0 20

0.00 0.06 0.12

Density of AbscontDistribution

grid

d(x)(grid)

0 20

0.0 0.4 0.8

CDF of AbscontDistribution

grid

p(x)(grid)

0.0 0.4 0.8

0 10 20 30

Quantile of AbscontDistribution

p(x)(grid)

grid

Figure 1: density, c.d.f. and quantile function of Z

1From version 1.7 on, additionally some startup message is shown after require(); confer Ruckdeschel et al. (2005, Section 5).

R News ISSN 1609-3631

Vol. 6/2, May 29, 2006 3

Comment:

Nand Pare assigned objects of classes "Norm" and

"Pois", respectively, with corresponding parameters.

The command Z <- 2*N+3+P generates a new distribu-

tion object and assigns it to Z— an object of class

"AbscontDistribution". Identifying N,P,Zwith r.v.’s dis-

tributed according to these distributions, L(Z) = L(2N+

3+P). Hence, p(Z)(0.4) returns P(Z≤0.4);q(Z)(0.3)

returns the 30%-quantile of Z;r(Z)(10) generates 10

pseudo random numbers distributed according to Z; and

the plot command produces Figure 1.

Concept

Our class design is deliberately open so that our

classes may easily be extended by any volunteer in

the Rcommunity; loose ends are multivariate dis-

tributions, time series distributions, and conditional

distributions. As an exercise, the reader is encour-

aged to implement extreme value distributions from

package evd. The greatest effort will be the docu-

mentation.

As to the naming of the slots, our goal was to pre-

serve naming and notation from the base package

as far as possible so that any programmer familiar

with Scould quickly use our distr package. In ad-

dition, as the distributions already implemented in R

are all well tested and programmed, we use the exist-

ing r,d,p, and q-functions wherever possible, simply

wrapping them in small code sniplets to our class hi-

erarchy. All this should make intensive use of object

orientation in order to be able to use inheritance and

method overloading.

Contrary to the standard R-functions like rnorm,

we only permit length 1 for parameters like mean, be-

cause we see the objects as implementations of uni-

variate random variables, for which vector-valued

parameters make no sense; rather one could gather

several objects with possibly different parameters in

a vector or list of distributions. Of course, the origi-

nal functions rnorm, etc., remain unchanged and still

allow vector-valued parameters.

Organization in classes

Loosely speaking, distr contains distribution classes

and corresponding methods, distrSim simulation

classes and corresponding methods, and distrTEst

an evaluation class and corresponding methods.

The latter two classes are to be considered only as

tools that allow a uniﬁed treatment of simulations

and evaluation of statistical estimation (perhaps also

later, tests and predictions) under varying simulation

situations.

Distribution classes

The Distribution class and its descendants imple-

ment the concept of a random variable/distribution

in R. They all have a param slot for a parameter, an

img slot for the range of the corresponding r.v., and

r,d,p, and qslots.

Subclasses

At present, the distr package is limited to univari-

ate distributions; these are derived from the subclass

UnivariateDistribution, and as typical subclasses,

we introduce classes for absolutely continuous (a.c.)

and discrete distributions — AbscontDistribution

and DiscreteDistribution. The latter has an

extra slot support, a vector containing the sup-

port of the distribution, which is truncated to the

lower/upper TruncQuantile in case of an inﬁnite

support. TruncQuantile is a global option of distr

described in Ruckdeschel et al. (2005, Section 4).

As subclasses we have implemented all paramet-

ric families from the base package simply by pro-

viding wrappers to the original R-functions of form

[prefix]<name>. More precisely, as subclasses of

class AbscontDistribution, we have implemented

Beta,Cauchy,Chisq,Exp,Logis,Lnorm,Norm,Unif,

and Weibull; to avoid name conﬂicts, the F-, t-, and

Γ-distributions are implemented as Fd,Td,Gammad.

As subclasses of class DiscreteDistribution, we

have implemented Binom,Dirac,Geom,Hyper,

NBinom, and Pois.

Parameter classes

The param slot of a distribution class is of class

Parameter. With method liesIn, some checking is

possible; for details confer Ruckdeschel et al. (2005).

Simulation classes and evaluation class

Simulation classes and an evaluation class are pro-

vided in the distrSim and distrTEst packages, re-

spectively. As a uniﬁed class for both “real” and sim-

ulated data, DataClass serves as a common mother

class for both types of data. For simulations, we

gather all relevant information in the derived classes

Simulation and, for simulations under contamina-

tion, ContSimulation.

When investigating properties of a procedure by

means of simulations, one typically evaluates this

procedure on a lot of simulation runs, giving a re-

sult for each run. These results are typically worth

storing. To organize all relevant information about

these results, we introduce an Evaluation class.

For details concerning these groups of classes,

consult Ruckdeschel et al. (2005).

R News ISSN 1609-3631

Vol. 6/2, May 29, 2006 4

Methods

We have made available quite general arithmetical

operations for our distribution objects, generating

new image distributions automatically. These arith-

metics operate on the corresponding r.v.’s and not

on the distributions. (For the latter, they only would

make sense in restricted cases like convex combina-

tions.)

Afﬁne linear transformations

We have overloaded the operators "+","-","*",

and "/" such that afﬁne linear transformations

that involve only single univariate r.v.’s are avail-

able; i.e., expressions like Y=(3*X+5)/4 are permit-

ted for an object Xof class AbscontDistribution

or DiscreteDistribution (or some subclass),

producing an object Ythat is also of class

AbscontDistribution or DiscreteDistribution (in

general). Here the corresponding transformations of

the d,p, and q-functions are determined analytically.

Unary mathematical operations

The math group of unary mathematical operations

is also available for distribution classes; so expres-

sions like exp(sin(3*X+5)/4) are permitted. The

corresponding r-method consists in simply perform-

ing the transformation on the simulated values of X.

The corresponding (default-) d-, p- and q-functions

are obtained by simulation, using the technique de-

scribed in the following subsection.

Construction of d,p, and qfrom r

For automatic generation of new distributions, in

particular those arising as image distributions un-

der transformations of correspondingly distributed

r.v.’s, we provide ad hoc methods that should be

overloaded by more exact ones wherever possible:

Function RtoDPQ (re)constructs the slots d,p, and q

from slot r, generating a certain number Kof ran-

dom numbers according to the desired law, where

Kis controlled by a global option of the distr pack-

age, as described in Ruckdeschel et al. (2005, Sec-

tion 4). For a.c. distributions, a density estimator is

evaluated along this sample, the distribution func-

tion is estimated by the empirical c.d.f., and, ﬁnally,

the quantile function is produced by numerical in-

version. Of course the result is rather crude as it

relies only on the law of large numbers, but in this

manner all transformations within the math group

become available.

Convolution

A convolution method for two independent r.v.’s

is implemented by means of explicit calculations

for discrete summands, and by means of the FFT2

if one of the summands is a.c. This method au-

tomatically generates the law of the sum of two

independent variables Xand Yof any univariate

distributions — or, in S4-jargon, the addition op-

erator "+" is overloaded for two objects of class

UnivariateDistribution and corresponding sub-

classes.

Overloaded generic functions

Methods print,plot, and summary have been over-

loaded for our new classes Distribution (pack-

age distr), DataClass,Simulation,ContSimulation

(package distrSim), as well as EvaluationClass

(package distrTEst), to produce “pretty” out-

put. For Distribution,plot displays the den-

sity/probability function, the c.d.f. and the quan-

tile function of a distribution. For objects of class

DataClass — or of a corresponding subclass — plot

graphs the sample against the run index, and in the

case of ContSimulation, the contaminating variables

are highlighted by a different color. For an object of

class EvaluationClass,plot yields a boxplot of the

results of the evaluation.

Simulation

For the classes Simulation and ContSimulation in

the distrSim package, we normally will not save the

current values of the simulation, so when declaring a

new object of either of the two classes, the slot Data

will be empty (NULL). We may ﬁll the slot by an ap-

plication of the method simulate to the object. This

has to be redone whenever another slot of the object

is changed. To guarantee reproducibility, we use the

slot seed. This slot is controlled and set through Paul

Gilbert’s setRNG package. For details confer Ruck-

deschel et al. (2005).

Options

Contrary to the generally preferable functional style

in S, we have chosen to set a few “constants” glob-

ally in order to retain binary notation for operators

like "+". Analogously to the options command in R,

you can specify a number of these global constants.

We only list them here; for details consult Ruckde-

schel et al. (2005) or see ?distroptions. Distribution

options include

•DefaultNrFFTGridPointsExponent

•DefaultNrGridPoints

•DistrResolution

2Details to be found in Kohl et al. (2005)

R News ISSN 1609-3631

Vol. 6/2, May 29, 2006 5

•RtoDPQ.e

•TruncQuantile

These options may be inspected/modiﬁed by means

of distroptions() and getdistrOption() which

are deﬁned just in the same way as options() and

getOption().

System/version requirements etc.

As our packages are completely written in R, there

are no dependencies on the underlying OS. After

some reorganization the three packages distr,distr-

Sim, and distrTEst, as described in this paper, are

only available from version R 2.2.0 on, but there

are precursors of our packages running under older

versions of R; see Ruckdeschel et al. (2005). For the

control of the seed of the random number genera-

tor in our simulation classes in package distrSim,

we use Paul Gilbert’s package setRNG, and for our

startup messages, we need package startupmsg —

both packages to be installed from CRAN. Packages

distr,distrSim, and distrTEst are distributed under

the terms of the GNU GPL Version 2, June 1991 (see

http://www.gnu.org/copyleft/gpl.html).

Examples

In the demo folder of the distr package, the interested

reader will ﬁnd the sources for some worked out

examples; we conﬁne ourselves to discussing only

one (the shortest) of them — NormApprox.R: In ba-

sic statistic courses, 12-fold convolution of Unif(0, 1)

variables is a standard illustration for the CLT. Also,

this is an opaque generator of N(0, 1)–variables; see

Rice (1988, Example C, p. 163).

> N <- Norm(0,1) # a standard normal

> U <- Unif(0,1)

>U2<-U+U

> U4 <- U2 + U2

> U8 <- U4 + U4

> U12 <- U4 + U8

> NormApprox <- U12 - 6

In the 3rd to 6th line, we try to minimize the num-

ber of calls to FFT for reasons of both time and ac-

curacy; in the end, NormApprox contains the centered

and standardized 12-fold convolution of Unif(0, 1).

Keep in mind, that instead of U <- Unif(0,1), we

might have assigned any distribution to Uwithout

having to change the subsequent code — except for

a different centering/standardization. For a more so-

phisticated use of FFT for convolution powers see

nFoldConvolution.R in the demo folder.

Next we explore the approximation quality of

NormApprox for N(0, 1)graphically (see Figure 2):

> par(mfrow = c(2,1))

> x <- seq(-4,4,0.001)

> plot(x, d(NormApprox)(x), type = "l",

+ xlab = "", ylab = "Density",

+ main = "Exact and approximated density")

> lines(x, d(N)(x), col = "red")

> legend(-4, d(N)(0),

+ legend = c("NormApprox",

+ "Norm(0,1)"), fill = c("black", "red"))

> plot(x, d(NormApprox)(x) - d(N)(x),

+ type = "l", xlab = "",

+ ylab = "\"black\" - \"red\"",

+ col = "darkgreen", main = "Error")

> lines(c(-4,4), c(0,0))

−4 −2 0 2 4

0.0 0.2 0.4

Exact and approximated density

Density

NormApprox

Norm(0,1)

−4 −2 0 2 4

−0.004 0.000

Error

"black" − "red"

Figure 2: densities of N(0, 1)and NormApprox and their differ-

ence

Details/odds and ends

For further details of the implementation, see Ruck-

deschel et al. (2005). In particular, we recommend

Thomas Stabla’s utility standardMethods for the au-

tomatic generation of S4-accessor and -replacement

functions. For more details, see ?standardMethods.

For odds and ends, consult the web-page for our

packages,

http://www.uni-bayreuth.de/departments/

/math/org/mathe7/DISTR/.

Acknowledgement

We thank Martin Mächler and Josef Leydold for their

helpful suggestions in conceiving the package. John

Chambers also gave several helpful hints and in-

sights. We got stimulating replies to an RFC on

r-devel by Duncan Murdoch and Gregory Warnes.

We also thank Paul Gilbert for drawing our attention

to his package setRNG and making it available in a

stand-alone version.

R News ISSN 1609-3631

Vol. 6/2, May 2006 6

Bibliography

J. M. Chambers. Programming with Data. A

guide to the S language. Springer, 1998. URL

http://cm.bell-labs.com/cm/ms/departments/

sia/Sbook/.

M. Kohl, P. Ruckdeschel, and T. Stabla. Gen-

eral Purpose Convolution Algorithm for

Distributions in S4-Classes by means of

FFT. Technical Report. Also available under

http://www.uni-bayreuth.de/departments/

math/org/mathe7/RUCKDESCHEL/pubs/comp.pdf,

Feb. 2005.

J. A. Rice. Mathematical statistics and data analysis.

Wadsworth & Brooks/Cole Advanced Books &

Software, Paciﬁc Grove, California, 1988.

P. Ruckdeschel, M. Kohl, T. Stabla, and F. Cam-

phausen. S4 Classes for Distributions— a manual

for packages distr,distrSim,distrTEst, ver-

sion 1.7, distrEx, version 0.4-3, May 2006.

http://www.uni-bayreuth.de/departments/

math/org/mathe7/DISTR.

Peter Ruckdeschel

Matthias Kohl

Thomas Stabla

Florian Camphausen

Mathematisches Institut

Universität Bayreuth

D-95440 Bayreuth

Germany

peter.ruckdeschel@uni-bayreuth.de

The regress function

An R function that uses the Newton Raphson al-

gorithm for ﬁtting certain doubly linear Gaussian

models.

by David Clifford and Peter McCullagh

Introduction

The purpose of this article is to highlight the

many uses of the regress function contained in the

regress package. The function can be used to ﬁt lin-

ear Gaussian models in which the mean is a linear

combination of given covariates, and the covariance

is a linear combination of given matrices. A Newton-

Raphson algorithm is used to maximize the residual

log likelihood with respect to the variance compo-

nents. The regression coefﬁcients are subsequently

obtained by weighted least squares, and a further

matrix computation gives the best linear predictor

for the response on a further out-of-sample unit.

Many Gaussian models have a covariance struc-

ture that can be written as a linear combination of

matrices, for example random effects models, poly-

nomial spline models and some multivariate mod-

els. However it was our research on the nature of

spatial variation in crop yields, McCullagh and Clif-

ford (2006), that motivated the development of this

function.

We begin this paper with a review of the kinds of

spatial models we wish to ﬁt and explain why a new

function is needed to achieve this. Following this

we discuss other examples that illustrate the broad

range of uses for this function. Uses include basic

random effects models, multivariate linear models

and examples that include prediction and smooth-

ing. The techniques used in these examples can also

be applied to growth curves.

Spatial Models

McCullagh and Clifford (2006) model spatial depen-

dence of crop yields as a planar Gaussian process in

which the covariance function at points z,z0in the

plane has the form

σ2

0δz−z0+σ2

1K(|z−z0|)(1)

with non-negative coefﬁcients σ2

0and σ2

1. In the

ﬁrst term, Dirac’s delta is the covariance function for

white noise, and is independent on non-overlapping

sets. The second term is a stationary isotropic covari-

ance function from the Matérn class or power class.

McCullagh and Clifford (2006) proposed that the

spatial correlation for any crop yield is well de-

scribed by a convex combination of the Dirac func-

tion and the logarithmic function associated with

the de Wijs process. The de Wijs process (de Wijs,

1951,1953) is a special case of both the Matérn and

power classes and is invariant to conformal transfor-

mation. The linear combination of white noise and

the de Wijs process is called the conformal model.

We examined many examples from a wide variety

of crops worldwide, comparing the conformal model

against the more general Matérn class. Apart from

a few small-scale anisotropies, little evidence was

found of systematic departures from the conformal

model. Stein (1999) is a good reference for more de-

tails on spatial models.

It is important to point out that the de Wijs pro-

cess and certain other boundary cases of the Matérn

family are instances of generalized covariance func-

tions corresponding to intrinsic processes. The ﬁtting

R News ISSN 1609-3631

Vol. 6/2, May 2006 7

of such models leads to the use of generalized co-

variance matrices that are positive deﬁnite on certain

contrasts. Such matrices usually have one or more

negative eigenvalues, and as such the models cannot

be ﬁtted using standard functions in R.

Clifford (2005) shows how to compute the gener-

alized covariance matrices when yields are recorded

on rectangular plots laid out on a grid. For the

de Wijs process this is an analytical calculation and

the spatialCovariance package in R can be used to

evaluate the covariance matrices.

In order to illustrate exactly what the spatial

model looks like let us consider the Fairﬁeld Smith

wheat yield data (Fairﬁeld Smith,1938). We wish to

ﬁt the conformal model to this dataset and to include

row and column effects in our model of the mean.

Assume at this point that the following information

is loaded into R: the yield yand factors row and col

that indicate the position of each plot.

The plots are 6 inches by 12 inches in a regular

grid of 30 rows and 36 columns with no separations

between plots. Computing the covariance matrix in-

volves two steps. The ﬁrst stage called precompute

is carried out once for each dataset. The second stage

called computeV is carried out for each member of the

Matérn family one wishes to ﬁt to the model.

require("spatialCovariance")

foot <- 0.3048 ## convert from ft to m

info <- precompute(nrows=30,ncols=36,

rowwidth=0.5*foot,

colwidth=1*foot,

rowsep=0,colsep=0)

V <- computeV(info)

The model we wish to now ﬁt to the re-

sponse variable yobserved on plots of common area

|A|=0.0465m2is

y∼row +col +spatial error

where the spatial error has generalized covariance

structure

Σ=σ2

0|A|I+σ2

1|A|2V. (2)

The fact that yield is an extensive variable means

one goes from Equation 1to Equation 2by aggre-

gating yield over plots, i.e. by integrating the covari-

ance function over plots. We ﬁt this model using the

regress function. One can easily check that Vhas a

negative eigenvalue and hence the model cannot be

ﬁtted using other packages such as lme for example.

require("regress")

model1 <- regress(y~row+col,~V,identity=TRUE)

In order to determine the residual log likelihood

relative to a white noise baseline, we need to ﬁt the

model in which σ2

1=0, i.e. Σ=σ2

0|A|I. The resid-

ual log likelihood is the difference in log likelihood

between these two models.

> BL <- regress(y~row+col,~,identity=TRUE)

> model1$llik - BL$llik

[1] 12.02406

> summary(model1,fixed.effects=FALSE)

Maximised Resid. Log Likelihood is -4373.721

Linear Coefficients: not shown

Variance Coefficients:

Estimate Std. Error

identity 39243.69 2139.824

V 60491.21 19664.338

The summary of the conformal model shows that

the estimated variance components are ˆσ2

0|A|=

39.2 ×104and ˆσ2

1|A|2=6.05 ×104. The standard

errors associated with each parameter are based on

the Fisher Information at the point of convergence.

Random Effects Models

Exchangeable Gaussian random effects models, also

known as linear mixed effects models Pinheiro

and Bates (2000), can also be ﬁtted using regress.

However the code is not optimised to use the

groupedData structure like lmer is and regress can-

not handle the large datasets that one can ﬁt using

lmer,Bates (2005).

The syntax for basic models using regress is

straightforward. Consider the example in Venables

and Ripley (2002)[p. 286] which uses data from Yates

(1935). In addition to the treatment factors N and

V, corresponding to nitrogen levels and varieties, the

model has a block factor B and a random B:V inter-

action.

The linear model is written explicitly as

ybnv =µ+αn+βv+ηb+ζbv +bnv (3)

where the distribution of the random effects are

ηb∼N(0,σ2

B),ζbv ∼N(0,σ2

B:V)and

bnv ∼N(0,σ2), all independent with indepen-

dent components, Venables and Ripley (2002). The

regress syntax mirrors how the model is deﬁned in

(3).

oats.reg <- regress(Y~1+N+V,~B+I(B:V),

identity=TRUE,data=Oats)

Similar syntax is used by the lmer function but

the residual log likelihoods reported by the two func-

tions are not the same. The difference is attributable

in part to the term +1

2log |X>X|, which is constant

for comparison of two variance-component models,

but is affected by the basis vectors used to deﬁne

the subspace for the means. The value reported by

regress is unaffected by the choice of basis vectors.

R News ISSN 1609-3631

Vol. 6/2, May 2006 8

While the regress function can be used for ba-

sic random effects models, the lmer function can be

used to ﬁt models where the covariance structure

cannot be written as a linear combination of known

matrices. The regress function does not apply to

such models. An example of such a model is one that

includes a random intercept and slope for a covariate

for each level of a group factor. A change of scale for

the covariate would result in a different model unless

the intercept and slope are correlated. In lmer the

groupedData structure automatically includes a cor-

relation parameter in such a model. See Bates (2005)

for an example of such a model.

Multivariate Models Using regress

In this section we show how to ﬁt a multivari-

ate model. The example we choose to work

with is a very basic multivariate example but

the method illustrates how the complex models

can be ﬁtted. Suppose that the observations are

i.i.d bivariate normal, i.e. Yi∼N(µ,Σ)where

µ=µ1

µ2and Σ=σ2

1ρσ1σ2

ρσ1σ2σ2

2. The goal is

to estimate the parameter (µ1,µ2,σ2

1,σ2

2,γ=ρσ1σ2).

Let Ydenote a vector of the length 2ncre-

ated by concatenating the observations for each

unit. The model for the data can be written as

Y∼N(Xµ,In⊗Σ)where X=1⊗I2and 1 is

a vector of ones of length n. Next we use the fact that

one can write Σas

Σ=σ2

11 0

0 0+γ0 1

1 0+σ2

20 0

0 1

to express the model as Y∼N(Xµ,D)where

D=σ2

1In⊗1 0

0 0+γIn⊗0 1

1 0+σ2

2In⊗0 0

0 1.

The code to generate such data and to ﬁt such

a model is given below, note that A⊗B=

kronecker(A,B) in R.

library("regress")

library("MASS") ## needed for mvrnorm

n <- 100

mu <- c(1,2)

Sigma <- matrix(c(10,5,5,10),2,2)

Y <- mvrnorm(n,mu,Sigma)

## this simulates multivariate normal rvs

y <- as.vector(t(Y))

X <- kronecker(rep(1,n),diag(1,2))

V1 <- matrix(c(1,0,0,0),2,2)

V2 <- matrix(c(0,0,0,1),2,2)

V3 <- matrix(c(0,1,1,0),2,2)

sig1 <- kronecker(diag(1,n),V1)

sig2 <- kronecker(diag(1,n),V2)

gam <- kronecker(diag(1,n),V3)

reg.obj <- regress(y~X-1,~sig1+sig2+gam,

identity=FALSE,start=c(1,1,0.5))

The summary of this model shows that the es-

timated mean parameters are ˆµ1=1.46 and ˆµ2=

1.78. The estimates for the variance components are

ˆσ2

1=9.25, ˆσ2

2=10.27 and ˆγ=3.96. The true param-

eter is (µ1,µ2,σ2

1,σ2

2,γ=ρσ1σ2)=(1, 2, 10, 10, 5).

Maximised Residual Log Likelihood is -315.48

Linear Coefficients:

Estimate Std. Error

X1 1.461 0.304

X2 1.784 0.320

Variance Coefficients:

Estimate Std. Error

sig1 9.252 1.315

sig2 10.271 1.460

gam 3.963 1.058

A closed form solution exists for the estimate of

Σin this case, ˆ

Σ=1

n−1(Y−¯

Y)>(Y−¯

Y)where Y

denotes the n×2 matrix of observations. Our com-

puted result can be checked using:

> Sig <- var(Y)

> round(Sig, 3)

[,1] [,2]

[1,] 9.252 3.963

[2,] 3.963 10.271

It may be of interest to ﬁt the sub-model with equal

variances, σ2=σ2

1=σ2

2, a case for which no closed

form solution exists. This model can be ﬁtted us-

ing the code shown below. The estimate for σ2is

ˆσ2=9.76 and we can see that the difference in resid-

ual log likelihood between the sub-model and the

full model is only 0.16 units.

> sig <- sig1+sig2

> reg.obj.sub <- regress(y~X-1,~sig+gam,

+ identity=FALSE,start=c(1,.5))

> summary(reg.obj.sub)

Maximised Residual Log Likelihood is -315.65

Linear Coefficients:

Estimate Std. Error

X1 1.461 0.312

X2 1.784 0.312

Variance Coefficients:

Estimate Std. Error

sig 9.761 1.059

gam 3.963 1.059

R News ISSN 1609-3631

Vol. 6/2, May 2006 9

The argument start allows one to specify the

starting point for the Newton-Raphson algorithm.

Another argument pos allows one to force certain

parameters to be positive or negative, in this exam-

ple we may only be interested in cases with negative

correlation. By default the support for the variance

components is the real line.

Prediction and Smoothing

regress can also be used to implement smoothing

and best linear prediction, McCullagh (2005)[exam-

ple 5]. Stein (1999) notes that in the geostatistical lit-

erature best linear prediction is also known as krig-

ing and in effect

BLP(x) = E(Y(x)|data)

for an out-of-sample unit whose x-value is x.

Smoothing usually requires the choice of a band-

width in each example, Efron (2001), but bandwidth

selection occurs automatically through REML esti-

mation of polynomial spline models. This is best

illustrated by the cubic spline generalized covari-

ance function (Wahba,1990;Green and Silverman,

1994). The cubic spline model has a two-dimensional

kernel spanned by the constant and linear functions

of x. Consequently the model matrix Xmust include

∼1+x, but could also include additional functions.

●

●●

●

●●●

●

●●

●

●●

●

0.0 0.2 0.4 0.6 0.8 1.0

1.0

1.5

2.0

Figure 1: Smoothing spline and best linear predictor

ﬁtted by REML.

In the code below we simulate data as in Wahba

(1990)[p. 45], but after generating data for one hun-

dred points in the unit interval we take a sample of

these points to be used for ﬁtting purposes. The goal

of ﬁtting a smooth curve to the data is achieved by

the best linear predictor, which is a cubic spline. The

spline is computed at all x-values and has knots at

the observed values. This experiment can be carried

out in the following four steps.

## 1: Data simulation

n <- 101

x <- (0:100)/100

prb <- (1 + cos(2*pi*x))/2

## indicator which x-values are observed

obs <- (runif(n) < prb)

mu <- 1 + x + sin(3*log(x+0.1))/(2+x)

## full data

y <- mu + rnorm(n,0,0.1)

## 2. Fit the cubic spline to observed data

one <- rep(1,n)

d <- abs(x %*% t(one) - one %*% t(x))

V <- d^3

X <- model.matrix(y~1+x)

fit <- regress(y[obs]~X[obs,],~V[obs,obs],

identity=TRUE)

## 3. BLP at all x given the observed data

wlsfit <- X%*%fit$beta

blp <- wlsfit + fit$sigma[1]*V[,obs]

%*% fit$W%*%(y-wlsfit)[obs]

## 4. Plot of results

plot(x[obs],y[obs], xlab="x",ylab="y")

lines(x,mu,lty=2)

lines(x,blp)

The results of this experiment can be seen in Fig-

ure 1. The points are the observed data, the solid line

is the best linear predictor and the dashed line shows

the function used to generate the data.

Bibliography

D. Bates. Fitting linear mixed models in R. RNews, 5

(1), May 2005.

D. Clifford. Computation of spatial covariance matri-

ces. Journal of Computational and Graphical Statistics,

14(1):155–167, 2005.

H. de Wijs. Statistics of ore distribution. Part I: fre-

quency distribution of assay values. Journal of the

Royal Netherlands Geological and Mining Society, 13:

365–375, 1951. New Series.

H. de Wijs. Statistics of ore distribution. Part II: The-

ory of binomial distribution applied to sampling

and engineering problems. Journal of the Royal

Netherlands Geological and Mining Society, 15:125–

24, 1953. New Series.

B. Efron. Selection criteria for scatterplot smoothers.

Annals of Statistics, 2001.

H. Fairﬁeld Smith. An empirical law describing het-

erogeneity in the yields of agricultural crops. Jour-

nal of Agricultural Science, 28:1–23, January 1938.

R News ISSN 1609-3631

Vol. 6/2, May 2006 10

P. Green and B. Silverman. Nonparametric Regression

and Generalized Linear Models. Chapman and Hall,

London, 1994.

P. McCullagh. Celebrating Statistics: Papers in hon-

our of Sir David Cox on his 80th birthday, chapter

Exchangeability and regression models, pages 89–

113. Oxford, 2005.

P. McCullagh and D. Clifford. Evidence for confor-

mal invariance of crop yields. Accepted at Proceed-

ings of the Royal Society A, 2006.

J. Pinheiro and D. Bates. Mixed-effects models in S and

S-PLUS. New York : Springer-Verlag, 2000.

M. Stein. Interpolation of Spatial Data: Some Theory for

Kriging. Springer, 1999.

W. Venables and B. Ripley. Modern Applied Statistics

with S. Springer-Verlag New York, Inc., 4th edition,

2002.

G. Wahba. Spline Models for Observational Data. SIAM,

Philadephia, 1990.

F. Yates. Complex experiments. Journal of the Royal

Statistical Society (Supplement), 2:181–247, 1935.

David Clifford, CSIRO

Peter McCullagh, University of Chicago

David.Clifford@csiro.au

pmcc@galton.uchicago.edu

Processing data for outliers

by Lukasz Komsta

The results obtained by repeating the same measure-

ment several times can be treated as a sample com-

ing from an inﬁnite, most often normally distributed

population.

In many cases, for example quantitative chemi-

cal analysis, there is no possibility to repeat measure-

ment many times due to very high costs of such vali-

dation. Therefore, all estimates and parameters of an

experiment must be obtained from a small sample.

Some repeats of an experiment can be biased by

crude error, resulting in values which do not match

the other data. Such values, called outliers, are very

easy to be identiﬁed in large samples. But in small

samples, often less than 10 values, identifying out-

liers is more difﬁcult, but even more important. A

small sample contaminated with outlying values will

result in estimates signiﬁcantly different from real

parameters of whole population (Barnett,1994).

The problem of identifying outliers in small sam-

ples properly (and making a proper decision about

removing or leaving suspicious data) is very old and

the ﬁrst papers discussing this problem were pub-

lished in the 1920s. The problem remained unre-

solved until 1950s, due to lack of computing technol-

ogy to perform valid Monte-Carlo simulations. Al-

though current trends in outlier detection rely on ro-

bust estimates, the tests described below are still in

use in many cases (especially chemical analysis) due

to their simplicity.

Dixon test

All concepts of outlier analysis were collected by

Dixon (1950) :

1. chi-squared scores of data

2. score of extreme value

3. ratio of range to standard deviation (two oppo-

site outliers)

4. ratio of variances without suspicious value(s)

and with them

5. ratio of ranges and subranges

The last concept seemed to the author to have the

best performance and Dixon introduced his famous

test in the next part of the paper. He deﬁned several

coefﬁcients which must be calculated on an ordered

sample x(1),x(2),··· ,x(n). The formulas below show

these coefﬁcients in two variants for each of them -

when the suspicious value is lowest and highest.

r10 =x(2)−x(1)

x(n)−x(1)

,x(n)−x(n−1)

x(n)−x(1)

(1)

r11 =x(2)−x(1)

x(n−1)−x(1)

,x(n)−x(n−1)

x(n)−x(2)

(2)

r12 =x(2)−x(1)

x(n−2)−x(1)

,x(n)−x(n−1)

x(n)−x(3)

(3)

r20 =x(3)−x(1)

x(n)−x(1)

,x(n)−x(n−2)

x(n)−x(1)

(4)

r21 =x(3)−x(1)

x(n−1)−x(1)

,x(n)−x(n−2)

x(n)−x(2)

(5)

r22 =x(3)−x(1)

x(n−2)−x(1)

,x(n)−x(n−2)

x(n)−x(3)

(6)

The critical values of the above statistics were not

given in the original paper, but only discussed. A

year later (Dixon,1951) the next paper with critical

R News ISSN 1609-3631

Vol. 6/2, May 2006 11

values appeared. Dixon discussed the distribution of

his variables in a very comprehensive way, and con-

cluded that a numerical procedure to obtain critical

values is available only for sample with n=3. Quan-

tiles for other sample sizes were obtained in this pa-

per by simulations and tabularized.

Dixon did not publish the 0.975 quantile for his

test, which is needed to perform two-sided test at

95% conﬁdence level. These values were obtained by

interpolation later (Rorabacher,1991) with some cor-

rections of typographical errors in the original Dixon

table.

Grubbs test

Another criteria for outlying observations proposed

by Dixon were discussed at the same time by Grubbs

(1950). He proposed three variants of his test. For

one outlier

G=|xoutlying −x|

s,U=S2

S2(7)

For two outliers on opposite tails

G=xmax −xmin

s,U=S2

S2(8)

For two outliers on the same tail

U=G=S2

S2(9)

where S2

1is the variance of the sample with one

suspicious value excluded, and S2

2is the variance

with two values excluded.

If the estimators in equation (7) are biased (with

nin denominator), then simple dependence occurs

between Sand U:

U=1−1

n−1G2(10)

and it makes Uand Gstatistics equivalent in their

testing power.

The Gdistribution for one outlier test was dis-

cussed earlier (Pearson and Chekar,1936) and the

following formula can be used to approximate the

critical value:

G=tα/n,n−2sn−1

n−2+t2

α/n,n−2

(11)

When discussing (8) statistics, Grubbs gives only

Gvalues, because Uwas too complicated to calcu-

late. He writes “. . . the limits of integration do not turn

out to be functions of single variables and the task of com-

puting the resulting multiple integral may be rather difﬁ-

cult.”

The simple formula for approximating critical

values of Gvalue (range to standard deviation) was

given by David, Hartley and Pearson (1954):

G=v

2(n−1)t2

α/n(n−1),n−2

n−2+t2

α/n(n−1),n−2

(12)

The ratio of variances used to test for two outliers

on the same tail (9) is not available to integrate in a

numerical manner. Thus, critical values were simu-

lated by Grubbs and must be approximated by inter-

polation from tabularized data.

Cochran test

The other test used to detect crude errors in exper-

iment is the Cochran test (Snedecor and Cochran,

1980). It is designed to detect outlying (or inlying)

variance in a group of datasets. It is based on sim-

ple statistics - the ratio of maximum (or minimum)

variance to the sum of all variances:

C=S2

max

∑S2(13)

The critical value approximation can be obtained

using the following formula:

C=1

1+ (k−1)F

α/k,n(k−1),n

(14)

where kis the number of groups, and nis the

number of observations in each group. If groups

have different length, the arithmetic mean is used as

The R implementation

Package outliers, available now on CRAN, contains

a set of functions for performing these tests and cal-

culating their critical values. The most difﬁcult prob-

lem to solve was introducing a proper and good

working algorithm to provide quantile and p-value

calculation of tabularized distribution by interpola-

tion.

After some experiments I have decided to use re-

gression of 3rd degree orthogonal polynomial, ﬁtted

to four given points, taken from the original Dixon

or Grubbs table, nearest to the argument. When an

argument comes from outside of tabularized range,

it is extrapolated by ﬁtting a linear model to the ﬁrst

(or last) two known values. This method is imple-

mented in the internal function qtable, used for cal-

culation of Dixon values and Grubbs values for two

outliers at the same tail.

R News ISSN 1609-3631

Vol. 6/2, May 2006 12

0.0 0.2 0.4 0.6 0.8 1.0

alpha

quantile

Figure 1: Interpolated quantile curves of Dixon test

for n=3 (solid), n=4 (dashed), n=5 (dotted) and

n=6 (dotdashed).

The proposed algorithm provides a good

continuity-like effect of quantile calculation as

shown on Figures (1) and (3). Critical values can be

obtained by qdixon, qgrubbs, qcochran. The cor-

responding reverse routines (calculating p-values)

are named pdixon, pgrubbs, pcochran. The conti-

nuity effect of reverse functions is depicted on Figure

(2)

0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.00 0.02 0.04 0.06 0.08 0.10 0.12

p−value

Figure 2: Interpolated p-value curve of Dixon test

for n=4, with percentage points given in original

Dixon table

The most important functions implemented in

the package are: dixon.test,grubbs.test and

cochran.test. According to the given parameters,

these functions can perform all of the tests men-

tioned above. Additionally, there is an easy way to

obtain different kinds of data scores, by the scores

function.

It is also possible to obtain the most suspicious

value (the largest difference from the mean) by the

outlier function. If this value is examined and

proved to be an outlier, it can be easily removed or

replaced by the arithmetic mean by rm.outlier.

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6

alpha

quantile

Figure 3: Interpolated quantile curves of Grubbs test

for two outliers on one tail; n=4 (solid), n=5

(dashed), n=6 (dotted) and n=7 (dotdashed).

Examples

Suppose we have six measurements, where the last

one is visually larger than the others. We can now

test, if it should be treated as an outlier, and removed

from further calculations.

> x = c(56.5,55.1,57.2,55.3,57.4,60.5)

> dixon.test(x)

Dixon test for outliers

data: x

Q = 0.5741, p-value = 0.08689

alternative hypothesis: highest value 60.5 is an outlier

> grubbs.test(x)

Grubbs test for one outlier

data: x

G = 1.7861, U = 0.2344, p-value = 0.06738

alternative hypothesis: highest value 60.5 is an outlier

> scores(x) # this is only example, not recommended for

small sample

[1] -0.2551552 -0.9695897 0.1020621 -0.8675276 0.2041241

[6] 1.7860863

> scores(x,prob=0.95)

[1] FALSE FALSE FALSE FALSE FALSE TRUE

> scores(x,prob=0.975)

[1] FALSE FALSE FALSE FALSE FALSE FALSE

R News ISSN 1609-3631

Vol. 6/2, May 2006 13

As we see, both tests did not reject the null hy-

pothesis and the suspicious value should not be re-

moved. Further calculations should be done on the

full sample.

Another example is testing a simple dataset for

two opposite outliers, and two outliers at one tail:

> x = c(41.3,44.5,44.7,45.9,46.8,49.1)

> grubbs.test(x,type=11)

Grubbs test for two opposite outliers

data: x

G = 2.9908, U = 0.1025, p-value = 0.06497

alternative hypothesis: 41.3 and 49.1 are outliers

> x = c(45.1,45.9,46.1,46.2,49.1,49.2)

> grubbs.test(x,type=20)

Grubbs test for two outliers

data: x

U = 0.0482, p-value = 0.03984

alternative hypothesis: highest values 49.1 , 49.2 are outliers

In the ﬁrst dataset, the smallest and greatest value

should not be rejected. The second example rejects

the null hypothesis: 49.1 and 49.2 are outliers, and

calculations should be made without them.

The last example is testing for outlying variance.

We have calculated variance in 8 groups (5 measure-

ments in each group) of results and obtained: 1.2, 2.5,

2.9, 3.5, 3.6, 3.9, 4.0, 7.9. We must check now if the

smallest or largest variance should be considered in

calculations:

> v = c(1.2, 2.5, 2.9, 3.5, 3.6, 3.9, 4.0, 7.9)

> n = rep(5,8)

> cochran.test(v,n)

Cochran test for outlying variance

data: v

C = 0.2678, df = 5, k = 8, p-value = 0.3579

alternative hypothesis: Group 8 has outlying variance

sample estimates:

12345678

1.2 2.5 2.9 3.5 3.6 3.9 4.0 7.9

> cochran.test(v,n,inlying=TRUE)

Cochran test for inlying variance

data: v

C = 0.0407, df = 5, k = 8, p-value < 2.2e-16

alternative hypothesis: Group 1 has inlying variance

sample estimates:

12345678

1.2 2.5 2.9 3.5 3.6 3.9 4.0 7.9

The tests show that ﬁrst group has inlying vari-

ance, signiﬁcantly smaller than the others.

Bibliography

V. Barnett, T. Lewis. Outliers in statistical data. Wiley.

W.J. Dixon. Analysis of extreme values. Ann. Math.

Stat., 21(4):488-506, 1950.

W.J. Dixon. Ratios involving extreme values. Ann.

Math. Stat., 22(1):68-78, 1951.

D.B. Rorabacher. Statistical Treatment for Rejection

of Deviant Values: Critical Values of Dixon Q Pa-

rameter and Related Subrange Ratios at the 95 per-

cent Conﬁdence Level. Anal. Chem. , 83(2):139-146,

1991.

F.E. Grubbs. Sample Criteria for testing outlying ob-

servations. Ann. Math. Stat. , 21(1):27-58, 1950.

E.S. Pearson, C.C. Chekar. The efﬁciency of statisti-

cal tools and a criterion for the rejection of outlying

observations. Biometrika , 28(3):308-320, 1936.

H.A. David, H.O. Hartley, E.S. Pearson. The distri-

bution of the ratio, in a single normal sample, of

range to standard deviation. Biometrika , 41(3):482-

493, 1954.

G.W. Snedecor, W.G. Cochran. Statistical Methods.

Iowa State University Press , 1980.

Lukasz Komsta

Department of Medicinal Chemistry

Skubiszewski Medical University of Lublin, Poland

luke@novum.am.lublin.pl

Analysing equity portfolios in R

Using the portfolio package

by David Kane and Jeff Enos

Introduction

R is used by major ﬁnancial institutions around the

world to manage billions of dollars in equity (stock)

portfolios. Unfortunately, there is no open source R

package for facilitating this task. The portfolio pack-

age is meant to ﬁll that gap. Using portfolio, an an-

alyst can create an object of class portfolioBasic

(weights only) or portfolio (weights and shares),

examine its exposures to various factors, calculate

its performance over time, and determine the contri-

butions to performance from various categories of

stocks. Exposures, performance and contributions

are the basic building blocks of portfolio analysis.

R News ISSN 1609-3631

Vol. 6/2, May 2006 14

One Period, Long-Only

Consider a simple long-only portfolio formed from

the universe of 30 stocks in the Dow Jones Industrial

Average (DJIA) in January, 2005. Start by loading and

examining the input data.

> library(portfolio)

> data(dow.jan.2005)

> summary(dow.jan.2005)

symbol name

Length:30 Length:30

Class :character Class :character

Mode :character Mode :character

price sector

Min. : 21.0 Industrials :6

1st Qu.: 32.1 Staples :6

Median : 42.2 Cyclicals :4

Mean : 48.6 Financials :4

3rd Qu.: 56.0 Technology :4

Max. :103.3 Communications:3

(Other) :3

cap.bil month.ret

Min. : 22.6 Min. :-0.12726

1st Qu.: 53.9 1st Qu.:-0.05868

Median : 97.3 Median :-0.02758

Mean :126.0 Mean :-0.02914

3rd Qu.:169.3 3rd Qu.: 0.00874

Max. :385.9 Max. : 0.04468

> head(dow.jan.2005)

symbol name price

140 AA ALCOA INC 31.42

214 MO ALTRIA GROUP INC 61.10

270 AXP AMERICAN EXPRESS CO 56.37

294 AIG AMERICAN INTERNATIONAL GROUP 65.67

946 BA BOEING CO 51.77

1119 CAT CATERPILLAR INC 97.51

sector cap.bil month.ret

140 Materials 27.35 -0.060789

214 Staples 125.41 0.044681

270 Financials 70.75 -0.051488

294 Financials 171.04 0.009441

946 Industrials 43.47 -0.022600

1119 Industrials 33.27 -0.082199

The DJIA consists of exactly 30 large US stocks.

We provide a minimal set of information for con-

structing a long-only portfolio. Note that cap.bil

is market capitalization in billions of dollars, price

is the per share closing price on December 31, 2004,

and month.ret is the one month return from Decem-

ber 31, 2004 through January 31, 2005.

In order to create an object of class

portfolioBasic, we can use this data and form a

portfolio on the basis of a “nonsense” variable like

price.

> p <- new("portfolioBasic",

+ date = as.Date("2004-12-31"),

+ id.var = "symbol", in.var = "price",

+ sides = "long", ret.var = "month.ret",

+ data = dow.jan.2005)

An object of class "portfolioBasic" with 6 positions

Selected slots:

name: Unnamed portfolio

date: 2004-12-31

in.var: price

ret.var: month.ret

type: equal

size: quintile

> summary(p)

Portfolio: Unnamed portfolio

count weight

Long: 6 1

Top/bottom positions by weight:

id pct

1 AIG 16.7

2 CAT 16.7

3 IBM 16.7

4 JNJ 16.7

5 MMM 16.7

6 UTX 16.7

In other words, we have formed a portfolio of

the highest priced 6 stocks out of the 30 stocks in

the DJIA. The id.var argument causes the portfo-

lio to use symbol as the key for identifying securi-

ties throughout the analysis. in.var selects the vari-

able by which stocks are ranked in terms of desir-

ability. sides set to long speciﬁes that we want a

long-only portfolio. ret.var selects the return vari-

able for measuring performance. In this case, we are

interested in how the portfolio does for the month of

January 2005.

The default arguments to portfolioBasic form

equal-weighted positions; in this case, each of the 6

stocks has 16.67% of the resulting portfolio. The de-

faults also lead to a portfolio made up of the best 20%

(or quintile) of the universe of stocks provided to

the data argument. Since 20% of 30 is 6, there are 6

securities in the resulting portfolio.

Exposures

Once we have a portfolio, the next step is to analyse

its exposures. These can be calculated with respect

to both numeric or factor variables. The method

exposure will accept a vector of variable names.

> exposure(p, exp.var = c("price", "sector"))

numeric

variable exposure

1 price 85.1

R News ISSN 1609-3631

Vol. 6/2, May 2006 15

sector

variable exposure

2 Industrials 0.500

1 Financials 0.167

3 Staples 0.167

4 Technology 0.167

The weighted average price of the portfolio is

85. In other words, since the portfolio is equal-

weighted, the average share price of AIG, CAT, IBM,

JNJ, MMM, and UTX is 85. This is a relatively high

price, but makes sense since we explicitly formed the

portfolio by taking the 6 highest priced stocks out of

the DJIA.

Each of the 6 stocks is assigned a sector and 3 of

them are Industrials. This compares to the 20% of

the entire universe of 30 stocks that are in this sec-

tor. In other words, the portfolio has a much higher

exposure to Industrials than the universe as a whole.

Similarly, the portfolio has no stocks from the Com-

munications, Cyclicals or Energy sectors despite the

fact that these make up almost 27% of the DJIA uni-

verse.

Performance

Time plays two roles in the portfolio class. First,

there is the moment of portfolio formation. This is

the instant when all of the data, except for future re-

turns, is correct. After this moment, of course, things

change. The price of AIG on December 31, 2004 was

$65.67, but by January 5, 2005 it was $67.35.

The second role played by time on the portfolio

class concerns future returns. ret.var speciﬁes a re-

turn variable which measures the performance of in-

dividual stocks going forward. These returns can be

of any duration — an hour, a day, a month, a year —

but should generally start from the moment of port-

folio formation. In this example, we are using one

month forward returns. Now that we know the port-

folio’s exposures at the start of the time period, we

can examine its performance for the month of Jan-

uary.

> performance(p)

Total return: -1.71 %

Best/Worst performers:

id weight ret contrib

2 CAT 0.167 -0.08220 -0.01370

3 IBM 0.167 -0.05234 -0.00872

6 UTX 0.167 -0.02583 -0.00431

1 AIG 0.167 0.00944 0.00157

4 JNJ 0.167 0.02018 0.00336

5 MMM 0.167 0.02790 0.00465

The portfolio lost 1.7% of its value in January. The

worst performing stock was CAT (Caterpillar), down

more than 8%. The best performing stock was MMM

(3M), up almost 3%. The contrib (contribution) of

each stock to the overall performance of the portfo-

lio is simply its weight in the portfolio multiplied by

its return. The sum of the 6 individual contributions

yields -1.7%.

Contributions

The contributions of individual stocks are not that in-

teresting in and of themselves. More important is

to examine summary statistics of the contributions

across different categories of stocks. Consider the use

of the contribution method:

> contribution(p, contrib.var = c("sector"))

sector

variable weight contrib roic

5 Communications 0.000 0.00000 0.00000

6 Conglomerates 0.000 0.00000 0.00000

7 Cyclicals 0.000 0.00000 0.00000

8 Energy 0.000 0.00000 0.00000

1 Financials 0.167 0.00157 0.00944

2 Industrials 0.500 -0.01336 -0.02671

9 Materials 0.000 0.00000 0.00000

3 Staples 0.167 0.00336 0.02018

4 Technology 0.167 -0.00872 -0.05234

10 Utilities 0.000 0.00000 0.00000

contribution, like exposure, accepts a vector of

variable names. In the case of sector, the contribu-

tion object displays the 10 possible values, the total

weight of the stocks for each level and the sum of the

contributions for those stocks. Only 4 of the 10 lev-

els are represented by the stocks in the portfolio. The

other 6 levels have zero weight and, therefore, zero

contributions.

The sector with the biggest weight is Industrials,

with half of the portfolio. Those 3 stocks did poorly,

on average, in January and are therefore responsible

for -1.3% in total losses. There is only a single Tech-

nology stock in the portfolio, IBM. Because it was

down 5% for the month, its contribution was -0.87%.

Since IBM is the only stock in its sector in the portfo-

lio, the contribution for the sector is the same as the

contribution for IBM.

The last column in the display shows the roic —

the return on invested capital — for each level. This

captures the fact that raw contributions are a func-

tion of both the total size of positions and their re-

turn. For example, the reason that the total contri-

bution for Industrials is so large is mostly because

they account for such a large proportion of the port-

folio. The individual stocks performed poorly, on

average, but not that poorly. Much worse was the

performance of the Technology stock. Although the

total contribution from Technology was only 60% of

that of Industrials, this was on a total position weight

only 1/3 as large. In other words, the return on total

capital was much worse in Technology than in Indus-

trials even though Industrials accounted for a larger

share of the total losses.

R News ISSN 1609-3631

Vol. 6/2, May 2006 16

Think about roic as useful in determining con-

tributions on the margin. Imagine that you have the

chance to move $1 out of one sector and into another.

Where should than initial dollar come from? Not

from the sector with the worst total contribution. In-

stead, the marginal dollar should come from the sec-

tor with the worst roic and should be placed into the

sector with the best roic. In this example, we should

move money out of Technology and into Staples.

contribution can also work with numeric vari-

ables.

> contribution(p, contrib.var = c("cap.bil"))

cap.bil

rank variable weight contrib roic

1 1 - low (22.6,50.9] 0.167 -0.013700 -0.08220

2 2 (50.9,71.1] 0.333 0.000345 0.00103

4 3 (71.1,131] 0.000 0.000000 0.00000

3 4 (131,191] 0.500 -0.003787 -0.00757

5 5 - high (191,386] 0.000 0.000000 0.00000

Analysing contributions only makes sense in the

context of categories into which each position can

be placed. So, we need to break up a numeric vari-

able like cap into discrete, exhaustive categories. The

contribution function provides various options for

doing so, but most users will be satisﬁed with the

default behavior of forming 5 equal sized quintiles

based on the distribution of the variable in the entire

universe.

In this example, we see that there are no portfo-

lio holdings among the biggest 20% of stocks in the

DJIA. Half the portfolio comes from stocks in the sec-

ond largest quintile. Within each category, the anal-

ysis is the same as that above for sectors. The worst

performing category in terms of total contribution is

the smallest quintile. This category also has the low-

est roic.

One Period Long-Short

Having examined a very simple long-only portfo-

lio in order to explain the concepts behind exposures,

performance and contributions, it is time to consider a

more complex case, a long-short portfolio which uses

the same underlying data.

> p <- new("portfolioBasic",

+ date = as.Date("2004-12-31"),

+ id.var = "symbol", in.var = "price",

+ type = "linear", sides = c("long",

+ "short"), ret.var = "month.ret",

+ data = dow.jan.2005)

An object of class "portfolioBasic" with

12 positions

Selected slots:

name: Unnamed portfolio

date: 2004-12-31

in.var: price

ret.var: month.ret

type: linear

size: quintile

> summary(p)

Portfolio: Unnamed portfolio

count weight

Long: 6 1

Short: 6 -1

Top/bottom positions by weight:

id pct

12 UTX 28.57

5 IBM 23.81

2 CAT 19.05

8 MMM 14.29

1 AIG 9.52

10 PFE -9.52

9 MSFT -14.29

11 SBC -19.05

6 INTC -23.81

4 HPQ -28.57

Besides changing to a long-short portfolio, we

have also provided a value of “linear” to the type

argument. As the summary above shows, this yields

a portfolio in which the weights on the individual

positions are (linearly) proportional to their share

prices. At the end of 2004, the lowest priced stock

in the DJIA was HPQ (Hewlett-Packard) at $20.97.

Since we are using price as our measure of desirabil-

ity, HPQ is the biggest short position. Since we have

not changed the default value for size from “quin-

tile,” there are still 6 positions per side.

> plot(p)

Weight vs. in.var

price (in.var)

weight

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

20 40 60 80 100

●

Weight distribution

weight

count

−0.2 0.0 0.2

Figure 1: Plot of a portfolioBasic object.

R News ISSN 1609-3631

Vol. 6/2, May 2006 17

Figure 1shows the result of calling plot on this

portfolio object. The top plot displays the relation-

ship between position weights and their correspond-

ing in.var values, while the bottom plot shows a his-

togram of position weights.

The display for the exposure object is now some-

what different to accommodate the structure of a

long-short portfolio.

> exposure(p, exp.var = c("price", "sector"))

numeric

variable long short exposure

1 price 92.6 -24.2 68.4

sector

variable long short exposure

2 Industrials 0.6190 0.0000 0.6190

1 Financials 0.0952 0.0000 0.0952

3 Staples 0.0476 -0.0952 -0.0476

5 Communications 0.0000 -0.2381 -0.2381

4 Technology 0.2381 -0.6667 -0.4286

The long exposure to price is simply the

weighted average of price on the long side of the

portfolio, where the weighting is done in proportion

to the size of the position in each stock. The same

is true on the short side. Since a linear weighting

used here emphasises the tail of the distribution, the

long exposure is greater than the long exposure of

the equal weighted portfolio considered above, $93

versus $85.

Since the weights on the short side are actually

negative — in the sense that we are negatively ex-

posed to positive price changes in these stocks —

the weighted average on the short side for a positive

variable like price is also negative. Another way to

read this is to note that the weighted average price

on the short side is about $24 but that the portfolio

has a negative exposure to this number because these

positions are all on the short side.

One reason for the convention of using a negative

sign for short side exposures is that doing so makes

the overall exposure of the portfolio into a simple

summation of the long and short exposures. (Note

the assumption that the weights on both sides are

equal. In future versions of the portfolio package,

we hope to weaken these and other requirements.)

For this portfolio, the overall exposure is 68. Because

the portfolio is long the high priced stocks and short

the low priced ones, the portfolio has a positive ex-

posure to the price factor. Colloquially, we are “long

price.”

A similar analysis applies to sector exposures. We

have 62% of our long holdings in Industrials but zero

of our short holdings. We are, therefore, 62% long In-

dustrials. We have 24% of the longs holdings in Tech-

nology, but 67% of the short holdings; so we are 43%

short Technology.

Calling plot on the exposure object produces a

bar chart for each exposure category, as seen in Fig-

ure 2. Since all numeric exposures are grouped to-

gether, a plot of numeric exposure may only make

sense if all numeric variables are on the same scale.

In this example, the only numeric exposure we have

calculated is that to price.

> plot(exposure(p, exp.var = c("price",

+ "sector")))

numeric exposure

exposure

price

0 20 40 60

sector exposure

exposure

Technology

Communications

Staples

Financials

Industrials

−0.4 −0.2 0.0 0.2 0.4 0.6

Figure 2: Plot of an exposure object for a single pe-

riod. Bar charts are sorted by exposure.

The performance object is similar for a long-short

portfolio.

> performance(p)

Total return: 2.19 %

Best/Worst performers:

id weight ret contrib

2 CAT 0.1905 -0.08220 -0.01566

5 IBM 0.2381 -0.05234 -0.01246

12 UTX 0.2857 -0.02583 -0.00738

3 DIS -0.0476 0.02986 -0.00142

1 AIG 0.0952 0.00944 0.00090

8 MMM 0.1429 0.02790 0.00399

6 INTC -0.2381 -0.04019 0.00957

10 PFE -0.0952 -0.10152 0.00967

11 SBC -0.1905 -0.06617 0.01260

4 HPQ -0.2857 -0.06581 0.01880

The portfolio was up 2.2% in January. By de-

fault, the summary of the performance object only

provides the 5 worst and best contributions to re-

turn. HPQ was the biggest winner because, though

it was only down 7% for the month, its large weight-

ing caused it to contribute almost 2% to overall per-

formance. The 19% weight of CAT in the portfolio

placed it as only the third largest position on the long

side, but its -8% return for the month made it the

biggest drag on performance.

R News ISSN 1609-3631

Vol. 6/2, May 2006 18

The contribution function provides similar out-

put for a long-short portfolio.

> contribution(p, contrib.var = c("cap.bil",

+ "sector"))

cap.bil

rank variable weight contrib roic

1 1 - low (22.6,50.9] 0.0952 -0.01566 -0.16440

2 2 (50.9,71.1] 0.3810 0.01399 0.03671

3 3 (71.1,131] 0.0952 0.01260 0.13235

4 4 (131,191] 0.3095 -0.00103 -0.00334

5 5 - high (191,386] 0.1190 0.01202 0.10098

sector

variable weight contrib roic

1 Communications 0.1190 0.0112 0.0939

6 Conglomerates 0.0000 0.0000 0.0000

7 Cyclicals 0.0000 0.0000 0.0000

8 Energy 0.0000 0.0000 0.0000

2 Financials 0.0476 0.0009 0.0189

3 Industrials 0.3095 -0.0191 -0.0616

9 Materials 0.0000 0.0000 0.0000

4 Staples 0.0714 0.0106 0.1488

5 Technology 0.4524 0.0183 0.0404

10 Utilities 0.0000 0.0000 0.0000

As in the last example, the weight column reﬂects

the proportion of the portfolio invested in each cate-

gory. In this case, we have 45% of the total capital

(or weight) of the long and short sides considered to-

gether invested in the Technology sector. We know

from the exposure results above that most of this is

invested on the short side, but in the context of con-

tributions, it does not matter on which side the capi-

tal is deployed.

> plot(contribution(p, contrib.var = c("cap.bil",

+ "sector")))

roic by cap.bil

roic

(191,386]

(131,191]

(71.1,131]

(50.9,71.1]

(22.6,50.9]

−0.15 −0.10 −0.05 0.00 0.05 0.10

roic by sector

roic

Utilities

Technology

Staples

Materials

Industrials

Financials

Energy

Cyclicals

Conglomerates

Communications

−0.05 0.00 0.05 0.10 0.15

Figure 3: Plot of a contribution object for a single pe-

riod.

Plotting objects of class contribution produces

output similar to plotting exposures, as can be seen

in Figure 3. This time, however, we see values of

roic plotted against the categories that make up each

contrib.var. For numeric variables, roic for each

interval is shown; numeric intervals will always ap-

pear in order.

Multi-Period, Long-Short

Analysing a portfolio for a single time period is a

useful starting point, but any serious research will

require considering a collection of portfolios over

time. The help pages of the portfolio package pro-

vide detailed instructions on how to construct a

portfolioHistory object. Here, we will just load up

the example object provided with the package and

consider the associated methods for working with it.

> data(global.2004.history)

> global.2004.history

An object of class "portfolioHistory"

Contains:

Object of class "exposureHistory"

Number of periods: 12

Period range: 2003-12-31 -- 2004-11-30

Data class: exposure

Variables: numeric currency sector

Object of class "performanceHistory"

Number of periods: 12

Period range: 2003-12-31 -- 2004-11-30

Data class: performance

Object of class "contributionHistory"

Number of periods: 12

Period range: 2003-12-31 -- 2004-11-30

Data class: contribution

Variables: currency sector liq.w

Note that the portfolios analysed in this object

were created on a monthly frequency for 2004 using a

universe of the 500 largest global stocks each month.

Market capitalization in billions of dollars (cap.bil)

was used as a measure of desirability, so the portfo-

lio is long the mega-large caps and short the merely

typical large caps.

We have designed the architecture of the portfo-

lio package so that working with multi-period port-

folios is similar to working with single period portfo-

lios. For starters, we can call the exposure method on

objects of class portfolioHistory just as we called

them on objects of class portfolio. An added call to

summary prints a summary of the returned object.

> summary(exposure(global.2004.history))

Mean exposure:

numeric

R News ISSN 1609-3631

Vol. 6/2, May 2006 19

variable long short exposure

11 cap.bil 109.20 -10.483 98.71

1 liq.w 1.22 0.822 2.04

currency

variable long short exposure

6 JPY 0.062978 -0.158931 -0.095952

8 SEK 0.000948 -0.040208 -0.039260

5 HKD 0.000000 -0.017283 -0.017283

1 AUD 0.001242 -0.005734 -0.004491

7 NOK 0.000000 -0.003949 -0.003949

9 SGD 0.000000 -0.000438 -0.000438

3 EUR 0.179902 -0.176904 0.002998

4 GBP 0.113589 -0.109119 0.004470

2 CHF 0.042340 -0.019232 0.023108

10 USD 0.598999 -0.468202 0.130797

sector

variable long short exposure

10 Utilities 0.00803 -0.08832 -0.080290

3 Cyclicals 0.04980 -0.12475 -0.074945

7 Materials 0.00112 -0.05876 -0.057636

6 Industrials 0.04147 -0.09762 -0.056152

2 Conglomerates 0.00000 -0.00356 -0.003563

9 Technology 0.08784 -0.08730 0.000545

8 Staples 0.24521 -0.23168 0.013532

5 Financials 0.24544 -0.19245 0.052982

4 Energy 0.10328 -0.03456 0.068719

1 Communications 0.19144 -0.05385 0.137588

The difﬁculty in considering exposures — basi-

cally a static concept about how the portfolio looks

at this moment — in the context of a time series of

portfolio objects is that it is not obvious how to

summarize the exposures over time, especially in the

context of a portfolio whose overall size is changing.

In this release of the portfolio package, we will sim-

ply report the average exposure for all time periods.

For this example portfolio, it is easy to see that the

portfolio is long the biggest stocks. The average mar-

ket capitalization is over $100 billion on the long side

versus only $10 billion on the short side. This cap

exposure causes the portfolio to take a similar “bet”

on liquidity, since larger stocks are, on average, more

liquid than smaller stocks. The liq.w variable is our

measure of liquidity — basically how much stock is

traded on a “typical” day — scaled to have mean 0

and variance 1 for the universe.

The currency and sector exposures are, like-

wise, simple averages of the of the 12 monthly ex-

posures. Even though the portfolio is constructed

without reference to currency or sector, it still ends

up with non-trivial exposures to these variables. In

a typical month, the portfolio is 14% long Communi-

cations and 10% short Japan.

Plotting the exposureHistory object yields box

and whisker plots of net exposure for each exposure

variable, as in Figure 4.

> plot(exposure(global.2004.history))

numeric exposure

exposure

liq.w

cap.bil

0 20 40 60 80 100

●

●●

currency exposure

exposure

USD

SGD

SEK

NOK

JPY

HKD

GBP

EUR

CHF

AUD

−0.10 0.00 0.05 0.10 0.15

●

sector exposure

exposure

Utilities

Technology

Staples

Materials

Industrials

Financials

Energy

Cyclicals

Conglomerates

Communications

−0.1 0.0 0.1 0.2

●

● ●

Figure 4: Plot of an exposureHistory object.

The performance method aggregates perfor-

mance for each individual time period into an overall

ﬁgure and provides a listing of the largest contribu-

tions.

> summary(performance(global.2004.history))

Performance summary (frequency = 12):

ret

mean -0.0081

mean (ann) -0.0971

sd 0.0152

sd (ann) 0.0527

sharpe -1.8439

Best period:

date ret

2004-03-31 2004-03-31 0.0149

Worst period:

date ret

2004-10-31 2004-10-31 -0.0292

Top/Bottom performers (total contribution):

id contrib

140 EBAY 0.00423

231 MRW.LN 0.00387

234 MU 0.00357

338 XOM 0.00337

196 JNJ 0.00333

211 LTR -0.00416

323 VOLVB.SS -0.00453

329 WLP -0.00474

227 MON -0.00494

11 5201 -0.00555

R News ISSN 1609-3631

Vol. 6/2, May 2006 20

The best month for the portfolio was April 2004

when it earned 1.5%. The worst month was a loss of

almost 3% in November. The biggest contributor on

the positive side was EBAY. The mean annualized re-

turn of the portfolio is almost -10%; the annualized

volatility is 5.3%. Sharpe ratios — deﬁned1as the

mean return divided by the standard deviation of re-

turn — are generally not provided for unproﬁtable

portfolios, but the ratio here is -1.8.

Note that the portfolioHistory object does not

keep a record of every position for each time pe-

riod. By design, it only retains summary informa-

tion, albeit on a stock-by-stock basis. We want to be

able to use this package for high frequency data —

say, a decade worth of daily portfolios — so it is not

easy (or even possible) to manipulate an object which

maintains information for every stock for each time

period.

As with the exposure method, the contribution

method works with portfolioHistory objects in the

same way that it works with portfolio objects.

> summary(contribution(global.2004.history))

Mean contribution:

currency

variable weight contrib roic

1 AUD 0.003488 -0.00014179 -0.02622

2 CHF 0.030786 0.00000868 -0.00291

11 DKK 0.000000 0.00000000 0.00000

3 EUR 0.178403 -0.00104765 -0.00620

4 GBP 0.111354 -0.00048120 -0.00409

5 HKD 0.008641 0.00011748 -0.00720

6 JPY 0.110954 0.00005111 -0.00273

21 NOK 0.001974 -0.00012765 -0.03745

7 SEK 0.020578 -0.00111390 -0.04674

31 SGD 0.000219 -0.00000963 -0.01264

8 USD 0.533601 -0.00534746 -0.00996

sector

variable weight contrib roic

1 Communications 0.12610 0.0008478 0.00814

11 Conglomerates 0.00179 -0.0000746 -0.01039

2 Cyclicals 0.08961 -0.0025430 -0.02863

3 Energy 0.07077 0.0008145 0.01384

4 Financials 0.22484 -0.0010191 -0.00416

5 Industrials 0.07153 -0.0014918 -0.01990

6 Materials 0.03067 -0.0002470 -0.01246

7 Staples 0.24511 -0.0036175 -0.01483

8 Technology 0.09007 0.0002594 0.00231

9 Utilities 0.04950 -0.0011930 -0.02607

liq.w

variable weight contrib roic

1 1 - low 0.2628 -0.00390 -0.0140

2 2 0.1607 -0.00545 -0.0349

3 3 0.0668 -0.00136 -0.0197

4 4 0.0952 -0.00167 -0.0156

5 5 - high 0.4144 0.00384 0.0090

Again, the basic approach consists of calculat-

ing the contributions for each time period separately,

saving those results, and then presenting an average

for the entire period.

Figure 5shows the plot of the resulting

contributionHistory object. As with exposure his-

tory, the box and whisker plot shows the median and

quartile range of variable values over the given time

period.

> plot(contribution(global.2004.history))

roic by currency

roic

USD

SGD

SEK

NOK

JPY

HKD

GBP

EUR

DKK

CHF

AUD

−0.2 −0.1 0.0 0.1

●

● ●

●

roic by sector

roic

Utilities

Technology

Staples

Materials

Industrials

Financials

Energy

Cyclicals

Conglomerates

Communications

−0.10 0.00 0.05

●

● ●

● ●●

roic by liq.w

roic

5 − high

1 − low

−0.10 −0.05 0.00 0.05

●

●●

Figure 5: Plot of a contributionHistory object.

Conclusion

The current release of the portfolio package is meant

to serve as a proof-of-concept. Relatively sophisti-

cated portfolio analytics are possible using an open

source package. Although money management is

largely a zero-sum game — every dollar that I make

is a dollar that someone else has lost — there is no

reason why we might not cooperate on the tools that

we all use.

David Kane and Jeff Enos

Kane Capital Management, Cambridge, Massachusetts,

USA

david@kanecap.com and jeff@kanecap.com

1The deﬁnition of the Sharpe ratio subtracts the risk free rate from the mean return in the numerator. But, in the context of a long-short

portfolio, this is typically ignored.

R News ISSN 1609-3631

Vol. 6/2, May 2006 21

GroupSeq: Designing clinical trials using

group sequential designs

by Roman Pahl, Andreas Ziegler, and Inke R. König

Introduction

Clinical trials are the gold standard to prove su-

periority of new interventions. An attractive vari-

ant to these studies is to use group sequential de-

signs (Ziegler et al.,2003). Widely used since their

introduction in the 1970s and 1980s, many differ-

ent versions of group sequential designs have been

developed. Common to all these designs is their

complex computation for which software tools are

required. In this context, existing software tools

like ADDPLAN (http://www.addplan.com/) often

are not free of charge, making it valuable to consider

freely distributed software alternatives.

GroupSeq provides a set of several methods for

computing group sequential boundaries of well es-

tablished group sequential designs, among others

the designs by Pocock (1977) and O‘Brien and Flem-

ing (1979). More precisely, the original designs are

approximated by the α-spending function approach

which was introduced by Lan and DeMets (1983) and

Kim and DeMets (1987), for which a Fortran pro-

gram by Reboussin et al. (2003) has been available.

For GroupSeq, the latter implementation was com-

pletely recoded in R, and, in addition, new features

were added. For instance, GroupSeq allows one to

calculate exact Pocock bounds beside the estimated

ones obtained by the α-spending approach.

Furthermore, as an important feature, this ap-

plication is embedded in a graphical user interface

(GUI) using the Tcl/Tk interface in the R tcltk pack-

age. Note that there is also a graphical user inter-

face by Reboussin et al. (2003) although it only runs

under Windows. In contrast, the GroupSeq GUI is

available on several platforms, and, moreover, pro-

vides customization possibilities; i.e., the user may

create as many windows as desired. Thus, he or she

may perform multiple tasks such as computing and

comparing several designs at the same time.

The computations yield valid results for any test

which is based on normally distributed test statistics

with independent increments, survival studies, and

certain longitudinal designs. Using the α-spending

function approach, interim analyses need not be

equally spaced, and their number need not be speci-

ﬁed in advance.

Group sequential tests

This paper only gives a basic introduction to group

sequential designs. For a comprehensive overview

of this topic, see, e.g., Jennison and Turnbull (2000)

and Wassmer (2001).

The classical theory of testing goes back to Ney-

man and Pearson (1928) and is based on a sample of

ﬁxed size. In this sample, the null hypothesis H0is

tested against an alternative hypothesis H1. A sig-

niﬁcance level αhas to be deﬁned a priori, which

implies the probability of the null hypothesis being

falsely rejected. Importantly, the evaluation always

takes place at the end after all observations have been

made.

By contrast, group sequential designs allow con-

secutive testing, with possible rejection of the null

hypothesis after observation of each group in a se-

quence of groups. Therefore, the signiﬁcance levels

corresponding to each group have to be adjusted ap-

propriately. Many different group sequential designs

have been developed. The main difference among

these is the manner of allocating the speciﬁc signiﬁ-

cance levels. A further distinction consists in equally

sized groups as in classical group sequential testing

(Pocock,1977;O‘Brien and Fleming,1979) versus un-

equally sized groups (Kim and DeMets,1987).

Determining critical regions

Normally one wants to compare two population

mean values µ1and µ2—i.e., one wants to tests the

null hypothesis H0:µ1=µ2against the alternative

hypothesis H1:µ16=µ2. Consider a standard nor-

mally distributed Zstatistic

Z=X1−X2

qσ2

1/n1+σ2

2/n2

(1)

with X1,X2being the means of two independent

samples of sizes n1,n2which are distributed as

N(µ1,σ2

1)and N(µ2,σ2

2). Assuming n1=n2=n

and σ2

1=σ2

2=σ2, analogously to (1), one can deﬁne

a statistic per group k :

Zk=X1k−X2k

σrnk

The standardized overall statistic until group kthen

is deﬁned as

Z∗

k=∑k

j=1√njZj

q∑k

j=1nj

R News ISSN 1609-3631

Vol. 6/2, May 2006 22

In contrast to Z1,Z2, ..., ZK, the Z∗

1,Z∗

2, ..., Z∗

Kare sta-

tistically dependent on each other. The joint multi-

variate normal distribution of Z∗

1,Z∗

2, ..., Z∗

Ktherefore

has to be computed numerically.

A speciﬁc group sequential design is character-

ized by its regions ξkwith k=1, 2, ..., K-1, where ξk

denotes the region in which the study continues; i.e.,

H0has not been rejected by then. The last acceptance

region in the sequence is denoted ξK, and the proba-

bility of type I error for test Kis given by

1−PH0 K

k=1{Z∗

k∈ξk}!,

where PH0denotes the distribution of Z∗

kunder H0.

One obtains the corresponding power with regard to

the alternative hypothesis H1by

1−PH1 K

k=1{Z∗

k∈ξk}!.

The α-spending function approach

Theα-spending function, or use function, approach was

introduced by Lan and DeMets (1983) and Kim and

DeMets (1987). It allows analyses at arbitrary points

of time in the study and hence of arbitrarily sized

groups. This is achieved via the α-spending function

α∗(tk)which determines the type I error "spent" until

time point tk, with kdenoting the k-th interim analy-

sis. Usually, the entire period of analysis is standard-

ized to one so that 0 <t1<t2<... <tK=1, with

tk=∑k

i=1ni/N.

Given the maximum sample size Nand the α-

spending function α∗(tk), the critical regions are ob-

tained by recursive calculation. Setting t1=n1/N,

one obtains

PH0(|Z∗

1|≥u1) = α∗(t1)(2)

for the ﬁrst critical region. Analogously to (2), the

remaining probabilites are obtained by

PH0(

k−1

i=1{|Z∗

i|<ui}∩|Z∗

k|≥uk) = α∗(tk)−α∗(tk−1).

The α-spending function approach thus is a very

ﬂexible tool for designing various studies because

neither group size nor number of interim analyses

have to be speciﬁed in advance. All calculations per-

formed by GroupSeq are based on the α-spending

approach. In this context, the classic designs are es-

timated as special cases by the following α-spending

functions:

• Pocock:

α∗

1(tk) = αln[1+ (e−1)tk]

• O’Brien and Fleming:

α∗

2(tk) = 41−ΦΦ−1(1−α/4)/√tk

Implementation

The program by Reboussin et al. (2003) constituted

the basis for GroupSeq. Recoding in R followed the

software engineering principles of unitization and

re-usability; i.e., the tasks were divided into many

subfunctions, which also allows improvement in fur-

ther development. One goal consisted in maintain-

ing the performance of the application. This was

achieved by converting the “low level syntax” of

Fortran into R speciﬁc functions as well as vector

and matrix operations. Some algorithms also were

slightly improved, e.g., by using Newton’s method

for optimization, replacing a bisection method. The

interested reader is referred to the source code which

is commented in detail. As it turned out, the per-

formance of GroupSeq is generally comparable with

the Fortran implementation. Notable differences oc-

cur only when the number of interim analyses rises

to more than ten. This, however, will usually not oc-

cur in practice. Still, efﬁciency could be improved

by outsourcing some computations into compiled

C/C++ code.

A main focus was on improving the user inter-

face. At the time of development of GroupSeq, the

Fortran program only provided a command line in-

terface. Because of this, GroupSeq was embedded

in a GUI using Tcl/Tk, making it available for every

platform on which R runs. The customization of the

GUI is left to the user who may create a separate win-

dow for each task, which allows any arrangement

and positioning on the desktop. For those who are

familiar with the program by Reboussin et al. (2003),

the menu structure of that program was retained in

GroupSeq.

Additionally, GroupSeq provides new features

that are not part of the program by Reboussin et al.

(2003). Speciﬁcally, new functionality is included

that allows the user to compute exact Pocock bounds.

Package features

GroupSeq provides a GUI using the Tcl/Tk interface

in the R tcltk package. Hence, GroupSeq’s usage is

almost self-explanatory. After loading the package,

the program starts automatically. The user will see

the menu shown in Figure 1.

A task is chosen by selecting an appropriate line

and conﬁrming with “Perform Selected Task”. For

the purpose of multitasking, the user may open as

many task windows as desired. Every task is pro-

cessed independently. Choosing task “-1- Compute

Bounds” will lead to the dialog box given in Figure 2.

R News ISSN 1609-3631

Vol. 6/2, May 2006 23

Figure 1: Menu on program start.

The number of interim analyses is selected by a

drop down menu at the top of the window. An

α-spending function can be selected and must be

conﬁrmed by the “CONFIRM FUNCTION” button.

Figure 2: Window after “-1- Compute Bounds” was

chosen.

The user may choose between (1) and (2) which are

estimates of the classic designs by Pocock (1977) and

O‘Brien and Fleming (1979), respectively. (3) corre-

sponds to the design proposed by Kim and DeMets

(1987), and (4) to a family of functions proposed by

Hwang et al. (1990). Finally, (5) calculates the exact

bounds for the Pocock design. All GroupSeq operat-

ing windows are designed so that irrelevant informa-

tion is concealed. For example, if the user chooses to

use unequal time points and different spending func-

tions for upper and lower bounds, the window will

appear as shown in Figure 3, with GroupSeq provid-

ing meaningful default values.

Figure 3: User speciﬁcation in “-1- Compute Bounds”.

R News ISSN 1609-3631

Vol. 6/2, May 2006 24

Figure 4: Displaying results of calculated bounds.

Pressing the “CALCULATE” button at the bottom of

the window initiates the calculation, and each of the

results is shown in a new, separate window. Figure

4 gives an example. The parameter values that were

used in the computation are shown at the top part of

the window, and the most important parameters can

also be seen in the blue window title bar. This feature

helps the user to keep track of many windows when

elaborating many designs at once. The results of the

computation are presented in a table. In the case of

task -1-, the resulting bounds,αspent per stage, and

αspent in total are displayed. In a next step, the re-

sulting critical bounds can be visualized by choosing

“Show Graph” (see Figure 5) which requires the R

graphics environment.

Figure 5: Graphical visualization of results.

By using the option “Save to ﬁle”, the results may

be saved into a table in common *.html-ﬁle for-

mat, which conserves meta information and there-

fore provides the possibility of further processing

(see Figure 6).

Figure 6: Saved results (viewed in Mozilla Firefox).

Figure 7: “-2- Compute drift given power and bounds”

was chosen.

Choosing task “-2- Compute drift given power and

bounds” from the main menu will lead to the win-

dow shown in Figure 7. Besides specifying the de-

R News ISSN 1609-3631

Vol. 6/2, May 2006 25

sired power, the user may enter speciﬁc bounds in-

stead of computing them with a spending function.

Activating the corresponding check box changes the

current window as shown in Figure 8. Calculating

the effect under the alternative hypothesis leads to

the window shown in Figure 9.

Figure 8: User enters speciﬁc bounds.

Figure 9: Displaying results of calculated drift under

re-speciﬁed power (here: Exact Pocock Bounds were

used).

The table contains exit probabilities (i.e., 1 −β) cor-

responding to the probability per stage of rejecting

the null hypothesis under the alternative. The effect

(drift) hereby is implicitly given by the pre-speciﬁed

power. The cumulative exit probabilities, in turn,

sum up to the power at last analysis.

Task “-3- Compute drift given power and bounds”

works the other way around. Specifying a certain

drift, which, again, corresponds to the expected ef-

fect size at last analysis (see Figure 10), exit probabil-

ities, resulting in the overall power at last analysis,

are calculated. The computation, of course, also de-

pends on the given bounds and interim times.

Figure 10: User enters drift in Task -3- window (detail).

Finally, users may select task “-4- Compute conﬁdence

interval”. As shown in Figure 11, the conﬁdence level

and the overall effect (as a Z-value) have to be en-

tered.

Figure 11: User enters conﬁdence level and desired effect

in Task -4- window (detail).

R News ISSN 1609-3631

Vol. 6/2, May 2006 26

Figure 12: Resulting conﬁdence interval

(Two-Sided-Bounds).

The window displaying the calculated conﬁdence in-

terval differs somewhat from the other windows (see

Figure 12), as no table is presented. The bounds still

can be visualized using “Show Graph”, however.

Future development

Future versions of GroupSeq will include addi-

tional software ergonomic improvements such as

a status bar during calculation. Additionally, an

“undo” function will be added, as well as a fea-

ture to let users specify their own default values,

thereby achieving more efﬁciency in frequent use of

GroupSeq. Furthermore, it is planned to implement

adaptive group sequential designs.

Bibliography

I. K. Hwang, W. J. Shih, and J. S. DeCani. Group

sequential designs using a family of type I er-

ror probability spending functions. Statistics in

Medicine, 9:1439–1445, 1990.

C. Jennison and B. W. Turnbull. Group Sequential

Methods with Applications to Clinical Trials. Chap-

man & Hall/CRC, 2000.

K. Kim and D. DeMets. Design and analysis of group

sequential tests based on the type I error spending

rate function. Biometrika, 74:25–36, 1987.

K. K. G. Lan and D. L. DeMets. Discrete sequential

boundaries for clinical trials. Biometrika, 70:659–

663, 1983.

J. Neyman and E. S. Pearson. On the use and inter-

pretation of certain test criteria for purposes of sta-

tistical interference. Biometrika, 20A:263–295, 1928.

P. C. O‘Brien and T. R. Fleming. A multiple testing

procedure for clinical trials. Biometrics, 35:549–556,

1979.

S. J. Pocock. Group sequential methods in the de-

sign and analysis of clinical trials. Biometrika, 64:

191–199, 1977.

D. M. Reboussin, D. L. DeMets, K. Kim, and K. K. G.

Lan. Programs for Computing Group Sequential

Boundaries Using the Lan-DeMets Method, Ver-

sion 2.1. 2003. URL http://www.biostat.wisc.

edu/landemets/.

G. Wassmer. Statistische Testverfahren für gruppense-

quentielle und adaptive Pläne in klassischen Studien.

Alexander Mönch Verlag, Köln, 2001.

A. Ziegler, G. Dahmen, I. R. König, G. Wassmer,

P. Hildebrand, and M. Birth. Accelerating clinical

trials by ﬂexible designs. Good Clinical Practice Jour-

nal, 10:16–19, 2003.

Roman Pahl

roman.pahl@gmx.de

Andreas Ziegler

ziegler@imbs.uni-luebeck.de

Inke R. König

Inke.Koenig@imbs.uni-luebeck.de

Institute of Medical Biometry and Statistics

University at Lübeck, Germany

Using R/Sweave in everyday clinical

practice

by Sven Garbade and Peter Burgard

The Sweave package by Friedrich Leisch is a pow-

erful tool to combine R with L

EX facilities for text

formatting (R Development Core Team,2005;Leisch,

2002a,b). Sweave allows the dynamic generation of

statistical reports by using literate data analysis. A

further application of Sweave is the package vignette

(Leisch,2003). An introduction to the abilities of

Sweave can be found on Friedrich Leisch’s Home-

R News ISSN 1609-3631

Vol. 6/2, May 2006 27

page (URL: http://www.ci.tuwien.ac.at/~leisch/)

and in several publications, for example Leisch

(2002b,c).

Some issues ago Damian Betebenner wrote an ar-

ticle on how to use R/Sweave in batch mode for au-

tomatic generation of prototype reports (Betebenner,

2005). His article pointed out very helpful and in-

structive notes on how to combine R and Sweave in

batch processing and how to use control structures

with Sweave.

Our article focuses on a combination of

R/Sweave/TclTk in batch mode to produce statis-

tical reports of a neuropsychological examination.

The emphasis of this article is on the background,

functional range and handling of our tool and its use

in our hospital.

Background and Motivation

We routinely use the computer based neuropsycho-

logical test-battery CORTEX (COntextual Reaction

Time EXercises) in our metabolic polyclinic for clin-

ical diagnostics, monitoring and evaluation. A neu-

ropsychological test-battery is a set of several psycho-

logical tests, each testing a speciﬁc function of the

brain. The emphasis of our tests is on cognitive

skills, namely visual short term memory, detection

and recognition abilities. The results from the tests

give an insight into a patient’s cognitive state, and

are useful for further diagnostics and treatment rec-

ommendations.

Currently six different tests are bundled in the

battery. An investigation with all six tests lasts about

one hour. Here we describe brieﬂy the six tests; for a

more detailed description of some of these tests and

their theoretical background see Burgard et al. (1997).

The ﬁrst test is a simple visuomotor reaction task

requiring the testee to press the button of a response

panel as fast as possible after a square appears on the

monitor screen (Test Reaction). The test CPT (Contin-

uous Performance Task) measures sustained atten-

tion by a binary choice task. The testee is required

to press the response button of the dominant hand

as quickly as possible when four dots appear, and

the response button of the non-dominant hand when

three or ﬁve dots appear. The LDT (Letter Detection

Task) requires the testee to scan four letters displayed

in a rectangular layout around the center of the com-

puter screen for one, two or three target letters. The

number of target letters determines task difﬁculty. In

the VWM (Visual Working Memory) task the testee

sees a 7×7 matrix ﬁlled with black and white squares

for one second. After a retention interval of four sec-

onds, the previous matrix and three other matrixes

are shown, and the testee has to click on the previ-

ously shown matrix. Task difﬁculty is determined

by similarity between the target matrix and the three

comparative matrixes. The Dual Task contains three

different tasks and the testee has to deal with visual

and auditory stimuli. In the ﬁrst task, the testee has

to press a response button as fast as possible when

a particular tone among other tones is played; in the

second task, large or small rectangles are displayed

in random order and the testee has to press a but-

ton when the small rectangle appears. The third task

is a combination of both; i. e., the testee has to press

a response key when the small rectangle or the par-

ticular tone is played. The Tracking task is a test of

the ability of eye-hand coordination. The testee has

to pursue manually with the mouse pointer a ran-

domly moving target on the computer screen. The

target is presented among 10 other moving objects.

To complicate this task, the movements of the mouse

pointer are delayed. The results of each test are saved

as integer variables in a tab-separated text ﬁle.

The neuropsychological tests and the analysis

tool for generating statistical reports are imple-

mented on a notebook computer running a Debian

GNU/Linux system. The response panel is con-

nected to the computer with a USB data acquisition

board (USB-DUX). Using a notebook guarantees mo-

bility and the neuropsychological examinations can

be arranged in almost every room in our clinic.

The motivation for the present project was the

need for a tool which could be easily operated by

psychologists, nurses, physicians, etc., to generate

immediately after testing a statistical report based on

the results of the neuropsychological tests.

Handling and functional range

A screen-shot of the main window is shown in Fig-

ure 1. A click on the Help button starts a PDF viewer

showing an instruction manual, and the Quit button

closes the application.

The process to generate a statistical report with

our tool can be summarized in four steps:

1. The user selects with radio buttons the neu-

ropsychological test for which a report should

be generated.

2. The user browses the ﬁle system and chooses

one or more raw data ﬁles associated with

the selected neuropsychological test. Choos-

ing data ﬁles from several testees as well as re-

peated measurements of the same testee is pos-

sible.

3. The user can add information about the tes-

tee and investigator not provided by the data

ﬁles from the neuropsychological test-battery,

for example, names, diagnosis, date of birth,

etc.

4. When all raw data ﬁles are selected, the user

generates the statistical report with a single

mouse click.

R News ISSN 1609-3631

Vol. 6/2, May 2006 28

Figure 1: Screen shot of the main window. The label on top of the Data frame indicates that no data ﬁles are

selected.

The main window has four frames with several

buttons. The following paragraphs describe brieﬂy

the functions and use of the program.

Data. The ﬁrst frame, Data, integrates buttons for

loading and exporting data ﬁles. On top of this

group, information about how many data ﬁles were

loaded by the user and for which test the report will

be generated is displayed.

The data frame can be exported to a tab-separated

text ﬁle with a click on the Export data button. The

variables in the data frame could be inspected by a

click on Variables .

A click on the Create Data button opens the dia-

log shown in Figure 2. The dialog provides buttons

for browsing the ﬁlesystem, selecting, deleting and

inspecting raw data ﬁles saved by the neuropsycho-

logical test-battery. Once a ﬁle is selected, its name

and path are shown in the list.

The callback function for the Add button opens a

ﬁle selection dialog to browse the ﬁle system. With

Remove the user can delete a marked data ﬁle from

the list. The File Info button calls a function to dis-

play the content of a marked data ﬁle.

The Load button calls a function which does sev-

eral computations. First, duplicate entries are iden-

tiﬁed. Then the selected ﬁles shown in the list wid-

get are merged into a data frame. Clicking on the

Cancel button closes the dialog without applying

any changes.

Test. The Test frame is a set of several radio but-

tons. The value of the activated radio button deﬁnes

the Sweave ﬁle associated with a statistical report.

The value is also used to create the window title and

label in Figure 2.

Report. The Create Report button calls the function

responsible for constituting the correct command

generating the statistical report. When the report

is created successfully, a PDF viewer will be started

showing the generated report.

Personal Data. Clicking the button in this frame

opens a new window where the user can add infor-

mation not provided by the raw data from the test-

battery (see Figure 3). These data concern the name

of the testee, diagnosis and date of birth, the values

of medical parameters, the investigator’s name, and

Figure 2: The dialog for creating the data frame with some raw data ﬁles selected. The last ﬁle in the list is

marked by a mouse click.

R News ISSN 1609-3631

Vol. 6/2, May 2006 29

Figure 3: The dialog for adding additional information about testee and investigator.

possible remarks. Only information that is present is

included in the statistical report.

The red label in the Personal Data frame in the

main window (see Figure 1) changes to a green col-

ored label "Personal data loaded" when any text is sup-

plied. When all characters from the text entries are

deleted, the label color changes back to red and the

label text to "No personal data loaded."

Variables and statistics

The results of each neuropsychological test are saved

as integer variables in a tab-separated text ﬁle, one

row per trial and one column per variable. The data

ﬁles are saved in separate directories, one directory

for each neuropsychological test. The amount of

saved data depends on the neuropsychological test.

Examples of saved variables are the reaction time a

testee needed to respond to the appearance of a stim-

ulus, the number of stimuli on the screen the tes-

tee had to attend to, the mode of the response, e. g.,

wrong, correct or false alarm (that is the testee re-

ported that there was a target to respond to, but in-

deed there was none), or the distance in pixels be-

tween the mouse pointer and the target to pursue.

Because all variables are saved as integers, the

program recodes integer variables representing a

grouping variable into a factor variable, e. g., 0/1 to

wrong/correct response.

The raw data ﬁles from the test-battery provide

no information about the ID (Identiﬁcation Num-

ber) of the testee or date of psychological examina-

tion. But the ﬁle name does: The ID, date and num-

ber of the examination are extracted from the ﬁle-

name. The names of the data ﬁles have the pat-

tern id_date_time.dat, e. g. 004_041110_1513.dat

for testee 004 tested on 10 November 2004 at 15:13

o’clock (see marked ﬁle in Figure 2). A function splits

the ﬁlename at the underscores and builds three fac-

tor variables coding the ID, date and time of exam-

ination and the session, that is the number of ex-

amination. The ﬁrst part code of the ﬁlename is

only evaluated until the ﬁrst dot ".", e. g. 001 and

001.MSUD result in the same ID 001. This gives the

investigator the possibility to include some more in-

formation into the ﬁle name without disturbing the

computation of the ID, date and session factors.

The statistical reports are compiled using the ba-

sic L

EXarticle class and have about three to ﬁve

pages, depending on the neuropsychological test.

Each report consists of a ﬁxed part, that is a header

and a basic analysis, and a second part covering a de-

tailed statistical description of the variables acquired

by the neuropsychological test.

The header contains a logo, the address of our

hospital, a title, and the issue date of the report. Then

variables from the text entries of the personal data

dialog (see Figure 3) are included, when available.

The subsequent basic analysis lists the number of tes-

tees, trials and dates of examination, and names of

the data ﬁles the statistics are based on.

The second part describes the results of the neu-

ropsychological test by tables and graphics, mostly

scatter plots and trellis graphics. The statistical re-

ports were designed to give a ﬁrst glance into the

data, and therefore they have a primarily descriptive

character. In most cases, for each variable of inter-

est (e. g. the reaction time from the Reaction test) the

mean, median and standard deviation across testees

and number of examinations are computed and dis-

played graphically. So a report allows the compari-

son of the results from one or more testees at several

dates of examination.

Some remarks about the implemen-

tation

In principle, our tool has two parts: An R/TclTk

function containing all routines for the graphical user

interface (GUI) and data manipulation, and several

R News ISSN 1609-3631

Vol. 6/2, May 2006 30

Sweave style ﬁles with R/L

EX routines for dynamic

generation of statistical reports. A simple ﬂowchart

of the tool is shown in Figure 4.

The GUI was built with the tcltk package that

comes with R. For a short but comprehensive intro-

duction to the tcltk package see Dalgaard (2001), the

Homepage of Wettenhall and of course the R help

pages.

The R/TclTk function holds the rank of a "master"

function. It provides dialogs for the user and deﬁnes

and calls R functions to generate a data frame suit-

able to load within an Sweave ﬁle. Further it con-

stitutes the correct command to generate the desired

statistical report.

The code chunks in Sweave ﬁles have no access

to variables within the running "master" function. To

supply the Sweave ﬁles with all necessary data, the

"master" function saves an R image which includes

all variables and a data frame created from the se-

lected raw data ﬁles. The variables are, for example,

names, investigator, and diagnosis that come from

the dialog in Figure 3, as well as some other vari-

ables that might be used to create a statistical report,

e. g., the value of the activated radio button in the

Test frame in the main window, and the path and

names of the selected raw data ﬁles.

To generate the statistical report, two Sweave

processes are started sequentially. Both Sweave

processes load the previously saved R image.

An Sweave process ﬁrst evaluates the test-speciﬁc

Sweave ﬁle; the second Sweave command processes

the template Sweave ﬁle and includes the previ-

oulsy generated test-speciﬁc L

EX ﬁle. The result-

ing L

EX ﬁle is then processed by pdfL

EX using the

texi2dvi() function in package tools. When the re-

port is generated successfully, a PDF reader showing

the statistical report is started.

Because the program has to deal with different

Sweave and R ﬁles, each report is built in a sepa-

rate directory which includes an Sweave ﬁle for the

test-speciﬁc routines and sometimes an additional

ﬁle with R functions. The template Sweave ﬁle and

the R image are stored in a shared directory. In this

way, a more maintainable structure can be achieved.

Summary

Our purpose was to introduce a tool we developed

with a combination of R, Sweave and the tcltk pack-

age and which we now use routinely in our everyday

clinical practise. The sophisticated statistical analysis

and graphic capabilities of R can be bundled with

Sweave and TclTk to build an application allowing

the user to generate a comprehensive report of a neu-

ropsychological examination immediately after test-

Figure 4: Simple ﬂowchart of the application.

R News ISSN 1609-3631

Vol. 6/2, May 2006 31

ing, even without statistical expertise.

This software is currently in a testing phase and

under active development. We still are adapting

some of the statistical reports to the needs of the

users and maybe we will expand the number of tests

of our neuropsychological test-battery and therewith

the range of statistical reports our tool supports. But

so far, we can point out that using R/Sweave/TclTk

is much more powerful compared to data analysis

of test results in the past. Some of the neuropsycho-

logical tests we brieﬂy reported about ran formerly

in an old version of MS-DOS, and subsequent data

analysis was a bit hampered because exporting the

raw data from an examination to a statistical analy-

sis system was necessary. After reprogramming the

neuropsychological tests on a modern computer sys-

tem with the facilities of R, it was a relatively sim-

ple matter to combine the neuropsychological test-

battery with a convenient tool for exploring the data.

One further improvement we are considering is

an interface for connecting our tool to a data base sys-

tem. This offers the opportunity to save and reload

easily the data from an examination.

Acknowledgment

We would like to thank an anonymous reviewer who

gave us helpful remarks on a previous version of this

manuscript, as well as suggestions about how to im-

prove our code.

Bibliography

D. Betebenner. Using control structures with Sweave.

R News, 5(1):40–44, May 2005. URL http://CRAN.

R-project.org/doc/Rnews/.

P. Burgard, F. Rey, A. Rupp, V. Abadie, and J. Rey.

Neuropsychologic functions of early treated pa-

tients with phenylketonuria, on and off diet: re-

sults of a cross-national and cross-sectional study.

Pediatric Research, 41:368–374, 1997.

P. Dalgaard. A primer on the R-Tcl/Tk package.

R News, 1(3):27–31, September 2001. URL http:

//CRAN.R-project.org/doc/Rnews/.

F. Leisch. Sweave: Dynamic generation of statistical

reports using literate data analysis. Presented at

”Österreichische Statistiktage 2002”, Vienna, Aus-

tria, October 23–25 2002a.

F. Leisch. Sweave, part I: Mixing R and L

EX. R

News, 2(3):28–31, December 2002b. URL http:

//CRAN.R-project.org/doc/Rnews/.

F. Leisch. Sweave user manual. URL http://www.

ci.tuwien.ac.at/~leisch/Sweave. Institut für

Statistik und Wahrscheinlichkeitstheorie, Technis-

che Universität Wien, Vienna, Austria, 2002c.

F. Leisch. Sweave, part II: Package vignettes. R

News, 3(2):21–24, October 2003. URL http://

CRAN.R-project.org/doc/Rnews/.

R Development Core Team. R: A Language and Envi-

ronment for Statistical Computing. R Foundation for

Statistical Computing, Vienna, Austria, 2005. URL

http://www.R-project.org. ISBN 3-900051-07-0.

USB-DUX. URL http://www.linux-usb-daq.co.

uk/.

J. Wettenhall. URL http://bioinf.wehi.edu.au/

~wettenhall/RTclTkExamples/.

Sven Garbade and Peter Burgard

University Hospital for Pediatric

and Adolescent Medicine

Department of General Pediatrics

METABNET - Epidemiological Register

D-69120 Heidelberg, Germany

www.metabnet.de

Sven.Garbade@med.uni-heidelberg.de

changeLOS: An R-package for change in

length of hospital stay based on the

Aalen-Johansen estimator

by M. Wangler, J. Beyersmann, and M. Schumacher

Introduction

Length of hospital stay (LOS) is used to assess the

utilization of hospital resources, the costs and the

general impact of a disease. Change in LOS due

to a complication is frequently used to assess the

impact and the costs of a complication. Promi-

nent examples include nosocomial infections and ad-

verse drug events. In a data analysis, it is impor-

tant to regard the timing of events: A complica-

R News ISSN 1609-3631

Vol. 6/2, May 2006 32

tion can only have an effect on LOS, once it has oc-

curred. Classical methods of two-group compari-

son, including matching, are commonly used, but

inadequate. Multi-state models provide a suitable

framework for analyzing change in LOS. For an

overview on multi-state models cf. (Andersen and

Keiding,2002). Change in LOS has been considered

in a multi-state framework by (Schulgen and Schu-

macher,1996), (Schulgen et al.,2000), (Beyersmann

et al.,2006) and (Beyersmann,2005). We introduce

an R package to adequately analyze change in LOS.

The R package is based on multi-state model theory.

The main features are:

• Describe multi-state models

• Compute and plot transition probabilities

(Aalen - Johansen estimator)

• Compute and plot change in LOS

The estimation techniques used are fully nonpara-

metric, allowing for a time-inhomogeneous Markov

process, i.e. the future development of the pro-

cess depends only on the state currently occupied;

the Markov assumption may be dropped for estima-

tion of state occupation probabilities (Glidden,2002).

However, at present, adjusting for covariates is not

included. In both these aspects, the package change-

LOS differs from the package msm.

Simple example: Competing Risks

A simple example for a multi-state model is the com-

peting risks model, shown in the following ﬁgure:



1

PPPPPPP

Alive

Discharge

Death

Figure 1: Multi-state model: competing risks

With our R package changeLOS it is possible to:

• describe the state names and possible transi-

tions (function msmodel(...)):

012

0 TRUE TRUE TRUE

1 FALSE TRUE FALSE

2 FALSE FALSE TRUE

• compute the Aalen-Johansen estimator for

the matrix of transition probabilities (function

aj(...)): P(s,t), time interval (s,t]. These are

the Cumulative Incidence Functions: P01(0, t),

P02(0, t)

• plot the temporal dynamics of the data il-

lustrated by transition probabilities (function

plot.aj(...)):

0 5 10 20 30 40 50 60 70

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Time t

Estimate of P01(0,t)

^01(0,t)

0 5 10 20 30 40 50 60 70

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Time t

Estimate of P02(0,t)

^02(0,t)

Figure 2: Cumulative Incidence Functions

Change in LOS

The model

The following model is used to analyze change in

LOS due to a complication, or in general, an inter-

mediate event:

complication -

Complication

QQQQ



3 2Discharge

3Death

Figure 3: Multi-state model: change in LOS

The ﬁrst step is to describe the mult-state model

with the function msmodel(...):

## possible transitions

> tra <- matrix(F, 4, 4)

> diag(tra) <- T

> tra[1, ] <- T

> tra[2, 3:4] <- T

## describe the multi-state model

> my.model <- msmodel(c("0","1","2","3"),

tra, cens.name="cens")

The data set

A real data set (Grundmann et al.,2005) comes with

the package and can be loaded by

> data(los.data)

> los.data[1:4,]

and looks like the following example:

R News ISSN 1609-3631

Vol. 6/2, May 2006 33

adm.id j.01 j.02 j.03 j.12 j.13 cens

1 12 Inf 3 Inf Inf Inf Inf

2 32 Inf Inf 20 Inf Inf Inf

341 20 Inf Inf 40 Inf Inf

4 175 6 Inf Inf Inf Inf 26

. . . . . . .

los.data has one row per patient and includes

the variables j.01, j.02, j.03, j.12, j.13 and

cens.The variables starting with ’j’ will hold the

time when a respective transition(or ’jump’) was

observed. E.g., the third patient experiences a com-

plication 20 days after admission to hospital and is

being discharged 40 days after admission. Entries

for non-observed transitions have to be set to inﬁn-

ity, which is represented in R by Inf. In addition, if

a patient is censored, i.e. still in hospital by the end

of the study, the variable cens is set to the time when

censoring occurred. If the patient is not censored, it

is set to Inf.

Before further computations, the patient-orientated

data set los.data must be transformed to a

transition-orientated data set:

> my.observ <- prepare.los.data(x=los.data)

Then the data set looks like:

id from to time

1 12 0 2 3

2 32 0 3 20

341 0 1 20

441 1 2 40

5 175 0 1 6

6 175 1 cens 26

. . . . .

Now each row represents a transition. E.g. for the

patient with the id 41 there are two entries: one entry

for the transition from state 0 to state 1 and one entry

for the transition from state 1 to state 2.

The Aalen-Johansen estimator

> my.trans <- trans(model=my.model,

observ=my.observ)

computes the Aalen - Johansen estimator for the ma-

trix of transition probabilities P(u−,u)for all ob-

served transition times u. The entry (l,m)of the

matrix denotes the estimated probability that state m

has been reached by time u given state l has been oc-

cupied just before time u. The estimator for P(u−,u)

is described by (Andersen et al.,1993) at the bottom

of p. 288. Non-diagonal entries (h,j)are given as

the number of observed transitions from state hto

state j, divided by the number of individuals in state

hjust prior to time u. The diagonal elements are cho-

sen such that the sum of each row equals 1.

The Aalen-Johansen estimator for P(s,t)can then

be computed as matrix product of all matrices

P(u−,u)for all transition times u in (s,t]:

> my.aj <- aj(my.trans,s=0,t=80)

The temporal dynamics of the data can be illustrated

by the transition probabilities:

> plot(my.aj,c("0","0","0","0","1","1"),

c("0","1","2","3","2","3"))

0 10 20 30 40 50 60 70 80

0.0 0.3 0.6 0.9

Time t

Estimate of P00(0,t)

^00(0,t)

0 10 20 30 40 50 60 70 80

0.0 0.3 0.6 0.9

Time t

Estimate of P01(0,t)

^01(0,t)

0 10 20 30 40 50 60 70 80

0.0 0.3 0.6 0.9

Time t

Estimate of P02(0,t)

^02(0,t)

0 10 20 30 40 50 60 70 80

0.0 0.3 0.6 0.9

Time t

Estimate of P03(0,t)

^03(0,t)

0 10 20 30 40 50 60 70 80

0.0 0.3 0.6 0.9

Time t

Estimate of P12(0,t)

^12(0,t)

0 10 20 30 40 50 60 70 80

0.0 0.3 0.6 0.9

Time t

Estimate of P13(0,t)

^13(0,t)

Figure 4: Aalen -Johansen estimator for P00(s,t),

P01(s,t),P02(s,t),P03(s,t),P12(s,t),P13(s,t)

Epected change in LOS

The function clos estimates the expected change in

length of stay (LOS) associated with an intermedi-

ate event (IE), using the Aalen-Johansen estimator

for the matrix of transition probabilities.

> los <- clos(model=my.model,observ=my.observ)

los will be a an object of class clos. This object is a

list which contains among others the following argu-

ments:

•cLOS: change in LOS

•e.given.1: estimates E(LOS |state 1 on day t )

for each day

•e.given.0: estimates E(LOS |state 0 on day t )

for each day

•weights: weights for the weighted average

R News ISSN 1609-3631

Vol. 6/2, May 2006 34

•trans: matrices of transition probabilities

P(u−,u)for all observed transition times u

The expected change in LOS is computed as the

difference between e.given.1 and e.given.0 for

each day and the overall change in LOS cLOS as a

weighted average of these quantities. Figure 5 shows

the graph obtained by plotting change in LOS:

> plot(los, xlim=c(0,80), ylim.1=c(0,120))

Weights

0.00 0.05 0.10 0.15

0 10 20 30 40 50 60 70 80

0 20 40 60 80 100 120

Time t

Expected LOS

Intermediate event by time t

No intermediate event by time t

Figure 5: Expected change in LOS due to a complica-

tion visualized

> summary(los)

The summary for objects of class clos displays the

following values (absolute and in percent):

• Number of observed patients

• Number of patients being discharged

• Number of patients who die

• Number of patients being censored

• Number of patients who experienced the inter-

mediate event (IE)

• Number of patients who experienced the IE be-

ing discharged

• Number of patients who experienced the IE

and died

• Number of patients who experienced the IE

and were censored

Progressive Disability Model

Change in LOS and impact of an intermediate event

on mortality can also be investigated in a so-called

progressive disability model, shown in Figure 6.

This multi-state model can be described and the

Aalen-Johansen estimator for transition probabilities

can be computed with our R package changeLOS:

complication 1Complication

Discharge

Death

Discharge

Death





JJJJ





JJJJ

Figure 6: Progressive disability model

Here, the ﬁrst state carries information on

whether or not an individual has passed through the

intermediate state.

Summary

In order to get realistic information on change in

LOS due to a complication, it is crucial to ade-

quately account for the timing of events. This can

be achieved as outlined by (Beyersmann et al.,2006).

However, while tools of standard survival analysis

have become an integral part of most statistical soft-

ware packages, the multi-state methods envisaged

by (Beyersmann et al.,2006) and (Beyersmann,2005)

are less accessible. We have introduced the R pack-

age changeLOS so that it is freely available, easy to

use and allows people to visualize the situation at

hand.

changeLOS is an R package to:

• describe any multistate model

• compute and visualize transition probabilities

(Aalen-Johansen estimator)

• compute and visualize change in LOS

Bibliography

P. Andersen, Ø. Borgan, R. D. Gill, and N. Kei-

ding. Statistical models based on counting pro-

cesses. Springer Series in Statistics. New York, NY:

Springer., 1993.

P. Andersen and N. Keiding. Multi-state models for

event history analysis. Statistical Methods in Medi-

cal Research, 11(2):91–115, 2002.

J. Beyersmann. On change in length of stay associated

with an intermediate event: estimation within multi-

state models and large sample properties. PhD the-

sis, University of Freiburg, Faculty of Mathematics

and Physics, Germany, 2005. URL www.freidok.

uni-freiburg.de/volltexte/1843/.

R News ISSN 1609-3631

Vol. 6/2, May 2006 35

J. Beyersmann, P. Gastmeier, H. Grundmann, S. Bär-

wolff, C. Geffers, M. Behnke, H. Rüden, and

M. Schumacher. Use of Multistate Models to As-

sess Prolongation of Intensive Care Unit Stay Due

to Nosocomial Infection. Infection Control and Hos-

pital Epidemiology, 27:493-499, 2006.

D. Glidden. Robust innference for event probabili-

ties with non-Markov data. Biometrics, 58:361–368,

2002.

H. Grundmann, S. Bärwolff, F. Schwab, A. Tami,

M. Behnke, C. Geffers, E. Halle, U. Göbel,

R. Schiller, D. Jonas, I. Klare, K. Weist, W. Witte,

K. Beck-Beilecke, M. Schumacher, H. Rüden, and

P. Gastmeier. How many infections are caused

by patient-to-patient transmission in intensive care

units? Critical Care Medicine (in press), 2005.

G. Schulgen, A. Kropec, I. Kappstein, F. Daschner,

and M. Schumacher. Estimation of extra hospital

stay attributable to nosocomial infections: hetero-

geneity and timing of events. Journal of Clinical Epi-

demiology, 53:409–417, 2000.

G. Schulgen and M. Schumacher. Estimation of pro-

longation of hospital stay attributable to nosoco-

mial infections. Lifetime Data Analysis, 2:219–240,

1996.

Matthias Wangler

mw@imbi.uni-freiburg.de

Balloon Plot

Graphical tool for displaying tabular data

by Nitin Jain and Gregory R. Warnes

Introduction

Numeric data is often summarized using rectangular

tables. While these tables allow presentation of all

of the relevant data, they do not lend themselves to

rapid discovery of important patterns. The primary

difﬁculty is that the visual impact of numeric values

is not proportional to the scale of the numbers repre-

sented.

We have developed a new graphical tool, the

“balloonplot”, which augments the numeric values

in tables with colored circles with area proportional

to the size of the corresponding table entry. This

visually highlights the prominent features of data,

while preserving the details conveyed by the nu-

meric values themselves.

In this article, we describe the balloonplot, as

implemented by the balloonplot function in the

gplots package, and describe the features of our im-

plementation. We then provide an example using the

“Titanic” passenger survival data. We conclude with

some observations on the balloonplot relative to the

previously developed “mosaic plot”.

Function description

The balloonplot function accepts a table (to be dis-

played as found) or lists of vectors for x (column

category), y (row category) and z (data value) from

which a table will be constructed.

The balloonplot function creates a graphical ta-

ble where each cell displays the appropriate numeric

value plus a colored circle whose size reﬂects the rel-

ative magnitude of the corresponding component.

The area of each circle is proportional to the fre-

quency of data. (The circles are scaled so that the

circle for largest value ﬁlls the available space in the

cell.)

As a consequence, the largest values in the table

are “spotlighted” by the biggest circles, while smaller

values are displayed with smaller circles. Of course,

circles can only have positive radius, so the radius of

circles for cells with negative values are set to zero.

(A warning is issued when this “truncation” occurs.)

Of course, when labels are present on the table

or provided to the function, the graphical table is

appropriately labeled. In addition, options are pro-

vided to allow control of various visual features of

the plot:

• rotation of the row and column headers

• balloon color and shape (globally or individu-

ally)

• number of displayed digits

• display of entries with zero values

• display of marginal totals

• display of cumulative histograms

• x- and y-axes group sorting

• formatting of row and column labels

• traditional graphics parameters (title, back-

ground, etc.)

R News ISSN 1609-3631

Vol. 6/2, May 2006 36

Example using the Titanic data set

For illustration purposes, we use the Titanic data

set from the datasets package. Titanic provides

survival status for passengers on the tragic maiden

voyage of the ocean liner “Titanic”, summarized ac-

cording to economic status (class), sex, and age.

Typically, the number of surviving passengers are

shown in a tabular form, such as shown in Figure 1.

●●●

1st 2nd 3rd Crew

Child

Adult

Male

Female

Male

Female

Age Sex

Class

5 11 13 0

1 13 14 0

57 14 75 192

140 80 76 20

BalloonPlot : Surviving passengers

Figure 1: Tabular representation of survived popula-

tion by gender and age

Figure 1was created by calling balloonplot with

the balloon color set to match the background color

and most options disabled. Note that one must ac-

tively focus on the individual cell values in order to

see any pattern in the data.

Now, we redraw the table with light-blue circles

(‘balloons’) superimposed over the numerical values

(Figure 2). This is accomplished using the code:

library(gplots)

data(Titanic)

# Convert to 1 entry per row format

dframe <- as.data.frame(Titanic)

# Select only surviving passengers

survived <- dframe[dframe$Survived=="Yes",]

attach(survived)

balloonplot(x=Class,

y=list(Age, Sex),

z=Freq,

sort=TRUE,

show.zeros=TRUE,

cum.margins=FALSE,

main=

"BalloonPlot : Surviving passengers"

)

title(main=list("Circle area is proportional to\

number of passengers",

cex=0.9),

line=0.5)

detach(survived)

●●●

●●●●

203 118 178 212

338

316

711

1st 2nd 3rd Crew

Child

Adult

Male

Female

Male

Female

Age Sex

Class

5 11 13 0

1 13 14 0

57 14 75 192

140 80 76 20

BalloonPlot : Surviving passengers

Circle area is proportional to number of passengers

Figure 2: Balloon plot of surviving individuals by

class, gender and age

With the addition of the blue “spotlights”, whose

area is proportional to the magnitude of the data

value, it is easy to see that only adult females and

adult male crew members survived in large num-

bers. Also note the addition of row and column

marginal totals.

Of course, the number of surviving passengers

is only half of the story. We could create a simi-

lar plot showing the number of passengers who did

not survive. Alternatively, we can simply add sur-

vival status as another variable to the display, setting

the color of the circles to green for passengers who

survived, and magenta for those who did not (Fig-

ure 3). This conveys considerably more information

than Figures 1 and 2 without substantial loss of clar-

ity. The large magenta circles make it clear that most

passengers did not survive.

To further improve the display, we add a visual

representation of the row and column sums (Figure

4). This is accomplished using light grey bars behind

the row and column headers. The length of each bar

is proportional to the corresponding sum, allowing

rapid visual ascertainment of their relative sizes. We

have also added appropriately colored markers ad-

jacent to the headers under “Survived” to emphasize

the meaning of each color.

R News ISSN 1609-3631

Vol. 6/2, May 2006 37

●

●●●●

●●●

●●●●

325 285 706 885

1329

109

338

316

2201

1st 2nd 3rd Crew

Yes

Child

Adult

Child

Adult

Male

Female

Male

Female

Male

Female

Male

Female

Survived Age Sex

Class

0 0 35 0

0 0 17 0

118 154 387 670

4 13 89 3

5 11 13 0

1 13 14 0

57 14 75 192

140 80 76 20

BalloonPlot : Passenger Class by Survival, Age and Sex

Circle area is proportional to number of passengers

Figure 3: Balloon plot of Titanic passengers by gen-

der, age and class. Green circles represent passengers

who survived and magenta circles represent the pas-

sengers who did not survive.

attach(dframe)

colors <- ifelse( Survived=="Yes", "green",

"magenta")

balloonplot(x=Class,

y=list(Survived, Age, Sex),

z=Freq,

sort=FALSE,

dotcol=colors,

show.zeros=TRUE,

main="BalloonPlot : Passenger Class \

by Survival, Age and Sex")

points( x=1, y=8, pch=20, col="magenta")

points( x=1, y=4, pch=20, col="green")

title(main=list("Circle area is proportional to \

number of passengers", cex=0.9), line=0.5)

detach(dframe)

It is now easy to see several facts:

• A surprisingly large fraction (885/2201) of pas-

sengers were crew members

• Most passengers and crew were adult males

• Most adult males perished

• Most women survived, except in 3rd class

• Only 3rd class children perished

●

●●●●

●●●

●●●●

325 285 706 885

1329

109

338

316

2201

1st 2nd 3rd Crew

Yes

Child

Adult

Child

Adult

Male

Female

Male

Female

Male

Female

Male

Female

Survived Age Sex

Class

0 0 35 0

0 0 17 0

118 154 387 670

4 13 89 3

5 11 13 0

1 13 14 0

57 14 75 192

140 80 76 20

BalloonPlot : Passenger Class by Survival, Age and Sex

●

Circle area is proportional to number of passengers

Figure 4: Balloon plot of all the passengers of Titanic,

stratiﬁed by survival, age, sex and class

Perhaps the most striking fact is that survival is

lowest among 3rd class passengers for all age and

gender groups. It turns out that there is a well known

reason for this difference in survival. Passengers

in 1st and 2nd class, as well as crew members, had

better access to the lifeboats. Since there were too

few lifeboats for the number of passengers and crew,

most women and children among the 1st class, 2nd

class and crew found space in a lifeboat, while many

of the later arriving 3rd class women and children

were too late: the lifeboats had already been ﬁlled

and had moved away from the quickly sinking ship.

Discussion

Our goals in developing the balloonplot were

twofold: First, to improve ability of viewers to

quickly perceive trends. Second, to minimize the

need for viewers to learn new idioms. With these

goals in mind, we have restricted ourselves to sim-

ple modiﬁcations of the standard tabular display.

Other researchers have pursued more general ap-

proaches to the visual display of tabular data. (For

a review of that work, see Hartigan and Kleiner

(Hartigan and Kleiner,1981) or Friendly (Friendly,

1992).) One of the most popular methods developed

by these researchers is the mosaic plot (Snee,1974).

We have previously experimented with mosaic

plots. Unfortunately, we found that they do not lend

themselves to rapid ascertainment of trends, partic-

ularly by untrained users. Even trained users ﬁnd

that they must pay careful attention in order to de-

code the visual information presented by the mosaic

plot. In contrast, balloonplots lend themselves to

very quick perception of important trends, even for

R News ISSN 1609-3631

Vol. 6/2, May 2006 38

users who have never encountered them before.

While there are, of course, tasks for which mosaic

plot is preferable, we feel that the balloonplot serves

admirably to allow high-levels patterns to be quickly

perceived by untrained users.

Conclusion

Using the well worn Titanic data, we have shown

how balloonplots help to convey important aspects

of tabular data, without obscuring the exact numeric

values. We hope that this new approach to visu-

alizing tabular data will assist other statisticians in

more effectively understanding and presenting tab-

ular data.

We wish to thank Ramon Alonso-Allende

allende@cnb.uam.es for the discussion on R-help

which lead to the development of balloonplot, as

well as for the code for displaying the row and col-

umn sums.

Bibliography

M. Friendly. Graphical methods for categorical data.

Proceedings of SAS SUGI 17 Conference, 1992.

J. A. Hartigan and B. Kleiner. Mosaics for contin-

gency tables. Computer Science and Statistics: Pro-

ceedings of the 13th Symposium on the Interface. New

York: Springer-Verlag, 1981.

R. D. Snee. Graphical display of two-way contin-

gency tables. The American Statistician, 28:9–12,

1974.

Gregory R. Warnes, Pﬁzer Inc., USA

gregory.r.warnes@pfizer.com

Nitin Jain, Smith Hanley Inc, USA

nitin.jain@pfizer.com

Drawing pedigree diagrams with R and

graphviz

by Jing Hua Zhao

Human genetic studies often involve data collected

from families and graphical display of them is useful.

The wide interest in such data over years has led to

many software packages, both commercial and non-

commercial. A recent account of these packages is

available (Dudbridge et al.,2004), and a very ﬂex-

ible package Madeline (http://eyegene.ophthy.

med.umich.edu/madeline/index.html) is now re-

leased under the GNU General Public License. A

comprehensive list of many packages, including

the package LINKAGE (Terwilliger and Ott,1994)

for human parametric linkage analysis and GAS

(Genetic Analysis System, http://users.ox.ac.uk/

~ayoung/gas.html) for some other analyses, can be

seen at the linkage server at Rockefeller University

(http://linkage.rockefeller.edu).

Here I describe two functions in R that are

able to draw pedigree diagrams; the ﬁrst be-

ing plot.pedigree in kinship developed for S-

PLUS by Terry Therneau and Beth Atkinson and

ported to R by the author, and the second

pedtodot in gap based on David Duffy’s gawk

script (http://www2.qimr.edu.au/davidD/Course/

pedtodot) that requires graphviz (http://www.

graphviz.org). Both are easy to use and can draw

many pedigree diagrams quickly to a single ﬁle,

therefore can serve as alternatives to some programs

that only offer interactive use.

Representation of pedigrees

The key elements to store pedigrees using a database

is via the so-called family trios each containing in-

dividual’s, father’s and mother’s IDs. Founders,

namely individuals whose parents are not in the

pedigree, are set to be zero or missing. Individual’s

gender (e.g. 1=male, 2=female) is included as auxil-

iary information, together with pedigree ID in order

to maintain multiple pedigrees in a single database,

each record of which indicates a node in the pedigree

graph.

For instance, information for pedigree numbered

10081 in genetic analysis workshop 14 (GAW14,

http://www.gaworkshop.org) is shown as follows.

pid id father mother sex affected

10081 1 2 3 2 2

10081 2 0 0 1 1

10081 3 0 0 2 2

10081 4 2 3 2 2

10081 5 2 3 2 1

10081 6 2 3 1 1

10081 7 2 3 2 1

10081 8 0 0 1 1

10081 9 8 4 1 1

10081 10 0 0 2 1

10081 11 2 10 2 1

10081 12 2 10 2 2

R News ISSN 1609-3631

Vol. 6/2, May 2006 39

10081 13 0 0 1 1

10081 14 13 11 1 1

10081 15 0 0 1 1

10081 16 15 12 2 1

Here all IDs are integers with obvious meanings

just described, and the variable affected indicates

if an individual is alcoholic (1=nonalcoholic, 2=alco-

holic) according to DSMIIIR and Feighner deﬁnition

ALDX1 in the dataset.

In human genetic linkage studies, this is also

called pre-makeped format since these IDs can also

be string variables, e.g. individuals’ names, and a

utility program makeped in LINKAGE can be used

to generate the serial integer IDs and perform simple

checks on errors in family structure(s).

Suppose this is kept in a text ﬁle called 10081.pre,

we use

pre <- read.table("10081.pre",header=TRUE)

to read it into object pre.

The pedigree-drawing algorithm

Typically, in a pedigree diagram males and fe-

males are shown in squares and circles, respectively.

Spouses can form marriage nodes from which nodes

for children are derived. It is also customary to draw

pedigree diagrams top down, so that children at a

given generation could have children of their own

in the next generation. This implies that the concep-

tually simple algorithm for pedigree drawing would

involve sorting members of a pedigree by generation

and align members of the same generation horizon-

tally and those at different generations vertically. In

other words, the family is drawn as a directed graph

with members as nodes and ordered by their genera-

tion numbers. The algorithm could be more involved

if there are marriage loops in the family, i.e. overlap-

ping generations, or if the pedigree is too large to ﬁt

in a single page. More details on the algorithmic as-

pects of pedigree-drawing (Tores and Barillot,2001)

can be found for interested readers.

Fortunately, there is software publicly available

that implements this algorithm. Among these the

most notable is graphviz (http://www.graphviz.

org) consisting of programs dot,dotty,neato and

lneato. In the following section two implemen-

tations of the pedigree-drawing algorithm with or

without use of graphviz will be described.

Drawing pedigree diagrams with R

and graphviz

via plot.pedigree in kinship

Package kinship was developed for linear mixed

and mixed-effects Cox models of family data, but

it has function plot.pedigree for drawing pedigree

diagrams. As an S3 function for class ‘pedigree’, it

can be used as plot if supplied with argument with

class ‘pedigree’.

library(kinship)

attach(pre)

par(xpd=TRUE)

ped <- pedigree(id,father,mother,sex,affected)

plot(ped)

This gives Figure 1. We can use R devices such

as postscript to keep the diagram as an outside

ﬁle, e.g., postscript("10081.ps"); plot(ped);

dev.off(). Either postscript or pdf format can hold

more than one pedigree diagrams.

56713 11 15 12

Figure 1: Pedigree 10081 by kinship

via pedtodot in gap

The diagram produced by plot.pedigree in

kinship is a still image so that nodes in the pedigree

graph can not be pulled and dragged. This might

not be trivial to implement. However, we can use

graphviz to achieve this. A nice feature of the pack-

age its ability to interpret dot language, which is in

ASCII format that allows for text-editing. Keeping

this in mind, we can simply generate a dot ﬁle for

given pedigree(s) using dot syntax.

library(gap)

pedtodot(pre)

By default, this generates 10081.dot which can be

used by dot, e.g. dot -Tps -oﬁgure2.ps 10081.dot giv-

ing Figure 2.

R News ISSN 1609-3631

Vol. 6/2, May 2006 40

45678

11 1213

Figure 2: Pedigree 10081 by dot

Alternatively, we use

library(gap)

pedtodot(pre,dir="forward")

Besides dot, we can also use neato -Tps -o ﬁgure3.ps

10081.dot, giving a more liberal graph (Figure 3).

Its unusual looking makes it appealing for exposing

the peeling algorithm in the likelihood calculation of

pedigrees.

Figure 3: Pedigree 10081 by neato

A single ﬁle containing all pedigree diagrams

can be generated through the use of R sink com-

mand and screen output using sink=FALSE op-

tion in pedtodot. For example, given that all the

pre-makeped pedigrees are contained in the object

gaw14ped, we can use the following command,

sink("gaw14.dot")

pedtodot(gaw14ped,sink=FALSE)

sink()

to generate a complete list of pedigree diagrams

available from the web site http://www.mrc-epid.

cam.ac.uk/~jinghua.zhao/r-progs.htm.

Summary and further remarks

Further information about the two functions is avail-

able from the packages themselves. Note that

plot.pedigree in kinship does not require pedigree

ID be speciﬁed, while pedtodot in gap does. Unlike

plot.pedigree,pedtodot requires graphviz to visu-

alize graphics, but can be edited with dotty, and be

printed out in multiple pages when a pedigree dia-

gram is too big to ﬁt in a single page. Both can pro-

duce a set of pedigree diagrams in a single ﬁle.

Although makeped is a utility program in LINK-

AGE, this function is also available in package gap.

If the pedigree ﬁle is in post-makeped format, then

the option makeped=TRUE can be used. However,

pedtodot can also use string IDs, or ﬁle in the so-

called GAS format in which gender can take values

‘m’, ‘f’, etc.

Most pedigrees in current studies have moderate

sizes, therefore sparse two-dimensional arrays are

used to keep track of marriages and children from

them, which are in turn used in generating dot script.

This is less efﬁcient than the original gawk script in

that the latter use individual IDs directly to index

the marriage arrays, leading to shorter code and bet-

ter use of memory. In addition, further improvement

is possible by introducing other features available in

the dot language.

It is notable that R package sem has function

path.diagram to generate dot ﬁle and the Bio-

conductor (http://www.bioconductor.org) pack-

age Rgraphviz also uses graphviz. If more such

packages appear, it is desirable to be familiar with

the dot language and/or have support for it in R. Al-

though as yet human genetic linkage and genome-

wide association analysis is not widely conducted

with R, this might change in the near future, as has

been demonstrated by the great success of the Bio-

conductor project. I believe the two R functions de-

scribed in this note will be very useful to researchers

in their genetic data analysis.

Bibliography

F. Dudbridge, T. Carver, and G. Williams. Peli-

can: pedigree editor for linkage computer analysis.

Bioinformatics, 20(14):2327–2328, 2004.

F. Tores and E. Barillot. The art of pedigree drawing:

algorithmic aspects. Bioinformatics, 17(2):174–179,

2001.

R News ISSN 1609-3631

Vol. 6/2, May 2006 41

J. Terwilliger and J. Ott. Handbook of Human Genetic

Linkage. The Johns Hopkins University Press, Bal-

timore, 2001.

Jing Hua Zhao

MRC Epidemiology Unit

jinghua.zhao@mrc-epid.ac.uk

Non-Standard Fonts in PostScript and PDF

Graphics

by Paul Murrell and Brian Ripley

Introduction

By default, all of the text produced by R graphics de-

vices uses a sans-serif font family (e.g., Helvetica),

and in a PostScript or PDF plot only text in West-

ern European languages has been handled until re-

cently. This article describes several ways in which

the range of font families has been extended for R

2.3.0.

It has been possible to change the overall font

family that is used to draw text within a plot (Mur-

rell,2004), but the choice of font family for PDF and

PostScript devices was largely restricted to the set of

Adobe Core 14 fonts—Helvetica (4 faces), Times (4

faces), Courier (4 faces), Symbol, and ZapfDingbats

(see Table 1). As from R 2.3.0 much more choice is

available.

The examples in this article are written follow-

ing the conventions needed for a modern Unix-alike

operating system such as Linux, but will work with

minor changes (for example to the locale names) on

most other R platforms including Windows.

A little terminology will be helpful. A charac-

ter is an abstract concept, and a glyph is a visual

representation of a character. Characters can have

more than one representation, for example crossed

and uncrossed sevens, and the variants on Greek let-

ters much loved by mathematicians (see ?plotmath).

This matters as East Asian languages may write the

same character in different ways.1

Minor conveniences

The default font family on a PDF or PostScript device

can be controlled using the family argument. For ex-

ample, the expression pdf(family="Times") opens a

PDF device with the default font family set to Times

(a serif font). This is the simplest way to select a font

family, if the same font (possibly in different faces,

e.g. bold) is to be used for all of the text in a plot.

Prior to R 2.3.0, only a ﬁxed set of font families

could be speciﬁed via the family argument (mostly

related to the Adobe Core 14 fonts), but it is now pos-

sible to specify any valid font family as the default.

For example, the expression pdf(family="mono")

opens a PDF device with the default font family set

to be a mono-spaced font (by default Courier on PDF

and PostScript devices).

The section on “Font databases” later in this arti-

cle will clarify what constitutes a “valid” font family.

Changing font family in-line

While it has been possible for some time to be able

to change the font family within a plot (e.g., have the

title in Times and the remaining labels in Helvetica),

in the traditional graphics system, this change could

only be made via the par() function.

It is now also possible to specify the family ar-

gument in low-level traditional graphics functions.

For example, if the default font family was sans-serif,

adding a label to a plot with a serif font used to re-

quire code of the form ...

> op <- par(family="serif")

> text("A label")

> par(op) # restore the setting

. . . but now the same thing can be achieved by the

following code.

> text("A label", family="serif")

Interationalization

As Martin Maechler likes to say, R has ‘for ever’

supported ISO Latin 1, the character set of the ma-

jor2Western European languages. Further interna-

tionalization features were introduced into R in ver-

sion 2.1.0 (Ripley,2005), which allowed users in non-

Western-European locales to use variable names and

produce output in their native language.

There was good support for graphical devices on

Windows, and as from R 2.3.0 on NT-based versions

of Windows it is even possible to change to another

language and output in that language.

The support for the X11() graphics device was as

good as the X server could provide, often limited by

1http://en.wikipedia.org/wiki/Han_unification

2The deﬁnition is rather circular, as those are the languages written in Latin-1. Icelandic is the most prominent exception.

R News ISSN 1609-3631

Vol. 6/2, May 2006 42

Table 1: Predeﬁned font families for the PDF and PostScript devices which cover most of the Adobe Core 14

fonts. The Adobe Core fonts are usually available with Adobe (Acrobat) Reader and in PostScript printers and

viewers. The ZapfDingbats font is always included in PDF ﬁles produced by R, being used to draw small

circles.

"Helvetica" "Times" "Courier"

Helvetica Times Courier

Helvetica-Bold Times-Bold Courier-Bold

Helvetica-Oblique Times-Italic Courier-Oblique

Helvetica-BoldOblique Times-BoldItalic Courier-BoldOblique

Symbol Symbol Symbol

the availability of glyphs unless extras fonts were in-

stalled and are selected.

These features did not at ﬁrst include graphical

output on PDF or PostScript devices, but this has

been much improved in R 2.3.0, with support for

many Eastern European and Asian locales.

European and Cyrillic fonts

R now automatically selects an appropriate encod-

ing for PDF and PostScript output based on your lo-

cale. In multi-byte locales such as those using UTF-

8, an appropriate single-byte encoding is chosen, so

now the PDF and PostScript devices allow encodings

other than ISO Latin 1.

For example, consider the character ´n, which is

needed in a Polish locale to write ‘hello’. This char-

acter is part of the ISO Latin 2 character set. For

demonstration purposes, we will temporarily switch

to a Polish locale using the following call:

> Sys.setlocale(category="LC_CTYPE",

locale="pl_PL.utf8")

Now when we start a PDF or PostScript device, it au-

tomatically selects an ISO Latin 2 encoding so that the

appropriate characters are available.

To show an example for users who do not have a

Polish keyboard, we will use an explicit Unicode es-

cape sequence to enter the character we want. This

will only work in UTF-8 locales in R 2.3.0.3

> library(grid)

> pdf("polish.pdf", width=3, height=0.8)

> grid.text(" Dzie\u0144 dobry

is hello in Polish ")

> dev.off()

Dzień dobry

is ’hello’ in Polish

On systems without access to UTF-8 locales, we

can achieve the same thing by switching to a Latin-2

locale: a Windows4user could write

> Sys.setlocale(category="LC_CTYPE",

locale="polish")

> grid.text(" Dzie\xf1 dobry

is hello in Polish ")

It is still possible to explicitly specify an encoding

when starting the pdf device if the automated selec-

tion is not correct on your platform.

This process ought to work well whenever the

language you want can be written in a single-byte

locale: this includes all the Western and Eastern

European, Cyrillic and Greek languages written in

ISO 8859-1 (Western), -2 (Eastern), -5 (Cyrillic), -7

(Greek), -13 (Baltic rim), 15 (Western with Euro) and

KOI8-R, KOI8-U and the equivalent Microsoft code-

pages.

Installed fonts

By specifying the correct encoding you can create

aPDF or PostScript ﬁle that contains the correct de-

scription of the characters you want, but if you actu-

ally want to view or print that text, you also need

to have an appropriate font installed on your system

that contains the relevant characters.

In the above example we were fortunate because

the default Helvetica font does include the glyph ´n.

In general, you may have to specify an appropriate

font (and ensure that it is installed on your system!).

The section “Font databases” will describe the issue

of specifying new fonts in detail.

Even if the glyph is known to R, it may not be

known to your viewer. For example, to view non-

Western European languages with Adobe Reader

you may need to install a Central European font

pack.5

A word of warning: the Adobe Core fonts are

often replaced by clones. This happens both in

3Unicode escape sequences will work in all locales in R 2.3.1 on platforms which support multi-byte character sets.

4NT4, 2000 or later, or (untested) a Polish-language version of 95/98/ME.

5http://www.adobe.com/products/acrobat/acrrasianfontpack.html

R News ISSN 1609-3631

Vol. 6/2, May 2006 43

PostScript viewers (ghostscript uses the equivalent

URW fonts) and PDF viewers (recent versions of

Adobe Reader use multiple-master fonts) as well as

in some printers. So it can be hard to tell if you

have exceeded the set of glyphs that can reasonably

be assumed to be always available. If you use any-

thing beyond ISO Latin 1, it is safest to embed the

fonts (see below). However, for local use where up-

to-date URW fonts can be assumed the coverage is

much wider including Eastern European and Cyril-

lic glyphs.

CJK fonts

Some support has also been added for locales with

very large character sets, such as the Chinese,

Japanese, and Korean (CJK) ideographic languages.

On PDF and PostScript devices, it is now possible to

select from a small set of font families covering many

of the glyphs used in writing those languages.

There are tens of thousands of (originally) Chi-

nese characters which are common to Chinese

(where they are called hanzi), Japanese (kanji), and

Korean (hanja). However, different glyphs may be

used, for example in Simpliﬁed Chinese (used in

most of PR China and in Singapore) and Traditional

Chinese (used in Taiwan and some of PR China).

For example, the following code selects one of the

CJK font families to display ‘hello’ in (Traditional)

Chinese.

> pdf("chinese.pdf", width=3, height=1)

> grid.text("\u4F60\u597D", y=2/3,

gp=gpar(fontfamily="CNS1"))

> grid.text(

"is hello in (Traditional) Chinese",

y=1/3)

> dev.off()

is 'hello' in (Traditional) Chinese

Note that we select the CJK font family only to dis-

play Chinese: although these font families can dis-

play English, they do so poorly. You should switch

back to a more standard font such as Helvetica for

producing Latin characters.

The appropriate CJK fonts include non-Chinese

glyphs, for example Hiragana and Katakana for use

in Japanese and Hangul for use in Korean.

Again, R does not check whether the fonts are

properly installed on your system. For example,

in order to view the ﬁle chinese.pdf with Adobe

Reader, you will have to download the Traditional

Chinese font pack.6

The CJK font families available in R have been

chosen because they are widely (or freely) available

for common print or viewing technology. Table 2lists

the font families and where they can be obtained.

The CJK fonts can be used in mathematical for-

mulas (see ?plotmath), but the metric information

on the sizes of characters is not as good as for Type 1

fonts, so the result may not be aesthetically pleasing.

Note that in this example we did not need to se-

lect an appropriate UTF-8 locale. It will in fact work

in any UTF-8 locale, and also in the older multi-byte

locales used on Unix-alikes (such as EUC-JP) and in

the double-byte locales used on Windows for the CJK

languages.

Font databases

In the previous two sections we mentioned that, in

addition to having the correct encoding, you may

also have to specify an appropriate font in order to

get all of the characters that you need. Here is an

example where that becomes necessary: we want to

write ‘hello’ in Russian and the Cyrillic characters

are not included in Helvetica; so we need to specify

a font that does contain Cyrillic characters. (A sim-

pler workaround would be to use the URWHelvetica

family which does since 2002 include Cyrillic glyphs,

but the versions installed with ghostscript on Win-

dows are from 2000. So we need to illustrate a more

general procedure.)

Again, we will pretend to be Russian temporarily

by setting the locale for the R session:

> Sys.setlocale(category="LC_CTYPE",

locale="ru_RU.utf8")

Now the single-byte encoding is automatically set to

ISO Latin 5, but we must also specify a font that con-

tains Cyrillic characters. We will generate a new one

using the function Type1Font() and specifying the

font name and a path to metric (.afm) ﬁles for the

font.

The font used in this example is from the PSCyr

font pack7. For this demonstration, the ﬁles have

only been downloaded to a local directory (not for-

mally installed) to keep the path manageable8. The

following command creates a Type 1 font object.

6http://www.adobe.com/products/acrobat/acrrasianfontpack.html

7ftp://ftp.vsu.ru/pub/tex/font-packs/pscyr

8The location of the installed .afm ﬁles will vary depending on your system. Most of the fonts used in this article are from T

EX packages,

so on a Linux system they might reside in /usr/share/texmf/fonts/afm/public/<fontpackname>.

R News ISSN 1609-3631

Vol. 6/2, May 2006 44

Table 2: The CJK fonts available for use in R. These were collected from examples found on various systems,

and suggestions of Ei-ji Nakama.

Font family name

in the font database PostScript font name PDF font name

"Japan1" HeiseiKakuGo-W5: a Linotype

Japanese printer font.

KozMinPro-Regular-Acro: from the

Adobe Reader 7.0 Japanese Font

Pack

"Japan1HeiMin" HeiseiMin-W3: a Linotype Japanese

printer font.

HeiseiMin-W3-Acro: a version of

the PostScript font, from the Adobe

Reader 4.0 Japanese Font Pack.

"Japan1GothicBBB" GothicBBB-Medium: a Japanese-

market PostScript printer font.

GothicBBB-Medium

"Japan1Ryumin" Ryumin-Light: a Japanese-market

PostScript printer font.

Ryumin-Light

"Korea1" Baekmuk-Batang: a True-

Type font found on some

Linux systems, and from ftp:

//ftp.mizi.com/pub/baekmuk/.

HYSMyeongJoStd-Medium-Acro:

from the Adobe Reader 7.0 Korean

Font Pack

"Korea1deb" Batang-Regular: a TrueType font

found on some Linux systems, prob-

ably the same as Baekmuk-Batang.

HYGothic-Medium-Acro: from the

Adobe Reader 4.0 Korean Font Pack.

"CNS1" (Traditional Chinese) MOESung-Regular: from Ken

Lunde’s CJKV resources; can be

installed for use with ghostscript

MSungStd-Light-Acro: from the

Adobe Reader 7.0 Traditional Chi-

nese Font Pack.

"GB1" (Simpliﬁed Chinese) BousungEG-Light-GB: a True-

Type font found on some

Linux systems, and from

ftp://ftp.gnu.org/pub/non-gnu/

chinese-fonts-truetype/.

STSong-Light-Acro: from the

Adobe Reader 7.0 Simpliﬁed Chi-

nese Font Pack.

R News ISSN 1609-3631

Vol. 6/2, May 2006 45

> RU <- Type1Font(

"TimesNewRomanPSMT-Regular",

c("fonts/afm/public/pscyr/times.afm",

"fonts/afm/public/pscyr/timesbd.afm",

"fonts/afm/public/pscyr/timesi.afm",

"fonts/afm/public/pscyr/timesbi.afm"))

This next command registers the Type 1 font in the

font database for PDF devices. The database records

all valid font family names for use with PDF devices.

After this, for the duration of the current R session,

the font family "RU" can be used as a font family

name in graphics functions.

> pdfFonts(RU=RU)

Now we can use this font family like any other and

to produce Cyrillic characters (we do not assume a

Russian keyboard so fake typing the Cyrillic keys by

specifying the appropriate Unicode values9).

> pdf("russian.pdf", width=3, height=0.8)

> grid.text("\u0417\u0434\u0440

\u0430\u0432\u0441\u0442\u0432

\u0443\u0439\u0442\u0435

is hello in Russian (Cyrillic)",

gp=gpar(fontfamily="RU"))

> dev.off()

Здравствуйте

is ’hello’ in Russian (Cyrillic)

A separate function, postscriptFonts(), is pro-

vided for querying and modifying the font database

for PostScript devices, and there is a CIDFont() func-

tion for deﬁning new CJK (CID-keyed) fonts10.

Embedding fonts

The Russian example above is not quite complete yet.

If you try to view it with Adobe Reader, it will almost

certainly not look as we intended11. The problem is

that Adobe Reader knows nothing about the PSCyr

font so it substitutes a font it knows about, trying to

make it look as close as possible to the font we’ve

asked for.

By deﬁning a Type 1 font object, we have given R

enough information about the Cyrillic font to create

aPDF or PostScript ﬁle, but if we want to view that

ﬁle or print it, we may also have to install the font for

the viewing software or the printer. Furthermore, if

we want to share the ﬁle with someone else (e.g., put

it on the web), then that someone else may have to

install the font.

The best solution to all of these problems is to em-

bed (enough of) the font within the ﬁle itself. This

means that the font becomes part of the document

and viewing software or printers do not need any

additional information to display or print the text.

(Note that licence restrictions on the fonts may limit

or prohibit this approach.)

The new R function embedFonts() performs this

task (by calling ghostscript12).

Here is how to embed the PSCyr fonts for the

russian.pdf plot; the ﬁle russianembed.pdf is the

result. Because we have the fonts in a local direc-

tory, we must specify a font path so that ghostscript

knows where to ﬁnd the font.

> embedFonts("russian.pdf",

outfile="russianembed.pdf",

fontpaths="fonts/type1/public/pscyr")

For most of the examples in this article we have em-

bedded the fonts using this function, which is why

you can view all of these different character sets

without having to download any extra fonts.

A Computer Modern fonts example

Computer Modern fonts are the fonts designed by

Donald Knuth for the T

EX system (Knuth,1984) and

are the default fonts used in L

EX (Lamport,1994).

For a long time, it has been possible to make

use of Computer Modern fonts to produce PostScript

plots speciﬁcally for inclusion in L

EX documents

via a special "ComputerModern" font family (see

?postscript). However, this mechanism does have

some limitations: some characters are not available

(e.g., less-than and greater-than signs, curly braces,

and a number of mathematical symbols) and, more

importantly, it does not work for PDF output. A sim-

ple way to avoid these restrictions is to use a different

set of Computer Modern fonts that contain a more

complete set of characters.

Gaining access to a wider set of Computer Mod-

ern characters can be achieved simply by using a

version of the Computer Modern fonts that has

9This series of Unicode characters has been broken across lines for this article due to typesetting limitations; do not reproduce these

line breaks if you want to try this code yourself. Also, as there are at least three incompatible single-byte encodings for Russian, we do not

attempt to show how this might be done other than on a UTF-8 system.

10Deﬁning a new CID-keyed font for use with PDF devices requires in-depth knowledge of Adobe font technology and the font itself,

so is not recommended except perhaps for expert users.

11What this actually looks like for you will depend on the fonts you have installed for Adobe Reader, however it is unlikely that you

will have the PSCyr fonts so it should at least look different to the “correct” output; on a vanilla Linux system, the Cyrillic characters are

all missing, and the Latin text is a serif font.

12Ghostscript should be installed already on most Linux systems and is available as a standard download and install for Windows

(http://www.cs.wisc.edu/~ghost/). You may need to tell R where ghostscript is installed via the environment variable R_GSCMD.

R News ISSN 1609-3631

Vol. 6/2, May 2006 46

been reordered and regrouped for different encoding

schemes. We will use two such fonts in this example.

The ﬁrst set of fonts is the CM-LGC font pack13,

which provides a Type 1 version of most of the Com-

puter Modern fonts with an ISO Latin 1 encoding

(e.g., a less-than sign will produce a less-than sign,

not an upside-down exclamation mark as in T

EX).

The second font is a special Computer Modern sym-

bol font14 developed by Paul Murrell which covers

almost all of the Adobe Symbol encoding (for pro-

ducing mathematical formulas).

The following code demonstrates the use of the

CM-LGC fonts and the special Computer Modern

symbol font to produce a PDF format “plot” that in-

cludes mathematical symbols. The fonts have again

been unpacked in a local directory.

First of all, we deﬁne a new Type 1 Font object for

these fonts and register it with the PDF font database:

> CM <- Type1Font("CM",

c(paste("cm-lgc/fonts/afm/public/cm-lgc/",

c("fcmr8a.afm", "fcmb8a.afm",

"fcmri8a.afm", "fcmbi8a.afm"),

sep=""),

"cmsyase/cmsyase.afm"))

> pdfFonts(CM=CM)

Now we can use this font family in a PDF plot:

pdf("cm.pdf", width=3, height=3,

family="CM")

# ... much drawing code ommitted ...

dev.off()

Finally, we can embed these fonts in the docu-

ment so that we can embed the plot in other doc-

uments (such as this article) for anyone to view or

print.

> embedFonts("cm.pdf",

outfile="cmembed.pdf",

fontpaths=

c("cm-lgc/fonts/type1/public/cm-lgc",

"cmsyase"))

The ﬁnal result is shown in Figure 1.

Summary

With R 2.3.0, it is now easier to produce text in PDF

and PostScript plots using character sets other than

English. There is built-in support for a wider range of

encodings and fonts and there is improved support

for making use of non-standard fonts in plots. There

is also now a utility for embedding fonts so that plots

can be more easily included in other documents and

shared across systems.

We believe that it is now possible for users to

produce high-quality graphical output in all human

languages with a sizeable number of R users. Any-

one whose native language is not covered is invited

to contribute suitable encoding ﬁles and locale map-

pings so our coverage can be extended.

Computer Modern Font Comparison

cm−lgc < > { } ...

cmr

cmsyase ∑∑ ≠≠ ∏∏ ...

cmsy

Figure 1: A “plot” that demonstrates the use of Com-

puter Modern fonts in PDF output.

Acknowledgements

Ei-ji Nakama provided the initial patch of the C code

for non-ISO Latin 1 encodings in multi-byte locales,

and for CJK (CID-keyed) font support, and he and

colleagues have patiently explained Japanese typog-

raphy to us.

Bibliography

D. Knuth. The T

EXbook. Addison-Wesley, Reading,

MA, 1984.

L. Lamport. L

EX: a document preparation system.

Addison-Wesley, Reading, MA, 1994.

P. Murrell. Fonts, lines, and transparency in R graph-

ics. R News, 4(2):5–9, September 2004. URL http:

//CRAN.R-project.org/doc/Rnews/.

B. D. Ripley. Internationalization features of R 2.1.0.

R News, 5(1):2–7, May 2005. URL http://CRAN.

R-project.org/doc/Rnews/.

Paul Murrell

The University of Auckland, New Zealand

paul@stat.auckland.ac.nz

Brian D. Ripley

University of Oxford, UK

ripley@stats.ox.ac.uk

13http://www.ctan.org/tex-archive/help/Catalogue/entries/cm-lgc.html, and available as a package for some Linux systems.

14http://www.stat.auckland.ac.nz/~paul/R/CM/CMR.html

R News ISSN 1609-3631

Vol. 6/2, May 2006 47

The doBy package

by Søren Højsgaard

This article is not about rocket science; in fact it is not

about science at all. It is a description of yet another

package with utility functions.

I have used R in connection with teaching gener-

alized linear models and related topics to Ph.d. stu-

dents within areas like agronomy, biology, and vet-

erinary science at the Danish Institute of Agricultural

Sciences.

These students, many of whom are familiar with

the SAS system, have almost all come to appreciate R

very quickly. However, they have also from time to

time complained that certain standard tasks are hard

to do in R – and certainly harder than in SAS. The

doBy package is an attempt to make some of these

standard tasks easier.

Airquality data

The presentation of the package is based on the

airquality dataset which contains air quality mea-

surements in New York, May to September 1973.

(Note that months are coded as 5, . . . , 9).

> head(airquality)

Ozone Solar.R Wind Temp Month Day

1 41 190 7.4 67 5 1

2 36 118 8.0 72 5 2

3 12 149 12.6 74 5 3

4 18 313 11.5 62 5 4

5 NA NA 14.3 56 5 5

6 28 NA 14.9 66 5 6

The summaryBy function

With the summary procedure of SAS (PROC SUMMARY)

one can easily calculate things like “the mean and

variance of xfor each combination of two factors A

and B”. To calculate mean and variance of Ozone

and Wind for each combination of Month, do:

> summaryBy(Ozone + Wind ~ Month,

+ data = airquality, FUN = c(mean,

+ var), prefix = c("m",

+ "v"), na.rm = TRUE)

Month m.Ozone m.Wind v.Ozone v.Wind

1 5 23.62 11.623 493.9 12.471

2 6 29.44 10.267 331.5 14.207

3 7 59.12 8.942 1000.8 9.217

4 8 59.96 8.794 1574.6 10.407

5 9 31.45 10.180 582.8 11.980

The result above can clearly be obtained in other

ways. For example by using the aggregate function,

the summarize function in the Hmisc package or by

> a <- by(airquality, airquality$Month,

+ function(d) {

+ c(mean(d[, c("Ozone",

+ "Wind")], na.rm = T),

+ diag(var(d[, c("Ozone",

+ "Wind")], na.rm = T)))

+ })

> do.call("rbind", a)

However, my students have found this somewhat

cumbersome!

The orderBy function

Ordering (or sorting) a data frame is possible with

the orderBy function. Suppose we want to order the

dataframe by Temp and by Month (within Temp) and

that the ordering should be decreasing. This can be

achieved by:

> x <- orderBy(~Temp + Month,

+ data = airquality, decreasing = T)

The ﬁrst lines of the result are:

Ozone Solar.R Wind Temp Month Day

120 76 203 9.7 97 8 28

122 84 237 6.3 96 8 30

121 118 225 2.3 94 8 29

123 85 188 6.3 94 8 31

126 73 183 2.8 93 9 3

127 91 189 4.6 93 9 4

Again, this can clearly be achieved in other ways,

but presumably not with so few commands as above.

The splitBy function

Suppose we want to split data into a list of

dataframes, e.g. one dataframe for each month. This

can be achieved by:

> x <- splitBy(~Month, data = airquality)

Information about the grouping is stored as a

dataframe in an attribute called groupid:

> attr(x, "groupid")

Month

1 1

2 2

3 3

4 4

5 5

R News ISSN 1609-3631

Vol. 6/2, May 2006 48

The sampleBy function

Suppose we want a random sample of 50% of the ob-

servations from a dataframe. This can be achieved

with:

> sampleBy(~1, frac = 0.5, data = airquality)

Suppose instead that we want a systematic sam-

ple of every ﬁfth observation within each month.

This is achieved with:

> sampleBy(~Month, frac = 0.2,

+ data = airquality, systematic = T)

The subsetBy function

Suppose we want to take out those rows within each

month for which the the wind speed is larger than

the mean wind speed (within the month). This is

achieved by:

> subsetBy(~Month, subset = "Wind>mean(Wind)",

+ data = airquality)

Note that the statement "Wind>mean(Wind)" is eval-

uated within each month.

The esticon function

Consider a linear model which explains Ozone as a

linear function of Month and Wind:

> airquality <- transform(airquality,

+ Month = factor(Month))

> m <- lm(Ozone ~ Month * Wind,

+ data = airquality)

> coefficients(m)

(Intercept) Month6 Month7

50.748 -41.793 68.296

Month8 Month9 Wind

82.211 23.439 -2.368

Month6:Wind Month7:Wind Month8:Wind

4.051 -4.663 -6.154

Month9:Wind

-1.874

When a parameter vector βof (systematic) effects

have been estimated, interest is often in a particu-

lar estimable function, i.e. linear combination λ>β

and/or testing the hypothesis H0:λ>β=β0where

λis a speciﬁc vector deﬁned by the user.

Suppose for example we want to calculate the

expected difference in ozone between consequtive

months at wind speed 10 mph (which is about the

average wind speed over the whole period).

The esticon function provides a way of doing so.

We can specify several λvectors at the same time by

row–binding of λ>:

> Lambda <- rbind(c(0, -1, 0,

+ 0, 0, 0, -10, 0, 0, 0),

+ c(0, 1, -1, 0, 0, 0, 10,

+ -10, 0, 0), c(0, 0,

+ 1, -1, 0, 0, 0, 10,

+ -10, 0), c(0, 0, 0,

+ 1, -1, 0, 0, 0, 10,

+ -10))

> esticon(m, Lambda)

Confidence interval ( WALD ) level = 0.95

beta0 Estimate Std.Error t.value

1 0 1.2871 10.238 0.1257

2 0 -22.9503 10.310 -2.2259

3 0 0.9954 7.094 0.1403

4 0 15.9651 6.560 2.4337

DF Pr(>|t|) Lower.CI Upper.CI

1 106 0.90019 -19.010 21.585

2 106 0.02814 -43.392 -2.509

3 106 0.88867 -13.069 15.060

4 106 0.01662 2.959 28.971

In other cases, interest is in testing a hypothesis

of a contrast H0:Λβ=β0where Λis a matrix. For

example a test of no interaction between Month and

Wind can be made by testing jointly that the last four

parameters in mare zero (observe that the test is a

Wald test):

> Lambda <- rbind(c(0, 0, 0, 0,

+ 0, 0, 1, 0, 0, 0), c(0,

+ 0, 0, 0, 0, 0, 0, 1, 0,

+ 0), c(0, 0, 0, 0, 0, 0,

+ 0, 0, 1, 0), c(0, 0, 0,

+ 0, 0, 0, 0, 0, 0, 1))

> esticon(m, Lambda, joint.test = T)

X2.stat DF Pr(>|X^2|)

1 22.11 4 0.0001906

For a linear normal model, one would typically

prefer to do a likelihood ratio test instead. However,

for generalized estimating equations of glm–type (as

dealt with in the packages geepack and gee) there is

no likelihood. In this case esticon function provides

an operational alternative.

Observe that another function for calculating con-

trasts as above is the contrast function in the De-

sign package but it applies to a narrower range of

models than esticon does.

Final remarks

The ease in using the data oriented functions lies in

that 1) the formula language is used in the speciﬁ-

cation of both the variables and the grouping and 2)

the functions take a data argument. My “biologically

oriented” students (and ditto collegues) seem to ap-

preciate that.

R News ISSN 1609-3631

Vol. 6/2, May 2006 49

On their wishlist are facilities along the line of the

ESTIMATE and LSMEANS statements available in many

SAS procedures (including GLM,MIXED and GENMOD)

for easy speciﬁcation of various contrasts (LSMEANS is

sometimes also denoted “population means”). While

LSMEANS are often misused (and certainly misinter-

preted) such facilities would be nice to have in R.

I would like to encourage anyone who has imple-

mented such facilities in a reasonable level of gen-

erality to come forward.

Søren Højsgaard

Statistics and Decision Analysis Unit

Department of Genetics and Biotechnology

Danish Institute of Agricultural Sciences

Tjele, Denmark

sorenh@agrsci.dk

Normed division algebras with R:

Introducing the onion package

The eccentric cousin and the crazy old uncle

by Robin K. S. Hankin

Preface

An algebra is a vector space Vover a ﬁeld (here the

real numbers) in which the vectors may be multi-

plied. In addition to the usual vector space axioms,

one requires, for any λ,µ∈and x,y,z∈V:

•x(λy+µz) = λxy +µxz

•(λy+µz)x=λyx +µzx

where multiplication is denoted by juxtaposition;

note the absence of associativity. A division algebra

is a nontrivial algebra in which division is possible:

for any a∈Vand any nonzero b∈V, there exists

precisely one element xwith a=bx and precisely

one element ywith a=yb. A normed division al-

gebra is a division algebra with a norm ||·|| satisfy-

ing ||xy|| =||x|| ||y||.

There are precisely four normed division alge-

bras: the reals themselves ( ), the complex num-

bers ( ), the quaternions ( ) and the octonions ( );

the generic term is “onion”, although the term in-

cludes other algebras such as the sedenions.

The R programming language (Develop-

ment Core Team,2004) is well-equipped to deal with

the ﬁrst two: here, I introduce the onion package,

which provides some functionality for the quater-

nions and the octonions, and illustrate these interest-

ing algebras using numerical examples.

Introduction

Historically, the complex numbers arose from a num-

ber of independent lines of inquiry and our current

understanding of them (viz z=x+iy; the Argand

plane) developed over the eighteenth and nineteenth

centuries.

Hamilton was one of many mathematicians to at-

tempt to extend the complex numbers to a third di-

mension and discover what would in our terminol-

ogy be a three dimensional normed division alge-

bra. We now know that no such thing exists: divi-

sion algebras all have dimension 2nfor nsome non-

negative integer.

Hamilton came upon the multiplicative structure

of the quaternions in a now-famous ﬂash of inspira-

tion on 16th October 1843: the quaternions are ob-

tained by adding the elements i,j, and kto the real

numbers and requiring that

i2=j2=k2=i jk =−1. (1)

A general quaternion is thus written a+bi +c j +dk

with a,b,c,dbeing real numbers; complex arithmetic

is recovered if c=d=0. Hamilton’s relations above,

together with distributivity and associativity, yield

the full multiplication table and the quaternions are

the unique four dimensional normed division alge-

bra.

However, Hamilton’s scheme was controversial

as his multiplication was noncommutative: a shock-

ing suggestion at the time. Today we recognize many

more noncommutative operations (such as matrix

multiplication), but even Hamilton had difﬁculty

convincing his contemporaries that such operations

were consistent, let alone worth studying.

The octonions

The fourth and ﬁnal normed division algebra is that

of the octonions. These were discovered around 1844

and are an eight-dimensional algebra over the reals.

The full multiplication table is given by Baez (2001).

Package onion in use

A good place to start is function rquat(), which

returns a quaternionic vector of a speciﬁed length,

whose elements are random small integers:

R News ISSN 1609-3631

Vol. 6/2, May 2006 50

> library(onion)

> x <- rquat(5)

> names(x) <- letters[1:5]

> print(x)

abcde

Re43213

i 32341

j 24231

k 33444

This quaternionic vector, of length 5, is of the

form of a four-row matrix. The rownames are the

standard basis for quaternions, and their entries may

be manipulated as expected; for example, we may set

component kto be component jplus 10:

> k(x) <- j(x) + 10

> x

abcde

Re43213

i 32341

j 24231

k 12 14 12 13 11

Quaternionic vectors behave largely as one might ex-

pect. For example, we may concatenate awith a ba-

sis quaternion (the four bases are represented in the

package by H1,Hi,Hj, and Hk) and multiply the result

by the ﬁrst element:

> c(x, Hk) * x[1]

abcde

Re -141 -170 -149 -170 -125 -12

i 24 -3 18 9 23 2

j 16 4 12 23 -11 -3

k 96 100 72 65 81 4

And indeed we may explicitly verify that quater-

nionic multiplication is not commutative using the

commutator() function, returning xy −yx:

> y <- rquat(5)

> commutator(x, y)

abcde

Re00000

i 20 12 32 98 62

j -42 -48 -24 -44 -40

k 2 12 -4 -20 -2

It is possible to verify that quaternionic

multiplication is associative using function

associator(x,y,z) which returns (xy)z−x(yz)

and is thus identically zero for associative operators:

> associator(x, y, rquat(5))

abcde

Re00000

i 00000

j 00000

k 00000

Compare the octonions, which are not associative:

> associator(roct(3), roct(3),

+ roct(3))

[1] [2] [3]

Re000

i 472 -408 242

j -158 112 -66

k -106 -408 84

l -492 128 506

il 506 54 -700

jl 250 298 610

kl -160 518 -768

Many transcendental functions operate on

onions, via a reasonably straightforward generaliza-

tion of the complex case. Consider the square root,

deﬁned as exp(log(x)/2). That this obeys similar

rules to the usual square root may be demonstrated

for octonions by calculating √x√x−x, and showing

that its modulus is small:

> x <- roct(3)

> Mod(sqrt(x) * sqrt(x) - x)

[1] 7.601583e-15 5.291934e-15

[3] 4.110825e-15

showing acceptable accuracy in this context. How-

ever, many identities that are true for the real or com-

plex case fail for quaternions or octonions; for exam-

ple, although log(x2) = 2 log(x), it is not generally

the case that log(xy) = log(x) + log(y), as we may

verify numerically:

> x <- rquat(3)

> y <- rquat(3)

> Mod(log(x * x) - 2 * log(x))

[1] 0.000000e+00 3.845925e-16

[3] 0.000000e+00

> Mod(log(x * y) - log(x) - log(y))

[1] 0.0000000 0.6770686 1.1158104

Practical applications

I now show the quaternions in use: they can be used

to rotate an object in 3D space in an elegant and nat-

ural way. If we wish to rotate vector v, considered to

be a triple of real numbers, we ﬁrst identify its three

components with the imaginary part of a quaternion

with zero real part (i.e., v=0+v1i+v2j+v3k).

Then the transformation deﬁned by

v−→ v0=zvz−1

R News ISSN 1609-3631

Vol. 6/2, May 2006 51

where z∈, is a rotation1. Note that the noncom-

mutativity of the quaternions means that the map-

ping is not necessarily the identity. This scheme may

be used to produce Figure 1.

> data(bunny)

> par(mfrow = c(2, 2))

> par(mai = rep(0.1, 4))

> p3d(rotate(bunny, H1), box = FALSE,

+ d0 = 0.4, r = 1e+06, h = NULL,

+ pch = 16, cex = 0.3, theta = 0,

+ phi = 90, main = "z=1 (identity)")

> p3d(rotate(bunny, Hi), box = FALSE,

+ d0 = 0.4, r = 1e+06, h = NULL,

+ pch = 16, cex = 0.3, theta = 0,

+ phi = 90, main = "z=i")

> p3d(rotate(bunny, H1 - Hj),

+ box = FALSE, d0 = 0.4,

+ r = 1e+06, h = NULL, pch = 16,

+ cex = 0.3, theta = 0, phi = 90,

+ main = "z=1-i")

> p3d(rotate(bunny, Hall), box = FALSE,

+ d0 = 0.4, r = 1e+06, h = NULL,

+ pch = 16, cex = 0.3, theta = 0,

+ phi = 90, main = "z=1+i+j+k")

Figure 1: The Stanford Bunny (Turk,2005) in various

rotations, plotted using p3d(). Depth cue is via pro-

gressive greying out of points further from the eye,

as though viewed through a mist

Translating between quaternionic rotation and

(say) the equivalent orthogonal matrix is straightfor-

ward.

Conclusions

Quaternions and octonions are interesting and in-

structive examples of normed division algebras.

Quaternions are a natural and efﬁcient representa-

tion of three dimensional rotations and this paper in-

cludes a brief illustration of their use in this context.

Octonions’ applicability to physics is currently

unclear, but a steady trickle of papers has appeared

in which the octonions are shown to be related to

spinor geometry and quantum mechanics (see Baez,

2001, and references therein). Reading the literature,

one cannot help feeling that octonions will eventu-

ally reveal some deep physical truth. When this hap-

pens, the onion package ensures that R will be ready

to play its part.

Acknowledgments

I would like to acknowledge the many stimulating

and helpful comments made by the R-help list over

the years.

Bibliography

J. C. Baez. The octonions. Bulletin of the American

Mathematical Society, 39(5):145–205, 2001.

R Development Core Team. R: A language and envi-

ronment for statistical computing. R Foundation for

Statistical Computing, Vienna, Austria, 2004. URL

http://www.R-project.org. ISBN 3-900051-07-0.

G. Turk. The Stanford bunny, 2005. Stanford

University Computer Graphics Laboratory,

http://graphics.stanford.edu/data/3Dscanrep/.

Robin Hankin

National Oceanography Centre, Southampton

European Way

Southampton

United Kingdom

SO14 3ZH

r.hankin@noc.soton.ac.uk

1The real part of v0is again zero and thus may be interpreted as a three-vector. It is straightforward to prove that v·w=v0·w0. In R id-

iom, one would numerically verify these identities by evaluating d <- (z*v/z) %.% (z*w/z) - v %.% w (where v,w,z are quaternions)

and observing that Mod(d) is small.

R News ISSN 1609-3631

Vol. 6/2, May 2006 52

Resistor networks R: Introducing the

ResistorArray package

Electrical properties of resistor networks

by Robin K. S. Hankin

Introduction

Many elementary physics courses show how resis-

tors combine in series and parallel (Figure 1); the

equations are

Rseries =R1+R2(1)

Rparallel =1

R1−1+R2−1. (2)

However, these rules break down for many systems

such as the Wheatstone bridge (Figure 2); the reader

who doubts this should attempt to apply equations 1

and 2and ﬁnd the resistance between points A and B

in the general case.

R R

Figure 1: Two resistors in series (top) and parallel

(bottom)

A B

23 R34

R23

1 4

Figure 2: The Wheatstone bridge

This paper introduces ResistorArray (Hankin,

2005), an R (R Development Core Team,2005) pack-

age to determine resistances and other electrical

properties of such networks.

Although this paper uses the language of elec-

trical engineering, the problem considered is clearly

very general: many systems are composed of iso-

lated nodes between which some quantity ﬂows and

the steady states of such systems are generally of in-

terest.

Package ResistorArray has been applied to such

diverse problems as the diffusion of nutrients among

fungal hyphae networks, the propagation of salinity

between (moored) oceanographical buoys, and hy-

draulic systems such as networks of sewage pumps.

Matrix formulation

The general problem of determining the resistance

between two nodes of a resistor network requires

matrix techniques1.

Consider a network of nnodes, with node icon-

nected to node jby a resistor of resistance Ri j. Then

the network has a “conductance matrix” Lwith

Li j =−1/Ri j if i6=j

∑k:k6=j1/Rk j if i=j.(3)

Thus Lis a symmetrical matrix, whose row sums and

column sums are zero (and therefore is singular).

Then the analogue of Ohm’s law (viz V=IR)

would be

Lv=i(4)

where v= (v1,...,vn)is a vector of potentials

and i= (i1,...,in)is a vector of currents; here ipis

the current ﬂow in to node p. Equation 4is thus a re-

statement of the fact that charge does not accumulate

at any node.

Each node of the circuit may either be fed a known

current2and we have to calculate its potential; or it is

maintained at a known potential and we have to cal-

culate the current ﬂux into that node to maintain that

potential. There are thus nunknowns altogether.

Thus some elements of vand iare known and

some are unknown. Without loss of generality, we

may partition these vectors into known and un-

known parts: v= (vk0,vu0)0and i= (iu0,ik0)0.

Thus the known elements of vare vk= (v1,...,vp)0:

these would correspond to nodes that are maintained

at a speciﬁed potential; the other elements vu=

(vp+1,...,vn)0correspond to nodes that are at an un-

known potential that we have to calculate.

The current vector imay similarly be decom-

posed, but in a conjugate fashion; thus elements iu=

(i1,...,ip)0correspond to nodes that require a cer-

tain, unknown, current to be fed into them to

1The method described here is a standard application of matrix algebra, but does not appear to be described in the literature. Many

books and papers, such as Doyle and Snell (1984), allude to it; but a formal description does not appear to have been published.

2This would include a zero current: in the case of the Wheatstone bridge of Figure 2, nodes 2 and 3 have zero current ﬂux so i2and i3

are known and set to zero.

R News ISSN 1609-3631

Vol. 6/2, May 2006 53

maintain potentials vk; the other elements ik=

(ip+1,...,in)0would correspond to nodes that have

a known current ﬂowing into them and whose po-

tential we seek.

Equation 4may thus be expressed in terms of a

suitably partitioned matrix equation:

A B

BTCvk

vu=iu

ik(5)

where, in R idiom, A=L[1:p,1:p],B=L[1:p,(p+1):n],

and C=L[(p+1):n,(p+1):n]. Straightforward matrix

algebra gives

vu=C−1ik−BTvk(6)

iu=A−BC−1BTvk+BC−1ik(7)

Equations 6and 7are implemented by circuit().

Package ResistorArray in use

Consider the Wheatstone Bridge, illustrated in Fig-

ure 2. Here the resistance between node 1 and node 4

is calculated; all resistances are one ohm except R34,

which is 2 ohms. This resistor array may be viewed

as a skeleton tetrahedron, with edge 1–4 missing. We

may thus use function tetrahedron() to generate

the conductance matrix; this is

R> L <- tetrahedron(c(1, 1, Inf, 1, 1, 2))

[,1] [,2] [,3] [,4]

[1,] 2 -1 -1.0 0.0

[2,] -1 3 -1.0 -1.0

[3,] -1 -1 2.5 -0.5

[4,] 0 -1 -0.5 1.5

Observe that L[1,4]==L[4,1]==0, as required.

The resistance may be determined by function

circuit(); there are several equivalent methods.

Here node 1 is earthed—by setting it to zero

volts—and the others are given a ﬂoating poten-

tial; viz v=c(0,NA,NA,NA). Then one amp is fed

in to node 4, zero amps to nodes 2 and 3, and

node 1 an unknown current, to be determined; viz

currents=c(NA,0,0,1). The node 1-to-node 4 resis-

tance will then be given by the potential at node 4:

R> circuit(L, v = c(0, NA, NA, NA),

+ currents = c(NA, 0, 0, 1))

$potentials

[1] 0.0000000 0.5454545 0.4545455

[4] 1.1818182

$currents

[1] -1 0 0 1

Thus the resistance is about 1.181818Ω, compar-

ing well with the theoretical value of 117/99Ω. Note

that the system correctly returns the net current re-

quired to maintain node 1 at 0 V as −1 A (which is

clear from conservation of charge). The potential dif-

ference between nodes 2 and 3 indicates a nonzero

current ﬂow along R23; currents through each resis-

tor are returned by function circuit() if argument

give.internal is set to TRUE.

Conclusions

Package ResistorArray solves a problem that is fre-

quently encountered in electrical engineering: de-

termining the electrical properties of resistor net-

works. The techniques presented here are applica-

ble to many systems composed of isolated nodes be-

tween which some quantity ﬂows; the steady states

of such systems generally have an electrical ana-

logue.

This paper explicitly presents a vectorized so-

lution method that uses standard numerical tech-

niques, and discusses the corresponding R idiom.

Acknowledgements

I would like to acknowledge the many stimulating

and helpful comments made by the R-help list over

the years.

Bibliography

P. G. Doyle and J. L. Snell. Random walks and electric

networks. The Mathematical Association of Amer-

ica, 1984.

R. K. S. Hankin. ResistorArray: electrical properties of

resistor networks, 2005. R package version 1.0-9.

R Development Core Team. R: A language and envi-

ronment for statistical computing. R Foundation for

Statistical Computing, Vienna, Austria, 2005. URL

http://www.R-project.org. ISBN 3-900051-07-0.

Robin Hankin

National Oceanography Centre, Southampton

European Way

Southampton

United Kingdom

SO14 3ZH

r.hankin@noc.soton.ac.uk

R News ISSN 1609-3631

Vol. 6/2, May 2006 54

The good, the bad, and the ugly—Review

of Paul Murrell’s new book: “RGraphics”

by David Meyer

At the DSC 2001 workshop

in Vienna, Ross Ihaka pre-

sented an overview on the

“old” Rgraphics system us-

ing the title: “The good, the

bad, and the ugly”, basi-

cally meaning that the tra-

ditional graphics system, al-

beit ﬂexible, was getting old

and the interface was cum-

bersome in several aspects.

Think, e.g., of the mysteries

of the dozens of par() pa-

rameters whose lengthy help

page most useRs keep browsing on a regular ba-

sis to ﬁlter out the appropriate settings. Thanks to

Paul Murrell’s new book, the secrets of both tradi-

tional graphics and the new, modern grid system

get unveiled, preventing useRs from writing “ugly”

code. Starting from scratch, both architectures are

presented and compared in detail, complemented

with illustrative examples and summary tables. The

text not only covers basic elements such as points,

lines, segments, and text, but also discusses the most

important high-level plot functions available in base

R, as well as some popular extension packages such

as scatterplot3d. In addition, a clear effort has

been made to enable the readers to develop their own

graphical methods, in the Rspirit of “users becoming

developers”.

The ﬁrst part of the book covers the traditional

graphics system. It starts with some simple plots us-

ing plot() and friends, and proceeds with a detailed

chapter on customizing, covering all low-level and

high-level par() settings, multiple plots, annotation,

and more complex, superposed plots. Even experi-

enced users will discover “new” features. For exam-

ple, most people will be familiar with line types—

but what about the various line join and line ending

styles, or the correct handling of fonts?

The second part is devoted to grid, the new

graphics system developed by the book author him-

self. Murrell chooses a practical approach and ﬁrst

introduces the lattice package, R’s version of Trellis

graphics. After giving an overview on all high-level

lattice functions, the author provides customization

examples. Only after this “teaser” is the reader con-

fronted with the basic elements of the grid package.

The text explains the idea behind concepts like units,

graphical context, viewports, and how existing plot

elements can easily be integrated in more complex

displays. Particular emphasis is given to the possibil-

ities of interactive plot modiﬁcations and off-screen

computations. Finally, a whole chapter is devoted to

the development of new grid-based functions and

objects that can be reused by others.

Paul Murrell’s book is the ﬁrst publication en-

tirely devoted to Rgraphics, written by the authorita-

tive expert in the ﬁeld. It is deﬁnitely a must-have for

novices and professionals alike, the ultimate guide to

the power (and beauty) of Rgraphics.

David Meyer

Department of Information Systems and Operations, Vi-

enna University of Economics and Business Administra-

tion

Augasse 2-6, A-1090 Wien, Austria

david.meyer@wu-wien.ac.at

Changes in R

by the R Core Team

User-visible changes

• In the grid package there are new ’arrow’ argu-

ments to grid.line.to(),grid.lines(), and

grid.segments() (grid.arrows() has been

deprecated).

The new ’arrow’ arguments have been added

BEFORE the ’name’, ’gp’ and ’vp’ arguments

so existing code that speciﬁes any of these ar-

guments *by position* (not by name) will fail.

•all.equal() is more stringent, see the PR#8191

bug ﬁx below.

• The data frame argument to transform() is no

longer called ’x’, but ’_data’. Since this is an in-

valid name, it is less likely to clash with names

given to transformed variables. (People were

getting into trouble with transform(data,

x=y+z).)

R News ISSN 1609-3631

Vol. 6/2, May 2006 55

New features

•arima.sim() has a new argument ’start.innov’

for compatibility with S-PLUS. (If not supplied,

the output is unchanged from previous ver-

sions in R.)

•arrows() has been changed to be more simi-

lar to segments(): for example col=NA omits

the arrow rather than as previously (undocu-

mented) using par("col").

•as.list() now accepts symbols (as given by

as.symbol() aka as.name()).

•atan2() now allows one complex and one nu-

meric argument.

• The ’masked’ warnings given by attach() and

library() now only warn for functions mask-

ing functions or non-functions masking non-

functions.

• New function Axis(), a generic version of

axis(), with Date and POSIX[cl]t meth-

ods. This is used by most of the stan-

dard plotting functions (boxplot, contour,

coplot, ﬁlled.contour, pairs, plot.default, rug,

stripchart) which will thus label x or y axes ap-

propriately.

•pbeta() now uses TOMS708 in all cases and

so is more accurate in some cases (e.g. when

lower.tail = FALSE and when one of the shape

parameters is very small).

•[qr]beta(),[qr]f() and [qr]t() now have a

non-centrality parameter.

•[rc]bind() and some more cases of sub-

assignment are implemented for raw matrices.

(PR8529 and 8530)

• The number of lines of deparsed calls printed

by browser() and traceback() can be limited

by the option "deparse.max.lines". (Wish of

PR#8638.)

• New canCoerce() utility function in "meth-

ods" package.

•[pq]chisq() are considerably more accurate

for moderate (up to 80) values of ncp, and

lower.tail = FALSE is fully supported in that re-

gion. (They are somewhat slower than before.)

•chol(pivot = TRUE) now gives a warning if

used on a (numerically) non-positive-deﬁnite

matrix.

•chooseCRANmirror() consults the CRAN mas-

ter (if accessible) to ﬁnd an up-to-date list of

mirrors.

•cov.wt() is more efﬁcient for ’cor = TRUE’ and

has a new ’method’ argument which allows

’Maximum Likelihood’.

•do.call() gains an ’envir’ argument.

•eigen() applied to an asymmetric real matrix

now uses a tolerance to decide if the result is

complex (rather than expecting the imaginary

parts of the eigenvalues to be exactly zero).

• New function embedFonts() for embedding

fonts in PDF or PostScript graphics ﬁles.

•fisher.test() now uses p-values computed

via hypergeometric distributions for all 2 by 2

tables. This might be slightly slower for a few

cases, but works much better for tables with

some large counts.

There is a new option to simulate the p-value

for larger than 2 x 2 tables.

•for() now supports raw vectors as the set of

indices.

•getNativeSymbolInfo() is vectorized for the

’name’ argument. It returns a named list of Na-

tiveSymbolInfo objects, but is backward com-

patible by default when called with a character

vector of length 1, returning the NativeSymbol-

Info object.

•help.search() no longer attempts to handle

packages installed prior to R 2.0.0, and reports

the current path to the package (rather than

where it was originally installed: this informa-

tion is not shown by the print() method).

• Added "hexmode" to parallel "octmode".

•install.packages() now does tilde expan-

sion on ﬁle paths supplied as ’pkgs’.

•install.packages() has additional argu-

ments ’conﬁgure.args’ and ’clean’ which allow

the caller to provide additional arguments to

the underlying R CMD INSTALL shell com-

mand when installing source packages on a

Unix-alike.

•is.loaded() has a new argument ’type’ to con-

ﬁne the search to symbols for .C, .Fortran, .Call

or .External: by default it looks for a symbol

which will match any of them. It is now inter-

nal and not primitive, so argument matching

works in the usual way.

• The symmetry test for matrices used in

eigen() has been “exported” as the ’matrix’

method of a new S3-generic isSymmetric().

• .leap.seconds and the internal adjustment code

now know about the 23rd leap second on 2005-

12-31: the internal code uses a run-time test to

see if the OS does.

R News ISSN 1609-3631

Vol. 6/2, May 2006 56

• The ’col’ argument of legend() now defaults

to par("col") (which defaults to "black", the

previous default), so that the lines/symbols are

shown in the legend in the colour that is used

on the plot.

•log2() and log10() call C functions of the

same name if available, and will then be more

likely to be precise to machine accuracy.

•new.packages() gains a ... argument to pass

e.g. ’destdir’ to install.packages(). (Wish of

PR#8239.)

•nls() now supports ’weights’.

• The vector passed as the ﬁrst argument of the

’fn’ and ’gr’ arguments of optim() has the

names (if any) given to argument ’par’.

•options(expressions) is temporarily in-

creased by 500 during error-handling. This

enables e.g. traceback() to work when the

error is reaching the limit on the nesting of ex-

pressions.

•page() accepts general R objects, not just

names (and previously undocumented) charac-

ter strings. This allows the object to be speciﬁed

as a call, for example. More options are allowed

in its ’...’ argument.

•pairs() allows a wider class of inputs, in-

cluding data frames with date and date-time

columns.

•par() and the in-line use of graphical parame-

ters produce more informative error messages,

distinguishing between non-existent pars and

inappropriate use of valid pars.

Graphical parameters ’family’, ’lend’, ’ljoin’

and ’lmitre’ can now be set in-line.

There is no longer a warning if non-settable

pars are used in-line, but there is an appropri-

ate warning if unknown pars are passed.

The length limit for the ’family’ parameter has

been increased to 200 bytes, to allow for the

names of some CID-keyed fonts in multi-byte

locales.

• The pdf() device now allows ’family’ to

be speciﬁed in the same generality as

postscript().

• The pdf() device writes /FontDescriptor en-

tries for all fonts except the base 14, and does

not write font entries for unused fonts.

• Plotmath allows ’vartheta’, ’varphi’ and ’var-

sigma’ (or ’stigma’) as synonyms for ’theta1’,

’phi1’ and ’sigma1’, and the help page has a

note for TeX users.

•plot.xy() now takes its default arguments

from the corresponding par() settings, so

points(type="l") and lines(type="p") be-

have in the same way (and more obviously, also

for type="b").

•poly() has a new argument ’raw’, mainly for

pedagogical purposes.

• The class "POSIXlt" now supports fractional

seconds (as "POSIXct" has always done). The

printing of fractional seconds is controlled by

the new option "digits.secs", and by default is

off.

•postscript() supports family = "Computer-

ModernItalic" for Computer Modern with italic

(rather than slanted) faces.

• The postscript()/pdf() font metrics for the

14 standard fonts (only, not the rest of the com-

mon 35) have been updated to versions from

late 1999 which cover more glyphs. There are

also a few differences in the metrics and hence

the output might be slightly different in some

cases.

• The way families can be speciﬁed for

postscript() and pdf() has been expanded

to include CID-keyed fonts, with new func-

tions Type1Font() and CIDFont() to set up

such fonts families.

•prettyNum() has new arguments ’pre-

serve.width’ and ’zero.print’. When the former

is not "none", as in calls from format() and

formatC(), the resulting strings are kept at the

desired width when possible even after adding

of ’big.mark’ or ’small.mark’.

•proc.time() and system.time() now record

times to 1 ms accuracy where available (most

Unix-like systems).

• The initialization methods for the quasi()

family have been changed to depend on the

variance function, and in particular to work

better for the "mu(1-mu)" variance function.

(PR#8486)

•read.table() gains a ’ﬂush’ argument passed

to scan().

•require() now takes a ’lib.loc’ argument.

• The second argument ’size’ to sample() is re-

quired to have length 1, so that errors when

supplying arguments are more easily detected.

• The default is now compress = !ascii in save()

(but not save.image).

R News ISSN 1609-3631

Vol. 6/2, May 2006 57

•scan() and write.table() now have some in-

terruptibility, which may be useful when pro-

cessing very large ﬁles.

• A new heuristic test, seemsS4Object() is

supplied, along with a similar C-level test,

R_seemsS4Object(object). The test detects

probable S4 objects by their class’s attribute.

See the help page.

• S3 classes can now be made non-virtual S4

classes by supplying a prototype object in the

arguments to setOldClass().

•splinefun() returns a function that now also

has a ’deriv’ argument and can provide up to

the 3rd derivative of the interpolating spline,

thanks to Berwin Turlach.

•stopifnot(A) now gives a better error mes-

sage when A has NAs, and uses "not all TRUE"

when A has length >= 2.

•str()’s default method has a new argu-

ment ’strict.width’ which can be used to

produce strict ’width’ conforming output.

A new options(str = list(strict.width =

*)) setting allows one to control this for a whole

session.

•summary.nls() has a new argument ’correla-

tion’ that defaults to FALSE (like summary.lm).

•Sys.sleep() has sub-millisecond resolution on

Unix-alikes with gettimeofday().

•Sys.time() now has sub-millisecond accuracy

on systems supporting the POSIX call gettime-

ofday, and clock-tick accuracy on Windows.

• The new function timestamp() adds a time

stamp to the saved command history on con-

soles which support it.

• New function tcrossprod() for efﬁciently

computing x %*% t(x) and x %*% t(y).

• The sufﬁx used by tempfile() is now in hex

on all platforms and guaranteed to be at least 6

hex digits (usually 8).

•trace() now works more consistently and

more like its documentation, in particular the

assertions about old tracing being removed for

new. For debugging purposes (of R) a mech-

anism for debugging the trace computations

themselves was added. See trace.R.

• The implementation of trace() has beem

made more general by calling a function to do

the trace interaction, and recover() now de-

tects trace calls to trim the irrelevant code un-

derneath.

•unserialize() can now also read a byte

stream from a raw vector.

• The useDynLib() directive in the NAMES-

PACE ﬁle now accepts the names of the na-

tive/foreign symbols that are to be resolved in

the DLL for use in .C/.Call/.Fortran/.External

calls. These can be used as regular R vari-

ables instead of the (routine name, PACK-

AGE) pairs currently recommended. Alterna-

tive names can be given for the R variables

mapping to these symbols. The native routine

registration information can also be used di-

rectly via useDynLib(name, .registration =

TRUE). See the ’Writing R Extensions’ manual

for more details.

checkFF() (package ’tools’) has been updated

accordingly.

•validObject() has an option complete=TRUE

that recursively checks the objects in the slots.

Not used when new(...) checks validity.

• New Vectorize() function, a wrapper for

mapply().

•write.ftable() has gained an argument ’ap-

pend = FALSE’ (thanks to Stephen Weigand).

• On Unix-alikes, X11() now has arguments

to request the initial position of the win-

dow, and ’gamma’ defaults to the value of

getOptions("gamma"). These changes are con-

sistent with the windows() device.

•X11() and the Unix-alike data entry window

can have properties (including geometry) set

by X resources: see their help ﬁles.

•xy.coords() &xyz.coords() now have

NULL defaults for their ’y’ or ’y’ and ’z’ ar-

guments. This is more consistent with their

earlier documentation, and may be convenient

for using them.

• Non-syntactic names of list elements are now

printed quoted by backticks rather than double

quotes.

• There is some basic checking for imminent C

stack overﬂow (when the evaluation depth and

the user interrupts are checked). On systems

with suitable OS support (not Windows), seg-

faults from C stack overﬂow are caught and

treated as an R error.

New function Cstack_info() reports on stack

size and usage.

options(expressions) reverts to the default

of 5000 now that stack checking is in place.

R News ISSN 1609-3631

Vol. 6/2, May 2006 58

• Package tcltk does not try to initialize Tk

on Unix-alikes unless a DISPLAY variable is

present. This allows packages dependent on

tcltk to be installed without access to an X

server.

• The code used to guess timezone offsets where

not supplied by the OS uses a different algo-

rithm that is more likely to guess the summer-

time transitions correctly.

• Package tools contains translation tables

’Adobe_glyphs’ and ’charset_to_Unicode’.

• Changed the environment tree to be rooted

in an empty environment, available as

emptyenv().baseenv() has been modiﬁed

to return an environment with emptyenv() as

parent, rather than NULL.

• gettext has been updated to 0.14.5.

• PCRE has been updated to version 6.4.

• The method $.DLLInfo resolves the speciﬁed

symbol in the DLL, returning a NativeSymbol-

Info object. Use [[ to access the actual values in

the DLLInfo object.

• On systems with either vasprintf or both

va_copy and a vsnprintf which reports the

size of buffer required, connections such as

gzfile() and bzfile() can now write arbi-

trarily long lines, not just 100000 chars.

• The R session temporary directory is now set

in C code using the same algorithm whether

or not the shell front-end is used and on all

platforms. This looks at environment variables

TMPDIR, TMP and TEMP in turn, and checks

if they point to a writable directory.

• Some of the classical tests put unnecessary re-

strictions on the LHS in the formula interface

(e.g., t.test(x+y ~ g) was not allowed).

• On suitably equipped Unix-alike systems, seg-

faults, illegal operations and bus errors are

caught and there is a simple error-handler

which gives the user some choice as to what to

do in interactive use. [Experimental.]

On Windows access violations and illegal in-

structions are caught with a simple error han-

dler. [Experimental.]

• Tracebacks now include calls such as

.C/.Fortran/.Call, which will help if errors oc-

cur in R code evaluated by compiled code and

in tracebacks presented by the segfault etc han-

dlers.

• Treatment of signature objects and method

deﬁnition objects has been modiﬁed to give

cleaner printing and more consistency in the

treatment of signatures. A sometimes useful

utility, methodSignatureMatrix(), is now ex-

ported.

• R refrains from printing a ﬁnal EOL upon ex-

iting the main loop if the quiet ﬂag is on and

if the save action is known (e.g. this is true for

–slave).

Deprecated & defunct

• The deprecated and undocumented use of

atan() with two arguments has been removed:

instead use atan2().

•write.table0() is defunct in favour of

write.table().

•format.char() is defunct in favour of

format.default().

• Support for the long-deprecated (and no

longer documented) arguments –min-vsize –

min-nsize –max-vsize –max-nsize –vsize –nsize

of R CMD BATCH has been removed.

• The ’debian’ subdirectory has been removed

from the sources.

• The ’vfont’ argument of axis() and mtext()

has been removed: use par(family=) instead.

• The unused graphical parameter "type" has

been removed: it invited confusion with the

’type’ argument to default methods of plot(),

points() and lines().

•nlsMethod() and profiler() are no longer

exported from the stats namespace (and

nlsMethod.plinear() is no longer registered

as a method, as nlsMethod() was not generic).

• The re-named tcltk functions tkcmd, tkﬁle.tail,

tkﬁle.dir, tkopen, tkclose, tkputs, and tkread

are now formally deprecated.

• Argument ’col’ of bxp() is now formally dep-

recated.

• Use of NULL as an environment is deprecated

and gives a warning.

•postscriptFont() is deprecated in favour of

Type1Font() (which is just a change of name).

•La.chol() and La.chol2inv() are deprecated

(they have since R 1.7.0 been the same as the

default options of chol() and chol2inv()).

•La.svd(method = "dgesvd") is deprecated.

R News ISSN 1609-3631

Vol. 6/2, May 2006 59

• The use of install.R and R_PROFILE.R ﬁles in

packages is deprecated: use the DESCRIPTION

ﬁle instead to arrange to save an image or to

load dependent packages.

The following command-line options to IN-

STALL are deprecated (use the ﬁelds in the DE-

SCRIPTION ﬁle instead): -s –save –no-save –

lazy –no-lazy –lazy-data –no-lazy-data

• Graphical parameter ’tmag’ (which is long un-

used) is deprecated.

Internationalization

A set of patches supplied by Ei-ji Nakama has been

incorporated.

• New postscript encodings for CP1253, CP1257

and Greek (ISO 8859-7).

• Support for East Asian CID-keyed fonts in

pdf() and postscript(). Although these usu-

ally contain Latin characters no accurate AFMs

are available and so CID-keyed fonts are in-

tended only for use with CJK characters.

• Wide-character width functions wc[s]width are

provided that overcome problems found with

OS-supplied ones (and those previously used

by R on Windows). This means that double-

width CJK characters are now supported on

all platforms. It seems that the width of some

characters (and not just CJK characters) de-

pends on which CJK locale’s fonts are in use

and also on the OS.

Revised wide-character classiﬁcation functions

are provided for use on Windows, AIX and Ma-

cOS X to replace deﬁcient OS-supplied ones.

• There is support for MBCS charsets in the

pictex() graphics device, and rotated (by 90

degrees) text may work better.

• The \u (and \U except on Windows) notation

for characters which is supported by the parser

in all MBCS charsets is now always interpreted

as a Unicode point, even on platforms which

do not encode wchar_t in Unicode. These are

now a syntax error in single-byte locales.

• The default encoding for postscript() and

pdf() is chosen to be suitable for the current

locale, if that is a single-byte locale which is

supported. This covers European (including

Greek) and Cyrillic languages.

In UTF-8 locales, a suitable single-byte encod-

ing is chosen for postscript() and pdf(), and

text translated to it.

•xfig() gains an ’encoding’ argument.

• There are some message translations into Span-

ish.

Installation changes

• The encoding ﬁles for pdf()/postscript()

have been moved to directory ’enc’ in package

’grDevices’.

• Support for MBCS is only enabled if iconv is

found and it supports enough conversions. (li-

biconv does.)

• In an MBCS locale, make check now translates

the graphics examples from Latin-1. This en-

sures that they will work correctly in UTF-8:

it is possible that in other MBCS locales they

will now fail (rather than work completely in-

correctly).

• There is a new test, ’test-Docs’, which as part of

’make check-devel’ tests the code in the docu-

mentation. Currently it runs doc/manual/R-

{exts,intro}.R and the compiled code in R-

exts.c.

• The workaround to allow an external

LAPACK-containing BLAS such as libsunperf

to be used with the internal LAPACK has been

removed. If you have such a library you may

now need to use –with-lapack. It is no longer

possible to use some older versions of libsun-

perf, e.g. Forte 7 on 64-bit builds.

• A substitute for mkdtemp is provided, so it is

now always used for R_TempDir.

• Most of the functions checked for by ’conﬁg-

ure’ also have declarations checked for in the

appropriate header.

• The top-level documentation ﬁles AUTHORS,

COPYING.LIB, COPYRIGHTS, FAQ, RE-

SOURCES, THANKS have been moved to doc,

and COPYING and NEWS are installed there.

The ﬁle Y2K has been removed from the distri-

bution.

• The extension .lo is no longer used in building

R (only in the optional build of libRmath.so):

this allows a considerable simpliﬁcation of the

Makeﬁles.

• Direct support for f2c has been removed: it can

still be used via a script which makes it look

like a Fortran compiler. (src/scripts/f77_f2c is

an example of such a script.)

• There is a new ﬂag SAFE_FFLAGS which is

used for the compilation of dlamc.f. It is set

by conﬁgure for known problem cases (recent

g77 and gfortran), but can be overridden by the

user.

R News ISSN 1609-3631

Vol. 6/2, May 2006 60

• The standard autoconf macros for large-ﬁle

support are now used, and these are enabled

unless –disable-largeﬁle is speciﬁed. This re-

places –enable-linux-lfs (and is now selected by

default).

• Visibility attributes are used where sup-

ported (gcc4/gfortran on some platforms, also

gcc3/g77 on FC3 and partially elsewhere). The

main beneﬁt should be faster loading (and per-

haps better optimized code) in some of the dy-

namic shared objects (e.g. libR.so and stats.so).

• The *PICFLAGS are taken to be -fpic rather

than -fPIC where possible. This will make no

difference on most platforms: -fPIC is needed

on Sparc (and still used there), but -fpic should

give slightly better performance on PowerPC

(although -fPIC is used on PPC64 as it is

needed to build libR.so there).

• More use is made of inlining for small utility

functions such as isReal. Because this can only

be done portably with C99 constructs (and we

know of no actual implementation), this is only

done for the GNU C compiler.

• There is an experimental feature to allow

shared installations of sub-architectures. See

the R-admin manual.

• All platforms now use R’s internal implemen-

tation of strptime, which allows fractional sec-

onds. (The major platforms were already using

it.)

• The dlcompat work-around for old Mac OS X

systems (<= 10.2) has been removed. External

dlcompat must be installed if needed.

Utilities

• R CMD check now uses an install log by de-

fault.

• R CMD check works for packages whose pack-

age name is different from the directory name

in which it is located.

• R CMD INSTALL now uses more randomness

in the temporary directory name even on sys-

tems without mktemp -d.

• R CMD f77 has been removed now that f2c is

no longer supported.

• The version string shown in the startup mes-

sage and by "R –version", and that stored in

variable R.version.string, are now in exactly

the same format.

• The base name of a help ﬁle needs to be valid

as part of a ﬁle:// URL, so R CMD check now

checks that the names are ASCII and do not

contain % .

• R CMD check now warns about unknown sec-

tions in Rd ﬁles, and invalid names for help,

demo and R ﬁles, as well as unlikely ﬁle names

in the ’src’ directory. The latter is controlled by

option –check-subdirs and by default is done if

checking a tarball without a conﬁgure script.

R CMD build excludes invalid ﬁles in the

’man’, ’R’ and ’demo’ subdirectories.

•\usepackage[noae]{Sweave} in the header of

an Sweave ﬁle supresses auto-usage of the ae

package ("almost European" fonts) and T1 in-

put encoding.

Documentation

• Rd format now allows \var{} markup inside

\code{} and \examples{}.

• Markup such as --,---, < and > is handled bet-

ter when converting .Rd ﬁles to [C]HTML.

• There is new markup \link[=dest]{name} to

generate a link to topic ’dest’ which is shown

as ’name’, and \linkS4class{abc} which ex-

pands to \link[=abc-class]{abc}, for cross-

referencing the recommended form of docu-

mentation for S4 classes.

Package Installation

• There is now some support for Fortran 90/95

code in packages: see ‘Writing R Extensions’.

• Installation of man sources and demos is now

done by R code. The restrictions on the names

of help ﬁles, R ﬁles and of demos are now en-

forced (see ‘Writing R Extensions’).

• Packages that contain compiled code can now

have more than one dot in their name even on

Windows.

• The Meta/hsearch.rds database saved now

contains LibPath="". This information is now

always recreated when help.search() is run,

but the ﬁeld is retained for back-compatibility.

•update.packages() now has a ’...’ argument

to be passed to install.packages(), includ-

ing the formerly separate arguments ’destdir’

and ’installWithVers’.

• Make macros AR and RANLIB are now de-

clared in etc/Makeconf for use by packages

that wish to make static libraries.

R News ISSN 1609-3631

Vol. 6/2, May 2006 61

C-level facilities

• qgamma and rgamma in Rmath.h now check

for non-positive arguments.

• The BLAS that ships with R now contains

the complete set of double-complex BLAS rou-

tines, rather than just those used in R.

<R_ext/BLAS.h> has been corrected to add

the missing double-precision BLAS functions

drotmg and drotm, and to exclude lsame

(which is a Lapack auxiliary function and is

now declared in <R_ext/Lapack.h>). It also in-

cludes the double complex routines added for

this release of R provided Fortran doublecom-

plex is usable on the platform.

• <R_ext/BLAS.h> and <R_ext/Lapack.h> now

declare all the entry points as ’extern’.

• The ﬂag SAFE_FFLAGS is made available to

packages via etc/Makeconf and R CMD con-

ﬁg. It can be used where optimization needs to

be defeated, e.g. in LAPACK setup.

• getNativeSymbolInfo has a withRegistra-

tionInfo argument which causes the address

ﬁeld to be a reference to the registration in-

formation if it is available for that symbol. If

the registration information is not available,

the address is a reference to the native sym-

bol. The default is FALSE which is backward

compatible, returning just the address of the

symbol and ignoring registration information.

• errorcall and warningcall are now declared in

<Rinternals.h> (they might be needed in front-

ends).

• R_FlushConsole and R_ProcessEvents are now

declared in <R.h>.

• The R_Sock* functions supporting socket con-

nections are no longer declared in R-ftp-http.h

as they are not loaded into R itself, and are now

hidden in the module’s DLL on suitable sys-

tems.

Bug ﬁxes

• Quoted arguments to the R script after –

args are now passed quoted to the R exe-

cutable and so will be shown as expected by

commandArgs(). (They were previously split at

whitespace even inside quotes on Unix-alikes

but not on Windows.)

•axis() now supports pars ’xaxp’/’yaxp’ as in-

line arguments.

•sort() now does not return inappropriate at-

tributes such as "dim" and "tsp": it only returns

names.

sort(x, partial=) no longer returns un-

sorted names, and drops names (since it is sup-

plied for efﬁciency).

• Use of non-central F in pf() gives accurate val-

ues for larger ncp.

• R CMD build –binary does a better job of clean-

ing up after failure to re-make vignettes.

• reg-test-1.R tested system(intern=TRUE)

which depends on popen and so is not sup-

ported on all platforms.

• Changed apparent mis-spelling of "Gibraltar"

in dataset ’eurodist’.

•sysconf() is now used to ﬁnd the number of

clock ticks/second: under some circumstances

glibc reported CLK_TCK = 60 when the true

value was 100.

•identical() was not allowing for embedded

nuls in character strings. (NB: the comparison

operators including == do not, and never will.)

• The profile() and profiler() methods for

"nls" objects now support algorithm = "plinear"

and algorithm = "port".

• The signal handlers for signals USR1 and USR2

where not restored if the signal arrived when

interrupts were suspended.

• Certain combinations of S4 inheritance could

cause inherited methods to override some di-

rectly speciﬁed methods.

• Some cases of named signatures in calls to

setMethod() caused errors.

•all.equal() is now more consistent and

"picky" about mismatching attributes, in partic-

ular names(); this is a part of the propositions

by Andy Piskorkski (PR#8191).

•load() when applied to a connection leaves it

open/not as it found it, and checks explicitly

for having a binary readable connection.

• The p-values given by stat.anova() (called

from several anova() methods) are now NA

(rather than spurious) if non-nested models

give rise to changes in deviance with a differ-

ent sign from changes in degrees of freedom.

• Built-ins were reported as the relevant call in C-

level error()s iff R proﬁling was in progress.

Now they are never reported.

• Too-long signatures (with no names) were not

being caught in setMethod().

R News ISSN 1609-3631

Vol. 6/2, May 2006 62

• Slot names in prototype() are being more

thoroughly checked.

•signif() is more likely to follow the ’round to

even’ rule for exactly representable numbers,

e.g. signif(0.25, 1). (Related to PR#8452.)

•nls() now works correctly with some low-

dimensional ﬁts, e.g. with one or zero non-

linear parameters.

•glm() could give an inappropriate error mes-

sage if all possible coefﬁcients were invalid

(e.g. a log-linear binomial model with no in-

tercept and a not all positive predictor).

•solve() gives clearer error messages for some

incorrect usages. (PR#8494 and similar)

• The gaussian() family was missing the

’valideta’ component (which could be needed

for the "inverse" link function).

The starting values supplied by the gaussian

family could be invalid for the "log" and "in-

verse" link functions. This is now reported.

•data.matrix() did not work correctly on zero-

row data frames. (PR#8496 and other prob-

lems.)

• The DSC comments in the ﬁles from

postscript(onefile=FALSE) now label all

ﬁles as having page 1 of 1, as some other soft-

ware seems to expect that.

• The axis labels chosen for logarithmic axis are

now less likely to be linear and inappropriate

(when the range is more than 10 and less than

100). (PR#1235)

• Staircase lines (types "s" and "S") are now

drawn continuously rather than a point at a

time and so line types, mitring and so on work.

(PR#2630)

• Calling par(mfg) before doing any plotting re-

sulted in NewPage never being called on the

device, which in turn resulted in incorrect out-

put for postcript() and pdf() devices. (Re-

ported by Marc Schwartz in discussion of the

non-bug PR#7820.)

• terms.formula needed to add parentheses to

formulae with terms containing ’|’. (PR#8462)

•pbirthday() and qbirthday() now also work

for very improbable events — those you are

typically *not* interested in.

• Only source help ﬁles starting with an upper-

or lower-case letter or digit and extension .Rd

or .rd are documented to be processed. This is

more liberal in that starting with a digit is now

also allowed, but the rule is now enforced.

•nls(algorthm="port") was always taking

positive numeric differences and so could ex-

ceed the upper bounds.

•methods:::.asEnvironmentPackage() was

not allowing for versioned installs.

•.find.package() now reports which

package(s) it cannot ﬁnd in case it stops with

an error.

• The standard Unix-alike version of

file.show() gives an informative message if

it cannot open a ﬁle rather than the (possibly

incorrect) ’NO FILE’.

•window() did not allow non-overlapping

ranges with extend = TRUE. (PR#8545)

•pbinom(size = 0) now returns correct values

(not NaN). (PR#8560)

•[dp]binom(x, *) for x < 0 now always returns

0 (PR#8700). Analogous change in pgeom(),

pnbinom() and ppois().

•[dqpr]geom() and [dpqr]nbinom() now all

consistently accept prob = 1 but not prob = 0.

qgeom(prob=1) now gives the correct values

(not -1).

• INSTALL on Unix-alikes was not loading de-

pendent packages when preparing for lazy-

•qcauchy(1) now gives +Inf instead of just a

very large number.

•df(0, f1, *) now properly returns Inf, 1, or 0

forf1<,=,or>2.

•qbinom(),qnbinom() and qpois() now use a

better search and normally reach the answer

very quickly when it is large (instead of being

slow or inﬁnite-looping).

•pt(x, df) lost accuracy in the far tails (when

|x| > 1e154) for small df (like df = 0.001 for

which such extremes are not unlikely).

•dbeta(x, a, b) underﬂowed internally and

incorrectly gave 0 for very small x and a.

• None of the warnings about convergence fail-

ures or loss of precision in nmath (distribution

and special functions) were being reported to

the R user.

• dnt was missing from standalone nmath (un-

der Unix-alikes).

•split() now accepts factors with numeric (but

not storage mode integer) codes.

R News ISSN 1609-3631

Vol. 6/2, May 2006 63

• The utilities such as ’check’ now report active

version numbers again, as SVN ’last changed

revision’ numbers.

•addmargins() did not accept a name for ’FUN’,

only an expression.

• ’+’ for POSIXt objects now takes the tzone from

whichever object has it, so date+x is the same

as x+date if x is numeric.

•mean.default() and var() compute means

with an additional pass and so are often more

accurate, e.g. the variance of a constant vector

is (almost) always zero and the mean of such

a vector will be equal to the constant value to

machine precision. (PR#1228)

sum(),prod(),mean(),rowSums() and friends

use a long double accumulator where avail-

able and so may be more accurate. (This is

particularly helpful on systems such as Sparc

and AMD64 where long double gives consider-

ably greater exponent range and precision than

double.)

•read.dcf() now gives a warning on mal-

formed lines.

• add1.[g]lm now try harder to use the environ-

ment of the formula in the orginal ﬁt to look

for objects such as the ’data’ and ’subset’ argu-

ments.

•gaussian()$aic was inconsistent with e.g. the

lm results from AIC() and extractAIC() for

weighted ﬁts: it treated the weights as case

weights and not variance factors.

•system() on Unix-alikes ignored non-logical

values of ’intern’ and treated ’intern = NA’ as

true.

•as.table() now produces non-NA rownames

when converting a matrix of more than 26

rows. (PR#8652)

• Partial sorting used an algorithm that was in-

tended only for a few values of ’partial’ and

so could be far slower than a full sort. It now

switches to a barebones full sort for more than

10 values of ’partial’ and uses a more efﬁcient

recursive implementation for 2...10.

•summary.glm() returned an estimate of disper-

sion of Inf for a gaussian glm with zero resid-

ual degrees of freedom and then treated that as

a known value. It now uses the estimate NaN,

which is consistent with summary.lm().

•Sys.sleep() on Unix-alikes was restricted to

about 2147 seconds and otherwise might never

have returned. (PR#8678)

•is(obj, Cl) could wrongly report TRUE

when Cl was a classUnion and multiple inheri-

tance was involved.

• conﬁnt[.lm / .default] used label "100 %" for

level = 0.999

• Empty entries (i.e., extraneous ",") in NAMES-

PACE ﬁles now give a better error message

early at parsing time instead of a less compre-

hensible one later at load time.

•all.equal(n1, n2) could erroneously return

NA when n1, n2 contained large integers.

•anova.mlm() didn’t handle multi-df effects

properly in the single-model case (PR#8679)

•anova.mlm() had its colnames mangled by

data.frame() (needed check.names=FALSE).

•summary.glm() gave an NA estimate of disper-

sion for ﬁts with zero weights. (PR#8720)

•qhyper() had too small a tolerance for right-

continuity on some platforms so was not al-

ways an inverse to phyper().

•rownames<-.data.frame() tested the

length(s) of the replacement value(s)

before coercion, which can change the

length (e.g. for class "POSIXlt"). Ditto

dimnames<-.data.frame().

•max() and min() ignored the largest/smallest

representable integer, as well as Inf/-Inf.

(PR#8731)

•write.table() assumed factors had integer

codes: it now allows malformed factors with

numeric codes (and otherwise throws an error).

• Worked around a Solaris restriction which

meant that Sys.sleep() was only effective for

times of up to one second.

•sink(, split=TRUE) now works correctly, but

is allowed only on platforms that support

va_copy or __va_copy. (PR#8716)

•factanal(),prcomp() and princomp() now

only check that columns in the model frame

that will be used are numeric (they previously

also checked columns which were part of neg-

ative terms in the formula).

• Misuse of $ in apply could corrupt memory.

(PR#8718)

•apply() could fail if the function returned

NULL (e.g. if there was a single row).

•registerS3method() failed due to a typo. (It

was almost never used.)

R News ISSN 1609-3631

Vol. 6/2, May 2006 64

• Registering an S3 method for an S3 generic in

another package that was converted to an S4

generic in the same package as the S3 method,

registered the method in the wrong place.

•Recall() used lookup for the function in use

and so could fail if that was an S3 method not

on the search path.

• Rdconv -t Ssgm failed if it encountered

\link[opt]{arg}.

•uniroot() did not give a warning (as docu-

mented) if it failed to converge in ’maxiter’

steps. (PR#8751)

• eapply (and as.list.environment) did not

work for the base environment/namespace.

(PR#8761)

• Added protection in conﬁgure against systems

for which using xmkmf fails to eport a C or C++

compiler.

•expand.grid() was constructing a data frame

’by hand’ and so setting integer row.names

(which are documented to be character).

It now sets character row names, and

row.names.data.frame() coerces to character.

•qbeta() used == on volatile doubles for its con-

vergence test, which failed with gcc 3.3.x on

ix86 Linux. We now use a less fragile test (and

lose a negligible amount of accuracy).

•ls.str() was missing inherits=FALSE, and so

could have reported on an object of the same

name but a different mode in the enclosure of

the given environment.

• logLik.nls assumed that sigma^2 had been es-

timated, but did not count this in the ’df’ at-

tribute.

Changes on CRAN

by Kurt Hornik

New contributed packages

BayesTree Software accompanying the paper

“Bayesian Additive Regression Trees” by

Chipman, George and McCulloch (2005), see

http://gsbwww.uchicago.edu/fac/robert.

mcculloch/research. By Hugh Chipman and

Robert McCulloch.

BayesValidate Implements the software validation

method described in the paper “Validation of

Software for Bayesian Models using Posterior

Quantiles” (Cook, Gelman, and Rubin, 2005).

It inputs a function to perform Bayesian in-

ference as well as functions to generate data

from the Bayesian model being ﬁt, and repeat-

edly generates and analyzes data to check that

the Bayesian inference program works prop-

erly. By Samantha Cook.

CTFS The CTFS Large Plot Forest Dynamics Analy-

ses. By Pamela Hall.

CVThresh Carries out a level-dependent cross-

validation method for the selection of a thresh-

olding value in wavelet shrinkage. This pro-

cedure is implemented by coupling a con-

ventional cross validation with an imputation

method due to a limitation of data length, a

power of 2. It can be easily applied to classical

leave-one-out and k-fold cross validation. Since

the procedure is computationally fast, a level-

dependent cross validation can be performed

for wavelet shrinkage of various data such as a

data with correlated errors. By Donghoh Kim

and Hee-Seok Oh.

DescribeDisplay Produce publication quality

graphics from output of GGobi’s describe dis-

play plug-in. By Hadley Wickham, Di Cook,

and Andreas Buja.

FLCore Contains the core classes and methods for

FLR, a framework for ﬁsheries modeling and

management strategy simulation in R. Devel-

oped by a team of ﬁsheries scientists in various

countries. More information can be found at

http://flr-project.org/, including a devel-

opment mailing list. By the FLR Team and var-

ious contributors. Initial design by Laurence T.

Kell & Philippe Grosjean.

FactoMineR Exploratory data analysis. By François

Husson, Sébastien Lê, and Jérémy Mazet.

FortranCallsR Teaches how to implement Fortran

Code calling R functions. By Diethelm Wuertz.

FracSim Perform simulation of one- and two-

dimensional fractional and multifractional

Lévy motions. By S. Déjean and S. Cohen.

FunCluster Performs a functional analysis of mi-

croarray expression data based on Gene On-

tology & KEGG functional annotations. From

expression data and functional annotations

R News ISSN 1609-3631

Vol. 6/2, May 2006 65

FunCluster builds classes of putatively co-

regulated biological processes through a spe-

cially designed clustering procedure. By Cor-

neliu Henegar.

GammaTest A suite of Gamma test analysis tools.

By Samuel E. Kemp.

GroupSeq Performing computations related to

group sequential boundaries. The computa-

tions are done via the alpha sending approach

i.e., interim analyses need not to be equally

spaced, and their number need not to be spec-

iﬁed in advance. It is appropriate for any trial

based on normally distributed test statistics

with independent increments, survival studies,

and certain longitudinal designs. Among other

things it computes critical boundaries for var-

ious spending functions and for prespeciﬁed

power and drift. Conﬁdence intervals are also

obtained. The package provides this in a GUI

and users have the option of graphical output

of the results and/or saving designs into ‘.html’

ﬁle tables allowing further processing. By Ro-

man Pahl.

HSAUR Functions, data sets, analyses and exam-

ples from the book “A Handbook of Statisti-

cal Analyses Using R” by Brian S. Everitt and

Torsten Hothorn, Chapman & Hall/CRC, 2006.

The ﬁrst chapter of the book, which is enti-

tled “An Introduction to R”, is completely in-

cluded in this package, for all other chapters,

a vignette containing all data analyses is avail-

able. By Brian S. Everitt and Torsten Hothorn.

JointGLM Joint modeling of mean and dispersion

through two interlinked GLM’s. By Mathieu

Ribatet and Bertrand Iooss.

LoopAnalyst Tools to conduct Levins’ Loop Anal-

ysis. Loop analysis makes qualitative predic-

tions of variable change in a system of causally

interdependent variables, where “qualitative”

means sign only (i.e., increases, decreases, non

change, and ambiguous). This implementation

includes output support for graphs in ‘.dot’

ﬁle format for use with visualization software

such as graphviz (http://www.graphviz.org).

Loop Analyst provides tools for the construc-

tion and output of community matrices, com-

putation and output of community effect ma-

trices, tables of correlations, adjoint, absolute

feedback, weighted feedback and weighted

prediction matrices, and feedback, path and

loop enumeration tools. By Alexis Dinno.

LowRankQP Routines and documentation for solv-

ing quadratic programming problems where

the hessian is represented as the product of two

matrices. By J. T. Ormerod and M. P. Wand.

PK Estimation of pharmacokinetic parameters. By

Martin J. Wolfsegger and Thomas Jaki.

QCA Performs the Quine-McCluskey algorithm

for Qualitative Comparative Analysis, as de-

scribed in “The Comparative Method. Mov-

ing beyond qualitative and quantitative strate-

gies” by Charles C. Ragin, Berkeley: University

of California Press, 1987. It currently handles

about 8 conditions and one outcome. While

theoretically it could handle more conditions,

it requires a lot of computer resources and

is memory hungry; future versions will have

more functions to address this problem, as well

as functions for fuzzy-set QCA. The package

doesn’t currently handle missing values in the

data, therefore it is not yet possible to perform

simplifying assumptions. By Adrian Dusa.

RBloomberg Fetch data from Bloomberg. By Robert

Sams.

RWeka An R interface to Weka (Version 3.4.7). Weka

is a collection of machine learning algorithms

for data mining tasks written in Java, con-

taining tools for data pre-processing, classiﬁ-

cation, regression, clustering, association rules,

and visualization. Both the R interface and

Weka itself are contained in the RWeka pack-

age. For more information on Weka see

http://www.cs.waikato.ac.nz/ml/weka/. By

Kurt Hornik, with contributions from Christian

Buchta, Torsten Hothorn, Alexandros Karat-

zoglou, David Meyer, and Achim Zeileis.

RcppTemplate Package template illustrating the use

of the Rcpp R/C++ interface class library. By

Dominick Samperi.

Rmdr R-Multifactor Dimensionality Reduction

(MDR), a nonparametric and genetic model-

free alternative to logistic regression for detect-

ing and characterizing nonlinear interactions

among discrete genetic and environmental at-

tributes. By Mounir Aout, with contributions

from C. Wachter.

StoppingRules Stopping rules for microarray clas-

siﬁers. By Wenjiang J. Fu et al. (functions) and

M. T. Mader (packaging).

VDCutil The VDC system is an open source digital

library system for quantitative data. This pack-

age supports on-line analysis using VDC and

Zelig,accuracy, and R2HTML. By Micah Alt-

man and Akio Sone.

actuar Collection of functions related to actuarial

science applications, namely credibility theory

and risk theory, for the moment. The package

also includes the famous Hachemeister (1975)

data set. By Vincent Goulet and Sébastien Au-

clair.

R News ISSN 1609-3631

Vol. 6/2, May 2006 66

aplpack Functions for drawing some special plots:

stem and leaf, bagplot, Chernoff faces, and an

inspection of a 3-dimensional point cloud. By

Peter Wolf.

blockrand Create randomizations for block random

clinical trials. Can also produce a PDF ﬁle of

randomization cards. By Greg Snow.

calibrate Draw calibrated scales with tick marks on

(non-orthogonal) variable vectors in scatter-

plots and biplots. By Jan Graffelman.

classInt Choose univariate class intervals for map-

ping or other graphics purposes. By Roger Bi-

vand.

clusterRepro Validate microarray clusters via repro-

ducibility. By Amy Kapp and Rob Tibshirani.

cocorresp Fits predictive and symmetric co-

correspondence analysis (CoCA) models to

relate one data matrix to another data ma-

trix. More speciﬁcally, CoCA maximizes the

weighted covariance between the weighted av-

eraged species scores of one community and

the weighted averaged species scores of an-

other community. CoCA attempts to ﬁnd pat-

terns that are common to both communities.

Original Matlab routines by C.J.F. ter Braak

and A.P. Schaffers. R port by Gavin L. Simp-

son. Function simpls based on simpls.fit

from package pls by Ron Wehrens and Bjorn-

Helge Mevik.

coxrobust Fit robustly proportional hazards regres-

sion model. By Tadeusz Bednarski and Filip

Borowicz.

crq Quantile regression for randomly censored data.

By Stephen Portnoy, with contributions from

Tereza Neocleous and Roger Koenker.

cwhmisc A bundle of miscellaneous functions by

Christian W. Hoffmann. Contains pack-

ages cwhmath,cwhplot,cwhprint,cwhstring,

cwhstat, and cwhtool.

data.table Data frames without rownames. The

white book speciﬁes that data frames must

have rownames. This package deﬁnes a new

class data.table which operates just like a

data.frame, but uses up to 10 times less mem-

ory, and can be up to 10 times faster to create

(and copy). It also takes the opportunity to al-

low subset() and with() like expressions for

subscripting. By Matt Dowle.

denpro Provides tools to visualize (1) multivari-

ate density functions and density estimates

with level set trees, (2) level sets with shape

trees, (3) multivariate data with tail trees, (4)

scales of multivariate density estimates with

mode graphs and branching maps, and (5)

anisotropic spread with 2D volume functions

and 2D probability content functions. With

level set trees one visualizes mode structure,

with shape trees one visualizes shapes of level

sets of unimodal densities, and with tail trees

one visualizes connected data sets. The ker-

nel estimator is implemented but the package

may be applied for visualizing other density es-

timates, which have to be setwise constant. By

Jussi Klemelä.

diveMove Functions to ﬁlter and summarize time-

depth recorder (TDR) data, and miscellaneous

functions for handling location data. By Sebas-

tian P. Luque.

doBy Facilities for groupwise computations. By

Søren Højsgaard.

drc Non-linear regression analysis for multiple

curves with focus on concentration-response,

dose-response and time-response curves used,

for example, in environmental sciences, phar-

macology, toxicology and weed science. By

Christian Ritz and Jens Strebig.

ecodist Dissimilarity-based analysis functions in-

cluding ordination and Mantel test functions,

intended for use with spatial and community

data. By Sarah Goslee and Dean Urban.

femmeR Plot and summarize results calculated by

the modeling environment FEMME (Soetaert,

2002). By Henrik Andersson, Andreas Hof-

mann and Karline Soetaert.

ﬁnancial Time value of money, cash ﬂows and other

ﬁnancial functions. By Lukasz Komsta.

fuzzyRankTests Fuzzy rank tests and conﬁdence in-

tervals. By Charles J. Geyer.

gamair Data sets used in the book “Generalized Ad-

ditive Models: An Introduction with R” by Si-

mon Wood, CRC, 2006. By Simon Wood.

ggplot Grammar of graphics based plots for R. See

http://had.co.nz/ggplot/ for more informa-

tion. By Hadley Wickham.

glmpath A path-following algorithm for L1regular-

ized generalized linear models and Cox pro-

portional hazards model. By Mee Young Park

and Trevor Hastie.

grnnR Synthesizes a generalized regression neu-

ral network from the supplied training data,

P(atterns) and T(argets). By Arnold Arrington.

gsubfn Miscellaneous string utilities. By G.

Grothendieck.

R News ISSN 1609-3631

Vol. 6/2, May 2006 67

haplo.ccs Estimates haplotype and covariate rela-

tive risks in case-control data by weighted

logistic regression. Diplotype probabilities,

which are estimated by EM computation with

progressive insertion of loci, are utilized as

weights. By Benjamin French and Thomas

Lumley.

hddplot Use known groups in high-dimensional

data to derive scores for plots. Cross-validated

linear discriminant calculations determine the

optimum number of features. Test and training

scores from successive cross-validation steps

determine, via a principal components calcu-

lation, a low-dimensional global space onto

which test scores are projected, in order to plot

them. Further functions are included for didac-

tic purposes. By John Maindonald.

hdrcde Computation of highest density regions in

one and two dimensions and kernel estimation

of univariate density functions conditional on

one covariate. By Rob Hyndman.

hybridHclust Hybrid hierarchical clustering via

mutual clusters. By Hugh Chipman and

Rob Tibshirani, with tsvq code originally from

Trevor Hastie.

igraph Routines for creating and manipulating

graphs, and graph visualization. It can handle

graphs with millions of vertices and edges. By

Gabor Csardi.

km.ci Computes various conﬁdence intervals for

the Kaplan-Meier estimator, namely: Petos CI,

Rothman CI, CI’s based on Greenwoods vari-

ance, Thomas and Grunkemeier CI and the si-

multaneous conﬁdence bands by Nair and Hall

and Wellner. By Ralf Strobl.

knnFinder Finds the pnumber of near neighbors for

every point in a given data set in O(Mlog M)

time. By Samuel E. Kemp.

kzft Functions implementing Kolmogorov-

Zurbenko Fourier transform based peri-

odograms and smoothing methods. By Wei

Yang and Igor Zurbenko.

lmomco Implements the statistical theory of L-

moments including L-moment estimation,

probability-weighted moment estimation, pa-

rameter estimation for numerous familiar and

not-so-familiar distributions, and L-moment

estimation for the same distributions from the

parameters. L-moments are derived from the

expectations of order statistics and are linear

with respect to the probability-weighted mo-

ments. L-moments are directly analogous to

the well-known product moments; however, L-

moments have many advantages including un-

biasedness, robustness, and consistency with

respect to the product moments. This package

is oriented around the FORTRAN algorithms

of J.R.M. Hosking, and the nomenclature for

many of the functions parallels that of the

Hosking library. Numerous additional features

are added to aid in extension of the breadth of

L-moment application. Much theoretical exten-

sion of L-moment theory has occurred in recent

years. E.A.H. Elamir and A.H. Seheult have

developed the trimmed L-moments, which are

implemented in this package. Further, recent

developments by Robert Serﬂing and Peng

Xiao have extended L-moments into multivari-

ate space; the so-called sample L-comoments

are implemented here. The supported dis-

tributions with moment type shown as L (L-

moments) or TL (trimmed L-moments) include

the Cauchy(TL), Exponential(L), Gamma(L),

Generalized Extreme Value(L), Generalized

Lambda(L/TL), Generalized Logistic (L), Gen-

eralized Normal(L), Generalized Pareto(L/TL),

Gumbel(L), Normal(L), Kappa(L), Pearson

Type III(L), and Wakeby(L). By William H.

Asquith.

longitudinal General data structures and func-

tions for longitudinal data with multiple vari-

ables, repeated measurements, and irregularly

spaced time points. It also implements a

shrinkage estimator of dynamical correlation

and dynamical covariance. By Rainer Opgen-

Rhein and Korbinian Strimmer.

lsa Latent Semantic Analysis (LSA). The basic idea

of LSA is that texts do have a higher order (la-

tent semantic) structure which, however, is ob-

scured by word usage (e.g., through the use of

synonyms or polysemy). By using conceptual

indices that are derived statistically via a trun-

cated singular value decomposition (a two-

mode factor analysis) over a given document-

term matrix, this variability problem can be

overcome. By Fridolin Wild.

lspls Implements the LS-PLS (least squares — par-

tial least squares) method described in for in-

stance Jørgensen, K., Segtnan, V. H., Thyholt,

K., Næs, T. (2004) A Comparison of Methods

for Analysing Regression Models with Both

Spectral and Designed Variables. Journal of

Chemometrics,18(10), 451–464. By Bjørn-Helge

Mevik.

mapLD Measures linkage disequilibrium and con-

structs haplotype blocks using the method de-

scribed in Gabriel et al (2002) and Wall &

Prichard (2003). By Peter Hu and Jared Lunce-

ford, with contributions from Xiang Yu, Bret

Musser and Peggy Wong.

R News ISSN 1609-3631

Vol. 6/2, May 2006 68

monreg Estimates monotone regression and vari-

ance functions in a nonparametric model. By

Kay Pilz and Steffanie Titoff, with earlier de-

velopments by Holger Dette and Kay Pilz.

nltm Implements nonlinear transformation models

(proportional odds, Gamma frailty, propor-

tional hazards, ) for survival analysis, see

Tsodikov (2003) “Semiparametricmodels: a

generalized self-consistency approach”, Jour-

nal of the Royal Statistical Society B,65, Part 3,

759–774. By Gilda Garibotti and Alexander

Tsodikov.

numDeriv Accurate Numerical Derivatives. By

Paul Gilbert.

nws Coordination and parallel execution facilities,

as well as limited cross-language data ex-

change, using the netWorkSpaces server devel-

oped by Scientiﬁc Computing Associates, Inc.

By Nick Carriero with support and contribu-

tions from Gregory R. Warnes.

pbatR A frontend to the PBAT software, automati-

cally reading in the output from the pbat pro-

gram and displaying the corresponding ﬁg-

ure when appropriate (i.e., PBAT-logrank). In-

cludes support for multiple processes. By

Christoph Lange (PBAT) and Thomas Hoff-

mann (R interface).

pcaPP Robust PCA by Projection Pursuit. By Pe-

ter Filzmoser, Heinrich Fritz, and Klaudius

Kalcher.

polyapost Generate dependent samples from a non-

full dimensional polytope via a Markov Chain

sampler. By Glen Meeden and Radu Lazar.

portfolio Classes for analyzing and implementing

portfolios. By Jeff Enos and David Kane.

pwr Power analysis functions along the lines of Co-

hen (1988). By Stéphane Champely.

quantregForest Quantile Regression Forests is a

tree-based ensemble method for estimation of

conditional quantiles. It is particularly well

suited for high-dimensional data. Predictor

variables of mixed classes can be handled. By

Nicolai Meinshausen.

rJava Low-level interface to Java VM very much like

.C/.Call and friends. Allows creation of ob-

jects, calling methods and accessing ﬁelds. By

Simon Urbanek.

rake Raking a survey data set entails re-weighting a

sample by making the sample marginal totals

agree with the population marginal totals for

two survey response variables. Raking is a ro-

bust technique that is often useful for dealing

with nonresponse. This package streamlines

the process of Raking by creating the special

rake class, which is essentially a summary of

the sample weights. By Toby Dylan Hocking.

rankreg Obtain rank regression estimator for the

AFT model with right censored data. Test-

ing a given value of the regression coefﬁcient

and Re-sampling variance estimator can also

be computed. By Mai Zhou; Splus original of

aft.fun by Jin Zhezhen.

rda Shrunken Centroids Regularized Discriminant

Analysis for classiﬁcation in high dimensional

data. By Yaqian Guo, Trevor Hastie, and Robert

Tibshirani.

relaimpo Provides several metrics for assessing rel-

ative importance in linear models. These can

be printed, plotted and bootstrapped. The rec-

ommended metric is lmg, which provides a de-

composition of the model explained variance

into non-negative contributions. There is a ver-

sion of this package available that additionally

provides a new and also recommended metric

called pmvd. If you are a non-US user, you

can download this extended version from Ul-

rike Groemping’s web site. By Ulrike Groemp-

ing.

rggobi An interface from R to ggobi for program-

matic dynamic, interactive visualization. By

Duncan Temple Lang, Debby Swayne, Hadley

Wickham, and Michael Lawrence.

riv Finds a robust instrumental variables estimator

using a high breakdown point S-estimator of

multivariate location and scatter matrix. By

Beat Kaufmann, with contributions from R. H.

Zamar and G. V. Cohen-Freue.

roblm Robust regression estimators. By Matias

Salibian-Barrera.

robustbase “Essential” Robust Statistics. The goal is

to provide tools for analyzing data with robust

methods. This includes regression method-

ology including model selections and mul-

tivariate statistics where we strive to cover

the upcoming book “Robust Statistics, Theory

and Methods” by Maronna, Martin and Yohai;

Wiley 2006. By Valentin Todorov, Andreas

Ruckstuhl, Matias Salibian-Barrera, and Martin

Maechler, based on code by many authors, no-

tably Peter Rousseeuw, Christophe Croux, see

ﬁle ‘Copyrights’.

rrp Random Recursive Partitioning method to

match, missing data imputation and nonpara-

metric classiﬁcation and prediction. By S. M.

Iacus.

R News ISSN 1609-3631

Vol. 6/2, May 2006 69

rtiff Read TIFF format images and return them as a

pixmap object. Because the resulting object can

be very large for even modestly sized TIFF im-

ages, images can be reduced as they are read

for improved performance. This package is a

wrapper around libtiff (http://www.libtiff.

org), on which it depends. By using libtiff’s

high-level TIFFReadRGBAImage function, this

package inherently supports a wide range of

image formats and compression schemes. It

also provides an implementation of the Ridler

Autothresholding algorithm for easy genera-

tion of binary masks. By Eric Kort.

rv Simulation-based random variable object class.

By Jouni Kerman.

scope Calculate, per data frame row, a value that

depends on information in a relevant subset

of rows and columns. These functions create

and reﬁne scope objects, which identify rele-

vant rows on a per-row basis. Columns can be

aggregated within relevant scopes to aid iden-

tiﬁcation of a row of interest, from which an ar-

bitrary column value can be selected. By Tim

Bergsma.

scuba Dive proﬁles, decompression models and gas

calculations for scuba diving. By Adrian Bad-

deley.

seas Capable of deriving seasonal statistics, such

as “normals”, and analysis of seasonal data,

such as departures. This package also has

graphics capabilities for representing seasonal

data, including boxplots for seasonal param-

eters, and bars for summed normals. There

are many speciﬁc functions related to climatol-

ogy, including precipitation normals, tempera-

ture normals, cumulative precipitation depar-

tures and precipitation interarrivals. However,

this package is designed to represent any time-

varying parameter with a discernible seasonal

signal, such as found in hydrology and ecology.

By M. W. Toews.

seewave Functions for analyzing, manipulating,

displaying, editing and synthesizing time

waves (particularly sound). This package

processes time analysis (oscillograms and en-

velopes), spectral content, resonance quality

factor, cross correlation and autocorrelation,

zero-crossing, dominant frequency, 2D and 3D

spectrograms. By Jérôme Sueur, Thierry Aubin

and Caroline Simonis-Sueur.

smatr Provides methods of ﬁtting bivariate lines in

allometry using the major axis (MA) or stan-

dardised major axis (SMA), and for making

inferences about such lines. The available

methods of inference include conﬁdence inter-

vals and one-sample tests for slope and eleva-

tion, testing for a common slope or elevation

amongst several allometric lines, constructing

a conﬁdence interval for a common slope or el-

evation, and testing for no shift along a com-

mon axis, amongst several samples. By David

Warton, translated to R by John Ormerod.

spgwr Functions for computing geographically

weighted regressions based on work by

Chris Brunsdon, Martin Charlton and Stew-

art Fortheringham, http://ncg.nuim.ie/ncg/

GWR/index.htm. By Roger Bivand and Danlin

Yu.

ssanv Functions to calculate sample size for two-

sample difference in means tests. Does adjust-

ments for either nonadherence or variability

that comes from using data to estimate param-

eters. By Michael Fay.

sudoku Generates, plays, and solves Sudoku puz-

zles. The GUI playSudoku() needs pack-

age tkrplot if you are not on Windows. By

David Brahm and Greg Snow, with contribu-

tions from Curt Seeliger and Henrik Bengtsson.

surveillance A framework for the development

and the evaluation of outbreak detection al-

gorithms in routinely collected public health

surveillance data. Currently contains an im-

plementation of the procedures described in

Stroup et al (1989) and Farrington et al (1996),

a Bayesian approach and the method used at

the Robert Koch Institute, Germany. Contains

several real-world data sets and the ability to

simulate outbreak data. By M. Höhle, C. Lang,

and A. Riebler.

tcltk2 A series of widgets (themed controls, tktable,

combobox, multi-column list, etc.) and various

functions (under Windows: DDE exchange, ac-

cess to the registry and icon manipulation) to

supplement tcltk. By Philippe Grosjean.

tgp Bayesian semiparametric and nonstationary re-

gression by treed Gaussian processes with

jumps to the limiting linear model (LLM). Spe-

cial cases also implemented include Bayesian

linear models, linear CART, stationary separa-

ble and isotropic Gaussian process regression.

Provides 1-d and 2-d plotting functions (with

projection and slice capabilities) and tree draw-

ing, designed for visualization of tgp-class out-

put. By Robert B. Gramacy.

truncgof Goodness-of-ﬁt tests and some adjusted

exploratory tools allowing for left truncated

data. By Thomas Wolter.

tsfa Extraction of Factors from Multivariate Time Se-

ries. By Paul Gilbert and Erik Meijer.

R News ISSN 1609-3631

Vol. 6/2, May 2006 70

twang Functions for propensity score estimating

and weighting, nonresponse weighting, and

diagnosis of the weights. By Greg Ridgeway,

Dan McCaffrey, Andrew Morral.

untb Utilities for biodiversity data. Includes the

simulation of ecological drift under Hubbell’s

Uniﬁed Neutral Theory of Biodiversity, and the

calculation of various diagnostics such as Pre-

ston curves. By Robin K. S. Hankin.

wccsom SOM networks for comparing patterns

with peak shifts. By Ron Wehrens.

Other changes

• Packages fdim and sound were resurrected

from the Archive.

• Packages Malmig,nlrq and nprq were moved

from the main CRAN section to the Archive.

Kurt Hornik

Wirtschaftsuniversität Wien, Austria

Kurt.Hornik@R-project.org

R Foundation News

by Kurt Hornik

Donations and new members

Donations

Christof Schramm (Germany)

Gordon Blunt (UK)

BC Cancer Agency (Canada)

David Kinniburgh (UK)

Richard Leeuw (USA)

New benefactors

Numbers Internation Pty Ltd, Australia

Institute of Mathematical Statistics, USA

New supporting institutions

Department of Statistics, University of California at

Los Angeles, USA

Department of Statistics, Brigham Young University,

USA

Max-Planck-Institut für demographische Forschung

(MPI), Rostock, Germany

Center für digitale Systeme, Freie Universität Berlin,

Germany

New supporting members

Axel Benner (Germany)

Gordon Blunt (UK)

Christopher Brown (USA)

Seth Falcon (USA)

Jutta Gampe (Germany)

Jerome Goudet (Switzerland)

Philippe Grosjean (Belgium)

Leopoldo E. Guzman (USA)

Lorenz Gygax (Switzerland)

Matthias Kohl (Germany)

Ralph Leonhardt (Germany)

A.I. McLeod (Canada)

Katharine Mullen (Netherlands)

Christophe Pouzat (France)

Michael H. Prager (USA)

Manel Salamero (Spain)

Keith Satherley (Australia)

Gordon Smyth (Australia)

Arthur J. Stock (Canada)

Matthew Wilkes (UK)

Alejandro Veen (USA)

Kurt Hornik

Wirtschaftsuniversität Wien, Austria

Kurt.Hornik@R-project.org

News from the Bioconductor Project

by Seth Falcon

On April 27, 2006 the Bioconductor project released

version 1.8 designed for the 2.3 release series of

R. This release brings 35 newly contributed pack-

ages for a total of 173 packages. The sustained

increase in the number of contributed packages

demonstrates that the ideas of “publishing software,

not just papers about software” and “reproducible

research” have been adopted by many bioinformati-

cians around the world.

To help navigate the growing collection of pack-

ages, we’ve implemented a categorization system,

inspired by Achim Zeileis’ and Kurt Hornik’s ctv

package, called biocViews. The views generated

by biocViews integrate Bioconductor’s three main

R News ISSN 1609-3631

Vol. 6/2, May 2006 71

package repositories (software, annotation data, and

experiment data) and provide users with assistance

in ﬁnding packages appropriate for their analy-

sis needs. You can see the views for yourself by

browsing to http://bioconductor.org/packages/

1.8/BiocViews.html. The views that a given

package appears in are determined through the

biocViews ﬁeld in the package’s ‘DESCRIPTION’ ﬁle.

Developers can control the categorization of their

packages by setting the values of this ﬁeld.

Another new addition with the 1.8 release is a

collection of microarray annotation support pack-

ages for Affymetrix GeneChips. These cdf and probe

packages are produced by the Molecular and Behav-

ioral Neuroscience Institute (MBNI) at the Univer-

sity of Michigan. These new packages are funda-

mentally different from standard packages based on

the Affymetrix probe set designations; each probe

sequence is blasted against the current UniGene

build and then ﬁltered using a particular annota-

tion source such as GenBank, Entrez Gene, EN-

SEMBL, or Tigr. Only those probes that blast to

a unique genomic sequence are retained, so each

probe set interrogates only one transcript. There are

also packages designed for special situations such as

analyzing chimpanzee samples with human chips,

and packages with 3’ biased probe sets for analyz-

ing samples that may contain degraded mRNA. A

word of caution is perhaps in order. The results

obtained using the new packages will be different

from those obtained with the standard packages;

despite the advantage of more up-to-date annota-

tion source data, it is not clear that the results based

upon the new cdf packages should be preferred. For

more information, visit the MBNI web page http:

//brainarray.mhri.med.umich.edu/Brainarray/

Database/CustomCDF/genomic_curated_CDF.asp.

Looking towards the 1.9 release, the core team

will be working on improvements to allow Biocon-

ductor to work with larger datasets and focusing on

integrating new core classes that will allow represen-

tation of data from array-based technologies such as

SNP and tiling arrays.

To this end, we have begun a redesign of our an-

notation data packages to allow them to use SQLite

rather than R environments as the underlying data

storage mechanism. This will provide additional

ﬂexibility, a slight decrease in disk use, and a sig-

niﬁcantly smaller memory footprint. The SQLite-

based annotation packages will have a compatibil-

ity layer that will allow most programs that work

with the current annotation packages to continue to

function without change. However, we recommend

interested users to familiarize themselves with the

RSQLite package.

Many of our users are unable to process their

microarray data on their current hardware due to

memory limitations and we anticipate these prob-

lems to increase as more investigators attempt to pro-

cess larger numbers of arrays simultaneously and

as high-throughput technologies like SNP chips and

tiling arrays become more popular. To address these

issues, we will be working on bounded memory al-

gorithms. As a starting point, we will implement

a microarray normalization routine that will allow

processing of an arbitrary number of arrays while re-

quiring memory for just three arrays.

Seth Falcon

Fred Hutchinson Cancer Research Center

sfalcon@fhcrc.org

Forthcoming Events: DSC 2007

The ﬁfth conference on Directions in Statistical Com-

puting will take place in Auckland, New Zealand,

on February 15 and 16, 2007. This conference will

deal with future directions in (open source) statisti-

cal computing and graphics.

Call for Papers

We invite abstracts on topics presenting innovations

or exciting applications of statistical computing.

A web page offering more information on DSC

2007 and on-line registration is available at http://

www.stat.auckland.ac.nz/dsc-2007/ Please send

all abstracts to dsc-2007@stat.auckland.ac.nz.

We hope to see you in Auckland!

The organizing committee:

Paul Murrell, Ross Ihaka, David Scott, and Thomas Yee.

dsc-2007@stat.auckland.ac.nz

R News ISSN 1609-3631

Vol. 6/2, May 2006 72

Editor-in-Chief:

Paul Murrell

Department of Statistics

The University of Auckland

Private Bag 92019

Auckland

Editorial Board:

Torsten Hothorn and John Fox.

Editor Programmer’s Niche:

Bill Venables

Editor Help Desk:

Uwe Ligges

Email of editors and editorial board:

firstname.lastname @R-project.org

R News is a publication of the R Foundation for Sta-

tistical Computing, communications regarding this

publication should be addressed to the editors. All

articles are copyrighted by the respective authors.

Please send submissions to regular columns to the

respective column editor, all other submissions to

the editor-in-chief or another member of the edi-

torial board (more detailed submission instructions

can be found on the R homepage).

R Project Homepage:

http://www.R-project.org/

This newsletter is available online at

http://CRAN.R-project.org/doc/Rnews/

R News ISSN 1609-3631

PDF Rnews 2006 2

PDF Rnews_2006-2 CRAN: R News

Navigation menu

Versions of this User Manual:

Views

Navigation