PDF Rnews 2003 1

PDF Rnews_2003-1 CRAN - Contents of R News

User Manual: PDF CRAN: R News

Open the PDF directly: View PDF PDF.
Page Count: 40

DownloadPDF Rnews 2003-1
Open PDF In BrowserView PDF
News

The Newsletter of the R Project

Volume 3/1, June 2003

Editorial
by Friedrich Leisch
Welcome to the first issue of R News in 2003, which
is also the first issue where I serve as Editor-in-Chief.
This year has already seen a lot of exciting activities
in the R project, one of them was the Distributed Statistical Computing Workshop that took place here in
Vienna. DSC 2003 is also partly responsible for the
rather late release of this newsletter, both members
of the editorial board and authors of regular columns
were also involved in conference organization.
R 1.7.0 was released on 2003-04-16, see “Changes
in R” for detailed release information. Two major
new features of this release are support for name
spaces in packages and stronger support of S4-style
classes and methods. While the methods package by
John Chambers provides R with S4 classes for some
time now, R 1.7.0 is the first version of R where the
methods package is attached by default every time
R is started. I know of several users who were rather
anxious about the possible side effects of R going further in the direction of S4, but the transitions seems
to have gone very well (and we were not flooded by
bug reports). Two articles by Luke Tierney and Doug
Bates discuss name spaces and converting packages
to S4, respectively.
In another article Jun Yan and Tony Rossini explain how Microsoft Windows versions of R and R
packages can be cross-compiled on Linux machines.

Contents of this issue:
Editorial . . . . . . . . . . . . . . . . . .
Name Space Management for R . . . . .
Converting Packages to S4 . . . . . . . .
The genetics Package . . . . . . . . . . .
Variance Inflation Factors . . . . . . . .
Building Microsoft Windows Versions
and R packages under Intel Linux . .

. .
. .
. .
. .
. .
of
. .

. .
. .
. .
. .
. .
R
. .

1
2
6
9
13
15

Changing my hat from R News editor to CRAN
maintainer, I want to take this opportunity to announce Uwe Ligges as new maintainer of the windows packages on CRAN. Until now these have been
built by Brian Ripley, who has spent a lot of time and
effort on the task. Thanks a lot, Brian!
Also in this issue, Greg Warnes introduces the genetics package, Jürgen Groß discusses the computation of variance inflation factors, Thomas Lumley
shows how to analyze survey data in R, and Carson,
Murison and Mason use parallel computing on a Beowulf cluster to speed up gene-shaving.
We also have a new column with book reviews.
2002 has seen the publication of several books explicitly written for R (or R and S-Plus) users. We hope to
make book reviews a regular column of this newsletter. If you know of a book that should be reviewed,
please tell the publisher to send a free review copy to
a member of the editorial board (who will forward it
to a referee).
Last, but not least, we feature a recreational page
in this newsletter. Barry Rowlingson has written a
crossword on R, and even offers a prize for the correct solution. In summary: We hope you will enjoy
reading R News 3/1.
Friedrich Leisch
Technische Universität Wien, Austria
Friedrich.Leisch@R-project.org

Analysing Survey Data in R . . . . . . .
Computational Gains Using RPVM on
owulf Cluster . . . . . . . . . . . . . .
R Help Desk . . . . . . . . . . . . . . . .
Book Reviews . . . . . . . . . . . . . . .
Changes in R 1.7.0 . . . . . . . . . . . . .
Changes on CRAN . . . . . . . . . . . .
Crossword . . . . . . . . . . . . . . . . .
Recent Events . . . . . . . . . . . . . . .

.
a
.
.
.
.
.
.
.

. . .
Be. . .
. . .
. . .
. . .
. . .
. . .
. . .

17
21
26
28
31
36
39
40

Vol. 3/1, June 2003

2

Name Space Management for R
Luke Tierney

Introduction
In recent years R has become a major platform for the
development of new data analysis software. As the
complexity of software developed in R and the number of available packages that might be used in conjunction increase, the potential for conflicts among
definitions increases as well. The name space mechanism added in R 1.7.0 provides a framework for addressing this issue.
Name spaces allow package authors to control
which definitions provided by a package are visible
to a package user and which ones are private and
only available for internal use. By default, definitions
are private; a definition is made public by exporting
the name of the defined variable. This allows package authors to create their main functions as compositions of smaller utility functions that can be independently developed and tested. Only the main
functions are exported; the utility functions used in
the implementation remain private. An incremental
development strategy like this is often recommended
for high level languages like R—a guideline often
used is that a function that is too large to fit into a
single editor window can probably be broken down
into smaller units. This strategy has been hard to follow in developing R packages since it would have
lead to packages containing many utility functions
that complicate the user’s view of the main functionality of a package, and that may conflict with definitions in other packages.
Name spaces also give the package author explicit control over the definitions of global variables
used in package code. For example, a package that
defines a function mydnorm as
mydnorm <- function(z)
1/sqrt(2 * pi) * exp(- z^2 / 2)

most likely intends exp, sqrt, and pi to refer
to the definitions provided by the base package.
However, standard packages define their functions
in the global environment shown in Figure 1.

.GlobalEnv
package:ctest
.
.
.
.
package:base
Figure 1: Dynamic global environment.
The global environment is dynamic in the sense
R News

that top level assignments and calls to library and
attach can insert shadowing definitions ahead of the
definitions in base. If the assignment pi <- 3 has
been made at top level or in an attached package,
then the value returned by mydnorm(0) is not likely
to be the value the author intended. Name spaces
ensure that references to global variables in the base
package cannot be shadowed by definitions in the
dynamic global environment.
Some packages also build on functionality provided by other packages. For example, the package
modreg uses the function as.stepfun from the stepfun package. The traditional approach for dealing
with this is to use calls to require or library to add
the additional packages to the search path. This has
two disadvantages. First, it is again subject to shadowing. Second, the fact that the implementation of a
package A uses package B does not mean that users
who add A to their search path are interested in having B on their search path as well. Name spaces provide a mechanism for specifying package dependencies by importing exported variables from other packages. Formally importing variables form other packages again ensures that these cannot be shadowed by
definitions in the dynamic global environment.
From a user’s point of view, packages with a
name space are much like any other package. A call
to library is used to load the package and place its
exported variables in the search path. If the package
imports definitions from other packages, then these
packages will be loaded but they will not be placed
on the search path as a result of this load.

Adding a name space to a package
A package is given a name space by placing a
‘NAMESPACE’ file at the top level of the package
source directory. This file contains name space directives that define the name space. This declarative
approach to specifying name space information allows tools that collect package information to extract
the package interface and dependencies without processing the package source code. Using a separate
file also has the benefit of making it easy to add a
name space to and remove a name space from a package.
The main name space directives are export and
import directives. The export directive specifies
names of variables to export. For example, the directive
export(as.stepfun, ecdf, is.stepfun, stepfun)

exports four variables. Quoting is optional for standard names such as these, but non-standard names
need to be quoted, as in
ISSN 1609-3631

Vol. 3/1, June 2003

export("[<-.tracker")

As a convenience, exports can also be specified with
a pattern. The directive
exportPattern("\\.test$")

exports all variables ending with .test.
Import directives are used to import definitions
from other packages with name spaces. The directive
import(mva)

imports all exported variables from the package
mva. The directive
importFrom(stepfun, as.stepfun)

imports only the as.stepfun function from the stepfun package.
Two additional directives are available. The
S3method directive is used to register methods for
S3 generic functions; this directive is discussed further in the following section. The final directive is
useDynLib. This directive allows a package to declaratively specify that a DLL is needed and should be
loaded when the package is loaded.
For a package with a name space the two operations of loading the package and attaching the
package to the search path are logically distinct. A
package loaded by a direct call to library will first
be loaded, if it is not already loaded, and then be
attached to the search path. A package loaded to
satisfy an import dependency will be loaded, but
will not be attached to the search path. As a result,
the single initialization hook function .First.lib
is no longer adequate and is not used by packages
with name spaces. Instead, a package with a name
space can provide two hook functions, .onLoad and
.onAttach. These functions, if defined, should not be
exported. Many packages will not need either hook
function, since import directives take the place of
require calls and useDynLib directives can replace
direct calls to library.dynam.
Name spaces are sealed. This means that , once a
package with a name space is loaded, it is not possible to add or remove variables or to change the
values of variables defined in a name space. Sealing serves two purposes: it simplifies the implementation, and it ensures that the locations of variable
bindings cannot be changed at run time. This will facilitate the development of a byte code compiler for
R (Tierney, 2001).
Adding a name space to a package may complicate debugging package code. The function fix,
for example, is no longer useful for modifying an
internal package function, since it places the new
definition in the global environment and that new
definition remains shadowed within the package by
the original definition. As a result, it is a good
idea not to add a name space to a package until it
R News

3

is completely debugged, and to remove the name
space if further debugging is needed; this can be
done by temporarily renaming the ‘NAMESPACE’
file. Other alternatives are being explored, and R
1.7.0 contains some experimental functions, such as
fixInNamespace, that may help.

Name spaces and method dispatch
R supports two styles of object-oriented programming: the S3 style based on UseMethod dispatch, and
the S4 style provided by the methods package. S3
style dispatch is still used the most and some support is provided in the name space implementation
released in R 1.7.0.
S3 dispatch is based on concatenating the generic
function name and the name of a class to form
the name of a method. The environment in which
the generic function is called is then searched for a
method definition. With name spaces this creates a
problem: if a package is imported but not attached,
then the method definitions it provides may not be
visible at the call site. The solution provided by the
name space system is a mechanism for registering S3
methods with the generic function. A directive of the
form
S3method(print, dendrogram)

in the ‘NAMESPACE’ file of a package registers the
function print.dendrogram defined in the package
as the print method for class dendrogram. The
variable print.dendrogram does not need to be
exported. Keeping the method definition private
ensures that users will not inadvertently call the
method directly.
S4 classes and generic functions are by design
more formally structured than their S3 counterparts
and should therefore be conceptually easier to integrate with name spaces than S3 generic functions
and classes. For example, it should be fairly easy
to allow for both exported and private S4 classes;
the concatenation-based approach of S3 dispatch
does not seem to be compatible with having private
classes except by some sort of naming convention.
However, the integration of name spaces with S4 dispatch could not be completed in time for the release
of R 1.7.0. It is quite likely that the integration can
be accomplished by adding only two new directives,
exportClass and importClassFrom.

A simple example
A simple artificial example may help to illustrate
how the import and export directives work. Consider two packages named foo and bar. The R code
for package foo in file ‘foo.R’ is
x <- 1
f <- function(y) c(x,y)

ISSN 1609-3631

Vol. 3/1, June 2003

4

The ‘NAMESPACE’ file contains the single directive

Thus the variable x is private to the package and the
function f is public. The body of f references a global
variable x and a global function c. The global refrence x corresponds to the definition in the package.
Since the package does not provide a definition for
c and does not import any definitions, the variable c
refers to the definition provided by the base package.
The second package bar has code file ‘bar.R’ containing the definitions

imports
base
.GlobalEnv

Dynamic

export(f)

Static

internal defs

package:ctest
.
.
.
.
package:base

Figure 2: Name space environment.
c <- function(...) sum(...)
g <- function(y) f(c(y, 7))

and ‘NAMESPACE’ file
import(foo)
export(g)

The function c is private to the package. The function g calls a global function c. Definitions provided
in the package take precedence over imports and definitions in the base package; therefore the definition
of c used by g is the one provided in bar.
Calling library(bar) loads bar and attaches its
exports to the search path. Package foo is also loaded
but not attached to the search path. A call to g produces
> g(6)
[1] 1 13

This is consistent with the definitions of c in the two
settings: in bar the function c is defined to be equivalent to sum, but in foo the variable c refers to the
standard function c in base.

Some details
This section provides some details on the current
implementation of name spaces. These details may
change as name space support evolves.
Name spaces use R environments to provide
static binding of global variables for function definitions. In a package with a name space, functions defined in the package are defined in a name
space environment as shown in Figure 2. This environment consists of a set of three static frames
followed by the usual dynamic global environment. The first static frame contains the local definitions of the package. The second frame contains
R News

imported definitions, and the third frame is the base
package. Suppose the function mydnorm shown in
the introduction is defined in a package with a name
space that does not explicitly import any definitions.
Then a call to mydnorm will cause R to evaluate the
variable pi by searching for a binding in the name
space environment. Unless the package itself defines
a variable pi, the first binding found will be the one
in the third frame, the definition provided in the base
package. Definitions in the dynamic global environment cannot shadow the definition in base.
When library is called to load a package with a
name space, library first calls loadNamespace and
then attachNamespace. loadNamespace first checks
whether the name space is already loaded and registered with the internal registry of loaded name
spaces. If so, the already loaded name space is returned. If not, then loadNamespace is called on all
imported name spaces, and definitions of exported
variables of these packages are copied to the imports
frame, the second frame, of the new name space.
Then the package code is loaded and run, and exports, S3 methods, and shared libraries are registered. Finally, the .onLoad hook is run, if the package
defines one, and the name space is sealed. Figure 3
illustrates the dynamic global environment and the
internal data base of loaded name spaces after two
packages, A and B, have been loaded by explicit calls
to library and package C has been loaded to satisfy
import directives in package B.
The serialization mechanism used to save R
workspaces handles name spaces by writing out a
reference for a name space consisting of the package name and version. When the workspace is
loaded, the reference is used to construct a call
to loadNamespace. The current implementation ignores the version information; in the future this may
be used to signal compatibility warnings or select
among several available versions.
The registry of loaded name spaces can be examined using loadedNamespaces. In the current
implementation loaded name spaces are not unloaded automatically. Detaching a package with a
ISSN 1609-3631

Vol. 3/1, June 2003

Loaded Name Spaces
A

exports

internal defs
imports
base
B

exports

internal defs

5

Global Search Path
.GlobalEnv
.
.
.
.
package:A
package:B
.
.
.
.
package:base

imports
base
C

exports

internal defs
imports
base

Figure 3: Internal implementation.
name space loaded by a call to library removes the
frame containing the exported variables of the package from the global search path, but it does not unload the name space. Future implementations may
automatically unload name spaces that are no longer
needed to satisfy import dependencies. The function
unloadNamespace can be used to force unloading of
a name space; this can be useful if a new version of
the package is to be loaded during development.
Variables exported by a package with a name
space can also be referenced using a fully qualified
variable reference consisting of the package name
and the variable name separated by a double colon,
such as foo::f. This can be useful in debugging at
the top level. It can also occasionally be useful in
source code, for example in cases where functionality
is only to be used if the package providing it is available. However, dependencies resulting from use of
fully qualified references are not visible from the
name space directives and therefore may cause some
confusion, so explicit importing is usually preferable.
Computing the value of a fully qualified variable reference will cause the specified package to be loaded
if it is not loaded already.

Discussion
Name spaces provide a means of making R packages
more modular. Many languages and systems originally developed without name space or module systems have had some form of name space management added in order to support the development of
larger software projects. C++ and Tcl (Welch, 1999)
are two examples. Other languages, such as Ada
(Barnes, 1998) and Modula-3 (Harbison, 1992) were
designed from the start to include powerful module
systems. These languages use a range of approaches
R News

for specifying name space or module information.
The declarative approach with separation of implementation code and name space information used in
the system included in R 1.7.0 is closest in spirit to
the approaches of Modula-3 and Ada. In contrast,
Tcl and C++ interleave name space specification and
source code.
The current R name space implementation, while
already very functional and useful, is still experimental and incomplete. Support for S4 methods still
needs to be included. More exhaustive error checking is needed, along with code analysis tools to aid
in constructing name space specifications. To aid in
package development and debugging it would also
be useful to explore whether strict sealing can be
relaxed somewhat without complicating the implementation. Early experiments suggest that this may
be possible.
Some languages, including Ada, ML (Ullman,
1997) and the MzScheme Scheme implementation
(PLT) provide a richer module structure in which parameterized partial modules can be assembled into
complete modules. In the context of R this might
correspond to having a parameterized generic data
base access package that is completed by providing a
specific data base interface. Whether this additional
power is useful and the associated additional complexity is warranted in R is not yet clear.

Bibliography
John Barnes. Programming In Ada 95.
Wesley, 2nd edition, 1998. 5

Addison-

Samuel P. Harbison. Modula-3. Prentice Hall, 1992. 5
PLT. PLT MzScheme. World Wide Web, 2003.
URL
http://www.plt-scheme.org/software/
mzscheme/. 5
Luke Tierney.
Compiling R: A preliminary report. In Kurt Hornik and Friedrich Leisch, editors, Proceedings of the 2nd International Workshop on Distributed Statistical Computing, March
15-17, 2001, Technische Universität Wien, Vienna,
Austria, 2001. URL http://www.ci.tuwien.ac.
at/Conferences/DSC-2001/Proceedings/. ISSN
1609-395X. 3
Jeffrey Ullman. Elements of ML Programming. Prentice
Hall, 1997. 5
Brent Welch. Practical Programming in Tcl and Tk.
Prentice Hall, 1999. 5
Luke Tierney
Department of Statistics & Actuarial Science
University of Iowa, Iowa City, Iowa, U.S.A
luke@stat.uiowa.edu
ISSN 1609-3631

Vol. 3/1, June 2003

6

Converting Packages to S4
by Douglas Bates

Introduction
R now has two systems of classes and methods,
known informally as the ‘S3’ and ‘S4’ systems. Both
systems are based on the assignment of a class to an
object and the use of generic functions that invoke different methods according to the class of their arguments. Classes organize the representation of information and methods organize the actions that are applied to these representations.
‘S3’ classes and methods for the S language were
introduced in Chambers and Hastie (1992), (see also
Venables and Ripley, 2000, ch. 4) and have been implemented in R from its earliest public versions. Because many widely-used R functions, such as print,
plot and summary, are S3 generics, anyone using R
inevitably (although perhaps unknowingly) uses the
S3 system of classes and methods.
Authors of R packages can take advantage of the
S3 class system by assigning a class to the objects
created in their package and defining methods for
this class and either their own generic functions or
generic functions defined elsewhere, especially those
in the base package. Many R packages do exactly
this. To get an idea of how many S3 classes are used
in R, attach the packages that you commonly use
and call methods("print"). The result is a vector of
names of methods for the print generic. It will probably list dozens, if not hundreds, of methods. (Even
that list may be incomplete because recent versions
of some popular packages use namespaces to hide
some of the S3 methods defined in the package.)
The S3 system of classes and methods is both
popular and successful. However, some aspects of
its design are at best inconvenient and at worst dangerous. To address these inadequacies John Chambers introduced what is now called the ‘S4’ system of
classes and methods (Chambers, 1998; Venables and
Ripley, 2000). John implemented this system for R in
the methods package which, beginning with the 1.7.0
release of R, is one of the packages that are attached
by default in each R session.
There are many more packages that use S3 classes
and methods than use S4. Probably the greatest use
of S4 at present is in packages from the Bioconductor
project (Gentleman and Carey, 2002). I hope by this
article to encourage package authors to make greater
use of S4.
Conversion from S3 to S4 is not automatic. A
package author must perform this conversion manually and the conversion can require a considerable
amount of effort. I speak from experience. Saikat DebRoy and I have been redesigning the linear mixedeffects models part of the nlme package, including
R News

conversion to S4, and I can attest that it has been a
lot of work. The purpose of this article is to indicate
what is gained by conversion to S4 and to describe
some of our experiences in doing so.

S3 versus S4 classes and methods
S3 classes are informal: the class of an object is determined by its class attribute, which should consist
of one or more character strings, and methods are
found by combining the name of the generic function with the class of the first argument to the function. If a function having this combined name is on
the search path, it is assumed to be the appropriate
method. Classes and their contents are not formally
defined in the S3 system - at best there is a “gentleman’s agreement” that objects in a class will have
certain structure with certain component names.
The informality of S3 classes and methods is convenient but dangerous. There are obvious dangers in
that any R object can be assigned a class, say "foo",
without any attempt to validate the names and types
of components in the object. That is, there is no guarantee that an object that claims to have class "foo"
is compatible with methods for that class. Also, a
method is recognized solely by its name so a function named print.foo is assumed to be the method
for the print generic applied to an object of class foo.
This can lead to surprising and unpleasant errors for
the unwary.
Another disadvantage of using function names to
identify methods is that the class of only one argument, the first argument, is used to determine the
method that the generic will use. Often when creating a plot or when fitting a statistical model we want
to examine the class of more than one argument to
determine the appropriate method, but S3 does not
allow this.
There are more subtle disadvantages to the S3
system. Often it is convenient to describe a class as
being a special case of another class; for example, a
model fit by aov is a special case of a linear model
(class lm). We say that class aov inherits from class
lm. In the informal S3 system this inheritance is described by assigning the class c("aov","lm") to an
object fit by aov. Thus the inheritance of classes becomes a property of the object, not a property of the
class, and there is no guarantee that it will be consistent across all members of the class. Indeed there
were examples in some packages where the class inheritance was not consistently assigned.
By contrast, S4 classes must be defined explicitly.
The number of slots in objects of the class, and the
names and classes of the slots, are established at the
time of class definition. When an object of the class
ISSN 1609-3631

Vol. 3/1, June 2003

is created, and at some points during computation
with the object, it is validated against the definition.
Inheritance of classes is also specified at the time of
class definition and thus becomes a property of the
class, not a (possibly inconsistent) property of objects
in the class.
S4 also requires formal declarations of methods,
unlike the informal system of using function names
to identify a method in S3. An S4 method is declared
by a call to setMethod giving the name of the generic
and the “signature” of the arguments. The signature identifies the classes of one or more named arguments to the generic function. Special meta-classes
named ANY and missing can be used in the signature.
S4 generic functions are automatically created
when a method is declared for an existing function,
in which case the function becomes generic and the
current definition becomes the default method. A
new generic function can be declared explicitly by a
call to setGeneric. When a method for the generic
is declared with setMethod the number, names, and
order of its arguments are matched against those of
the generic. (If the generic has a ... argument, the
method can add new arguments but it can never omit
or rename arguments of the generic.)
In summary, the S4 system is much more formal
regarding classes, generics, and methods than is the
S3 system. This formality means that more care must
be taken in the design of classes and methods for S4.
In return, S4 provides greater security, a more welldefined organization, and, in my opinion, a cleaner
structure to the package.

Package conversion
Chambers (1998, ch. 7,8) and Venables and Ripley
(2000, ch. 5) discuss creating new packages based on
S4. Here I will concentrate on converting an existing
package from S3 to S4.
S4 requires formal definitions of generics, classes,
and methods. The generic functions are usually the
easiest to convert. If the generic is defined externally
to the package then the package author can simply
begin defining methods for the generic, taking care
to ensure that the argument sequences of the method
are compatible with those of the generic. As described above, assigning an S4 method for, say, coef
automatically creates an S4 generic for coef with the
current externally-defined coef function as the default method. If an S3 generic was defined internally
to the package then it is easy to convert it to an S4
generic.
Converting classes is less easy. We found that
we could use the informal set of classes from the S3
version as a guide when formulating the S4 classes
but we frequently reconsidered the structure of the
classes during the conversion. We frequently found
ourselves adding slots during the conversion so we
R News

7

had more slots in the S4 classes than components in
the S3 objects.
Increasing the number of slots may be an inevitable consequence of revising the package (we
tend to add capabilities more frequently than we remove them) but it may also be related to the fact that
S4 classes must be declared explicitly and hence we
must consider the components or slots of the classes
and the relationships between the classes more carefully.
Another reason that we incorporating more slots
in the S4 classes than in the S3 prototype is because
we found it convenient that the contents of slots in
S4 objects can easily be accessed and modified in C
code called through the .Call interface. Several C
macros for working with S4 classed objects, including GET_SLOT, SET_SLOT, MAKE_CLASS and NEW (all described in Chambers (1998)) are available in R. The
combination of the formal classes of S4, the .Call
interface, and these macros allows a programmer to
manipulate S4 classed objects in C code nearly as easily as in R code. Furthermore, when the C code is
called from a method, the programmer can be confident of the classes of the objects passed in the call
and the classes of the slots of those objects. Much of
the checking of classes or modes and possible coercion of modes that is common in C code called from
R can be bypassed.
We found that we would initially write methods
in R then translate them into C if warranted. The
nature of our calculations, frequently involving multiple decompositions and manipulations of sections
of arrays, was such that the calculations could be expressed in R but not very cleanly. Once we had the
R version working satisfactorily we could translate
into C the parts that were critical for performance or
were awkward to write in R. An important advantage of this mode of development is that we could
use the same slots in the C version as in the R version
and create the same types of objects to be returned.
We feel that defining S4 classes and methods in
R then translating parts of method definitions to C
functions called through .Call is an extremely effective mode for numerical computation. Programmers who have experience working in C++ or Java
may initially find it more convenient to define classes
and methods in the compiled language and perhaps
define a parallel series of classes in R. (We did exactly that when creating the Matrix package for R.)
We encourage such programmers to try instead this
method of defining only one set of classes, the S4
classes in R, and use these classes in both the interpreted language and the compiled language.
The definition of S4 methods is more formal than
in S3 but also more flexible because of the ability to
match an argument signature rather than being contrained to matching just the class of the first argument. We found that we used this extensively when
defining methods for generics that fit models. We
ISSN 1609-3631

Vol. 3/1, June 2003

would define the “canonical” form of the arguments,
which frequently was a rather wordy form, and one
“collector” method that matched this form of the arguments. The collector method is the one that actually does the work, such as fitting the model. All
the other methods are designed to take more conveniently expressed forms of the arguments and rearrange them into the canonical form. Our subjective
impression is that the resulting methods are much
easier to understand than those in the S3 version.

Pitfalls
S4 classes and methods are powerful tools. With
these tools a programmer can exercise fine-grained
control over the definition of classes and methods.
This encourages a programming style where many
specialized classes and methods are defined. One of
the difficulties that authors of packages then face is
documenting all these classes, generics, and methods.
We expect that generic functions and classes will
be described in detail in the documentation, usually
in a separate documentation file for each class and
each generic function, although in some cases closely
related classes could be described in a single file. It
is less obvious whether, how, and where to document methods. In the S4 system methods are associated with a generic function and an argument signature. It is not clear if a given method should be documented separately or in conjunction with the generic
function or with a class definition.
Any of these places could make sense for some
methods. All of them have disadvantages. If all
methods are documented separately there will be
an explosion of the number of documentation files
to create and maintain. That is an unwelcome burden. Documenting methods in conjunction with the
generic can work for internally defined generics but
not for those defined externally to the package. Documenting methods in conjunction with a class only
works well if the method signature consists of one argument only but part of the power of S4 is the ability
to dispatch on the signature of multiple arguments.
Others working with S4 packages, including the
Bioconductor contributors, are faced with this problem of organizing documentation. Gordon Smyth
and Vince Carey have suggested some strategies for
organizing documentation but more experience is
definitely needed.
One problem with creating many small methods that rearrange arguments and then call
callGeneric() to get eventually to some collector
method is that the sequence of calls has to be un-

R News

8

wound when returning the value of the call. In the
case of a model-fitting generic it is customary to preserve the original call in a slot or component named
call so that it can be displayed in summaries and
also so it can be used to update the fitted model.
To preserve the call, each of the small methods that
just manipulate the arguments must take the result
of callGeneric, reassign the call slot and return
this object. We discovered, to our chagrin, that this
caused the entire object, which can be very large in
some of our examples, to be copied in its entirety as
each method call was unwound.

Conclusions
Although the formal structure of the S4 system of
classes and methods requires greater discipline by
the programmer than does the informal S3 system,
the resulting clarity and security of the code makes
S4 worthwhile. Moreover, the ability in S4 to work
with a single class structure in R code and in C code
to be called by R is a big win.
S4 encourages creating many related classes and
many methods for generics. Presently this creates
difficult decisions on how to organize documentation and how unwind nested method calls without
unwanted copying of large objects. However it is
still early days with regard to the construction of
large packages based on S4 and as more experience
is gained we will expect that knowledge of the best
practices will be disseminated in the community so
we can all benefit from the S4 system.

Bibliography
J. M. Chambers. Programming with Data. Springer,
New York, 1998. ISBN 0-387-98503-4. 6, 7
J. M. Chambers and T. J. Hastie. Statistical Models in
S. Chapman & Hall, London, 1992. ISBN 0-41283040-X. 6
R. Gentleman and V. Carey. Bioconductor. R
News, 2(1):11–16, March 2002. URL http://CRAN.
R-project.org/doc/Rnews/. 6
W. N. Venables and B. D. Ripley. S Programming.
Springer, 2000. URL http://www.stats.ox.ac.
uk/pub/MASS3/Sprog/. ISBN 0-387-98966-8. 6, 7
Douglas Bates
University of Wisconsin–Madison, U.S.A.
Bates@stat.wisc.edu

ISSN 1609-3631

Vol. 3/1, June 2003

9

The genetics Package
Utilities for handling genetic data
by Gregory R. Warnes

Introduction
In my work as a statistician in the Non-Clinical
Statistics and Biostatistical Applications group
within Pfizer Global Research and Development I
have the opportunity to perform statistical analysis
in a wide variety of domains. One of these domains
is pharmacogenomics, in which we attempt to determine the relationship between the genetic variability
of individual patients and disease status, disease
progression, treatment efficacy, or treatment side effect profile.
Our normal approach to pharmacogenomics is
to start with a small set of candidate genes. We
then look for markers of genetic variability within
these genes.
The most common marker types
we encounter are Single Nucleotide Polymorphisms
(SNPs). SNPs are locations where some individuals
differ from the norm by the substitution one of the 4
DNA bases, adenine (A), thymine (T), guanine (G),
and cytosine (C), by one of the other bases. For example, a single cytosine (C) might be replaced by
a single tyrosine (T) in the sequence ‘CCTCAGC’,
yielding ‘CCTTAGC’. We also encounter simple sequence length polymorphisms (SSLP), which are
also known as microsatellite DNA. SSLP are simple
reteating patters of bases where the number of repeats can vary. E.g., at a particular position, some
individuals might have 3 repeats of the pattern ‘CT’,
‘ACCTCTCTAA’, while others might have 5 repeats,
‘ACCTCTCTCTCTAA’.
Regardless of the type or location of genetic variation, each individual has two copies of each chromosome, hence two alleles (variants), and consequently
two data values for each marker. This information is
often presented together by providing a pair of allele
names. Sometimes a separator is used (e.g. ‘A/T’),
sometimes they are simply concatenated (e.g., ‘AT’).
A further layer of complexity arises from the
inability of most laboratory methods to determine
which observed variants comes from which copy of
the chromosome. (Statistical methods are often necessary to impute this information when it is needed.)
For this type of data ‘A/T’, and ‘T/A’ are equivalent.

types and haplotypes can be annotated with chromosome, locus (location on a chromosome), gene, and
marker information. Utility functions compute genotype and allele frequencies, flag homozygotes or heterozygotes, flag carriers of certain alleles, count the
number of a specific allele carried by an individual,
extract one or both alleles. These functions make it
easy to create and use single-locus genetic information in R’s statistical modeling functions.
The genetics package also provides a set of functions to estimate and test for departure from HardyWeinberg equilibrium (HWE). HWE specifies the
expected allele frequencies for a single population
when none of the variant alleles impart a survival
benefit. Departure from HWE is often indicative of
a problem with the laboratory assay, and is often the
first statistical method applied to genetic data. In
addition, the genetics package provides functions to
test for linkage disequilibrium (LD), the non-random
association of marker alleles which can arise from
marker proximity or from selection bias. Further,
to assist in sample size calculations when considering sample sizes needed when investigating potential markers, we provide a function which computes
the probability of observing all alleles with a given
true frequency.
My primary motivation in creating the genetics
package was to overcome the difficulty in representing and manipulating genotype in general-purpose
statistical packages. Without an explicit genotype
variable type, handling genetic variables requires
considerable string manipulation, which can be quite
messy and tedious. The genotype function has
been designed to remove the need to perform string
manupulation by allowing allele pairs to be specified
in any of four commonly occuring notations:
• A single vector with a character separator:
g1 <- genotype( c(’A/A’,’A/C’,’C/C’,’C/A’,
NA,’A/A’,’A/C’,’A/C’) )
g3 <- genotype( c(’A A’,’A C’,’C C’,’C A’,
’’,’A A’,’A C’,’A C’),
sep=’ ’, remove.spaces=F)

• A single vector with a positional separator
g2 <- genotype( c(’AA’,’AC’,’CC’,’CA’,’’,
’AA’,’AC’,’AC’), sep=1 )

• Two separate vectors

The genetics package
The genetics package, available from CRAN, includes classes and methods for creating, representing, and manipulating genotypes (unordered allele
pairs) and haplotypes (ordered allele pairs). GenoR News

g4 <- genotype(
c(’A’,’A’,’C’,’C’,’’,’A’,’A’,’A’),
c(’A’,’C’,’C’,’A’,’’,’A’,’C’,’C’)
)

• A dataframe or matrix with two columns
ISSN 1609-3631

Vol. 3/1, June 2003

gm <- cbind(
c(’A’,’A’,’C’,’C’,’’,’A’,’A’,’A’),
c(’A’,’C’,’C’,’A’,’’,’A’,’C’,’C’) )
g5 <- genotype( gm )

For simplicity, the functions makeGenotype and
makeHaplotype can be used to convert all of the genetic variables contained in a dataframe in a single
pass. (See the help page for details.)
A second difficulty in using genotypes is the need
to represent the information in different forms at different times. To simplify the use of genotype variables, each of the three basic ways of modeling the
effect of the allele combinations is directly supported
by the genetics package:
categorical Each allele combination acts differently.
This situation is handled by entering the
genotype variable without modification into a
model. In this case, it will be treated as a factor:
lm( outcome ~ genotype.var + confounder )

additive The effect depends on the number of copies
of a specific allele (0, 1, or 2).
The function allele.count( gene, allele )
returns the number of copies of the specified
allele:
lm( outcome ~ allele.count(genotype.var,’A’)
+ confounder )

dominant/recessive The effect depends only on the
presence or absence of a specific allele.
The function carrier( gene, allele ) returns a boolean flag if the specified allele is
present:
lm( outcome ~ carrier(genotype.var,’A’)
+ confounder )

Implementation
The basic functionality of the genetics package is
provided by the genotype class and the haplotype
class, which is a simple extension of the former.
Friedrich Leisch and I collaborated on the design of
the genotype class. We had four goals: First, we
wanted to be able to manipulate both alleles as a
single variable. Second, we needed a clean way of
accessing the individual alleles when this was required. Third, a genotype variable should be able
to be stored in dataframes as they are currently implemented in R. Fourth, the implementation of genotype variables should be space-efficient.
After considering several potential implementations, we chose to implement the genotype
R News

10

class as an extension to the in-built factor variable type with additional information stored in attributes. Genotype objects are stored as factors
and have the class list c("genotype","factor").
The names of the factor levels are constructed as
paste(allele1,"/",allele2,sep=""). Since most
genotyping methods do not indicate which allele
comes from which member of a chromosome pair,
the alleles for each individual are placed in a consistent order controlled by the reorder argument.
In cases when the allele order is informative, the
haplotype class, which preserves the allele order,
should be used instead.
The set of allele names is stored in the attribute
allele.names. A translation table from the factor
levels to the names of each of the two alleles is stored
in the attribute allele.map. This map is a two column character matrix with one row per factor level.
The columns provide the individual alleles for each
factor level. Accesing the individual alleles, as performed by the allele function, is accomplished by
simply indexing into this table,
allele.x <- attrib(x,"allele.map")
alleles.x[genotype.var,which]
where which is 1, 2, or c(1,2) as appropriate.
Finally, there is often additional metainformation associated with a genotype. The functions locus, gene, and marker create objects to store
information, respectively, about genetic loci, genes,
and markers. Any of these objects can be included as
part of a genotype object using the locus argument.
The print and summary functions for genotype objects properly display this information when it is
present.
This implementation of the genotype class met
our four design goals and offered an additional benefit: in most contexts factors behave the same as the
desired default behavior for genotype objects. Consequently, relatively few additional methods needed
to written. Further, in the absence of the genetics
package, the information stored in genotype objects
is still accessible in a reasonable way.
The genotype class is accompanied by a full complement of helper methods for standard R operators
( [], [<-, ==, etc. ) and object methods ( summary,
print, is.genotype, as.genotype, etc. ). Additional
functions for manipulating genotypes include:
allele Extracts individual alleles. matrix.
allele.names Extracts the set of allele names.
homozygote Creates a logical vector indicating
whether both alleles of each observation are the
same.
heterozygote Creates a logical vector indicating
whether the alleles of each observation differ.
carrier Creates a logical vector indicating whether
the specified alleles are present.
ISSN 1609-3631

Vol. 3/1, June 2003

11

allele.count Returns the number of copies of the
specified alleles carried by each observation.

Example

getlocus Extracts locus, gene, or marker information.

Here is a partial session using tools from the genotype package to examine the features of 3 simulated
markers and their relationships with a continuous
outcome:

makeGenotypes Convert appropriate columns in a
dataframe to genotypes or haplotypes
write.pop.file Creates a ’pop’ data file, as used by
the GenePop (http://wbiomed.curtin.edu.
au/genepop/) and LinkDos (http://wbiomed.
curtin.edu.au/genepop/linkdos.html) softare packages.
write.pedigree.file Creates a ’pedigree’ data file, as
used by the QTDT software package (http:
//www.sph.umich.edu/statgen/abecasis/
QTDT/).
write.marker.file Creates a ’marker’ data file, as
used by the QTDT software package (http:
//www.sph.umich.edu/statgen/abecasis/
QTDT/).
The genetics package provides four functions related to Hardy-Weinberg Equilibrium:
diseq Estimate or compute confidence interval for
the single marker Hardy-Weinberg disequilibrium
HWE.chisq Performs a Chi-square test for HardyWeinberg equilibrium
HWE.exact Performs a Fisher’s exact test of HardyWeinberg equilibrium for two-allele markers.
HWE.test Computes estimates and bootstrap confidence intervals, as well as testing for HardyWeinberg equilibrium.
as well as three related to linkage disequilibrium
(LD):
LD Compute pairwise linkage disequilibrium between genetic markers.
LDtable Generate a graphical table showing the LD
estimate, number of observations and p-value
for each marker combination, color coded by
significance.
LDplot Plot linkage disequilibrium as a function of
marker location.
and one function for sample size calculation:
gregorius Probability of Observing All Alleles with
a Given Frequency in a Sample of a Specified
Size.
The algorithms used in the HWE and LD functions
are beyond the scope of this article, but details are
provided in the help pages or the corresponding
package documentation.
R News

> library(genetics)
[...]
> # Load the data from a CSV file
> data <- read.csv("example_data.csv")
>
> # Convert genotype columns to genotype variables
> data <- makeGenotypes(data)
>
> ## Annotate the genes
> marker(data$a1691g) <+
marker(name="A1691G",
+
type="SNP",
+
locus.name="MBP2",
+
chromosome=9,
+
arm="q",
+
index.start=35,
+
bp.start=1691,
+
relative.to="intron 1")
[...]
>
> # Look at some of the data
> data[1:5,]
PID DELTA.BMI c104t a1691g c2249t
1 1127409
0.62
C/C
G/G
T/T
2 246311
1.31
C/C
A/A
T/T
3 295185
0.15
C/C
G/G
T/T
4
34301
0.72
C/T
A/A
T/T
5
96890
0.37
C/C
A/A
T/T
>
> # Get allele information for c104t
> summary(data$c104t)
Marker: MBP2:C-104T (9q35:-104) Type: SNP
Allele Frequency:
Count Proportion
C
137
0.68
T
63
0.32
Genotype Frequency:
Count Proportion
C/C
59
0.59
C/T
19
0.19
T/T
22
0.22
>
>
> # Check Hardy-Weinberg Equilibrium
> HWE.test(data$c104t)
----------------------------------Test for Hardy-Weinberg-Equilibrium
----------------------------------Call:

ISSN 1609-3631

Vol. 3/1, June 2003

12

HWE.test.genotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C
T
C
0.12
T 0.12
Scaled Disequlibrium for each allele pair (D’)
C
T
C
0.56
T 0.56
Correlation coefficient for each allele pair (r)

c104t
c104t
c104t
c104t

Corr.
X^2
P-value
n

-0.03 -0.21
0.16
8.51
0.69 0.0035
100
100

a1691g
a1691g
a1691g
a1691g
a1691g
a1691g

D
D’
Corr.
X^2
P-value
n

-0.01
0.31
-0.08
1.30
0.25
100

>
> LDtable(ld) # graphical display

C
T
C 1.00 -0.56
T -0.56 1.00

c2249t

a1691g

−0.3076
(100)
0.25439

−0.9978
(100)
0.00354

c104t

Linkage Disequilibrium
a1691g

(−)D’
(N)
P−value

−0.0487
(100)
0.68780

Overall Values

Confidence intervals computed via bootstrap
using 1000 samples
Observed
Overall D
0.121
Overall D’ 0.560
Overall r -0.560
Contains
Overall D *NO*
Overall D’ *NO*
Overall r *NO*

95% CI
( 0.073, 0.159)
( 0.373, 0.714)
(-0.714, -0.373)
Zero?

NA’s
0
0
0

Marker 1

Value
D
0.12
D’ 0.56
r -0.56

Marker 2

Significance Test:
Exact Test for Hardy-Weinberg Equilibrium
data: data$c104t
N11 = 59, N12 = 19, N22 = 22, N1 = 137, N2
= 63, p-value = 3.463e-08

>
> # Check Linkage Disequilibrium
> ld <- LD(data)
Warning message:
Non-genotype variables or genotype variables with
more or less than two alleles detected. These
variables will be omitted: PID, DELTA.BMI
in: LD.data.frame(data)
> ld # text display
Pairwise LD
----------c104t
c104t

D
D’

a1691g c2249t
-0.01 -0.03
0.05
1.00

> # fit a model
> summary(lm( DELTA.BMI ~
+
homozygote(c104t,’C’) +
+
allele.count(a1691g, ’G’) +
+
c2249t, data=data))
Call:
lm(formula = DELTA.BMI ~ homozygote(c104t, "C") +
allele.count(a1691g, "G") + c2249t,
data = data)
Residuals:
Min
1Q Median
-2.9818 -0.5917 -0.0303

Max
2.7101

Coefficients:
(Intercept)
homozygote(c104t, "C")TRUE
allele.count(a1691g, "G")
c2249tT/C
c2249tT/T
(Intercept)

R News

3Q
0.6666

Estimate Std. Error
-0.1807
0.5996
1.0203
0.2290
-0.0905
0.1175
0.4291
0.6873
0.3476
0.5848
t value Pr(>|t|)
-0.30
0.76

ISSN 1609-3631

Vol. 3/1, June 2003

13

homozygote(c104t, "C")TRUE
4.46 2.3e-05 ***
allele.count(a1691g, "G")
-0.77
0.44
c2249tT/C
0.62
0.53
c2249tT/T
0.59
0.55
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.1 on 95 degrees of
freedom
Multiple R-Squared: 0.176,
Adjusted R-squared: 0.141
F-statistic: 5.06 on 4 and 95 DF,
p-value: 0.000969

Conclusion
The current release of the genetics package, 1.0.0,
provides a complete set of classes and methods for
handling single-locus genetic data as well as functions for computing and testing for departure from
Hardy-Weinberg and linkage disequilibrium using a

variety of estimators.
As noted earlier, Friedrich Leisch and I collaborated on the design of the data structures. While
I was primarily motivated by the desire to provide
a natural way to include single-locus genetic variables in statistical models, Fritz also wanted to support multiple genetic changes spread across one or
more genes. As of the current version, my goal has
largely been realized, but more work is necessary to
fully support Fritz’s goal.
In the future I intend to add functions to perform
haplotype imputation and generate standard genetics plots.
I would like to thank Friedrich Leisch for his
assistance in designing the genotype data structure, David Duffy for contributing the code for the
gregarious and HWE.exact functions, and Michael
Man for error reports and helpful discussion.
I welcome comments and contributions.
Gregory R. Warnes
Pfizer Global Research and Development
gregory_r_warnes@groton.pfizer.com

Variance Inflation Factors
where C is the usual centering matrix. Then the centered variance inflation factor for the j-th variable is

by Jürgen Groß
A short discussion of centered and uncentered variance inflation factors. It is recommended to report
both types of variance inflation for a linear model.

Centered variance inflation factors
A popular method in the classical linear regression
model

VIF j =

b j)
var(β
b j ) x0 Cx j ,
= σ −2 var(β
j
vj

It can also be written in the form
VIF j =

e2 = 1 −
R
j

to diagnose collinearity is to measure how much the
b j ) of the j-th element β
b j of the ordivariance var(β
0
b
nary least squares estimator β = ( X X )−1 X 0 y is inflated by near linear dependencies of the columns of
b j ) in relation to the
X. This is done by putting var(β
b j as it would have been when
variance of β
(a) the nonconstant columns x2 , . . . , x p of X had
been centered and
(b) the centered columns had been orthogonal to
each other.
This variance is given by
σ2
vj = 0
,
x j Cx j
R News

1
C = I n − 1n 10n ,
n

1
e2
1−R
j

,

where

y = X β + ε, X = (1n , x2 , . . . , x p ), ε ∼ (0, σ 2 I n ),
n× p

j = 2, . . . , p.

x0j M j x j
x0j Cx j

,

M j = I n − X j ( X 0j X j )−1 X 0j ,

is the centered coefficient of determination when the
variable x j is regressed on the remaining independent variables (including the intercept). The matrix
X j is obtained from X by deleting its j-th column.
From this it is also intuitively clear that the centered
VIF only works satisfactory in a model with intercept.
The centered variance inflation factor (as well
as a generalized variance-inflation factor) is implemented in the R package car by Fox (2002), see also
Fox (1997), and can also be computed with the function listed below, which has been communicated by
Venables (2002) via the r-help mailing list.
> vif <- function(object, ...)
UseMethod("vif")

ISSN 1609-3631

Vol. 3/1, June 2003

> vif.default <- function(object, ...)
stop("No default method for vif. Sorry.")
> vif.lm <- function(object, ...) {
V <- summary(object)$cov.unscaled
Vi <- crossprod(model.matrix(object))
nam <- names(coef(object))
k <- match("(Intercept)", nam,
nomatch = FALSE)
if(k) {
v1 <- diag(V)[-k]
v2 <- diag(Vi)[-k] - Vi[k, -k]^2 / Vi[k, k]
nam <- nam[-k]
}
else {
v1 <- diag(V)
v2 <- diag(Vi)
warning(paste("No intercept term",
"detected. Results may surprise."))
}
structure(v1 * v2, names = nam)
}

Uncentered variance inflation factors
As pointed out by Belsley (1991, p. 28/29) the centered VIF can be disadvantageous when collinearity
effects are related to the intercept. To illustrate this
point let us consider the following small ad-hoc example:
>
>
>
>
>
>

n <- 50
x2 <- 5 + rnorm(n, 0, 0.1)
x3 <- 1:n
x4 <- 10 + x2 + x3 + rnorm(n, 0, 10)
x5 <- (x3/100)^5 * 100
y <- 10 + 4 + x2 + 1*x3 + 2*x4 + 4*x5 +
rnorm(n, 0, 50)
> cl.lm <- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtained
via vif.lm(cl.lm) are
x2
x3
x4
x5
1.017126 6.774374 4.329862 3.122899

Several runs produce similar results. Usually
values greater than 10 are said to indicate harmful
collinearity, which is not the case above, although the
VIFs report moderate collinearity related to x3 and x4 .
Since the smallest possible VIF is 1, the above result
seems to indicate that x2 is completely unharmed by
collinearity.
No harmful collinearity in the model when the
second column x2 of the matrix X is nearly five times
the first column and the scaled condition number, see
Belsley (1991, Sec. 3.3), is 147.4653?
As a matter of fact, the centered VIF requires an intercept in the model but at the same time denies the status of
the intercept as an independent ‘variable’ being possibly
related to collinearity effects.
R News

14

Now as indicated by Belsley (1991), an alternative way to measure variance inflation is simply to
apply the classical (uncentered) coefficient of determination
x0j M j x j
R2j = 1 −
x0j x j
e 2 . This can also be done for
instead of the centered R
j
the intercept as an independent ‘variable’. For the
above example we obtain
(Intercept)
x2
x3
x4
x5
2540.378 2510.568 27.92701 24.89071 4.518498

as uncentered VIFs, revealing that the intercept and
x2 are strongly effected by collinearity, which is in
fact the case.
It should be noted that x0j Cx j ≤ x0j x j , so that
e 2 ≤ R2 . Hence the usual critical value
always R
j

j

e 2 > 0.9 (corresponding to centered VIF > 10) for
R
j
harmful collinearity should be greater when the uncentered VIFs are regarded.

Conclusion
When we use collinearity analysis for finding a possible relationship between impreciseness of estimation
and near linear dependencies, then an intercept in
the model should not be excluded from the analysis,
since the corresponding parameter is as important as
any other parameter (e.g. for prediction purposes).
See also (Belsley, 1991, Sec. 6.3, 6.8) for a discussion
of mean centering and the constant term in connection with collinearity.
By reporting both, centered and uncentered VIFs
for a linear model, we can obtain an impression
about the possible involvement of the intercept in
collinearity effects. This can easily be done by complementing the above vif.lm function, since the uncentered VIFs can be computed as in the else part
of the function, namely as diag(V) * diag(Vi). By
this, the uncentered VIFs are computed anyway and
not only in the case that no intercept is in the model.
> vif.lm <- function(object, ...) {
V <- summary(object)$cov.unscaled
Vi <- crossprod(model.matrix(object))
nam <- names(coef(object))
k <- match("(Intercept)", nam,
nomatch = FALSE)
v1 <- diag(V)
v2 <- diag(Vi)
uc.struct <- structure(v1 * v2, names = nam)
if(k) {
v1 <- diag(V)[-k]
v2 <- diag(Vi)[-k] - Vi[k, -k]^2 / Vi[k, k]
nam <- nam[-k]
c.struct <- structure(v1 * v2, names = nam)
return(Centered.VIF = c.struct,
Uncentered.VIF = uc.struct)
}

ISSN 1609-3631

Vol. 3/1, June 2003

else{
warning(paste("No intercept term",
"detected. Uncentered VIFs computed."))
return(Uncentered.VIF = uc.struct)
}
}

15

J. Fox (1997). Applied Regression, Linear Models, and Related Methods. Sage Publications, Thousand Oaks.
13
J. Fox (2002). An R and S-Plus Companion to Applied
Regression. Sage Publications, Thousand Oaks. 13

The user of such a function should allow greater
critical values for the uncentered than for the centered VIFs.

W. N. Venables (2002). [R] RE: [S] VIF Variance Inflation Factor. http://www.R-project.org/nocvs/
mail/r-help/2002/8566.html. 13

Bibliography

Jürgen Groß
University of Dortmund
Vogelpothsweg 87
D-44221 Dortmund, Germany
gross@statistik.uni-dortmund.de

D. A. Belsley (1991).
Conditioning Diagnostics.
Collinearity and Weak Data in Regression. Wiley,
New York. 14

Building Microsoft Windows Versions of
R and R packages under Intel Linux
A Package Developer’s Tool

this in detail in the section on Building R Packages and
Bundles.

by Jun Yan and A.J. Rossini
It is simple to build R and R packages for Microsoft
Windows from an ix86 Linux machine. This is very
useful to package developers who are familiar with
Unix tools and feel that widespread dissemination
of their work is important. The resulting R binaries
and binary library packages require only a minimal
amount of work to install under Microsoft Windows.
While testing is always important for quality assurance and control, we have found that the procedure
usually results in reliable packages. This document
provides instructions for obtaining, installing, and
using the cross-compilation tools.
These steps have been put into a Makefile, which
accompanies this document, for automating the process. The Makefile is available from the contributed
documentation area on Comprehensive R Archive
Network (CRAN). The current version has been
tested with R-1.7.0.
For the impatient and trusting, if a current version of Linux R is in the search path, then
make CrossCompileBuild
will run all Makefile targets described up to the
section, Building R Packages and Bundles. This assumes: (1) you have the Makefile in a directory
RCrossBuild (empty except for the makefile), and (2)
that ./RCrossBuild is your current working directory. After this, one should manually set up the packages of interest for building, though the makefile will
still help with many important steps. We describe
R News

Setting up the build area
We first create a separate directory for R crossbuilding, say RCrossBuild and will put the Makefile into this directory. From now on this directory
is stored in the environment variable $RCB. Make
sure the file name is Makefile, unless you are familiar with using the make program with other control
file names.

Create work area
The following steps should take place in the
RCrossBuild directory. Within this directory, the
Makefile assumes the following subdirectories either
exist or can be created, for use as described:
• downloads is the location to store all the sources
needed for cross-building.
• cross-tools is the location to unpack the
cross-building tools.
• WinR is the location to unpack the R source and
cross-build R.
• LinuxR is the location to unpack the R source
and build a fresh Linux R, which is only needed
if your system doesn’t have the current version
of R.
ISSN 1609-3631

Vol. 3/1, June 2003

16

• pkgsrc is the location of package sources
(tarred and gzipped from R CMD build) that
need to cross-build.

Building R

• WinRlibs is the location to put the resulting
Windows binaries of the packages.

If a current Linux R is available from the system and
on your search path, then run the following command:

These directories can be changed by modifying the
Makefile. One may find what exactly each step does
by looking into the Makefile.

Download tools and sources
make down
This command downloads the i386-mingw32 crosscompilers and R source (starting from R-1.7.0, other
sources that used to be necessary for cross building, pcre and bzip2, are no longer needed). It
places all sources into a separate directory called
RCrossBuild/downloads. The wget program is used
to get all needed sources. Other downloading tools
such as curl or links/elinks can be used as well. We
place these sources into the RCrossBuild/downloads
directory. The URLs of these sources are available in
file src/gnuwin32/INSTALL which is located within
the R source tree (or, see the Makefile associated
with this document).

Cross-tools setup
make xtools
This rule creates a separate subdirectory
cross-tools in the build area for the crosstools. It unpacks the cross-compiler tools into the
cross-tools subdirectory.

Prepare source code

Configuring

make mkrules
This rule modifies the file src/gnuwin32/Mkrules
from the R source tree to set BUILD=CROSS and
HEADER=$RCB/cross-tools/mingw32/include.
If a current Linux R is not available from the
system, and a Linux R has just been built by make
LinuxR from the end of the last section:
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCB/LinuxR/R/bin/R, in
addition to the variables above.

Compiling
Now we can cross-compile R:
make R
The environment variable $PATH is modified in
this make target to ensure that the cross-tools are
at the beginning of the search path. This step as
well as initiation of the compile process is accomplished by the R makefile target. This may take a
while to finish. If everything goes smoothly, a compressed file Win-R-1.7.0.tgz will be created in the
RCrossBuild/WinR directory.
This gzip’d tar-file contains the executables and
supporting files for R which will run on a Microsoft Windows machine. To install, transfer and
unpack it on the desired machine. Since there is
no InstallShield-style installer executable, one will
have to manually create any desired desktop shortcuts to the executables in the bin directory, such as
Rgui.exe. Remember, though, this isn’t necessarily
the same as the R for Windows version on CRAN!

make prepsrc
This rule creates a separate subdirectory WinR to
carry out the cross-building process of R and unpacks the R sources into it. As of 1.7.0, the source
for pcre and bzip2 are included in the R source tar
archive; before that, we needed to worry about them.

Build Linux R if needed
make LinuxR
This step is required, as of R-1.7.0, if you don’t have
a current version of Linux R on your system and you
don’t have the permission to install one for system
wide use. This rule will build and install a Linux R
in $RCB/LinuxR/R.
R News

Building R packages and bundles
Now we have reached the point of interest. As one
might recall, the primary goal of this document is to
be able to build binary packages for R libraries which
can be simply installed for R running on Microsoft
Windows. All the work up to this point simply obtains the required working build of R for Microsoft
Windows!
Now, create the pkgsrc subdirectory to put the
package sources into and the WinRlibs subdirectory
to put the windows binaries into.
We will use the geepack package as an example for cross-building. First, put the source
geepack_0.2-4.tar.gz into the subdirectory
pkgsrc, and then do
ISSN 1609-3631

Vol. 3/1, June 2003

make pkg-geepack_0.2-4
If there is no error, the Windows binary geepack.zip
will be created in the WinRlibs subdirectory, which
is then ready to be shipped to a Windows machine
for installation.
We can easily build bundled packages as well.
For example, to build the packages in bundle VR, we
place the source VR_7.0-11.tar.gz into the pkgsrc
subdirectory, and then do
make bundle-VR_7.0-11
The Windows binaries of packages MASS, class,
nnet, and spatial in the VR bundle will appear in
the WinRlibs subdirectory.
This Makefile assumes a tarred and gzipped
source for an R package, which ends with “.tar.gz”.
This is usually created through the R CMD build
command. It takes the version number together with
the package name.

The Makefile
The associated Makefile is used to automate many
of the steps. Since many Linux distributions come
with the make utility as part of their installation, it
hopefully will help rather than confuse people crosscompiling R. The Makefile is written in a format similar to shell commands in order to show what exactly
each step does.

17

The commands can also be cut-and-pasted out
of the Makefile with minor modification (such as,
change $$ to $ for environmental variable names),
and run manually.

Possible pitfalls
We have very little experience with cross-building
packages (for instance, Matrix) that depend on external libraries such as atlas, blas, lapack, or Java libraries. Native Windows building, or at least a substantial amount of testing, may be required in these
cases. It is worth experimenting, though!

Acknowledgments
We are grateful to Ben Bolker, Stephen Eglen, and
Brian Ripley for helpful discussions.
Jun Yan
University of Wisconsin–Madison, U.S.A.
jyan@stat.wisc.edu
A.J. Rossini
University of Washington, U.S.A.
rossini@u.washington.edu

Analysing Survey Data in R
by Thomas Lumley

Introduction
Survey statistics has always been a somewhat specialised area due in part to its view of the world. In
the rest of the statistical world data are random and
we model their distributions. In traditional survey
statistics the population data are fixed and only the
sampling is random. The advantage of this ‘designbased’ approach is that the sampling, unlike the population, is under the researcher’s control.
The basic question of survey statistics is
If we did exactly this analysis on the
whole population, what result would we
get?
If individuals were sampled independently with
equal probability the answer would be very straightforward. They aren’t, and it isn’t.
To simplify the operations of a large survey it is
routine to sample in clusters. For example, a random
sample of towns or counties can be chosen and then
R News

individuals or families sampled from within those
areas. There can be multiple levels of cluster sampling, eg, individuals within families within towns.
Cluster sampling often results in people having an
unequal chance of being included, for example, the
same number of individuals might be surveyed regardless of the size of the town.
For statistical efficiency and to make the results
more convincing it is also usual to stratify the population and sample a prespecified number of individuals or clusters from each stratum, ensuring that all
strata are fairly represented. Strata for a national survey might divide the country into geographical regions and then subdivide into rural and urban. Ensuring that smaller strata are well represented may
also involve sampling with unequal probabilities.
Finally, unequal probabilities might be deliberately used to increase the representation of groups
that are small or particularly important. In US health
surveys there is often interest in the health of the
poor and of racial and ethnic minority groups, who
might thus be oversampled.
The resulting sample may be very different in
ISSN 1609-3631

Vol. 3/1, June 2003

18

structure from the population, but it is different in
ways that are precisely known and can be accounted
for in the analysis.

If we index strata by s, clusters by c and observations by i then
ˆ [ S] =
var

The major technique is inverse-probability weighting to estimate population totals. If the probability
of sampling individual i is πi and the value of some
variable or statistic for that person is Yi then an unbiased estimate of the population total is
1

∑ πi Yi
i

regardless of clustering or stratification. This is
called the Horvitz–Thompson estimator (Horvitz &
Thompson, 1951). Inverse-probability weighting can
be used in an obvious way to compute population
means and higher moments. Less obviously, it can
be used to estimate the population version of loglikelihoods and estimating functions, thus allowing
almost any standard models to be fitted. The resulting estimates answer the basic survey question: they
are consistent estimators of the result that would be
obtained if the same analysis was done to the whole
population.
It is important to note that estimates obtained by
maximising an inverse-probability weighted loglikelihood are not in general maximum likelihood estimates under the model whose loglikelihood is being
used. In some cases semiparametric maximum likelihood estimators are known that are substantially
more efficient than the survey-weighted estimators
if the model is correctly specified. An important example is a two-phase study where a large sample is
taken at the first stage and then further variables are
measured on a subsample.

Standard errors
The standard errors that result from the more common variance-weighted estimation are incorrect for
probability weighting and in any case are modelbased and so do not answer the basic question. To
make matters worse, stratified and clustered sampling also affect the variance, the former decreasing
it and the latter (typically) increasing it relative to independent sampling.
The formulas for standard error estimation are
developed by first considering the variance of a
sum or mean under independent unequal probability sampling. This can then be extended to cluster
sampling, where the clusters are independent, and
to stratified sampling by adding variance estimates
from each stratum. The formulas are similar to the
‘sandwich variances’ used in longitudinal and clustered data (but not quite the same).
R News

∑
s

Weighted estimation

ns
ns − 1

∑



c,i


1
Ysci − Ȳs
πsci

2

where S is the weighted sum, Ȳs are weighted stratum means, and ns is the number of clusters in stratum s.
Standard errors for statistics other than means are
developed from a first-order Taylor series expansion,
that is, they use the same formula applied to the
mean of the estimating functions. The computations
for variances of sums are performed in the svyCprod
function, ensuring that they are consistent across the
survey functions.

Subsetting surveys
Estimates on a subset of a survey require some
care. Firstly, it is important to ensure that the correct weight, cluster and stratum information is kept
matched to each observation. Secondly, it is not correct simply to analyse a subset as if it were a designed
survey of its own.
The correct analysis corresponds approximately
to setting the weight to zero for observations not in
the subset. This is not how it is implemented, since
observations with zero weight still contribute to constraints in R functions such as glm, and they still take
up memory. Instead, the survey.design object keeps
track of how many clusters (PSUs) were originally
present in each stratum and svyCprod uses this to adjust the variance.

Example
The examples in this article use data from the National Health Interview Study (NHIS 2001) conducted by the US National Center for Health Statistics. The data and documentation are available
from http://www.cdc.gov/nchs/nhis.htm. I used
NCHS-supplied SPSS files to read the data and then
read.spss in the foreign package to load them into
R. Unfortunately the dataset is too big to include
these examples in the survey package — they would
increase the size of the package by a couple of orders
of magnitude.

The survey package
In order to keep the stratum, cluster, and probability
information associated with the correct observations
the survey package uses a survey.design object created by the svydesign function. A simple example
call is
ISSN 1609-3631

Vol. 3/1, June 2003

imdsgn<-svydesign(id=~PSU,strata=~STRATUM,
weights=~WTFA.IM,
data=immunize,
nest=TRUE)

The data for this, the immunization section of
NHIS, are in the data frame immunize. The strata
are specified by the STRATUM variable, and the inverse
probability weights are in the WTFA.IM variable. The
statement specifies only one level of clustering, by
PSU. The nest=TRUE option asserts that clusters are
nested in strata so that two clusters with the same
psuid in different strata are actually different clusters.
In fact there are further levels of clustering and it
would be possible to write
imdsgn<-svydesign(id=~PSU+HHX+FMX+PX,
strata=~STRATUM,
weights=~WTFA.IM,
data=immunize,
nest=TRUE)

to indicate the sampling of households, families,
and individuals. This would not affect the analysis,
which depends only on the largest level of clustering.
The main reason to provide this option is to allow
sampling probabilities for each level to be supplied
instead of weights.
The strata argument can have multiple terms:
eg strata= region+rural specifying strata defined
by the interaction of region and rural, but NCHS
studies provide a single stratum variable so this is
not needed. Finally, a finite population correction to
variances can be specified with the fpc argument,
which gives either the total population size or the
overall sampling fraction of top-level clusters (PSUs)
in each stratum.
Note that variables to be looked up in the supplied data frame are all specified by formulas, removing the scoping ambiguities in some modelling
functions.
The resulting survey.design object has methods
for subscripting and na.action as well as the usual
print and summary. The print method gives some
descriptions of the design, such as the number of
largest level clusters (PSUs) in each stratum, the distribution of sampling probabilities and the names of
all the data variables. In this immunisation study the
sampling probabilities vary by a factor of 100, so ignoring the weights may be very misleading.

Analyses
Suppose we want to see how many children
have had their recommended polio immmunisations
recorded. The command
svytable(~AGE.P+POLCT.C, design=imdsgn)

produces a table of number of polio immunisations
by age in years for children with good immunisation
R News

19

data, scaled by the sampling weights so that it corresponds to the US population. To fit the table more
easily on the page, and since three doses is regarded
as sufficient we can do
> svytable(~AGE.P+pmin(3,POLCT.C),design=imdsgn)
pmin(3, POLCT.C)
AGE.P
0
1
2
3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825

where the last column refers to 3 or more doses. For
ages 3 and older, just over 80% of children have received 3 or more doses and roughly 10% have received none.
To examine whether this varies by city size we
could tabulate
svytable(~AGE.P+pmin(3,POLCT.C)+MSASIZEP,
design=imdsgn)

where MSASIZEP is the size of the ‘metropolitan statistical area’ in which the child lives, with categories
ranging from 5,000,000+ to under 250,000, and a final
category for children who don’t live in a metropolitan statistical area. As MSASIZEP has 7 categories the
data end up fairly sparse. A regression model might
be more useful.
Regression models
The svyglm and svycoxph functions have similar
syntax to their non-survey counterparts, but use a
design rather than a data argument. For simplicity
of implementation they do require that all variables
in the model be found in the design object, rather
than floating free in the calling environment.
To look at associations with city size we fit the logistic regression models in Figure 1. The resulting
svyglm objects have a summary method similar to
that for glm. There isn’t a consistent association, but
category 2: cities from 2.5 million to 5 million people,
does show some evidence of lower vaccination rates.
Similar analyses using family income don’t show any
real evidence of an assocation.
General weighted likelihood estimation
The svymle function maximises a specified weighted
likelihood. It takes as arguments a loglikelihood
function and formulas or fixed values for each parameter in the likelihood. For example, a linear regression could be implemented by the code in Figure 2. In this example the log=TRUE argument is
passed to dnorm and is the only fixed argument. The
other two arguments, mean and sd, are specified by
ISSN 1609-3631

Vol. 3/1, June 2003

20

svyglm(I(POLCT.C>=3)~factor(AGE.P)+factor(MSASIZEP), design=imdsgn, family=binomial)
svyglm(I(POLCT.C>=3)~factor(AGE.P)+MSASIZEP, design=imdsgn, family=binomial)

Figure 1: Logistic regression models for vaccination status
gr <- function(x,mean,sd,log)
dm <- 2*(x - mean)/(2*sd^2)
ds <- (x-mean)^2*(2*(2 * sd))/(2*sd^2)^2 - sqrt(2*pi)/(sd*sqrt(2*pi))
cbind(dm,ds)
m2 <- svymle(loglike=dnorm,gradient=gr, design=dxi,
formulas=list(mean=y~x+z, sd=~1),
start=list(c(2,5,0), c(4)),
log=TRUE)

Figure 2: Weighted maximum likelihood estimation
formulas. Exactly one of these formulas should have
a left-hand side specifying the response variable.
It is necessary to specify the gradient in order to
get survey-weighted standard errors. If the gradient
is not specified you can get standard errors based on
the information matrix. With independent weighted
sampling the information matrix standard errors will
be accurate if the model is correctly specified. As is
always the case with general-purpose optimisation it
is important to have reasonable starting values; you
could often get them from an analysis that ignored
the survey design.

Extending the survey package
It is easy to extend the survey package to include
any new class of model where weighted estimation
is possible. The svycoxph function shows how this is
done.
1. Create a call to the underlying model (coxph),
adding a weights argument with weights 1/πi
(or a rescaled version).
2. Evaluate the call
3. Extract the one-step delta-betas ∆i or compute
them from the information matrix I and score
contributions Ui as ∆ = I −1 Ui
4. Pass ∆i and the cluster, strata and finite population correction information to svyCprod to
compute the variance.
5. Add a "svywhatever" class to the object and a
copy of the design object
6. Override the print and print.summary
methods
to
include
print(x$design,
varnames=FALSE, design.summaries=FALSE)
7. Override any likelihood extraction methods to
fail
R News

It may also be necessary to override other methods such as residuals, as is shown for Pearson residuals in residuals.svyglm.
A second important type of extension is to add
prediction methods for small-area extrapolation of
surveys. That is, given some variables measured in a
small geographic area (a new cluster), predict other
variables using the regression relationship from the
whole survey. The predict methods for survey regression methods current give predictions only for
individuals.
Extensions in third direction would handle different sorts of survey design. The current software cannot handle surveys where strata are defined within
clusters, partly because I don’t know what the right
variance formula is.

Summary
Analysis of complex sample surveys used to require
expensive software on expensive mainframe hardware. Computers have advanced sufficiently for a
reasonably powerful PC to analyze data from large
national surveys such as NHIS, and R makes it possible to write a useful survey analysis package in a
fairly short time. Feedback from practising survey
statisticians on this package would be particularly
welcome.

References
Horvitz DG, Thompson DJ (1951). A Generalization
of Sampling without Replacement from a Finite Universe. Journal of the American Statistical Association.
47, 663-685.
Thomas Lumley
Biostatistics, University of Washington
tlumley@u.washington.edu
ISSN 1609-3631

Vol. 3/1, June 2003

21

Computational Gains Using RPVM on a
Beowulf Cluster
by Brett Carson, Robert Murison and Ian A. Mason.

Gene-shaving

Introduction

The gene-shaving algorithm Hastie et al. (2000b,a)
is one of the many DNA micro-array gene clustering techniques to be developed in recent times. The
aim is to find clusters with small variance between
genes, and large variance between samples Hastie
et al. (2000b). The algorithm consists of a number
of steps:

The Beowulf cluster Becker et al. (1995); Scyld Computing Corporation (1998) is a recent advance in
computing technology that harnesses the power of
a network of desktop computers using communication software such as PVM Geist et al. (1994) and MPI
Message Passing Interface Forum (1997). Whilst the
potential of a computing cluster is obvious, expertise
in programming is still developing in the statistical
community.
Recent articles in R-news Li and Rossini (2001)
and Yu (2002) entice statistical programmers to consider whether their solutions could be effectively calculated in parallel. Another R package, SNOW Tierney (2002); Rossini et al. (2003) aims to skillfully
provide a wrapper interface to these packages, independent of the underlying cluster communication
method used in parallel computing. This article concentrates on RPVM and wishes to build upon the contribution of Li and Rossini (2001) by taking an example with obvious orthogonal components and detailing the R code necessary to allocate the computations
to each node of the cluster. The statistical technique
used to motivate our RPVM application is the geneshaving algorithm Hastie et al. (2000b,a) for which
S-PLUS code has been written by Do and Wen (2002)
to perform the calculations serially.
The first section is a brief description of the Beowulf cluster used to run the R programs discussed
in this paper. This is followed by an explanation of
the gene-shaving algorithm, identifying the opportunities for parallel computing of bootstrap estimates
of the “strength" of a cluster and the rendering of
each matrix row orthogonal to the “eigen-gene". The
code for spawning child processes is then explained
comprehensively and the conclusion compares the
speed of RPVM on the Beowulf to serial computing.

A parallel computing environment
The Beowulf cluster Becker et al. (1995); Scyld Computing Corporation (1998) used to perform the parallel computation outlined in the following sections
is a sixteen node homogeneous cluster. Each node
consists of an AMD XP1800+ with 1.5Gb of memory,
running Redhat Linux 7.3. The nodes are connected
by 100Mbps Ethernet, and are accessible via the dual
processor AMD XP1800+ front-end. Each node has
installed versions of R, PVM Geist et al. (1994) and R’s
PVM wrapper, RPVM Li and Rossini (2001).
R News

1. The process starts with the matrix XN,p which
represents expressions of genes in the micro array. Each row of X is data from a gene, the
columns contain samples. N is usually of size
103 and p usually no greater than 102 .
2. The leading principal component of X is calculated and termed the “eigen-gene".
3. The correlation of each row with the eigen-gene
is calculated and the rows yielding the lowest 10% of R2 are shaved. This process is repeated to produce successive reduced matri∗ . There are [ 0.9N ] genes in X ∗ ,
ces X1∗ , . . . XN
1
∗
∗.
[0.81N ] in X2 and 1 in XN
4. The optimal cluster size is selected from X∗ .
The selection process requires that we estimate
the distribution of R2 by bootstrap and calculate Gk = R2k − R∗2k where R∗2k is the mean R2
from B bootstrap samples of Xk . The statistic
Gk is termed the gap statistic and the optimal
cluster is that which has the largest gap statistic.
5. Once a cluster has been determined, the process repeats searching for more clusters, after
removing the information of previously determined clusters. This is done by “sweeping" the
mean gene of the optimal cluster, x̄ Sk̂ , from each
row of X,
newX[i,] = X[i,] ( I − x̄ Sk̂ ( x̄TSk̂ x̄ Sk̂ )−1 x̄TSk̂ )
|
{z
}
IPx

6. The next optimal cluster is found by repeating
the process using newX.
ISSN 1609-3631

Vol. 3/1, June 2003

22

for the master to send data for processing, and return an answer to the master. There are two varieties
of children/slaves:
• Bootstrapping children - each slave task performs one permutation of the bootstrapping
process.
• Orthogonalization children - each slave computes a portion of the orthogonalization process.
Figure 1: Clustering of genes
The bootstrap and orthogonalization processes
are candidates for parallel computing. Figure 1 illustrates the overall gene-shaving process by showing
the discovery of clusters of genes.

Figure 2 shows a basic master-slave setup, without any communication between the slaves themselves. The slaves receive some data from the parent, process it, and send the result back to the parent.
A more elaborate setup would see the slaves communicating with each other, this is not necessary in
this case, as each of the slaves is independent of each
other.
Master

Parallel implementation of geneshaving
There are two steps in the shaving process that can
be computed in parallel. Firstly, the bootstrapping of
the matrix X, and secondly, the orthogonalization of
the matrix X with respect to x̄ Sk̂ . In general, for an
algorithm that can be fully parallelized, we could expect a speedup of no more than 16 on a cluster of 16
nodes. However, given that this particular algorithm
will not be fully parallelized, we would not expect
such a reduction in computation time. In fact, the
cluster will not be fully utilised during the overall
computation, as most of the nodes will see significant idle time. Amdahl’s Law Amdahl (1967) can be
used to calculate the estimated speedup of performing gene-shaving in parallel. Amdahl’s Law can be
represented in mathematical form by:
Speedup =

1
rs +

rp
n

where rs is the proportion of the program computed
in serial, r p is the parallel proportion and n is the
number of processors used. In a perfect world, if
a particular step computed in parallel were spread
over five nodes, the speedup would equal five. This
is very rarely the case however, due to speed limiting
factors like network load, PVM overheads, communication overheads, CPU load and the like.

The parallel setup
Both the parallel versions of bootstrapping and orthogonalization consist of a main R process (the master), and a number of child processes that the master
spawns (the slaves). The master process controls the
entire flow of the shaving process. It begins by starting the child processes, which then proceed to wait
R News

Slave 1

Slave 2

Slave 3

Slave n

Figure 2: The master-slave paradigm without slaveslave communication.

Bootstrapping
Bootstrapping is used to find R∗2 used in the gap
statistic outlined in step 4 of the gene-shaving algorithm, to determine whether selected clusters could
have arisen by chance. Elements of each row of
the matrix are rearranged, which requires N iterations, quite a CPU intensive operation, even more so
for larger expression matrices. In the serial R version, this would be done using R’s apply function,
whereby each row of X has the sample function applied to it.
The matrix X is transformed into X ∗ B times by
randomly permuting within rows. For each of the
B bootstrap samples (indexed by b), the statistic R2b
is derived and the mean of R2b , b = 1, B is calculated
to represent that R2 which arises by chance only. In
RPVM, for each bootstrap sample a task is spawned;
for B bootstrap samples we spawn B tasks.

Orthogonalization
The master R process distributes a portion of the matrix X to its children. Each portion will be approximately equal in size, and the number of portions will
be equal to the number of slaves spawned for orthogonalization. The master also calculates the value of
IPx, as defined in step 5 of the gene-shaving algorithm.
ISSN 1609-3631

Vol. 3/1, June 2003

The value of IPx is passed to each child. Each
child then applies IPx to each row of X that the master has given it.
There are two ways in which X can be distributed. The matrix can be split into approximately
even portions, where the number of portions will
be equal to the number of slaves. Alternatively, the
matrix can be distributed on a row-by-row basis,
where each slave processes a row, sends it back, is
given a new row to process and so on. The latter
method becomes slow and inefficient, as more sending, receiving, and unpacking of buffers is required.
Such a method may become useful where the number of rows of the matrix is small and the number of
columns is very large.

The RPVM code
Li and Rossini (2001) have explained the initial steps
of starting the PVM daemon and spawning RPVM tasks
from within an R program, These steps precede the
following code. The full source code is available
from http://mcs.une.edu.au/~bcarson/RPVM/.

Bootstrapping
Bootstrapping can be performed in parallel by treating each bootstrap sample as a child process. The
number of children is the number of bootstrap samples, (B) and the matrix is copied to each child,
(1, ..., B). Child process i permutes elements within
each row of X, calculates R∗2 and returns this to the
master process. The master averages R∗2 , i = 1, B
and computes the gap statistic.
The usual procedure for sending objects to PVM
child processes is to initialize the send buffer, pack
the buffer with data and send the buffer. Our function par.bootstrap is a function used by the master process and includes .PVM.initsend() to initialize the send buffer, .PVM.pkdblmat(X) to pack the
buffer, and .PVM.mcast(boot.children,0) to send
the data to all children.
par.bootstrap <- function(X){
brsq <- 0
tasks <- length(boot.children)
.PVM.initsend()
#send a copy of X to everyone
.PVM.pkdblmat(X)
.PVM.mcast(boot.children,0)
#get the R squared values back and
#take the mean
for(i in 1:tasks){
.PVM.recv(boot.children[i],0)
rsq <- .PVM.upkdouble(1,1)
brsq <- brsq + (rsq/tasks)
}

R News

23

return(brsq)
}

After the master has sent the matrix to the
children, it waits for them to return their calculated R2 values. This is controlled by a simple
for loop, as the master receives the answer via
.PVM.recv(boot.children[i],0), it is unpacked
from the receive buffer with:
rsq <- .PVM.upkdouble(1,1)
The value obtained is used in calculating an overall mean for all R2 values returned by the children.
In the slave tasks, the matrix is unpacked and a bootstrap permutation is performed, an R2 value calculated and returned to the master. The following code
performs these operations of the children tasks:
#receive the matrix
.PVM.recv(.PVM.parent(),0)
X <- .PVM.upkdblmat()
Xb <- X
dd <- dim(X)
#perform the permutation
for(i in 1:dd[1]){
permute <- sample(1:dd[2],replace=F)
Xb[i,] <- X[i,permute]
}
#send it back to the master
.PVM.initsend()
.PVM.pkdouble(Dfunct(Xb))
.PVM.send(.PVM.parent(),0)

.PVM.upkdblmat() is used to unpack the matrix.
After the bootstrap is performed, the send is initialized with .PVM.initsend() (as seen in the master),
the result of Dfunct(Xb) (R2 calculation) is packed
into the send buffer, and then the answer is sent with
.PVM.send() to the parent of the slave, which will be
the master process.

Orthogonalization
The orthogonalization in the master process is handled by the function par.orth. This function takes as
arguments the IPx matrix, and the expression matrix
X. The first step of this procedure is to broadcast the
IPx matrix to all slaves by firstly packing it and then
sending it with .PVM.mcast(orth.children,0). The
next step is to split the expression matrix into portions and distribute to the slaves. For simplicity, this
example divides the matrix by the number of slaves,
so row size must be divisible by the number of children. Using even blocks on a homogeneous cluster
will work reasonably well, as the load-balancing issues outlined in Rossini et al. (2003) are more relevant to heterogeneous clusters. The final step performed by the master is to receive these portions
ISSN 1609-3631

Vol. 3/1, June 2003

par.orth <- function(IPx, X){
rows <- dim(X)[1]
tasks <- length(orth.children)
#intialize the send buffer
.PVM.initsend()
#pack the IPx matrix and send it to
#all children
.PVM.pkdblmat(IPx)
.PVM.mcast(orth.children,0)
#divide the X matrix into blocks and
#send a block to each child
n.rows = as.integer(rows/tasks)
for(i in 1:tasks){
start <- (n.rows * (i-1))+1
end <- n.rows * i
if(end > rows)
end <- rows
X.part <- X[start:end,]
.PVM.pkdblmat(X.part)
.PVM.send(orth.children[i],start)
}
#wait for the blocks to be returned
#and re-construct the matrix
for(i in 1:tasks){
.PVM.recv(orth.children[i],0)
rZ <- .PVM.upkdblmat()
if(i==1){
Z <- rZ
}
else{
Z <- rbind(Z,rZ)
}
}
dimnames(Z) <- dimnames(X)
Z

.PVM.initsend()
.PVM.pkdblmat(Z)
.PVM.send(.PVM.parent(),0)

Gene-shaving results
Serial gene-shaving
The results for the serial implementation of geneshaving are a useful guide to the speedup that can
be achieved from parallelizing. The proportion of
the total computation time the bootstrapping and
orthogonalization use is of interest. If these are of
significant proportion, they are worth performing in
parallel.
The results presented were obtained from running the serial version on a single cluster node. RPVM
is not used in the serial version, and the program
contains no parallel code at all. The data used are
randomly generated matrices of size N = 1600, 3200,
6400, and 12800 respectively, with a constant p of size
80, to simulate variable-sized micro-arrays. In each
case, the first two clusters are found, and ten permutations used in each bootstrap. Figure 3 is a plot of
the overall computation time, the total time of all invocations of the bootstrapping process, and the total
time of all invocations of the orthogonalization process.

orthogonalization

1200

computation time (seconds)

back from the slaves and reconstruct the matrix. Order of rows is maintained by sending blocks of rows
and receiving them back in the order the children sit
in the array of task ids (orth.children). The following code is the par.orth function used by the master
process:

24

bootstrapping
gene−shaving

1000
800
600
400
200
0
2000

4000

6000

8000

10000

12000

size of expression matrix (rows)

}

The slaves execute the following code, which is
quite similar to the code of the bootstrap slaves. Each
slave receives both matrices, then performs the orthogonalization and sends back the result to the master:
#wait for the IPx and X matrices to arrive
.PVM.recv(.PVM.parent(),0)
IPx <- .PVM.upkdblmat()
.PVM.recv(.PVM.parent(),-1)
X.part <- .PVM.upkdblmat()
Z <- X.part
#orthogonalize X
for(i in 1:dim(X.part)[1]){
Z[i,] <- X.part[i,] %*% IPx
}
#send back the new matrix Z

R News

Figure 3: Serial Gene-shaving results
Clearly, the bootstrapping (blue) step is quite
time consuming, with the mean proportion being
0.78 of the time taken to perform gene-shaving (red).
The orthogonalization (black) equates to no more
than 1%, so any speedup through parallelization is
not going to have any significant effect on the overall
time.
We have two sets of parallel results, one using
two cluster nodes, the other using all sixteen nodes
of the cluster. In the two node runs we could expect a reduction in the bootstrapping time to be no
more than half, i.e a speedup of 2. In full cluster runs
we could expect a speedup of no more than 10, as
ISSN 1609-3631

Vol. 3/1, June 2003

25

we can use ten of the sixteen processors (one processor for each permutation). It is now possible to
estimate the expected maximum speedup of parallelization with Amdahl’s law, as described earlier for
expression matrices of size N = 12800 and p = 80,
given the parallel proportion of 0.78.
Application of Amdahl’s law fot the cases (n =
2, 10) gives speedups of 1.63 and 3.35 respectively.
We have implemented the orthogonalization in parallel, more for demonstration purposes, as it could
become useful with very large matrices.

S16nodes =

1301.18
= 2.4
549.59

The observed speedup for two nodes is very close
to the estimate of 1.63. The results for sixteen nodes
are not as impressive, due to greater time spent computing than expected for bootstrapping. Although
there is a gain with the orthogonalization here, it is
barely noticeable in terms of the whole algorithm.
The orthogonalization is only performed once for
each cluster found, so it may well become an issue
when searching for many clusters on much larger
datasets than covered here.

Parallel gene-shaving
The parallel version of gene-shaving only contains
parallel code for the bootstrapping and orthogonalization procedures, the rest of the process remains
the same as the serial version. Both the runs using
two nodes and all sixteen nodes use this same R program. In both cases ten tasks are spawned for bootstrapping and sixteen tasks are spawned for orthogonalization. Clearly this is an under-utilisation of the
Beowulf cluster in the sixteen node runs, as most of
the nodes are going to see quite a lot of idle time, and
in fact six of the nodes will not be used to compute
the bootstrap, the most CPU intensive task of the entire algorithm.
Orthogonalization Comparisons

1000

computation time (seconds)

computation time (seconds)

Bootstrap Comparisons

800
600
400
200
0
2000

6000

5
4
3

Bibliography

2
1
0

10000

2000

6000

10000

size of expression matrix (rows)

computation time (seconds)

Overall Comparisons

1200
1000

serial
2 nodes
16 nodes

800
600
400
200
0
6000

Whilst the gene-shaving algorithm is not completely
parallelizable, a large portion of it (almost 80%) of it
can be implemented to make use of a Beowulf cluster. This particular problem does not fully utilise the
cluster, each node other than the node which runs the
master process will see a significant amount of idle
time. Regardless, the results presented in this paper
show the value and potential of Beowulf clusters in
processing large sets of statistical data. A fully parallelizable algorithm which can fully utilise each cluster node will show greater performance over a serial
implementation than the particular example we have
presented.

6

size of expression matrix (rows)

2000

Conclusions

10000

Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing
capabilities. In AFIPS spring joint computer conference, Sunnyvale, California. IBM. 22
Becker, D. J., Sterling, T., Savarese, D., Dorband,
J. E., Ranawak, U. A., and Packer, C. V. (1995).
BEOWULF: A PARALLEL WORKSTATION FOR
SCIENTIFIC COMPUTATION. In International
Conference on Parallel Processing. 21

size of expression matrix (rows)

Figure 4: Comparison of computation times for variable sized expression matrices for overall, orthogonalization and bootstrapping under the different environments.
The observed speedup for the two runs of parallel gene-shaving using the N = 12800 expression
matrix is as follows:
S=
S2nodes =
R News

Tserial
Tparallel

1301.18
= 1.6
825.34

Do, K.-A. and Wen, S. (2002).
Documentation.
Technical report, Department of
Biostatistics, MD Anderson Cancer Center.
http://odin.mdacc.tmc.edu/ kim/. 21
Geist, A., Beguelin, A., Dongarra, J., Jiang, W.,
Manchek, R., and Sunderam, V. (1994). PVM: Parallel Virtual Machine. A user’s guide and tutorial for
networked parallel computing. MIT Press. 21
Hastie, T., Tibshirani, R., Eisen, M., Alizadeh, A.,
Levy, R., Staudt, L., Chan, W., Botstein, D., and
Brown, P. (2000a). ’Gene shaving’ as a method for
identifying distinct sets of genes with similar expression patterns. Genome Biology. 21
ISSN 1609-3631

Vol. 3/1, June 2003

Hastie, T., Tibshirani, R., Eisen, M., Brown, P., Scherf,
U., Weinstein, J., Alizadeh, A., Staudt, L., and Botstein, D. (2000b). Gene shaving: a new class of
clustering methods for expression arrays. Technical report, Stanford University. 21
Li, M. N. and Rossini, A. (2001). RPVM: Cluster statistical computing in R. R News, 1(3):4–7. 21, 23
Message Passing Interface Forum (1997).
MPI2: Extensions to the Message-Passing Interface.
http://www.mpi-forum.org/docs/docs.html. 21
Rossini, A., Tierney, L., and Li, N. (2003).
Simple parallel statistical computing in
r.
Technical Report Working Paper 193,
UW Biostatistics Working Paper Series.
http://www.bepress.com/uwbiostat/paper193.
21, 23

26

Scyld
Computing
Corporation
(1998).
Beowulf
introduction
&
overview.
http://www.beowulf.org/intro.html. 21
Tierney, L. (2002).
Simple Network of Workstations for R.
Technical report, School
of
Statistics,
University
of
Minnesota.
http://www.stat.uiowa.edu/ luke/R/cluster/cluster.html.
21
Yu, H. (2002). Rmpi: Parallel statistical computing in
R. R News, 2(2):10–14. 21
Brett Carson, Robert Murison, Ian A. Mason
The University of New England, Australia
bcarson@turing.une.edu.au
rmurison@turing.une.edu.au
iam@turing.une.edu.au

R Help Desk
Getting Help – R’s Help Facilities and Manuals
Uwe Ligges

Introduction
This issue of the R Help Desk deals with its probably
most fundamental topic: Getting help!
Different amounts of knowledge about R and R
related experiences require different levels of help.
Therefore a description of the available manuals,
help functions within R, and mailing lists will be
given, as well as an “instruction” how to use these
facilities.
A secondary objective of the article is to point
out that manuals, help functions, and mailing list
archives are commonly much more comprehensive
than answers on a mailing list. In other words,
there are good reasons to read manuals, use help
functions, and search for information in mailing list
archives.

Manuals
The manuals described in this Section are shipped
with R (in directory ‘.../doc/manual’ of a regular R
installation). Nevertheless, when problems occur before R is installed or the available R version is outdated, recent versions of the manuals are also available from CRAN1 . Let me cite the “R Installation and
Administration” manual by the R Development Core

Team (R Core, for short; 2003b) as a reference for information on installation and updates not only for R
itself, but also for contributed packages.
I think it is not possible to use adequately a programming language and environment like R without reading any introductory material . Fundamentals for using R are described in “An Introduction
to R” (Venables et al., 2003), and the “R Data Import/Export” manual (R Core, 2003a). Both manuals
are the first references for any programming tasks in
R.
Frequently asked questions (and corresponding
answers) are collected in the “R FAQ” (Hornik, 2003),
an important resource not only for beginners. It is a
good idea to look into the “R FAQ” when a question
arises that cannot be answered by reading the other
manuals cited above. Specific FAQs for Macintosh
(by Stefano M. Iacus) and Windows (by Brian D. Ripley) are available as well2 .
Users who gained some experiences might want
to write their own functions making use of the language in an efficient manner. The manual “R Language Definition” (R Core, 2003c, still labelled as
“draft”) is a comprehensive reference manual for a
couple of important language aspects, e.g.: objects,
working with expressions, working with the language (objects), description of the parser, and debugging. I recommend this manual to all users who plan
to work regularly with R, since it provides really illuminating insights.
For those rather experienced users who plan to
create their own packages, writing help pages and
optimize their code, “Writing R Extensions” (R Core,

1 http://CRAN.R-project.org/
2 http://CRAN.R-project.org/faqs.html

R News

ISSN 1609-3631

Vol. 3/1, June 2003

2003d) is the appropriate reference. It includes also
descriptions on interfacing to C, C++, or Fortran
code, and compiling and linking this code into R.
Occasionally, library(help=PackageName) indicates that package specific manuals (or papers corresponding to packages), so-called vignettes, are available in directory ‘.../library/PackageName/doc’.
Beside the manuals, several books dealing with R
are available these days, e.g.: introductory textbooks,
books specialized on certain applications such as regression analysis, and books focussed on programming and the language. Here I do not want to advertise any of those books, because the “optimal” choice
of a book depends on the tasks a user is going to
perform in R. Nevertheless, I would like to mention
Venables and Ripley (2000) as a reference for programming and the language, and Venables and Ripley (2002) as a comprehensive book dealing with several statistical applications and how to apply those in
R (and S-P LUS).

Help functions
Access to the documentation on the topic help (in this
case a function name) is provided by ?help. The “?”
is the probably most frequently used way for getting
help, whereas the corresponding function help() is
rather flexible. As the default, help is shown as formatted text. This can be changed (by arguments
to help(), or for a whole session with options())
to, e.g., HTML (htmlhelp=TRUE), if available. The
HTML format provides convenient features like links
to other related help topics. On Windows, I prefer to get help in Compiled HTML format by setting
options(chmhelp=TRUE) in file ‘.Rprofile’.
When exact names of functions or other topics are unknown, the user can search for relevant documentation (matching a given character
string in, e.g. name, title, or keyword entries) with
help.search(). This function allows quite flexible searching using either regular expression or
fuzzy matching, see ?help.search for details. If
it is required to find an object by its partial name,
apropos() is the right choice – it also accepts regular
expressions.
Documentation and help in HTML format can
also be accessed by help.start(). This functions
starts a Web browser with an index that links to
HTML versions of the manuals mentioned above, the
FAQs, a list of available packages of the current installation, several other information, and a page containing a search engine and keywords. On the latter page, the search engine can be used to “search
for keywords, function and data names and text in
help page titles” of installed packages. The list of

27

keywords is useful when the search using other facilities fails: the user gets a complete list of functions
and data sets corresponding to a particular keyword.
Of course, it is not possible to search for functionality of (unknown) contributed packages or documentation packages (and functions or data sets defined therein) that are not available in the current installation of R. A listing of contributed packages and
their titles is given on CRAN, as well as their function indices and reference manuals. Nevertheless, it
might be still hard to find the function or package of
interest. For that purpose, an R site search3 provided
by Jonathan Baron is linked on the search page of
CRAN . Help files, manuals, and mailing list archives
(see below) can be searched.

Mailing lists
Sometimes manuals and help functions do not answer a question, or there is need for a discussion. For
these purposes there are the mailing lists4 r-announce
for important announcements, r-help for questions
and answers about problems and solutions using R,
and r-devel for the development of R.
Obviously, r-help is the appropriate list for asking questions that are not covered in the manuals
(including the “R FAQ”) or the help pages, and regularly those questions are answered within a few
hours (or even minutes).
Before asking, it is a good idea to look into the
archives of the mailing list.5 These archives contain a huge amount of information and may be much
more helpful than a direct answer on a question, because most questions already have been asked – and
answered – several times (focussing on slightly different aspects, hence these are illuminating a bigger
context).
It is really easy to ask questions on r-help, but
I would like to point out the advantage of reading
manuals before asking: Of course, reading (a section
of) some manual takes much more time than asking, and answers from voluntary help providers of
r-help will probably solve the recent problem of the
asker, but quite similar problems will certainly arise
pretty soon. Instead, the time one spends reading
the manuals will be saved in subsequent programming tasks, because many side aspects (e.g. convenient and powerful functions, programming tricks,
etc.) will be learned from which one benefits in other
applications.

3 http://finzi.psych.upenn.edu/search.html,

see also http://CRAN.R-project.org/search.html
via http://www.R-project.org/mail.html
5 archives are searchable, see http://CRAN.R-project.org/search.html

4 accessible

R News

ISSN 1609-3631

Vol. 3/1, June 2003

Bibliography
Hornik, K. (2003): The R FAQ, Version 1.7-3. http:
//www.ci.tuwien.ac.at/~hornik/R/
ISBN 3901167-51-X. 26
R Development Core Team (2003a): R Data Import/Export. URL http://CRAN.R-project.org/
manuals.html. ISBN 3-901167-53-6. 26
R Development Core Team (2003b): R Installation and
Administration. URL http://CRAN.R-project.
org/manuals.html. ISBN 3-901167-52-8. 26
R Development Core Team (2003c): R Language
Definition. URL http://CRAN.R-project.org/
manuals.html. ISBN 3-901167-56-0. 26
R Development Core Team (2003d): Writing R
Extensions. URL http://CRAN.R-project.org/
manuals.html. ISBN 3-901167-54-4. 26

28

Venables, W. N. and Ripley, B. D. (2000): S Programming. New York: Springer-Verlag. 27
Venables, W. N. and Ripley, B. D. (2002): Modern Applied Statistics with S. New York: Springer-Verlag,
4th edition. 27
Venables, W. N., Smith, D. M., and the R Development Core Team (2003): An Introduction to
R. URL http://CRAN.R-project.org/manuals.
html. ISBN 3-901167-55-2. 26

Uwe Ligges
Fachbereich Statistik, Universität Dortmund, Germany
ligges@statistik.uni-dortmund.de

Book Reviews
Michael Crawley: Statistical Computing: An Introduction to Data
Analysis Using S-Plus
John Wiley and Sons, New York, USA, 2002
770 pages, ISBN 0-471-56040-5
http://www.bio.ic.ac.uk/research/mjcraw/
statcomp/
The number of monographs related to the S
programming language is sufficiently small (a few
dozen, at most) that there are still great opportunities for new books to fill some voids in the literature.
Statistical Computing (hereafter referred to as SC) is
one such text, presenting an array of modern statistical methods in S-Plus at an introductory level.
The approach taken by author is to emphasize
"graphical data inspection, parameter estimation and
model criticism rather than classical hypothesis testing" (p. ix). Within this framework, the text covers
a vast range of applied data analysis topics. Background material on S-Plus, probability, and distributions is included in the text. The traditional topics such as linear models, regression, and analysis of
variance are covered, sometimes to a greater extent
than is common (for example, a split-split-split-split
plot experiment and a whole chapter on ANCOVA).
More advanced modeling techniques such as bootstrap, generalized linear models, mixed effects models and spatial statistics are also presented. For each
topic, some elementary theory is presented and usually an example is worked by hand. Crawley also
discusses more philosophical issues such as randomization, replication, model parsimony, appropriate
transformations, etc. The book contains a very thorR News

ough index of more than 25 pages. In the index, all
S-Plus language words appear in bold type.
A supporting web site contains all the data files,
scripts with all the code from each chapter of the
book, and a few additional (rough) chapters. The
scripts pre-suppose that the data sets have already
been imported into S-Plus by some mechanism–no
explanation of how to do this is given–presenting
the novice S-Plus user with a minor obstacle. Once
the data is loaded, the scripts can be used to re-do
the analyses in the book. Nearly every worked example begins by attaching a data frame, and then
working with the variable names of the data frame.
No detach command is present in the scripts, so the
work environment can become cluttered and even
cause errors if multiple data frames have common
column names. Avoid this problem by quitting the
S-Plus (or R) session after working through the script
for each chapter.
SC contains a very broad use of S-Plus commands
to the point the author says about subplot, "Having gone to all this trouble, I can’t actually see why
you would ever want to do this in earnest" (p. 158).
Chapter 2 presents a number of functions (grep,
regexpr, solve, sapply) that are not used within the
text, but are usefully included for reference.
Readers of this newsletter will undoubtedly want
to know how well the book can be used with R (version 1.6.2 base with recommended packages). According to the preface of SC, “The computing is presented in S-Plus, but all the examples will also work
in the freeware program called R.” Unfortunately,
this claim is too strong. The book’s web site does
contain (as of March, 2003) three modifications for
using R with the book. Many more modifications
ISSN 1609-3631

Vol. 3/1, June 2003

are needed. A rough estimate would be that 90%
of the S-Plus code in the scripts needs no modification to work in R. Basic calculations and data manipulations are nearly identical in S-Plus and R. In
other cases, a slight change is necessary (ks.gof in SPlus, ks.test in R), alternate functions must be used
(multicomp in S-Plus, TukeyHSD in R), and/or additional packages (nlme) must be loaded into R. Other
topics (bootstrapping, trees) will require more extensive code modification efforts and some functionality
is not available in R (varcomp, subplot).
Because of these differences and the minimal documentation for R within the text, users inexperienced
with an S language are likely to find the incompatibilities confusing, especially for self-study of the text.
Instructors wishing to use the text with R for teaching should not find it too difficult to help students
with the differences in the dialects of S. A complete
set of online complements for using R with SC would
be a useful addition to the book’s web site.
In summary, SC is a nice addition to the literature of S. The breadth of topics is similar to that of
Modern Applied Statistics with S (Venables and Ripley, Springer, 2002), but at a more introductory level.
With some modification, the book can be used with
R. The book is likely to be especially useful as an
introduction to a wide variety of data analysis techniques.
Kevin Wright
Pioneer Hi-Bred International
Kevin.Wright@pioneer.com

Peter Dalgaard:
Introductory Statistics with R
Springer, Heidelberg, Germany, 2002
282 pages, ISBN 0-387-95475-9
http://www.biostat.ku.dk/~pd/ISwR.html
This is the first book, other than the manuals that
are shipped with the R distribution, that is solely devoted to R (and its use for statistical analysis). The
author would be familiar to the subscribers of the R
mailing lists: Prof. Dalgaard is a member of R Core,
and has been providing people on the lists with insightful, elegant solutions to their problems. The
author’s R wizardry is demonstrated in the many
valuable R tips and tricks sprinkled throughout the
book, although care is taken to not use too-clever
constructs that are likely to leave novices bewildered.
As R is itself originally written as a teaching tool,
and many are using it as such (including the author),
I am sure its publication is to be welcomed by many.
The targeted audience of the book is “nonstatistician
scientists in various fields and students of statistics”.
It is based upon a set of notes the author developed
for the course Basic Statistics for Health Researchers
R News

29

at the University of Copenhagen. A couple of caveats
were stated in the Preface: Given the origin of the
book, the examples used in the book were mostly, if
not all, from health sciences. The author also made
it clear that the book is not intended to be used as a
stand alone text. For teaching, another standard introductory text should be used as a supplement, as
many key concepts and definitions are only briefly
described. For scientists who have had some statistics, however, this book itself may be sufficient for
self-study.
The titles of the chapters are:
1. Basics
2. Probability and distributions
3. Descriptive statistics and graphics
4. One- and two-sample tests
5. Regression and correlation
6. ANOVA and Kruskal-Wallis
7. Tabular data
8. Power and computation of sample sizes
9. Multiple regression
10. Linear Models
11. Logistic regression
12. Survival analysis
plus three appendices: Obtaining and installing R,
Data sets in the ISwR package (help files for data sets
the package), and Compendium (short summary of
the R language). Each chapter ends with a few exercises involving further analysis of example data.
When I first opened the book, I was a bit surprised to
see the last four chapters. I don’t think most people
expected to see those topics covered in an introductory text. I do agree with the author that these topics
are quite essential for data analysis tasks for practical
research.
Chapter one covers the basics of the R language,
as much as needed in interactive use for most data
analysis tasks. People with some experience with
S-PLUS but new to R will soon appreciate many
nuggets of features that make R easier to use (e.g.,
some vectorized graphical parameters such as col,
pch, etc.; the functions subset, transform, among
others). As much programming as is typically
needed for interactive use (and may be a little more;
e.g., the use of deparse(substitute(...) as default
argument for the title of a plot) is also described.
Chapter two covers the calculation of probability distributions and random number generation for
simulations in R. Chapter three describes descriptive statistics (mean, SD, quantiles, summary tables,
etc.), as well as how to graphically display summary
plots such as Q-Q plots, empirical CDFs, boxplots,
dotcharts, bar plots, etc. Chapter four describes the t
and Wilcoxon tests for one- and two-sample comparison as well as paired data, and the F test for equality
of variances.
Chapter five covers simple linear regression and
bivariate correlations. Chapter six contains quite a
bit of material: from oneway ANOVA, pairwise comISSN 1609-3631

Vol. 3/1, June 2003

parisons, Welch’s modified F test, Barlett’s test for
equal variances, to twoway ANOVA and even a repeated measures example. Nonparametric procedures are also described. Chapter seven covers basic
inferences for categorical data (tests for proportions,
contingency tables, etc.). Chapter eight covers power
and sample size calculations for means and proportions in one- and two-sample problems.
The last four chapters provide a very nice, brief
summary on the more advanced topics. The graphical presentations of regression diagnostics (e.g., color
coding the magnitudes of the diagnostic measures)
can be very useful. However, I would have preferred the omission of coverage on variable selection
in multiple linear regression entirely. The coverage
on these topics is necessarily brief, but do provide
enough material for the reader to follow through the
examples. It is impressive to cover this many topics
in such brevity, without sacrificing clarity of the descriptions. (The linear models chapter covers polynomial regression, ANCOVA, regression diagnostics,
among others.)
As with (almost?) all first editions, typos are inevitable (e.g., on page 33, in the modification to vectorize Newton’s method for calculating square roots,
any should have been all; on page 191, the logit
should be defined as log[ p/(1 − p)]). On the whole,
however, this is a very well written book, with logical ordering of the content and a nice flow. Ample
code examples are interspersed throughout the text,
making it easy for the reader to follow along on a
computer. I whole-heartedly recommend this book
to people in the “intended audience” population.
Andy Liaw
Merck Research Laboratories
andy_liaw@merck.com

John Fox: An R and S-Plus Companion to Applied Regression
Sage Publications, Thousand Oaks, CA, USA, 2002
328 pages, ISBN 0-7619-2280-6
http://www.socsci.mcmaster.ca/jfox/Books/
companion/
The aim of this book is to enable people with
knowledge on linear regression modelling to analyze
their data using R or S-Plus. Any textbook on linear
regression may provide the necessary background,
of course both language and contents of this companion are closely aligned with the authors own Applied Regression, Linear Models and Related Metods (Fox,
Sage Publications, 1997).
Chapters 1–2 give a basic introduction to R and
S-Plus, including a thorough treatment of data import/export and handling of missing values. Chapter 3 complements the introduction with exploratory
R News

30

data analysis and Box-Cox data transformations. The
target audience of the first three sections are clearly
new users who might never have used R or S-Plus
before.
Chapters 4–6 form the main part of the book and
show how to apply linear models, generalized linear
models and regression diagnostics, respectively. This
includes detailed explanations of model formulae,
dummy-variable regression and contrasts for categorical data, and interaction effects. Analysis of variance is mostly explained using “Type II” tests as implemented by the Anova() function in package car.
The infamous “Type III” tests, which are requested
frequently on the r-help mailing list, are provided
by the same function, but their usage is discouraged
quite strongly in both book and help page.
Chapter 5 reads as a natural extension of linear
models and concentrates on error distributions provided by the exponential family and corresponding
link functions. The main focus is on models for categorical responses and count data. Much of the functionality provided by package car is on regression diagnostics and has methods both for linear and generalized linear models. Heavy usage of graphics,
identify() and text-based menus for plots emphasize the interactive and exploratory side of regression
modelling. Other sections deal with Box-Cox transformations, detection of non-linearities and variable
selection.
Finally, Chapters 7 & 8 extend the introductory
Chapters 1–3 with material on drawing graphics and
S programming. Although the examples use regression models, e.g., plotting the surface of fitted values of a generalized linear model using persp(),
both chapters are rather general and not “regressiononly”.
Several real world data sets are used throughout
the book. All S code necessary to reproduce numerical and graphical results is shown and explained in
detail. Package car, which contains all data sets used
in the book and functions for ANOVA and regression
diagnostics, is available from CRAN and the book’s
homepage.
The style of presentation is rather informal and
hands-on, software usage is demonstrated using examples, without lengthy discussions of theory. Over
wide parts the text can be directly used in class to
demonstrate regression modelling with S to students
(after more theoretical lectures on the topic). However, the book contains enough “reminders” about
the theory, such that it is not necessary to switch to a
more theoretical text while reading it.
The book is self-contained with respect to S and
should easily get new users started. The placement
of R before S-Plus in the title is not only lexicographic,
the author uses R as the primary S engine. Differences of S-Plus (as compared with R) are clearly
marked in framed boxes, the graphical user interface
of S-Plus is not covered at all.
ISSN 1609-3631

Vol. 3/1, June 2003

The only drawback of the book follows directly
from the target audience of possible S novices: More
experienced S users will probably be disappointed
by the ratio of S introduction to material on regression analysis. Only about 130 out of 300 pages deal
directly with linear models, the major part is an introduction to S. The online appendix at the book’s
homepage (mirrored on CRAN) contains several extension chapters on more advanced topics like boot-

31

strapping, time series, nonlinear or robust regression.
In summary, I highly recommend the book to
anyone who wants to learn or teach applied regression analysis with S.
Friedrich Leisch
Universität Wien, Austria
Friedrich.Leisch@R-project.org

Changes in R 1.7.0
by the R Core Team

User-visible changes
• solve(), chol(), eigen() and svd() now
use LAPACK routines unless a new backcompatibility option is turned on. The signs
and normalization of eigen/singular vectors
may change from earlier versions.
• The ‘methods’, ‘modreg’, ‘mva’, ‘nls’ and
‘ts’ packages are now attached by default at
startup (in addition to ‘ctest’). The option
"defaultPackages" has been added which
contains the initial list of packages.
See
?Startup and ?options for details. Note that
.First() is no longer used by R itself.
class() now always (not just when ‘methods’ is attached) gives a non-null class, and
UseMethod() always dispatches on the class
that class() returns. This means that methods
like foo.matrix and foo.integer will be used.
Functions oldClass() and oldClass<-() get
and set the "class" attribute as R without ‘methods’ used to.
• The default random number generators have
been changed to ‘Mersenne-Twister’ and ‘Inversion’. A new RNGversion() function allows
you to restore the generators of an earlier R version if reproducibility is required.
• Namespaces can now be defined for packages
other than ‘base’: see ‘Writing R Extensions’.
This hides some internal objects and changes
the search path from objects in a namespace.
All the base packages (except methods and
tcltk) have namespaces, as well as the recommended packages ‘KernSmooth’, ‘MASS’,
‘boot’, ‘class’, ‘nnet’, ‘rpart’ and ‘spatial’.
• Formulae are not longer automatically simplified when terms() is called, so the formulae in
results may still be in the original form rather
than the equivalent simplified form (which
R News

may have reordered the terms): the results are
now much closer to those of S.
• The tables for plotmath, Hershey and Japanese
have been moved from the help pages
(example(plotmath) etc) to demo(plotmath)
etc.
• Errors and warnings are sent to stderr not stdout on command-line versions of R (Unix and
Windows).
• The R_X11 module is no longer loaded until it
is needed, so do test that x11() works in a new
Unix-alike R installation.

New features
• if() and while() give a warning if called with
a vector condition.
• Installed packages under Unix without compiled code are no longer stamped with the platform and can be copied to other Unix-alike
platforms (but not to other OSes because of
potential problems with line endings and OSspecific help files).
• The internal random number generators will
now never return values of 0 or 1 for runif().
This might affect simulation output in extremely rare cases. Note that this is not guaranteed for user-supplied random-number generators, nor when the standalone Rmath library
is used.
• When assigning names to a vector, a value that
is too short is padded by character NAs. (Wishlist part of PR#2358)
• It is now recommended to use the ’SystemRequirements:’ field in the DESCRIPTION file for
specifying dependencies external to the R system.
• Output text connections no longer have a linelength limit.
ISSN 1609-3631

Vol. 3/1, June 2003

• On platforms where vsnprintf does not return
the needed buffer size the output line-length
limit for fifo(), gzfile() and bzfile() has
been raised from 10k to 100k chars.
• The Math group generic does not check the
number of arguments supplied before dispatch: it used to if the default method had one
argument but not if it had two. This allows
trunc.POSIXt() to be called via the group
generic trunc().
• Logical matrix replacement indexing of data
frames is now implemented (interpreted as if
the lhs was a matrix).
• Recursive indexing of lists is allowed, so
x[[c(4,2)]] is shorthand for x[[4]][[2]] etc.
(Wishlist PR#1588)
• Most of the time series functions now check explicitly for a numeric time series, rather than
fail at a later stage.
• The postscript output makes use of relative
moves, and so is somewhat more compact.

32

• Add ‘difftime’ subscript method and methods for the group generics. (Thereby fixing
PR#2345)
• download.file() can now use HTTP proxies
which require ‘basic’ username/password authentication.
• dump() has a new argument ‘envir’. The search
for named objects now starts by default in the
environment from which dump() is called.
• The edit.matrix() and edit.data.frame()
editors can now handle logical data.
• New argument ‘local’ for example() (suggested by Andy Liaw).
• New function file.symlink() to create symbolic file links where supported by the OS.
• New generic function flush() with a method
to flush connections.
• New function force() to force evaluation of a
formal argument.

• %*% and crossprod() for complex arguments
make use of BLAS routines and so may be
much faster on some platforms.

• New
functions
getFromNamespace(),
fixInNamespace() and getS3method() to facilitate developing code in packages with
namespaces.

• arima() has coef(), logLik() (and hence AIC)
and vcov() methods.

• glm() now accepts ‘etastart’ and ‘mustart’ as
alternative ways to express starting values.

• New function as.difftime() for time-interval
data.

• New function gzcon() which wraps a connection and provides (de)compression compatible
with gzip.

• basename() and dirname() are now vectorized.
• biplot.default() mva allows ‘xlab’ and
‘ylab’ parameters to be set (without partially
matching to ‘xlabs’ and ‘ylabs’). (Thanks to
Uwe Ligges.)
• New function capture.output() to send
printed output from an expression to a connection or a text string.
• ccf() (pckage ts) now coerces its x and y arguments to class "ts".
• chol() and chol2inv() now use LAPACK routines by default.
• as.dist(.) is now idempotent, i.e., works for
"dist" objects.
• Generic function confint() and ‘lm’ method
(formerly in package MASS, which has ‘glm’
and ‘nls’ methods).
• New function constrOptim() for optimisation
under linear inequality constraints.
R News

load() now uses gzcon(), so can read compressed saves from suitable connections.
• help.search() can now reliably match individual aliases and keywords, provided that all
packages searched were installed using R 1.7.0
or newer.
• hist.default() now returns the nominal
break points, not those adjusted for numerical
tolerances.
To guard against unthinking use, ‘include.lowest’ in hist.default() is now ignored, with a warning, unless ‘breaks’ is a vector. (It either generated an error or had no effect, depending how prettification of the range
operated.)
• New
generic
functions
influence(),
hatvalues() and dfbeta() with lm and glm
methods; the previously normal functions
rstudent(), rstandard(), cooks.distance()
and dfbetas() became generic. These have
changed behavior for glm objects – all originating from John Fox’ car package.
ISSN 1609-3631

Vol. 3/1, June 2003

33

• interaction.plot() has several new arguments, and the legend is not clipped anymore
by default. It internally uses axis(1,*) instead
of mtext(). This also addresses "bugs" PR#820,
PR#1305, PR#1899.
• New isoreg() function and class for isotonic
regression (‘modreg’ package).
• La.chol() and La.chol2inv() now give interpretable error messages rather than LAPACK
error codes.
• legend() has a new ‘plot’ argument. Setting
it ‘FALSE’ gives size information without plotting (suggested by U.Ligges).
• library() was changed so that when the
methods package is attached it no longer complains about formal generic functions not specific to the library.
• list.files()/dir() have a new argument ‘recursive’.
• lm.influence() has a new ‘do.coef’ argument
allowing *not* to compute casewise changed
coefficients.
This makes plot.lm() much
quicker for large data sets.
• load() now returns invisibly a character vector of the names of the objects which were restored.
• New convenience function loadURL() to allow
loading data files from URLs (requested by
Frank Harrell).
• New function
lapply().

mapply(),

a

multivariate

• New function md5sum() in package tools to calculate MD5 checksums on files (e.g. on parts of
the R installation).
• medpolish() package eda now has an ‘na.rm’
argument (PR#2298).
• methods() now looks for registered methods
in namespaces, and knows about many objects
that look like methods but are not.
• mosaicplot() has a new default for ‘main’,
and supports the ‘las’ argument (contributed
by Uwe Ligges and Wolfram Fischer).
• An attempt to open() an already open connection will be detected and ignored with a warning. This avoids improperly closing some types
of connections if they are opened repeatedly.
• optim(method = "SANN") can now cover combinatorial optimization by supplying a move
function as the ‘gr’ argument (contributed by
Adrian Trapletti).
R News

• PDF files produced by pdf() have more extensive information fields, including the version of
R that produced them.
• On Unix(-alike) systems the default PDF
viewer is now determined during configuration, and available as the ’pdfviewer’ option.
• pie(...) has always accepted graphical pars
but only passed them on to title(). Now
pie(, cex=1.5) works.
• plot.dendrogram (‘mva’ package) now draws
leaf labels if present by default.
• New plot.design() function as in S.
• The postscript() and PDF() drivers now allow the title to be set.
• New function power.anova.test(),
tributed by Claus Ekstrøm.

con-

• power.t.test() now behaves correctly for
negative delta in the two-tailed case.
• power.t.test() and power.prop.test() now
have a ‘strict’ argument that includes rejections
in the "wrong tail" in the power calculation.
(Based in part on code suggested by Ulrich
Halekoh.)
• prcomp() is now fast for nm inputs with m >>
n.
• princomp() no longer allows the use of more
variables than units: use prcomp() instead.
• princomp.formula() now has principal argument ‘formula’, so update() can be used.
• Printing an object with attributes now dispatches on the class(es) of the attributes. See
?print.default for the fine print. (PR#2506)
• print.matrix() and prmatrix() are now separate functions. prmatrix() is the old Scompatible function, and print.matrix() is
a proper print method, currently identical to
print.default(). prmatrix() and the old
print.matrix() did not print attributes of a
matrix, but the new print.matrix() does.
• print.summary.lm() and print.summary.glm()
now default to symbolic.cor = FALSE,
but symbolic.cor can be passed to
the print methods from the summary
methods.
print.summary.lm() and
print.summary.glm() print correlations to
2 decimal places, and the symbolic printout
avoids abbreviating labels.
ISSN 1609-3631

Vol. 3/1, June 2003

34

• If a prompt() method is called with ’filename’
as ’NA’, a list-style representation of the documentation shell generated is returned. New
function promptData() for documenting objects as data sets.

• New function Sys.getpid() to get the process
ID of the R session.

• qqnorm() and qqline() have an optional logical argument ‘datax’ to transpose the plot (SPLUS compatibility).

• The tempfile() function now takes an optional second argument giving the directory
name.

• qr() now has the option to use LAPACK routines, and the results can be used by the helper
routines qr.coef(), qr.qy() and qr.qty().
The LAPACK-using versions may be much
faster for large matrices (using an optimized
BLAS) but are less flexible.

• The ordering of terms for

• QR objects now have class "qr", and
solve.qr() is now just the method for solve()
for the class.
• New function r2dtable() for generating random samples of two-way tables with given
marginals using Patefield’s algorithm.
• rchisq() now has a non-centrality parameter
‘ncp’, and there’s a C API for rnchisq().
• New generic function reorder() with a dendrogram method; new order.dendrogram()
and heatmap().
• require() has a new argument named
character.only to make it align with library.
• New functions rmultinom() and dmultinom(),
the first one with a C API.
• New function runmed() for fast runnning medians (‘modreg’ package).
• New function slice.index() for identifying
indexes with respect to slices of an array.
• solve.default(a) now gives the dimnames
one would expect.
• stepfun() has a new ‘right’ argument for
right-continuous step function construction.
• str() now shows ordered factors different
from unordered ones. It also differentiates
"NA" and as.character(NA), also for factor
levels.
• symnum() has a
‘abbr.colnames’.

new

logical

argument

• summary() now mentions NA’s as
suggested by Göran Broström.
• summaryRprof() now prints times with a precision appropriate to the sampling interval,
rather than always to 2dp.
R News

• table() now allows exclude= with factor arguments (requested by Michael Friendly).

terms.formula(keep.order=FALSE)

is now defined on the help page and used consistently, so that repeated calls will not alter the
ordering (which is why delete.response()
was failing: see the bug fixes). The formula is
not simplified unless the new argument ‘simplify’ is true.
• added "[" method for terms objects.
• New argument ‘silent’ to try().
• ts() now allows arbitrary values for y in
start/end = c(x, y): it always allowed y <
1 but objected to y > frequency.
• unique.default() now works for POSIXct objects, and hence so does factor().
• Package tcltk now allows return values from
the R side to the Tcl side in callbacks and the
R_eval command. If the return value from the
R function or expression is of class "tclObj" then
it will be returned to Tcl.
• A new HIGHLY EXPERIMENTAL graphical
user interface using the tcltk package is provided. Currently, little more than a proof of
concept. It can be started by calling "R -g Tk"
(this may change in later versions) or by evaluating tkStartGUI(). Only Unix-like systems
for now. It is not too stable at this point; in particular, signal handling is not working properly.
• Changes to support name spaces:
– Placing base in a name space
can no longer be disabled by
defining the environment variable
R_NO_BASE_NAMESPACE.
– New function topenv() to determine the
nearest top level environment (usually
.GlobalEnv or a name space environment).
– Added name space support for packages
that do not use methods.
ISSN 1609-3631

Vol. 3/1, June 2003

• Formal classes and methods can be ‘sealed’,
by using the corresponding argument to
setClass or setMethod.
New functions
isSealedClass() and isSealedMethod() test
sealing.
• packages can now be loaded with version numbers. This allows for multiple versions of files
to be installed (and potentially loaded). Some
serious testing will be going on, but it should
have no effect unless specifically asked for.

Installation changes
• TITLE files in packages are no longer used, the
Title field in the DESCRIPTION file being preferred. TITLE files will be ignored in both installed packages and source packages.
• When searching for a Fortran 77 compiler, configure by default now also looks for Fujitsu’s frt
and Compaq’s fort, but no longer for cf77 and
cft77.
• Configure checks that mixed C/Fortran code
can be run before checking compatibility on
ints and doubles: the latter test was sometimes
failing because the Fortran libraries were not
found.
• PCRE and bzip2 are built from versions in the R
sources if the appropriate library is not found.
• New configure option ‘--with-lapack’ to allow high-performance LAPACK libraries to be
used: a generic LAPACK library will be used if
found. This option is not the default.
• New configure options ‘--with-libpng’,
‘--with-jpeglib’, ‘--with-zlib’, ‘--with-bzlib’
and ‘--with-pcre’, principally to allow these
libraries to be avoided if they are unsuitable.
• If the precious variable R_BROWSER is set at
configure time it overrides the automatic selection of the default browser. It should be set to
the full path unless the browser appears at different locations on different client machines.
• Perl requirements are down again to 5.004 or
newer.
• Autoconf 2.57 or later is required to build the
configure script.

35

\name and \title entries in the Rd files. Data,
demo and vignette indices are computed from
all available files of the respective kind, and
the corresponding index information (in the
Rd files, the ’demo/00Index’ file, and the
\VignetteIndexEntry{} entries, respectively).
These index files, as well as the package Rd
contents data base, are serialized as R objects in the ’Meta’ subdirectory of the toplevel package directory, allowing for faster and
more reliable index-based computations (e.g.,
in help.search()).
• The Rd contents data base is now computed
when installing source packages using R code
in package tools. The information is represented as a data frame without collapsing the
aliases and keywords, and serialized as an R
object. (The ’CONTENTS’ file in Debian Control Format is still written, as it is used by the
HTML search engine.)
• A NAMESPACE file in root directory of a
source package is copied to the root of the package installation directory. Attempting to install a package with a NAMESPACE file using
‘--save’ signals an error; this is a temporary
measure.

Deprecated & defunct
• The assignment operator ‘_’ will be removed in
the next release and users are now warned on
every usage: you may even see multiple warnings for each usage.
If environment variable R_NO_UNDERLINE
is set to anything of positive length then use of
‘_’ becomes a syntax error.
• machine(), Machine() and Platform() are defunct.
• restart() is defunct. Use try(), as has long
been recommended.
• The deprecated arguments ‘pkg’ and ‘lib’ of
system.file() have been removed.
• printNoClass() methods is deprecated (and
moved to base, since it was a copy of a base
function).

• Configure provides a more comprehensive
summary of its results.

• Primitives dataClass() and objWithClass()
have been replaced by class() and class<-();
they were internal support functions for use by
package methods.

• Index generation now happens when installing
source packages using R code in package tools.
An existing ’INDEX’ file is used as is; otherwise, it is automatically generated from the

• The use of SIGUSR2 to quit a running R process
under Unix is deprecated, the signal may need
to be reclaimed for other purposes.

R News

ISSN 1609-3631

Vol. 3/1, June 2003

Utilities
• R CMD check more compactly displays the
tests of DESCRIPTION meta-information. It
now reports demos and vignettes without
available index information. Unless installation tests are skipped, checking is aborted if
the package dependencies cannot be resolved
at run time. Rd files are now also explicitly
checked for empty \name and \title entries.
The examples are always run with T and F redefined to give an error if used instead of TRUE
and FALSE.
• The Perl code to build help now removes an
existing example file if there are no examples
in the current help file.
• R CMD Rdindex is now deprecated in favor of
function Rdindex() in package tools.
• Sweave() now encloses the Sinput and Soutput
environments of each chunk in an Schunk environment. This allows to fix some vertical spacing problems when using the latex class slides.

C-level facilities
• A full double-precision LAPACK shared library is made available as -lRlapack. To use
this include $(LAPACK_LIBS) $(BLAS_LIBS) in
PKG_LIBS.

36

• Header file R_ext/Lapack.h added.
C
declarations of BLAS routines moved to
R_ext/BLAS.h and included in R_ext/Applic.h
and R_ext/Linpack.h for backward compatibility.
• R will automatically call initialization and
unload routines, if present, in shared
libraries/DLLs during dyn.load() and
dyn.unload() calls. The routines are named
R_init_ and R_unload_,
respectively. See the Writing R Extensions
Manual for more information.
• Routines exported directly from the R executable for use with .C(), .Call(), .Fortran()
and .External() are now accessed via the
registration mechanism (optionally) used by
packages. The ROUTINES file (in src/appl/)
and associated scripts to generate FFTab.h and
FFDecl.h are no longer used.
• Entry point Rf_append is no longer in the installed headers (but is still available). It is apparently unused.
• Many conflicts between other headers
and R’s can be avoided by defining
STRICT_R_HEADERS and/or R_NO_REMAP
– see ‘Writing R Extensions’ for details.
• New entry point R_GetX11Image and formerly
undocumented ptr_R_GetX11Image are in new
header R_ext/GetX11Image. These are used by
package tkrplot.

Changes on CRAN
by Kurt Hornik and Friedrich Leisch

New contributed packages
Davies useful functions for the Davies quantile
function and the Generalized Lambda distribution. By Robin Hankin.
GRASS Interface between GRASS 5.0 geographical
information system and R, based on starting R
from within the GRASS environment using values of environment variables set in the GISRC
file. Interface examples should be run outside
GRASS, others may be run within. Wrapper
and helper functions are provided for a range
of R functions to match the interface metadata
structures. Interface functions by Roger Bivand, wrapper and helper functions modified
from various originals by interface author.
R News

MCMCpack This package contains functions for
posterior simulation for a number of statistical models. All simulation is done in compiled C++ written in the Scythe Statistical Library Version 0.3. All models return coda
mcmc objects that can then be summarized using coda functions or the coda menu interface.
The package also contains some useful utility
functions, including some additional PDFs and
pseudo-random number generators for statistical distributions. By Andrew D. Martin, and
Kevin M. Quinn.
RSvgDevice A graphics device for R that uses the
new w3.org xml standard for Scalable Vector
Graphics. By T Jake Luciani.
SenSrivastava Collection of datasets from Sen & Srivastava: Regression Analysis, Theory, Methods
and Applications, Springer. Sources for individual data files are more fully documented in
ISSN 1609-3631

Vol. 3/1, June 2003

the book. By Kjetil Halvorsen.
abind Combine multi-dimensional arrays. This is
a generalization of cbind and rbind. Takes
a sequence of vectors, matrices, or arrays
and produces a single array of the same or
higher dimension. By Tony Plate and Richard
Heiberger..
ade4 Multivariate data analysis and graphical display. By Jean Thioulouse, Anne-Beatrice Dufour and Daniel Chessel.
amap Hierarchical Clustering and Principal Component Analysis (With general and robusts methods). By Antoine Lucas.
anm The package contains an analog model for statistical/empirical downscaling. By Alexandra
Imbert & Rasmus E. Benestad.
clim.pact The package contains R functions for retrieving data, making climate analysis and
downscaling of monthly mean and daily mean
global climate scenarios. By Rasmus E. Benestad.
dispmod Functions for modelling dispersion in
GLM. By Luca Scrucca.
effects Graphical and tabular effect displays, e.g., of
interactions, for linear and generalised linear
models. By John Fox.
gbm This package implements extensions to Freund and Schapire’s AdaBoost algorithm and
J. Friedman’s gradient boosting machine. Includes regression methods for least squares,
absolute loss, logistic, Poisson, Cox proportional hazards partial likelihood, and AdaBoost
exponential loss. By Greg Ridgeway.
genetics Classes and methods for handling genetic
data.
Includes classes to represent genotypes and haplotypes at single markers up to
multiple markers on multiple chromosomes.
Function include allele frequencies, flagging
homo/heterozygotes, flagging carriers of certain alleles, computing disequilibrium, testing
Hardy-Weinberg equilibrium, . . . . By Gregory
Warnes and Friedrich Leisch.
glmmML A Maximum Likelihood approach to
mixed models. By Göran Broström..
gpclib General polygon clipping routines for R
based on Alan Murta’s C library. By R interface
by Roger D. Peng; GPC library by Alan Murta.
grasper R-GRASP uses generalized regressions
analyses to automate the production of spatial
predictions. By A. Lehmann, J.R. Leathwick,
J.McC. Overton, ported to R by F. Fivaz.
R News

37

gstat variogram modelling; simple, ordinary and
universal point or block (co)kriging, and sequential Gaussian or indicator (co)simulation;
variogram and map plotting utility functions.
By Edzer J. Pebesma and others.
hier.part Variance partition of a multivariate data
set. By Chris Walsh and Ralph MacNally.
hwde Fits models for genotypic disequilibria, as described in Huttley and Wilson (2000), Weir
(1996) and Weir and Wilson (1986). Contrast
terms are available that account for first order
interactions between loci. By J.H. Maindonald.
ismev Functions to support the computations carried out in ‘An Introduction to Statistical Modeling of Extreme Values’ by Stuart Coles. The
functions may be divided into the following groups; maxima/minima, order statistics,
peaks over thresholds and point processes.
Original S functions by Stuart Coles, R port and
R documentation files by Alec Stephenson.
locfit Local Regression, Likelihood and density estimation. By Catherine Loader.
mimR An R interface to MIM for graphical modelling in R. By Søren Højsgaard.
mix Estimation/multiple imputation programs for
mixed categorical and continuous data . Original by Joseph L. Schafer, R port by Brian Ripley.
multidim multidimensional descritptive statistics:
factorial methods and classification. Original
by André Carlier & Alain Croquette, R port by
Mathieu Ros & Antoine Lucas.
normalp A collection of utilities refered to normal
of order p distributions (General Error Distributions). By Angelo M. Mineo.
pls.pcr Multivariate regression by PLS and PCR. By
Ron Wehrens.
polspline Routines for the polynomial spline fitting routines hazard regression, hazard estimation with flexible tails, logspline, lspec, polyclass, and polymars, by C. Kooperberg and coauthors . By Charles Kooperberg.
rimage This package provides functions for image
processing, including sobel filter, rank filters,
fft, histogram equalization, and reading JPEG
file. This package requires fftw http://www.
fftw.org/ and libjpeg http://www.ijg.org>.
By Tomomi Takashina.
ISSN 1609-3631

Vol. 3/1, June 2003

38

session Utility functions for interacting with R processes from external programs. This package
includes functions to save and restore session
information (including loaded packages, and
attached data objects), as well as functions to
evaluate strings containing R commands and
return the printed results or an execution transcript. By Gregory R. Warnes.
snow Support for simple parallel computing in R.
By Luke Tierney, A. J. Rossini, and Na Li.
statmod Miscellaneous biostatistical
functions. By Gordon Smyth.

modelling

survey Summary statistics, generalised linear models, and general maximum likelihood estimation for stratified, cluster-sampled, unequally
weighted survey samples. By Thomas Lumley.
vcd Functions and data sets based on the book
“Visualizing Categorical Data” by Michael
Friendly. By David Meyer, Achim Zeileis,
Alexandros Karatzoglou, and Kurt Hornik.

New Omegahat packages
REventLoop An abstract event loop mechanism that
is toolkit independent and can be used to to replace the R event loop. This allows one to use
the Gtk or Tcl/Tk’s event loop which gives better response and coverage of all event sources
(e.g. idle events, timers, etc.) . By Duncan Temple Lang.
RGdkPixbuf S language functions to access the facilities in GdkPixbuf for manipulating images.
This is used for loading icons to be used in widgets such as buttons, HTML renderers, etc. By
Duncan Temple Lang.
RGtkExtra A collection of S functions that provide
an interface to the widgets in the gtk+extra library such as the GtkSheet data-grid display,

R News

icon list, file list and directory tree. By Duncan
Temple Lang.
RGtkGlade S language bindings providing an interface to Glade, the interactive Gnome GUI creator. This allows one to instantiate GUIs built
with the interactive Glade tool directly within
S. The callbacks can be specified as S functions
within the GUI builder. By Duncan Temple
Lang.
RGtkHTML A collection of S functions that provide an interface to creating and controlling an
HTML widget which can be used to display
HTML documents from files or content generated dynamically in S. This also supports embedded objects created and managed within S
code, include graphics devices, etc. By Duncan
Temple Lang.
RGtk Facilities in the S language for programming
graphical interfaces using Gtk, the Gnome GUI
toolkit. By Duncan Temple Lang.
SWinRegistry Provides access from within R to read
and write the Windows registry. By Duncan
Temple Lang.
SWinTypeLibs This provides ways to extract type
information from type libraries and/or DCOM
objects that describes the methods, properties,
etc. of an interface. This can be used to discover
available facilities in a COM object or automatically generate S and C code to interface to such
objects. By Duncan Temple Lang.
Kurt Hornik
Wirtschaftsuniversität Wien, Austria
Kurt.Hornik@R-project.org
Friedrich Leisch
Technische Universität Wien, Austria
Friedrich.Leisch@R-project.org

ISSN 1609-3631

Vol. 3/1, June 2003

39

Crossword
by Barry Rowlingson
Following a conversation with 1 down at DSC 2003
post-conference dinner, I have constructed a crossword for R-news. Most of the clues are based on
matters relating to R or computing in general. The
clues are ‘cryptic’, in that they rely on word-play, but

1

2

3

4

most will contain a definition somewhere. A prize
of EUR50 is on offer. See http://www.maths.lancs.
ac.uk/~rowlings/Crossword for details.
Barry Rowlingson
Lancaster University, UK
B.Rowlingson@lancaster.ac.uk

5

9

6

7

8

10

11

12

13

14
15

16

19

17

20

18

21

22
23

24

26

25

27

Across
9 Possessing a file, or boat, we hear (9)
10, 20 down Developer code is binary peril (5,6)
11 Maturer code has no need for this (7)
12 And a volcano implodes moderately slowly (7)
13 Partly not an hyperbolic function (4)
14 Give 500 euro to R Foundation, then stir your café Breton
(10)
16 Catches crooked parents (7)
17, 23 down Developer code bloated Gauss (7,5)
19 Fast header code is re-entrant (6-4)
22, 4 down Developer makes toilet compartments (4,8)
24 Former docks ship files (7)
25 Analysis of a skinned banana and eggs (2,5)
26 Rearrange array or a hairstyle (5)
27 Atomic number 13 is a component (9)

R News

Down
1 Scots king goes with lady president (6,9)
2 Assemble rain tent for private space (8)
3 Output sounds sound (5)
4 See 22ac
5 Climb tree to get new software (6)
6 Naughty number could be 8 down (1,3,5)
7 Failed or applied ! (3,3)
8 International body member is ordered back before baker, but
can’t be shown (15)
15 I hear you consort with a skeleton, having rows (9)
17 Left dead encrypted file in zip archive (8)
18 Guards riot, loot souk (8)
20 See 10ac
21 A computer sets my puzzle (6)
23 See 17ac

ISSN 1609-3631

Vol. 3/1, June 2003

40

Recent Events
DSC 2003
Inspired by the success of its predecessors in 1999
and 2001, the Austrian Association for Statistical
Computing and the R Foundation for Statistical
Computing jointly organised the third workshop on
distributed statistical computing which took place
at the Technische Universität Vienna from March
20 to March 22, 2003. More than 150 participants
from all over the world enjoyed this most interesting event. The talks presented at DSC 2003 covered a wide range of topics from bioinformatics, visualisation, resampling and combine methods, spatial statistics, user-interfaces and office integration,
graphs and graphical models, database connectivity
and applications. Compared to DSC 2001, a greater
fraction of talks addressed problems not directly related to R, however, most speakers mentioned R as
their favourite computing engine. An online proceedings volume is currently under preparation.
Two events took place in Vienna just prior to DSC

Editor-in-Chief:
Friedrich Leisch
Institut für Statistik und Wahrscheinlichkeitstheorie
Technische Universität Wien
Wiedner Hauptstraße 8-10/1071
A-1040 Wien, Austria
Editorial Board:
Douglas Bates and Thomas Lumley.
Editor Programmer’s Niche:
Bill Venables
Editor Help Desk:
Uwe Ligges
Email of editors and editorial board:
firstname.lastname @R-project.org

R News

2003. Future directions of R developments were discussed at a two-day open R-core meeting. Tutorials
on the Bioconductor project, R programming and R
graphics attended a lot of interest the day before the
conference opening.
While the talks were given in the beautiful rooms
of the main building of the Technische Universität at
Karlsplatz in the city centre, the fruitful discussions
continued in Vienna’s beer pubs and at the conference dinner in one of the famous Heurigen restaurants.
Finally, one participant summarised this event
most to the point in an email sent to me right after
DSC 2003: ”At most statistics conferences, over 90%
of the stuff is not interesting to me, at this one, over
90% of it was interesting.”
Torsten Hothorn
Friedrich-Alexander-Universität Erlangen-Nürnberg
Germany
Torsten.Hothorn@rzmail.uni-erlangen.de

R News is a publication of the R Foundation for Statistical Computing, communications regarding this
publication should be addressed to the editors. All
articles are copyrighted by the respective authors.
Please send submissions to regular columns to the
respective column editor, all other submissions to
the editor-in-chief or another member of the editorial board (more detailed submission instructions
can be found on the R homepage).
R Project Homepage:
http://www.R-project.org/
This newsletter is available online at
http://CRAN.R-project.org/doc/Rnews/

ISSN 1609-3631



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.3
Linearized                      : No
Page Count                      : 40
Page Mode                       : UseOutlines
Warning                         : Duplicate 'Producer' entry in dictionary (ignored)
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.00a
Create Date                     : 2003:06:10 09:54:00
EXIF Metadata provided by EXIF.tools

Navigation menu