PDF Rnews 2003 2
PDF Rnews_2003-2 CRAN - Contents of R News
User Manual: PDF CRAN: R News
Open the PDF directly: View PDF .
Page Count: 39
Download | |
Open PDF In Browser | View PDF |
The Newsletter of the R Project News Volume 3/2, October 2003 Editorial by Friedrich Leisch Welcome to a new issue of R News, focussing on graphics and user interfaces. Paul Murell’s grid graphics system offers a modern infrastructure for R graphics. As of R-1.8.0 it is part of the base distribution but, until now, grid graphics functions could not be mixed with traditional (base) plotting functions. In this newsletter Paul describes his gridBase package which allows mixing of grid and base graphics. Marc Schwartz, known as frequent poster of answers on the mailing list r-help, has been invited to contribute to the R Help Desk. He presents “An Introduction to Using R’s Base Graphics”. The Bioconductor column was written by Colin Smith and shows a web-based interface for microarray analysis. Angelo Mineo describes normalp, a new package for the general error distribution. In the second part of my mini-series on Sweave I demonstrate how users can interact with package vignettes and how package authors can add such documents to their packages. To close the circle, a version of Paul’s article is a vignette in package gridBase, so if you want to play with the code examples, the easiest way is to use R’s Contents of this issue: Editorial . . . . . . . . . . . . . . . . . . . . . . R Help Desk . . . . . . . . . . . . . . . . . . . . Integrating grid Graphics Output with Base Graphics Output . . . . . . . . . . A New Package for the General Error Distribution . . . . . . . . . . . . . . . . . . . . . . Web-based Microarray Analysis using Bioconductor . . . . . . . . . . . . . . . . . . . . . . 1 2 7 13 17 vignette tools. We hope to see vignettes used more frequently in the future, as they are a simple, effective way of delivering code examples to users. R 1.8.0 was released more than two weeks ago. The list of new features is long; see “Changes in R” for detailed release information. I will draw attention to a “minor” but rather emotional point that has sparked heated dicussions on the mailing lists in the past. Release 1.8.0 marks the end of the use of the underscore as an assignment operator in the R dialect of the S language. That is, x_1 is now a syntax error. A new column in R News lists the new members of the R Foundation for Statistical Computing. R has become a mature and valuable tool and we would like to ensure its continued development and the development of future innovations in software for statistical and computational research. We hope to attract sufficient funding to make these goals realities. Listing members in our newsletter is one (small) way of thanking them for their support. Friedrich Leisch Technische Universität Wien, Austria Friedrich.Leisch@R-project.org Sweave, Part II: Package Vignettes . . . . . . . R Foundation News . . . . . . . . . . . . . . . . Recent Events . . . . . . . . . . . . . . . . . . . Book Reviews . . . . . . . . . . . . . . . . . . . Changes in R 1.8.0 . . . . . . . . . . . . . . . . . Changes on CRAN . . . . . . . . . . . . . . . . Crossword Solution . . . . . . . . . . . . . . . . Correction to “Building Microsoft Windows Versions of R and R packages under Intel Linux” . . . . . . . . . . . . . . . . . . . . . . 21 25 26 28 29 35 38 39 Vol. 3/2, October 2003 2 R Help Desk An Introduction to Using R’s Base Graphics Marc Schwartz Preface As the use of R grows dramatically, an increasingly diverse base of users will begin their exploration of R’s programmatic approach to graphics. Some new users will start without prior experience generating statistical graphics using coded functions (ie. they may have used GUI based “point-and-click” or “drag-and-drop” graphic processes) and/or they may be overwhelmed by the vast array (pardon the pun) of graphic and plotting functions in R. This transition can not only present a steep learning curve, but can perhaps, by itself, become a barrier to using R entirely, which would be an unfortunate outcome. R has essentially two separate core plotting environments in the default (base plus ‘recommended package’) installation. The first is the extensive set of base graphic functions and the second is the combination of the grid (Murrell, 2002) and lattice packages (Sarkar, 2002), which together provide for extensive Trellis conditioning plots and related standardized functionality. For the purpose of this introduction, I shall focus exclusively on the former. The key advantages of a programmatic plotting approach are much finer control over the plotting process and, importantly, reproducibility. Days, weeks or even months later, you can return to reuse your same code with the same data to achieve the same output. Ultimately, productivity is also enhanced because, once created, a single plotting function can be called quickly, generating one or an entire series of graphics in a largely automated fashion. R has a large number of “high” and “low” level plotting functions that can be used, combined and extended for specific purposes. This extensibility enables R to meet a wide spectrum of needs, as demonstrated by the number of contributed packages on CRAN that include additional specialized plotting functionality. The breadth of base plotting functions is usually quite satisfactory for many applications. In conjunction with R’s innate ability to deal with data in vectorized structures and by using differing ‘methods’, one can further reduce the need for lengthy, repetitive and complex code. In many cases, entire data structures (ie. a linear model object) can be passed as a single argument to a single plotting function, creating a default plot or series of plots. Further, where default plot settings are perhaps inappropriate for a given task, these can be adjusted to your liking and/or disabled. The base R News graphic can be enhanced by using various lower level plotting functions to add data points, lines, curves, shapes, titles, legends and text annotations. Formatted complex mathematical formulae (Murrell and Ihaka, 2000; Ligges, 2002) can also be included where required. If a graphics ‘device’ is not explicitly opened by the user, R’s high level plotting functions will open the default device (see ?Devices) specified by options("device"). In an interactive session, this is typically the screen. However, one can also open an alternative device such as a bitmap (ie. PNG/JPEG) or a PostScript/PDF file for publishing and/or presentation. I will focus on using the screen here, since the particulars concerning other devices can be platform specific. Note that if you intend to create plots for output to something other than the screen, then you must explicitly open the intended device. Differences between the screen and the alternate device can be quite significant in terms of the resultant plot output. For example, you can spend a lot of time creating the screen version of a plot, only to find out it looks quite different in a PostScript file, Various parameters of the figure and plot regions within a device can be set in advance by the use of the par() function before calling the initial plot function. Others can be set as named arguments to the plot functions. Options set by par() affect all graphics; options set in a graphics call affect only that call. (See ?par and ?plot.default for some additional details). It is possible to divide the overall graphic device into a row/column grid of figures and create individual plots within each grid section (ie. a matrix of scatterplots like a pairs() plot) or create a graphic that contains different plot types (ie. a scatterplot with boxplots placed on the x and y axes). For more information, see ?layout, ?split.screen and graphic parameters ‘mfcol’ and ‘mfrow’ in ?par. For additional details regarding graphic devices, parameters and other considerations, please review “Graphical Procedures” (Ch. 12) in “An Introduction to R” (Venables, Smith and R Core, 2003) and “Graphics” (Ch. 4) in “Modern Applied Statistics with S” (Venables and Ripley, 2002). Let’s Get Plotting In this limited space, it is not possible to cover all the combinations and permutations possible with R’s base graphics functionality (which could be a thick book in its own right). Thus, I will put forth a finite set of practical examples that cover a modest range of base plots and enhancements. For each plot, ISSN 1609-3631 Vol. 3/2, October 2003 3 we will create some simple data to work with, create a basic plot using a standard function to demonstrate default behavior and then enhance the base plot with additional detail. The included graphic for each will show the final result. I recommend that you consult the R help system for each function (using ?FunctionName) to better understand the syntax of each function call and how each argument impacts the resultant output. Scatterplot with a regression line and confidence / prediction intervals Linear Regression Plot Fitted Line Confidence Bands Prediction Bands 4 ● ● ● ● ● ● ● ● 2 ● ● ● ● ● y vals ● ● ● ● ● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● −4 −2 −1 0 1 2 x vals The plot() function is a generic graphing function that can accept of variety of data structures through specific defined ‘methods’. Frequently, these arguments are numeric vectors representing the two-dimensional (x,y) coordinate pairs of points and/or lines to display. If you want to get a feel for the breadth of plotting methods available use methods(plot). In the next example we first create a series of simple plots (not shown) then create the more complex scatterplot shown above. To do this we create an x-y scatterplot using type = "n" so that the axis ranges are established, but nothing is plotted initially. We then add the data points, the axes, a fitted regression line, and confidence and prediction intervals for the regression model: # Create our data set.seed(1) x <- runif(50, -2, 2) set.seed(2) y <- x + rnorm(50) # Create the model object mod <- lm(y ~ x) # Plot the data and add a regression line # using default plot() behavior plot(x, y) abline(mod) # Plot the model object, going through a R News # sequence of diagnostic plots. See ?plot.lm plot(mod) # Create prediction values and confidence limits # using a new dataframe of x values, noting the # colnames need to match your model term names. newData <- data.frame(x = seq(min(x), max(x), by = (max(x) - min(x)) / 49)) pred.lim <- predict(mod, newdata = newData, interval = "prediction") conf.lim <- predict(mod, newdata = newData, interval = "confidence") # Function to color plot region color.pr <- function(color = "white") { usr <- par("usr") if (par("xlog")) usr[1:2] <- 10 ^ usr[1:2] if (par("ylog")) usr[3:4] <- 10 ^ usr[3:4] rect(usr[1], usr[3], usr[2], usr[4], col = color) } # Color the plot background par(bg = "blue") # Define margins to enable space for labels par(mar = c(5, 6, 5, 3) + 0.1) # Create the plot. Do not plot the data points # and axes to allow us to define them our way plot(x, y, xlab = "x vals", ylab = "y vals", type = "n", col.lab = "yellow", font.lab = 2, cex.lab = 1.5, axes = FALSE, cex.main = 2, main = "Linear Regression Plot", col.main = "yellow", xlim = c(-2.1, 2.1), ylim = range(y, pred.lim, na.rm = TRUE)) # Color the plot region white color.pr("white") # Plot the data points points(x, y, pch = 21, bg = "yellow", cex=1.25) # Draw the fitted regression line and the # prediction and confidence intervals matlines(newData$x, pred.lim, lty = c(1, 4, 4), lwd = 2, col = c("black", "red", "red")) matlines(newData$x, conf.lim, lty = c(1, 3, 3), lwd = 2, col = c("black", "green4", "green4")) # Draw the X and Y axes, repectively axis(1, at = -2:2, col = "white", col.axis = "white", lwd = 2) axis(2, at = pretty(range(y), 3), las = 1, col = "white", col.axis = "white", lwd = 2) # Draw the legend legend(-2, max(pred.lim, na.rm = TRUE), legend = c("Fitted Line", "Confidence Bands", "Prediction Bands"), lty = c(1, 3, 4), lwd = 2, col = c("black", "green4", "red"), horiz = FALSE, cex = 0.9, bg = "gray95") # Put a box around the plot box(lwd = 2) ISSN 1609-3631 Vol. 3/2, October 2003 4 Barplot with confidence intervals and additional annotation Incidence of Event By Group barplot(table(Group)) barplot(table(Group), horiz = TRUE) barplot(table(Event, Group)) barplot(table(Event, Group), beside = TRUE) Total N = 819 12.0% Benchmark Value: 4.5% % Incidence (+/−95% CI) 10.0% 8.0% 6.0% 4.8% 4.0% 2.8% 2.7% 2.0% 0.0% A B C 126 p = 0.8285 409 p = 0.0931 284 p = 0.1977 barplot() can draw essentially three types of plots with either vertical or horizontal bars (using the argument horiz = TRUE / FALSE). The first is a series of individual bars where the height argument (which defines the bar values) is a simple vector. The second is a series of stacked multi-segment bars where height is a matrix and beside = FALSE. The third is a series of grouped bars where height is a matrix and beside = TRUE. In the second and third cases, each column of the matrix height represents either the values of the bar segments in each stacked bar, or the values of the individual bars in each bar group, respectively. barplot() returns either a vector or a matrix (when beside = TRUE) of bar midpoints that can be assigned to a variable (ie. mp <- barplot(...)). You can use this information to locate bar midpoints for text and/or line placement. To locate the midpoint of bar groups, use colMeans(mp) to enable the placement of a bar group label. Here we will create a vertical barplot, with each of the three bars representing a proportion. We will add binomial confidence intervals and p values from binom.test() using a ‘benchmark’ value that will be plotted. We will label the y axis with percentages (prop * 100), add bar values above the top of each bar and put sample sizes centered below each bar under the x axis. # Create our data A <- data.frame(Event = c(rep("Yes", 6), rep("No", 120)), Group = "A") B <- data.frame(Event = c(rep("Yes", 11), rep("No", 398)), Group = "B") C <- data.frame(Event = c(rep("Yes", 8), rep("No", 276)), Group = "C") BarData <- rbind(A, B, C) attach(BarData) # Create initial ’default’ barplots R News # Let’s get our summary data from the dataframe table.data <- table(Event, Group) # Get sample sizes n <- as.vector(colSums(table.data)) # Get number of "Yes" events events <- as.vector(table.data["Yes", ]) # Proportion of "Yes" events prop.events <- events / n # Group names from table dimnames Group.Names <- dimnames(table.data)$Group # Define our benchmark value benchmark <- 0.045 # Get binomial confidence limits and p values stats <- mapply(binom.test, x = events, n = n, p = benchmark) # ci[, 1] = lower and ci[, 2] = upper ci <- matrix(unlist(stats["conf.int", ]), ncol = 2, byrow = TRUE) p.val <- unlist(stats["p.value", ]) # Define Y axis range to include CI’s and # space for a legend in the upper LH corner YMax <- max(ci[, 2]) * 1.25 # Define margins to enable space for labels par(mar = c(5, 6, 5, 3) + 0.1) # Do the barplot, saving bar midpoints in MidPts MidPts <- barplot(prop.events, space = 1, axes = FALSE,axisnames = FALSE, ylim = c(0, YMax)) # Define formatted Y axis labels using # axTicks() and draw the Y Axis and label YLabels <- paste(formatC(axTicks(2) * 100, format = "f", digits = 1), "%", sep = "") YAxisLab <- "% Incidence (+/-95% CI)" axis(2, labels = YLabels, at = axTicks(2), las = 1) mtext(YAxisLab, side = 2, adj = 0.5, line = 4.5, cex = 1.1, font = 2) # Draw the X axis using Group Names at bar # midpoints axis(1, labels = Group.Names, at = MidPts, font = 2, cex.axis = 1.25) # Draw Sample Sizes and p Values below Group # Names mtext(n, side = 1, line = 2, at = MidPts, cex = 0.9) p.val.text <- paste("p = ", formatC(p.val, format = "f", digits = 4), sep = "") mtext(p.val.text, side = 1, line = 3, at = MidPts, cex = 0.9) # Place formatted bar values above the left edge # of each bar so that CI lines do not go through # numbers. Left edge = MidPts - (’width’ / 2) bar.vals <- paste(formatC( prop.events * 100, format = "f", digits=1), "%", sep = "") text(MidPts - 0.5, prop.events, cex = 0.9, labels = bar.vals, adj = c(0, -0.5), font=1) ISSN 1609-3631 Vol. 3/2, October 2003 5 # Draw confidence intervals, first drawing # vertical line segments and then upper and # lower horizontal boundary segments segments(MidPts, ci[, 1], MidPts, ci[, 2], lty = "solid", lwd = 2) segments(MidPts - 0.25, ci[, 1], MidPts + 0.25, ci[, 1], lty = "solid", lwd=2) segments(MidPts - 0.25, ci[, 2], MidPts + 0.25, ci[, 2], lty = "solid", lwd=2) # Plot benchmark line abline(h = benchmark, lty = "dotdash", lwd = 2, col = "blue") # Draw legend legend(1, YMax * 0.95, lty = "dotdash", legend = "Benchmark Value: 4.5%", lwd = 2, col = "blue", horiz = FALSE, cex = 0.9, bg = "gray95") # Draw title and sub-title mtext("Incidence of Event By Group", side = 3, line = 3, cex = 1.5, font = 2) mtext(paste("Total N = ", sum(n), sep = ""), side = 3, line = 1, cex = 1, font = 2) # Put box around plot box() detach(BarData) Paired Boxplots with outliers colored and median / mean values labeled # # # # # # # # # Distribution of ’Measure’ by ’Group’ 14 # # ● ● ● ● ● Mean Median ● ● ● ● ● ● 12 ● ● 10 # # ● ● ● ● ● 8 ● 7.6 7.5 6 5.1 5.1 5.0 4.6 4 # 3.2 2.7 ● 2 # 0 # A1 B1 A2 B2 135 175 250 500 J.W. Tukey’s Box-Whisker plots (Tukey, 1977) are a quick and easy way to visually review and compare the distributions of continuous variables. For some descriptive information on the structure and interpretation of these plots including additional references, see ?boxplot.stats. Here we will generate continuous measures in four groups. We will generate default plots and then enhance the layout of the plot to visually group the data and to annotate it with key labels. # Create our data set.seed(1) A1 <- data.frame(Group = "A1", R News # # # # # # Measure = rnorm(135, 5)) set.seed(2) A2 <- data.frame(Group = "A2", Measure = rgamma(250, 3)) set.seed(3) B1 <- data.frame(Group = "B1", Measure = rnorm(175, 7.5)) set.seed(4) B2 <- data.frame(Group = "B2", Measure = rgamma(500, 5)) BPData <- rbind(A1, A2, B1, B2) attach(BPData) Create default boxplots boxplot(Measure) boxplot(Measure, horizontal = TRUE) boxplot(Measure ~ Group) Adjust Group factor levels to put A1 / B1 and A2 / B2 pairs together Group <- factor(Group, levels = c("A1", "B1", "A2", "B2")) Show default boxplot with re-grouping boxplot(Measure ~ Group) Define that boxplot midpoints to separate the pairs of plots at <- c(1.25, 1.75, 3.25, 3.75) Draw boxplot, returning boxplot stats in S which will contain summary data for each Group. See ?boxplot.stats S <- boxplot(Measure ~ Group, boxwex = 0.25, col = c("orange", "yellow"), notch = TRUE, at = at, axes = FALSE) Draw thicker green lines for median values When notch = TRUE, median width = boxwex / 2 segments(at - 0.0625, S$stats[3, ], at + 0.0625, S$stats[3, ], lwd = 2, col = "darkgreen") Get Group means and plot them using a diamond plot symbol means <- by(Measure, Group, mean) points(at, means, pch = 23, cex = 0.75, bg = "red") Color outlier values using x,y positions from S points(at[S$group], S$out, pch = 21, bg="blue") Draw Y axis, rotating labels to horiz axis(2, las = 1) Draw X Axis Group Labels axis(1, at = at, labels = S$names, cex.axis = 1.5, font.axis = 2) mtext(S$n, side = 1, at = at, line = 3) Draw Mean values to the left edge of each boxplot text(at - 0.125, means, labels = formatC( means, format = "f", digits = 1), pos = 2, cex = 0.9, col = "red") Draw Median values to the right edge of each boxplot text(at + 0.125, S$stats[3, ], labels = formatC(S$stats[3, ], format = "f", digits = 1), pos = 4, cex = 0.9, col = "darkgreen") Draw a box around plot box() Add title and legend title("Distribution of ’Measure’ by ’Group’", ISSN 1609-3631 Vol. 3/2, October 2003 cex.main = 1.5) legend(0.5, max(Measure), legend = c("Mean", "Median"), fill = c("red", "darkgreen")) detach(BPData) Additional Resources For additional information on using R’s plotting functionality, see: Venables, Smith and R Core (2003); Venables and Ripley (2002); Fox (2002); Dalgaard (2002). In addition, Uwe Ligges’ recent R News article (Ligges, 2003) provides excellent insights into how best to utilize R’s documentation and help resources. If you are in need of expert guidance on creating analytic graphics, such as the pros and cons of using particular graphic formats and their impact on the interpretation of your data, two critically important references are “Visualizing Data” (Cleveland, 1993) and “The Elements of Graphing Data” (Cleveland, 1994). Bibliography Cleveland, W. S. (1993): Visualizing Data. Summit, NJ: Hobart Press. 6 Cleveland, W. S. (1994): The Elements of Graphing Data. Summit, NJ: Hobart Press, revised edition. 6 Dalgaard, P. (2002): Introductory Statistics with R. New York: Springer-Verlag. 6 Fox, J. (2002): An R and S-PLUS Companion to Applied Regression. Thousand Oaks: Sage. 6 R News 6 Ligges, U. (2002): R Help Desk – Automation of Mathematical Annotation in Plots. R News, 2 (3), 32–34. ISSN 1609-3631. URL http://CRAN. R-project.org/doc/Rnews/. 2 Ligges, U. (2003): R Help Desk – Getting Help – R’s Help Facilities and Manuals. R News, 3 (1), 26–28. ISSN 1609-3631. URL http://CRAN.R-project. org/doc/Rnews/. 6 Murrell, P. (2002): The grid Graphics Package. R News, 2 (2), 14–19. ISSN 1609-3631. URL http: //CRAN.R-project.org/doc/Rnews/. 2 Murrell, P. and Ihaka, R. (2000): An Approach to Providing Mathematical Annotation in Plots. Journal of Computational and Graphical Statistics, 9 (3), 582– 599. 2 Sarkar, D. (2002): Lattice: An Implementation of Trellis Graphics in R. R News, 2 (2), 19–23. ISSN 1609-3631. URL http://CRAN.R-project.org/ doc/Rnews/. 2 Tukey, J. (1977): Exploratory Data Analysis. Reading, MA: Addison-Wesley. 5 Venables, W. N. and Ripley, B. D. (2002): Modern Applied Statistics with S. New York: Springer-Verlag, 4th edition. 2, 6 Venables, W. N., Smith, D. M. and the R Development Core Team (2003): An Introduction to R. URL http://CRAN.R-project.org/doc/ manuals.html. 2, 6 Marc Schwartz MedAnalytics, Inc., Minneapolis, Minnesota, USA MSchwartz@MedAnalytics.com ISSN 1609-3631 Vol. 3/2, October 2003 7 Integrating grid Graphics Output with Base Graphics Output 10 by Paul Murrell This document describes the gridBase package which provides some support for combining grid and base graphics output. Annotating base graphics using grid The gridBase package provides the baseViewports() function, which supports adding grid output to a base graphics plot. This function creates a set of grid viewports that correspond to the current base plot. These allow simple operations such as adding lines and text using grid’s units to locate them relative to a wide variety of coordinate systems, or something more complex involving pushing further grid viewports. baseViewports() returns a list of three grid viewports. The first corresponds to the base “inner” region. This viewport is relative to the entire device; it only makes sense to push this viewport from the “top level” (i.e., only when no other viewports have been pushed). The second viewport corresponds to the base “figure” region and is relative to the inner region; it only makes sense to push it after the “inner” viewport has been pushed. The third viewport corresponds to the base “plot” region and is relative to the figure region; it only makes sense to push it after the other two viewports have been pushed in the correct order. 4 2 ten ht nin e en eig se v e r six fiv ee o fou thr tw on e 0 The grid graphics package (Murrell, 2002) is much more powerful than the standard R graphics system (hereafter “base graphics”) when it comes to combining and arranging graphical elements. It is possible to create a greater variety of graphs more easily with grid (see, for example, Deepayan Sarkar’s lattice package (Sarkar, 2002)). However, there are very many plots based on base graphics (e.g., biplots), that have not been implemented in grid, and the task of reimplementing these in grid is extremely daunting. It would be nice to be able to combine the readymade base plots with the sophisticated arrangement features of grid. 6 8 Introduction Figure 1: Annotating a base plot with grid.text(). A simple application of this facility involves adding text to the margins of a base plot at an arbitrary orientation. The base function mtext() allows text to be located in terms of a number of lines away from the plot region, but only at rotations of 0 or 90 degrees. The base text() function allows arbitrary rotations, but only locates text relative to the user coordinate system in effect in the plot region (which is inconvenient for locating text in the margins of the plot). By contrast, the grid function grid.text() allows arbitrary rotations and can be used in any grid viewport. In the following code we first create a base plot, leaving off the tick labels. > midpts <- barplot(1:10, axes = FALSE) > axis(2) > axis(1, at = midpts, labels = FALSE) Next we use baseViewports() to create grid viewports that correspond to the base plot and we push those viewports1 . > vps <- baseViewports() > par(new = TRUE) > push.viewport(vps$inner, vps$figure, + vps$plot) Finally, we draw rotated labels using grid.text() (and pop the viewports to clean up after ourselves). The final plot is shown in Figure 1. 1 The par(new=TRUE) is necessary currently because the first grid action will try to move to a new page; it should be possible to remove this step in future versions of R. R News ISSN 1609-3631 Vol. 3/2, October 2003 Now we read in the data and generate a base plot, but plot no points. > chinasea <- read.table("chinasea.txt", + header = TRUE) > plot(chinasea$lat, chinasea$long, + type = "n", xlab = "latitude", + ylab = "longitude", + main = "China Sea ...") Now we use baseViewports() to align a grid viewport with the plot region, and draw the symbols by speed <- 0.8 * chinasea$speed/14 + 0.2 temp <- chinasea$temp/40 vps <- baseViewports() par(new = TRUE) push.viewport(vps$inner, vps$figure, vps$plot) for (i in 1:25) { push.viewport(viewport( x = unit(chinasea$lat[i], "native"), y = unit(chinasea$long[i], "native"), angle = chinasea$dir[i])) novelsym(speed[i], temp[i]) pop.viewport() } pop.viewport(3) 122.0 China Sea Wind Speed/Direction and Temperature ● ● ● ● ●● ● ● ● 121.0 121.5 ● ● ● ● ● ● ● ● ● 120.5 ● ● ● ● ● 120.0 > novelsym <- function(speed, + temp, width = unit(3, "mm"), + length = unit(0.5, "inches")) { + grid.rect(height = length, + y = 0.5, just = "top", + width = width, + gp = gpar(fill = "white")) + grid.rect(height = temp * + length, y = unit(0.5, + "npc") - length, width = width, + just = "bottom", + gp = gpar(fill = "grey")) + grid.arrows(x = 0.5, + y = unit.c(unit(0.5, "npc"), + unit(0.5, "npc") + + speed * length), + length = unit(3, "mm"), + type = "closed", + gp = gpar(fill = "black")) + grid.points(unit(0.5, "npc"), + unit(0.5, "npc"), size = unit(2, + "mm"), pch = 16) + } > + > > > > + > + + + + + + + + + > ● 119.5 The next example is a bit more complicated because it involves embedding grid viewports within a base graphics plot. The dataset is a snapshot of wind speed, wind direction, and temperature at several weather stations in the South China Sea, south west of Japan2 . grid is used to produce novel plotting symbols for a standard base plot. First of all, we need to define the novel plotting symbol. This consists of a dot at the data location, with a thermometer extending “below” and an arrow extending “above”. The thermometer is used to encode temperature and the arrow is used to indicate wind speed (both scaled to [0, 1]). creating a grid viewport per ( x, y) location (we rotate the viewport to represent the wind direction). The final plot is shown in Figure 2. longitude > grid.text(c("one", "two", "three", + "four", "five", "six", "seven", + "eight", "nine", "ten"), + x = unit(midpts, "native"), + y = unit(-1, "lines"), just = "right", + rot = 60) > pop.viewport(3) 8 ● 22 23 24 25 latitude Figure 2: Using grid to draw novel symbols on a base plot. Embedding base graphics plots in grid viewports gridBase provides several functions for adding base graphics output to grid output. There are three functions that allow base plotting regions to be aligned with the current grid viewport; this makes it possible to draw one or more base graphics plots within a grid viewport. The fourth function provides a set of 2 Obtained from the CODIAC web site: http://www.joss.ucar.edu/codiac/codiac-www.html. The file chinasea.txt is in the gridBase/doc directory. R News ISSN 1609-3631 Vol. 3/2, October 2003 graphical parameter settings so that base par() settings can be made to correspond to some of3 the current grid graphical parameter settings. The first three functions are gridOMI(), gridFIG(), and gridPLT(). They return the appropriate par() values for setting the base “inner”, “figure”, and “plot” regions, respectively. The main usefulness of these functions is to allow you to create a complex layout using grid and then draw a base plot within relevant elements of that layout. The following example uses this idea to create a lattice plot where the panels contain dendrograms drawn using base graphics functions4 . First of all, we create a dendrogram and cut it into four subtrees5 . > > > + > > library(mva) data(USArrests) hc <- hclust(dist(USArrests), "ave") dend1 <- as.dendrogram(hc) dend2 <- cut(dend1, h = 70) 9 + + + + + } ps = 10) plot(dend2$lower[[subscripts]], axes = FALSE) pop.viewport() Finally, we draw a lattice xyplot, using lattice to set up the arrangement of panels and strips and our panel function to draw a base dendrogram in each panel. The final plot is shown in Figure 3. > library(lattice) > xyplot(y ~ x | height, subscripts = TRUE, + xlab = "", ylab = "", + strip = function(...) { + strip.default(style = 4, + ...) + }, scales = list(draw = FALSE), + panel = dendpanel) 39 44 45 55 39 44 45 55 > space <- max(unit(rep(1, 50), + "strwidth", + as.list(rownames(USArrests)))) > dendpanel <- function(x, y, + subscripts, ...) { + push.viewport(viewport(y = space, + width = 0.9, height = unit(0.9, + "npc") - space, + just = "bottom")) + grid.rect(gp = gpar(col = "grey", + lwd = 5)) + par(plt = gridPLT(), new = TRUE, Idaho Nebraska Kentucky Montana Ohio Utah Indiana Kansas Connecticut Pennsylvania Hawaii West Virginia Maine South Dakota North Dakota Vermont Minnesota Wisconsin Iowa New Hampshire 39 45 55 44 45 55 California Maryland Arizona New Mexico Delaware Alabama Louisiana Illinois New York Michigan Nevada Alaska Mississippi South Carolina 44 North Carolina Next we define a lattice panel function to draw the dendrograms. The first thing this panel function does is push a viewport that is smaller than the viewport lattice creates for the panel; the purpose is to ensure there is enough room for the labels on the dendrogram. The space variable contains a measure of the length of the longest label. The panel function then calls gridPLT() and makes the base plot region correspond to the viewport we have just pushed. Finally, we call the base plot() function to draw the dendrogram (and pop the viewport we pushed)6 . 39 Florida > x <- 1:4 > y <- 1:4 > height <- factor(round(unlist( + lapply(dend2$lower, + attr, "height")))) Washington Oregon Wyoming Oklahoma Virginia Rhode Island Massachusetts New Jersey Missouri Arkansas Tennessee Georgia Colorado Texas Now we create some dummy variables which correspond to the four subtrees. Figure 3: Adding base dendrograms to a lattice plot. The gridPLT() function is useful for embedding just the plot region of a base graphics function (i.e., without labels and axes; another example of this usage is given in the next section). If labelling and axes are to be included it will make more sense to use gridFIG(). The gridOMI() function has pretty much the same effect as gridFIG() except that it allows for the possibility of embedding multiple base plots at once. In the following code, a lattice plot is placed alongside base diagnostic plots arranged in a 2-by-2 array. We use the data from page 93 of “An Introduction to Generalized Linear Models” (Annette Dobson, 1990). 3 Only lwd, lty, col are available yet. More should be available in future versions. that lattice is built on grid so the panel region in a lattice plot is a grid viewport. 5 the data and cluster analysis are copied from the example in help(plot.dendrogram). 6 The grid.rect() call is just to show the extent of the extra viewport we pushed. 4 Recall R News ISSN 1609-3631 Vol. 3/2, October 2003 10 I am looking at a way of plotting a series of pie charts at specified locations on an existing plot. The size of the pie chart would be proportion to the magnitude of the total value of each vector (x) and the values in x are displayed as the areas of pie slices. > counts <- c(18, 17, 15, 20, + 10, 20, 25, 13, 12) > outcome <- gl(3, 1, 9) > treatment <- gl(3, 3) We create two regions using grid viewports; the left region is for the lattice plot and the right region is for the diagnostic plots. There is a middle column of 1cm to provide a gap between the two regions. > push.viewport(viewport( + layout = grid.layout(1, + 3, widths = unit(rep(1, + 3), c("null", "cm", + "null"))))) We draw a lattice plot in the left region. > + > > + > > push.viewport(viewport( layout.pos.col = 1)) library(lattice) bwplot <- bwplot(counts ~ outcome | treatment) print(bwplot, newpage = FALSE) pop.viewport() We draw the diagnostic plots in the right region. Here we use gridOMI() to set the base inner region and par(mfrow) and par(mfg) to insert multiple plots7 . The final plot is shown in Figure 4. > > + > + > + > > + > push.viewport(viewport(layout.pos.col = 3)) glm.D93 <- glm(counts ~ outcome + treatment, family = poisson()) par(omi = gridOMI(), mfrow = c(2, 2), new = TRUE) par(cex = 0.5, mar = c(5, 4, 1, 2)) par(mfg = c(1, 1)) plot(glm.D93, caption = "", ask = FALSE) pop.viewport(2) First of all, we construct some fake data, consisting of four ( x, y) values, and four ( z1 , z2 ) values : > x <- c(0.88, 1, 0.67, 0.34) > y <- c(0.87, 0.43, 0.04, 0.94) > z <- matrix(runif(4 * 2), ncol = 2) Before we start any plotting, we save the current par() settings so that at the end we can “undo” some of the complicated settings that we need to apply. > oldpar <- par(no.readonly = TRUE) Now we do a standard base plot of the ( x, y) values, but do not plot anything at these locations (we’re just setting up the user coordinate system). > plot(x, y, xlim = c(-0.2, 1.2), + ylim = c(-0.2, 1.2), type = "n") Now we make use of baseViewports. This will create a list of grid viewports that correspond to the inner, figure, and plot regions set up by the base plot. By pushing these viewports, we establish a grid viewport that aligns exactly with the plot region created by the base plot, including a (grid) “native” coordinate system that matches the (base) user coordinate system9 . A more complex example > > > + > + + + + + + + + We will now look at a reasonably complex example involving embedding base graphics within grid viewports which are themselves embedded within a base plot. This example is motivated by the following problem8 : Before we draw the pie charts, we need to perform a couple of calculations to determine their size. In this case, we specify that the largest pie will be 1" in diameter and the others will be a proportion of that size based on ∑i z.i /max (∑i z.i ) Notice that because there is only ever one current grid viewport, it only makes sense to use one of gridOMI(), gridFIG(), or gridPLT(). In other words, it only makes sense to align either the inner region, or the figure region, or the plot region with the current grid viewport. vps <- baseViewports() par(new = TRUE) push.viewport(vps$inner, vps$figure, vps$plot) grid.segments(x0 = unit(c(rep(0, 4), x), rep(c("npc", "native"), each = 4)), x1 = unit(c(x, x), rep("native", 8)), y0 = unit(c(y, rep(0, 4)), rep(c("native", "npc"), each = 4)), y1 = unit(c(y, y), rep("native", 8)), gp = gpar(lty = "dashed", col = "grey")) 7 We use par(mfrow) to specify the 2-by-2 array and par(mfg) to start at position (1, 1) in the array. description is from an email to R-help from Adam Langley, 18 July 2003 9 The grid.segments call is just drawing some dashed lines to show that the pie charts we end up with are centred correctly at the appropriate ( x, y) locations. 8 This R News ISSN 1609-3631 ● ● ● counts 20 ● 1.0 2● 0.5 ● ● ● ● ● −1.0 −1.0 −0.5 Residuals ● 6● ● ● Std. deviance resid. 25 6● 2● 0.0 3 0.5 2 0.0 1 1.5 11 1.0 Vol. 3/2, October 2003 9● ● ● 9● ● 2.6 2.7 2.8 2.9 3.0 −1.5 −0.5 Predicted values 0.5 1.0 1.5 Theoretical Quantiles 6● 9● ● ● 0.8 ● ● ● 10 0.0 0.2 ● ● 0.6 ● 0.4 15 Std. deviance resid. 1.0 2● Cook’s distance ● ● 2.6 1 2 3 1 2 3 1 2 3 2.7 2.8 2.9 7 6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1.2 ● 9 3.0 2 Predicted values 4 6 8 Obs. number Figure 4: Drawing multiple base plots within a grid viewport. > for (i in 1:4) { + push.viewport(viewport(x = unit(x[i], + "native"), y = unit(y[i], + "native"), width = sizemult[i] * + maxpiesize, height = sizemult[i] * + maxpiesize)) + grid.rect(gp = gpar(col = "grey", + fill = "white", lty = "dashed")) + par(plt = gridPLT(), new = TRUE) + pie(z[i, ], radius = 1, + labels = rep("", 2)) + pop.viewport() + } Finally, we clean up after ourselves by popping the grid viewports and restoring the initial par settings. > pop.viewport(3) > par(oldpar) The final plot is shown in Figure 5. 1.0 0.8 0.6 −0.2 0.0 0.2 0.4 y We now enter a loop to draw a pie at each ( x, y) location representing the corresponding ( z1 , z2 ) values. The first step is to create a grid viewport at the ( x, y) location, then we use gridPLT() to set the base plot region to correspond to the grid viewport. With that done, we can use the base pie function to draw a pie chart within the grid viewport10 . 1.2 > maxpiesize <- unit(1, "inches") > totals <- apply(z, 1, sum) > sizemult <- totals/max(totals) −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 x Figure 5: Base pie charts drawn within grid viewports, which are embedded within a base plot. Problems and limitations The functions provided by the gridBase package allow the user to mix output from two quite different graphics systems and there are limits to how much the systems can be combined. It is important that users are aware that they are mixing two not wholly compatible systems (which is why these functions are provided in a separate package) and it is of course important to know what the limitations are: • The gridBase functions attempt to match grid 10 We draw a grid.rect with a dashed border just to show the extent of each grid viewport. It is crucial that we again call par(new=TRUE) so that we do not move on to a new page. R News ISSN 1609-3631 Vol. 3/2, October 2003 graphics settings with base graphics settings (and vice versa). This is only possible under certain conditions. For a start, it is only possible if the device size does not change. If these functions are used to draw into a window, then the window is resized, the base and grid settings will almost certainly no longer match and the graph will become a complete mess. This also applies to copying output between devices of different sizes. • It is not possible to embed base graphics output within a grid viewport that is rotated. • There are certain base graphics functions which modify settings like par(omi) and par(fig) themselves (e.g., coplot()). Output from these functions may not embed properly within grid viewports. • grid output cannot be saved and restored so any attempts to save a mixture of grid and base output are likely to end in disappointment. Summary The functions in the gridBase package provide a simple mechanism for combining base graphics output with grid graphics output for static, fixed-size plots. R News 12 This is not a full integration of the two graphics systems, but it does provide a useful bridge between the existing large body of base graphics functions and the powerful new features of grid. Availability The grid package is now part of the base distribution of R (from R version 1.8.0). Additional information on grid is available from: http://www. stat.auckland.ac.nz/~paul/grid/grid.html. The gridBase package is available from CRAN (e.g., http://cran.us.r-project.org). Bibliography P. Murrell. The grid graphics package. R News, 2(2): 14–19, June 2002. URL http://CRAN.R-project. org/doc/Rnews/. 7 D. Sarkar. Lattice. R News, 2(2):19–23, June 2002. URL http://CRAN.R-project.org/doc/Rnews/. 7 Paul Murrell University of Auckland, NZ paul@stat.auckland.ac.nz ISSN 1609-3631 Vol. 3/2, October 2003 13 A New Package for the General Error Distribution 2. The most likely value of a quantity, for which direct measurements xi are available, must not depend on the adopted unit of measure. 0.5 0.4 p= 1 p= 1.5 p= 2 p= 3 0.3 1. The probability of an error ε depends only on the greatness of the error itself and can be expressed by a function ϕ(ε) with continuous first derivative almost everywhere. f(x) The General Error Distribution, whose first formulation could be ascribed to the Russian mathematician Subbotin (1923), is a general distribution for random errors. To derive this random error distribution, Subbotin extended the two axioms used by Gauss to derive the usual normal (Gaussian) error distribution, by generalizing the first one. Subbotin used the following axioms: 0.2 Introduction 0.1 Angelo M. Mineo changing the structure parameter p, we can recognize some known probability distribution: for example, for p = 1 we have the Laplace distribution, for p = 2 we have the normal (Gaussian) distribution, for p → +∞ we have the uniform distribution. A graphical description of some normal of order p curves is in figure 1 (this plot has been made with the command graphnp() of the package normalp). 0.0 The normalp package −4 In this way Subbotin obtains the probability distribution with the following density function: ϕ(ε) = mh · exp[−hm |ε|m ] 2Γ(1/m) with −∞ < ε < +∞, h > 0 and m ≥ 1. This distribution is also known as Exponential Power Distribution and it has been used, for example, by Box and Tiao (1992) in Bayesian inference. In the Italian statistical literature, a different parametrization of this distribution has been derived by Lunetta (1963), who followed the procedure introduced by Pearson (1895) to derive new probability distributions, solving this differential equation d log f log f − log a = p· dx x−c and obtaining a distribution with the following probability density function ! 1 |x − µ| p · exp − f ( x) = p 2σ p p1/ p Γ(1 + 1/ p) pσ p with −∞ < x < +∞ and −∞ < µ < +∞, σ p > 0 and p ≥ 1. This distribution is known as the order p normal distribution (Vianelli, 1963). It is easy to see how this distribution is characterized by three parameters: µ is the location parameter, σ p is the scale parameter and p is the structure parameter. By R News −2 0 2 4 x Figure 1: Normal of order p curves. In this paper we present the main functions of the package normalp and some examples of their use. The normalp functions The package contains the four classical functions dealing with the computation of the density function, the distribution function, the quantiles and the generation of pseudo-random observations from an order p normal distribution. Some examples related to the use of these commands are the following: > dnormp(3, mu = 0, sigmap = 1, p = 1.5, + log = FALSE) [1] 0.01323032 > pnormp(0.5, mu = 2, sigmap = 3, p = 1.5) [1] 0.3071983 > qnormp(0.3071983, mu = 2, sigmap = 3, + p = 1.5) [1] 0.5 > rnormp(6, mu = 2, sigmap = 5, p = 2.5) [1] 3.941597 -1.943872 -2.498598 [4] 1.869880 6.709037 14.873287 In case of generation of pseudo-random numbers we have implemented two methods: one, faster, based ISSN 1609-3631 Vol. 3/2, October 2003 14 on the relationship linking an order p normal distribution and a gamma distribution (see Lunetta, 1963), and one based on the generalization of the Marsaglia (1964) method to generate pseudo-random numbers from a normal distribution. Chiodi (1986) describes how the representation of the order p normal distribution as a generalization of a normal (Gaussian) distribution can be used for simulation of random variates. Another group of functions concerns the estimation of the order p normal distribution parameters. To estimate the structure parameter p, an estimation method based on an index of kurtosis is used; in particular, the function estimatep() formulates an estimate of p based on the index V I given by p √ µ2 Γ(1/ p)Γ(3/ p) . = VI = µ1 Γ(2/ p) method implemented by estimatep(). Through the function plot.simul.mp() it is possible to see graphically the behavior of the estimators. A possible use of the function simul.mp() is the following: by comparing its theoretical value and the empirical value computed on the sample. For a comparison between this estimation method and others based on the likelihood approach see Mineo (2003). With the function kurtosis() it is possible to compute the theoretical values of, besides V I, β2 and β p given by The command plot(res) will produce an histogram for every set of estimates created by the function simul.mp(). In figure 2 we have the histogram for p̂. For more details see Mineo (1995-a). 0.4 N. samples with a difficult convergence: 26 > plot(res) β̂2 = n ∑in=1 ( xi − M)4 [∑in=1 ( xi − M)2 ]2 β̂ p = n ∑in=1 | xi − M|2p . [∑in=1 | xi − M| p ]2 Concerning the estimation of the location parameter µ and the scale parameter σ p , we have used the maximum likelihood method, conditional on the estimate of p that we obtain from the function estimatep(). The function we have to use in this case is paramp(). We have implemented also a function simul.mp(), that allows a simulation study to verify the behavior of the estimators used for the estimation of the parameters µ, σ p and p. The compared estimators are: the arithmetic mean and the maximum likelihood estimator for the location parameter µ, the standard deviation and the maximum likelihood estimator for the scale parameter σ p ; for the structure parameter p we used the estimation 0.2 Density Moreover, it is possible to compute the empirical values of these indexes given by q n ∑in=1 ( xi − M)2 c VI = ∑in=1 | xi − M| 0.1 √ µ2p = p+1 µ 2p 0.3 µ4 Γ(1/ p)Γ(5/ p) = 2 [Γ(3/ p)]2 µ2 βp = R News Sp p Mean 2.9348828 3.415554 Variance 0.5481126 7.753024 0.0 β2 = > res <- simul.mp(n = 30, m = 1000, mu = 2, + sigmap = 3, p = 3) > res Mean Mp Sd Mean 1.9954033 1.9991151 2.60598964 Variance 0.2351292 0.2849199 0.08791664 2 4 6 8 10 Figure 2: Histogram of p̂ obtained with the command plot.simul.mp(res). It is also possible to estimate linear regression models when we make the hypothesis of random errors distributed according to an order p normal distribution. The function we have to use in this case is lmp(), which we can use like the function lm() from the base package. In fact, the function lmp() returns a list with all the most important results drawn from a linear regression model with errors distributed as a normal of order p curve; moreover, it returns an object that can form the argument of the functions summary.lmp() and plot.lmp(): the function summary.lmp() returns a summary of the main obtained results, while the function plot.lmp() returns a set of graphs that in some way reproduces the analysis of residuals that usually we conduct when ISSN 1609-3631 Vol. 3/2, October 2003 15 we estimate a linear regression model with errors distributed as a normal (Gaussian) distribution. 0.0 7 −0.5 8 distance 9 Residuals 10 11 0.5 Residuals vs Fitted 6 7 8 9 10 11 Fitted values lmp(formula = Car ~ Center, data = cyclist) 14 16 18 20 Figure 4: First graph obtained by using the command plot.lmp(res). space Figure 3: Plot of the data considered in the data frame cyclist. To show an example of use of these functions, we considered a data set reported in Devore (2000). In this data set (see figure 3) the distance between a cyclist and a passing car (variable distance) and the distance between the centre line and the cyclist in the bike lane (variable space) has been recorded for each of ten streets; by considering the variable distance as a dependent variable and the variable space as an independent variable, we produce the following example: > data(ex12.21, package = "Devore5") > res <- lmp(distance ~ space, + data = ex12.21) > summary(res) Call: lmp(formula = distance ~ space, data = ex12.21) Residuals: Min 1Q -0.7467 -0.5202 Median 0.0045 Coefficients: (Intercept) -2.4075 space 0.6761 3Q 0.3560 Max 0.8363 Estimate of p 1.353972 Power deviation of order p: 0.6111 > plot(res) In figure 4 we show one of the four graphs that we have obtained with the command plot(res). R News Also for a linear regression model with errors distributed as an order p normal distribution we have implemented a set of functions that allow a simulation study to test graphically the suitability of the estimators used. The main function is simul.lmp(); besides this function, we have implemented the functions summary.simul.lmp() and plot.simul.lmp() that allow respectively to visualize a summary of the results obtained from the function simul.lmp() and to show graphically the behavior of the produced estimates. A possible use of these functions is the following: > res <- simul.lmp(10, 500, 1, data = 1.5, + int = 1, sigmap = 1, p = 3, lp = FALSE) > summary(res) Results: (intercept) x1 Mean 0.9959485 1.497519 Variance 0.5104569 1.577187 Sp p Mean 0.90508039 3.196839 Variance 0.04555003 11.735883 Coefficients: (intercept) 1.0 x1 1.5 Formula: y ~ +x1 Number of samples: 500 Value of p: 3 N. of samples with problems on convergence 10 > plot(res) In figure 5 it is showed the result of plot(res). For more details see Mineo (1995-b). ISSN 1609-3631 Vol. 3/2, October 2003 16 mands where we have observations drawn from an order p normal distribution, known also as general error distribution. The implemented functions deal essentially with estimation problems for linear regression models, besides some commands that generalize graphical tools already implemented in the package base, related to observations distributed as a normal (Gaussian) distribution. In the next future we shall work on the computational improvement of the code and on the implementation of other commands to make this package still more complete. Histogram of x1 0.00 0.10 0.20 Density 0.4 0.2 0.0 Density 0.30 0.6 Histogram of intercept −1 0 1 2 3 4 5 −4 2 4 Histogram of p 0.2 0.4 Density 1.5 1.0 0.0 0.0 0.5 Density 0 0.6 Histogram of Sp −2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 2 4 6 8 Bibliography 10 Figure 5: Graphs obtained with the command plot.simul.lmp(res). Besides the described functions, we have implemented two graphical functions. The command graphnp() allows visualization of up to five different order p normal distributions: this is the command used to obtain the plot in figure 1. The command qqnormp() allows drawing a Quantile-Quantile plot to check graphically if a set of observations follows a particular order p normal distribution. Close to this function is the command qqlinep() that sketches a line passing through the first and the third quartile of the theoretical order p normal distribution, line sketched on a normal of order p Q-Q plot derived with the command qqnormp(). In figure 6 there is a graph produced by using these two functions. 0 −2 −1 Sample Quantiles 1 2 p= 3 −2 −1 0 1 2 Theoretical Quantiles Figure 6: Normal of order p Q-Q plot. Conclusion In this article we have described the use of the new package normalp, that implements some useful com- R News G.E.P. Box and G.C. Tiao. Bayesian inference in statistical analysis. Wiley, New York, 1992. First edition for Addison-Wesley, 1973. M. Chiodi. Procedures for generating pseudorandom numbers from a normal distribution of order p (p>1). Statistica Applicata, 1:7-26, 1986. J.L. Devore. Probability and Statistics for Engineering and the Sciences (5th edition). Duxbury, California, 2000. G. Lunetta. Di una generalizzazione dello schema della curva normale. Annali della Facoltà di Economia e Commercio dell’Università di Palermo, 17:237244, 1963. G. Marsaglia and T.A. Bray. A convenient method for generating normal variables. SIAM rev., 6:260-264, 1964. A.M. Mineo. Stima dei parametri di intensità e di scala di una curva normale di ordine p (p incognito). Annali della Facoltà di Economia dell’Università di Palermo (Area Statistico-Matematica), 49:125-159, 1995-a. A.M. Mineo. Stima dei parametri di regressione lineare semplice quando gli errori seguono una distribuzione normale di ordine p (p incognito). Annali della Facoltà di Economia dell’Università di Palermo (Area Statistico-Matematica), 49:161-186, 1995-b. A.M. Mineo. On the estimation of the structure parameter of a normal distribution of order p. To appear on Statistica, 2003. K. Pearson. Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London (A), 186:343-414, 1895. M.T. Subbotin. On the law of frequency of errors. Matematicheskii Sbornik, 31:296-301, 1923. S. Vianelli. La misura della variabilità condizionata in uno schema generale delle curve normali di frequenza. Statistica, 23:447-474, 1963. Angelo M. Mineo University of Palermo, Italy elio.mineo@dssm.unipa.it ISSN 1609-3631 Vol. 3/2, October 2003 17 Web-based Microarray Analysis using Bioconductor by Colin A. Smith Introduction The Bioconductor project is an open source effort which leverages R to develop infrastructure and algorithms for the analysis of biological data, in particluar microarray data. Many features of R, including a package-based distribution model, rapid prototyping, and selective migration to high performance implementations lend themselves to the distributed development model which the Bioconductor project uses. Users also benefit from the flexible command line environment which allows integration of the available packages in unique ways suited to individual problems. However, features to one individual may be roadblocks to another. The use of microarrays for gene expression profiling and other applications is growing rapidly. Many biologists who perform these experiments lack the programming experience of the typical R user and would strongly object to using a command line interface for their analysis. Here we present an overview of a web-based interface that attempts to address some of the difficulties facing individuals wishing to use Bioconductor for their microarray data analysis. It is distributed as the webbioc package available on the Bioconductor web site at http://www.bioconductor.org/. Target audience and interface goals While targeted at many user types, the web interface is designed for the lowest common denominator of microarray users, e.g. a biologist with little computer savvy and basic, but not extensive, statistical knowledge in areas pertinant to microarray analysis. Note that although this is the lowest level user targeted by the web interface, this interface also caters to power users by exposing finer details and allows flexibility within the preconstructed workflows. This article presents only the first version of a Bioconductor web interface. With luck, more R and Perl hackers will see fit to add interfaces for more aspects of microarray analysis (e.g. two-color cDNA data preprocessing). To help maintain quality and provide future direction, a number of user interface goals have been established. • Ease of use. Using the web interface, the user should not need to know how to use either a command line interface or the R language. Depending on the technical experience of the R News user, R tends to have a somewhat steep learning curve. The web interface has a short learning curve and should be usable by any biologist. • Ease of installation. After initial installation by a system administrator on a single system, there is no need to install additional software on user computers. Installing and maintaining an R installation with all the Bioconductor packages can be a daunting task, often suited to a system administrator. Using the web interface, only one such installation needs to be maintained. • Discoverability. Graphical user interfaces are significantly more discoverable than command line interfaces. That is, a user browsing around a software package is much more likely to discover and use new features if they are graphically presented. Additionally, a unified user interface for the different Bioconductor packages can help add a degree to cohesiveness to what is otherwise a disparate collection of functions, each with a different interface. Ideally, a user should be able to start using the web interface without reading any external documentation. • Documentation. Embedding context-sensitive online help into the interface helps first-time users make good decisions about which statistical approaches to take. Because of its power, Bioconductor includes a myriad of options for analysis. Helping the novice statistician wade through that pool of choices is an important aspect of the web interface. Another aspect of the target audience is the deployment platform. The web interface is written in Perl, R, and shell scripts. It requires a Unix-based operating system. Windows is not supported. It also uses Netpbm and optionally PBS. For further information, see the webbioc vignette at http://www. bioconductor.org/viglistingindex.html. User-visible implementation decisions There are a number of existing efforts to create web interfaces for R, most notably Rweb, which presents the command line environment directly to the user. See http://www.math.montana.edu/Rweb/. The Bioconductor web interface, on the other hand, entirely abstracts the command line away from the user. This results in an entirely different set of design decisions. ISSN 1609-3631 Vol. 3/2, October 2003 The first choice made was the means by which data is input and handled within the system. In an R session, data is instantiated as variables which the user can use and manipulate. However, in the web interface, the user does not see variables associated with an R session but rather individual files which hold datasets, such as raw data, preprocessed data, and analysis result tables. Different stages of microarray analysis are divided into individual modules. Each module leads the user through a series of steps to gather processing parameters. When ready, the system creates an R script which it either runs locally or submits to a computer cluster using the Portable Batch System. Any objects to be passed to another module are saved in individual files. Another decision which directly impacts the user experience is that the system does not maintain accounts for individual users. Instead, it uses the concept of a uniquely identified session. When a user first starts using the web interface, a session is created which holds all the uploaded and processed data. The system provides the user with a session token comprised of a random string of letters and numbers. The token allows the user to return to their session at a future date. This offers a number of advantages: 1) At the discretion of the local system administrator, the web analysis resource can be offered as either a public or a private resource. Such a restriction can be made at the web-server level rather than the code level. 2) It allows rapid development of the web interface without being bogged down in the implementation or integration of a user infrastructure. 3) As opposed to having no session whatsoever, this allows a user to input data only once. Raw data files are often quite large. Uploading multiple copies of such datasets for each change in parameters is not desirable. Lastly, the web interface brings the idea of designby-contract used in the Bioconductor project down to the package level. That is, individual interface modules are responsible for a specific stage or type of analysis. Modules may take the user through any number of steps as long as they use standard input and output formats. This allows the system to grow larger and more powerful over time without making individual components more complex than is necessary to fulfill their function. 18 CEL files containing intensity values for individual probes. Because all processing is done server-side, that data must first be transferred with the Upload Manager. While raw data files can each be over ten megabytes, today’s fast ethernet networks provide very acceptable performance, with file transfers taking only a few seconds. Figure 1: Upload Manager Affymetrix preprocessing is handled by the affy Bioconductor package. The core functionality of that package is captured by only a handful of functions and thus lends itself to a simple web interface. The user may choose either the built-in high performance function for RMA or a custom expression measure. The custom expression measure also uses additional plug-in methods from the vsn and gcrma packages, which leverage the modular design of affy. Analysis workflow The web interface is currently limited to processing data from microarray experiments based on the Affymetrix GeneChip platform. It does however handle an entire workflow going from raw intensity values through to annotated lists of differentially expressed genes. Affymetrix microarray data comes in the form of R News Figure 2: Affymetrix Data Preprocessing There are a number of methods for identifying differentailly expressed genes. The web interface currently uses basic statistical tests (t-tests, F-tests, ISSN 1609-3631 Vol. 3/2, October 2003 19 Figure 3: Differential Expression and Multiple Testing etc.) combined with multiple testing procedures for error control of many hypotheses. These are implemented in the multtest package. Additionally, the web interface automatically produces a number of diagnostic plots common to microarray analysis. Those include M vs. A (log fold change vs. overall expression) and normal quantile-quantile plot. The web interface completes the workflow by producing tables with integrated results and metadata annotation. Where appropriate, the annotation links to other online databases including a chromosome viewer, PubMed abstracts, Gene Ontology trees, and biochemical pathway schematics. The metadata presentation is handled by annaffy, another Bioconductor package. In addition to presenting metadata, the web interface provides facilities for searching that metadata. For instance, it is trivial to map a set of GenBank accession numbers onto a set of Affymetrix probeset ids or find all genes in a given Gene Ontology branch. This assists biologists in making specific hypotheses about differential gene expression while maintining strong control over the error rate. Lastly, because the web interface stores intermediate data as R objects, users of Bioconductor through either the command line or web interface can easily exchange data back and forth. Data exchange is currently limited to exprSet objects, which R News is the standard class for microarray data in Bioconductor. Future development of the interface should yield more data exchange options enabling novel collaboration between R users and non-users alike. Final thoughts An important consideration worthy of discussion is the inherent lack of flexibility in graphical user interfaces. The R command line interface does not box one into pre-scripted actions in the way that the web interface does. It allows one to exercise much more creativity in analysis and take more non-linear approaches. In the GUI trivial questions may by impossible to answer simply because of unforeseen limitations. There are, however, a number of strengths in the web interface beyond what is available in R. The aforementioned interface goals are good examples of this. Additionally, the web interface can help reduce errors by distilling long series of typed commands into simple point-and-click options. All actions and parameters are tracked in a log for verification and quality control. Secondly, the web interface easily integrates into existing institutional bioinformatics resources. The web has been widely leveraged to bring univerISSN 1609-3631 Vol. 3/2, October 2003 20 Figure 4: Annotated Results and Online Database Links sally accessible interfaces to common command-line bioinformatics tools. The system presented here can sit right next to those tools on a web site. Because it already uses PBS for dispatching computational jobs, the web interface can take advantage of existing computer clusters built for genomic search tools, such as BLAST, and can scale to many simultaneous users. The web interface has been deployed and is currently in use by two research groups. One group is split between institutions located in different states. They use common session tokens and collaborate by sharing data and analysis results over the web. R News Lastly, Bioconductor has implementations of a number of algorithms not otherwise freely available. Some newer algorithms have been exclusively implemented in Bioconductor packages. The web interface helps bring such innovations to the mainstream. It may even wet the appetite of some users, convincing them to take the plunge and learn R. Colin A. Smith NASA Center for Computational Astrobiology and Fundamental Biology webbioc@colinsmith.org ISSN 1609-3631 Vol. 3/2, October 2003 21 Sweave, Part II: Package Vignettes Reading, writing and interacting with R package primers in Sweave format. by Friedrich Leisch This is the second article in a two-part miniseries on Sweave (Leisch, 2002a), a tool that allows embedding of R code in LATEX documents so that code, results, and descriptions are presented in a consistent way. The first article (Leisch, 2002b) introduced the Sweave file format and the R functions to process it, and demonstrated how to use Sweave as a reporting tool for literate statistical practice. This article will concentrate on how to use files in Sweave format as primers or manuals for R packages, so that users have direct access to the code examples shown and so that all code can be checked automatically for syntax errors and for consistency of the usage description with the implementation. Reading vignettes Books like Venables and Ripley (2002) that describe the use of S for data analysis typically contain a mixture of documentation, code, and output. Short documents in this style are ideally suited to explaining the functionality of a package to new users. The directory ‘inst/doc’ of an R source package may contain such package documentation in any format, although we recommend PDF files because of their platform independence. We call a user guide located in ‘inst/doc’ a vignette only when it is a document from which the user can extract the R code and interact with it. Sweave is the only format for such documents that is currently supported by R; there may be others in the future. In short: every vignette is a user guide, but not every user guide is a vignette. Command line interface R package documentation The main vehicle for documenting R packages are the help files, which are the sources, written in R documentation (Rd) format, of what you see after calling help() on a topic. These files, which are divided into sections, typically contain code in just two sections: usage and examples. All examples in the R help files are, by default, required to be executable such that the user can copy & paste the code to a running R process using • the mouse, • keyboard shortcuts if running R inside Emacs with ESS, or • R’s example() function. Examples should be flagged as non-executable only if there are good reasons for this, e.g. because they require user interactivity like identify() and hence cannot be executed in batch mode. The tools for package quality control, available through the R CMD check1 command, test if all the examples in the package can be successfully executed. Furthermore, the code in the usage section is compared with the actual implementation to check for inconsistencies and for missing documentation. The Rd file format was designed for reference documentation on single R objects (functions, classes, data sets, . . . ). It is not intended for demonstrating the interaction of multiple functions in a package. For this task we have developed the concept of package vignettes — short to medium-sized documents explaining parts or all of the functionality of a package in a more informal tone than the strict format of reference help pages. 1R Starting with R version 1.8.0 there is support in base R for listing and viewing package vignettes. The vignette() function works similar to data() and demo(). If no argument is given, a list of all vignettes in all installed packages is returned — see the example R session in Figure 1. Do not be surprised if this list is rather short or even empty on your computer. At present only a few of the packages on CRAN have vignettes. For Bioconductor we have made package vignettes a requirement and thus all Bioconductor packages provide one or more vignettes. Following the title of each vignette listed in Figure 1, you will see in parenthesis a list of the formats that are available. In Figure 1 all the vignettes are available in both source and PDF format. To view the strucchange-intro vignette, all you need to do is to issue R> vignette("strucchange-intro") and the PDF file is opened in your favorite PDF reader (exactly which PDF reader depends on the platform that you use). If the source file for a vignette is available, one can easily extract the code shown in the vignette, although we have not yet fully automated the procedure. First we get the full path to the vignette directory R> vigdir = + system.file("doc", package="strucchange") and then we examine the names of the files it contains CMD xxx is Rcmd xxx in the Windows version of R. R News ISSN 1609-3631 Vol. 3/2, October 2003 22 R > vignette () Vignettes in package ’ AnnBuilder ’: AnnBuilder HowTo AnnBuilder Basic ( source , pdf ) AnnBuilder HowTo ( source , pdf ) Vignettes in package ’ Biobase ’: Biobase Bioconductor HowTo esApply Biobase Primer ( source , pdf ) Howto Bioconductor ( source , pdf ) HowTo HowTo ( source , pdf ) esApply Introduction ( source , pdf ) ... Vignettes in package ’ strucchange ’: strucchange - intro strucchange : An R Package for Testing for Structural Change in Linear Regression Models ( source , pdf ) ... Figure 1: Usage of vignette() to list available package vignattes. R> list.files(vigdir) [1] "00Index.dcf" [2] "strucchange-intro.R" [3] "strucchange-intro.Rnw" [4] "strucchange-intro.pdf" [5] "strucchange-intro.tex" File ‘strucchange-intro.Rnw’ is the original Sweave file, ‘strucchange-intro.R’ has the extracted R code for all code chunks and could now be executed using source() or opened in your favorite editor. If the ‘.R’ file is not available, we can create it in the current working directory by R> library("tools") R> vig = listFilesWithType(vigdir, "vignette") R> Stangle(vig[1]) Writing to file strucchange-intro.R where listFilesWithType() returns the full path to all files in vigdir that have type "vignette", i.e., an extension marking them as Sweave files. Function vExplorer() lists all available vignettes in a nice point & click menu. For example, after selecting the strucchange vignette the upper left window shown in Figure 2 is opened. The PDF version of the vignette can be opened by clicking on the “View PDF” button. Each code chunk of the vignette has a button on the left side of the window, clicking on the button shows the code in the "R Source Code" text field. The code can be executed and the resulting output is shown in the “Results of Execution” area. The most powerful feature of this kind of interface is that the S code in the source code field can be modified by the user, e.g., to try variations of the pre-fabricated examples. To modify the example, one simply edits the code in the "R Source Code" area and presses the "Execute Code" button again. Dynamic statistical documents and their user interfaces are an open research area, see also Buttrey et al. (2001) and Sawitzki (2002) for other approaches. Graphical user interface Writing vignettes The simplest way to access vignettes is probably through the HTML help system. If you execute help.start() and click your way to a package containing vignettes, then you will see, at the beginning of the package’s table of contents, a link to an index of all the available vignettes. In addition there is a link to the directory containing the vignettes so that, for example, you could use your browser to examine the source files. A more advanced interface to package vignettes is available in the Bioconductor package tkWidgets, available from http://www.bioconductor.org. Once the Sweave file is written (we cannot do that for you), it is almost trivial to include it in an R package and make it available to users by the tools described above. Say you have written a file ‘foo.Rnw’ to be used as a vignette for package foo. First you need to add some meta-information to the file along the lines of R News % % % % \VignetteIndexEntry{An R Package for ...} \VignetteDepends{foo, bar, ...} \VignetteKeyword{kwd1} \VignetteKeyword{kwd2} ISSN 1609-3631 Vol. 3/2, October 2003 23 Figure 2: Screenshot of vExplorer() showing the vignette from package strucchange: main controls for code chunk execution (upper left), currently active R graphics window (lower left) and a pdf viewer (right). All of these should be in LATEX comments as we have not defined them as proper LATEX commands. The index entry is used for the listings of vignette() or vExplorer(); frequently it is the same as the title of the document (or an abbreviated version thereof). Note that it is directly used in text and HTML files and hence should not contain any TEX markup. The dependency information is analogous to the Depends field of a package ‘DESCRIPTION’ file and lists packages needed to execute the code in the vignette. The list of \VignetteXXX meta-information specifications will probably get longer in the near future, especially for versioning etc. Once this is done all you have to do is create a subdirectory ‘inst/doc’ in yor package source tree and copy ‘foo.Rnw’ to it. All the rest is taken care of by the R package management system, e.g. • R CMD check will extract the code from the vignette and test that it can be executed successfully. • R CMD build will run Sweave() and pdflatex on the vignette to create the PDF version. • The package installation mechanism creates an index of all vignettes in the package and links it into the HTML help system. R News Note that even code chunks with option eval=FALSE are tested by R CMD check; if you want to include code that should not be tested in a vignette, move it to a normal LATEX verbatim environment. The reason for this policy is because users should be able to rely on code examples being executable exactly as shown in the vignette. By including the PDF version in the package sources it is not necessary that the vignettes can be compiled at install time, i.e., the package author can use private LATEX extensions or BibTeX files. Only the R code inside the vignettes is part of the checking procedure; typesetting manuals is not part of package quality control. For more details see the manual “Writing R Extensions”, which features a section on package vignettes. In general it is assumed that package authors run R CMD build on their machine (and may safely assume that only they do that). R CMD check on the other hand should be runnable by everybody, e.g., CRAN runs a check on all 250+ packages on a daily basis (the results are available at http://cran.r-project.org/src/contrib/ checkSummary.html). Bioconductor has opted for a stricter policy such that even building packages (including running latex on vignettes) should be reISSN 1609-3631 Vol. 3/2, October 2003 producible on every machine which has the necessary tools installed. Acknowledgements vignette() and most of R CMD check were written by Kurt Hornik. vExplorer() and its helper functions were written by Jeff Gentry and Jianhua Zhang as part of the Bioconductor project. I want to thank them and Robert Gentleman for helpful ideas and discussions. Bibliography S. E. Buttrey, D. Nolan, and D. T. Lang. An environment for creating interactive statistical documents. In E. J. Wegman, A. Braverman, A. Goodman, and P. Smyth, editors, Computing Science and Statistics, volume 33. Interface Foundation of North America, Fairfax Station, VA, USA, 2001. 22 F. Leisch. Sweave: Dynamic generation of statistical R News 24 reports using literate data analysis. In W. Härdle and B. Rönz, editors, Compstat 2002 — Proceedings in Computational Statistics, pages 575–580. Physika Verlag, Heidelberg, Germany, 2002a. URL http: //www.ci.tuwien.ac.at/~leisch/Sweave. ISBN 3-7908-1517-9. 21 F. Leisch. Sweave, part I: Mixing R and LATEX. R News, 2(3):28–31, December 2002b. URL http: //CRAN.R-project.org/doc/Rnews/. 21 G. Sawitzki. Keeping statistics alive in documents. Computational Statistics, 17:65–88, 2002. 22 W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, 2002. URL http://www.stats.ox.ac.uk/pub/MASS4/. ISBN 0-387-95457-0. 21 Friedrich Leisch Institut für Statistik & Wahrscheinlichkeitstheorie Technische Universität Wien, Austria Friedrich.Leisch@R-project.org ISSN 1609-3631 Vol. 3/2, October 2003 25 R Foundation News by Bettina Grün and Friedrich Leisch New benefactors • Department of Statistics, Brigham Young University, Utah, USA • Institute of Mathematical Statistics (IMS), Ohio, USA • MedAnalytics, Inc., Minnesota, USA New supporting institutions • Astra Zeneca R&D Mölndal, Mölndal, Sweden • Baxter AG, Vienna, Austria • Boehringer Ingelheim Austria GmbH, Vienna, Austria • Department of Economics, Stockholm University, Sweden • Department of Statistics, University Wisconsin-Madison, Wisconsin, USA of • Lehrstuhl für Rechnerorientierte Statistik und Datenanalyse, University of Augsburg, Germany • Spotfire, Massachusetts, USA New supporting members Klaus Abberger (Germany) Luc Anselin (USA) Anestis Antoniadis (France) Carlos Enrique Carleos Artime (Spain) Ricardo Azevedo (USA) Pierluigi Ballabeni (Switzerland) Saghir Bashir (UK) Marcel Baumgartner (Switzerland) Hans Werner Borchers (Germany) Rollin Brant (Canada) Alessandra R. Brazzale (Italy) Karl W. Broman (USA) Robert Burrows (USA) Federico Calboli (Italy) Charles M. Cleland (USA) Jorge de la Vega Góngora (Mexico) Jan de Leeuw (USA) Ramón Diaz-Uriarte (Spain) R News Dubravko Dolić (Germany) Dirk Eddelbuettel (USA) Stephen Eglen (UK) John Fox (Canada) Simon Gatehouse (Australia) Stefano Guazzetti (Italy) Frank Harrell (USA) Pascal Heus (USA) Paul Hewson (UK) Giles Heywood (UK) Johannes Hüsing (Germany) Rafael Irizarry (USA) David A. James (USA) Landon Jensen (USA) Diego Kuonen (Switzerland) Manuel Castéjon Limas (Spain) Greg Louis (Canada) Clifford E. Lunneborg (USA) John Marsland (UK) Andrew D. Martin (USA) Gordon M. Morrison (UK) Rashid Nassar (USA) Vadim Ogranovich (USA) John C. Paolillo (USA) Thomas Petzoldt (Germany ) Bernhard Pfaff (Germany) Jonas Ranneby (USA) Gus Rashid (USA) Greg Ridgeway (USA) Jeffrey Rosenthal (Canada) Claude Rubinson (USA) Ingo Ruczinski (USA) Erik A. Sauleau (France) Martin Schlather (Germany) Michael Scroggie (Australia) Frederik von Ameln (Switzerland) Scott R. Waichler (USA) Rainer Walke (Germany) Ko-Kang Kevin Wang (New Zealand) Stefan Werner (Finland) Victor D. Zurkowski (Canada) New ordinary members Roger Bivand (Norway) Bill Venables (Australia) Bettina Grün & Friedrich Leisch Technische Universität Wien, Austria Bettina.Gruen@ci.tuwien.ac.at Friedrich.Leisch@R-project.org ISSN 1609-3631 Vol. 3/2, October 2003 26 Recent Events Statistical Computing Reisensburg 2003 at The Reisensburg meeting has become a regular attraction for those interested in Computational Statistics. It is organized by three special interest groups (Computational Statistics of the Biometric Society -DR, Statistical Analysis Systems of the German Association of Medical Informatics, Biometry and Epidemiology GMDS, and Classification and Data Analysis in the biosciences of the Gesellschaft für Klassifikation GfKl) and it takes place near Ulm, Germany, in beautiful Reisensburg castle, situated above the river Danube. The main topics of this conference are fixed one year in advance by the members of the working groups. The organizers take great care that there is sufficient time for discussion after the talks and at the famous Reisensburg bartizan round tables. Recent developments of statistical software has been a major topic of previous meetings. Merits and miseries of the various packages were discussed in depth. This has changed. Discussion of the large packages played a minor role this year, and R was featured in many presentations. F. Bretz, T. Hothorn and P. Westfall gave an overview on the multcomp package for multiple comparisons. F. Leisch introduced flexmix, a framework for fitting discrete mixtures of regression models. As we all know, a lot has still to be done in R to support advanced visualization and interactivity. A. Zeileis, D. Meyer and K. Hornik demonstrated visualizations using mosaic plots in R . S. Urbanek showed Java based interactive graphics for R and H. Hoffmann demonstrated what can be done in other environments, using visualizations for conditional distributions as an example and tools derived from the Augsburg Dada collection. The list of speakers and topics is too long to be repeated here in full. Program and abstracts are available from http://www.dkfz.de/ biostatistics/Reisensburg2003/. The 2003 meeting highlighted the analysis of genomic data. Robert Gentleman presented a keynote session on exploring and visualizing genomic data. Subsequent sessions covered methodological aspects, in particular techniques for combining classifiers or variables and methods related to machine learning. On the more applied side, topics included, among others, C. Ittrich and A. Benner addressing the role of microarrays in clinical trials, and U. Mansmann discussing simulation techniques for microarray experiments. E. Brunner (Göttingen) used the opportunity to demonstrate how classical statistical analysis of the design of experiments may be applied in this field to give concise answers instead of vague conjectures. R News High dimensional observations, combined with very low sample sizes, are a well known peculiarity of genomic data. Another peculiarity comes from the strong dependence between the observed data. The data refer to gene activities, and these are only an aspect of the metabolic and regulatory dynamics of a cell. Little is known about how to include the knowledge of metabolic pathways and the resulting dependencies in a statistical analysis. Using statistical inference from genomic data to identify metabolic or regulatory structures is largely an open task. F. Markowetz and R. Spang studied the effect of perturbations on reconstructiing network structure; C. Becker and S. Kuhnt addressed robustness in graphical modelling. From the application side, A. v. Heydebreck reported on estimation of oncogeneic tree models and W. Huber talked about identification of protein domain combinations. The next Reisensburg working conference will take place 2004, June 27.-30. By the time you read this article, the call for papers should have been issued. The main topics will be: applications of machine learning; statistical analyis of graphs/networks; statistical software; bioinformatics; exploration of large data sets. Till then, working groups in cooperation with the special research unit in Erlangen will organize a workshop on Ensemble Learning, Erlangen 2004, Jan. 23.-24. Stay tuned, and see http://www.imbe. med.uni-erlangen.de/links/EnsembleWS/. Günther Sawitzki Universität Heidelberg gs@statlab.uni-heidelberg.de Statistical Inference, Computing and Visualization for Graphs On August 1–2, 2003, a workshop on using graphs in statistical data analysis took place at Stanford University. Quoting the workshop homepage at http://www.research.att.com/~volinsky/ Graphs/Workshop.html “Graphs have become an increasingly popular way of representing data in many different domains, including telecommunications research, genomics and bioinformatics, epidemiology, computer networks and web connectivity, social networks, marketing and statistical graphical models. Analyzing these data effectively depends on contributions from the areas of data representation, algorithms, visualization (both static and interactive), statistical modeling (including graphical models) and inference. Each of these areas has its own language for describing graphs, and its own favorite tools and methods. Our goal for the workshop is to explore ISSN 1609-3631 Vol. 3/2, October 2003 synergies that may exist between these different areas of research.” It was very interesting to see the number of different areas of applied data analysis in which graphs (structures with nodes and edges) are used. There are differences, most notably the sizes of the graphs, ranging from a dozen nodes to several millions, which has an impact on “natural” and efficient computations. However, we also identified commonalities, and having a central infrastructure in R for representing graphs and performing common operations will certainly help to prevent reinventing the wheel several times. The Bioconductor project has started to provide this infrastructure with the graph package and interfaces to standard libraries for graph computations and visualization (Rgraphviz, RBGL, . . . ). Development versions of ggobi also have support for graphs that can be tightly linked to R. If you are interested to learn more about the workshop: you can download the slides for any of the presentations from the workshop homepage. Finally, I want to thank the organizers for the great job they did in organizing the workshop; both the scientific program and the social atmosphere made it a pleasure to participate. Friedrich Leisch Technische Universität Wien, Austria Friedrich.Leisch@R-project.org JSM 2003 At the 2003 Joint Statistical Meetings in San Francisco, an invited session was organized that is of particular interest to the R community. Jan de Leeuw from University of California, Los Angeles, led off the session with the a talk on “The State of Statistical Software” (http://gifi.stat.ucla.edu/pub/ jsm03.pdf). He began with a overview of types of statistical software one might use for activities such as consulting, teaching and research providing some history and thoughts for the future along the way. Luke Tierney from University of Iowa, spoke on “Some New Language Features of R” (http:// www.stat.uiowa.edu/~luke/talks/jsm03.pdf) focussing on namespaces, code analysis tools, exception handling and byte compilation. Duncan Temple Lang from Bell Laboratories spoke on “Connecting Scientific Software” (http://cm.bell-labs. com/stat/duncan/Talks/JSM2003). The talk dealt with connecting other software packages to R, with particular attention to R DCOM services. The discussant, Wolfgang Hartmann from SAS, provided an industry perspective (see http://www.cmat.pair. com/wolfgang/jsm03.pdf) comparing the features of different software, commercial and open-source, with specific attention to R. R News 27 Balasubramanian Narasimhan Stanford University, CA, USA naras@stat.stanford.edu gR 2003 On 17-20th September 2003, Aalborg University hosted a workshop bringing together people from many communities working with graphical models. The common interest is development of a package for R, supporting the use of graphical models for data analysis. The workshop followed up on the gR initiative described by Steffen Lauritzen in R News 2/3. The workshop provided a kaleidoscope of applications as well as insight in experiences dealing with practical graphical models. The applications presented were from the areas of epidemiology, geostatistics, genetics, bioinformatics and machine learning. The wide range of applications and methodology showed that a unifying software package for graphical models must be widely extensible and flexible — utilizing a variety of data formats, model specifications and estimation algorithms. The package should also provide an attractive user interface that aids in working with complex models interactively. Development of a gR-package is evolving at many levels. Some ’old’ stand-alone programs are being ported as R-packages (CoCoR, BRugs), some are being interfaced (mimR, JAGS, BugsR), while others have been developed in R (ggm, deal, GRAPPA). Experiences from other existing packages can inspire the gR project. For example, the Bayes Net Toolbox for Matlab includes many features that gR will include. Intel is currently re-implementing the Bayes Net Toolbox in C++ (called Probability Network Library, PNL) and plan a December 2003 release, expected to be open source. An R interface to PNL could be a possibility. During the workshop an outline of a package grbase with basic elements was discussed and thought to become a common ground for extensions. Important features were to separate data, model and inference. The grbase package will include • support for a variety of data formats, eg. as a list of cases, a dataframe or a database connection. It should also be possible to work with a model without data. • a general model language capable of specifying eg. (block-) recursive graphical models and BUGS models. • a variety of representation forms for graphs, eg. using/extending the graph package from bioconductor. ISSN 1609-3631 Vol. 3/2, October 2003 28 • a graphics system, for interactively working with models. For example using R-Tcl/Tk, Rggobi or the R-interface to Graphviz. links, including links to the aforementioned software. • an analyzing unit that combines data and model with the possibility of using different inference algorithms in the analyzing step. Acknowledgments The gR-2003 workshop was supported by the Danish National Research Foundation Network in Mathematical Physics and Stochastics - MaPhySto. The Danish activities of the gR project are supported by the Danish Natural Science Research Council. A minimal version of grbase is planned for January 2004. An invited session concerned with the gR developments is being planned for the Joint Statistical Meeting in Toronto, 8-12 August 2004. See http://www.math.auc.dk/gr/gr2003/ for more information about the workshop and related Claus Dethlefsen Aalborg University, Denmark dethlef@math.auc.dk Book Reviews John Maindonald and John Braun: Data Analysis and Graphics Using R — An Example-based Approach Cambridge University Press, Cambridge, United Kingdom, 2003 362 pages, ISBN 0-521-81336-0 http://cbis.anu.edu/DAAG/ http://www.stats.uwo.ca/DAAG/ The aim of the book is to describe the ideas and concepts of many statistical methodologies, that are widely used in applications, by demonstrating the use of R on a number of examples. Most examples in the book use genuine data collected by the authors in their combined several decades of statistical consulting experience. The authors see the book as a companion to other books that include more mathematical treatments of the relevant theory, and they avoid mathematical notation and mathematical description of statistical methods. The book is aimed at both scientists and students interested in practical data analysis. Data and new R functions used in the book are included in the DAAG package available from the authors’ web sites and through the Comprehensive R Archive Network (CRAN). The book begins with a nice summary of the contents of the twelve chapters of the book. Chapter 1, A Brief Introduction to R, provides enough information on using R to get the reader started. Chapter 2, Style of Data Analysis, demonstrates with many examples the use of R to carry out basic exploratory data analysis involving both graphical and numerical summaries of data. The authors not only describe how to create graphs and plots but also show the reader what to look for in the data summaries and how to interpret the summaries in the context of each particular example. Chapter 3, Statistical Models, describes the authors’ view on the importance of modR News els as a framework for statistical analysis. Chapter 4, An Introduction to Formal Inference, introduces the basic ideas of random sampling and sampling distributions of statistics necessary to understand confidence intervals and hypothesis testing. It also includes chi-square tests for contingency tables and one-way ANOVA. The next several chapters demonstrate the use of R to analyze data using linear models. Chapter 5, Regression with a Single Predictor, Chapter 6, Multiple Linear Regression, Chapter 7, Exploiting the Linear Model Framework, and Chapter 8, Logistic Regression and Other Generalized Linear Models, use increasingly complex models to lead the reader through several examples of practical data analysis. The next three chapters discuss more specialized topics that arise frequently in practice. Chapter 9, Multi-level Models, Time Series, and Repeated Measures, goes through examples that use more complicated error structures than examples found in previous chapters. Chapter 10, Tree-based Classification and Regression Trees, provides an introduction to tree-based regression and classification modeling. Chapter 11, Multivariate Data Exploration and Discrimination, describes both principle components analysis and discriminant analysis. The final chapter, Chapter 12, The R System — Additional Topics, is a far more detailed introduction to R than that contained in the initial chapters. It is also intended as a reference to the earlier chapters. The book is a primer on the nuts-and-bolts use of R for the types of statistical analysis that arise commonly in statistical practice, and it also teaches the reader to think statistically when interpreting the results of an analysis. The strength of the book is in the extensive examples of practical data analysis with complete examples of the R code necessary to carry out the analyses. Short R commands appear on nearly every page of the book and longer R code examples appear frequently as footnotes. ISSN 1609-3631 Vol. 3/2, October 2003 I would strongly recommend the book to scientists who have already had a regression or a linear models course and who wish to learn to use R. However, my recommendation has a few caveats. The first chapter of the book takes the reader through an introduction to R that has the potential to be a little frustrating for a reader with no prior R experience. For example, the first plotting command given is plot(ACT ~ Year, data=ACTpop, pch=16) The meaning of the pch=16 option is described and the option data=ACTpop is self evident, but the syntax ACT ~ Year is not explained and is potentially confusing to an R beginner who does not automatically translate ~ into “is modeled by”. Page 5 gives the advice to create a new workspace before experimenting with R functions, but provides no details on how one actually does this. Most examples of R code in the book do contain adequate descriptions, but there are a number of exceptions. A second caveat is that the descriptions of statis- 29 tical methods are an adequate refresher, but are inadequate as a primary source of information. The authors indicate clearly that the book is meant to complement other books in the presentation of, and the mathematical description of, statistical methods. I agree that the book would not work well as a standalone text book for a course on statistical modeling. However, it is also not short and I would hesitate to require students to buy it in addition to another comprehensive textbook. The scope of the book is greater than simply serving as a companion book for teaching R. Despite my hesitation to use this book in teaching, I give it a strong recommendation to the scientist or data analyst who wishes an easy-to-read and an understandable reference on the use of R for practical data analysis. Bret Larget University of Wisconsin—Madison larget@stat.wisc.edu Changes in R 1.8.0 by the R Core Team MacOS changes • As from this release there is only one R port for the Macintosh, which runs only on MacOS X. (The ‘Carbon’ port has been discontinued, and the ‘Darwin’ port is part of the new version.) The current version can be run either as a command-line application or as an ‘Aqua’ console. There is a ‘Quartz’ device quartz(), and the download and installation of both source and binary packages is supported from the Aqua console. Those CRAN and BioC packages which build under MacOS X have binary versions updated daily. User-visible changes • The defaults for glm.control(epsilon=1e-8, maxit=25) have been tightened: this will produce more accurate results, slightly slower. • sub, gsub, grep, regexpr, chartr, tolower, toupper, substr, substring, abbreviate and strsplit now handle missing values differently from "NA". • Saving data containing name space references no longer warns about name spaces possibly being unavailable on load. R News • On Unix-like systems interrupt signals now set a flag that is checked periodically rather than calling longjmp from the signal handler. This is analogous to the behavior on Windows. This reduces responsiveness to interrupts but prevents bugs caused by interrupting computations in a way that leaves the system in an inconsistent state. It also reduces the number of system calls, which can speed up computations on some platforms and make R more usable with systems like Mosix. Changes to the language • Error and warning handling has been modified to incorporate a flexible condition handling mechanism. See the online documentation of tryCatch() and signalCondition(). Code that does not use these new facilities should remain unaffected. • A triple colon operator can be used to access values of internal variables in a name space (i.e. a:::b is the value of the internal variable b in name space a). • Non-syntactic variable names can now be specified by inclusion between backticks `Like This`. The deparse() code has been changed to output non-syntactical names with this convention, when they occur as operands in expressions. This is controlled by a backtick ISSN 1609-3631 Vol. 3/2, October 2003 argument, which is by default TRUE for composite expressions and FALSE for single symbols. This should give minimal interference with existing code. • Variables in formulae can be quoted by backticks, and such formulae can be used in the common model-fitting functions. terms.formula() will quote (by backticks) non-syntactic names in its "term.labels" attribute. [Note that other code using terms objects may expect syntactic names and/or not accept quoted names: such code will still work if the new feature is not used.] New features 30 • format() and hence print() make an effort to handle corrupt data frames, with a warning. • format.info() now also works with ‘nsmall’ in analogy with format.default(). • gamma(n) is very slightly more precise for integer n in 11:50. • ? and help() will accept more un-quoted arguments, e.g. NULL. • The ? operator has new forms for querying documentation on S4 methods. See the online documentation. • New argument frame.plot = axes (== TRUE) for filled.contour(). • New function bquote() does partial substitution like LISP backquote. • New argument fixed = TRUE for grep() and regexpr() to avoid the need to escape strings to match. • capture.output() takes arbitrary connections for file argument. • grep(x, ..., value = TRUE) names of x. • contr.poly() has a new scores argument to use as the base set for the polynomials. • hist.POSIXt() can now pass arguments to hist.default() • cor() has a new argument method = c("pearson","spearman","kendall")’ as cor.test() did forever. The two rank based measures do work with all three missing value strategies. • legend() and symbols() now make use of xy.coords() and accept a wider range of coordinate specifications. • New utility function cov2cor() Cov -> Corr matrix. • cut.POSIXt() now allows ‘breaks’ to be more general intervals as allowed for the ‘by’ argument to seq.POSIXt(). • data() now has an envir argument. • det() uses an LU decomposition and LAPACK. The method argument to det() no longer has any effect. • dev.control() now accepts enable as well as inhibit. (Wishlist PR#3424) • *, - and / work more generally on "difftime" objects, which now have a diff() method. • dt(*, ncp = V) is now implemented, thanks to Claus Ekstroem. • dump() only quotes object names in the file where necessary. • eval() of a promise forces the promise • file.path() now returns an empty character vector if given at least one zero-length argument. R News preserves • Added function library.dynam.unload() to call dyn.unload() on a loaded DLL and tidy up. This is called for all the standard packages in namespaces with DLLs if their namespaces are unloaded. • lm(singular.ok = FALSE) mented. is now imple- • Empty lm() and glm() fits are now handled by the normal code: there are no methods for classes "lm.null" and "glm.null". Zero-rank fits are handled consistently. • make.names() has improvements, and there is a new auxiliary function make.unique(). (Based on code contributed by Tom Minka, since converted to a .Internal function.) In particular make.names() now recognises that names beginning with a dot are valid and that reserved words are not. • methods() has a print method which asterisks functions which are not user-visible. methods(class = "foo") now lists nonvisible functions, and checks that there is a matching generic. • model.matrix() now warns when it removes the response from the rhs of the formula: that this happens is now documented on its help page. ISSN 1609-3631 Vol. 3/2, October 2003 • New option "locatorBell" to control the confirmation beep during the use of locator() and identify(). • New option("scipen") provides some user control over the printing of numbers in fixedpoint or exponential notation. (Contributed by David Brahm.) • plot.formula() now accepts horizontal=TRUE and works correctly when boxplots are produced. (Wishlist PR#1207) The code has been much simplified and corrected. • polygon() and rect() now interpret density < 0 or NA to mean filling (by colour) is desired: this allows filling and shading to be mixed in one call, e.g. from legend(). • The predict() methods for classes lm, glm, mlm and lqs take a ‘na.action’ argument that controls how missing values in ‘newdata’ are handled (and defaults to predicting NA). [Previously the value of getOption("na.action") was used and this by default omitted cases with missing values, even if set to ‘na.exclude’.] • print.summary.glm() now reports omitted coefficients in the same way as print.summary.lm(), and both show them as NAs in the table of coefficients. • print.table() has a new argument ‘zero.print’ and is now documented. • rank(x, na.last = "keep") now preserves NAs in ‘x’, and the argument ‘ties.method’ allows to use non-averaging ranks in the presence of ties. • read.table()’s ’as.is’ argument can be character, naming columns not to be converted. • rep() is now a generic function, with default, POSIXct and POSIXlt methods. For efficiency, the base code uses rep.int() rather than rep() where possible. • New function replicate() for repeated evaluation of expression and collection of results, wrapping a common use of sapply() for simulation purposes. • rev() is now a generic function, with default and dendrogram methods. • serialize() and unserialize() functions are available for low-level serialization to connections. • socketSelect() allows waiting on multiple sockets. • sort(method = "quick", decreasing = TRUE) is now implemented. R News 31 • sort.list() has methods "quick" (a wrapper for sort(method = "quick", index.return = TRUE) and "radix" (a very fast method for small integers). The default "shell" method works faster on long vectors with many ties. • stripchart() now has ‘log’, ‘add’ and ‘at’ arguments. • strsplit(x, *) now preserves names() but won’t work for non-character ‘x’ anymore formerly used as.character(x), destroying names(x). • textConnection() now has a local argument for use with output connections. local = TRUE means the variable containing the output is assigned in the frame of the caller. • Using UseMethod() with more than two arguments now gives a warning (as R-lang.texi has long claimed it did). • New function vignette() for viewing or listing vignettes. • which.min(x) and which.max(x) now preserve names. • xy.coords() coerces "POSIXt" objects to "POSIXct", allowing lines, etc. to be added to plot.POSIXlt() plots. • .Machine has a new entry, sizeof.pointer. • .Random.seed is only looked for and stored in the user’s workspace. Previously the first place a variable of that name was found on the search path was used. • Subscripting for data.frames has been rationalized: – Using a single argument now ignores any ‘drop’ argument (with a warning). Previously using ‘drop’ inhibited list-like subscripting. – adf$name <- value now checks for the correct length of ‘value’, replicating a whole number of times if needed. – adf[j] <- value and adf[[j]]sample(c("Rolf","Saikat","Simon"))[1] [1] "Simon" R O I W B E R R A N A T H N R News B D R R E I X M A U A U A P B A D A M A B T H E I N P P E Y D O D R A F Y L S A A T T M E E M R T R F A I A D A N T O C T O L E D U G P O J A E S A R O N O H T S N T V U N S E K T M L O B N E E E E N R L S S D U D V E F L P A E N R T L E T H N T S E E N R I A G E E T R T N C W A B A L E ISSN 1609-3631 Vol. 3/2, October 2003 39 Correction to “Building Microsoft Windows Versions of R and R packages under Intel Linux” by Jun Yan and A.J. Rossini Unfortunately, due to an inexcusable oversight on our part, we failed to be crystal clear in our article Yan and Rossini (2003) that all the described steps and the Makefile were summarized from several documents in the R sources (R Development Core Team, 2003). These documents are INSTALL, readme.package, and Makefile under the directory src/gnuwin32/ in the R source. We intended to automate and illustrate those steps by presenting an explicit example, hoping that it might save people’s time. However, confusion has been caused and inquiries have been raised to the R-help mailing list. We apologize for the confusion and claim sole responsibility. In addition, we clarify that the final credit should go to the R Development Core Team. Editor-in-Chief: Friedrich Leisch Institut für Statistik und Wahrscheinlichkeitstheorie Technische Universität Wien Wiedner Hauptstraße 8-10/1071 A-1040 Wien, Austria Editorial Board: Douglas Bates and Thomas Lumley. Editor Programmer’s Niche: Bill Venables Editor Help Desk: Uwe Ligges Email of editors and editorial board: firstname.lastname @R-project.org R News Bibliography R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2003. URL http://www.R-project.org. ISBN 3-900051-00-3. 39 J. Yan and A. Rossini. Building Microsoft Windows versions of R and R packages under Intel Linux. R News, 3(1):15–17, June 2003. URL http://CRAN. R-project.org/doc/Rnews/. 39 Jun Yan University of Iowa, U.S.A. jyan@stat.uiowa.edu A.J. Rossini University of Washington, U.S.A. rossini@u.washington.edu R News is a publication of the R Foundation for Statistical Computing, communications regarding this publication should be addressed to the editors. All articles are copyrighted by the respective authors. Please send submissions to regular columns to the respective column editor, all other submissions to the editor-in-chief or another member of the editorial board (more detailed submission instructions can be found on the R homepage). R Project Homepage: http://www.R-project.org/ This newsletter is available online at http://CRAN.R-project.org/doc/Rnews/ ISSN 1609-3631
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 39 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.10b Create Date : 2003:11:19 11:59:00EXIF Metadata provided by EXIF.tools