Cray_Users_Group_Spring94 Cray Users Group Spring94

Cray_Users_Group_Spring94 Cray_Users_Group_Spring94

User Manual: Cray_Users_Group_Spring94

Open the PDF directly: View PDF PDF.
Page Count: 540

DownloadCray_Users_Group_Spring94 Cray Users Group Spring94
Open PDF In BrowserView PDF
Spring Proceedings
Cray User Group, Inc.

INCORPORATED

PROCEEDINGS

Thirty-Third Semi-Annual
Cray User Group Meeting
San Diego, California
March 14-18, 1994

No part of this publication may be reproduced by any mechanical, photographic, or electronic process
without written permission from the author(s) and publisher. Permission requests should be made in
writing to:
Cray User Group Secretary
c/o Joan Palm
655A Lone Oak Drive
Eagan, MN 55121 USA

The CUG Proceedings is edited and produced by Karen Winget,
Fine Point Editorial Services, 1011 Ridge Valley Court, Shepherdstown, WV 25443
Cover designed by Carol Conway Leigh

Autotasking, CF77, CRAY, Cray Ada, CRAY Y-MP, CRAY-l, HSX, MPGS, SSD, SUPERSERVER, UniChem, UNICOS, and X-MP EA
are federally registered trademarks and CCI, CFT, CFf2, CFf77, COS, CRAY APP, Cray C++ Compiling System, CRAY C9Q, CRAY
EL, Cray NQS, CRAY S-MP, CRAY TID, CRAY X-MP, CRAY XMS, CRAY-2, CraylREELlibrarian, CRInform, CRVTurboKiva,
CSIM, CVT, Delivering the power ... , Docview, EMDS, lOS, MPP Apprentice, Network Queuing Environment, Network Queuing
Tools, OLNET, RQS, SEGLDR, SMARTE, SUPERCLUSTER, SUPERUNK, Trusted UNICOS, and UNICOS MAX are trademarks of Cray
Research, Inc.

Cray User Group, Inc.
BOARD OF DIRECTORS
President
Gary Jensen (NCAR)
Vice-President
Jean Shuler (NERSC)
Secretary
Gunter Georgi (Grumman)
Treasurer
Howard Weinberger (TAl)
Director
Claude Lecoeuvre (CEL-V)
Regional Director Americas
Fran Pellegrino (Westinghouse)
Regional Director Asia/Pacific
Kyukichi Ohmura (CRC)
Regional Director Europe
Walter Wehinger (RUS)

PROGRAM COMMITTEE
Jean Shuler, Chair (NERSC)
Robert Price (Westinghouse)
Helene Kulsrud (IDA)
Phil Cohen (Scripps)

LOCAL ARRANGEMENTS
Anke Kamrath
Ange Mason
Mike Vildibill
Jay Dombrowski
Phil Cohen
Dan Drobnis
Gail Bamber
Russ Sinco
Sandy Davey
Jayne Keller

CONTENTS
GENERAL SESSIONS
Chairman's Report
3

John F. Carlson (CRI)

Cray Research Corporate Report
6

Robert Ewald (CRI)

CRI Software Report

14

Irene Qualters (CRI)

Unparalleled Horizons: Computing in Heterogeneous Environments

21

Reagan Moore (SDSC)

Cray TID Project Update
28

Steve Reinhardt (CRI)

PARALLEL SESSIONS

Applications and Algorithms
Porting Third-Party Applications Packages to the Cray MPP: Experiences at the
Pittsburgh Supercomputing Center
Frank C. Wimberley, Susheel Chitre, Carlos Gonzales,
Michael H. Lambert, Nicholas Nystrom,
Alex Ropelewski, William Young (PITTSCC)

39

Some Loop Collapse Techniques to Increase Autotasking Efficiency
Mike Davis (CRI)

44

Collaborative Evaluation Project of Ingres on the Cray (CEPIC
C.B. Hale, G.M. Hale, and K.F. Witte (LANL)

65

Providing Breakthrough Gains: Cray Research MPP for Commercial Applications
DentonA. Olson (CRI)

73

Asynchronous Double-Buffered I10Applied to Molecular Dynamics Simulations of
Macromolecular Systems
Richard J. Shaginaw, Terry R. Stouch, and Howard E. Alper (BMSPRI)

76

A PVM Implementation of a Conjugate Gradient Solution Algorithm for Ground-Water Flow Modeling
Dennis Morrow, John Thorp, and Bill Holter (CRI and NASA/Goddard)

80

Graphics
Decimation of Triangle Meshes
William J. Schroeder (GE-AE)

87

v

Visualization of Volcanic Ash Clouds
Mitchell Roth (ARSC) and Rick Guritz (Alaska Synthetic Aperture Radar Facility)

92

A Graphical User Interface for Networked Volume Rendering on the Cray C90
Allan Snavely and T. Todd Elvins (SDSC)

100

Management
Heterogeneous Computing Using the Cray Y -MP and T3D
Bob Carruthers (CRI)

109

Future Operating Systems Directions
Don Mason (CRI)

112

Future Operating Systems Directions-Serverized UNICOS
Jim Harrell (CRI)

114

Mass Storage Systems
Storage Management Update
Brad Strand (CRI)

119

RAID Integration on Model-E lOS
Bob Ciotti (NASA-Ames))

123

Automatic DMF File Expiration
Andy Haxby (Shell U.K.)

136

ER90®Data Storage Peripheral
Gary R. Early and Anthony L. Peterson (ESYSTEMS)

140

EMSSICray Experiences and Performance Issues
Anton L. Ogno (EXXON)

143

AFS Experience at the Pittsburgh Supercomputing Center
Bill Zumach (PIITSC)

153

AFS Experience at the University of Stuttgart
Uwe Fischer and Dieter Mack (RUS)

157

Cray Research Status of the DCEIDFS Project
Brian Gaffey (CRI)

161

Networking
SCinet '93-An Overview
Benjamin R. Peek (peek & Associates, Inc.)

169

Experiences with OSF-DCFJDFS in a 'Semi-Production' Environment
Dieter Mack (RUS)

175

A1M-Current Status
Benjamin R. Peek (Peek & Associates, Inc.)

180

Network Queuing Environment
Robert E. Daugherty and Daniel M. Ferber (CRI)

vi

203

Operating Systems
Livennore Computing's Production Control System, 3.0

Robert R. Wood (UNL)

209

ALACCf: Limiting Activity Based On ACID

Sam West (CR!)

212

Priority-Based Memory Scheduling

Chris Brady (CR!), Don Thorp (CR!), and Jeff Pack (GRUMMAN)

221

Planning and Conducting a UNICOS Operating System Upgrade

Mary Ann Ciuffini (NCAR)

224

UNICOS Release Plans (1994-1995)

Janet Lebens (CR!)

231

UNICOS 8.0 Experiences

Hubert Bus~h (ZlB)

237

UNICOS 8.O-Field Test Experiences

Douglas A. Spragg (EUTEC)

239

Operations
SDSC Host Site Presentation

Daniel D. Drobnis and Michael Fagan (SDSC)

247

Tools for the Cray Research OWS-E Operator under UNICOS 8.0

Leon Vann (CR!)

251

Cray Research Product Resiliency

Gary Shorrel (CR!)

255

Performance Evaluation
UNICOS 7.C versus 8.0 Test Results

C.L. Berkey (CR!)

260

Workload Characterization of Cray Supercomputer Systems Running UNICOS for the Optimal
Design of NQS Configuration in a Site

Young W. Lee and Yeong Wook Cho (KIST) andAlex Wight (Edinburgh University)

271

Cray C90D Performance

David Slowinski

291

Software Tools
Cray File Pennission Support Toolset

David H. Jennings and Mary Ann Cummings (Naval Surface Warfare Center)

295

Tools for Accessing Cray Datasets on Non-Cray Platfonns

Peter W. Morreale (NCAR)

300

Centralized User Banking and User Administration on UNICOS

Morris A. Jette, Jr., and John Reynolds (NERSC)

304
vii

Fortran 90 Update
Jon Steidel (CRI)

313

C and C++ Programming Environments
David Knaak (CRI)

320

The MPP Apprentice™ Perfonnance Tool: Delivering the Perfonnance of the Cray T3D®
Winifred Williams, Timothy Hoel, and Douglas Pase (CRI)

324

Cray Distributed Programming Environment
Lisa Krause (CRI)

333

Cray TotalView™ Debugger
Dianna Crawford (CRI)

338

Fortran 110 Libraries on T3D
Suzanne LaCroix (CRI)

344

User Services
User Contact Measurement Tools and Techniques
Ted Spitzmiller (LANL)

349

Balancing Services to Meet the Needs of a Diverse User Base
353

Kathryn Gates (MCSR)
Applications of Multimedia Technology for User Technical Support

360

Jeffrey A. Kuehn (NCAR)
Integrated Perfonnance Support: Cray Research's Online Information Strategy

365

Marjorie L. Kyriopoulos (CRI)
New and Improved Methods of Finding Information via CRINFORM

369

Patricia A. Tovo (CRI)
Online Documentation: New Issues Require New Processes

372

Juliana Rew (NCAR)
MetaCenter Computational Science Bibliographic Infonnation System: A White Paper
Mary Campana (Consultant) and Stephanie Sides (SDSC)

376

Electronic Publishing:From High-Quality Publications to Online Documentation
Christine Guzy (NCAR)

380

JOINT SESSIONS

MSSIOperating Systems
Workload Metrics for the NCAR Mass Storage System
J.L. Sloan (NCAR)

387

Scientific Data Storage Solutions: Meeting the High-Performance Challenge
Daniel Krantz, Lynn Jones, Lynn Kluegel, Cheryl Ramsey, and William Collins (LANL)

392

Beyond a Terabyte System
Alan K. Powers (SS-NAD)
viii

402

/

Operations/Environmental MIG
Overview of Projects, Porting, and Perfonnance of Environmental Applications
411

Tony Meys (CRI)

Operations/MPP MIG
T3D SN6004 is Well, Alive and Computing
Martine Gigandet, Monique Patron, Francois Robin (CEA-CEL)

419

System Administration Tasks and Operational Tools for the Cray TID System
426

Susan J. Crawford (CRI)

Performance Evaluation/Applications and Algorithms
I/O Optimisation Techniques Under UNICOS

433

Neil Storer (ECMWF)

I/O Improvements in a Production Environment
JejfZnis and John Bauer (CRI)

442

New Strategies for File Allocation on Multi-Device File Systems
Chris Brady (CRI), Dennis Colarelli (NCAR),
Henry Newman (Instrumental), and Gene Schumacher (NCAR)

446

Performance Evaluation/MPP MIG
The Performance of Synchronization and Broadcast Communication on the Cray T3D Computer System
F. Ray Barriuso (CRI)

455

High Perfonnance Programming Using Explicit Shared Memory Model on the Cray TID
Subhash Saini, Horst Sinwn (CSC-NASA/Ames) and Charles Grassl (CRI)

465

Architecture and Perfonnance for the Cray TID
Charles M. Grassl (CRI)

482

User Services/Software Tools
Confessions of a Consultant
491'

Tom Parker (NCAR)

SHORT PAPERS
Xnewu: A Client-Server Based Application for Managing Cray User Accounts
Khalid Warraich and Victor Hazlewood (Texas A&M)

505

QEXEC: A Tool for Submitting a Command to NQS
Glenn Randers-Pehrson (USARL)

ATTENDEE LIST

507

513
ix

AUTHOR INDEX
Alper, H.E. (Bristol-Meyers Squibb), 76
Barriuso, F.R. (CRI), 455
Bauer,1. (CRI), 442
Berkey, C.L. (CRI), 260
Brady, C. (CRI), 221, 446
Busch, H. (ZIB), 237
Campana, M. (SDSC consultant), 376
Carlson, J.F. (CRI), 3
Carruthers, B. (CRI), 109
Chitre, S. (PITTSC), 39
Cho, Y.W. (KIST), 271
Ciotti, B. (NASAlAmes), 123
Ciuffini, M. (NCAR), 224
Colarelli, D. (NCAR), 446
Collins, W. (LANL), 392
Crawford, D. (CRI), 338
Crawford, S.I. (CRI), 426
Cummings, M., (Naval Surface Warfare
. Center), 295
Daugherty, R.E. (CRI), 203
Davis, M. (CRI), 44
Drobnis, D.D. (SDSC), 247
Early, G.R. (ESYSTEMS), 140
Elvins, T.T. (SDSC), 100
Ewald, R.(CRI), 6
Fagan, M. (SDSC), 247
Ferber, D.M., 203
Fischer, U. (RUS), 157
Gaffey, B. (CRI), 161
Gates, K. (MCSR), 353
Gigandet, M. (CEA-CEL), 419
Gonzales, C. (PITTSC), 39
Grassl, C. (CRI), 465, 482
Guritz, R. (Alaska Synthetic Aperture Radar
Facility), 92
Guzy, C. (NCAR), 380
Hale, C.B. (LANL), 65
Hale, G.M. (LANL), 65
Harrell,1. (CRI), 114
Haxby, A. (Shell U.K), 136
Hazlewood, V. (Texas A&M), 505
Hoel, T. (CRI), 324
Holter, B. NASAlGoddard), 80\
Jennings, D.H. (Naval Surface Warfare
Center), 295
Jette, M.A., Jr. (NERSC), 304
Jones, L. (LANL), 392
Kluegel, L. (LANL), 392
Knaak, D. (CRI), 320
Krantz, D. (LANL), 392
Krause, L. (CRI), 333
Kuehn, J .A. (NCAR), 360
Kyriopoulos, M.L. (CRI), 365
Labens, 1. (CRI)231
LaCroix, S. (CRI), 340
Lambert, M.H. (PITTSC), 39

x

Lee, Y.W. (KIST), 271
Mack, D. (RUS), 157, 175
Mason, D. (CRI), 112
Meys, T. (CRI), 411
Moore, R. (SDSC), 21
Morreale, P. (NCAR), 300
Morrow, D., (CRI)??, 80
Newman, H. (Instrumental), 446
Nystrom, N. (PITTSC), 39
Ogno, A.L. (EXXON), 143
Olson, D.A. (CRI), 73
Pack, 1. (GRUMMAN)), 221
Parker, T. (NCAR),491
Pase, D. (CRI), 324
Patron, M. (CEA-CEL), 419
Peek, B.R. (peek & Assoc.), 169, 180
Peterson, A.L. (ESYSTEMS), 140
Powers, A.K. (SS-NAD), 402
Qualters, I. (CRI), 14
Ramsey, C. (LANL), 392
Randers-Pehrson, G. (USARL), 507
Reinhardt, S. (CRI), 28
Rew, J. (NCAR), 372
Reynolds, J. (NERSC), 304
Robin, F. (CEA-CEL), 419
Ropelewski, A. (PITTSC), 39
Roth, M. (ARSC), 92
Saini, S. (CSC-NASAI Ames), 465
Schroeder, W.I. (GE-AE), 87
Schumacher, G. (NCAR), 446
Shaginaw, R.I. (Bristol-Meyers Squibb), 76
Shorrel, G. (CRI), 255
Sides, S. (SDSC), 376
Simon, H. (CSC-NASAI Ames), 465
Sloan, J.L. (NCAR), 387
Slowinski, D. (CRI), 291
Snavely, A. (SDSC), 100
Spitzmiller, T. (LANL), 349
Spragg, D.A. (EUTEC), 239
Steidel, 1. (CRI), 313
Storer, N. (ECMWF), 433
Stouch, T.R. (Bristol-Meyers Squibb), 76
Strand, B. (CRI), 119
Thorp, D. (CRI), 221
Thorp, J. (CRI), 80
Tovo, P.A. (CRI), 369
Vann, L. (CRI), 251
Warraich, K. (Texas A&M), 505
West, S. (CRI), 212
Williams, W. (CRI), 324
Wimberley, F.C. (PITTSC), 39
Witte, K.F. (LANL), 65
Wood, R.R. (LLNL), 209
Young, W. (PITTSC), 39
Zais, 1. (CRI), 442
Zumach, B. (PITTSC), 153

GENERAL SESSIONS

CHAIRMAN'S REPORT
John F. Carlson
Cray Research, Inc.
Eagan, Minnesota
Good morning. Thank you for inviting me to
participate today. I want to take the time available to
review our 1993 performance and discuss where we
plan to go in our strategic plan covering us from now
through 1996.
I am taking some time to review our fmancial
condition because you are all investors in Cray
Research. Whether or not you own any of our
common stock, you and your colleagues have invested
your efforts and energy and professional challenges in
Cray Research. I want to use this opportunity to
show that your investment is well placed.
Simply stated, we had a very good year in 1993. We
delivered on our 1993 plan by increasing revenue,
improving margins, reducing costs and delivering our
shareholders solid profits for the year of $2.33 per
share versus a loss for 1992.
Our financial perfonnance enables us to continue
investing at least 15 percent of our revenues into
research and development. No matter what changes
the market brings or we bring to the market, that
commitment will not change.
We achieved those fmancial results while also
delivering two major hardware platfonns and a number
of software advances -- including the creation of
CraySoft -- our initiative to make Cray software
available on non-Cray platforms, right down to PCS.
I am pleased to note, also, that we delivered our new
hardware and software products on time, as promised.
Clearly, we have recovered from our 1992 restructuring
and are realizing the operating savings we hoped would
result from those difficult actions.
For a few specifics, our 1993 revenue grew 12 percent
to $895 million from $798 million in 1992. This
substantial increase resulted principally from strong
sales of large C90 systems. We sold 12 C916
systems in 1993 -- double the number sold a year
earlier. And having moved on from the C90's
introductory stages, we achieved better margins -which certainly helped our earnings results.

We also improved our balance sheet along the way.
We grew our total assets by 15 percent to about $l.2
billion from 1992, generated about $170 million in
cash, improved stockholder's equity by $56 million
and our book value at year-end 1993 was roughly $30
a share.
Return to shareholder's equity was eight percent for the
year and return on capital employed rose to 11 percent.
While these figures are relatively good, they do not yet
hit the levels we have targeted for the company. Our
targets, long tenn, are to deliver 18 to 20 percent
improvements in stockholder's equity and 15 percent
improvement on our return on capital employed.
Also, we ended 1993 with an increase in inventories of
about $52 million. That increase is one figure going
the wrong direction and reflects the ramp-up
investments we made to launch both the T3D and
CS6400 and also the high level of deliveries we
anticipate completing this quarter.
In 1993 we signed a total net contract value of orders
of $711 million compared with $598 million for
1992. This 19 percent increase really reflects the
strong demand for C916s and TIDs. Our order
backlog at year-end was $409 million, just shy of the
record $417 million we reported in 1992. And, yes,
the delayed acceptance of a C916 from December 1992
to January, 1993 was included in the 1992 record
number. Backing out that system for comparison
purposes confrrms that our backlog number for 1993
is going in the right direction, upward.
Today, we have about 500 systems installed around the
world. Those systems are broadening out
geographically and by industry. In 1993, we added
new fIrst-time customers in Malaysia and
Czechoslovakia and installed our first system in Africa
at the South African weather bureau. We also received
orders for our fIrst systems to China -- in both the
PVP and SMP line. Three Superserver systems are
now installed at Chinese universities. The PVP
mainline system will be installed at the Chinese
Weather service, assuming the export license is granted
as we expect.
Right now we have systems installed in 30 countries.

Copyright © 1994. Cray Research Inc. All rights reserved.

3

The orders break out by market sector to: 44 percent
government customers, 29 percent commercial and 27
percent with universities.
Our commercial and industrial customers continue to
use our systems to deliver solutions for their technical
and scientific computing needs. I expect this sector to
continue to grow and increase its overall percentage of
total orders.
Insofar as our product mix is concerned, 1993 was a
very good year. As you know we stepped up deliveries
of our C916 systems. We also announced the
availability of extended C90 family ranging in size
from two to 16 processors. This range of availability
helps make the C90 product more flexible and
available to our customers in convenient
configurations. Convenient in size to fit your mission
and, hopefully, your budget
The C90 continues to set the performance pace for our
customers. As you may know, more than 20 key
codes used by our customers run at sustained speeds
exceeding six gigaflops. Five of these 20 codes
reached speeds of more than 10 gigaflops each. And as
we speak, more performance records are being set
These records are far more than bragging points. They
are the ultimate measures of getting ~ work done on
~ problems. That will remain a corporate priority
for all our systems.
In the MPP arena, we had a very good launch for the
TID system in September. On announcement day we
figure we captured third place in the MPP market and
expect to be the number one supplier by the end of
this year.
By year-end we had 15 orders in hand. I am
particularly pleased to note that this week we are able
to announce our first commercial and first petroleum
customer for the T3D -- Exxon Corporation. It was
particularly gratifying to read their news release in
which they described the T3D as a critical tool to their
growth strategies because it is the first production
quality MPP system in the market. As I said with the
C90 line, that means Exxon will be doing n:al work
on ~ problems day in and day out on the T3D.
We're also announcing this week some significant
benchmark results for the T3D on the NAS parallel
benchmark suite. Now I know that there has been a
rush to claim moral or technical victory by any
number of MPP companies using NAS results and I
promise not to enter into that debate. But I do want to
bring two points to your attention.
First point has little to do with MPP benchmarks.
Rather, it has to do with the results they are compared

4

to. The 256-processor system is the first of the MPP
crowd to approach the C916's performance on any of
the NAS parallel benchmarks beyond the
Embarrassingly Parallel test. So the C916, the
workhorse of the supercomputing field continues to be
the performance leader on these benchmarks. But at
256 processors the TID is starting to give the C916 a
run for its money on highly parallel computing, and
since the TID is showing near-linear scalability we're
confident it will widen its performance and scalability
leadership as we move to 512-processors and larger
configurations. We're also glad it took another system
from Cray to approach the C916's performance.
Second point is to note that these results arrived just
six months after the TID was introduced. We all
know that we were late to the MPP party. But I think
it is even more important to note what is being
accomplished following our arrival. Some of our
competitors have had two years or more to improve
their performance against the C916. I can only
assume that they have not announced their complete
256 results to date because they can't efficiently scale
up that far.
The strength of the T3D -- and those systems to
follow -- reflects the input received from many of you
here. This system stands as an example of what can
be accomplished when we listen to our customers.
The strength of the TID will be enhanced by another
program that had an excellent year -- the Parallel
Applications Technology Partners program, or PATP.
When MPP performance shows steady, consistent
improvement in a wide range of applications, it will
be in part due to the efforts underway between Cray
and its PATP partners, PSC, EPFL, JPL and Los
Alamos and Livermore labs. Great things are coming
from these collaborations.
The third sibling of the Cray architecture family is our
Superservers operation, based in Beaverton, Oregon.
They also delivered a beautiful, bouncing and robust
baby in 1993 -- the CS6400. The CS6400 is a binary
compatible extension of the Sun product line. It
serves as a high performance server for networks of
workstations and smaller servers running the huge
range of Sun applications. Any program that runs on
a Sun system will run on the CS6400 without
modification and the current product is Solaris 2.3
compatible.
We expect this newest arrival to create opportunities
for us in markets where we have not traditionally had a
presence. They include mainstream commercial
applications in financial services, banking,
telecommunications, government data processing
applications, et cetera. It is proving to be an excellent

alternative to a mainframe in the so-called
"rightsizing" market It is also important to note that
the CS6400 is an excellent alternative to smallconfiguration MPP systems for both the commercial
and scientific and technical markets.
Introduction of the Superserver subsidiary brings me to
an important point and sets the stage to discuss our
strategic direction. First the important point: Cray
Research is focused on its mission of providing the
highest perfonning systems to its customers. We are
not architecture constrained. In fact, we are the only
finn that can provide top-perfonning systems in all
three major architectures -- Parallel Vector; Massively
Parallel and Symmetric Multiprocessing.
You and your colleagues want solutions. We want to
help you get them. It matters little to us whether your
ideal solution comes from one specific architecture or
another. What matters is that Cray Research continue
as a technical partner in finding the solutions.
We don't have to -- nor do we want to -- make two
sales every time we talk with you. We don't have to
sell you on one architecture or another and then, next
sell you on our product over a competitor's. We
remain focused on helping deliver the highest
perfonnance possible in each architecture. We are
certain that if we continue to stick to our knitting with
that focus, we will be successful as you become
successful with our systems.
Now I want to tie this together to our strategic plan.
Our plan is simple. We have to grow in order to
accomplish the things that are important to you and to
us.

systems to solve big problems in other markets.
Commercial entities of all sizes and shapes are
recognizing that they have done a great job in .
accumulating huge data bases, but are limited in how
well they can manipulate or mine these data bases to
their advantage.
Some businesses have said they forgot the second part
of the equation -- "what are we going to do with all
this infonnation and how, pray tell, are we going to
use it to grow our businesses?"
Here's where we believe Cray Research can help.
We believe that our technologies can help. Across all
three of the important high perfonnance architectures
our products are distinguished by their technical
balance. Balance that unlocks the problem-solving
potential of increasingly fast microprocessors -regardless of the architecture in which they are
embedded.
We combine fast processors with high-perfonnance
memory and I/O subsystems and robust, proven
software to deliver unsurpassed balance right to your
desktop. That won't change. But we may be able to
add new applications in other parts of a commercial
enterprise.
We plan to use this unique strength to grow our
volume and market reach. Instead of finding a Cray
system at the R&D center only, I can picture one
being used by the marketing or finance functions as
well.
Just this last year we've seen new customers emerge
from non-traditional fields. Like Wall Street, I expect
to see more and more utilization at new locations.

If we want to remain at the head of the perfonnance

class, we must continue to invest huge sums each year
. in R&D. Those sums are only available if we
continue to grow revenues, earnings, margins and
backlogs. We wouldn't have to change anything in our
strategy if the technical and scientific market was
growing, say 30 to 40 percent as it did in the 1970s.
But the reality is that the market is flat and, depending
on how you measure it, has been for four consecutive
years.
At the same time, we don't want to change who we
are. We are a technology driven company. We will
~ be a technology driven company committed to
achieving the highest perfonnance possible at the best
price.
So, while our focus can't change, our tactics can. As
we maintain our marketshare leadership at a relatively
flat level in the scientific market -- for now, at least -we see that a growing need for efficient, price/effective

In doing so, much like the early successes of the
superserver product, we'll probably draw some
attention. Some of our nervous competitors will
whisper that we've "lost focus" as we compete for and
win commercial business. Quite the contrary, I
believe our move to add customers in non-traditional
supercomputing areas is an example of our unique
focus. We see this approach as the best way to ensure
that we don't blink when we face the future. We
remain committed to the top of the perfonnance
pyramid. We remain focused on the need to invest at
least 15 percent of our revenues in to R&D so we can
move our perfonnance up the scale to match your
demands. And we remain focused on the importance of
long-tenn, cooperative relationships with our
customers. And that may be the best single
investment we've ever made.
Thank you again for inviting me today.

5

Cray Research Corporate Report
Robert H. Ewald
Cray Research, Inc.
900 Lowater Road
Chippewa Falls, WI 54729

ABSTRACf
This paper provides an overview of Cray Research's business in 1993, an update on progress
in 1994, and a vision of Cray in the future.

1

1993 Results

1993 proved to be a very good year for Cray Research,
Inc. We achieved our 1993 plan by delivering a broad
mix of high performance systems and software, and
achieved a return to solid profitability. 1993 was also a
year of substantial change for Cray Research, but before
describing the changes, I will review some terminology.
The "s upercomputer Operations" business unit is
focused on high performance, cost effective computational tools for solving the world's most challenging
scientific and industrial problems and produces our parallel vector and highly-parallel products.
In this technical computing world we have defined three
segments that encompass different customer characteristics. The "Power" segment can be recognized as those
with "grand" challenge problems. These customers are
typically from the government and university sector and
require the highest performance solutions of scientific
and engineering problems. Most of the applications are
mathematically-oriented and are created by the customer. This segment typically invests more than $10
million in computer products. The "Endurance" segment can be characterized as customers with production
problems to be solved. There are a combination of
industrial, government, and academic customers each
solving high performance production engineering, scientific, and data-oriented problems. In this segment,

Copyright © 1994. Cray Research Inc. All rights reserved.

6

applications are frequently from third party providers.
PriceJPerformance is of major concern for these products, ranging from about $3 million to $10 million. The
"Agile" market segment is comprised of industrial, government, and academic customers with engineering, scientific, and data-oriented problems. The applications
are generally third party and their price/performance is
of paramount importance for the production capability.
The hardware investment here is less than $3 million.
The "Superserver" business unit addresses applications
that require rapid manipulation and analysis of large
amounts of data in a networked production environment, and produces products based on Sun Microsystem's SPARC architecture. Many of the applications
are at non-traditional Cray customers in the business or
commercial segment
As shown in Figure 1, we are changing our organization
to reflect these concepts. Les Davis and I share the
office of Chief Operating Officer and we have two
major business units: Supercomputer Operations and
Superserver Operations. Changes within the organization since the last CUG meeting include:
- Gary Ball, Vice President of Government Marketing
and General Manager of "Power" Systems - in addition
to his previous job, Gary is leading our efforts to continue to lead in the high-end of supercomputing.

- Rene Copeland, General Manager of "Endurance"
Systems - Rene is leading our efforts to ensure that we
provide products and services for the industrial, production-oriented customers.
- Dan Hogberg, General Manager of "Agile" Systems Dan is leading our work with deskside and departmental systems.
- Dave Thompson, Vice President of Hardware
Research & Development - Dave has moved from the
Software Division to lead our hardware development
and engineering efforts. Dave is teamed with Tony
Vacca who also leads the Triton and other projects.
- Don Whiting, Sr. Vice President, Operations - we
have combined all product operational divisions (Integrated Circuits, Printed Circuits, Manufacturing, and
Systems Test & Checkout) under Don to improve our
operational processes and efficiency.

- Lany Betterley, Vice President of Finance & Administration - Larry leads the operational Finance and
Administration groups within the business units.
The Superservers business unit remains the same, and
in addition, Paul Iwanchuk and Penny Quist are acting
leaders of new initiatives in Government Systems &
Technology and Data Intensive computing.
Figure 2 shows the progression of orders and the results
for 1993, indicating our best year ever for orders for our
larger products and a good base of orders for Agile and
Superserver products.
During 1993 we installed half of the systems shipped in
North and South America and one-third in Europe. As
shown in Figure 3, that brings the total installed base to
54 percent North and South America, 30 percent
Europe, and 11 percent Japan. We expect that more of
our future business will come from non-U.S. based customers, so the U.S. percentage will continue to decline.
The Agile installs in 1993 were distributed 45 percent
to the Americas, 33 percent to Europe, and 14 percent
to Japan. This brings the total Agile installed base to 40
percent Americas, 33 percent Europe, and 19 percent
Japan, as shown in Figure 4. This again reflects the
increasing "internationalization" of our business.
Figure 5 shows the installed base at year-end 1993 by
industry. The most significant changes in the distribution that occurred during the year were increases in the
university, aerospace, and environmental segments,
and a decrease in the government research lab business
reflecting changing government spending.
Figure 6 shows the distribution of the 500 systems
installed at customer sites. The Y -MP and EL products
are dominant with C-90s and T3Ds just beginning to
ramp up.

There were 38 new customers in the Supercomputer
business unit and 12 in the Superserver business unit.
We welcome all of you to the growing Cray family and
hope you will share with us your ideas for the future of
computing.
We ended 1993 with the strongest offering of products
we have ever fielded. We offer three architectures,
each with superior price/performance: parallel vector,
massively parallel, and symetric multiprocessing. The
parallel vector product line was expanded by extending
the Cray C-90 to configurations ranging from 2 to 16
processors. We began shipping the C-92, C-94, and
C-98 at mid-year. All the C-90 series systems feature
the newest, fastest memory technology available - four
megabit SRAM (static random access memory). We
also introduced the big memory machines - D-90 with
up to 4 billion words of memory.
Two new departmental supercomputers were introduced - the EL92 and EL98. The EL92 is the company's smallest, lowest-priced system to date with a
U.S. starting price of $125,()()() and a footprint of about
four square feet. The Cray EL92 is designed to extend
Cray Research compatibility to office environments at
attractive prices. The EL98 is available with two, four,
six, or eight processors, offering up to 512 million
words of central memory and providing a peak performance of one gigaflops (billion floating point operations per second).
During 1993 we expanded our product offerings for the
scientific and technical market by introducing our first
massively parallel supercomputer - the Cray TID. We
now have five installed and are expecting to install over
thirty this year. Reaching that target will make us the
leading MPP vendor in 1994.
In addition to these products, the CS6400, a
SPARC-based symmetric multiprocessing system, was
introduced with configurations ranging from 4 to 64
processors. This platfOtnl is for commercial data management applications as well as scientific and technical
users. We expect to install over 50 of these new
machines in 1994.
New announcements in software include the release of
Network Queing Environment (NQE) and our Fortran
compiler (CF90). Our new business entity "CrayS oft"
was successfully initialized to provide Cray software
products and technologies to the non-Cray platfOtnls.

2

1994 Plans

As we start 1994, we have set some aggressive goals for
ourselves that include:

7

- Continuing to have a strong financial foundation with
some growth over 1993.
- Continue to invest about 15 percent of revenue in
R&D to develop hardware and software products and
services.
- Continue with R&D work on all three product families: parallel vector, MPP, symmetric multiprocessing.
- Continue to unify our UNICOS software base across
our PVP and MPP platforms.
- Continue to make more software available on workstations and servers.
- Improve the application availability, performance,
and price/performance on our systems.
- Continue our reliability improvements.
- Better understand the "cost of ownership" of our products and make improvements.
3

Strategic Summary

Cray Research is one of the world's leading technology
companies. At the heart of our company are four core
technology competencies (Figure 7) which set us apart
from others, and upon which we will build our future:
1) Hardware and manufacturing technology to create
fast, balanced computers.

2) Applications and software that support production
computing.

3) The ability to solve problems in parallel.
4) Distributed computing skills that allow us to deliver
results directly to the user via a network.
We will focus on these four technology strengths to create a set of products and services that help solve our
customers' most challenging problems. We will leverage our technology to provide hardware products that
range from deskside systems to the world's most powerful computers as shown in Figure 8. Our software
products and services will enable us to deliver production computing results to the user anytime, anywhere.
To enable this, some of our software and services will
be applied to other vendors systems - primarily workstations. Our CraySoft initiative addresses this market
We will also expand our service offerings to better help
our customers solve their problems. We will develop
new distribution mechanisms in the form of service
products that enable us to package our hardware, software, and application products as total solutions to our

customer's problems. We will develop plans to enable
our customers to buy total computational services on
demand, in ways that are consistent with their needs
and their internal budget constraints (shown as Anytime, Anywhere Solutions in Figure 9).
Spanning all of our systems is the key element that
allows the customer to buy and use our systems - the
application programs that solve the customer's problem. We will port, adapt, optimize, create, or attract the
applications that the customers require. We will lead
the industry in converting applications to run in parallel
and deliver the performance inherent in parallel systems.
We also recognize that our core competencies, products, and services can be applied to problems which are
more "commercial" in nature. We will focus our supercomputing business on technically-based problems,
and will focus our Superserver business and some new
efforts on helping solve open system, performance-oriented commercial problems. Typically, these commercial problems will require the rapid manipulation and
analysis of large amounts of data in a networked, production environment as businesses seek to better understand their data and business processes and make better
decisions more rapidly. Within both technical and
commercial segments, we will also become more customer driven.
We will also open a pipeline between our technology
and selected government customers with our government systems and technology effort. As the government customers require, we may:
1) Tailor existing products to better suit application

needs.
2) Perform custom R&D work.
3) License selected core technologies for new applications.
In the commercial markets, we will also tailor our products to meet a new set of problems - those that are more
"data intensive." We will build our Superserver business on top of Sun's business. We will understand the
market requirements for extracting and creating new
information from databases for commercial, industrial,
and scientific applications. We will then apply our
technology and products to that marketplace, and consider partnerships to strengthen our position, particularly at the high-end of the commercial business.
Putting all of these pieces together yields a customer
driven, technology company as depicted in Figure 10.
4

References

[CRI93] Cray Research, Inc. Annual Report, February, 1994. Cray Research, Inc.

8

CRI OPERATIONAL ORGANIZATION

C::l=liIIOtt......",

_;'441-,;,,;,1\1,

rebruory 1894

Figure'l

Orders per Year
160
140
120
100

T

t

mCRS
[;iJAgile
[] Power/End

t

t

80t

1986

Sales and Marketing
;vt()94. allde'

1987

'1988

1989

1990

1991

1992

1993

CRI 4Q93 Overview

Figure 2

9

Cray Installed Base
By Geography
December 31,1993

o
o
172(53.6%)

N & S America
Europe

•

Japan

o

PacificJMDE

Figure 3

Cray Agile Installed Base
By Geography
December 31, 1993

o
o

Figure 4

10

N & S America
Europe

•

Japan

o

Paclf,C/MDE

Cray Customer Installed Base
By Industry
As of December 31, 1993

o

UNIVERSITY

•

CHEMIPHARM·

o

OTHER·

181
E:l

RES LABS·

rEiJ

AUTOMOTIVE

PETROLEUM

m AEROSPACE·
o ENVIWEA"

"Includes commercial/government

cue S.nO'.yo/C.l' ..... IH'tL ."..

Figure 5

Installed Customer Systems

(20)

32%

4%

(164) ./,/
,,,'''

~5 systems ~

raCRAV-2'1

IllIX-MP
(1)

14%

0%
6%

(69)

I::IT30

2%

oELS

(12)

L1CR~

38%
Sales and Marketing
~Q94 ••1I".12

~~-;t J'
aC90

(29)

(189)

CAl 4Q93 Overview

........

c::::~
1.lilli,

i"'W' r.1

Figure 6

11

TECHNOLOGY

Fast Computers
Production Oriented
Parallel Computing
Network Delivery

Core Techn%gies

C::~..,A...""'" ... cJe/i"t!flng /fIe performance

Figure 7

Product Core

Figure 8

12

POWER
Technical
Market

Systems &
nology

ENDURANCE
A2 Sol

oft

AGILE
Technical Solution Business

Figure 9

Technical
Market

Commercial
Market

ENDURANCE
A2 Sol

AGILE
Overall Strategy Concepts

Figure 10

13

CRI Software Report
Irene M. Qualters
Senior Vice President
Cray Research, Inc.
655F Lone Oak Drive
Eagan, Minnesota 55121
This paper describes the Cray Research Software Division plans and strategies for
1994 and beyond.

1

Introduction

Since the last CUG, CRI improved the software reliability while increasing the installed base by 30%, improving perfonnance, and increasing functionality. We also
changed our organization to create smaller groups,
allowing greater focus and agility. This paper covers
these topics as well as recent and planned product
releases for operating systems, storage systems, connectivity, and programming environments.
The
progress by CraySoft is presented and product retirements are noted. At the end of this paper, the status of
the recent CRAY T3D installations are highlighted.

2

Software Divi"ion Reorganization

Our two largest groups, UNICOS and Networking,
were reorganized into three new groups: Operating Systems, Networks, and Storage Systems. Dave Thompson
(the fonner UNICOS group leader) was promoted to
Vice President, Hardware R&D in Chippewa Falls.
Paul Rutherford is the leader for the new Storage Systems group. Wayne Roiger is the group leader for Networking. Kevin Matthews is acting ali the group leader
for Operating Systems, until a pennanent leader is chosen. (See figure 1 for the updated organization chart.)

3

Reliability

From August 1993 through February 1994 the Software
Problem Report (SPR) backlog, incoming rate and fix
rates improved (see figure 2). The six-month rolling
software and system MITI also improved over this
period (see figure 3). The box plot metric (figure 4)
confinned this improvement, but shows stability is still
an issue at some sites. The dashed lines at the bottom of
the chart indicate sites with stability substantially below
the mean. The number of sites with low stability has
decreased considerably over this period, nevertheless in

Copyright © 1994. eray Research Inc. All rightli reserved.

14

1994 we will emphasize reliability improvements at
sites with low MITIs.

4

Operating Systems

4.1

UN/COS 8.0 Features (3/10/94)

UNICOS 8.0 was released on March 10, 1994. Its
themes include robustness and resiliency, perfonnance,
security, standards, and new hardware. On March 9,
1994, UNICOS 8.0 received an official DoD
Orange/Red Book B 1 MDIA rating. This culminates
four years of effort, making CRI the sole supercomputer
vendor evaluated as a network node. UNICOS 8.0
improves our compliance with standards with direct
FDDI connectivity and improved POSIX 1003.2
compatibility.
The new platfonns supported include M90, C90, EL98,
EL92 (UNICOS support for T3D), J90, DA-60/62/301,
DD~301 and ND-12/14 disks, DD-5, RAID 10 disks for
EL, EMASS ER90 D2 and Datatower, 3490E tapes,
SSD-E 128i and FCA-1 channel adapter. UNICOS 8.0
will also be supported on CRAY X-MP systems (with
EMA support) and CRAY-2 systems.
4.2

Field Tests - UN/COS 8.0

Extensive field tests were completed. We ran six field
tests, installed four prereleases, and have nine additional
installs in progress. We tested the widest variance of
platfonns to date: CRAY X-MP, CRAY Y-MP, CRAY
C90, and CRAY Y-MPIEL platfonns. We included a
production tape site-with excellent results. Stability
improved, compared to 7.0; the SPR count was reduced
and the severity of SPRs declined.
UNICOS 8.0 multithreading perfonnance measurements at NASAlAmes showed a 60% reduction in system overhead, compared with UNICOS 7.0.
4.3

UN/COS 9.0 (3Q95)

UNICOS 9.0 will support Triton and IOS-F hardware. It
will support additional standards: X/Open XPG4 brand-

ing, AlM, and ONC+ (NFS V3). It will improve the
heterogeneous computing capabilities with the Shared
File System (SFS) and UNICOS MAX.

Metrum tape library, and SCSI tape and disk controller
(SI-2).

A major theme for UNICOS 9.0 is Reliability, Availability, and Serviceability (RAS). Specific RAS features are UNICOS under UNICOS and checkpointing
tapes.

6

Connectivity

6.1

ATM

UNICOS 9.0 potentially will support additional peripherals: EsconIFCA-2, Redwood Autoloader, and IBM
NTP.
4.4

UN/COS MAX Releases

UNICOS MAX 1.1 will be released in 2Q94. It
includes improvements in resiliency and interactive
scheduling. It supports the CRAFf progranuning
model.
UNICOS MAX 1.2 is planned to be released in 4Q94.
It will support phase-II I/O and job rolling (coarse-grain
timesharing). Phase-II I/O adds HISP channels that act
as direct data channels between 10Cs and the MPP.
The control remains on the host and the 10Cs maintain
physical HISP/LOSP connections to the host. This
generally increases the number of I/O gateways that can
function in parallel.
UNICOS MAX 1.3 is planned to be released in 2Q95,
with support for pha'ie-III I/O. Phase-III I/O allows
10Cs to attach to the MPP, without physical connections to the host. Control remains on the host; the host
controls the remote 10Cs through virtual channels that
pass through the MPP. This allows the number onocs
to exceed the number that will physically connect to the
host.

5

Storage System~

Several new file systems will be available in 1994 and
1995. All of these features are unbundled and must be
ordered separately. The DCEIDFS (Distributed File
System) will be released for UNICOS 8.0 in 3Q94.
ONC+INFS3 will be available with UNICOS 9.0. The
Shared File System (SFS) will first be available in UNICOS 8.2 and then in UNICOS 9.0. SFS will support
multiple UNICOS hosts sharing ND disks.
Planned hierarchical storage management improvements include DMF 2.2, UniTree, and FileServ support. With the introduction of DMF 2.3, we plan to
offer client/server for SFS. This means a DMF host
could act as a DMF server for systems that share the
SFS disks.
On Y -MP ELs we will offer the following peripheral
upgrades: SCSI disks (DD-5s), IPI disks (DD-5i),

ATM pilot tests will run from 2Q94 through 4Q94. The
ATM will be connected to CRAY Y-MP and CRAY
C90 hosts through Bussed Based Gateways (BBGs)
and with native connections on CRAY Y -MP EL systems. The BBGs will support OC3 and OC12 (622
Mb/s). The native CRAY Y -MP EL connections will
support OC3 (155 Mb/s).
Software support for ATM will be in UNICOS 9.0.
Long term plans are to prototype OC48 (2.6 Gb/s) in
1995 on 10S-F and to prototype OC192 (8 Gb/s) in this
decade on 10S-F.
6.2

NQX

NQX helps NQS on UNICOS to interoperate with
NQE. When one adds NQX to NQS and FTA, the
result is the NQE capabilities on a UNICOS host. NQE
will replace RQS and Stations.

7
Cray Application
Programming Environment
7.1

Cray Programming Environment 1.0

The CF90 Programming Environment 1.0 was released
in December 1993. Cray Research was the first vendor
to release full native Fortran 90. Twenty-six CF90
licenses have been purchased.
On average, CF90 compiled code runs within 10% of
the speed of eF7? code. Some codes run faster, some
slower.
A SPARC version ofCF90 will be available in 3Q94.
The CF7? 6.1 Programming Environment for the
CRAY T3D will be released in 2Q94. Its main feature
is support for the CRAFT programming environment.
The C++ Programming Enviromnent 1.0 for the CRAY
T3D is will also be released in 2Q94. With this release,
the MathPack.h++ Tools.h++ class libraries will be
available. These libraries are unbundled and must be
purchased separately.
7.2

Programming Environment 2.0

The CF90 Progranuning Enviromnent 2.0 for all platforms (PVP, MPP, and SPARC) will be released in
2Q95. A goal we expect to meet for CF90 2.0 is to
exceed the performance of CF7? .
The C++/C Programming Environment 2.0 for PVP
and MPP platforms will be released 2Q95. Note: the

15

C++/CProgramming Environment 2.0 is intended to
replace the Cray Standard C Programming Environment 1.0 and the C++ Programming Environment 1.0.
C++/C is both an ANSI Standard C compiler and a C++
compiler.

7.3

Distributed Programming Environments

The Distributed Programming Environment (DPE) 1.0
is scheduled for release in 3Q94 with support for the
SunOS and Solaris platfonns. It includes the front-end
for CF90 that allows test compiles on the workstation
and dpe_f90, which perfonns remote compiles on the
host. It provides ATExpert and Xbrowse locally on the
workstation linked through ToolTalk to the host.
DPE 2.0 is planned to be released in 2Q95 for the
Solaris platform. It will add a full CF90 2.0 cross compiler and native Cray TotalView support through
ToolTalk. With DPE 2.0, the compiler will produce
host object code on the workstation and the dpe_f90
command will upload the data to the host and link it on
the host with transparent remote commands.

8

CraySoftTM

CrayS oft released its first product in December 1993:
NQE for Solaris. This included NQS, a load balancer,
queuing clients, and the File Transfer Agent (FTA.)
In 3Q94, CraySoft plans to release NQE 1.1 for multiple vendors including workstations and servers from
IBM, SGI, HP, Sun and Cray Research.
Also in 3Q94, CraySoft plans to release DPE 1.0 for
Solaris. (See the Distributed Programming Environments section above.)
In addition, in 3Q94, CraySoft plans to release CF90
1.0 for Solaris.

9

Product Retirement

Ada and Pascal will be placed in maintenance mode one
year from now. The last feature development for these
products is complete. They will be supported on
CRAY Y -MP, CRAY M90, and CRAY C90 platfonns
through UNICOS 9.0. They will not be offered on new
hardware, such as the CRAY J90 series.
CrayDoc will replace DocView; DocView will be
placed in maintenance mode one year from now.
The final OS platform on which DocView will be supported is UNICOS 9.0. CrayDoc will be available with
UNICOS 8.0.3

16

10

CRAY T3D Product Status

Six sites have installed CRAY T3D systems, and we
have received fifteen additional orders. The user base
is diverse, including industry, government, and university customers. The hardware in the field has been
highly reliable. The software is stable and maturing.
The 110 is performing as expected: over 100 MB/s on
one lOG to SSD and over 350 MB/s on four lOGs to
SSD.
The CRAY T3D system is the only MPP that has run all
eight of the NAS Parallel Benchmarks on 256 processors. All other vendors have been unable to scale all
eight benchmarks to this level of parallelism. The
CRAY C916 system runs all but one of the benchmarks
faster than a 256 processor CRAY T3D system, but the
CRAY T3D system is approaching the CRAY C916
perfonnance on many of these benchmarks and is
expected to exceed the CRAY C916 performance when
scaled to 512 processors. The scaling has been excellent; the performance on 256 processors was almost
double that of 128 processors (see figure 5).
The CRAY T3D system is proving to be an excellent
graphics rendering engine. Microprocessors excel at
this task, compared with vector processors. The 256
processor CRAY T3D system runs a ray-tracing application 7.8 times faster than a CRAY C916 system.
The heterogeneous nature of the CRA Y T3D system is
proving to be an advantage for some codes. The
SUPERMOLECULE code is a good example. It contains a mixture of serial and parallel code. Running the
serial code on a fast CRAY Y-MP processor substantially improves the scalability. If all the code, including
the serial portion, is run on the MPP, the code runs only
1.3 times faster on 256 processors than on 64 processors. When the serial portion of the code is run on a
CRAY Y-MP CPU, the program runs 3.3 times faster
on 256 processors than on 64 processors (see figure 6).

11

Summary

Cray Research remains committed to an Open Supercomputing strategy.
We build our systems using standards, such as pas IX,
System V, Solaris, X/Open, ANSI, and CaSE. We
concentrate on performance such as scalable parallel
processing and automatic parallelizing compilers. We
excel in resource management such as comprehensive
accounting and NQS production batch facilities. We

have the most comprehensive UNIX security on a
major computing platform, including "Trusted UNICOS", with an official Orange/Red book B 1 MDIA rating. These security features are very useful for
commercial sites.
We are constantly improving our data accessibility with
features such as ONC+INFS3, DCFJDFS, hierarchical
storage management (DMF, UniTree, FileServ), FDDI,
HiPPI, and A1M. Finally, we will continue improvements in providing a cohesive environment, including
CaSE (CDE), the CraySoft Network Queueing Environment, Fortran 90, and technology agreements with
workstation vendors such as Sun Microsystems, Inc.
We will continue our investments in Open Supercomputing to constantly improve methods for bringing
supercomputing to the desktop.

12

12.1

Appendix:
Explanation of Box Plots
(Figure 4)
Comparing Estimated Software
and System Time Between Failures

Figure 4 compares the distributions of customer Software and System MITI estimates. The solid line represents the median customer Software MTTI, and the
dotted line represents the median customer System
MTTI. The boxplots allow us to see how these distributions change over time. The graph compares the
MTTI distributions for Software and Systems for of all
customer machines.

12.2

value, or whether they have the same amount of
spread.
We use log paper in the Y axis, since otherwise we
would not be able to observe what is happening in the
lower quartile, and this is probably where we would
like to focus our attention.

12.3

Where does the Data Come From?

For each site a list was obtained from the IR database
of times when Software or System crashes occurred.
From this information we were able to estimate the
MITI for each site at any moment in time. In each
graph the most recent month is not included. (We
expect data that will be reported late to bring the lower
part of the last month's box down.) For more information please feel free to contact David Adams by email
(dadams@cray.com) or phone (612) 683-5332.

12.4

Analysis

All of the medians seem to be fairly stable. (They are
not moving up or down significantly.) The shaded
notch in the middle of each box is a 95% confidence
interval on the median for the box. If any two boxes
have shaded notches that do not overlap horizontally,
then those boxes have significantly different medians in
the statistical sense.

Boxplots

Boxplots are often used to give an idea of how a population or sample is distributed, where it is centered,
how spread out it is, and whether there are outliers.
The box itself contains the middle 50% of the data.
The line in the middle represents the median. The
whiskers extend outward to cover all non-outlier data.
Outliers are plotted as small dashes. An outlier is a
point that "lies out" away from the main body of the
group.
The median is a "measure of location." That means it
is a nice metric for telling "where we are." (Where we
are centered.) The boxes help show us how spread out
we are.
While boxplots do not display everything there is to
know about a data set, they are quite useful in allowing
us to compare one data set to another. By lining boxplots up side by side we can often tell whether two or
more data sets are located around the same central

17

Supercomputer Operations
Software Division
Irene QUllters
Senior Vice President
Software Division

I

I

Mike Booth
Group Leader

Mark Furtney
Group Leader

Paul Rutherford
Group LeIder

Wayne Rolger
Group Leader

Compliers

TLC and MPP Software

Storage Syatems

Networking

Pat Donlin
Prolect Leader

Leary GatM
Program Manager

Kevin Matthews
Acting Group Leader

Janet Lebens
Group Leader

Operating Syatems

UNICOS Relea"

ELS Software

CraySoft

Nancy Hanna
Group Leader

Dick Nelson
Principal Scientist

Pete Sydow
Group Leader

BNceWhlte
Group Leader

Human ResourcM

FortraniC Research

Systems Software

Operations & Administration

I

P-IiI~~~~C!i.~~I_'I.~.~:I~/v~e~rln~g~Uw~p~e~rl~~:ma
:n~~~..~.................... ~_
•
~ ~~.

c::
'.I¥i-i§f'·'Fl:NP'FH

(Figure 1)

... _, ..........

",MC . . . . . . . . . . .

~----------------~

SPR Status
1000

800

Number of
SPRs

Backlog

600

400

200

Fixed Rate
o
Aug-93

Sep-93

Oct-93

Nov-93

P'.'i--J§f'·"'=8'dFH
!c: i 3~:l'.:!~ I: !l.~t:I:I••~•..;~; I;.;lv; erl.; .n~g.:uw; .:p;e;.;.rf,:~.; .ma=n;.; ~ ~

L-_ _ _ _ _ _ _ _ _~

18

Dec-93

Jan-94

Feb-94

rnTI_
~~.

.. ...................... ~

(Figure 2)

... _, .........'..MCU. . . . . . . . . .

6 Month Rolling
6000

Software and System MTTI

,,..

~

5000

,I~

4000

r--

Hours
3000

/MTTI

Software

, r--

.

,....;..

,......
System

MTTI

r-r--

~ooo

,

1000

,

I

r--

V

--

I
_

4183

5183

&In

,

.

,

,
3183

~

r--

7/83

1183

8/83

,

1M3 1111J3
"

12/83

11M

21M

CFfl*i='l I~D'i:J-'f:l1"C:INi:i;"91: , ·f-~de:I ~V.:rt.:.n:"g:the:,;"':rl~onnan=~~;._~________~~ UjJ)•

(Figure 3)

,.. _._ ...- ..... _-

Estimated Time Between Failures at Customer Sites
Hours
50,000
20,000
10,000
5,000

2,000
1,000
500

10 years
6 years

!

3 years
1 year

6 months
3 months

W

=

1 month

2 weeks
200

1 week

100

3 days

50

=

20

1 day

12 hours

10

6 hours
3 hours

Median Software Mnt
Median System Mm

1 hour

3/93

4/93

5/93

6/93

7/93

8/93

9/93

10/93 11/93 12193

1/94

2194

19

NAS Parallel Benchmarks
Highest Performance Reported on Any Size Configuration
30.00

~~1

L

25.00 .~ ~III!!:"
20.00

...... ~

C90CPU
Equivalents 15.00I-~ ~~
10.00
5.00I0.00 III! ... ~

-~

~~

~1l

~

~

c: c::

--

i ~ +i
~

.Ai. ......

EP FT'MG

IS~
CG

Kernels

.:-

"'-lI1

I)

,~

BT

;::'--

.:

SP

--

1) ~

.,

LV

-

CRAYC916
C RA YT3D 256 PEs
Other MPPs

Applications
Note: No Other Major MPP Vendor Scaled All 8 Codes to 256 PEs

IfTIiJ

~I-f==C-;i*;1;~~.lj--;'~~~:-H;~~.-'i~d:~f·~.M~I~lV~erl~n~g~~~p~e~rl~~~ma=n~~~~~__________________ ~~~,...
_

_

(Figure 5)

... _ .. _ ••,.,M·. . . . .M . . . .

SUPERMOLECULE
Homogeneous versus Heterogeneous Performance
..,----.,.---.----.,.-------,r--------.
3.1---1--;----+----1
2.5 L-.J---l---r~n

3.5

Relative
Peformance

2

1.5

0.5
T3D + 1 Y·MP CPU

256
Number of T3D Processors

c: -==- ~"t'

Molecule

.. Mllverlng the performan~

=18·Crown·6 3·21G

Updated 213194

~~~~~----~
(Figure 6)

20

tfJJJ
,...

Unparalleled Horizons: Computing in
Heterogeneous Environments
Reagan W. Moore
San Diego Supercomputer Center
San Diego, California

Abstract
The effective use of scalable parallel
processors requires the development
of heterogeneous hard ware and
software systems. In this article, the
Intel Paragon is used to illustrate
how heterogeneous systems can
support interactive, batch,
and
dedicated usage of a parallel
supercomputer.
Heterogeneity can
also be exploited to optimize
execution of task decomposable
applications.
Conditions for superlinear speedup of applications are
derived that can be achieved over
both loosely and tightly coupled
architectures.

Introduction
Scalable
heterogeneous
parallel
archi tectures are a strong contender
for future Teraflop supercomputers.
From
the
systems
perspective,
heterogeneous
architectures
are
needed to support the wide range of
user requirements for interactive
execution of programming tools, for
production execution of parallel
codes, and for support for fast disk
I/O.
From
the
application
perspective, heterogeneity can also
lead to more efficient execution of
programs.
On
heterogeneous
systems,
applications
can
be
decomposed into tasks that are
executed
on
the
appropriate
hardware and software systems.
For

a certain class of applications,
superlinear
speedups
can
be
achieved. The execution rate of the
application can be increased by a
factor greater than the number of
decomposed tasks.
To demonstrate the viability of
heterogeneous
parallel
architectures, a comparison will be
presented between the job mixes
supported on the Cray C90 and the
Intel Paragon XP/S-30 at the San
Diego Supercomput~ Center.
The
observed C90 workload cannot be
efficiently
executed
on
a
homogeneous
massively
parallel
computer.
The
heterogenous
hardware and software systems on
the Paragon, however, do provide
support for a job mix similar to that
of the C90.
The operating system
software
that
controls
the
heterogeneous resources on the
Paragon
will
be
described.
Conditions
for
achieving
superlinear speedup will be derived
that are valid for both tightly
coupled architectures such as the
C90/T3D, and for loosely coupled
architectures such as a Paragon and
C90 linked by a high-speed network.

Heterogeneity
in
Resource Requirements

Application

The
Cray
C90
supercomputer
supports a job mix with widely
varying
application
resource
requirements.
In addition, the

21

resource
demands
can
vary
significantly between the CPU time,
memory size, and disk space required
by a job.
Jobs that need a large
fraction, from one-quarter to onehalf, of any of these resources are
"boulders" that can dramatically
affect the performance of the C90.
Examples of these types of resource
demands are support for fast turnaround
for
interactive
job
development, execution of large
memory production jobs,
and
execution of jobs requiring large
"Boulders"
amounts of disk space.
constitute the dominant need for
support for heterogeneity on the
C90.

scheduler [1-3].
The scheduler
automatically packs jobs in the C90
memory while satisfying scheduling
policy constraints.
The turn-around
time of particular classes of jobs is
enhanced while maintaining high
system utilization.
The limiting
resource that is controlled on the C90
is memory. Enough jobs are kept in
memory to ensure that no idle time
occurs because of I/O wait time.
Job mix statistics for the C90 and the
Paragon for the month of January,
1994 are given in Table 1. The C90 at
SDSC has 8 CPUs, 128 MWords of
memory, and 189 GWords of disk
space.

At SDSC, "boulders" are controlled
through
a
dynamic
job
mix

Table 1
Interactive and batch workload characteristics for the C90 and the Paragon for
the month of January, 1994
C90

Number of Interactive jobs
Number of Batch jobs
CPU time interactive (processor-hrs)
CPU time batch (processor-hrs)
Average batch CPU time (processor-hrs)
There are several noteworthy items
about the job mix on the C90. Users
rely
on the C90 to support
interactive job development.
These
jobs constitute over 90% of the jobs,
but use less than 10% of the CPU
time. Thus the dominant use of the
C90 by number of jobs is for fast
interactive support of researcher
development efforts.
The dominant
use by execution time is for deferred
execution of batch jobs.
Even for
jobs submitted to the batch queues,
typically half of the runs are for
short execution times of less than
five minutes.
Excluding these short
execution time batch jobs, the longrunning batch jobs execute for
about one hour.

22

101,561
8,279
410
3,595
0.43

Paragon
4,731
1,111
13,409
145,860
131

Typical types of support needed for
job development on the C90 are
shown in Table 2.
On the C90, these development
support
functions
are
run
in
competition
with
the
batch
production jobs.
Although they
comprise a small fraction of the total
CPU time, their need for fast turnaround times does impose a heavy
load on the operating system. On a
heterogeneous architecture, these
functions could be executed on a
separate set of resources.
Another
characteristic of the development
tasks
is
their
need
for
a
comprehensive set of UNIX system
calls.
Excluding
file
110
manipulation, most production batch

jobs use a relatively small fraction of
the UNIX system call set.
A
heterogeneous system that provides
minimal
UNIX
system
call
functionality for batch jobs, with a

complete
set
provided
for
development tasks may also allow
optimization of use of system
resources.

Table 2
Development Functionality Used on Vector Supercomputers
Function
Archival storage
Compilation
Shell commands
Accounting
Editing
Resource Mana~ement
Heterogeneous
Computers

Percent Wall-clock time
1.08%
1.05%
0.73%
0.57%
0.45%
0.17%
Parallel

The Intel Paragon in use at SDSC is a
It is shown
heterogeneous system.
schematically in Figure 1.
The
different node types include 400
compute
nodes
with
varying
amounts of memory (denoted by a
circle),
5
nodes
to
support
interactive
job
development

(denoted by S for service), 9 MIO
nodes to support RAID disk arrays
(denoted by M), a HIPPI node to
support
access
to
the
800
Mbit/second HIPPI backbone at SDSC
(denoted by H), an Ethernet node for
interactive access to the Paragon
(denoted by E), and a boot node
(denoted by B). The positions labeled
wi th an asterisk are open slots in the
system.

Figure 1
Node Organization for the Paragon

*00001000000000000000000000*8
MO 00 010 0 0 0 0 0 000000 0 000 00000 5 5
*000000000000000000000000055
MO 000100 0 0 0 0 000000 0 0 0 0 00000 * 5
* 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * M
* 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * E

MOOOOOOOOOOOOOOOOOOOOOOOOO**
MO 000100 0 0 0 0 000000 0 0 0 0 0000 cj * *
* 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * M
* 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * *
MO 000 00 0 0 0 0 000000 0 0 0 0 0000.0 * *
1
*0000000000000000000000000**
* 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * M
H 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * *
H0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * *
1

MOOOOOOOOOOOOOOOOOOOOOOOOO**

23

On the Paragon, traditional UNIX
interactive development tasks are
executed on the service nodes, while
parallel compute jobs are executed
on the 400 compute nodes. Note that
in Table 2, the two dominant
development efforts in terms of CPU
time are archiving of data and
compilation of codes. At SDSC, these
functions are migrated off of the
Service nodes onto compute nodes
for execution in parallel.
Parallel
MAKE jobs are typically executed on
4 compute nodes on the Paragon.
The compute nodes to the right of
the solid vertical line in Figure 1
have 32 MBytes of memory, while
the remaining compute nodes have
16 MB ytes of memory. The nodes to
the left of the dashed vertical line
are
reserved
for
interactive
execution of parallel compute jobs;
the other nodes are controlled by the
batch system.
The statistics presented in Table 1
indicate that this heterogeneous
system supports a workload whose
characteristics are similar to that of
the C90.
The number of interactive
jobs on the Paragon only includes
those jobs explicitly run in the
interactive
compute
partition.
Adding the traditional UNIX tasks
executed on the service nodes would
significantly increase this number.
As on the C90, the amount of CPU
time used by batch jobs is over 90%
of the total time used. The number of
interactive jobs on the Paragon
exceeds the number of batch jobs by
over a factor of 4. Thus the Paragon
is supporting a large interactive job
mix with a concommittant need for
fast turn-around, in addition to a
production
workload
that
is
processed through batch.
The
preferred number of nodes for the
batch jobs is 64.
Thus the average
batch job execution time is about two
hours.
Up to 25
login sessions are
simultaneously supported on a single

24

To
service node on the Paragon.
support more
users,
additional
service nodes are added to the
system.
Similar scaling is used for
disk support, with a separate MIO
node used to control each 5 GByte
RAID array. Adding more disk to the
Paragon is accomplished by adding
more MIO nodes.
Since batch compute jobs tend to
require a smaller subset of the UNIX
system
call
functionality,
one
important feature of the Paragon is
the
ability
to
run
different
operating
system
kernels
on
different nodes. The Sandia National
Laboratory and the University of
New Mexico have developed a
"nanokernel" (called SUNMOS) that
is less than 256 kBytes in size that
supports
high-speed
message
passing between nodes.
Bandwidths
as high as 160 MB/sec have been
reported for the SUNMOS operating
system [4]. The reduced size of the
operating system allows larger incore problems to be run.
Operating System
Heterogeneous Systems

Support

for

The
Paragon
architecture
can
consist of nodes with both varying
hardware and software capabilities.
Hence a job mix scheduling system
must recognize different node types
and schedule jobs accordingly.
At
SDSC, such a system is in production
use.
Modifications have been made
to the NQS batch system to support
scheduling policies and packing of
jobs onto the 2-D mesh. Job packing
is done by a modified 2-D buddy
algorithm.
Scheduling is controlled
by organizing nodes into uniform
node sets, with assignment of lists of
node sets to each NQS queue. Jobs
submitted to a particular queue are
then scheduled to run on only those
nodes belonging to the node sets
associated with the queue.
This
allows jobs to be scheduled to use
large memory nodes, or to use nodes

that are executing the SUNMOS
operating system.
The scheduling
policy picks the highest priority job
If not enough nodes
for execution.
are available, nodes may be held idle
until the job can fit on the mesh.

where TAl is the time to execute task
I on platform A and T A 2 is the time
to execute task 2 on platform A. With
similar definitions, the time for
executi(~)fi on platfonn B is

SuperIinear Speedup

TB =TBI +TB2

Heterogeneous systems can be used
to improve individual application
performance, as well as to support
productivity requirements.
For
applications that can be decomposed
into multiple tasks, performance can
be increased by assigning each task
to hardware/software systems that
execute that task the quickest.
An
example is assigning sequential setup tasks to nodes with very fast
execution rates, while assigning
highly parallel solution algorithms
For problems
to parallel systems.
that iterate between job setup and
job solution, it is possible to pipeline
the calculations and do most of the
If the execution
work in parallel.
rate of each task is sufficiently
faster
on
different
compute
platforms, a superlinear speedup can
be achieved.
The solution time
obtained
by
distributing
the
application can be faster by a factor
larger than the number of tasks into
which the application is decomposed.

Assume that task I executes faster on
platform A, and task 2 executes faster
The execution time
on platform B.
for the application
distributed
between the two platforms and
executing
in
parallel
is
the
maximum of TAl and TB2.

Simple algebraic equations can be
derived to illustrate this effect.
Consider two tasks, a sequential task,
I, and a highly parallel task, 2, that
are executed iteratively on two
compute platforms, a fast sequential
platform, A, and a fast parallel
platform, B.
After the initial setup
for task I, data is pipelined through
multiple
interations
until
convergence is reached.
Thus on
average, task I and task 2 can
execute in parallel.
The execution
time on platform A is given by
TA=TAI+TA2

The speedup, S, is the ratio of the
minimum of the
stand
alone
execution
times
on
the
two
platforms, divided by the execution
time on the distributed system. The
speedup is then given by
S = min (SA, SB)
SA = TA I max (TAl, TB2)
SB = TB I max (TAl, TB2)
Each execution time can be modeled
as the amount of work (N I is the
number of operations for task I)
divided by the execution rate of the
particular platform (R A I is the
execution rate of task I on platform
A). Thus
TAl =NI IRAI
TB2 = N2/RB2
The speedup can be calculated as a
scaled function of the relative
amount of work, h, of the two tasks,
where

h = NI I N2 * RB2/RAI

25

The maximum obtainable speedup
can then be shown to depend only
on the relative execution rates of the
two tasks on the two platforms.
A
plot of the dependence of the
speedup versus the relative amount
of work between the two tasks is
shown in Figure 2 when the ratio
R A I / RBI is greater than the ratio
RB2/RA2.
The maximum attainable speedup is
given by

s = 1 + RB2/RA2.
When RB2 is greater than R A2, then
S > 2, and superlinear speedups are
achievable. Note that there can be a
substantial range in the relative load
balance over which superlinear
speedup is attainable. The speedup is
given by the lower peak and
corresponds to the slower running
task on platform . A being processed
on platform B in the same amount of
time as the faster task on platfonn A.
The maximum speedup corresponds
to perfect load balancing, which
occurs when both of the distributed
tasks execute in the same amount of
time.
Thus, the maximum speedup
occurs when
TAl =TB2
or

Summary

Scalable
heterogeneous
parallel
architectures are able to support the
heterogeneous workloads seen on
present
vector
supercomputers.
They achieve this by assigning
different types of work to different
hardware
resources.
On
the
Paragon, the ability to schedule jobs
in an environment with nodes with
different amounts of memory and
even different operating systems is
necessary
for
handling
heterogeneous work loads.
By
decomposing
applications
into
multiple tasks, it is possible to take
advantage
of
heterogeneous
architectures
and
achieve
superlinear
speedups,
with
applications decomposed into two
tasks executing over twice as fast as
the original.

Acknowledgements

This work was funded in part by the
National Science Foundation under
Cooperative Agreement Number ASC8902825 and in part by the National
Science Foundation and Advanced
Research Projects Agency under
Cooperative Agreement NCR-8919038
with the Corporation for National
Research Initiatives.
The support of
Mike Vildibill and Ken Steube in
generating
the
statistics
is
gratefully acknowledged.

h=l
All brand and product names are
trademarks or registered trademarks
of their respective holders.

26

Figure 2

Speedup versus Scaled Load Distribution

S8

R1=RA1/Rs1
R2=RS2IRA2

1

_11
1
(R2-1 )/(Fh -1)

1
-1-

-

- --

1

1

h

h=N1/N2 * RB2/RA1
References
1.
Moore, Reagan W., "UNICOS
Performance
Dependence
on
Submi tted Workload," Proceedings,
Twenty-seventh Semiannual Cray
User Group Meeting, London, Great
Britain (April 1991), GA-A20509.
2. Moore, Reagan W., Michael Wan,
and William Bennett, "UNICOS
Tuning Strategies and Performance
at the San Diego Supercomputer
Center," Proceedings, Twenty-sixth
Semiannual
Cray
User
Group
Meeting, Austin, TX (Oct. 1990), GAA20286.

3. Wan, Michael and Reagan Moore,
"Dynamic Adjustment/Tuning
of
UNICOS," Proceedings, Twenty-fifth
Semiannual
Cray
User
Group
Meeting, Toronto, Canada ( April
1990), GA-A20062.
4.
Private communication, Rolf
Riesen, Parallel Computing Science
Department,
Sandia
National
Laboratories.

27

CRAYT3D Project Update
Steve Reinhardt
Cray Research, Inc.
655F Lone Oak Drive
Eagan, MN 55121 USA
spr@cray.com

This paper describes significant CRAY T3D project events which have
occurred since the last CUG meeting, emphasizing those events which
will effect how programmers or users will use the machine and the
performance they can expect.

1.0 Introduction1
At the time of the last CUG meeting, in September, .1993,
in Kyoto, the first customer CRAY T3D system had been
shipped to the Pittsburgh Supercomputing Center. Hardware was stable, software was growing out of its infancy,
and performance results were available beyond common
industry kernels. Currently the CRAY T3D system is shipping in volume, and customers are using the CRAY TID
successfully for production computing and MPP application development. Topics covered in this paper include:
shipments, reliability, CRAY T3D hardware plans, software plans, performance (kernel, 110, and application),
and third-party application work.

2.0 Shipments
The CRAY TID architecture spans system sizes from 32
to 2048 processing elements (PEs). As of March 1994, we
have shipped nine systems to customers. Those customers
1. The work described in this paper was supported in part by the
Advanced Research Projects Agency (ARPA) of the U.S. Government under Agreement No. MDA972-92-H-0002 dated 21
January, 1992.

Copyright © 1994. Cray Research Inc. All rights reserved.

28

include governmental, industrial, and academic organizations. The industrial shipments include seismic and electronics customers. Machines reside in Europe, Japan, and
the United States. Shipments include all of the chassis
types which will be built: multi-chassis liquid-cooled
(MC), single-chassis liquid-cooled (SC), and multi-chassis
air-cooled (MCA). The largest customer system contains
256 PEs.

3.0 Reliability
CRI developed and delivered the CRAY T3D in 26
months, and some customers expressed concerns about the
reliability of a machine developed so quickly. Given the
small amount of total CRAY T3D experience we have, we
cannot call the data conclusive, but some trends are
already emerging. Overall the reliability of CRAY T3D
systems is growing quickly. We plan to end the year 1994
with an MTTI of 1 month.
Effect on Y-MP host. Many current CRI customers wish
to add an MPP component to their production system, but
are extremely attentive to any impact this may cause to
their existing CRAY Y-MP or CRAY Y-MP C90 production workload. Because of these concerns, we designed the
CRAY T3D hardware and software to be isolated from the

Hardware Update

nonnal functions of the host, in order to minimize reliability problems. In practice this has worked well. We have
had some early growing pains, which we believe are now
behind us. Even including these early growing pains, our
data shows that for every six interruptions caused by host
hardware and software, the CRAY T3D system has caused
one extra interruption.
CRAY T3D itself. In isolation, the reliability of the CRAY
T3D system has been about 1 week. The hardware MITI
has been about 7 weeks. The software MTTI has been
about one week.
Many customers were concerned about the binary-only
release policy for CRAY T3D software because of the
potential for slow response to problems observed on site.
In practice, this has turned out not to be an issue. The OS
has a handful of different packages, each of which can
usually be upgraded separately. This allows us to deliver
tested changes quickly. The infrastructure (tools and processes) is based on that being used by CRI compilers,
which have been binary-only for several years, and hence
is well proven.

to switch in the redundant hardware nodes more easily.
Scheduling will be improved by allowing small programs
to leap-frog in front of large programs which are waiting
for resources; large programs will wait only a finite time.
Release 1.2 (4Q94) JlO connectivity and perfonnance will
be improved by the delivery of Phase II JlO, which allows
a "back-door" channel from a Model E JlO cluster to connect directly to a CRAY TID I/O gateway. This will especially improve the I/O picture for CRAY T3D customers
who have host systems with few processors. Machine
sharing will be improved by the implementation of rolling.
Rolling enables a running program to be suspended,
"rolled" out of its partition to secondary storage and all
resources freed, another program to be run in the partition,
and then finally for the original program to be rolled in and
resumed. With rolling, very large, long-running "hog" jobs
can co-exist with many small development programs.
Release 1.3 (IH95) The delivery of Phase III I/O will
increase the 1/0 connectivity and performance again,
enabling Model E IOCs to be connected directly to the
CRAY T3D for both data and control, and thus allowing 1/
o to scale with the size of the CRAY T3D and be less controlled by the size of the host.

4.0 Hardware Update
When we announced the CRAY T3D two memory sizes
were quoted, 16MB (2MW) and 64MB (8MW). Several of
the early systems were shipped with 16MB memories. We
have now shipped a 64MB memory system.
We plan to allow the follow-on system to the CRAY YMPIEL to be a host to a CRAY TID system during 1995.

5.0 Software Plans
The software for the early shipments enabled users to run
production jobs effectively and to develop further MPP
applications. Future software releases will provide better
performance, ease of use, and flexibility.

5.1

UNICOS MAX operating system

Release 1.1. (2Q94) Improvements to UNICOS MAX are
being released in monthly updates. By the time of release
1.1, alIOS support for the CRAFT programming model
will be in place. Resilience will be improved by the ability

5.2

Compilers/Programming Environments

Release 1.1 (2Q94) The CRAFT programming model will
enable users to program the CRAY T3D as a globaladdress-space machine, with data parallel and work-sharing constructs. We expect that many applications can be
ported to the CRAY TID more quickly with CRAFT than
with a message-passing decomposition. The 1.1 release
will allow C++ users to exploit the power of the CRAY
T3D system for their programs, with compilation and
debugging abilities and the class libraries most frequently
used for scientific computations. Access to multi-PE operation from C++ will be via the PVM message-passing
library.
3Q94 Users whose applications are dominated by operations that need only 32-bit data and operation will gain a
significant perfonnance improvement from the release of a
signal processing option in Fortran in the third quarter of
1994. A subset of the Fortran 90 language and the mathematical intrinsic functions, suitable for signal processing,
will be provided, along with visual tool support. Access to
multi-PE operation will be via PVM.

29

Performance

4Q94 Users who do I/O which is spread across the memory of multiple processors will be able to do this more easily with the release of global /10, which simplifies the task
of doing I/O on shared objects and synchronizing access to
files which are shared among PEs.

compiler in September of 1993 in the front row and the
current performance in the back row.
Livermore Loops

CF90 2.0 (lH95) The full Fortran 90 language will be
delivered to CRAY T3D users with release 2.0 of the CF90
programming environment. Access to multi-PE operation
will be via PVM. Implementation of the CRAFT model
within CF90 is being scheduled.

Users will see improving application performance
throughout this period as a consequence of further compiler and library optimizations. (See below under Kernel
performance.)
6.2

The CRAFT programming model will deliver, we believe,
an appropriate balance between the programmer's ease of
use and the compiler's ability to deliver high performance.[Pase94] Many researchers believe that the HPF
language will deliver good portability. [HPF93] Each of
these languages is a global-address-space model for distributed memory computers. Widespread MPP applications development depends on the emergence of a standard
language. We believe that the implementation of each of
these languages will add to the body of knowledge about
languages of this type. We will expect that these efforts
will both contribute to the Fortran 95 standard committee,
and that is where we will expect to resolve any conflicts
between the two.

6.0 Performance
6.1

Livermore loops

VO performance

For 2 MB transfers, the CRAY T3D system can sustain
across one HISPILOSP channel pair more thanl00 MB/s
to a disk file system. When using 4 channel pairs in parallel, 4MB transfers can sustain more than 350 MB/s to disk.

6.3

Seismic Application Performance

Three-dimensional pre-stack depth migration describes the
Grand Challenge of seismic computing. The application
requires a high computational rate, but especially a high 1/
o rate. A CRAY Y-MP C90 implementation of this
method was one of the 1993 Gordon Bell Award winners.
The 3DPSDM program implements this technique for the
CRAY T3D in Fortran and message-passing with some
assembly code used [Wene93]. The absolute performance
of 64 CRAY T3D PEs is 42% of the performance of a
C90/16. The CRAY T3D is about 3.5 times more cost3D Prestack Migration Absolute Performance

The Livermore Fortran Kernels measure the performance
of a processor for certain common operations. Figure 1
displays the performance of the CRAY T3D single-PE and

C90 16 CPUs

Mlchlne Size

30

T3D 64 PEs

Performance

ning on 256 PEs of a CRAY T3D runs about 27% as fast
as it does on a C90/16.

effective for this application.
3D Prestack Migration Performance
Per Dollar

POP Absolute Performance
400c)('

...'-

3 SOc)('

8

8.

3ooc)('

~

~

2 SOc)('

...c
E
0

2ooc)('
1 SOC)(,

't:
QI

no

!

150%

QI

40%

QI

no

100c)('

QI

.<:
~
CII

.i!:

19

SOC)(,

0:

10%

&

Oc)('
CRAYT3D

CRAYC90

ZO%

0%
ClIO 18 CPU.

T1OZsePE.

The application scales very well on many PEs.
3D Prestack Migration Scaling on CRAY T30s

UI

,00%

e..

'QI

e..
QI

g
E
0
(1\

't:

QI

SO"

The price-performance of the CRAY T3D is 88% of the
C90116.

I-1-

POP Performance Per Dollar
100%

60" 1-

8

40"

K

20"

...c
E
0

'-

J!

e..
QI

·zo>

QI

0

(1\

Qi
0::

't:

0"
32

80%
60%
40%

~

64

QI

Number of PEs

·zo>
...

20%

a:

0%

Gi

CRAY C90

CRAYT3D

POP scales very well up to 256 PEs.

6.4

Climate Modeling Application Performance

The Parallel Ocean Program (POP) performs climate modeling calculations on scalable computers. [Duko93,
Duko94] The program is structured as an SPMD computation, with the overall domain being decomposed into a
block on each PE. Its basic structure is representative of
many problems which devolve to the solution of partial
different equations. On other MPPs, POP has spent more
than 25% of its time communicating; on the CRAY TID it
spends less than 5% of its time communicating. POP run-

POP Scaling on CRAY T3Ds
W
0-

...
(1)

0(1)

U

r:::::

§

.g
(1)

0(1)

>

"P

nl

~

c::

Number of PEs

31

Performance

6.5

Chemistry Application Performance

The SUPERMOLECULE program is a public domain
code whose structure and function is representative of
third-party chemistry applications. [Sarg93, Feye93] It
implements the Hartree-Fock method and is used to understand the interaction of drug molecules. The absolute performance of a 64-PE CRAYT3D is 45% of a C90116.

noticeably by using more than 64 processors. A matrix

SUPERMOLECULE Scaling
on CRAYT3Ds

L-

eu
eu
u
c

c..
C'Il

SUPERMOLECULE Absolute Performance

E

.g
eu

c..
eu
>
',p

10096
GI

~

1\1

E
0

't:
GI
n.

C'Il

8096

~
0:::

4

6096
4096

16

8

64

128

256

Number of PEs

GI

~1\1

2096

a:

096

'ii

C90 16 CPUS

T3D 64 PEs
Machine Size

The price-performance of a CRAY T3D is almost four
times that of the C90.

diagonalization step is being performed serially on the
CRAY T3D; Amdahl's Law limits the speedups which are
possible. However, because of the close linkage between
the CRAY T3D and its parallel vector host, the serial portion of the code can be executed on a single processor of
the host, and at much higher speed than a single processor
of the CRAY T3D. When that portion of the code is run on
a CRAY Y-MP processor, the program can use effectively
more PEs on the CRAY T3D side. In this way a large pro-

SUPERMOLECULE Heterogeneous
Performance
'2

SUPERMOLECULE Performance Per Dollar

2

U;

<1>

3.5

II

3

40096

Cij

"0

35096

8.

30096

q
.,...

GI

25096

~

c

g
~

~

If

20096
15096
10096

GI

.~

SGI

a:

5096
096

<1>

I
I

0

c

ctS

E
0

1-

2.5
2
1.5

1::
<1>

0..

I

<1>

>
~

CRAYC90

CRAYT3D

The scaling of SUPERMOLECULE, however, is not
good, and in fact the time to solution does not decrease

CRAY T3D Project Update

32

==
0

Qi

a:

T3D only
64

128

256

gram can exploit the coupled system for faster time to
solution than either system could provide by itself.

Third-Party Application Work

7.0 Third-Party Application Work
The success of the CRI MPP project depends heavily on
the availability of third-party applications programs to
enable many users to ignore the complexity of a distributed memory computer and yet tap the very high performance of the CRAY T3D system. CRI is working with
vendors of the following applications programs, with a
goal of having some of these applications available from
the vendor for the CRAY T3D system by the end of 1994.
chemistry

structural analysis
CFD

electronics

petroleum
mathematica1libraries

CHARMm
Discover
Gaussian
X-PLOR
LS-DYNA3D
PAMCRASH
FIRE
FL067
STAR-CD
DaVinci
Medici
TSuprem
DISCO
GEOVECTEUR
IMSL
NAG
Elegant Mathematics

[Duko94] Implicit Free-Surface Methods for the BryanCox-Semtner Ocean Model, J.K. Dukowicz and R.D.
Smith. To appear in J. Geophys. Res.
[Feye93] An Efficient Implementation of the Direct-SCF
Algorithm on Parallel Computer Architectures, M. Feyereisen and R.A. Kendall. Theoret. Chim. Acta 84, 289
(1993).
[Geis93] PVM3 User's Guide and Reference Manual, Al
Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang,
Robert Manchek and Vaidy Sunderam. Oak Ridge
National Laboratory ORNLtrM-12187, May 1993 ..
[HPF93] High Performance Fortran, High Performance
Fortran Language Specification version 1.0, May 1993.
Also available as technical report CRPC-TR 92225, Center for Research on Parallel Computation, Rice University.
To appear in "Scientific Programming."
[MacD92] The Cray Research MPP Fortran Programming Model, Tom MacDonald. Proceedings of the Spring
1992 Cray Users' Group Meeting, pp. 389-399.
[Pase94] The CRAFT Fortran Programming Model, Douglas M. Pase, Tom MacDonald, and Andrew Meltzer. To
appear in Scientific Programming, 1994.
[Rein93] CRAY T3D System Software, Steve Reinhardt.
Proceedings of the Fall 1993 Cray Users' Group Meeting,
pp.36-40.

8.0 Summary
The CRAY T3D computer system has been delivered to
customers working in several scientific disciplines, and is
enabling production MPP computing for those customers.
Development of new MPP applications on the CRAY T3D
system is fueling greater exploitation of the latent performance of the system.

[Sarg93] Electronic Structure Calculations in Quantum
Chemistry, A.L. Sargent, J. Almlof, and M. Feyereisen.
SIAM News 26(1), 14 (1993).
[Wene94] Seismic Imaging in a Production Environment,
Geert Wenes, Janet Shiu, and Serge Kremer. Proceedings
of the Spring 1994 Cray Users' Group (CUG) Meeting.

9.0 References
[Duko93] A Reformulation and Implementation of the
Bryan-Cox-Semtner Ocean Model on the Connection
Machine, J.K. Dukowicz, R.D. Smith, and R.C. Malone.
J. Atmos. Ocean. Techn., 10, 195-208 (1993).

33

P ARALLELSESSIONS

Applications and Algorithms

Porting Third-Party Applications Packages to the Cray MPP: Experiences at
the Pittsburgh Supercomputing Center
Frank C. Wimberly, Susheel Chitre, Carlos Gonzalez, Michael H. Lambert,
Nicholas Nystrom, Alex Ropelewski, William Young
March 30, 1994

1

Backgrou nd

Early in 1993 it was decided that the Pittsburgh Supercomputing Center would acquire the first Cray T3D MPP supercomputer to be delivered to a customer site. Shortly after
that decision was made we began a project to implement a
number of widely used third-party packages on the new platform with the goal that they be available for production use
by the time the hardware was made available to the PSC
user community.
The wide use of such packages on the Center's Cray C90
(C916/512) led us to appreciate the importance of their availability. Approximately 30 to 40 percent of the cycles on the
C90 are delivered to users of applications packages. Previously acquired massively parallel supercomputers at the
Center had seen less widespread use than the vector supercomputers, probably because of the lack of such packages.
These other MPP's were not underutilized but they were
used by a relatively smaller set of users, who had developed
their own codes.
In selecting the targets for the porting effort we took into
account: demand for the package on the C90; whether we
had a source license and the right to modify the program
and make the result available to users; and, whether message
passing parallel versions already existed which would give us
a headstart on a T3D version. Based on these criteria we
selecting the following packages:

* GAMESS
* Amber 4
* CHARMM
* MaxSegs
* Gaussian 92
In addition, later in the year we began evaluating FIDAP
and X-PLOR as additional candidates for porting.
Although we did not have access to T3D hardware until later
in the summer extensive porting began early in the year
by means of access to T3D emulators running at CRI and
shortly thereafter on PSC's own C90. The emulator proved
to be an excellent tool for validating program syntax and

correctness of results. Its usefulness for program optimization was limited in that performance data was restricted to
information about locality of memory references. Since several programs worked correctly on the emulator before the
hardware became available we turned to the T3D simulator,
running on a Y-MP at CRI, as a means of testing performance. Although the simulator was important in operating
system development it was not as useful as we had hoped
since for testing applications programs it ran too slowly to
permit realistic runs. The hardware became available shortly
after these attempts to use the simulator so this was not a
significant impediment to progress.
By the time of delivery of the 32 PE T3D to PSC in August 1993 all of the five initially selected packages ran in
one version or another on the emulator. Some of them ran
"heterogeneously" between the C90 and the emulator and
some ran "standalone" within the emulator. Since Heterogeneous PVM was not available at that time, a PSC developed
communications package, DHSC (for Distributed High Speed
Communication), was used for communication between the
two platforms. Within a few weeks after delivery of the hardware versions of all five packages were running either on the
T3D or distributed between the T3D and the C90. Again,
DHSC was used for the heterogeneous programs. There were
various problems with the run time system and with the
CF77 compiler (for instance, incorrect handling of dynamic
arrays) which prevented the programs from running immediately on the hardware even though they had run on the
emulator. CRI was very responsive to the discovery of these
software problems and as a result of our efforts and support
from Cray personnel we were pleased with the progress we
had made.
We have recently begun to place more emphasis on performance as opposed to establishing program correctness. As
the focus has moved to performance we have been getting
regular hardware upgrades. The latest occurred in early
March of the current year, and we now have 256 PE's and
four I/O gateways on the T3D; an additional 256 PE's are
scheduled to be installed in the early summer of 1994.
Before presenting the current status of several of the porting projects we should comment on some general themes.
First, performance figures given below should be understood
in context. The CRAFT FORTRAN programming model
is not yet available to us. All parallel implementations have

39

been done using explicit message passing. The compiler does
not yet produce fully optimized code nor have the applications themselves been fully reorganized to exploit the power
of the T3D at this time. The PE's currently have 16 MB
of memory which has made it necessary to compensate in
ways which may adversely affect performance. For the heterogeneous programs, performance has been impacted by the
relatively slow speed of the C90 to T3D connection (the I/O
gateways); this is especially true for Heterogeneous PVM
but also for DHSC. The hardware is capable of 200 MB/sec
but we have realized process-to-process throughput of only
about 10 MB/sec. We expect this to improve substantially
as Heterogeneous PVM is expanded and improved.

2

CHARMM and the Molecular Dynamics
Kernel

CHARMM (Chemistry at HARvard Macromolecular Mechanics) is a widely used program to model and simulate
molecular systems using an empirical energy function. In
the 10 years since CHARMM was developed by Brooks [1]
and co-workers, a great deal of work has gone into developing optimized versions for a large number of computer systems including vector [2] and parallel [3], [4], [5] machines.
Because of its heavy usage at PSC and the nature of the algorithms used by CHARMM we felt that it was an excellent
candidate for porting to the T3D.
The principal focus of most CHARMM-based research done
by PSC users is simulating poly-peptides and proteins in an
aqueous environment. As a starting point we developed a
specialized version of CHARMM for this problem and are
currently running production calculations while we explore
algorithms and methods for developing a full-featured version of CHARMM. This initial port extends ideas in heterogeneous computing previously explored at the PSC [6] using
a Cray Y-MP and a Thinking Machines Corporation CM-2
and CM-5. This heterogeneous computing approach couples a highly-optimized code to simulate molecular solvents
[7] over a high-speed channel using either DHSC routines or
network PVM [8].
In previous computational experiments in distributed computing we were able observe good performance in small
demonstration calculations. But in scaling these calculations up for production we faced a large number of technical
problems, including those related to data format conversion.
With the arrival of the T3D and its pairing with the C90,
we had a heterogeneous computing system from a single vendor and were hopeful that these issues could be resolved. At
present, many of those issues have been resolved and we are
currently running production calculations. For a wide range
of benchmark problems we currently see speedups of 2 to 3
in run time using a single C90CPU and 32 T3D PE's over
the same calculation done on a single C90 CPU alone.

40

3

GAMESS

GAMESS (General Atomic and Molecular Electronic Structure System) [9] [10] is a quantum chemistry code currently
being developed at Iowa State University. It is written in
standard Fortran 77 and runs on a variety of platforms. The
program exhibits poor performance on Cray vector hardware.
However, as distributed by the author, it provides support for
message-passing parallelism. This is accomplished through
calls to the TCGMSG message passing library [11]. Because
most computation is done in parallel regions and these regions are scattered through the code, GAMESS is better
suited to a standalone T3D implementation than it is to the
heterogeneous C90/T3D platform.
Because PVM is the natural message-passing mechanism on
the Cray T3D, the TCGMSG library was replaced by a set
of wrappers to call the appropriate PVM routines. The code
would not run under the T3D emulator, so it was necessary
to do development work on the PSC workstation cluster.
Once the T3D hardware arrived and the software stabilized,
GAMESS ran. However, because of current memory limitations on the T3D, it is limited to direct (in memory) calculations with about 200 basis functions. This is marginally
enough to handle interesting problems.
GAMESS running on the T3D scales well with the number of
processors. Communications accounts for about two percent
of the total time and load balancing is at the five-percent
level. On a 110 basis-function direct SCF gradient calculation with 32 PEs, the two-electron gradient routine is over
99.5% parallel and the overall efficiency is just over 50%. Direct calculations run in about the same time on four PEs on
the T3D as on one processor of a C90.

4

Amber 4

Amber 4[12] is a suite of programs designed for molecular
modeling and simulations. It is a robust package prominent in the toolkits of computational chemistry and biology
researchers, and it methods, tailored to the study of macromolecules in solution, have found widespread applicability in
modeling complex biochemical processes and in the pharmaceuticals industry. Amber's genesis was in the application
of classical mechanics (i.e. integration of Newton's equations of motion) to large molecular assemblies using fitted
potentials to determine low-energy conformations, model biological processes, obtain bulk properties, etc. New versions
of Amber have since introduced capabilities to treat nuclear
magnetic resonance data, support free energy perturbation
calculations, and otherwise implement the functionality required by its research audience.
The Amber 4 package consists of sixteen individual programs, loosely categorized as preparatory programs (prep,
link, edit, and parm) , energy programs (minmd, gibbs,
sander, nmode), and utility programs (anal, mdanal,
runanal, lmanal, nucgen, pdbgen, geom, and a set of tutorial

shell scripts). The preparatory programs address the construction of Amber 4 input. They generally consume a small
amount of computational resources and are run a small number of times as the first step of any given calculation. The
energy programs perform the real work of minimizing structures and propagating trajectories through time. Computationally they are the most demanding programs in the Amber 4 system, and they are often run multiple times to model
different environments and initial conditions. The four energy programs together, including library routines shared between them, comprise only 53% of Amber 4 package's source
code. The utility programs analyze configurations, compute
various properties from the system's normal modes, and interact with Brookhaven Protein Data Bank[13, 14] (PDB)
files. As with the preparatory programs, the utility programs
consume an relatively insignificant portion of computational
resources when compared with the energy programs.

4.1

Parallel implementations

The clear association of heavy CPU requirements with the
energy programs suggests them as ideal candidates for implementation on distributed and parallel computers. Distributing even moderate-sized programs such as these can
be laborious, so the PSC was fortunate to receive distributed
versions ofminmd and gibbs based on PVM[8] from Professor
Terry P. Lybrand and his research group at the University
of Washington. (Work is also underway at the University of
Washington on sander.)
The PSC's initial choice for conversion to the Cray T3D
was minmd because its relevance to computational chemistry
and biology, the number of CPU cycles it consumes, its size,
and the time frame in which the PVM version was obtained.
Minmd performs minimization, in which the atoms' positions
are iteratively refined to minimize the energy gradient, and
molecular dynamics, in which the atoms' coordinates are integrated in time according to Newton's equations of motion.
Typical system sizes in contemporary research range from on
the order of 10 2 to upwards of 10 5 atoms. Realistic simulations can entail up to on the order of 10 5 integration time
steps, rendering the integration phase of the simulation dominant and also daunting.
Lybrand's PVM-based version of minmd, subsequently referred to as the distributed implementation, embodies a
host/node model to partition a simulation across a networked
group of workstations. One process, designated as the host,
spawns a specified number of node processes to compute nonbonded interactions. All other aspects of the calculation are
performed on the host, which efficiently overlaps its own computation with communication to and from the nodes.
Work on two distinct implementations minmd is well underway: a standalone implementation which runs solely on the
Cray T3D, and a heterogeneous version which distributes
work between the Cray C90 and the Cray T3D. The standalone version employs Cray T3D PVM to exchange data
between processing element (PE) 0 and all other PE's. There

is occasional synchronization between the nodes, but no explicit node-node data transfer. The heterogeneous version
of minmd uses CRI Network PVM (also known as "Hetero
PVM") to communicate between the Cray C90 (host) and
the T3D PE's (nodes). Again, no node-node data transfer is
necessary.
The standalone and heterogeneous implementations ofminmd
each have their own advantages and disadvantages. The
standalone version offers the greatest potential performance
because of the low-latency, high-bandwidth I/O available on
the Cray T3D hardware. Its principal disadvantage is that
conversion from distributed host/node code to standalone
code is tedious and error-prone because two separate sets of
source code must be merged. This results in a high maintenance cost for a package such as Amber 4 which is constantly
evolving. The heterogeneous implementation currently suffers from the low efficiency of the C90-T3D communications
mechanism, but it is very easily derived from the distributed
source code. The changes to PVM are trivial, and the only
extensive changes required concern file I/O and the processing of command line arguments. Communications rates between the C90 and the T3D are expected to improve with
time, so for now development and instrumentation of both
implementations of minmd will continue.
Preliminary timing data has already been obtained for distributed mirund running on the PSC's DEC Alpha workstation cluster and for the heterogeneous minmd running between the Cray C90 and T3D. Debugging of the standalone
Cray T3D implementation of minmd is in its final stages, and
timings are expected shortly.

4.2

Acknowledgements

The initial PVM-based implementations ofminmd and gibbs
were developed and provided by Professor Terry P. Lybrand
and his research group at the University of Washington.

5

MaxSegs

MaxSegs [15] is a program written by the PSC for genetic
sequence analysis. MaxSegs is designed to take a experimental DNA/RNA or protein query sequence and compare
it with a library of all categorized DNA/RNA or protein
sequences. Searching categorized sequences with an experimental sequence is useful because it helps the researcher locate sequences that might share an evolutionary, functional,
or a biochemical relationship with the query sequence 1. The
MaxSegs program is written in standard FORTRAN-77 and
is highly optimized to run on Cray vector supercomputers.
For a typical protein the MaxSegs program operates at about
1 There are currently about 40,000 categorized protein sequences
ranging in length from 2 to 6000 characters. The average size of a
typical protein sequence is approximately 300 residues. There are approximately 170,000 DNA/RNA sequences with lengths ranging from
100 to 200,000. The length of a typical DNA sequence is about 1000.

41

230 million vector operations per second on the PSC's C90
2

MaxSegs was also one of the first programs distributed between the Cray YMP and the Thinking Machine Corporation's CM-2 supercomputer at the Center [16]. This project
helped to show that for large problems, two supercomputers
could indeed cooperate to achieve results faster than either
supercomputer alone could achieve. The CM2 code was implimented using data parallel methods; each virtual processor on the CM2 received a unique library sequence residue
to compare with a broadcast query sequence residue. In this
implementation, many library sequences could be compared
with a query sequence simultaneously. In addition to comparing many library sequences with a query sequence at once,
this implementation also requires the use of very little perprocessor memory. The disadvantages of this implementation include an enormous amount of nearest neighbor communication, a startup and finishing penalty in which nodes
process zeros and the pre-processing of enormous, frequently
updated sequence libraries 3. Although programming in this
style has a number of disadvantages, the results have been
impressive enough to allow sequence analysis software implemented on SIMD machines, to gain acceptance within the
sequence analysis community.
Both published research [17], [18] and unpublished research
by the biomedical group at the PSC have shown that if node
processors have sufficient memory, a MIMD style of programming can be applied to the sequence comparison problem
yielding performance superior to the performance reported
on machines using data parallel approaches. In this style
each processor is given the experimental query sequence and
unique library sequences to compare. The results of these
comparisons are collected by a single host processor. One
advantage to this implementation is that superior load balancing can be achieved, without having to pre-sort the frequently updated sequence databases. The increased load
balancing capabilities also make this implementation suitable for a wide selection of sequence comparison algorithms,
such as Gribskov's profile searching algorithm [19]. In addition to providing superior load balancing overall communication is also reduced; communication only occurs when the
sequences are sent to the processors, and the results of the
comparisons are collected back at the host. The disadvantages of this method are that the communication patterns
are irregular, there is the potential for a bottleneck to occur at the host processor, and the nodes must have sufficient
memory to perform the comparison.
We have decided to implement the code on the T3D using the
MIMD approach in two different ways. The first way uses the
C90 as the host processor, directing the T3D to perform the
work of comparing the sequences. The second method uses
PEO on the T3D as the host processor, leaving the remaining
T3D processors to compare the sequences. Preliminary reego version of MaxSegs is available upon request.
startup and finishing penalties result from the recursive nature
of the sequence comparison algorithms; to improve efficiency on a SIMO
machine, categorized sequences must be sorted according to size. See
2The

3 The

[16].

42

suIts indicate that the overall communication speed between
the T3D and the C90 is currently insufficient to consider the
first approach as a viable alternative to using the native C90
code. However, the second approach is very promising, preliminary results indicate that 32 T3D processors take only
25% more time than a single C90 CPU and that 64 T3D
processors can match the performance of a single C90 cpu.
These performance results are on preliminary code, and results indicate that communication, rather than computation
is the main bottleneck. Using some simple communication
reduction techniques, we expect the results to improve dramatically.

5.1

Acknowledgements

This work was developed in part under the National Institutes of Health grants 1 p41 RR06009 (NCRR), and ROI
LM05513 (NLM) to the Pittsburgh Supercomputing Center.
Additional funding was provided to the Pittsburgh Supercomputing Center by the National Science Foundation grant
ASC-8902826.

6

Conclusion

Based on our experience we would say that the problem of
porting "dusty decks", or, more accurately, large pre-existing
packages to a modern, high performance MPP platform is
difficult but possible. The difficulty varies depending on the
structure of the code, the intrinsic nature of the algorithms,
the existence of other message passing versions, and several
other factors. In the best cases, the effort is clearly worthwhile. We have seen impressive performance even in the
absence of obvious optimizations of both the compilers and
the applications programs themselves. It seems clear that
in many cases the throughput, measured in the amount of
scientific research accomplished per unit time, realized at
the PSC will be substantially increased by the availability of
these packages either on the T3D or on the heterogeneous
platform (C90/T3D).

References
[1] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J.
States, S. Swaminathen, and M. Karplus. Charmm:
A program for macromolecular energy, minimization,
and dynamics calculations. J. Compo Chern., 4:187-217,
1983.
[2] J. E. Mertz, D. J. Tobias, C. L. Brooks, III, and U. C.
Singh. Vector and parallel algorithms for the molecular
dynamics simulation of macromolecules on shared memory computers. J. Compo Chern., 12:1270-1277,1991.
[3] B. G. J. P. T. Murray, P. A. Bash, and M. Karplus.
Molecular dynamics on the connection machine. Tech-

nical Report CB88-3, Thinking Machines Corporation,
1988.
[4] S. J. Lin, J. Mellor-Crummey, B. M. Pettitt, and Jr.
G. N. Phillips. Molecular dynamics on a distributedmemory multiprocessor. J. Compo Chem., 13:10221035, 1992.
[5] B. R. Brooks and Milan Hodoscek. Parallelization of
charmm for mimd machines. Chemical Design A utomation News, 7:16-21, 1993.

[16] H. Nicholas, G. Giras, V. Hartonas-Garmhausen,
M. Kopko, C. Maher, and A. Ropelewski. Distributing the comparison of dna and protein sequences across
heterogeneous supercomputers. In Supercomputing '91
Proceedings, pages 139-146, 1991.
[17] A. Deshpande, D. Richards, and W. Pearson. A platform for biological sequence comparison on parallel computers. CABIOS, 7:237-347, 1991.

[6] C. L. Brooks III, W. S. Young, and D. J. Tobias. Molecular simulations on supercomputers. IntI. J. Supercomputer App., 5:98-112, 1991.

[18] P. Miller, P. Nadkarni, and W. Pearson. Comparing machine-independent versus machine-specific parallelization of a software platform for biological sequence
comparison. CABIOS, 8:167-175, 1992.

[7] W. S. Young and C. L. Brooks III. Optimization of replicated data method for molecular dynamics. J. Compo
Chem., In preparation.

[19] M. Gribskov, R. Liithy, and D. Eisenberg. Profile analysis. In R. Doolittle, editor, Methods in Enzymology
Volume 183, 1990.

[8] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng
Jiang, Robert Manchek, and Vaidy Sunderam. PVM 3
user's guide and reference manual. Technical Report ORNL/TM-12187, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 1993.
[9] M. Dupuis, D. Spangler, and J. J. Wendoloski. National
Resource for Computations in Chemistry Software Catalog, Program QG01. University of California, Berkeley,
1980.
[10] Michael W. Schmidt, Kim K. Baldridge, Jerry A. Boatz,
Steven T. Elbert, Mark S. Gordon, Jan H. Jensen, Shiro
Koseki, Nikita Matsunaga, Kiet A. Nguyen, Shujun Su,
Theresa L. Windus, Michel Dupuis, and John A. Montgomery, Jr. General atomic and molecular electronic
structure system. J. Compo Chem., 14(11):1347-1363,
1993.
[11] R. J. Harrison, now at Pacific Northwest Laboratory, v. 4.03, available by anonymous ftp in directory
pUb/tcgmsg from host ftp.tcg.anl.gov.
[12] David A. Pearlman, David A. Case, James C. Caldwell,
George L. Seibel, U. Chandra Singh, Paul Weiner, and
Peter A. Kollman. AMBER 4.0. University of California, San Francisco, 1991.
[13] F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F.
Meyer, Jr., M. D. Brice, J. R. Rodgers, O. Kennard,
T. Shimanouchi, and M. Tasumi. The Protein Data
Bank: A computer-based archival file for macromolecular structures. J. Mol. BioI., 112:535-542, 1977.
[14] E. E. Abola, F. C. Bernstein, S. H. Bryant, T. F.
Koetzle, and J. Weng.
Protein Data Bank.
In
F. H. Allen, G. Bergerhoff, and R. Sievers, editors,
Crystallograhic Databases - Information Content, Software Systems, Scientific Applications, pages 107-132,
Bonn/Cambridge/Chester, 1987. Data Commission of
the International Union of Crystallography.
[15] M. S. Waterman and M. Eggert. A new algorithm for
best subsequent alignments with applications to trnarrna comparisons. J. Mol. BioI., 197:723-728, 1987.

43

Some Loop Collapse Techniques to
Increase Autotasking Efficiency
Mike Davis
Customer Service Division
Cray Research, Inc.
Albuquerque, NM

Abstract

causing high overhead to distribute and synchronize tasks (VW);

Loop collapsing has limited application as a technique to
improve the efficiency of vectorization; however, when aplied to a
nest of loops in which Autotasking is taking place, the resulting
transformed loop structure can perform much more efficiently.
This paper describes the loop structures that can be collapsed,
discusses the techniquesfor collapsing these loops, and measures
the efficiencies of the transformations.

f. The parallel region is itself inside a loop that also contains a significant amount of serial work; this kind of code can
potentially result in high overhead as the operating system repeatedly gives CPUs to the process for execution of the parallel
region, then takes them away during execution of the serial
region (RESCH).

1.0 Introduction
The Cray Research Autotasking Compiling System recognizes several fonns of vectorizable, parallelizable work in Fortran code. These forms fit the general Concurrent-Outer-Vector!!1ner (CO VI) model, where outer loop iterations are executed
concurrently on separate CPUs and inner loop iterations are executed in vector mode on each CPU. The kinds of iterative structures that cannot be recognized as parallelizable or that cannot be
efficiently parallelized have been classified into groups according
to their characteristics [1,2]. The descriptions of some of these
groupings are as follows:
a. The parallelized loop contains an insufficient amount
of work over which to amortize the cost of initiating and terminating parallel processing (BPEP);
b. The number of iterations in the parallelized loop is not
sufficiently high to permit the use of all (or a large majority) of the
machine's CPUs (LI);
c. The parallel efficiency of the parallelized loop is limited by an ineffective work distribution scheme (LI);
d. The amount of work being done on each iteration of
the parallelized loop is so large that delays in task scheduling can
result in significant reductions in achieved speedup (LI);
e. The amount of work being done on each iteration of
the parallelized loop varies greatly from one iteration to the next,

The techniques described in this paper address these
types of structures. Section 2 introduces the concept of iteration
space, and how looping structures can be described by the shape
of their iteration spaces. Section 3 describes coding techniques
for collapsing nests of loops and gives intuitive explanations of
why the transformations provide performance benefits. Section 4
covers the results of performance-testing the various collapse
transformations. Conclusions and areas of future work are presented in Section 5.

2.0 Geometric Characterization of Iterative
Structures
The number of different looping structures that can be
constructed is literally infinite; the number that have actually been
constructed is probably as large as the number of computer applications. But the vast majority of looping constructs can be
grouped into a handful of categories. For the purposes of this
paper, the best system for classifying loops is the geometric system. In this system, iterative structures are described by the shape
of the iteration space. The shape will have as many dimensions as
the iterative structure has loop nests. The shape is constructed by
projecting it outward by one dimension for every loop in the loop
nest. The following helps to illustrate this process:
The outermost loop in the structure (call it Ll) is represented by a line; its length, in geometric units, is equal to the trip
count of the loop (N1); this is represented in Figure 2.1.

Copyright (C) 1994, Cray Research, Inc. All Rights Reserved.

44

/

1 2 3

...

N3

Nl
11

Figure 2.1
The next outermost loop (L2), inside of LI, is represented by projecting the shape to two dimensions.. The width of
the shape at a point II units along the edge formed by Ll is equal
to the trip count of L2 (N2) on iteration 11 of L 1; this is shown in
Figure 2.2.

12

3

N2

1 2

...

3

Nl
II

Figure 2.3

2.2 Rectangular Iteration Space
Consider the iterative structure of Figure 2.2. The iteration space of this structure has a rectangular shape. Its length is
equal to NI. and its width at all points along its length is equal to
N2.

2.3 Triangular Iteration Space
The next outermost loop (L3), inside of L2, is represented by projecting the shape to three dimensions. The depth of
the shape at a point II units along the edge formed by LI and 12
units along the edge formed by L2 is equal to the trip count of L3
(N3) on iteration II of Ll and 12 of L2; this is depicted in Figure
2.3. Quite often, the shape of the iteration space matches the
shape of the data upon which the iterative structure operates, but
this is not always necessarily the case. Some examples will help
reinforce the concept of representing iterative structures geometrically

2.1 Linear Iteration Space
In a simple iterative structure consisting of only one
loop, the iteration space is said to have a linear shape. The length
of the line is equal to the trip count of the loop.

A structure whose iteration space is right-triangular
would appear as shown in Figure 2.4. There are several different
variations of the triangular iteration space, and each variation corresponds to a triangle with different spatial orientation and a specification of whether or not the triangle includes the "main
diagonal." These variations can be distinguished by the form of
the DO statement for the inner loop. as shown in Table 2.1.

2.4 Nevada Iteration Space
A more generalized expression for both the rectangular
and triangular iteration spaces is that of the "Nevada" iteration
space. This shape has both rectangular and triangular components. such that it is shaped like the state of Nevada. The coding
structure that corresponds to this shape is shown in Figure 2.5. In
this structure. the shape has a length of Nl and a width that varies
from N2+N2D to N2+NI *N2D. N2D represents the magnitude

45

of the slope of the hypotenuse of the triangular component of the
shape. If N2 is zero, then the shape degenerates into that of a triangle; if N2D is zero, then the shape degenerates into that of a
rectangle.

along its length is dependent upon the contents of some data structure that has been built prior to entering the coding structure. The
histogram iteration space is shown in Figure 2.5. Here, the identifier N2 is an integer array of extent (l:Nl), and each element of
N2 specifies the trip count of the inner loop. Geometrically, this
corresponds to the notion that for a given "bar" II, the "bar height"
is equal to N2(II).

1
2

r

3

11
123

12

t

Nl
1

2
3
N2

Nl
123
~

r
12

Nl
11

•

Figure 2.4
N2+Nl*N2D

Figure 2.5
12 index values

Orientation

Inc. Main Diag

1, 11

Upper Right

Yes

1,11-1

Upper Right

No

11, Nl

Lower Left

Yes

11+1, Nl

Lower Left

No

Nl+1-ll, Nl

Lower Right

Yes

Nl+I-ll, Nl-1

Lower Right

No

1, Nl+I-Il

Upper Left

Yes

1, NI-Il

Upper Left

No

t
12

1

1

2

~

3

3

4

6

N2(ll)

1

~

5

~

~

5

~

~

6

I.....-

~

9

8
~

I.....-

123
TABLE 2.1: Variations of Triangular Iteration Space

~

Nl

11

2.5 Histogram Iteration Space
Another shape commonly encountered is that of a histogram. For this structure, the width of the shape at a given point

46

Figure 2.6

2.6 Porous Iteration Space
There is another characteristic of iterative structures that is
worth considering for the purposes of this study. Recall from the
previous section that the iterative structures that pose problems for
Autotasking include those that have low trip counts and low
amounts of work in the body. So far we h.ave focused primarily on
:;haracterizing loops based on their trip counts. Now we consider
how to describe loops in terms of the amounts of work contained
within. Consider the iterative structure of Figure 2.7. Notice that
the iteration space is rectangular, but the work is confined to a triangular subspace.

t

1

2

3

12

t

~~~

1
2
3

12

~

~

')
~/>...

~

N2

the compiler at one time. Collapsing nested loops can improve the
vector performance of a loop structure because it increases the trip
count of the vector loop, and hence increases the vector length
[3,4]. For Autotasking, a collapsed nest ofloops can perform better because the entire iteration space is handled in one parallel
region, rather than just individual sections; furthermore, the programmer has more control over the granularity of the parallelism
in a collapsed loop.

:;..

~~
~

N2

V/-'-'-'fA.

~

11

~

123

Nt

t 2 3

~-'-'-'-'./A

•

Nl
11

Figure 2.8

Figure 2.7

3.1 Collapsing the Rectangle
Now consider a more generic case, as depicted in Figure
2.8. Here, work is done only on iterations where the control variables suit some condition. The shape of the iteration space is rectmgular, but its interior is porous; that is, certain cells of the space
have substance while others do not. An iterative coding structure
that executes varying amounts of work from one iteration to the next
lS analogous to a shape that varies in density from one cell to the
Ilext. In this work, we will restrict our study to structures that exe:;ute either some work or no work depending on the iteration. much
like the one in Figure 2.8.

3.0 Loop Collapse Techniques
Loop collapsing is one of several loop transformation techIliques for optimizing the performance of code. Some of the others
lnelude loop splitting, loop fusion, loop unrolling, and loop inter;hange. Each technique is suitable for its own elass of loop structures, but loop collapsing is done primarily to increase the number
)f iterations that can be made available to the optimizing stage of

Probably the simplest loop structure to collapse is one
with a rectangular shape, such as the one below:
DO 11 = 1, NI
DOI2= I,N2
work
END DO
END DO
Suppose that NI = 17 and N2 = 3. If we chose to direct
the compiling system to autotask the loop varying II, and we
wanted to use all 8 CPUs of a CRAY Y-MP, then we could potentially suffer a high LI overhead when 7 CPUs had to wait for the
8th CPU to finish the 17th iteration. On the other hand, if we
direct the compiling system to auto task the loop varying 12, then
we could only make effective use of 3 CPUs in the parallel region
(LI overhead again), plus we would suffer a high cost for initiating and terminating the parallel region 17 times (BPEP overhead).

If we collapse the two loops into one, the resulting code
appears as shown below:

47

DO 112 = 0, Nl*N2-1
11 = 112 / N2 + 1
12 = MOD (112, N2) + 1
work
END DO
Here, the amount of work to be done by the structure
essentially has not changed. The trip count has been increased to
17 * 3 = 51. If we let Cwork be the cost of executing the work
inside the loop body, then the worst case LI cost is C work /51,
because out of 51 iterations to do, we might have to wait for one
iterations' worth of work to be completed; by comparison, in the
previous case, if the outer loop is Autotasked, the worst case LI
cost is 3Cwork /17, because out of 17 outer-loop iterations to do,
we might have to wait for 3 inner-loop iterations' worth of work
to be completed (each task does an entire DO 12 loop). The difference in LI cost is therefore a factor of 9.
Another point worth emphasizing here, on the subject of
load balancing, is that the collapsed structure offers more flexibility in terms of specifying work distribution parameters to the
Autotasking compiling system [5], thus further increasing the parallel efficiency of the collapsed structure relative to the original.
One of the dangers of collapsing loops is that the scope
of the data dependency analysis must be widened to include the
bodies of the outer and inner loops. If, for example, we had a
structure like the one below,
DOll = I,Nl
DOI2= I,N2
TEMP = A(l2-1 ,11)
work
A(12,Il) = TEMP
END DO
END DO
Autotasking the loop varying 11 is safe, but Autotasking
the loop varying 12 is not, because of the data dependency involving A. This data dependency would persist through the collapse
of the loops, rendering the collapsed loop unparallelizable. In a
case like this, it might be worthwhile to code the transformed loop
so that the value of 11 varies fastest, if the logic within the body of
the loop allows this; otherwise, it might be better just to leave the
structure alone, since the loop varying 11 can Autotask.
Another potential danger occurs when collapsing a loop
structure in which the inner loop has a trip count that is lower than
the number of CPUs that could be working concurrently on the
collapsed loop. In this case, there might be more than one task
working on an iteration of the structure with the same value for the
inner loop index. This mayor may not be a problem, depending
on the contents of the loop body. To illustrate, consider the following collapsed rectangle structure:
DOll = I,Nl
DOI2= I,N2
work

48

SUM(I2) = SUM(I2) + A(12,1l)
END DO
END DO
. In this example, if N2 were equal to 4 and the collapsed
structure were Autotasked across more than 4 CPUs then more
than one task will execute concurrently with the same value for 12.
Thus, a race condition on SUM(12) could occur. Techniques to
protect against this danger include installing GUARD directives
in the loop body or interchanging the loops before collapsing.

3.2 Collapsing the Triangle
Collapsing a loop structure that corresponds to a triangular iteration space is a little more complicated. First we note that
the number of cells NCtri in the triangular iteration space of length
Nl is given by

NCtri (Nt)

__

~

£.J
i= 1

l' -_

N1 x (N 1 + 1)
2

So we define a statement function NC_TRI which will
aid in the readability of the transformed code. Given the following original loop structure:
DOI1=I,Nl
DO 12 = 1,11
work
END DO
END DO
The transformation would then look like this:
DO 112 = 0, NC_TRI(Nl) - 1
ISEEK = Nl- 1
DO WHILE (NC_TRI(lSEEK) .GT. 112)
ISEEK = ISEEK - 1
END DO
11 = ISEEK + 1
12 = 112 - NC_TRI(lSEEK) + 1
work
END DO
The sole purpose of the code at the top of the collapsed
loop is to determine, given the collapsed loop index 112, the values of the "original" loop indices 11 and 12. The first impulse of
many programmers would be to create counters that get incremented on every iteration of the loop varying 112, and occasionally zero one of the counters when a new strip of the triangle is
begun. This kind of logic is probably easier to understand, but in
order for it to be run Autotasked, it would have to be GUARDed
to ensure that only one task at a time updates the counters. The
logic shown here makes use of the collapsed loop index 112 and a

DO Il2 =0, NC_NEV (NI,N2,N2D) - I
ISEEK = NI- I
DO WHILE (
NC_NEV (ISEEK,N2,N2D) &
.GT. Il2)
ISEEK =ISEEK - 1
END DO
II =ISEEK+ I
12 = Il2 - NC_NEV(ISEEK,N2,N2D) + 1
work
END DO

private variable ISEEK; its primary advantage is that it requires no
such protection as a critical section of the loop.
There are a few noteworthy aspects of this structure.
First of all, observe that on the first few iterations of the outer loop
in the original structure, the trip count of the inner loop is going to
be low. Therefore, you are guaranteed to encounter the situation
described above in the discussion of collapsing rectangular structures, namely several tasks running concurrently with the same
value for 12. Furthermore, one of the techniques to circumvent
this potential problem, that of interchanging the loops, is not an
option here, because the inner loop trip count depends on the outer
loop index: you can't iterate from 1 to II on the outside because
you don't know what II is yet!
The second thing worth noting is that when the outer
loop is Autotasked in the original version of the code, the iterations will contain variable amounts of work (VW); this corresponds to the variation in the height of the triangle at various
points along the base. This phenomenon will produce added overhead in this kind of loop. The best way to avoid this situation, if
the loops cannot be collapsed, is to arrange the iterations of the
outer, Autotasked loop so that the amount of work performed on
each iteration decreases as the iteration number increases. This
VW problem occurs in essentially all non-rectangular iteration
spaces.

In this case, so long as N2 is large enough, there is no
need to worry about the possibility of two concurrent tasks having
the same value for their inner loop index 12.

3.4 Collapsing the Histogram
Considering now the histogram iteration space, the same
kind of transformation technique can be applied, but first, a special
data structure must be built to assist in the collapse. The data
structure is an integer array of extent (O:NI), where NI is equal to
the number of bars in the histogram, and the value of each element
of the array is equal to the height of the histogram bar corresponding to the element's index, plus the value of the preceding element
(the value of the zero-th element is zero). Hence, the data structure represents a running sum of the number of cells in the histogram up to a certain point:

3.3 Collapsing the Nevada
For the Nevada-shaped iteration space, it is best to make
use of a statement function to compute the indices, much like that
used for the triangle space discussed above. When the statement
function is used, the collapsed loop looks very much like that for
the triangle case. The number of cells in a Nevada structure is:

NCnev(Nl,N2,N2D) = Nl xN2+N2DxNCtri(Nl)

The function computes the number of cells in a Nevadashaped space with a rectangular component of size NI by N2 and
a triangular component whose base is NI and whose hypotenuse
has a slope N2D. The original Nevada iterative structure looks
like this:
DOll = I,NI
DO 12 = 1, N2 + II
work
END DO
END DO

Nl

NChist (N!)

L N2 (i)

i=1

We can call this data structure NC_HIST. The code to
construct the NC_HIST data structure, prior to entering the loop
structure, looks like this:
NC_HIST(O) = 0
DO II = I,NI
NC_HIST(Il) = NC_HIST(II-I) + N2(Il)
END DO
Equipped with this data structure, we can now collapse
the histogram iterative structure. The original histogram loop
structure looks like this:
DO II = 1, NI
DO 12 = 1, N2(Il)
work
END DO
END DO

* N2D

The transformation to collapse the Nevada space to a linear space appears below.

=

The collapsed histogram loop structure is as shown
below:

49

DO 112 = 0, NC_HIST(NI) - I
ISEEK =NI-I
DO WHILE (NC_HIST(lSEEK) .GT. 112)
ISEEK = ISEEK - I
END DO
11 = ISEEK + I
12 = 112 - NC_HIST(lSEEK) + I
work
END DO
As with the triangle iteration space, we will see here the
potential for more than one task running at a time with the same
value for 12. And as is also the case for the triangle, the loop interchange remedy is not available to us because of the dependency
between the loops. But it is worth remembering that this potential
problem is only a real problem whenever there is code within the
body of the loop that updates some data structure element using 12
as an index. In this case, the only good remedy is to create a critical section in the loop body that prevents 12-indexed data structures from being updated by one task while being referenced by
another task.

3.5 Collapsing the Porous Shape
The last iterative structure we will consider in this section is that of the porous shape, representing a structure with conditional work inside. In this scenario, what we would like to do is
eliminate VW overhead. We are also interested in eliminating the
iteration overhead of distributing to the CPUs iterations that
essentially have no work in them. The technique involves essentially skipping the iterations for which the porosity condition
holds true, and distribute only those iterations for which real work
will be done. To accomplish this, we must build a data structure
before entering the parallel region, much like that used for the histogram above:
NCONDA=O
DO 11 = I,NI
DO 12= I,N2
IF (CONDA (II, 12» THEN
NCONDA = NCONDA + 1
ICONDA(l,NCONDA) = 11
ICONDA(2,NCONDA) = 12
END IF
END DO
END DO
CONDA is a function that takes as arguments the loop
structure indices, and returns a logical result; it is essentially the
porosity function. The data structure ICONDA is an integer array;
its size in the first dimension must be equal to the nesting depth of
the iterative structure; here it is 2. The size of ICONDA in the second dimension must be large enough to represent the completely
non-porous iterative structure; in this case, NCONDA could get as
large as NI * N2, so ICONDA must be equipped to store that
many entries. (Of course, if the programmer knows a priori what

50

the degree of porosity of the structure is likely to be, he can dimension ICONDA accordingly). The ICONDA array keeps track of
those non-porous cells within the structure where there is work to
be done. Its use in the transformation of the porous iterative structure is shown below. First, the original porous loops:

DOll = I,Nl
DO 12= I,N2
IF (CONDA (II, 12» THEN
work
END IF
END DO
END DO
Now, the collapsed porous structure:
DO 112 = I, NCONDA
11 = ICONDA(l,Il2)
12 = ICONDA(2,Il2)
work
END DO
Note that, unlike the other loop collapse transformations,
this collapsed loop iterates fewer times than the original structure,
so the possibility exists that the collapsed loop will not have a high
enough trip count to warrant Autotasking. The effects of this condition, and techniques for accounting for it, will be covered in the
section on testing.
It is important to keep in mind that the shape of a structure and its porosity are two totally independent characteristics. In
fact, the technique for collapsing a porous structure can be applied
to any kind of loop nest, regardless of its shape. This makes the
porous collapse technique the most general of all the techniques.

4.0 Performance Testing of Collapsed
Loops
For each of the five collapse techniques discussed in the
previous sections (rectangle, triangle, Nevada, histogram,
porous), four test runs were performed. Two of the four test runs
compare wall clock times, and two compare CPU times. Of the
two wall clock tests, one compares the performance of the collapsed loop against the original with the inner loop Autotasked,
and the other compares the performance of the collapsed loop
against the original with the outer loop Autotasked. The same two
comparisons are done in the two CPU time tests. The test programs were compiled and executed on a 16-CPU Cray Y-MP C90
installed in the Corporate Computing Network at Cray Research,
Inc. in Eagan, MN. All tests were executed during SJS time,
which is essentially equivalent to dedicated time. CPU timing
tests used the SECOND function, and wall clock timing tests used
the RTC intrinsic [6].
The CPU timing plots should reveal gains achieved by
collapsing, specifically in the area of reducing BPEP overhead;
they may also show costs associated with collapsing, specifically

in the areas of executing code to generate supporting data structures or to compute original index values. The wall clock timing
plots should reveal gains achieved by collapsing, specifically in
the area of reducing LI and VW overhead.

Inner/CPU plot, but the scales are much different. In this case, the
speedup to be obtained by collapsing the loops is quite dramatic if
the shape is short and wide. This speedup is due to improved load
balancing.

The bodies of the loops in all cases were the same, essen-

Notice also that this plot is more "bumpy" than the previous plot. This is presumably due to the fact that the LI overhead
of the original code is highly dependent on the trip count of the
Autotasked loop; further, there is always an inherent slight variability in wall-clock timings.

tially:
CALL WORK (A(l2,Il))
where A is an appropriately-dimensioned array and the
subroutine WORK looks like this:
SUBROUTINE WORK (A)
A= 2.7181
DO I = 1,512
A =EXP (LOG (A))
END DO
END
Thus, the WORK routine does nothing useful, but does
exercise the hardware and accumulate CPU time. Notice also that
memory traffic is minimal in the loop body. This makes the test
results essentially unbiased by memory contention issues. In a
real loop body, however, memory contention could be a very big
issue.
The results of these tests are depicted in the 20 plots that
make up the Appendix. Each will be discussed in the sections that
follow.

4.1 Speedup from Collapsing the Rectangle
The graphs that describe the results of the rectangle tests
are all 3-D mesh plots, with speedup shown on the vertical axis,
as a function of the rectangle's length and width. In these tests,
the outer loop of the original structure iterates over the width of
the rectangle, and its trip count varies from 1 to 30; the inner loop
iterates over the length, and its trip count is varied from 1 to 50.

4.1.3 Outer/CPU
This plot shows that collapsing the rectangle yields
essentially no benefit in CPU time over Autotasking the outer
loop.

4.1.4 Outer/Wall
In this plot, we see only slight speedups for the collapsed
code over the original. There appears to be a significant drop in
speedup at width = 16. This is probably because the original code
performs most efficiently there, since the trip count is exactly
equal to the number of CPUs in the machine.

4.2 Speedup from Collapsing the Triangle
The graphs that describe the results of the triangle tests
are all 2-D line graphs, with speedup shown on the vertical axis,
as a function of the triangle's base and height. In these tests, base
and height were always equal, and they were allowed to vary from
1 to 50.

4.2.1 Inner/CPU
This graph shows speedup to be moderate when the size
of the triangle is small, and decreasing as the triangle increases in
size.

4.1.1 Inner/CPU
In this plot, we see that the speedup of the collapsed
structure is high where the length of the rectangle is small, and
decreases to an asymptotic value around 1.0 as length increases.
In the original version of this test code, the length dimension of the
shape is being processed in the inner loop , and the inner loop is
the one being Autotasked, so a small value for length means a high
ratio of BPEP code execution to user code execution.
The plot also shows that the collapsed structure performs
slightly better than the original when the width of the shape is
large. This can be explained by the fact that an increase in width
means, for the original code, increasing BPEP overhead. This
overhead cost is not present in the collapsed structure.

4.1.2 Inner/Wall
This plot has the same general shape as the previous

4.2.2 Inner/Wall
Speedups for this case are quite significant, as shown in
this graph. Like the Inner/CPU case above, payoffs are highest for
the small shape, and tapering off as the size of the shape increases.
This graph is quite jagged, perhaps indicating that the LI overhead is, in the original version of this test case, highly dependent
upon small changes in the size of the problem.

4.2.3 Outer/CPU
In this case, the speedup is negligible over essentially the
entire test space. Notice the scale of the vertical axis.

4.2.4 Outer/Wall
Speedups here are significant for small and moderatesized problems. The spikes in the graph are evidence ofLI over-

51

head in the original version of the code, which are exacerbated by
VW conditions.

4.3 Speedup from Collapsing the Nevada
The plots that describe the results of the Nevada tests are
all 3-D mesh plots, with speedup shown on the vertical axis, as a
function of the shape's rectangular dimensions (N1 and N2) and
diagonal slope (N2D). In these tests, N1 and N2 were always
equal, and they were allowed to vary from 1 to 50. N2D was
allowed to vary from 1 to 20. The test program iterates over the
length of the shape in the outer loop, and over the (variable) width
in the inner loop.

4.3.1 Inner/CPU

as a function of the length (number of bars) and the variation in
bar height. The average bar height is 25, and the bar height range
is allowed to vary from 2 to 50. Before a histogram is processed,
the length and average bar height are chosen, then the bars are
assigned random heights within the allowable range. The efficiency with which the original coding structure can process the
histogram space should depend strongly on the variance in the histogram's bar heights. It is important to note that even though this
series of tests used random numbers to generate the iteration
spaces, the same series of shapes was generated in each test run,
because each test started with the same (default) random number
seed.

4.4.1 Inner/CPU

ligible.

Speedups in this case are negligible. Any gains made in
reducing the BPEP overhead are perhaps being offset by the cost
of constructing the NC_HIST data structure.

4.3.2 InnerlWall

4.4.2 Inner/Wall

Here, the speedups are moderate (around 1.5) over most
of the test space. The plot shows that speedups are quite variable
when the iteration space is small, and more steady when the shape
is large. The variation in the perfonnance of the small test cases
may be due to differences in task scheduling.

This plot shows the wall-clock speedups to gather around
the 1.5 value, but to vary between 1.0 and 2.0. The variation
seems to be independent of both the length of the histogram and
the variation in bar height. The random jaggedness of the plot is
most likely due to the randomness of the shape of the iteration
space itself. It is interesting to note, however, that the variation
seems to be greater in the left rear comer of the plot and smaller
in the right front corner.

For this case, the speedup obtained by collapsing is neg-

4.3.3 Outer/CPU
This is an interesting plot. Speedups take on a stair-step
behavior based on the length of the iteration space. The steps
occur at length values that are multiples of 16, the number of
CPUs in the machine being used. In this test, the outer loop, iterating over the length of the shape, is Autotasked. Presumably,
when an Autotasked loop has a trip count that is some integer multiple of the number of CPUs in the machine, the iterations will be
distributed evenly and the efficiency will be near optimal. The
plot seems to support this hypothesis.

4.3.4 Outer/Wall
Speedups are dramatic in this case, in the region where
the rectangular component of the shape (given by N1 and N2) is
small and the slope of the hypotenuse of the triangular component
(given by N2D) is steep. The Autotasked loop in the original code
iterates over Nl, so when N1 is small, the ratio of overhead to useful work is high. Further, when N2D is large, that means that the
trip count of the inner loop varies greatly with respect to N1; this
results in a very high amount of VW overhead. As in the Outer/
CPU plot discussed above, there is a stair-step effect at length values of 16, 32, and 48; however, these effects are very subtle here.

4.4 Speedup from Collapsing the Histogram
The plots that describe the results of the histogram tests
are all 3-D mesh plots, with speedup shown on the vertical axis,

52

4.4.3 Outer/CPU
This plot shows that very little is gained in terms of CPU
time from collapsing the histogram versus simply Autotasking the
outer loop.

4.4.4 Outer/Wall
Speedups in wall-clock time from collapsing the histogram can be quite dramatic, especially where the length of the
shape is small and the bar height variation is high. In this test, the
outer loop iterates along the length of the shape. Since each bar
being processed in an iteration of the outer loop has a random
height, the potential for VW overhead is very high; and the larger
the variation, the higher the overhead. Then there is the LI overhead that can arise when parallelizing in the large grain, especially
when the length of the histogram is low. These factors together
can severely impact the parallel efficiency of the original structure.

4.5 Speedup from Collapsing the Porous Rectangle
The plots that describe the results of the porous rectangle
tests are all 3-D mesh plots, with speedup shown on the vertical
axis, as a function of the size of the rectangle and the degree of

porosity. The rectangle is actually a square, because length and
width are always equal; they are allowed to vary from 1 to 50. The
porosity of the shape is varied from 0% to 95%, in intervals of 5%.
The "holes" in each shape are placed at random before the processing of the shape; as in the histogram tests described above, the
series of shapes that are used in each test are all the same. The
porosity of the shape should have a direct impact on the original
coding structure's ability to process the shape efficiently.

4.5.1 Inner/CPU
Speedups in this test are significant only in the cases
where the size of the rectangle is small, or when the porosity is
very high. The higher the porosity, the more frequently we are
executing concurrent iterations for no useful purpose. The collapsed structure has been designed in such a way that this overhead does not occur.

4.5.2 Inner/Wall
As in the Inner/CPU case described above. speedups are
greatest where length is small or porosity is high. Since iterations
of the inner loop will do either much work or no work, the VW
overhead of high-porosity structures is extreme.

Several common geometries were presented, corresponding transformation strategies developed, and performance characteristics
were measured.
There are several areas that remain to be explored,
including:
a. More work needs to be done to characterize loops with
varying amounts of work, and the performance impact of splitting
these loops into separate structures that each contain fixed body
sizes;
b. It would be interesting to study the degree to which
performance is impacted by body size in structures that have low
trip counts;
c. The memory efficiency of collapsed loops versus their
un-collapsed counterparts should be studied;
d. Some thought should be given to the feasibility of
making these kinds of collapse transformations automatically,
within the compiler;
e. It is not clear how transformations such as these fit
into the MPP Fortran Programming Model [7]. It is possible that
a DOSHARED directive with multiple induction variables specified can do the same job as, or perhaps a better job than, loop-collapse transformations on some kinds of loops.

4.5.3 Outer/CPU
This plot shows that speedups are negligible in all but the
most pathological of cases. The BPEP overhead saved by collapsing is offset by the cost of building the ICONDA data structure.

4.5.4 Outer/Wall
For this test, the speedups were modest across the majority of the test space. The outer loop in the original code iterates
across the length of the structure, and each iteration processes one
row along the width. Although these rows are porous, they are all
equally porous, so the load here is fairly well balanced. The quality of uniform porosity across the rows of the structure is an artifact of the way in which the test was constructed; it should not be
presumed to be a quality of all porous structures. In general, it is
probably reasonable to assume that the less uniform the porosity,
the better the speedup from collapsing will be.

5.0 Conclusions and Future Work
The loop collapse technique can be applied to a wide
variety of iterative structures, with performance benefits that
range from modest (25%) to remarkable (over 500%). A collapsed
loop offers the advantages of small-grain parallelism, such as
good load balance, with the advantages of large-grain parallelism,
such as low parallel-startup costs. In most cases the code transformations are trivial; the type of transformation needed is governed by the geometric characterization of the iterative structure.

6.0 Acknowledgements
I would like to thank the following people for providing
assistance and inspiration for this work:
Peter Feibelman, Distinguished Member of Technical
Staff, Surface and Interface Science Department 1114, Sandia
National Laboratories. I developed the concepts of collapsing
loops of different shapes while working with Peter on optimizing
and multitaskiT!3 his impurity codes;
Al lacoletti, John Noe, Rupe Byers, and Sue Goudy, Scientific Computing Directorate 1900, Sandia National Laboratories. The comments and suggestions from these reviewers helped
improve the clarity and structure of several important parts of the
paper. All remaining errors, omissions, inaccuracies, inconsistencies, etc., are, of course, my responsibility.

REFERENCES:
1. Cray Research, Inc., CF77 Optimization Guide, SO
3773 6.0, p. 179.
2. Ibid., p. 185.
3. Ibid, pp. 86-87.
4. Levesque J., and Williamson, J. A Guidebook to
Fortran on Supercomputers, Academic Press (1989),
p.91.

53

5. Cray Research, Inc., CF77 Commands and Directives, SR-3771 6.0, p. 79.
6. Cray Research, Inc., UN/COS Fortran Library Reference Manual, SR-2079 7.0, pp. 289-291.
7. Cray Research Inc., Programming the Cray T3D
with Fortran, SG-2509 Draft, pp. 45-48.

54

Speedup from collapsing rectangle (inner/cpu)

Speedup
2.5 2

1.5
1

Width

Speedup from collapsing rectangle (inner/wall)

Speedup

76
5
4
3

2
1

0

-

30

55

~~~ejUP

1.15
1.1

_ _ _ _- ,- 30

1.()5
1

Speedup

1G
5

.'3it'.)J---- ,..

56

30

Speedup from collapsing triangle (inner/cpu)

Speedup

2----------------~------------~----~------------~----~----~

1.8
1.6
1.4
1.2
1

0.8

0.6
5

Speedup

10

15 .

20

25
Length

30

35

40

Speedup from collapsing triangle (inner/wall)
4~--r-----~-----T------r-----~----~------p-----~----

45

50

__----___

3.5

3

2.5

2

1.5

1~--------------~~----~----~------~----~------~----~----~
5
10
15
20
25
30
35
40
45
50

Length

57

Speedup

Speedup from collapsing triangle {outer/cpu}

1.4~~~----r-----~----~-----r----~----~~----~----~----~

1.3

1.2

1.1

1

0.9

0.8
5

10

Speedup

15

20

25
Length

30

35

40

45

50

40

45

50

Speedup from collapsing triangle {outer/wall}
3

2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
.5

58

10

15

20

25
Length

30

35

speedUP

1..3

1.25
1.2
1..1.5
1..11..()5
1.
0.95
0.9

15

20

25

"30

Length U\l and \,\2)

Speedup

2.5
2

S9

10

is

20

\..ensth

{~1

2S

40

"30

and Wi)

speedUP

s·
7

S

20

25

\..ensth {Nl and N2)

'30

Speedup

4()

speeduP
2.5

iO

i5

61

speedup

S

7
S
5

62

speedup
S

4.5
4

3.5

3

2.5
'2

1.5
1

0.5
4S

speedup
S

7
S
S

63

Speedup

s4.S

speedup
4

3.5
3

4()

45

Collaborative Evaluation Project Of Ingres On The Cray (CEPIC)
C. B. Hale, G. M. Hale, and K. F. Witte
Los Alamos National Laboratory
Los Alamos, NM
Abstract
At the Los Alamos National Laboratory, an evaluation project has been done that explores the
utility a commercial database management system (DBMS) has for supercomputer-based
traditional and scientific applications. This project studied application performance, DBMS
porting difficulty, and the functionality of the supercomputer DBMS relative to its more wellknown workstation and minicomputer hosts. Results indicate that the use of a commercial DBMS
package with a scientific application increases the efficiency of the scientist and the utility of the
application to its broader scientific community.

Introduction

Ingres is a relational distributed database
management system that makes it possible to
quickly transform data into useful
information in a secure environment using
powerful database access and development
tools. It has the architecture for true
client/server performance.
Ingres offered Los Alamos a ninety day trial
evaluation of the Cray/lngres product on a
Los Alamos Cray. This included the use of
the base product, ABF, C, FORTRAN,
Knowledge Management, TCP/IP Interface,
Net, Vision, and STAR. To insure the
success of the project, they provided Los
Alamos with the Premium level of support
(required for a Cray platform) during the trial
period.
CRI also was very supportive of Los Alamos'
interest in putting Ingres on Crays, because
they see the potential for entering a new
market. Sara Graffunder, senior director of
Applications at Cray Research, said that with
Ingres functionality on Cray systems, users
will be able to apply the world's largest
memories and superior computational
performance of Cray systems to data
management. They wanted Los Alamos to
demonstrate that this is true. They offered
the one-processor YMP-2E (BALOO) in the
Advanced Computing Laboratory (ACL) for
Los Alamos to use for the Ingres/Cray trial
evaluation. Having a machine dedicated to

this effort allowed us to run basic tests on the
product for evaluation without affecting the
user community. They provided consulting
to go with the installation.
This project was an effective way of using
Los Alamos scientific and computational
expertise in collaboration with CRI and
Ingres to generate interest in the scientific
community in database management on
supercomputers.
Project Description

Members of Client Services and Marketing
(C-6), Internal Information Technology
(ADP-2), and Nuclear Theory and
Applications (T-2) Groups at Los Alamos
National Laboratory (LANL) collaborated
with representatives from Cray Research, Inc.
(CRI) and Ingres Corporation to successfully
complete the Collaborative Evaluation
Project of Ingres On The Cray (CEPIC). The
project objectives were to determine:
• how easy it is to use Ingres on Cray
computers and if Ingres runs in a
reasonable time similar to current utilities
in an environment that people like and will
use;
• whether current standard database
applications could be ported to the Cray
and the level of effort required to
accomplish such porting;

65

• how well Ingres performs on the Cray in
managing data used in and generated by
large scientific codes;
• what additional hardware is required to use
Ingres on the Cray;
• whether Ingres is compatible with Los
Alamos UNICOS.

1 I/O Cluster
1 Million words of buffer memory
1 HISP channels to mainframe memory
1 LOSP channel to CPU
2 HIPPI channels
Disks: 15.68 GBytes on-line storage
8 DD-60 drives
1.96 GB (formatted) per drive
20 MB/s peak transfer rate per drive

Project Plan
The CEPIC project plan was to:
• install Ingres on the Cray;
• port an existing traditional database and its
three applications from a V AX to the Cray
and run compatibility and performance
tests;
• create an Ingres database (NUCEXPDAT)
and its application (EDAAPPL) to manage
the data and results for an existing
scientific code (EDA);
• run performance tests on the Cray and other
machines;
• install Ingres on machine RHO, a Cray YMP8/128 in the Los Alamos ICN
• port the NU CEXPD AT database and the
EDAAPPL application to machine RHO
and run Los Alamos UNICOS
compatibility and performance tests on
machine RHO
Installation Of Ingres On A Cray Y-MP
After deciding on the system configuration
and Ingres file location, Ingres 6.3 was
installed on a Cray Y-MP, named BALOO.
It was configured as follows:
Cray CPU: Y-MP2E/116, SIN 1620
1 Central Processor
16 Million 64-bit words of central memory
1 HISP channel to lOS
1 LOSP channel to lOS
I/O Subsystem (IDS): Model E, serial
number 1620

66

Because BALOO was located in the ACL test
environment and was being used for nondatabase work, no database performance
tuning was done on the machine.
EDA Physics Code
The Los Alamos code EDA (Energy
Dependent A nalysis) was chosen for this
study because it has data management
requirements representative of many of the
scientific codes used at the Laboratory. EDA
is the most general, and among the most
useful, programs in the world for analyzing
and predicting data for nuclear reactions
among light nuclei. In its present form, the
code can be used only on Cray computers,
where all of its data files reside. These data
files are represented by the boxes shown in
Fig. 1; the ones in the upper part of the figure
are input files, and those in the lower part are
output files (results) of the analysis. Because
of the size and complexity of the data files
that are used and produced by the program,
the data management tasks associated with
EDA are quite challenging.
The primary data are the results of
experimental measurements (upper left-hand
box of Fig. 1) for reactions that can occur
between light nuclei.
Because this
information also has the most complex data
structure, we decided to concentrate on these
files for the CEPIC demonstration project.
An experimental data library containing on
the order of 50,000 measured points had
already been assembled in the specific form
required by the code before the project
began. These data entries are classified
according to several identifiers, including
those for compound system, reaction, energy,
observable type, and reference source. At
run time, the user selects a subset of these

data to be used for a particular problem. This
was being done mainly by grouping data

associated with different compound systems
in separate files.

Energy Dependendent Analysis

R-Matrix
parameters

Experimental
Nuclear Data

adjust to fit data

EDA

Predicted
Nuclear Data

Structure data,
phase shifts

Figure 1. Schematic Of EDA Physics Code Files

67

We anticipated that putting the experimental
information into an Ingres-based data
management system would allow far more
selectivity in the choice of data (e.g., by
energy ranges or data type) than was possible
with the existing system. Also, for purposes
of experimental data compilation, which is an
aspect of the EDA activity that is of great
potential interest to outside users, it would
provide the capability to sort the entire
experimental data file according to any
hierarchy of the identifiers listed above.
However, we also obtained some unexpected
benefits, related to the ease and accuracy
with which new data could be entered into
the system, and publication-quality

NUCEXPDAT Database
The nine tables and four views in the
NUCEXPDAT database are summarized in
Table 1.
A schematic of the table
relationships, showing the number of rows
and columns in each table, is given in Fig. 2.
Unique identifiers join all tables except
NEXTKEYS and TIMENOW. Single and
double arrows indicate one-to-many and
many-to-many relationships.

NAME

TYPE

DESCRIPTION

BIBLIO

table

Bibliographic data

CSREACTION

table

Compound System Reaction data

DATAHIST

table

Date and data identifiers of EDA runs

ENERGY

table

Energy data

EXPDATA

table

Experimental data (angle, value, error)

NEXTKEYS

table

Next keys for BIBID, RNO, ENO, and OBSID

OBSERV ABLES

table

Observables data

RUNHIST

table

History of EDA runs (date, energy range, notes)

TIMENOW

table

Time stamp

BIBLIOVW

view

Join of CSREACTION, ENERGY, OBSERVABLES, and
BIBLIO tables used for BIBLIORPT, BIBNPLTRPT,
BIBNPRPT, BIBPRLTRPT, BIBPRRPT reports

EDADATAVW

view

Join of CSREACTION, ENERGY, OBSERVABLES,
EXPDATA, and BIBLIO tables used for EDADATARPT and
EXPDATARPT reports

RENOBBIBVW

view

Join of CSREACTION, ENERGY, OBSERVABLES tables
used for BLBIBIDRPT report

RENOBDATVW

view

Join of CSREACTION, ENERGY, OBSERV ABLES, and
EXPDATA tables used to view all data throu h QBF

Table 1. NUCEXPDAT Tables And Views

68

bibliographies of the experimental data
references could be generated in a variety of
formats.

NUCEXPDAT DATABASE

ENERGY

CSREACTION
181 rows
7cols

OBSERVABLES

6497 rows
9 cols

6886 rows
13 cols

EXPDATA
24,367 rows
4 cols

DATAHIST

BIBLIO

457 rows
5 cols

236 rows
10 cols

TIMENOW

NEXTKEYS
RUNHIST

1 row
3 cols

1 row
4 cols

1 row
1 col

Total: 22.5 Mbytes

Figure 2. Schematic Of NUCEXPDAT Table Relationships
EDAAPPL Application
Shown in Fig. 3 is a flow chart of the
EDAAPPL application, while Table 2
describes its frames and procedure. Data

DATENTFRM -->
BIBLIOFRM

entry and update frames are BIBLIOFRM,
DATENTFRM,
DATUPDFRM,
FIXANGLEFRM, FIXENERG YFRM,
RESEFRM, and RESERNGFRM. Report
frames end in RPT.

I BIBLIOFRM
I FIXANGLEFRM
I FIXENERGYFRM -->

IRESEFRM

I NEXTLETPRC

RESERNGFRM
MAINMENU -->
RPTMENU ---->

I BIBLIORPT
I BIBNPLTRPT
I BIBNPRPT
I BIBPRLTRPT
I BIBPRRPT
I BLBIBIDRPT
I EDADATARPT
I EXPDAT ARPT

DATUPDFRM
RUNHISTFRM
Figure 3. Flow Chart Of EDAAPPL

69

NAME

TYPE

DESCRIPTION

BIBLIOFRM
BIBLIORPT

user
report

BIB NPLTRPT

report

BIBNPRPT
BIBPRLTRPT

report
report

BIBPRRPT
BLBIBIDRPT
DATENTFRM

report
report
user

DATUPDFRM

user

EDADATARPT
EXPDATARPT

report
report

FIXANGLEFRM
FIXENERGYFRM
MAINMENU
NEXTLETPRC
RESEFRM

user
user
user
procedure
user

RESERNGFRM
RPTMENU
RUNHISTFRM

user
user
user

Add data to the BIBLIO table
Generate report of data in BIBLIO table with system,
reaction, and observable
Generate bibliography in Nuclear Physics form with
LaTex command for bold face type
Generate bibliography in Nuclear Physics form
Generate bibliography in Physical Review form with
LaTex commands for bold face type
Generate bibliography in Physical Review form
Generate a list of blank BIBIDs
Add data to CSREACTION, ENERGY, OBSERVABLES,
and EXPDATA tables
Update data in CSREACTION, ENERGY,
OBSERVABLES, and EXPDATA tables
Generate EDA input data file
Generate file of experimental data with bibliographic
references
Add excitation function data (energy, value, and error)
Add angular distribution data (angle, value, error)
Select menu choices for application
Take a character and return the next greatest one
Update blank resolution data fields in ENERGY for a
given compound system, energy, and reaction
Add resolution data to ENERGY for an energy range
Select menu choices for report
Keep a history of data files used in EDA runs

Table 2. Description Of EDAAPPL Frames And Procedure

What was done

70

load about half of the existing data into the
NUCEXPDAT database.

In the analysis phase, we specified the data,
their relationships in the various tables, the
kinds of forms needed to cover all possible
types of experimental data that would be
input to the database, and the reports,
including the format of the EDA data file.
Then, the database and its application were
designed and created to meet these
specifications.

We located, with the help of an earlier
bibliographic file, many complete references
to the experimental data and put them into
the database. We obtained, for the first time,
a complete set of references for the data that
had been used in an analysis of reactions in
the SHe nuclear system.

Code was written that was based on that part
of EDA that processes the input data. This
allowed existing EDA data files to be read
directly into the database tables. Even
though the EDA input code and input file
format are extremely complex, in less than
two weeks this code was written and used to

Many different types of reports were
generated from the database, including EDA
run files, bibliographic files in standard-text
and TeX format, using the style of either of
the two major nuclear physics journals, and
annotated listings of experimental data. The
run files were checked by using them in
actual EDA calculations, and verifying that
the answers duplicated the results of previous

runs. By entering qualification data and then
pressing just one key, we were able to
generate the EDA data file for the sHe
compound system in eleven seconds (clock
time).
After installing Ingres on RHO, we unloaded
the database and copied the application from
BALOO. The entire procedure required less
than fifteen minutes and was straightforward
to do because both Cray machines were
running the same version of Ingres.
Operations on the database and using the
application gave identical results as on
BALOO. We had no problems running
Ingres nor did we have to modify any code
because of Los Alamos UNICOS. Using
Ingres on RHO was frustrating because of
delayed responses when there were many
users on the machine. Machine time was
comparable on the two Crays.
Discussion
Creating an application using ABF was
identical to creating one on a VAX or a SUN
only much faster. Because compilation and
report saving times were so short and because
reports using complicated queries of lots of
data could be run in real time, application
development time was greatly diminished.
The EDA code is an extreme case of program
complexity for reading input. Therefore,
rewriting the relevant code to process
existing data files and entering the data in the
database in less than two weeks suggests a
very short conversion time for codes with
more usual data reading requirements.
The EDA experimental data application
(EDAAPPL):
• makes the existing experimental data files
more uniform, especially in labeling and
bibliographic information;
• allows rapid and easy editing of data files
(by reaction, energy ranges, data type, etc.);
• simplifies and speeds up the task of entering
new data, and provides some error
checking;

• produces readable listings of experimental
data for outside distribution; and
• greatly facilitates the generation of
bibliographies for reports, articles, etc.,
which reduces the time it takes to document
research for publication;
• creates data files so quickly that there is no
reason to save these files on CFS to reuse,
thereby reducing long-term mass storage
requirements.
These capabilities could make significant
changes in the way one of the authors (G. H.)
does his work. He could spend more time on
physics and less on editing, file management,
and looking for the right data decks. One of
the most tedious parts of writing his papers,
compiling the bibliography of experimental
data references, can now be done
automatically, using the BIBID as a bibitem
label in a LaTeX-formatted bibliographic
file. The fact that there are no longer funds
to hire specialists that enter and help manage
data can be off-set in some cases by the ease
of using a well-designed, automated DBMS.
If Ingres becomes available on the ICN Cray
computers, we would first put the remaining
data from the existing data files into the
NUCEXPDAT system, along with their
associated bibliographical references. Any
new data, of course, would be added to the
system using the data-entry forms.
Then we would address the other input- and
output-file questions. The files containing
the parameter values and covariances should
be fairly straight-forward to put into a DBMS
sytem, because they are relatively small and
have a fixed format. Using these files, the
code can predict the results of any
measurement and their associated
uncertainties. In many applications, large
files of these predicted quantities, produced
in a complicated and rigid data format, are
the desired output of the analysis.
In other cases, resonance parameters and/or
phase-shifts that are of more fundamental
interest are produced. Some of the requests
received for information from EDA analyses

71

are delayed because of difficulties in
retrieving and reconstructing the information
used in the analysis at its last stage
(especially if it was done some time ago).
Implementing the same type of sophisticated
data-base system that was used for the
experimental data files to manage the many
types of information that are produced as
output files would benefit all aspects of the
EDA work at Los Alamos.
Many of the data management problems
solved by using a database management
system are similar to those of scientific
applications involving gigabytes or terabytes
of data. For these applications a DBMS
could be used to query the metadata, load the
files containing the full data set needed for a
particular run, and keep track of runs and
their resulting output files. The database
could contain metadata of three types: user
administrative data; internal data
management data; and storage information
data. A menu-driven 4GL application could
be written to get, list, update, and put files on
a storage system and to run analysis codes
including those producing images.
Conclusions
Project participants feel that Ingres worked
well on the Cray computers in a reasonable
time in an environment that people will like
and will use.
It was no more difficult or time-consuming to
port an existing database and applications to
the Cray than it has been to port them to
other platforms.
For both scientific and traditional database
applications, Ingres looked, acted, felt, and
ran the same on the Cray as on other
platforms, only it was faster. Because the
database and target execution machines are
the same, accuracy when going from one
machine to another is not an issue.
It was possible to create a useful scientific
database application on the Cray in a
reasonably short time.
Application
development time was less because the
developer was able to try things without long

72

waits for compilation of code, reports, and
forms. Queries for reports were able to be
coded in a single SQL statement instead of in
steps because queries, even with aggregates,
were fast.
No additional hardware was required to use
Ingres on the Cray.
Ingres was compatible with Los Alamos
UNICOS.
The availability of a database management
system like Ingres on a Cray greatly enhances
the efficiency and flexibility of users and
producers of large quantities of scientific data
on supercomputers.

PROVIDING BREAKTHROUGH GAINS:

CRAv RESEARCH MPP FOR COMMERCIAL APPUCATIONS
Denton A. Olson
Cray Research, Inc.
Eagan, Minnesota
Abstract

Since announcing plans to build the CRAY T3D
massively parallel processor (MPP), Cray
Research, Inc. has been approached by
numerous customers who have enormous needs
for database mining. Many are not our
traditional science and engineering customers.
These customers want to mine information
about, for example buyinJL trends, from their
terabyte-sized data bases. They are not asking
for help with their payroll or accounting
systems. They are not asking for COBOL.
They want to build huge decision support
systems. They want to search for reports and
drawings done ten years and seven million pages
ago. These customers are looking for MPP to
provide breakthrough gains for competitive
advantage. They have hit the computational
brick wall with their traditional mainframes and
the current MPP offerings. They have come to
the supercomputer company for help.

Historically, Cray Research parallel vector
processing (PVP) systems have provided
numerous strengths for DBMS processiI!g,
including: large memories (e.g., the CRA Y YMP M90 series with up to 32 GBytes of real
memory); fast memory (250 GBytes/sec for
CRA Y C90); fast and configurable I/O
subsystems for connectiI}g JO many different
kinds of peripherals; UNICOS for UNIX
compatibilIty; network supercomputer protocols
required for implementing client-server
architectures; and the ability to embed SQL (the
ANSI-standard structural query language of
most DBMS) in vectorized and parallelized C
and FORTRAN codes.
Third party DBMS products now available on
Cray Research PVP ~stems are INGRES,
ORACLE, and EMPRESS.

New Market Needs
Back2round
According to Bob Rohloff, Mobil Exploration &
Producing Division vice president, Mobil Oil
cannot tolerate the "data dilemma of the
geoscientist who spends as much as 60 percent
of his time simply looking for data", not
working with it. (Source: Open Systems
Today, July 20, 1992, "Mobil's Collaboration")
One of the reasons Cray Research ori~inal!y
ported database management systems (DBMS)
to our computers was to address the concerns of
scientists and engineers who spent much of their
time just looking for data - and possibly having
to recreate the data when it cannot be found Our
rather than working with the data.
customers have asked us to provide database
management systems on Cray Research systems
that will help address their data management
needs.
Copyright © 1994.

There is a new class of customer that has
approached Cray Research in the past year with
needs that are often quite different from those of
scientists and engineers. These needs are in the
commercial area of 'database mining'searching the contents of databases looking for
relationships and trends that are not readily
apparent from the data structure.
These customers want to find sales trends, do
marketing analysis, and look for drawings that
they have stored a million pa~es and many years
ago. They present a WIde spectrum of
requirements (see Figure 1) such as decision
support; reI?0rt genera~ion; data visuali.z~tioni·
econometnc modelIng; and traditIona
scientific/engineering applIcations.

Cray Research, Inc.

73

MPP-DBMS Is Core Requirement

Cray Research's UNIX market share and
experience gives these customers a level of
comfort about a major player in this market
being able to provide their UNIX needs.
These commercial customers are not asking for
COBOL or payroll systems or traditional data
processing applications. Many are asking Cray
Research for help with Decision Support
SYstems (DSS) not the time-critical OnLine
Transaction Processing (OLTP) or Point-OfSales (POS) systems. They need fauItresistance not fault-tolerance. There are a limited
number of sophisticated users on-line - not the
hundreds of users that might be required for
banking or reservation systems. There are also
significant 'number crunching' components to
their applications - in concert with the database
processIng they want to do.

Figure 1

The consistent core r~guirement for these
customer inquiries is an MPP database.
These customers have run into performance
'brick walls' with their current systems. They
have large AT&T-GISrreradata ~ystems and
many IBM 3090s and ES-9000s. They came to
Cray Research for help - even before we
announced the CRAY T3D. These customers
are moving to Open Systems, which ri~ht now
means UNIX. Th~y sought out the hIgh-end
suppliers of UNIX systems and, as Figure 2
shows, found Cray Research.

Percentage of the Market
for Large Unix Systems

A technical overview of the CRAY T3D is
beyond the scope of this paper and is adequately
covered in oUier CUG papers. But briefly,
when studying hardware requirements for
running .databases on MPP platforms, the
following features must be examined:
• CPU performance. The ISO-MHz AIJ?ha
RISC microprocessors from DigItal
Equipment can provide the required
horsepower needed for running scalar, CPUintensive, MPP-DBMS applications.
• Latency. Low latency is important for MPPDBMS codes that currently rely on messagepassing models. The CRA Y T3D provides
Industry-leading low latency.
• Bandwidth. MPP databases need to move
massive amounts of data. The CRAY T3D
provides unrivaled bisection bandwidth.

Other

40% •.. -...-•.. -...-•.. -... 1

• I/O. The CRA Y T3D's Model E lOS
subsystem provides a wealth of peripheral
capabilities for disks, production tapes and a
large spectrum of standard networking
protocols.

Sequent

12%

Total estimated 1993 market: $3.5 billion
Source: InfoCorp. and September '93 Electronic Business Buyer

Figure 2

74

Requirements for MPP-DBMS

Conclusion
Cray Research's activities in providing MPP
databases for Cray Research systems are part of
a Data Intensive Systems (DIS) focus. We are
in discussions with numerous customers who
have DIS and MPP-DBMS needs. These
customers span a wide spectrum of industries
from petroleum, to g0v.emment, to retailers.
Cray Research is studying the hardware and
software requirements for MPP-DBMS, and is
establishing requirements for our follow-on
products. We are also in discussions with
various system integrators who can provide
great synergy and llelp us move into the
commercial marketplace.
Customers and prospects continue to come to
Cray Research for help because it is clear from
what they are telling us that there is no MPPDBMS 'winner' to date. .

75

Asynchronous Double-Buffered I/O Applied to Molecular
Dynamics Simulations of Macromolecular Systems
Richard J. Shaginaw, Terry R. Stouch, and Howard E. Alper
Bristol-Myers Squibb Pharmaceutical Research Institute
P.O. Box 4000
Princeton, NJ 08543-4000
shaginaw@bms.com stouch@bms.com

ABSTRACf

Researchers at the Bristol-Myers Squibb Pharmaceutical Research Institute use a l(lcally modified Discover (tm
Biosym Technologies, Inc.) to simulate the dynamics of large macromolecular systems. In order to contain the
growth of this code as problem size increases, we have further altered the program to permit storage of the atomicneighbor lists outside of central memory while running on our Y-MP2E1232 with no SSD. We have done this
efficiently by using BUFFER OUT and BUFFER IN to perform asynchronous double-buffered lID between
neighbor-list buffers and DD-60 disk storage. The result has been an improvement in turnaround for systems
currently under study (because of reduced swapping) and enablement of much larger simulations.

1.0 Introduction
Each atom in a molecular dynamics simulation of a
very large molecule or of a macromolecular system
must sense the attractive/repulsive forces of neighboring atoms in the system. These atomic neighbors
include the atoms covalently bonded to the atom in
question as well as those not bonded to it but lying
within a prescribed cutoff distance. Faced with the
choice of either identifying all neighbors each time
step or maintaining a periodically updated list of
neighbors. researchers ordinarily choose the latter
approach.
The size of such a neighbor list is linearly proportional to atom count and geometrically proportional to
cutoff length. For researchers interested in treating
very large systems as accurately as possible. the physical limitation of computer memory is a handicap.
Especially in the Y-:MP environment. with expensive
central memory and no virtual paging. the hardware
limits the size of the problem. On less powerful platforms. these computationally intensive problems
require so much time as to be intractable. Moreover.
a virtual-memory system without ability to advise the
system to pre-fetch pages cannot accommodate an
extremely large program.

76

large neighbor lists to high-speed disk as the program
creates the neighbor list. and from disk each time step
when the neighbor list is needed. This paper details
the implementation strategy and our results. Section 2
is a more complete statement of the problem. Section
3 discusses our programming strategy in detail. Section 4 presents the FORmAN specifics of the implementation. Section 5 is a summary of our results and
conclusions.

2.0 Statement of Problem
Numerical simulation of a biological system using
molecular dynamics techniques requires repeated calculation (each time step along a numerically
integrated trajectory) of the force exerted on each
atom in the system by every other atom within a
specified cutoff radius. Each atom then moves once
per time step in response to these forces. The atoms
whose electronic forces affect the motion of a given
atom are considered "neighbors" of that atom. Most
of an atom's neighbors are not chemically bonded
directly to that atom; nevertheless. their position.
charge. and motion are vital pieces of information.

We have solved this dilemma by using the BUFFER
IN and BUFFER OUT statements in FORTRAN on

All atoms in a system under study are indexed using
positive integers. One approach to molecular dynamics involves tracking every atom's neighbors by keeping a list of neighbors' indices in an integer array. In

the Y-:MP to achieve asyncbionous transfer of Very

tum.• two other integer arrays maintain for each atom

a pointer to its first neighbor's position in the list and
another pointer to the last. The program reconstructs
the neighbor list at a interval specified by the
researcher based on his or her expectations for the
movement of atoms into and out of cutoff range.
Obviously, a large macromolecular system contains
many atoms; the neighbor list grows approximately
linearly with atom count. Moreover, accurate simulation requires a long cutoff distance; the neighbor list
grows geometrically with cutoff distance. Therefore
the real memory available to a program limits the size
and accuracy of any simulation.
At BMSPRI, researchers are interested in the structure
and dynamics of lipid bilayers as instantiated in
animal cell membranes. The membrane's structure
and dynamics control the transport and diffusion of
compounds in their vicinity, especially those which
cross the membrane into or out of the cell interior.
The compounds of interest include drug molecules,
toxins, antigens, nutrients, and others. The cell membrane incorporates a variety of embedded proteins,
some which function as receptors for a variety of
stimuli, and others which act as ion channels.
As they have focused their resources on the lipid
bilayer, BMSPRI scientists have increased the size
and accuracy of the simulations they use in their
research. This increase has led to a non-bond neighbor list approaching 7 million Cray words in size.
Added to an already large, complex program, this
memory requirement has pushed the size of the program beyond 14 megawords (roughly half of our Y:MP2FJ232). Research progress dictates that future
simulations must embrace much larger systems with
longer cutoffs. In order to achieve this, the researchers have decided to try to reduce the strong dependency of program size on molecular system size and
cutoff distance.

3.0 Solution Strategy
Multiple-buffered asynchronous I/O is in common use
in graphical animation, in numerical modeling of very
large physical systems, and in other computer applications. The basic approach is to create and open at
least one file, using specialized I/O library functions
or subroutines. These calls permit data to bypass the
I/O buffers that are ordinarily part of user data space
(in the FORmAN or C library) and to bypass the
buffers that are maintained by the system in kernel
space. In other words, these I/O calls permit data

transfer directly between the user program's data
space and unstructured files on the storage medium.
The programmer uses at least two program arrays as
I/O buffers, and the program must include the bookkeeping needed to make well-formed I/O requests
(I/O transfers that are integer multiples of the disk
device's physical block size). Avoiding library and
system buffers permits program execution to continue
while I/O proceeds. Special calls then permit blocking of execution in order to synchronize data transfers
before writing to or reading from the program arrays
functioning as buffers, to protect data integrity.
Cray Research supports several techniques for asynchronous I/O. Table 1 outlines these.

Table 1. Cray Research Asynchronous I/O Options.
Technique

Description

READDR/WRITDR
GEIWA/PUTWA
AQREAD/AQWRITE
BUFFER IN/our

record-addressable random-access I/O
word-addressable random-access I/O
queued I/O
unbuffered, unblocked I/O

As structured at the start of this effort. BMSPRImodified Discover generates three neighbor lists; two
of these are subject to rapid growth with cutoff distance. and so these two are our candidates for disk
storage. Our strategy is to store both lists of neighbors on disk at the time of list creation, and then to
read up this list each time step in the routines that
compute the non-bond energies of the system. We
have no need for record- or word-addressable random
access, because we know a priori that the energy routines require the data to be read sequentially. Likewise, the random-access capability of the queued I/O
calls is unnecessary. We have decided to use
BUFFER IN and BUFFER our to achieve asynchronous transfer.
For efficiency, the ratio of transfer time to the quantity (transfer time + latency) must be close to unity;
therefore the buffer size must be sufficiently large to
overwhelm latency. In contrast, for I/O to overlap
execution completely, the buffer size must be
sufficiently small to permit completion of the transfer
before the buffer is needed again.

Our fastest filesystems reside on unstriped logical
devices built on DD-60 drives. with one drive per I/O

77

channel. The fastest user-writable file system is one
we call /tmp/people. a continuous scratch area of
about 5 GB. where every user with an account owns a
directory. The worst-case maximum rotational
latency for DD-60 devices is 26 milliseconds. according to Cray Research. We have found that unbuffered
writes to existing unblocked DD-60 files run at about
19 megabytes per second. while unbuffered reads
from the same files proceed at about 16 megabytes
per second. This asymmetry may be due to the fact
that each read requires a seek operation, while the
drives when idle are positioned for writing.
At 16 :MB per second. the smallest I/O request size
that permits 90% efficiency is 490.734 words. The
loop timing in the non-bond-energy routines (under a
system load of four simultaneous batch jobs) averages
0.18 seconds. and so the maximum transfer size to
achieve overlap is 377.487 words. These limiting
values clearly eliminate the possibility of 90%
efficient asynchronous I/O in our case. Nevertheless.
we have chosen to accept an efficiency level of less
than 90% in order to test our strategy. We have
chosen a buffer size of 409.600. which is exactly 200
DD-60 sectors in length. This buffer size leads to an
efficiency of 88%. but may lead to incomplete overlap
on the read side under typical load. In the case of
heavy load. overlap on both read and write will be
complete.
We have decided to employ two files. each
corresponding to one buffer. for each list. in order to
maximize overlap in end cases. In other words. we
use four files in the current implementation. We use
the same buffers for reading and for writing. and for
both neighbor lists.

4.0 FORTRAN Implementation
BMSPRI uses the program Discover from Biosym
Technologies. Inc. to carry out its lipid bilayer simulations. The Institute holds a source license for
Biosym. and researchers have modified the program
substantially. to include theory useful in their specific
problem area. Two neighbor-generation· routines use
several criteria to determine the relationship of each
atom in the system to every other atom in the system.
These routines create two separate integer lists of
neighbor indices. Two nonbond-energy routines read
through these neighbor lists each time step; these
integer lists point into an array containing charge and
position data for each atom. Four routines contain all
the code modifications made in the double-buffering

78

effort.
Two COMMON blocks contain six control variables
necessary for bookkeeping and the buffers themselves. dimensioned (LBUF.2) where LBUF LREC
+ LPADD; LREC 409.600 and LPADD is a padregion length. set to 10.000 words.

=

=

Each neighbor-generating routine has two loops that
iterate across all "atom groups" in the system. The
first of these loops is operative in the case where all
atoms in the system are permitted to move; the second
loop. when some atoms are held fixed in space. The
energy routines each contain three exclusive loops in
which the neighbor list is read. one atom at a time.
In the list-generation routines. the first action taken is
to synchronize both files used by each of the two
neighbor lists. using the FORmAN extension
LENGTH. Then the program repositions both files
into which it is about to write to the beginning of data.
Then we initialize all control variables. During each
iteration of the main loop. the program stores each
atom's neighbor indices sequentially into the
"current" buffer and advances the buffer pointer. At
the end of each iteration. the program tests the buffer
pointer. and if the buffer has overflowed into the pad
region, it initiates a BUFFER our for the current
buffer. IT this is a second or subsequent write. it uses
LENGTH to synchronize the write of the alternate
buffer. Then it moves the content of the pad region to
the start of the alternate buffer. At this point. the program switches the alternate buffer to current status.
Then. whether the "buffer full" test has passed or
failed. if this is the last loop iteration. the program initiates a BUFFER our of the current buffer; otherwise it continues iterating.
The first action in the non-bond energy routines is to
synchronize all four files and to initialize local control
variables. Then we reposition both files about to be
read to the beginning of data. Then the program initiates reads into both buffers and then blocks execution. using LENGTH. to synchronize the read into the
current buffer. The program uses buffer-swapping
techniques analogous to those in the generation routines to manage the buffers during loop iteration.
To achieve asynchronous I/O. the files in use must be
typed as "unblocked" files. The UNICOS command
"assign" with the option "-s u" creates a file of type
"unblocked". We name these files uniquely to each
batch job by including the C-sheU process-ID substitution string "$$" in each of the four file names. At

the end of the job. we remove all four files.

5.0 Results and Conclusions
Implementation of these code changes has led to a
reduction in the size of the executable code for a
30.000-atom case with a cutoff of 12 Angstroms by
3.4 Megawords. Moreover. no longer is program size
dependent on cutoff length.
The overhead incurred by BUFFER IN/OUT and the
additional bookkeeping in the program has led to an
increase in CPU time of 1% to 2%. This will cause in
our environment a worst-case increase in turnaround
time of one day (out of 6 weeks of wallclock time) for
a nanosecond of simulated time. This turnaround
delay is acceptable to Institute researchers. Moreover. the improvement in turnaround time due to a
reduction in swap-in-queue residency more than compensates for this disimprovement in most cases. On
the other hand. at this point this code sustains an
increase of I/O wait time from effectively zero to
between 3% and 5% of CPU time. We expect this
wait time to increase turnaround time to an unacceptable level. Profiling of the effected routines reveals
that essentially all of these I/O waits occur on the read
side. in the non-bond energy calculation routines. We
believe that this reflects the lower speed of a typical
read. Writing the contents of a 409.600-word buffers
to an unblocked file resident on unstriped DD-60
takes an average of 0.16 seconds; reading a 409.600word unit of data from the same file into a buffer
takes about 0.19 seconds. With our typical system
load of four simultaneous batch jobs. our I/O scheme
tries to do a read or write every 0.17 to 0.18 seconds.
on average. This asymmetry between read and write
performance can explain the additionalI/O wait time.

Our next modification to this program will be to add a
third buffer to accommodate a read-ahead of the data
chunk beyond the "next" in the energy routines. This
should nearly eliminate the I/O wait overhead. We
also plan to experiment further with the queued asynchronous I/O strategies.

79

A PVM Implementation of a Conjugate Gradient Solution
Algorithm for Ground-Water Flow Modeling
by
Dennis Morrow, John Thorp, and Bill Holter

Cray Research, Inc. and NASA Goddard Space Flight Center

SUMMARY

drawdown or for control of contaminant plumes or saltwater
intrusion can require many independent simulations

This paper concerns the application of a conjugate-gradient
solution method to a widely available U.S. Geological Swvey
(USGS) ground-water model called MODFLOW which was
used to solve a ground-water flow management problem in
North Carolina.

The need to model larger more complex problems is coupled with a
need for applying more efficient parallel algorithms which
can take advantage of supercomputer hardware and reduce the
wall-clock time to get a solution.

The performance of the MODFLOW model incorporating a
polynomial preconditioned conjugate gradient (pPCG) algorithm
is presented on the Cray C90, and a PVM implementation of the
algorithm on the Cray T3D emulator is outlined. For this
large-scale hydrologic application on a shared memory
supercomputer, the polynomial preconditioned conjugate gradient
(pPCG) method is the fastest of several solution methods which
were examined. Further work is needed to ascertain how well
PPCG will perform on distributed memory architectures.
The sections in this paper flI'St introduce the USGS MODFLOW
model and its application to a North Carolina ground-water
flow problem. Next the PPCG algorithm and similar CRAY
library routines are discussed, followed by tables of CPU timing
results and the ratio of parallel speed-up attained by the
MODFLOW model on the Cray C90.
The fmal section discusses the distributed memory
implementation of the PPCG algorithm using PVM on the Cray
T3D emulator followed by a summary and plans for future work.

ABSTRACT
There is a need for additional computing power for modeling
complex, long term, real-world, basin-scale hydrologic
problems. Two examples which illustrate the computational
nature of ground-water modeling are:

1. the stochastic nature of the model input data may require a
sensitivity analysis for each model input parameter and/or a
number of monte-carlo simulations
2. optimal wellfield pumping scenarios for minimizing

80

INTRODUCTION
This paper examines replacements for the matrix solution algorithm
used in a USGS ground-water model, MODFLOW. MODFLOW is
the short name for the Modular Three-Dimensional Finite
Difference Ground-Water Flow Model [10], a publically-available
model which is widely used in industry, academia, and government.
MODFLOW simulates transient ground-water flow and can
include the influence of rivers, drains, recharge/discharge wells, and
precipitation on both confmed and unconfmed aquifers.

A WATER RESOURCE MANAGEMENT APPLICATION
The application presented in this paper is a water resources
management study conducted by Eimers [2,3] with the MODFLOW
model to determine the influence of pumping on a 3,600
square-mile study area that is a subset of the ten-vertical-layer
aquifer system composing the 25,000 square-mile North Carolina
Coastal Plain. The aquifer model consists of 122,400 fmite
difference cells on a 10 x 120 x 102 grid. This is a transient problem
with 120 time steps representing a total simulation time of 87 years.
The number of wells in the model ranges from 1,416 to 1,680,
although some of these are not actual well sites but are
pseudo-wells needed to constrain the hydraulic condition at certain
political boundaries where there is no corresponding hydrogeologic
boundary. Vertical conductance, transmissivity, and storage
coefficients can vary by node within each layer but are assumed to
have no directional component.

GROUND-WATER SOLUTION ALGORITHMS
Many algorithms are available to solve the ground-water flow
equations and there is certainiy not one best algorithm for all

problems on all computers. MODFLOW constructs a sparse
matrix called the A matrix from the discrete form of the flow
equations, then solves the matrix. The matrix is symmetric,
positive defmite, and has three off-diagonals. In MODFLOW.
97% of the total CPU time is spent in the A matrix solution.
A direct linear equation solver such as banded Gauss elimination
computes an explicit factorization of the matrix and in general
guarantees an accurate solution. Iterative matrix solvers. such as
the strongly implicit procedure supplied with MODFLOW.
begin with an initial approximate solution and then converge
toward the solution, thus improving the approximate solution
with each iteration.
Used appropriately. iterative algorithms can be as accurate and
much faster than direct methods applied to these kinds of
problems. Dubois [1]. Hill [5]. Jordan [8]. Van der Vorst [15] and
many others have confirmed the desirable properties of iterative
conjugate gradient (CG) solvers on vector computers. Johnson [7]
suggested polynomial preconditioners for CG solvers and Oppe.
et.al.[12] implemented a general non-symmetric preconditioned
CG package on a CRAY.
Specifically for ground-water modeling. Kuiper [9] compared the
incomplete Cholesky CG method with the strongly implicit
procedure (SIP) described by Weinstein. et.al.[16]. and Scandrett
[14] extended the work and included timings on the CDC Cyber
205, reporting that PPCG has very good convergence and that the
iteration sequence is completely vectorizable. Morrow and
Holter [11] implemented a single-CPU vectorized PPCG solver
for MODFLOW on the Cyber 205. Saad [13] discussed the steps
needed to implement a parallel version of the PPCG algorithm
and Holter, et.a1.[6] attained 1.85 gigaflops on a CRAY Y-MP8
with a multitasked PPCG solver for a two-dimensional
ground-water diffusion problem.

The PPCG algorithm provides an efficient, vector-parallel
solution ofAX=B, where A is symmetric, banded, and diagonally
dominant. It is assumed that A has been normalized (by a
straightforward procedure) so that all its diagonal elements are
equal to unity.
The algorithm utilizes a least squares polynomial approximation
to the inverse of A, and calls for the repeated multiplication of
this inverse and a vector of residuals, R. [11]
The steps in the PPCG algorithm can be summarized as follows:
1. Set an initial estimate of the groundwater pressure heads in
the aquifer.
.
2. Compute the vector of residuals
3. Form two auxiliary vectors from the residuals
4. Iteratively cycle through a six-step process which updates the
heads and residuals until convergence

CRA Y SCIENTIFIC LIBRARY ROUTINES

As mentioned previously, several variations of preconditioned CG
matrix solvers perform well on vector computers. Several of these
methods are incorporated into a single routine in the CRAY
UNICOS Math and Scientific Library (SCILIB). The routine is
called SITRSOL and it is a general sparse matrix solver which
allows the selection of a preconditioned conjugate gradient-like
method. SITRSOL has many selections for the combination of
preconditioner and iteration method. The six options for iterative
method (accelerators) are:
1. (BCG) -- Bi-conjugate gradient method
2. (CGN) - Conjugate gradient method applied with Craig's
Method
3. (CGS) -- Conjugate gradient squared method
4. (GMR) -- Generalized minimum residual method
5. (OMN) -- Orthomin/generalized conjugate residual method
6. (pCG) -- Preconditioned conjugate gradient method
For preconditioners. there are also six options:
1.
2.
3.
4.
5.
6.

No preconditioning
Diagonal (Jacobi) preconditioning
Incomplete Cholesky factorization
Incomplete LU factorization
Truncated Neumann polynomial expansion
Truncated least squares polynomial expansion

Not all combinations of preconditioners work with all the
selections for accelerators. For instance. Incomplete LU
factorization cannot be used with a symmetric matrix in
half-storage mode. PCG cannot be used unless the matrix
resulting from MODFLOW is always positive definite (it is).
The performance of several of these SITRSOL matrix solution
routines is compared for solving a test problem of similar size as
the North Carolina ground-water modeling problem (see Table
1.). All the runs were made on one CPU of a YMP-2E.

1Ox125x125 cell MODFLOW problem

Algorithm
PPCG*
BCG
CGS
PCG
PCG
PCG

Preconditioner
polynomial
least squares
least squares
least squares
Neumann
Cholesky

memory algorithm
CPU
Mwords megaflops seconds
1.435
232
190
11 .400
45
956
3,650
106
275
3,156
98
232
5,774
127
441
18,342
16
951

*not a part of SITRSOL
Table 1. PPCG and SITRSOL timing comparisons.
The column headed memory is the CPU memory integral time
reported by the UNICOS ja command. This indicates an average
memory requirement for for the duration of execution of the entire
code. The numberS can be used to infer a comparative memory
requirement for the solution algorithms.
81

For the SI1RSOL solution routines, the least squares
preconditioner combined with the PCG iterative method had the
lowest time and memory requirement of the five attempted
SITRSOL combinations, but PPCG was the overall best. The
PPCG algorithm is coded to take advantage of the specific
sparse-diagonal matrix structure in the MODFLOW model.

All the listed computer runs were made during the day on
production systems and none of the results are benchmark runs
though, as noted, one of the runs was made in the
single-job-streaming queue (only one batch job is allowed to ru
at a time). The reported wall-cloek times will vary depending 0
the number of jobs in the system. The listed CPU times and
megaflop rates are fairly independent of system load.

SHARED MEMORY PARALLEL-VECTOR STRATEGY
Compiler-level strip mining of DO loops was the data-parallel
approach used to implement the PPCG algorithm on multiple
processors of the Cray C90. Several modifications to the
single-cpu PPCG algorithm were made to accomplish this. Some
of these modifications also improved the performance of the
single-cpu implementation of the algorithm.
The MODFLOW code has a large scratch array dimensioned in the
main program and individual variables are referenced by pointers to
locations within the scratch array. In some cases, this degrades
data locality. To accomplish data locality and to avoid unnecessary
references to memory, MODFLOW variables which were a part of
the matrix solution were removed from the large scratch array and
recast as arrays which were local to the PPCG solution routine.
Also. some variables were removed from DATA and SAVE
statements in order to have as many variables as possible stored on
the stack.
Next, the lengths of the off-diagonals were padded by
zero-filling to match the length of longest diagonal of the matrix
(the main diagonal). This allowed the elimination of some IF
tests associated with the shorter diagonals. Also several DO
LOOPS of varying lengths could now be collapsed into a single DO
LOOP, thus organizing lots of work into a single parallel region.
For the North Carolina water management problem, loop lengths
were fixed by parameter statements to 122,400 to eliminate the
need for execution-time checking of loop lengths prior to
strip-mining.
Tables 2. presents the Y - MP-C98 timing results for the entire
MODFLOW model solving the North Carolina water management
problem (122,400 groundwater cells for the 87 year simulation
period) with the shared memory implementation of PPCG.

1 Ox120x1 02 cells (122.400 equns.)
Y-MP- e98

#
CPUs

Wall
sec.

CPU Mflops Concurrent Total
sec. ICPU avg. cpus Gflops

1
2
4
8

61
40
31
28

57
62
65
67

641
589
558
541

1.0
1.6
2.2
2.4

8*

15

78

466

5.2

*dedicated run

82

Table 2. Timing results for Modflow model

0.64
0.94
1.2
1.3
2.4

The CPU times increase slightly with additional CPUs, and the
reduction in wall clock time illustrates that multitasking can Cll'
the turnaround time on a production system.
DISTRIBUTED MEMORY MESSAGE PASSING
STRATEGY
Parallel Virtual Machine (PVM) was used to implement the
distributed version of the PPCG algorithm on the TID emulator
running on the multiple processors of the Cray C90.
PVM is a public domain set of library routines originally
developed for explicit communication between heterogeneous
systems tied to a network [4]. PVM was developed at Oak Ridgl
National Laboratory, and it has become a de-facto message passu
standard.
Message passing is a parallel programming paradigm which is
natural for network based or (in this case) distributed memory
systems. An additional benefit of the message passing paradigm:
that it is portable. A message consists of a user-defined message
tag and the accompanying data. Messages may be either
point-to-point or global (broadcast). In point-to-point
message passing, the sender specifies the task to which the messa!l
is to be sent, the data to be sent, and the unique message tag to lat
the message. Then the message is packed into a buffer and sent to
the interconnection network. The receiver specifies the task from
which the message is expected, the unique message tag which is
expected, and where to store the data which is received from the
network.

In actual PVM implementation in FOR1RAN, the sending task
makes three PVM calls which (1) create the send buffer, (2) pac1
the data into the send buffer. and (3) send the message. Similarl)
the receiving task makes two PVM calls which (1) receive the
message from the network, and (2) unpack the data into the loea
memory user space.
A general observation is that message passing is to parallel syste,
as assembly language is to mainframes. Considerable modificatic
of the shared-memory version of the PPCG algorithm was
necessary to accomplish the implementation with message passin
The explicit nature of message passing can also be tedious (uniqm
message tags, five FORTRAN routine calls to send/receive any
piece of data, etc). This discourages frequent communications.
Avoiding unnecessary communication is an important part of

nlc:tributed c.omputing.
The message passing strategy for implementing the PPCG
algorithm on the TID emulator began with equally distributing

the data contained in FORlRAN arrays among the total number of
processors. All four diagonals of the A matrix (the matrix to be
solved) are copied from the master processor (PEO). so that each
processor has a local copy of its part of the matrix. This is
necessary because the A matrix is set up by the MODFLOW model
and passed to the matrix solution routine and, to date. only the
PPCG solution routine has been implemented in PVM. In the
program, the arrays WI, WMl, p. R, SQINV. and B can all be
distributed equally across the processors. For example, Figure 1.
shows three of these six arrays distributed across four processors.

Pe 0

Pe 1

Pe 2

Figure 3. shows a complete pattern involving all six offsets.
Note that not all offsets will result in off-processor
communication. In Figure 3, the -1 offset on PEl is an
on-processor communication. The mix between on and off
processor communication also will change with the total number
of processors.

Pe 0

Pe 3

I I I I III I I I III I I I III I I I I
R, SQINV, B

Figure 1. Distributed Linear arrays R. SQINV, and P

A second part of the message passing strategy for implementing
the PPCG algorithm was to accomplish communication between
processors by making and distributing offset copies of the A matrix
and the temporary work vectors needed in the iterative solution.
There are six regular communication patterns which trace back to
the three dimensions of the groundwater problem being solved.
These six patterns involve offsetting the data by either plus or
minus 1, plus or minus ND, or plus or minus NX, where for this
particular groundwater problem. ND=12,240 and NX=l02.
This communication is similar to the End-Off Shift (EOSHIFf)
operation where data does not wrap around from the last processor
to the first. Two of these communication patterns are shown in
Figure 2.

Pe 0

Pe 1

Pe 2

Pe 3

~111~91111
Pe 0

Pe 1

Pe 2

Pe 3

11~1]~IIcq;g

Figure 3. Communication Pattern
Also, there are a number of reduction operations that are
accomplished using PVM to communicate the partial sums from
each processor back to PEO using a standard library-like routine.
RESULTS and FUTURE PLANS
For the MODFLOW ground-water modeling application, CPU
times for several shared-memory sparse matrix solution
algorithms are compared. The PPCG algorithm on a Cray
YMP-C98 ran at 2.4 gigaflops and attained a speedup ratio of
5.2. The distributed-memory version of the PPCG algorithm
was implemented on the emulator and the next step is to port the
code to the CRAY T3D.
CONCLUSIONS
Supercomputers and efficient vector-parallel solution
algorithms can speed processing and reduce tum-around time for
large hydrologic models, thus allowing for more accurate
simulations in a production environment where wall-clock
turnaround is important to the ground-water model user.
SUMMARY
A data-parallel shared memory implementation and a PVM
distributed memory implementation of a polynomial
preconditioned conjugate gradient solution algorithm for the U.S.
Geological Survey ground-water model MODFLOW were
presented. The PVM solution is accomplished on multiple
processors and can be ported to the T3D.
Performance on the Cray C90 is presented for a three-dimensional
122,400 cell anisotropic ground-water flow problem
representing a transient simulation of pumping a North Carolina
aquifer for 87 years.

Figure 2. EOSHIFT Operation

83

ACKNOWLEDGEMENTS
CRAY, CRAY Y-:MP, and UNICOS are federally registered
trademarks and Autotasking and CRAY EL are trademarks of
Cray Research, Inc. The UNICOS operating system is derived
from the UNIX System Laboratories, Inc. UNIX System V
operating system. UNICOS is also based in part on the Fourth
Berkeley Software Distribution under license from The Regents
of the University of California.

Model ". Open-File Report 83-875, U.S. Geological Survey,
1984.
11. Morrow, D. and Holter, B., "A Vectorized Polynomial
Preconditioned Conjugate Gradient Solver Package for the USGS
3-D Ground-Water Model", Proceedings of the ASCE Water
Resources Conference, Norfolk, VA. June 1988.

REFERENCES

12. Oppe. T., et.al.. "NSPCG User's Guide. Version 1.0". Center
for Numerical Analysis. The University of Texas at Austin. Apri
1988.

1. Dubois, P .F. et al.. "Approximating the Inverse of a Matrix
for Use in Iterative Algorithms on Vector Processors",
Computing, 22,257-268, 1979.

13. Saad. Y., "Practical Use of Polynomial Preconditionings for
the Conjugate Gradient Method ". SIAM J. Sci. Stat. Comput.,
6(4). 865-881. 1985.

2. Eimers, J .• Lyke, W., and Brockman, A., "Simulation of
Ground-water Flow in Aquifers in Cretaceous Rocks in the
Central Coastal Plain, Nonh Carolina", Water Resources
Investigations Report, Doc 119.42/4:89-4153, USGS Raleigh,
NC 1989.

14. Scandrett, C. "A Comparison of Three Iterative Techniques it
Solving Symmetric Systems of Linear Equations on a CYBER 205'
Supercomputer Computations Research Institute, Florida State
Univ., Report # FSU-SCRI-87-44. August 1987.

3. Eimers, J. and Morrow, D., "The Utility of Supercomputers
for Ground-Water Flow Modeling in the North Carolina
Coastal Plain ", Proceedings of the ASCE Water Resources
Conference, Sacramento, CA May 1989.
4. Grant and SIgellum, "The PVM Systems: An In-Depth
Analysis and Documenting Study - Concise Edition", The 1992
:MPCI Yearly Report: Harnessing the Killer Micros, LLNL #
UCRL-ID-I07022-92.
5. Hill, Mary, "Preconditioned Conjugate Gradient (pCG2) -A Computer Program for Solving Ground-Water Flow
Equations", Water Resources Investigations Report # Doc
119.42/4:90-4048, USGS Denver, CO 1990
6. Holter. B., et.al., "1990 Cray Gigaflop Performance Award".
7. Johnson, D.G. et aI., "Polynomial Preconditioners for
Conjugate Gradient Calculations ", SIAM J. Numer. Anal., 20(2),
362-376, 1983.
8. Jordan, T.L., "Conjugate Gradient Preconditioners for Vector
and Parallel Processors ", Elliptic Problem Solvers II, Academic
Press Inc .• 127-139. 1984.
9. Kuiper, L.K., "A Comparison of the Incomplete
Cholesky-Conjugate Gradient Method With the Strongly
Implicit Method as Applied to the Solution of Two-Dimensional
Groundwater Flow Equations ", Water Resources Research,
17(4), 1082-1086, August 1981.
10. McDonald, M.G. and Harbaugh, A.W., "A Modular
Three-Dimensional Finite-Difference Ground-Water Flow

84

15. Van Der Vorst, H .• "A Vectorizable Variant of Some ICCG
Methods ". SIAM J. Sci. Stat. Comput., 3(3). 350-356. 1982.
16. Weinstein. H.G .• Stone, H.L .. and Kwan. T.V. "Iterative
Procedure for Solution of Systems of Parabolic and Elliptic
Equations in Three Dimensions", Industrial and Engineering
Chemistry Fundamentals. 8(2). 281-287. 1969.

Graphics

Decimation of Triangle Meshes
William J. Schroeder
General Electric Company
Scenectady, NY

1.0

INTRODUCTION

The polygon remains a popular graphics primitive for
computer graphics application. Besides having a simple
representation, computer rendering of polygons is widely
supported by commercial graphics hardware and software.
However, because the polygon is linear, often thousands
or millions of primitives are required to capture the details
of complex geometry. ModelS of this size are generally
not practical since rendering speeds and memory requirements are proportional to the number of polygons. Consequently applications that generate large polygonal meshes
often use domain-specific knowledge to reduce model
size. There remain algorithms, however, where domainspecific reduction techniques are not generally available
or appropriate.
One algorithm that generates many polygons is marching cubes. Marching cubes is a brute force surface construction algorithm that extracts isodensity surfaces from
volume data, producing from one to five triangles within
voxels that contain the surface. Although originally developed for medical applications, marching cubes has found
more frequent use in scientific visualization where the size
of the volume data sets are much smaller than those found
in medical applications. A large computational fluid
dynamics volume could have a finite difference grid size
of order 100 by 100 by 100, while a typical medical computed tomography or magnetic resonance scanner produces over 100 slices at a resolution of 256 by 256 or 512
by 512 pixels each. Industrial computed tomography, used
for i~spection and analysis, has even greater resolution,
varymg from 512 by 512 to 1024 by 1024 pixels. For
these sampled data sets, isosurface extraction using
marching cubes can produce from 500k to 2,OOOk triangles. Even today's graphics workstations have trouble
storing and rendering models of this size.
Other sampling devices can produce large polygonal
models: range cameras, digital elevation data, and satellite
data. The sampling resolution of these devices is also
improving, resulting in model sizes that rival those
obtained from medical scanners.
This paper describes an application independent algorithm that uses local operations on geometry and topology
to reduce the number of triangles in a triangle mesh.
Although our implementation is for the triangle mesh it
can be dire:tl! applied to the more general polygon m'esh.
After descnbmg other work related to model creation
from sampled data, we describe the triangle decimation

process and its implementation. Results from two different geometric modeling applications illustrate the
strengths of the algorithm.

2.0

THE DECIMATION ALGORITHM

The goal of the decimation algorithm is to reduce the
total number of triangles in a triangle mesh, preserving
the original topology and a good approximation to the
original geometry.

2.1

OVERVIEW

The decimation algorithm is simple. Multiple passes are
made over all vertices in the mesh. During a pass, each
vertex is a candidate for removal and, if it meets the specified decimation criteria, the vertex and all triangles that
use the vertex are deleted. The resulting hole in the mesh
is patched by forming a local triangulation. The vertex
removal process repeats, with possible adjustment of the
decimation criteria, until some termination condition is
met. Usually the termination criterion is specified as a
percent reduction of the original mesh (or equivalent), or
as some maximum decimation value. The three steps of
the algorithm are:
1. characterize the local vertex geometry and topology,
2. evaluate the decimation criteria, and
3. triangulate the resulting hole.

2.2

CHARACTERIZING LOCAL
GEOMETRY/TOPOLOGY

The first step of the decimation algorithm characterizes
the local geometry and topology for a given vertex. The
outcome of this process determines whether the vertex is
a potential candidate for deletion, and if it is, which criteria to use.
Each vertex may be assigned one of five possible classifications: simple, complex, boundary, interior edge, or
comer vertex. Examples of each type are shown in the
figure below.

_.,-Simple

Complex

Boundary

Interior

Comer

Edge

A simple vertex is surrounded by a complete cycle of

87

triangles, and each edge that uses the vertex is used by
exactly two triangles. If the edge is not used by two triangles, or if the vertex is used by a triangle not in the cycle of
triangles, then the vertex is complex. These are non-manifold cases.
A vertex that is on the boundary of a mesh, i.e., within a
semi-cycle of triangles, is a boundary vertex.
A simple vertex can be further classified as an interior
edge or comer vertex. These classifications are based on the
local mesh geometry. If the dihedral angle between two
adjacent triangles is greater than a specifiedfeature angle,
then afeature edge exists. When a vertex is used by two feature edges, the vertex is an interior edge vertex. If one or
three or more feature edges use the vertex, the vertex is classified a comer vertex.
Complex vertices are not deleted from the mesh. All other
vertices become candidates for deletion.

2.3

EVALUATING THE DECIMATION
CRITERIA

The characterization step produces an ordered loop of vertices and triangles that use the candidate vertex. The evaluation step determines whether the triangles fonning the loop
can be deleted and replaced by another triangulation exclusive of the original vertex. Although the fundamental decimation criterion we use is based on vertex distance to plane
or vertex distance to edge, others can be applied.
Simple vertices use the distance to plane criterion (see
figure below). If the vertex is within the specified distance
to the average plane it may be deleted. Otherwise it is
retained.

~,\ageplme
Boundary and interior edge vertices use the distance to
edge criterion (figure below). In this case, the algorithm
determines the distance to the line defined by the two vertices creating the boundary or feature edge. If the distance to
the line is less than d, the vertex can be deleted.

It is not always desirable to retain feature edges. For
example, meshes may contain areas of relatively small triangles with large feature angles, contributing relatively little
to the geometric approximation. Or, the small triangles may
be the result of "noise" in the original mesh. In these situations, corner vertices, which are usually not deleted, and
interior edge vertices, which are evaluated using the distance to edge criterion, may be evaluated using the distance
to plane criterion. We call this edge preservation, a user
specifiable parameter.
If a vertex can be eliminated, the loop created by removing the triangles using the vertex must be triangulated. For

88

interior edge vertices, the original loop must be split into
two halves, with the split line connecting the vertices forming the feature edge. If the loop can be split in this way, i.e.,
so that resulting two loops do not overlap, then the loop is
split and each piece is triangulated separately.

2.4

TRIANGULATION

Deleting a vertex and its associated triangles creates one
(simple or boundary vertex) or two loops (interior edge vertex). Within each loop a triangulation must be created
whose triangles are non-intersecting and non-degenerate. In
addition, it is desirable to create triangles with good aspect
ratio and that approximate the original loop as closely as
possible.
In general it is not possible to use a two-dimensional
algorithm to construct the triangulation, since the loop is
usually non-planar. In addition, there are two important
characteristics of the loop that can be used to advantage.
First, if a loop cannot be triangulated, the vertex generating
the loop need not be removed. Second, since every loop is
star-shaped, triangulation schemes based on recursive loop
splitting are effective. The next section describes one such
scheme.
Once the triangulation is complete, the original vertex and
its cycle of triangles are deleted. From the Euler relation it
follows that removal of a simple, corner, or interior edge
vertex reduces the mesh by precisely two triangles. If a
boundary vertex is deleted then the mesh is reduced by precisely one triangle.

3.0

IMPLEMENTATION

3.1

DATA STRUCTURES

The data structure must contain at least two pieces of information: the geometry, or coordinates, of each vertex, and
the definition of each triangle in terms of its three vertices.
In addition, because ordered lists of triangles surrounding a
vertex are frequently required, it is desirable to maintain a
list of the triangles that use each vertex.
Although data structures such as Weiler's radial edge or
Baumgart's winged-edge data structure can represent this
information, our implementation uses a space-efficient vertex-triangle hierarchical ring structure. This data structure
contains hierarchical pointers from the triangles down to the
vertices, and pointers from the vertices back up to the triangles using the vertex. Taken together these pointers form a
ring "relationship. Our implementation uses three lists: a list
of vertex coordinates, a list of triangle definitions, and
another list of lists of triangles using each vertex. Edge definitions are not explicit, instead edges are implicitly defined
as ordered vertex pairs in the triangle definition.

3.2

TRIANGULATION

Although other triangulation schemes can be used, we chose
a recursive loop splitting procedure. Each loop to be triangulated is divided into two halves. The division is along a
line (i.e., the split line) defined from two non-neighboring
vertices in the loop. Each new loop is divided again, until

only three vertices remain in each loop. A loop of three vertices forms a triangle, that may be added to the mesh, and
tenninates the recursion process.
Because the loop is non-planar and star-shaped, the loop
split is evaluated using a split plane. The split plane, as
shown in the figure below, is the plane orthogonal to the
average plane that contains the split line. In order to determine whether the split foons two non-overlapping loops,
the split plane is used for a half-space comparison. That is,
if every point in a candidate loop is on one side of the split
plane, then the two loops do not overlap and the split plane
is acceptable. Of course, it is easy to create examples where
this algorithm will fail to produce a successful split. In such
cases we simply indicate a failure of the triangulation process, and do not remove the vertex or surrounding triangle
from the mesh.

~Ihllire ~~
'::::::::::::"
..;::(

average plane

Typically, however, each loop may be split in more than
one way. In this case, the best splitting plane must be
selected. Although many possible measures are available,
we have been successful using a criterion based on aspect
ratio. The aspect ratio is defined as the minimum distance of
the loop vertices to the split plane, divided by the length of
the split line. The best splitting plane is the one that yields
the maximum aspect ratio. Constraining this ratio to be
greater than a specified value,.e.g., 0.1, produces acceptable

meshes.
Certain special cases may occur during the triangulation
process. Repeated decimation may produce a simple closed
surface such as a tetrahedron. Eliminating a vertex in this
case would modify the topology of the mesh. Another special case occurs when "tunnels" or topological holes are
present in the mesh. The tunnel may eventually be reduced
to a triangle in cross section. Eliminating a vertex from the
tunnel boundary then eliminates the tunnel and creates a
non-manifold situation.
These cases are treated during the triangulation process.
As new triangles are created, checks are made to insure that
duplicate triangles and triangle edges are not created. This
preserves the topology of the original mesh, since new connections to other parts of the mesh cannot occur.

4.0

RESULTS

1\vo different applications illustrate the triangle decimation
algorithm. Although each application uses a different
scheme to create an initial mesh, all results were produced
with the same decimation algorithm.

4.1

VOLUME MODELING

The first application applies the decimation algorithm to
isosurfaces created from medical and industrial computed
tomography scanners. Marching cubes was run on a 256 by
256 pixel by 93 slice study. Over 560,000 triangles were
required to model the bone surface. Earlier work reported a
triangle reduction strategy that used averaging to reduce the

(569K Gouraud shaded triangles)

75% decimated
(l42K Gouraud shaded triangles)

75% decunated
(142K flat shaded triangles)

90% decimated
(57K flat shaded triangles)

89

number of triangles on this same data set. Unfortunately,
averaging applies uniformly to the entire data set, blurring
high frequency features. The first set of figures shows the
resulting bone isosurfaces for 0%, 75%, and 90% decimation, using a decimation threshold of 1/5 the voxel dimension. The next pair of figures shows decimation results for
an industrial CT data set comprising 300 slices, 512 by 512,
the largest we have processed to date. The isosurface created from the original blade data contains 1.7 million triangles. In fact, we could not render the original model because
we exceeded the swap space on our graphics hardware.
Even after decimating 90% of the triangles, the serial number on the blade dovetail is still evident

4.2

TERRAIN MODELING

We applied the decimation algorithm to two digital elevation data sets: Honolulu, Hawaii and the Mariner Valley on
Mars. In both examples we generated an initial mesh by creating two triangles for each uniform quadrilateral element in
the sampled data. The Honolulu example illustrates the
polygon savings for models that have large flat areas. First
we applied a decimation threshold of zero, eliminating over
30% of the co-planar triangles. Increasing the threshold
removed 90% of the triangles. The next set of four figures
shows the resulting 30% and 90% triangulations. Notice the
transitions from large flat areas to fine detail around the
shore line.
The Mars example is an appropriate test because we had
access to sub-sampled resolution data that could be compared with the decimated models. The data represents the
western end of the Mariner Valley and is about l000km by
500km on a side. The last set of figures compares the shaded
and wireframe models obtained via sub-sampling and decimation. The original model was 480 by 288 samples. The
sub-sampled data was 240 by 144. After a 77% reduction,
the decimated model contains fewer triangles, yet shows
more fine detail around the ridges.

5.0

[10]
[11]

[12]
[13]
[14]
[15]
[16]
[17]

REFERENCES

[1] Baumgart, B. G., "Geometric Modeling for Computer Vision,"
Ph.D. Dissertation, Stanford University, August 1974.
[2] Bloomenthal, 1., "Polygonalization of Implicit Surfaces,"
Computer Aided Geometric Design, Vol. 5, pp. 341-355, 1988.
[3] Cline, H. E., Lorensen, W. E., Ludke, S., Crawford, C. R., and
Teeter, B. C., "Two Algorithms for the Three Dimensional Construction of Tomograms, " Medical Physics, Vol. 15, No.3, pp.
320-327, June 1988.
[4] DeHaemer, M. 1., Jr. andZyda, M. J., "Simplification of Objects
Rendered by Polygonal Approximations," Computers &
Graphics, Vol. 15, No.2, pp 175-184, 1992.
[5] Dunham, J.G., "Optimum Uniform Piecewise Linear Approximation of Planar Curves," IEEE Trans. on Pattern Analysis
and Machine Intelligence, Vol. PAMI-8, No.1,pp. 67-75, January 1986.
[6] Finnigan, P., Hathaway, A., and Lorensen, W., "Merging CAT
and FEM," Mechanical Engineering, Vol. 112, No.7, pp. 3238, July 1990.
[7] Fowler, R. J. and Little, J. J., "Automatic Extraction of Irregular
Network Digital Terrain Models," Computer Graphics, Vol.
13, No.2, pp' 199-207, August 1979.
[8] Ihm, I. and Naylor, B., "Piecewise Linear Approximations of
Digitized Space Curves with Applications," in Scientific VlSU-

90

[9]

alization o/Physical Phenomena, pp. 545-569, Springer-Verlag, June 1991.
Kalvin, A. D., Cutting, C. B., Haddad, B., and Noz, M. E.,
"Constructing Topologically Connected Surfaces for the Comprehensive Analysis of 3D Medical Structures," SPIE Image
Processing, Vol. 1445, pp. 247-258, 1991.
Lorensen, W. E. and Cline, H. E., "Marching Cubes: A High
Resolution 3D Surface Construction Algorithm," Computer
Graphics, Vol. 21, No.3, pp. 163-169, July 1987.
Miller, J. V., Breen, D. E., Lorensen, W. E., O'Bara, R. M., and
Womy, M. 1., "Geometrically Deformed Models: A Method
for Extracting Closed Geometric Models from Volume Data,"
Computer Graphics, Vol. 25, No.3, July 1991.
Preparata, F. P. and Shamos, M. I., Computational Geometry,
Springer-Verlag, 1985.
Schmitt, F. J., Barsky, B. A., and Du, W., "An Adaptive Subdivision Method for Surface-Fitting from Sampled Data,"
Computer Graphics, Vol. 20, No.4, pp.179-188, August 1986.
Schroeder, W. J., "Geometric Triangulations: With Application
to Fully Automatic 3D Mesh Generation," PhD Dissertation,
Rensselaer Polytechnic Institute, May 1991.
Terzopoulos, D. and Fleischer, K., "Deformable Models," The
VISual Computer, Vol. 4, pp. 306-311, 1988.
Turk. G., "Re-Tiling of Polygonal Surfaces," Computer Graphics, Vol. 26, No.3, July 1992.
Weiler, K., "Edge-Based Data Structures for Solid Modeling
in Curved-Surface Environments," IEEE Computer Graphics
and Applications, Vol. 5, No. I, pp. 21-40, January 1985.

75% decimated
(425K flat shaded triangles)

90% decimated
(170K flat shaded triangles)

32% decimated
(276K flat shaded triangles)

(shore line detail)

90% decimated
(40K Gouraud shaded triangles)

90% decimated
(40K wireframe)

(68K Gouraud shaded triangles)

(68K wireframe)

77% decimated
(62K Gouraud shaded triangles)

(62K wirefrmae)

91

VISUALIZATION OF VOLCANIC ASH CLOUDS
Mitchell Roth
Arctic Region Supercomputing Center
University of Alaska
Fairbanks, AK 99775
roth@acad5.alaska.edu

Rick Guritz
Alaska Synthetic Aperture Radar Facility
University of Alaska
Fairbanks, AK 99775
rguritz@iias.images.alaska.edu

ABSTRACT
Ash clouds resulting from volcanic eruptions pose a serious hazard to aviation safety. In Alaska alone, there are over 40 active
volcanoes whose eruptions may affect more than 40,000 flights using the great circle polar routes each year. Anchorage International
Airport, a hub for flights refueling between Europe and Asia, has been closed due to volcanic ash on several occasions in recent years.
The clouds are especially problematic because they are invisible to radar and nearly impossible to distinguish from weather clouds. The
Arctic Region Supercomputing Center and the Alaska Volcano Observatory have used AVS to develop a system for predicting and
visualizing the movement of volcanic ash clouds when an eruption occurs. Based on eruption parameters obtained from geophysical
instruments and meteorological data, a model was developed to predict the movement of the ash particles over a 72 hour period. The
output from the model is combined with a digital elevation model to produce a realistic view of the ash cloud, which may be examined
interactively from any desired point of view at any time during the prediction period. This paper describes the visualization techniques
employed in the system and includes a video animation of the'Mount Redoubt eruption on December 15 that caused complete engine
failure on a 747 passenger jet when it entered the ash cloud.

1. Introduction
Alaska is situated on the northern boundary of the Pacific Rim.
Home to the highest mountains in North America, the mountain
ranges of Alaska contain over 50 active volcanoes. In the past
200 years most of Alaska's active volcanoes have erupted at least
once. Alaska is a polar crossroads where aircraft traverse the great
circle airways between Asia, Europe and North America.
Volcanic eruptions in Alaska and the resulting airborne ash
clouds pose a significant hazard to the more than 40,000
transpolar flights each year.
The ash clouds created by volcanic eruptions are invisible to radar
and are often concealed by weather clouds. This paper describes a
system developed by the Alaska Volcano Observatory and the
Arctic Region Supercomputing Center for predicting the
movement of ash clouds. Using meteorological and geophysical
data from volcanic eruptions, a supercomputer model provides
predictions of ash cloud movements for up to 72 hours. The
A VS visualization system is used to control the execution of the
ash cloud model and to display the model output in three
dimensional form showing the location of the ash cloud over a
digital terrain model.

92

Eruptions of Mount Redoubt on the morning of December 15,
1989, sent ash particles more than 40,000 feet into the
atmosphere. On the same day, a Boeing 747 experienced
complete engine failure when it penetrated the ash cloud. The ash
cloud prediction system was used to simulate this eruption and to
produce an animated flyby of Mount Redoubt during a 12 hour
period of the December 15 eruptions including the encounter of
the 747 jetliner with the ash cloud. The animation combines the
motion of the viewer with the time evolution of the ash cloud
above a digital terrain model.

2. Ash Plume Model
The ash cloud visualization is based on the output of a model
developed by Hiroshi Tanaka of the Geophysical Institute of the
University of Alaska and Tsukuba University, Japan. Using
meteorological data and eruption parameters for input, the model
predicts the density of volcanic ash particles in the abnosphere as
a function of time. The three dimensional Lagrangian form of the
diffusion equation is employed to model particle diffusion, taking
into account the size distribution of the ash particles and
gravitational settling described by Stokes' law. Details of the
model are given in Tanaka [2] [3].

The meteorological data required are winds in the upper
~tmosphere. These are obtained from UCAR Unidata in NetCDF
format. Unidata winds are interpolated to observed conditions on
12 hour intervals. Global circulation models are used to provide
up to 72 hour predictions at 6 hour intervals.
The eruption parameters for the model include the geographical
location of the volcano, the time and duration of the event,
altitude of the plume, particle density, and particle density
distribution.
The model has been implemented in both Sun and Cray
environments. An AVS module was created for the Cray version
which allows the model to be controlled interactively from an
A VS network. In this version, the AVS module executes the
model on the Cray, reads the resulting output file and creates a
3D AVS scalar field representing the particle densities at each
time step.

At any point in time, the particle densities in the ash cloud are
represented by the values in a 150 x 150 x 50 element integer
array. The limits of the cloud may be observed using the
isosurface module in the network shown in Figure 1 with the
isosurface level set equal to 1. As the cloud disperses, the
particle concentrations in the array decrease and holes and isolated
cells begin to appear in the isosurface around the edges of the
plume where the density is between zero and one particle. These
effects are readily apparent in the plume shown in Figure 2 and
are especially noticeable in a time animation of the plume
evolution. To create a more uniform cloud for the video
animation, without increasing the overall particle counts, the
density array was low pass filtered by an inverse square kernel
before creating the isosurface. An example of a filtered plume
created by this technique is shown in Figure 8.

The raw output from the model for each time step consists of a
list of particles with an (x,y,z) coordinate for each particle. The
AVS module reads the particle data and increments the particle
counts for the cells formed by an array indexed over (x,y,z). We
chose a resolution of 150 x 150 x 50 for the particle density
array, which equals 1.1 million data points at each solution point
in time. For the video animation, we chose to run the model
with a time step of 5 minutes. For 13 hours of simulated time,
the model produced 162 plumes, amounting to approximately
730 MB of integer valued volume data.

3. Ash Cloud Visualization
The ash cloud is rendered as an isosurface with a brown color
approximating volcanic ash. The rendering obtained through this
technique gives the viewer a visual effect showing the boundaries
of the ash cloud. Details of the cloud shape are highlighted
through lighting effects and, when viewed on a computer
workstation, the resulting geometry can be manipulated
interactively to view the ash cloud from any desired direction.

Figure 2. Unfiltered plume data displayed as an isosurface.

4. Plume Animation
The plume model must use time steps of 5 minutes or greater
due to limitations of the model. Plumes that are generated at 5
minute intervals may be displayed to create a flip chart animation
of the time evolution of the cloud. However, the changes in the
plume over a 5 minute interval can be fairly dramatic and shorter
time intervals are required to create the effect of a smoothly
evolving cloud. To accomplish this without generating
additional plume volumes we interpolate between successive
plume volumes. Using the field math module, we implemented
linear interpolation between plume volumes in the network
shown in Figure 3.
Figure 1. AVS plume isosurface network

93

The comers of the region define a Cartesian coordinate system
and the extents of the volcano plume data must be adjusted to
obtain the correct registration of the plume data in relation to the
terrain. The terrain features are based on topographic data
obtained from the US Geological Survey with a grid spacing of
approximately 90 meters. This grid was much too large to
process at the original resolution and was downsized to a 1426 x
1051 element array of terrain elevations, which corresponds to a
grid size of approximately 112 mile. As shown in Figure 4, the
terrain data were read in field format and were converted to a
geometry using the field to mesh module. We included a
downsize module
Figure 3. AVS interpolation network.
The linear interpolation fonnula is:
(1)

where Pi is the plume volume at time step i and t is time. The
difference tenn in (1) is fonned in the upper field math module.
The lower field math module sums its inputs. Nonnally, a
separate field math module would be required to perfonn the
multiplication by 1. However, it is possible to multiply the
output port of a field math module by a constant value when the
network is executed from a CLI script and this is the approach
we used to create the video animation of the eruption. If it is
desired to set the interpolation parameter interactively, it is
necessary to insert a third field math module to perfonn the
multiplication on the output of the upper module. This can be
an extremely effective device for producing smooth time
animation of discrete data sets in conjunction with the A VS
Animator module.
Figure 4. AVS terrain network.
One additional animation effect was introduced to improve the
appearance of the plume at the beginning of the eruption. The
plume model assumes that the plume reaches the specified
eruption height instantaneously. Thus, the plume model for the
rrrst time step produces a cylindrical isosurface of unifonn
particle densities above the site of the eruption. To create the
appearance of a cloud initially rising from the ground, we defined
an artificial plume for time O. The time 0 plume isosurface
consists of an inverted cone of negative plume densities
centered over the eruption coordinates. The top of the plume
volume contains the most negative density values. When this
plume volume is interpolated with the model plume from time
step 1, the resulting plume rises from the ground and reaches the
full eruption height at t=l.
5. Terrain Visualization
The geographical region for this visualization study is an area in
south-central Alaska which lies between 141 0 - 160 0 west
longitude and 600 - 67 0 north latitude. Features in the study area
include Mount Redoubt, Mount McKinley, the Alaska Range,
Cook Inlet and the cities of Anchorage, Fairbanks, and Valdez.

94

ahead of field to mesh because even the 1426 x 1051 terrain
exceeded available memory on all but our largest machines. For
prototyping and animation design, we typically downsized by
factors of 2 to 4 in order to speed up the terrain rendering.
The colors of the terrain are set in the generate colormap
module according to elevation of the terrain and were chosen to
approximate ground cover during the fall season in Alaska. The
vertical scale of the terrain was exaggerated by a factor of 60 to
better emphasize the topography.
The resulting terrain is shown Figure 5 with labels that were
added using image processing techniques. To create the global
zoom sequence in the introduction to the video animation, this
image was used as a texture map that was overlaid onto a lower
resolution terrain model for the entire state of Alaska. This
technique also allowed the study area to be highlighted in such a
way as to create a smooth transition into the animation sequence.

Figure 5. Study area with texture mapped labels.
6. Flight Path Visualization
The flight path of the jetliner in the animation was produced by
applying the tube module to a polyline geometry obtained
through read geom as shown in Figure 6. The animation of the
tube was performed by a simple program which takes as its input
a time dependent cubic spline. The program evaluates the spline
at specified points to create a polyline geometry for read geom.
Each new point added to the polyline causes a new segment of
the flight path to be generated by tube. In Figure 7, the entire
flight path spline function is displayed. Four separate tube
modules were employed to allow the flight path segments to be
colored green, red, yellow, and green during the engine failure and

restart sequence.

Figure 6. AVS flight path network.

The path of the jetliner is based on flight recorder data obtained
from the Federal Aviation Administration. The flight path was
modeled using three dimensional time dependent cubic splines.
The technique for deriving and manipulating the spline functions
is so powerful that we created a new module called the Spline
Animator for this purpose. The details of this module are
described in a paper by Astley [1]. A similar technique is used to
control the camera motion required for the flyby in the video
animation.
By combining the jetliner flight path with the animation of the
ash plume described earlier, a simulated encounter of the jet with
the ash cloud can be studied in an animated sequence. The
resulting simulation provides valuable information about the
accuracy of. the plume model. Because ash plumes are invisible
to radar and may be hidden from satellites by weather clouds, it is
often very difficult to determine the exact position and extent of
an ash cloud from direct observations. However, when a jetliner
penetrates an ash cloud, the effects are immediate and
unmistakable and the aircraft position is usually known rather
accurately. This was the case during the December 15 encounter.
Thus, by comparing the intersection point of the jetliner flight
path with the plume model to the point of intersection with the
actual plume, one can determine if the leading edge of the plume
model is in the correct position. Both the plume model and the
flight path must be correctly co-registered to the terrain data in
order to perform such a test. Using standard transformations
between latitude-longitude and x-y coordinates for the terrain, we
calculated the appropriate coordinate transformations for the
plume model and jet flight path. The frrst time the animation

95

was run we were quite amazed to observe the flight path tum red,
denoting engine failure, at precisely the point where the flight
.path encountered the leading edge of the modeled plume.

Apparently this was one of those rare times when we got
everything right. The fact that the aircraft position is well
known at all times, and that it encounters the ash cloud at the
correct titlle and place lends strong support for the correctness of
the model. Figure 8 shows a frame from the video animation at
~e time when the jetliner landed in Anchorage. The ash cloud in
this image is drifting from left to right and away from the
viewer.
7. Satellite Image Comparison
Ash clouds can often be detected in AVHRR satellite images.
For the December 15 events, only one image recorded at 1:27pm
AST was available. At the time of this image most of the study
area was blanketed by clouds. Nevertheless, certain atmospheric
features become visible when the image is subjected to
enhancement, as shown in Figure 9. A north-south frontal
system is moving northeasterly from the left side of the image.
To the left of the front, the sky is generally clear and surface
features are visible. To the right of the front, the sky is
completely overcast and no surface features are visible. One
prominent cloud feature is a mountain wave created by Mount
McKinley. This shows up as a long plume moving in a northnortheasterly direction from Mount McKinley and is consistent
with upper altitude winds on this date.

Figure 7. Flight path geometry created by spline animator.

Figure 8. Flight path through ash plume.

96

Figure 9. Enhanced AVHRR satellite image taken at 1:27pm AST.

Figure 10. Position of simulated plume at 1:30pm AST.
The satellite image was enhanced in a manner which causes ash
clouds to appear black. There is clearly a dark plume extending
from Mount Redoubt near the lower edge of the image to the
northeast and ending in the vicinity of Anchorage. The size of

this plume indicates that it is less than an hour old. Thus, it
could not be the source of the plume which the jet encountered
approximately 2 hours before this image was taken.

97

There are additional black areas in the upper right quadrant of the
image which are believed to have originated with the 10:15am
.eruption. These are the clouds which the jet is believed to have
penetrated approximately 2 hours before this image was taken.
The image has been annotated with the jetliner flight path
entering in the top center of the image and proceeding from top
to bottom in the center of the image. The ash cloud encounter
occurred at the point where the flight path reverses course to the
north and east. However, the satellite image does not show any
ash clouds remaining in the vicinity of the flight path by the
time of this image.
When the satellite image is compared with the plume model for
the same time period, shown at approximately the same scale in
Figure 10, a difference in the size of the ash cloud is readily
apparent. While the leading edge of the simulated plume
stretching to the northeast is located in approximately the same
position as the dark clouds in the satellite image, the cloud from
the simulated plume is much longer. The length of the
simulated plume is controlled by the duration of the eruption,
which was 40 minutes.
Two explanations for the differences have been proposed. The
first is that the length of the eruption was determined from
seismic data. Seismicity does not necessarily imply the
emission of ash and therefore the actual ash emission time may
have been less than 40 minutes. The second possibility is that
the trailing end of the ash cloud may be invisible in the satellite
image due to cloud cover. It is worth noting that the ash cloud
signatures in this satellite image are extremely weak compared to
cloudless images. In studies of the few other eruptions where
clear images were available, the ash clouds are unmistakable in
the satellite image and the model showed excellent agreement
with the satellite data.
8. Flyby Animation
One of the great advantages of three dimensional animation
methods is the ability to move around in a simulated 3D
environment interactively, or to create a programmed tour or
flyby. For the ash cloud visualization, we wanted to follow the
moving ash clouds in relation to the terrain and to look at them
from different distances and different directions. AVS allows
fully interactive manipulation of the views of the ash cloud, but
the rendering process is too slow (minutes per frame) to allow
for realtime animation. For this reason we decided to create a
flyby of the events on December 15 by combining camera
animation with the time dependent animations of the plume and
jet flight path.
In our initial attempts, we used the AVS Animator module and
found that it worked well for the linear time dependent portions
of the animation. However, the camera animation was a different
story altogether because camera motion in a flyby situation is
seldom linear. When we attempted to use the Animator in its
"smooth" mode, we found it was only possible to control the
camera accurately when we introduced dozens of key frames in
order to tightly constrain the frame acceleration introduced by the

98

sinusoidal interpolation technique used in the Animator.
Having to use a large number of key frames makes it very time
consuming to construct a flight path, because changing the flight
path requires all the key frames near the change to be modified in
a consistent manner. We eventually realized that a minor
extension to the flight path spline algorithm already developed
could easily provide the desired camera coordinates. In essence,
the camera coordinates could be determined from the flight path
of the viewer in the same manner that we computed the flight
path of the jetliner.
The fIrSt version of the Spline Animator used a text file for
input which contained the key frame information. The output
was a sequence of camera coordinates which were edited into a
CLI script which could be played back interactively or in batch
mode. In this fust effort we were able to define a flight path
using about a half dozen key frames in place of the dozens
required by the A VS Animator and the smoothness and
predictability of results were far superior. After the video
animation of the eruption visualization was completed, a second
version of the Spline Animator was created with a Motif
interface and a module is now available for use in A VS
networks. For more information about this module, the reader is
referred to Astley [1].
9. Conclusions
An ash plume modeling and prediction system has been
developed using AVS for visualization and a Cray supercomputer
for model computations. A simulation of the December 15
encounter with ash clouds from Mount Redoubt by a jetliner
provides strong support for the accuracy of the model. Although
the satellite data for this event are relatively limited, agreement
of the model with satellite data for other events is very good.
The animated visualization of the eruption which was produced
using AVS demonstrates that A VS is an extremely effective tool
for developing visualizations and animations. The Spline
Animator module was developed to perform flybys and may be
used to construct animated curves or flight paths in 3D.
10. Acknowledgments
The eruption visualization of Mount Redoubt Volcano was
produced in a collaborative effort by the University of Alaska
Geophysical Institute and the Arctic Region Supercomputing
Center. Special thanks are due Ken Dean of the Alaska Volcano
Observatory and to Mark Astley and Greg Johnson of ARSC.
This project was supported by the Strategic Environmental
Research and Development Program (SERDP) under the
sponsorship of the Army Corps of Engineers Waterways
Experiment Station.

11. References
[1]

Astley, M. and M. Roth, Spline Animator: Smooth camera
motion for AVS animations, AVS '94 Conference
Proceedings, May 1994.

[2]

Tanaka, H., K.G. Dean, and S. Akasofu, Predicti()n of the
movement of volcanic ash clouds, submitted to EOS
Transactions, Am. Geophys. Union, Dec. 1992.

[3]

Tanaka, H., Development of a prediction scheme for the
volcanic ash fall from Redoubt Volcano, First International
Symposium on Volcanic Ash and Aviation Safety, Seattle,
Washington, July 8-12 1991, U. S. Geological Survey
Circular 165, 58 pp.

99

A Graphical User Interface for Networked
Volume Rendering on the CRAY C90
Allan Snavely
T. Todd Elvins
San Diego Supercomputer Center
Abstract
SDSC_NetV is a networked volume rendering package developed at
the San Diego Supercomputer Center. Its purpose is to ofRoad computationally intensive aspects of three-dimensional data image rendering
to appropriate rendering engines. This means that SDSC_NetV users
can transparently obtain network-based imaging capabilities that may
not be available to them locally. An image that might take minutes to
render on a desktop computer can be rendered in seconds on an SDSC
rendering engine. The SDSC-.NetV graphical user interface (GUI),
a Motif-based application developed using a commercially available
tool TeleUSE, was recently ported to SDSC's CRAY C90. Because
TeleUSE is not available on the C90, the interface was developed on
a SUN SPARC workstation and ported to the C90. Now, if users
have an account on the C90, they can use SDSC-.NetV directly on the
CRAY platform. All that is required is a terminal running XWindows,
such as a Macintosh running MacX, the SDSC_NetV graphical user
interface runs on the C90 and displays on the terminal.

1

Introduction

Volume rendering is the process of generating a two-dimensional image of
a three-dimensional data-set. The inputs are a data-set representing either
a real world object or a theoretical model, and a set of parameters such
as viewing angle, substance opacities, substance color ranges and lighting

100

values. The output is an image which represents the data as viewed under
these constraints. The user of a volume rendering program will want to be
able to input these parameters in an easy fashion and to get images back
quickly.
The task of generating the image is usually compute intensive. Threedimensional objects are represented as three-dimensional arrays where each
cell of the array corresponds to a sample value for the object at a point in
space. The size of the array depends on the size of the object and the sampling rate. Data collected from tomographic devices such as CT scanners
are often 256*256*256 real numbers. Grids with 1024 sample points per dimension are becoming common. As sampling rates increase due to improved
technology, the data sizes will grow proportionally. Data generated from a
theoretical model can also be very large.
There are several algorithms that traverse such sample point grids to
generate images. Two of the most popular are splatting and ray-casting.
Both of these involve visiting each cell in the array and building up an image
as the array is traversed. Without going into the details of the algorithms,
it will be apparent that their theoretical time complexity is order

O(Nl * N2 * N 3 )
where Ni is the size of the ith dimension. When large arrays are considered,
the actual run time on a workstation class CPU may be quite long. The CPU
speed and memory limitations of the typical scientific workstation make it
unsuitable for interactive speed rendering.
If we were to characterize an ideal rendering machine, it would be; inexpensive, so everyone could have one; very fast to allow interactive exploration
of large three-dimensional datasets; and it would sit on the desktop to allow
researchers to do their visualizations without traveling.
SDSC~etV, a networked volume rendering tool developed at the San
Diego Supercomputer Center, addresses the needs of researchers who have
limited desktop power. SDSC_Net V distributes CPU intensive jobs to the
appropriate rendering resources. At SDSC these resources include fast rendering engines on the machine floor. Researchers access these resources via a
graphical user interface (GUI) running on their desktop machines. The GUI
allows the researcher to enter viewing parameters and color classifications in
an easy, intuitive way. The GUI also presents the image when the result is
sent back over the network.

101

The QUI itself is quite powerful. It allows the user to interactively examine slices of the data-set and to do substance property classification without
requesting services via the network. Up to eight data ranges can each be
associated with a color and opacity values.
Once the user has set all the parameters to his/her satisfaction, a render
request causes the render job to be spawned on the appropriate rendering
engine at SDSC. The optimized renderer subsequently sends images to the
QUI.
The design of such a network GUI is a significant software engineering
task, actually as complicated as the coding of the rendering algorithms. Programmers can write GUls for X-window based applications in raw X/Motif
or with a QUI building utility. The second approach is the most reasonable
one when the envisioned GUI is large and complex.

2

A e90 GUI

Recently, the SDSC_NetV QUI was ported to the SDSC CRAY C90. The
goal was to extend the availability of SDSC_NetV. Previously, the QUI only
ran on workstation class platforms. Specifically, Sun SPARC, SGI, DEC and
DEC Alpha workstations. The porting challenge proved to be significant
due to the fact that we needed to preserve I/O compatibility between the
different platforms. Also, there is no GUI builder program on the SDSC
C90. An examination of SDSC_N et V's architecture highlights the need for
I/O compatibility. SDSC_NetV is a distributed program. The QUI usually
runs on the user's workstation. This first component sends requests across
the network to a second component, running on an SDSC server which accepts incoming requests for render tasks, and delegates these tasks to a third
class of machines, the rendering engines. This means that any time a new
architecture is added to the mix, communication protocols between the various machines must be established. Typically, this sort of problem is solved
by writing socket level code which reads and writes ASCII data between the
machines. Communication via a more efficient and compact binary protocol
requires each machine to determine what type of architecture is writing to its
port and decoding the input based on what it knows about the sender's data
representations. Among a network of heterogeneous machines, establishing
all the correct protocols can become a big programming job.

102

3

A Better Solution

The GUI for SDSC_NetV was developed on workstation platforms using
TeleUse. TeleUse is a commercially available aUI builder that allows one
to quickly define widget hierarchies and call-back functions. A GUI that
might take several days to develop in raw X/Motif can be implemented in
a few hours using Tele Use. Tele Use generates source code in C with calls
to the X-Motif library. To get SDSC~et V's GUI running on the C90, we
ported the TeleUse generated source code and compiled it. Although this
source code compiled almost without modification, the object produced was
not executable because of pointer arithmetic unsuitable to the C90's 64 bit
word. Programmers in C on 32 bit word machines often treat addresses and
integers interchangeably. Of course this will not work on a machine with an
address word length different from the integer representation length. These
sorts of problems were easily corrected.
As described in the previous section, the GUI has to communicate across
the network to request a render. The GUI has to be I/O compatible with the
SDSC machines in order for this communication to take place. The problem
of I/O compatibility is one that has been encountered before at SDSC. In
response to this problem, we have built a library called the SDSC Binary
I/O library for compatible I/O between heterogeneous architectures. A network version allows the data to be written across the network with a call to
SN et Write. This results in conversion of the data representation the appropriate format for the target architecture. When the data are read in at the
other end, using SNetRead, they are reassembled into the correct representation for the receiver. SDSC~etV and SDSC Binary I/O are available via
ftp from ftp.sdsc.edu.
Once the SDSC_NetV GUI was linked with the SDSC Binary I/O library,
a more subtle problem arose. The main window widget was responsive to
user input, but none of the sub-windows brought up in response would take
input. Eventually it was discovered that an X application structured as
groups of cooperating top level widgets would not function. The X-Motif
library installed on the C90 expected one widget to be the designated top
level manager. The exact reason for this remains unclear as the version of X
on the CRAY is the same as the version on the workstations. Once this fact
was discovered, the aUI was restructured on our development workstations
and re-ported to the C90. With the new widget hierarchy the aUI worked

103

fine.
Considering the description of rendering given in the introduction, we
can form an idea of the attributes for an ideal rendering machine. It would
be very fast to allow interactive exploration of large three-dimensional data
sets. It would sit on the desktop to allow researchers to do their visualizations without traveling. It would be inexpensive so everyone could have one.
SDSC..NetV gives many CRAY users a virtual ideal machine.

4

Results

The goal of SDSC_NetV is to provide cutting edge rendering technology to
the researcher on the desktop. Porting SDSC..NetV to the C90 has helped
to realize this goal. The GUI runs quickly on the C90. Graphically intensive operations such as slice examination and classification run at interactive
speed. The availability of SDSC_NetV has been extended. Now a user with
an account on the C90 can display the GUIon a Mac or even an X terminal.
The performance of the GUI over the network depends on the speed of the
the link. However, the basic functions of setting parameters and displaying
images are now available on a platform where such functionality was not
available before. SDSC_Net V is used by scientists in a number of disciplines.

5

Images

Art Winfree, a researcher at The University of Arizona, is using SDSC..NetV
to gain intuition into stability in chemically reactive environments. Figure
1. shows the main window of the SDSC_NetV GUI with an image of a
theoretical model known as an equal diffusion meander vortex. It represents
a compact organizing center in a chemically excitable medium (or really, a
math model thereof.) The main feature is that all substances involved diffuse
at the same rate. The organizing center, in this case, is a pair of linked rings,
which you can see only as the edges of iso-value fronts you colored with the
Classifier. The key thing about this, besides equal diffusion, is that the rings
are not stationary, but are forever wiggling in a way discovered only a couple
years ago, called meander. Despite their endless writhing, which precludes
adjective stable, the rings and their linkage persist quite stably. A picture

104

allows reasoning about such structures which may not be apparent from an
equational model.

6

Conclusion

The accessibility of the C90 and the versatility of SDSC-NetV work together
to provide a state-of-the-art tool for scientists involved in a wide range of
disciplines. Visualization of data representing natural phenomena and theoretical models is now available on more scientist's desk tops.

Acknowledgements
This work was supported by the National Science Foundation under grant
ASC8902825 to the San Diego Supercomputer Center. Additional support
was provided by the State of California and industrial partners.
Thanks to the network volume rendering enthusiasts at SDSC, in particular, Mike Bailey, Max Pazirandeh, Tricia Koszycki, and the members of the
visualization technology group.

105

Figure 1: An Equal Diffusion Meander Vortex.

106

Management

HETEROGENOUS COMPUTING USING THE CRA Y Y-MP AND T3D
Bob Carruthers
Cray Research (UK) Ltd.
Bracknell, England

Introduction
Two applications which use a CRAY Y-MP and a T3D in a
heterogenous environment will be described. The first
application is a chemistry code that was ported from the
Y-MP to the T3D, and then distributed between the two
machines. The combined solution using both machines ran
the test cases faster than either the Y-MP or T3D
separately.

First, a master control program had to be written that
executed on the Y-MP. This was responsible for
establishing contact with PVM, reading in the number of
PE's to be used on the T3D, and then spawning the
necessary tasks on the T3D via PVM. The Fortran code for
this part of the operation is shown below:
include ' .. /sizes'

c
The second application is slightly different, since the
complete problem could not be run on the T3D because of
memory limitations imposed by the design of the code
and the strategy used to generate a parallel version.
Distributing the problem across the Y-MP and T3D
allowed the application to run, and produced a time
faster than that on the Y-MP.
Both these applications were encountered as part of two
benchmark efforts, and thus show how real user problems
can be adapted to heterogenous computing.

common/ wordlen/ nrbyte,nibyte
common/pvm/npvm,ipvm
c
print *,' Input the Number of Processes Required "
2 'on the CRA Y T3D'
read(*, *)npvm
c
call pvm_controlO
stop 'End of Solver'
end

c
subroutine pvm_controlO
include '.. /sizes'
include ' .. /fpvm3.h'
include ' .. ljrc_buf_sizes'

Application 1
The first application is a large chemistry code, that is
regularly used in a production environment, and runs on
both CRAY Y-MP's and on several MPP platforms. The
version that was included in the benchmark used PVM for
message passing, and was initially run on a Y-MP in this
form to validate the code. Following this, the code was
ported to the T3D, and various optimisations to the I/O
and message passing implemented to improve the
performance. In this form, the T3D ran the code
considerably faster than the Y-MP.
As part of the benchmark activity, we were looking for
ways to improve the performance of the code, and to
demonstrate the benefits of the Cray Research
Heterogenous Computing Environment. This particular
code had sections that were not well suited to an MPP, but
were known to run well on a vector machine. With this in
mind, a strategy was evolved to do the well vectorised
part on the Y-MP, and leave the rest on the T3D were it
could take advantage of the MPP architecture. The
strategy for this is outlined below.

c
common/wordlen/ nrbyte,nibyte
common/pvm/npvm,ipvm
character*8 mess_buff
dimension itids(npvm)

c
call pvmfmytid(itid)
if(jrc_debug.eq.l) then
write(O,*), ,
write(0,2000) itid
2000
format(,TID of SOLVER Process on the "
2 'CRAY Y-MP is ',z16)
end if
c
call pvmfs pawn("a.out", PvmTaskArch,
2 'CRAY', npvm, itids,numt)
c

if(numt.ne.npvm) then
write(O,*),Response from PVMFSPAWN',
2 'was ',numt,' rather than ',npvm

Copyright (C) 1994. Cray Research Inc. All rights reserved.

109

stop 'Error in PVMFSPAWN'
else
do i=l,npvm
if(itids(i).ne.1) then
i tid_peO=itids(i)
ntid_peO=i
goto 1000
endif
end do
write(O,*)'All T3D TID"s are 1 - PE 0 is absent'
stop 'TID Error'
1000
continue
if(jrc_debug.eq.1) write(O,2100) itid_peO
format(,T3D Initialised by SOLVER - PE 0 "
2100
2 , has TID ',z17)
endif
The main program begins by asking the user for the
number of PE's to be used on the T3D, and then calls the
main control routine, pvm_controI. This enables the
latter routine to accurately dimension the array 'itids'
which holds the PVM Task Identifiers (TID's) of the PE's
on the T3D.
This routine finds the TID of the Y-MP process and prints
it out, and then spawns the T3D processes. The variable
'numt' contains the number of processes actually spawned
by PVM, which is checked against the number requested.
The final part of the set up procedure involves the Y-MP
searching for the TID of PE 0 on the T3D. By default only
PE 0 can communicate with the Y-MP, and PE's that
cannot communicate with the Y-MP have their TID's set
to unity. This can be modified if required by an
environment variable.
The control routine then enters a state where it waits for
the T3D to send it work to do.
c

1100

continue
call pvmfrecv(itid_peO, 9971, ibufid)
call pvmfunpack(BYTE1, mess_buff, 8, I, info)

goto 1100
end
The control routine can interpret two messages from the
T3D - 'Solve' indicating that it should do some work, and
'Finished' indicating that the T3D has finished its work,
and that the Y-MP process should clean up and terminate.
Further data is exchanged between the T3D and Y-MP
via both PVM and the UNICOS file system. Control
parameters are sent via PVM, while the main data array
is sent over as a file. The Y-MP is responsible for
performing the necessary data format conversion from
IEEE to Cray Research Floating Point format, and
performing the reverse operation when it has finished
computing and wishes to return information back to the
T3D. This is done using IEG2CRAY and CRAY2IEG
respectively with conversion type 8.
While the Y-MP is working, the T3D enters a wait loop
similar to that on the Y-MP, and waits for the signal
'Done' from the Y-MP. At this point it picks up the new
data file. Note that both the T3D and the Y-MP must
have set up their data files before signalling that there
is work to do.
The final pOint to remember is that any task spawned by
by PVM will use the directory pointed to by the shell .
variable $HOME in which to create and search for files.
If the Y-MP executes in any other directory than $HOME,
the exchange of files with the T3D will not take place.
This can be controlled by changing $HOME prior to
starting the Y-MP process so that it points to the correct
directory while the tasks are running.
This heterogenous approach enabled the time for the
complete job to be reduced, so that it was less than either
the time on the Y-MP or the T3D.

Application 2

c
if(jrc_debug.eq.1) write(O,*)'SOLVER "
2 'Received Message '" ,mess_buff,'" from PE 0'

c
if(mess_buff(1:5).eq.'Solve') then
call bob_do_work(itid_peO, jrc_debug)
else if(mess_buff(1:8).eq.'Finished') then
return
else
write(O,*),Illegal Message Found in "
2 'SOLVER - ',mess_buff
stop 'Protocol Error'
end if

Like the first code, this application is a large user
program that regularly runs on a Y-MP. The owners of the
code funded one of the universities in the UK to produce a
parallel, distributed version which could be run on a
group of workstations.
We started to look at this version of the code when we
received a benchmark containing it. The initial part of
the work consisted of converting the code from its 32-bit
version using a local message passing language to a 64-bit
version that used' PVM. This PVM version was
eventually ported to both the Y-MP and the T3D, and ran
the small test cases provided.
However, the design of the program and the parallel

110

implementation constrained the maximum size problem
that could be run on the T3D, and meant that the large
problems of greatest interest could not be run. The
underlying method of locating data in memory used the
"standard" Fortran technique of allocating most of the
large data arrays in one large common block via a memory
manager. The strategy to generate a parallel version of
the code relied on one master processor doing all the input
and data set up, and then distributing the data to the
other processors immediately before running the parallel,
compute-intensive part of the code. Similarly at the end,
the results were collected back into the master processor
for output and termination processing. This meant that
the master processor needed sufficient space to store all
the data at the start and the end of the compute phase.
For the problems of interest, this space is typically over
30 Mwords on the Y-MP, well beyond the capacity of
single PE on the T3D.
The strategy to solve this dilemma was to split the
computation into three distinct phases. The first, which
is the initialisation, runs on the Y-MP, and instead of
distributing the data to other processors prior to the
parallel section, outputs the required arrays to a series of
files and then terminates. The second phase which runs
on the T3D picks up the required data from the UNICOS
file system, completes the parallel compute-intensive
part of the calculation, and then outputs the results to a
second set of files. The third phase runs on the Y-MP and
performs the merging of the resultant arrays and outputs
the results.
In this approach, those phases that require a large
amount of memory are run on the Y-MP, while the T3D
executes the parallel part which contains the heavy
computation. Although the scheme sounds simple in
outline, the implmentation was actually quite tricky for
a number of reasons:
•

•

The arrays to be distributed for parallel processing
are defined by the user in the input data, as are
those required to be saved after the parallel
processing. This means that all three phases must
be able to read the input data, but only select those
commands that are necessary to perform their
operations.
Data other than the arrays mentioned above need to
be passed between the various phases - for example,
control variables and physical constants that define
the problem. These are typically stored in separate
common blocks, and do not have to be selected via
the user input for distribution and merging.
For the transition between the input processing and
the parallel computation phases, this proved easy
to sort out, since the original parallel code had had
to distribute most of this information from the

master processor as well. At the end of the parallel
processing, however, the master processor was
assumed to have all the data it needed, and the
other PE's only sent back their share of the
distributed arrays. The merge process thus needed
careful analysis to ensure that all the relevant data
was regenerated in the master processor.
•

Although the problem is defined above in terms of
the T3D performing phase 2, there is no reason why
any other machine running PVM could not perform
the parallel part, in particular several processors of
a Y-MP or C90. To allow this to happen, the T3D
specific code had to compiled in or out
conditionally, and data conversion only applied
when strictly necessary.

•

For small problems, there is no need for three
separate images to be run. The code therefore
needed an option to allow all three phases to be
executed during one pass through the code, either on
the T3D or any other platform. This imposes some
extra logic in the code, but means that it can be run
as originally intended.

•

For a given hardware platform, the same binary
executes any of the phases, or all three phases
together. The logic for this is embedded in the code,
and controlled by input variables.

•

To simplify code maintenance, it was decided that
there would be only one version of the source.
Different hardware platforms are selected via
conditional compilation.

The progress through the various stages is made
transparent to the user via the use of shell scripts. The
user can submit a problem for execution to the Y-MP, and
the various phases are executed on the appropriate
hardware. Restart facilties can be included if necessary.
This approach also offers considerable flexibility when
either running or testing the program. Unlike the first
application, it is not necessary to run each part
immediately after the preceding one, nor do the Y-MP
and T3D have to wait for messages from each other. It is
simply necessary to preserve the files between the
various phases, so that each phase can be started when
convenient. Finally, code can be tested or developed on
either the Y-MP, T3D or both.
This approach allows what seems at first glance to be
intractable problem to be solved using the heterogenous
environment available with the Cray Research T3D.
The Y-MP component is used for the pieces that it is best
suited to, while the T3D performs the heavy computation
for which is was designed.

111

FUTURE OPERATING SYSTEM DIRECTION
Don Mason
Cray Researc~ Inc.
Eagan,MN
INTRODUCTION
The UNICOS operating system has been under
development since UNIX System V was ported to Cray
Research hardware in 1983. With ten years of
development effort, UN! COS has become the leading
UNIX implementation in a supercomputing
environment, in all aspects important to our customers:
functionality, perfonnance, stability and ease of use.
During these ten years, we have matured from a system
architecture supporting only four processor CRAY-2s or
X-MPs, to the complexity of 16 processors of the C90,
and hundreds of processors on a T3D. We invented and
implemented new I/O structures, multiprocessing
support, including automatic parallelization, and a host
of indusUy leading tools and utilities to help the users
to solve their problems, and the system administrators
to maximize their investment.
The system, which initially was simple and small,
grew in both size and complexity.
The evolution of hardware architectures towards
increasing number of processors in both shared and
distributed memory systems and the requirements of
open, heterogeneous environments, present challenges
that will be difficult to meet with current operating
system architecture. We decided that it was time for us
to revise our operating system strategy, and to define a
new path for the evolution of UNICOS. This evolution
should enable us to face the challenges of the future,
and at the same time preserve our customers' and our
own software investments.
After two years of careful evaluation and studies, in
1993 we decided to base the future architecture of
UNICOS on microkernel technology, and we selected
Chorus Systems as the technology provider.
This evolution will preserve all the functionality and
applications interfaces of the current system, and the
enhancements that will come in the meantime.
MICROKERNEL/SERVER ARCHITECTURE
Figure 1 compares the current UNICOS architecture
to the future one, that we refer to as "serverized."
The current system architecture is depicted on the left
side. It is characterized by a relatively large, monolithic
system kernel, that executes in the system address

Copyright © 1994. emy Research Inc. All rights reserved.

112

Serverized UNICOS

UNICOS7.0

(·:~~:~~B"!!iI:~::~1
1~,.;:;Ip.pm~!gn:::~1

Figure 1

space, and that performs system work on behalf of
user's applications, on the top part of the figure. The
applications interact with the system via a "system call
interface," represented on the figure by the "interface"
area. The kernel, to perform its tasks, uses "privileged"
instructions, that can only be executed in the system
space.
The right side of Figure 1 represents the new
architecture. Nothing changes in the upper part of the
figure: the users applications are exactly the same, and
they use the same system interface as before. The
monolithic kernel is now replaced by a "microkernel"
and a set of new entities, called "servers." The
microkernel provides the "insulation" of the system
software from the particular hardware architecture, and
provides a message passing capability for transferring
messages between applications and servers, as well as
between servers themselves.
System tasks that previously were performed by the
system kernel will be executed by the specialized
servers. Each of them is assigned a particular and well
defined task: "memory manager", "file manager",
"process manager" etc... The servers, in their majority,
operate in the system space. However, some of them,
which do not require access to privileged instructions,
can perform in user space. Each of the servers is
"frrewalled" from the others, and can communicate with
the others only through a well defined message passing
interface, under the microkemel's control.
Microkernels can also communicate from one physical
system to another, across a network. This is the
situation represented by the Figure 2. In a configuration
like the one depicted, not all systems need to have all
the servers. Certain systems can be specialized for
particular tasks, and therefore might require only a

subset of the available servers. In the case of a CRAY
J\1PP system, each of the processing elements contains
a micro kernel, and few servers, only those that are
necessary for executing applications. Most of the
system services can be provided either by some other
PEs of the system, which have the appropriate server,
or even by a different system on the network.

Parallel Vector Platform

MPP Platform

~::::::~I!I!I:::::::::I

1::,:::·1_1::;::::::1
1;::~i&l_::,ji~!:1

For an J\1PP in particular, this frees the PEs memory
space for user applications, rather than using it for
the system.
A long-term objective with the new architecture is to
provide a single UNICOS Operating System which will
manage the resources of diverse hardware architectures.
This is called "Single System Image" (SSI) in the
industry . From an administrator's point of view SSI
will facilitate management of computing resources as if
they were a single platform; for example, an MPP and
a C90. From an applications point of view SSI means
OS support for scheduling computing resources such
that components of the application can efficiently
utilize diverse platform characteristics.

Figure 2

o Most of the system software is "insulated" from the
hardware. Therefore, porting to new architectures is
made much easier, and safer.

the market, and their adaptability to Cray Research
hardware architectures. Several options were examined,
including Mach. There are more similarities than
differences among the available microkernel
technologies. All microkernels attempt to mask
hardware uniqueness and manage message passing. The
differences become evident only when looking at the
application of the technology to specific hardware
platforms such as the CRAY Y -MP or CRAY T3D
series. The selection criteria included the following:

o System functions can be distributed across platforms.

o The adaptability to a real memory architecture

o Servers are easier to maintain, since each of them is
relatively small and self-contained. Interactions between
the server and the "external world" are done via clear
interfaces. A change to a server should not have any
impact on other parts of the systems.

o Perfonnance

o Servers can be made more reliable, precisely because
of their smaller size, and well defined functions and
interfaces. They can be "cleanly" designed, and well
tested.

o Serverization model (existence of a multi-server
implementation)

There are several advantages of adopting a serverized
architecture:

o A serverized system can evolve in a "safer" way than
a monolithic kernel.
o Systems can be customized by introduction of
custom servers, designed to perform a particular task,
not required by other customers.
o If an industry agreement on microkernel interfaces is
achieved, this would open the way for leveraging
servers across different platforms and architectures.
TIlE CHORUS TECHNOLOGY CHOICE
Before selecting Chorus technology, we conducted a
comprehensive study of the technologies available in

o Ease of implementation - time-to-market
o Existence of programming tools

The technology from Chorus Systems was selected,
since it perfectly satisfied all of our selection criteria. In
particular, this choice will allow us to deliver a standalone J\1PP capability approximately 18 months sooner
than with any other technology. Also, the same
technology can be used on both real and virtual
memory architectures.
TIMETABLE
The transition from current UNICOS and UNICOS
MAX to the new architecture will take from two to four
years, depending on hardware platforms. Initial
implementation will become available on the MPP
systems. In parallel with the development of current
structure of UNICOS, we will work on a serverized
version for parallel vector platforms.

113

Future Operating System Directions - Serverized Unicos
Jim Harrell
Cray Research, Inc.
655-F Lone Oak Drive
Eagan, Minnesota 55121

ABSTRACT
This paper is the technical component of a discussion of plans for changes in operating system
design at Cray Research. The paper focuses on the organization and architecture of Unicos in
the future. The Unicos operating system software is being modified and ported to this new architecture now.

1

Introduction

Over the past several years the Unicos Operating System Group has been studying the needs and requirements for changes in Unicos to provide support for new
Cray hardware architectures, and new software features
as required by Cray customers. At previous Cray User
Group Meetings and in Unicos Advisory Meetings there
have been open discussions of the requirements and
choices available. In 1993 a decision was made to move
forward with the Chorus microkernel and Unicos as the
feature base and application interface. This talk
describes some of the technical components of the new
operating system architecture, explains how the new
system is organized, and what has been learned so far.
The talk is composed of five parts. The first part
describes the Chorus microkernel and the server model.
This is the base architecture we have chosen to use in
the future. We reviewed the choices at the Cray User
Group Meeting in Montreaux in March of 1993. The
second part of this talk describes our experiences in
porting Chorus to a Cray YMP architecture. This phase
of the program was an important step in proving the
technology is capable of supporting high performance
computing. The third part of the talk explains how we
expect Unicos will "map" onto the Chorus model. We
have chosen to move Unicos forward as the feature base
and application interface. Our goal remains to provide
full application compatibility with Unicos. The fourth
part describes the technical milestone for 1993 and the
status of that milestone. The final part of the talk discusses our interest in interfaces for servers and microkernels that can allow vendors and users with different
operating system bases to provide heterogeneous distributed systems in the future.

Copyright © 1994. Cray Research Inc. All rights reserved.

114

2

The Chorus Model

The Chorus model is based on decomposing a monolithic
system into a micro kernel and a set of servers. The microkernel manages the hardware and provides a basic set of
operations for the servers. The servers implement a user
or application interface and, in Chorus, are usually bound
into the same address space as the microkernel. This
binding of servers is done for performance. Servers can
run in user mode. This allows flexibility in the operating
system organization and can add resiliency. An important
component of the Chorus model is the Remote Procedure
Call (RPC) support. Traditionally RPCs require a context
switch, and message or data copies. In the Chorus model
this is referred to as Full Wefght RPC (FWRPC). Chorus
provides a Light Weight RPC (LWRPC) that can be used
to communicate between servers in kernel space, that is,
servers bound in the same address space as the microkernel. L WRPC does not context switch, or change stacks.
Instead of copying message data, a pointer to the message
is passed. The result is a significant reduction in the cost
of communication between servers. The offset to the performance improvement is that servers in the same address
space are not protected from random memory stores.
There is no "firewall" between the servers. There is an
obvious requirement for FWRPC across a network, and
for servers running in user mode, user space. But when
servers are in the same address space the requirement is
not as obvious.
3

Porting Chorus to a YMP

In 1992 we worked with Chorus Systems to port a minimal Chorus operating system to a YMP machine. The
purpose of this test was to determine if the software architecture was viable on a YMP hardware architecture.
There were concerns about Chorus memory management
on a Ylv!P. ~.1ost microkernel based systems use virtual

memory extensively. Chorus claimed to be architecture
neutral in memory management. There were other concerns about machine dependent differences. Chorus is
normally ported to smaller machines. The port would
provide answers to these concerns and allow us real
hands-on experience with Chorus. Use of the code
would make the evaluation and comparison with other
systems real. We set a goal of demonstrating at the end
of six months.
The Chorus components of the port were the microkernel, a Process Manager (PM), and a very simplified
Object Manager (OM), or filesystem server. The
machine-dependent parts of the microkernel were modified to support the YMP, and a simple memory management scheme was put in place to support the YMP.
The Chorus PM was modified to use Unicos system call
numbering so we could run Unicos binaries.· This
greatly simplified what had to be done to run tests. The
Chorus OM was greatly simplified. The support for
Unix filesystems was removed and in its place was put
a table of files that would be "known" to this OM. All
of the disk device support was removed from Chorus,
and replaced with code to access files from YMP memory. This dispensed with the need for a filesystem, drivers, and lOS packet management.
The ported system was booted on a YMP and simple
tests run from a standard Unicos shell, Ibinlsh. This
confirmed our view that Unicos could be used as the
operating system personality. The Unicos shell had
been built under standard Unicos and yet under the test
system it functioned exactly as it did under Unicos. The
tests that were run did a variety of very simple system
operations. The test results were compared to Unicos
results using the same tests. The Unicos system was run
on the same YMP and configured to use a memory filesystem. The results showed two important facts. Using
FWRPC the performance of Chorus was 2 times slower
than Unicos for the same system call. Using LWRPC
the performance was comparable to Unicos.
Cray Research is not planning on using all of the Chorus product. We had previously decided, in conjunction
with our customers, that we should use Unicos as the
base of any future systems. We want to use certain features from Chorus to help split Unicos into components.
The primary Chorus technology that is being used in the
restructuring of U nicos is the microkernel. It is much
the same as the Chorus version, with the exception of
the machine dependent portions. We are augmenting it
to provide support for some other services like swapping and enhancements to scheduling and monitoring.
We are also using the Chorus Process Manager (PM) as
the basis for our PM. We are modifying the way that the
system calls, signalling, etc., work to match Unicos.

The last major piece of Chorus technology we are using
is the server "wrappers". This is a series of routines that
provide two capabilities. The first is a model for the
interfaces needed to get a server to communicate with
another server. The second is as a model for how to
mimic Unix or Unicos internal kernel requests or convert the kernel requests to microkernel requests.

4

Mapping Unicos onto the Chorus Model

The restructuring of Unicos will maintain complete
user compatibility. At a very early stage of the project
we determined that the best way to provide this compatibility is to use as much Unicos code directly as possible. This has a side effect, in that we can move more
quickly to serverize Unicos without having to rewrite
code. We have chosen to have a large number of servers, aggressively trying to modularize Unicos wherever
possible. We believe that there are at least a dozen different potential servers. For example the device drivers
for disk, tape, and networking will form three separate
servers. The IDS packet management code has already
been made into a server. The terminal or console support is also already a separate server.

5

1993 Milestone - Some Progress

At the end of 1993 we completed one of the formal
project milestones. We ran a system composed of a
Chorus based microkernel, our PM, an OM or filesystem server that implements the NCI filesystem, a disk
server for disk device support and a packet server. We
also added a terminal server for console communications. The system was run on a YMP using an 8.0 filesystem on Model E lOS connected disks. This
milestone verified that several major servers were functioning together and that device access worked. This
milestone also continues to monitor progress with the
goal of U nicos application compatibility.

6

Future Interfaces

In the computer industry there are several companies
and research facilities that are studying microkernels
and serverized systems. We expect that a number of
distributed systems will take a similar form to the direction we have chosen. Cray Research believes that in the
future heterogeneous systems will depend on different
operating systems from different vendors being able to
interact and interoperate at a deeper level than currently
exists. This interoperation is required to support Single
System Image and system resource management in distributed systems. In order to facilitate this communication Cray is taking a leadership role in trying to find

115

ways to standardize server and microkemel interfaces.
This work has met with some success, but will require
the interest and participation of our customers to convince computer vendors that Single System Image is a
serious requirement in the future.

7

Summary

We have shown progress towards the reorganization of
Unicos into a more modular form. We expect that this
new system will be capable of supporting all Cray hardware architectures, and capable of supporting all Cray
customer requirements by providing a better base for
new functionality and compatibility for current Unicos
applications.

116

Mass Storage Systems

Storage Management Update
Brad Strand
Section Leader, UNICOS Storage Management Products
Cray Research, Inc.
655-F Lone Oak Drive
Eagan, Minnesota 55121
bstrand@cray.com

ABSTRACT
This paper presents an update to the status of UNICOS Storage Management products and
projects at Cray Research. Status is reported in the areas of Hierarchical Storage Management products (HSMs), Volume Management products, and Transparent File Access products. The paper concludes with a timeline indicating the approximate introduction of several
Storage Management products

1

Topics

Work on Storage Management products and projects at
Cray Research is currently focused in three major areas.
The fIrst area is Hierarchical Storage Management
products, or HSMs. These products are designed to
allow sites to increase the effective size of their disks by
transparently moving, or migrating, data between disks
and cheaper media, such as tape. These products allow
sites to make their disk storage pool appear larger than
they actually are. The second area where Cray Research
is currently doing Storage Management work is in Volume Management Volume Management products
allow users and administrators to use and manage a
large set of logical and physical tape volumes in an easy
manner. Finally, Cray Research is very active in the
area of Remote Transparent File Access products.
These are products which allow users and applications
to access fIles which physically reside on another system as if they were on the local system. These products
are often called "Remote File System" products,
because of the way they effectively extend the local
physical fIle system across the network. Each of these
three areas will be discussed in terms of current product
availability, and in terms of development projects currently active and underway. The paper concludes with a
timeline designed to indicate the relative times in which
Storage Management products are expected to be introduced into the UNICOS system.

2
Hierarchical Storage Management Products (HSMs)

2.1

Data Migration Facility (DMF)

2.1.1

DMF 2.1

A new version of the CRAY Data Migration Facility, or
DMF, version 2.1, is now,available. DMF 2.1 is designed to run on top ofUNICOS 7.0, UNICOS 7.C, and
UNICOS 8.0. DMF 2.1 provides several important new
features in DMF, a few of which will be described below.
DMF 2.1 adds support for Multi-Level Security, or
MLS. This means that sites running with UNICOS MLS
can use DMF to provide their HSM solution. DMF is
even part of the "Evaluated System," which means that
sites running Trusted UNICOS can run DMF without violating the security rating of their system.
DMF 2.1 also provides support for a multi-tiered data
management hierarchy, in that data may be moved between Media Specific Processes (or MSPs). This means
that sites can configure their systems to migrate data
from, for example, one tape format to another.
DMF 2.1 also adds support for gathering a variety of
dmdaemon statistics. These statistics may then be analyzed using the new dmastat(l) utility.

Copyright © 1994. Cray Research Inc. All rights reserved.

119

2.1.2

Client/Server DMF Project

One of the DMF development projects currently underway at Cray Research is called Client/Server DMF.
This project adds the functionality which will allow
sites to use DMF to manage their data which are stored
on Shared File Systems, or SFSs. The Shared File System is an evolving product, not yet available, which allows multiple UNICOS systems to share direct access
to disk devices. The Client/Server DMF project allows
a single DMF server to manage all the data in a UNICOS Shared File System environment.
Each UNICOS machine in the SFS complex needs to
run a copy of the DMF Client. One or more UNICOS
hosts with access to the secondary storage media
(tapes) needs to run the DMF Server. The system can
also be configured to provide redundancy, should one
particular DMF Server process fail.
The Client/Server DMF project internal goal is to demonstrate the functionality by year-end. More information on Client/Server DMF will be available at the
"Clusters BOF," hosted by Dan Ferber.
2.1.3
New Tape Media Specific Process (MSP)
Project

The other major DMF development project currently
underway is that which is developing a new tape MSP.
This is the tape MSP that was originally planned to be
available in DMF 2.1, but has since slipped into DMF
2.2. This new tape MSP is designed to provide a variety
of important improvements. Several of these are listed
below.
• Support/or Improved Data Recording Capability
(IDRC). Some tape devices have controllers which pro-

vide on-the-fly data compression. This feature is called
Improved Data Recording Capability, or IDRC. The
new tape MSP will provide support for controllers usingIDRC.
• Improved Media Utilization. The current tape MSP
design does not support the function of "append" to partially written tapes. The new tape MSP will support
appending to tapes, and will thereby obtain greater utilization of tape media.
• Much Improved Media Recovery. The new tape MSP
writes to tapes in blocks. The new tape MSP is designed
to be able to read and recover all blocks which do not
contain unrecoverable media errors. Thus, only data in
blocks which contain unrecoverable media errors
would be lost. This is a major improvement to the current design, which is unable to retrieve data written beyond bad spots on the tape.

120

• Absolute Block Positioning. Some new tape devices

support high speed tape positioning to absolute block
addresses. The new tape MSP will utilize this feature
whenever it is available.
• Asynchronous. Double-Buffered I/O. To fully utilize
the greater bandwidth available on some tape devices,
the new tape MSP will use asynchronous, double buffered I/O. We expect this will yield very near full channel I/O rates for these devices.
• New Tape and Database Formats. To provide some of
the new functionality, changes were made to the tape
format, and to the MSP database format. The new tape
MSP will support reading tapes in the old format. Conversion utilities will be supplied to convert databases
from the old format to the new format.
• Availability. The new tape MSP will be available in
DMF 2.2. We expect this release to be available in the
fourth quarter of 1994.
2.2

UniTree

The UniTree HSM product has now been ported to the
Y-MP EL platform by Titan Client/Server Technologies. We are currently awaiting a Titan installation of
UniTree at our first customer site. The UniTree version
ported to UNICOS 7.0.6 is version 1.7. Titan has not
shared their plans to port version 1.8 to UNICOS.
2.3

FileServ

A port of the FileServ HSM product to UNICOS is currently underway. This port is being done by EMASS.
Cray Research has cooperated with EMASS by providing "hooks" into the UNICOS kernel which allows
FileServ to obtain the information it needs more easily.
These hooks are integrated into UNICOS 8.0. Since the
porting work is being done by EMASS, persons interested in obtaining more detailed information, or project
schedules, should contact them directly.
2.4

1l0pen HSM" Project

Although DMF provides an excellent, high-performance HSM solution on UNICOS platforms, some of
our customers have indicated that the proprietary nature
of DMF is a disadvantage to them. They would prefer a
solution that is more "open," in the sense of being available on more than one hardware platform. In this way,
the customer's choice of HSM solution would not necessarily dictate their choice of hardware platform. In response to this requirement, Cray Research has begun a
project with the goal of providing an HSM product on
UNICOS which is close to DMF in performance and

functionality, yet which is also available on other hardware platforms.
We have been evaluating potential candidates to become our open HSM product for about six months.
Much of our work has been analyzing product designs,
to see which ones have the potential for being integrated into our unique supercomputing environment. As
one might imagine, there are many challenges to address when attempting to integrate an HSM product
with our Tape Daemon, Multi-Level Security, and very
large, high performance peripherals. Moreover, most of
the products we evaluated were not designed for multi-processor architectures, so there is a significant amount
of design work required to determine just where we can
add parallelism into these products.
Despite these hurdles, we feel we have made significant
progress on the project. Indeed, we feel we are close to
a decision point for selecting the product we will use as
our base. Our target is to have an open HSM product
running on UNICOS in 1995. Depending on which
product we choose, and the platforms on which it already runs, it may be possible that a version of the open
HSM product will be available on the Cray Research
SuperServer platform before a UNICOS-based product
is available.
3

Volume Management Products

3.1

CRAYIREELlibrarian (CRL)

3.1.1

CRL 2.0.5 Complete and Available

Release 2.0.5 of CRAY/REEL librarian is now available. There are several important changes and improvements to the product, including MLS support, ER90
tape device support, and support for 44-character file
ids. The database format for CRL 2.0.5 is incompatible
with the format used in CRL 1.0.x, but a conversion
utility is supplied with CRL 2.0.5. Because CRL 2.0.5
takes advantage of Tape Daemon interface enhancements in UNICOS 8.0, CRL 2.0.5 will only run on
UNICOS 8.0 or higher. CRL 2.0.5 is not supported on
UNICOS 7.0 or UNICOS 7.C.
3.1.2

CRL Database Study Project

The primary CRL project currently underway is a study
which is examining the feasibility of incorporating an
improved database technology into CRL. The motive
for this study is to improve the reliability and the scalability of the CRL product. No decision has yet been
made as to whether or not we will proceed with this database upgrade.

4

Remote Transparent File Access Products

4.1
Open Network Computing! Network File System (ONC!NFS)
4.1.1

NFS Improvements in UNICOS B.O

A great deal of effort went into improving the NFS
product Cray Research released in UNICOS 8.0. Improvements include the implementation of server side
readaheads, improved management techniques of the
mbufs used by the Y-MP EL networking code, new options for the mount(8) and exportfs(8) commands
which can provide dramatically improved performance
in the appropriate circumstances, and the support for
B1 level MLS.
Much more information about the NFS changes introduced in UNICOS 8.0 is was given in the CUG presentation given in Kyoto last September. Please refer to
that presentation for further details.
4.1.2
ONe + Project
The primary active development project in the NFS
area is ONC+. ONC+ is a set of enhancements to the
current set of ONC protocols. Features of our ONC+
product are listed below.
• NFS Version 3 is an enhancement to the current NFS
protocol, NFS Version 2. NFS Version 3 provides
native support for 64-bit file and file system sizes, provides improved support for files with Access Control
Lists (ACLs), and provides a wide range of performance improvements.
• Support for Version 3 of the LockManager protocol,
which provides advisory record locking for NFS Version 3 files.
• Support for the AUTH_KERB flavor of Remote Procedure Call (RPC) authentication. AUTH KERB
implements Kerberos Version 4 authentication to RPC
on a per-request basis. This component of the project
adds AUTH_KERB to both user-level RPC calls, and
to the kernel-level RPC calls that are used by NFS. The
result is a much greater level of RPC security than is
offered by either AUTH_NONE, AUTH_UNIX, or
AUTH_DES, the current supported RPC authentication
types.
• Support for NIS+, the enhanced version of the Network Information Services (NIS) protocols. These are
important enhancements which add security, functionality, and performance to NIS.
The CRAY ONC+ product will be a separately licensed
product, available in UNICOS 9.0.
4.2
Open Software Foundation! Distributed File
System (OSFIDFS)

121

An important project in the area of Remote Transparent
File Access software is OSF/DFS. The DFS product
provided by Cray Research will provide most of the
important features of DFS, including support for both
client and server, as well as for file caching. However,
the Episode file system is not provided with DFS, so
certain Episode ftIe system specific functions, such as
support for filesets, is not yet supported by our DFS
implementation.
The DFS server will be available as the CRAY DCE
DFS Server. The DFS client will be available as part of
the CRAY DCE Client Services product. Both of these
products are separately licensed, and both will be available 3Q94.
More detailed information about the Cray Research
DFS product will be given by Brian Gaffey in his talk
"DFS on UNICOS," scheduled for Thursday, 3/17/94
at 9:30.

5

Approximate Timeline
1994

1995

1996

ew Tape MSP
FS
COS 8.0, DMF 2.1, CRL 2.0.5, UniTree

122

RAID Integration on Model-E lOS
Bob Ciotti
Numerical Aerodynamic Simulation
NASA Ames Research Center
Moffett Field, CA 94035, USA
ciotti@nas.nasa.gov

Abstract
The Redundant Array of Inexpensive Disks (RAID) technology has finally made its way into the supercomputing market.
CRI has recently made available software for UNICOS to
support this. This paper discusses the experiences over the
past twelve months of integrating a Maximum Strategy RAID
into the C90 environment. Initial performance and reliability
were poor using the early release Cray driver. Over time, performance and reliability have risen to expected levels albeit
with some caveats. Random i/o tests show that RAID is much
faster than expected compared to CRI DD60s.

1.0 Introduction
The Numerical Aerodynamic Simulation facility at NASA
Ames Research Center provides a large scale simulation capability that is recognized as a key element of NASA's aeronautics program, augmenting both theory and experimentation
[cooper93]. As a pathfinder in advanced large scale computing,
the NAS program tests and evaluates new hardware and software to provide a unique national resource. This role provided
the basis for NAS entering into a development/integration
project using RiPPI connected RAID. As such, NAS was the
first site to use the eRI IPI-3 driver to access a RiPPI connected RAID from the C90.
Maximum Strategy Incorporated (MSI) began manufacturing a
RiPPI attached RAID beginning with the Gen-3 system in
early 1990. These systems cost $15/megabyte. Comparable
CRI disk was available at $50/megabyte (DD40), with their top
of the line disk offered at a hefty $200/megabyte (DD49). With
such a difference in cost and the potential of high performance,
the MSI systems were extremely attractive.
The availability of the fIrst Gen-3 systems, led to the prospect
of providing inexpensive directly attached disks which transferred data at over eighty megabytes/second. IDA Princeton
led the fIrst integration project of this technology into a Cray
environment [cave92], connecting a Gen-3 system to a

CRAY2. They developed a software driver which was the starting point for that available on CRI Model-E lOS systems
today. NAS considered attaching the Gen-3 to a YMP8-8/256
IOS-D system, but development time, loss of production, and
short expected lifetime negated any cost savings. At that time
the CRI solution cost $39/megabyte while MSI RAID was
$13/megabyte. These prices included all required hardware
(e.g., $250,000 for a CRI IOS-D RiPPI channel).
Understandably, CRI has been extremely slow to integrate this
cost effective storage into their product offerings, choosing
instead to build their own narrow stripe RAID product from
Single Large Expensive Disks (SLEDs) [badger92]. This
largely ignores the calls of customers to provide fast inexpensive media. Thus, the procurement for High Speed Processor 3
(HSP3) contained a requirement that potential vendors supply
support for IPI-3 over RiPPI.
Better performance would be achieved with direct support of
the MSI RAID system in CRI lOS hardware, yet even with the
20%-30% overhead of IPI-3 over RiPPI, performance is still
very good.
In late 1992, a separate procurement for RiPPI attached RAID
awarded MSI a contract to supply 75 gigabytes of storage. The
cost was approximately $9/megabyte. Since that time, competition has fostered falling prices with Gen-4 systems available
in quantity today at around $5/megabyte.
After the installation of HSP3 (C916/1024) in March 1993,
twelve months of testing were required before RAID provided
a reliable low cost and high performance alternative to CRI
proprietary disks.

2.0 Overview
The original RAID papers came out of the University of California at Berkeley in late 1987 [patterson87, patterson88]. It
was clear the gains in capacity and performance of SLEDs was
modest compared to that achievable from RAID. At the time of

123

Overview

[patterson87], Maximum Strategy had already built and marketed its first RAID product and completed the design of its
second. MSI introduced the Strategy-l in mid 1987, a RAID
level 0 system capable of a sustained 10 megabytes/second
over VME. August 1988 marked the Strategy-2 introduction, a
RAID level 3 product capable of 20 megabytes/second over
VME. In June 1990, MSI introduced the Gen-3, also RAID
level 3, that sustained a transfer rate of over 80 megabytes/second via HiPP!. Gen-4 became available in August 1992.

is not unrealistic to imagine such a scenario. For this reason,
MSI has agreed to add automatic reallocation. Automatic reallocation will map out bad sectors which cause fIrm errors the
fIrst time they are encountered. This will lessen the likelihood
of data loss. Failed drives are easily replaced by operations
staff. For a further description of the MSI RAID see
[homan92]. For a discussion of Cray directions in disk technology, see [badger92] or [anderson93].

2.3
2.1

Gen-4

2.2

Overview

The MSI Gen-4 product is composed of a main processor,
HiPPI channel interface, ethemet interface and 1 or 2 facilities.
At NAS, each facility has 20 1.3 gigabyte drives. Two drives
are combined into a module and are striped either 4, 8 or 9
wide. Optimal conditions can produce transfer rates of over 80
megabytes/second. A hot standby module is available in the 8
wide stripe configuration for automatic substitution should any
drive fail within the facility. We chose the 8+ 1+ 1 (8 data, 1 parity, 1 spare) configuration for the best transfer rate and reliability.
The Gen-4 supports RAID levels 1,3, and 5, and the capability
to partition facilities into different RAID levels. We configured
the entire system as RAID levelS.
The MSI RAID achieves fault tolerance'in several ways. Data
reads cause an access to 8 modules. A read that fails on any
drive (because of BCC, time-out, etc.) is retried up to five
times. Successful retries result in successful reads. A soft error
is then logged and its sector address optionally saved in the
suspect permanent flaw table. If the data cannot be successfully
read after 5 retries, the data is reconstructed using the XOR of
the remaining 7 drives and the parity drive, called "parity
replacement". In this case, a firm error is logged and the sector
address saved in the susp~t permanent flaw table. A read failure occurs when more than one fInn error occurs at the same
sector offset (2 or more of 9), This results in a hard error being
logged.
A higher failure level is the loss of an entire disk. If, in the process of any operation, a drive fails, an immediate automatic hot
spare replacement and reconstruction is initiated. This operation is transparent but requires approximately half of the bandwidth of the RAID (i.e., throughput drops by 112 during
reconstruction). Reconstruction takes approximately 15 minutes. If there are firm errors on any of the remaining drives,
reconstruction will fail for those sectors and data loss occurs.
With the effective system MTBF of a drive at eight months, it

System Maintenance Console (SMC)

The SMC monitors activity on the system, supports configuration modifIcation, and maintenance. It is accessible by a direct
vt100/rs232 connection or Telnet. The SMC, while providing
robust control over the RAID, is non-intuitive and cumbersome to use at times. It is time for MSI to redesign this software.

2.4

Status and Preventative Maintenance

Operations· staff must perform preventative maintenance regularly. While some operations are inherently manual, others lend
themselves to automation. MSI needs to automate some of
these functions, such as the ones described below. While not a
big problem for a few systems, a center considering the installation of a large number of systems will find it necessary to do
so.

2.4.1 Daily RAID Preventative Maintenance
UN/COS Kernel Log - Inspect the UNICOS kernel log for
"hdd.c" errors. These indicate problems detected on the CRI
side. Look to the SMC to diagnose the problem.
Response Error Log - Accessible via the SMC, messages in the
response log indicate the nature of the problem with an error
code and a text description.
Real Time Status - On the Main display of the SMC (Real Time
Status display) counters indicate accumulated errors (e.g., soft,
firm, hard).

2.4.2 Weekly RAID Preventative Maintenance
Read Scrub - Reading all sectors on the RAID is necessary to
check for disk flaws. A utility program provided for this purpose can be run during production. This operation takes
approximately 20 minutes per facility and should be done during periods of low activity.
Reallocation - Manual reallocation of suspected permanent
flaws is necessary to prevent data loss. This operation does not
use significant bandwidth.

RAID Integration on Model-E lOS

124

Product Impressic:ms

2.4.3 Monthly RAID Preventative Maintenance
Flaw Management - The sudden occurrence of a large number
of permanent flaws on a drive may indicate a failing drive. To
monitor this accumulation, one must download the information
to a 3 112" floppy, and inspect the logs on a system capable of
reading DOS floppies.

3.0 Product Impressions
During the past 18 months, there have been a number of goals
met, problems encountered and obstacles overcome. The
appendix contains a chronological listing of these events. Consistent throughout the process of testing the MSI RAID was:
1. MSI always responded immediately to problems.
2. MSI diagnosed hardware problems rapidly and replaced
boards immediately.
3. MSI added software functionality as requested.
4. MSI fixed software bugs immediately.
5. CRI would respond to problems expeditiously, in that problems were acknowledged and duplicated.
6. CRI did not experience any hardware problems.
7. CRI software functionality not added when requested.
8. CRI software bugs fixed at the leisure of CRI. Fixes for critical bugs (e.g., corrupted data) took as long as 6 weeks.

3.1

Reliability

Overall reliability of the RAID system for the first 10 months
has been poor. This is due exclusively to the support and quality of the CRI driver. RAID reliability has been 100% over the
two months since the last bug fix was installed.
Other than the initial board failure, the MSI hardware has been
stable and reliable. A visual inspection of the boards indicates
that they are well constructed and cleanly engineered. The finish work on the cabinets and other mechanical aspects of the
construction is also well done. Overall I would rate the quality
of the MSI RAID product as excellent.

4.0 Performance Analysis
Several tests were done to test the performance of the MSI
RAID under various configurations. When possible, comparative performance numbers are provided for CRI proprietary
disks. All tests were run under UNICOS 8.0. All data was generated on a dedicated machine, except those shown in figure

10, and those of the random i/o test (figures 16, 17,18 and 19).
Tests were run to simulate unix functions, applications, and
administration.

4.1

Key to figures

Below is a description of the test configurations used for the
results shown in section 4.2. Tests were, conducted to evaluate
the performance of the MSI RAID system against CRI proprietary disks and the effectiveness of combing the two.

4.1.1 Filesystem types
RAID-H - this "hybrid" filesystem was composed of two slices,
a primary and a secondary. The primary was a CRI DD60 and
the secondary was one facility of an MSI RAID (approximately 24 gigabytes). This configuration is such that inodes
and small data blocks are allocated on the primary, while files
that grow over 65k are allocated on the secondary. This feature
of the CRI software is extremely useful in enhancing the performance of the MSI RAID. As testing will show, small block
transfers on RAID are slow compared to the proprietary CRI
DD60 disks. Peak performance of the DD60 is approximately
20 megabytes/second while that of the MSI RAID is approximately 80 megabytes/second. The following mkfs command
was used to create the filesystem:
mkfs -A 4 -B 65536 -P 4 -S 64 -p 1 -q /dev/dskfraid
RAID-P - this filesystem was composed of one primary slice, a
single facility of the MSI RAID. All inodes and data blocks are
stored directly on the MSI RAID. Peak performance of the
MSI RAID is approximately 80 megabytes/second. The following mkfs command was used to create the filesystem:
mkfs -A 64 -B 65536 -P 16 -q /dev/dskfraid
DD60-SP - This filesystem consisted of 2 primary slices. Each
slice was composed of 4 DD60 drives that were software
stripped via UNICOS. This was used as the gold standard
against which to the measure others. Peak performance is
approximately 80 megabytes/second.
DD42-P - This filesystem consisted of one primary slice, a portion of a DD42 disk subsystem (approximately 8 gigabytes).
The DD42 is based on a previous generation technology, the
DD40. Big file throughput should be no better that 9 megabytes/second.

4.1.2 Idcaching
Several tests were run to show the benefit of Idcache on filesystern operations. Three levels of cache were used. The first
level, was no, which simply means that there was no cache

RAID Integration on Model-E lOS

125

P~rformance

used in the test. The second level of cache was sm, which consisted of 20 ldcache units, where the size of the cache unit was
128 for RAID-H and RAID-P, 92 for DD60-SP, and 48 for
DD42-P. This resulted in a total amount of ldcache of 10.0
megabytes for the RAID-H and RAID-P filesystems, 7.19
megabytes for the DD60-SP filesystem, and 3.75 megabytes
for the DD42 filesystem. The objective of the sm cache was to
provide a minimal amount of buffer memory so that commands
could be more effectively queued and/or requests coalesced, if
the OS was so inclined. The third level of cache was Ig, which
consisted of 1438 cache units for the RAID-H and RAID-P filesystem, 2000 cache units for the DD60-SP filesystems, 3833
cache units for the DD42-P filesystem. The cache unit size was
the same as that of the sm cache configuration. This resulted in
approximately 719 megabytes of ldcache for each of the 4 filesystem types. The objective of Ig cache was to check for anomalous behavior. Figures 3 through 8 and 16 through 19 are
annotated along the x-axis at the base of the bar graph to indicate the ldcache level associated with the results.

Analysis

number of small files occupy a small portion of space used.
This mimics the home directory structure. In fact, the bigfs was
a copy of our NAS C90 lulva filesystem. Figure 9 is annotated
along the x-axis at the base of each bar graph to indicate the file
data set associated with the results.
Fsck Times
35- •.••.....••..............................................................•..............................................

00 :

:::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::J=.ii:t::::::

'ij

8d 20" .•••••.•.••.....•....•.....•...•...••.•.••....••...•..••••••••..•.. --------------------.--.------.--••••••• -••••••• -••
d)

CI)

15 •...•.....•.•............•...............................•......... -.......•...............• -..... -..•.••......•...•..
10" •••••••••••••••.••••••••••••••••••••••••••••••••••..••.•••••••••••• -•.••• -••• -••••..•.•• -•••••••••••••••••••••••••••••

5 •••••••••••••• ..

L~·~~~·~~j················ ~~~~~~~~~~~~~~~~ ············-····~.~.~.~.~.~.~.~.~l················

v

4.1.3 Test Filesystems
In tests for the fsck, find and Is -IR, two different data sets were

duplicated over the 4 different filesystems described in section
4.1.1. The smfs filesystem consisted of 11,647 files, and 639
directories, totaling 1 gigabyte of data. The bigfs filesystem
consisted of 33,051 files and 2,077 directories, totaling 3
gigabytes of data. Figure 1 shows the distribution of files and
their sizes. Smfs count and bigfs count graph lines represent the
distribution by size as a function of the running total percentage of the total number of files. Smfs size and bigfs size represent the distribution by size as a function of the running total
percentage of all outstanding bytes. The data show that a large

RAID-H

Figure 2
4.2

RAID-P

Tests

4.2.1 letc/fsck

Several tests were run to compare the difference in time
required to perform filesystem checks. letc/fsck(1) was run on
the RAID-H, RAID-P, and DD60-SP filesystems using the smfs
data set (figure 2). The ranking and magnitude of the results are
as expected. The 6x performance differential between RAID-HIDD60-SP and RAID-P is attributed to the 3x average latency
of the RAID (25 ms) and the 4x sector size (64k). Of interest is
fmd $FS -print

File Dis tribu tions

..................................................................

Ie!

Figure 1

1(1'

10'
HI
File Size (bytes)

DD60-SP

Hj

no

H/

sm

.----~

.....................................

Ig

RAID-H

RAID-P

DD60-SP

DD42-P

Figure 3
RAID Integration on Model-E lOS

126

Performance Analysis

how effectively the RAID-H filesystem is able to take advantage of the DD60 primary partition. /etc/fsck times are further
analyzed in the section below that compares bigfs vs. smfs performance.

/bin/ls -IR Times

4.2.2 Ibin/find
A/bin/find. -print was executed on the smfs data set for each
of the 4 filesystems at 3 ldcache levels. Shown in figure 3, we
take the performance of the DD60-SP filesystem at face value.
It is interesting to note the factor of 3 improvement in wall
3 ..................................................................................................................... .
clock achieved of sm cache over no in both the DD60-SP and
2 ....•.....•..•.........................•.•.•..........................•...............•.......•............••..••.
RAID-H filesystems. An additional option to ldcache to cache
filesystem metadata only is justified from these results·. The
no
Ig
sm
RAID-P filesystem is showing one of its weaknesses here in the
RAID-P
RAID-H
greater latency that cannot be amortized with small block
reads. A 3x margin at the no cache level stretches to 6x for the Figure 4
sm cache. A large amount of cache is effective only for the
RAID-P and DD42-P filesystems. Again, note the effectiveness
Ibin/dd bs=16meg Write Test
of RAID-H.
12 ················9·············································································9·····9·······........ .
··io········

4.2.3 Ibin/ls -IR
A /binlls -lR was executed on the smfs data set for the RAID-H
and RAID-P file systems with 3 different cache levels (figure 4).
The results are quite similar to the /bin/find results above,
except that the sm ldcache has much better performance for the .
RAID-P file system. Confusing is the observation that the
/binlls -IR test requires more processing than the /bin/find .
-print test~ and it execution time on RAID-H bears this out.
However, the RAID-P test completes in less time in all cases!

9

................................................................................................................. .

.................................................'=
..............,...,..,....! •••••••••••••••••••••••••••••••••••••••••••••••••

..................... ?9..........................................................................................
3

..•......................•..........................•................................•.....•......•....•....•.•.

2

..•...... ~!.

4.2.4 Ibin/dd

no

.....................~!.... ~~...................?~..... ?~ ....................................... .

sm

Ig

RAID-H

no

sm

Ig

no

sm

Ig

DD60-SP

RAID-P

This test was run twice, once to write a 1 gigabyte file to the Figure 5
filesystem under test, and once to read a 1 gigabyte file from
the fIlesystem under test. The source for the write test and the
Ibin/dd bs=16meg Read Test
destination for the read test was an SSD resident filesystem
......................................................
(RAM disk). In figures 5 through 8, the transfer rate in megabytes/second is shown along the top of each bar graph. The
block transfer size for each of the tests was 16 megabytes.

no

sm

Ig

DD42-P

·········································cr····~······9·········

..........."'"'i •••••••••••••••••••••••••••••••••••••••••••••••••

Figure 5 shows the performance achieved while writing to the
filesystem under test. The RAID-P, DD60-SP and DD42-P configurations performed at expected levels, however the RAID-H
file system showed dramatic performance degradation when
ldcache is used. This clearly indicates a problem which needs
to be addressed by CRI.

~

Figure 6 shows the performance achieved while reading from
the filesystem under test. Each configuration performed at satisfactorily close to the peak sustainable rate for the underlying

no

sm

Ig

RAID-H

no

sm

Ig

RAID-P

no

sm

Ig

DD60-SP

no

sm

Ig

DD42-P

Figure 6
RAID Integration on Model-E lOS

127

Performance Analysis

hardware, although sm cache degrades RAID-H and RAID-P
perfonnance by 10%.

4.2.5 Ibin/cp
This test was run twice, once to write a 1 gigabyte file the filesystem under test, and once to read a 1 gigabyte file from the
filesystem under test. The source for the write test and the destination for the read test was an SSD resident filesystem.

/hinlcp Write Test
...•.•... ~ ..•••.....•...•...•••......~ .••......................•.............•.............................•..•........

2

75 ............................................................................................................. .
5

.................................................................................................

2S •••••••••••..••••••••••••••••••••••

69

75

68

·········i6ii························i·69"······· ............

··28········
~~.~

..... .

o+--=~~~~-~s~~-~i=;~~oo~=sm~l~g~~n~o~s=m~lg~~=n=o~s~m~lg~~

RAID-H

RAID-P

DD60-SP

DD42-P

tests with no cache. Accounting for the latency, the performance of the no cache DD60-SP exceeds expected performance. This leads to the conclusion that UNICOS is
perfonning additional optimization beyond that which is done
for the RAID. These shortcomings can largely be overcome
with sm cache on the RAID-P filesystem. In particular, this
shows the benefit of write-behind optimization, boosting the
RAID-P performance from 5 megabytes/second to over 69
megabytes/second. Consistent with the dd write test is the
severe performance problem with sm and 19 cache with
RAID-H performance at 10% of its potential. Again the CRI
proprietary filesystems DD60-SP and DD42-P, perform well.
Figure 8 shows the performance achieved while reading from
the filesystem under test. The data clearly shows the optimization effort which CRI has put into their proprietary disks systems. At the other end of the spectrum, the worst performing
filesystem was RAID-P. Although performance is somewhat
enhanced with a small amount of ldcache, it is still far below
what should be obtainable from the device. Doing far better is
RAID-H which performed at about 112 of its potential, but
could be substantially improved with read-ahead optimization.
The fact that the RAID-H filesystem outperforms the RAID-P
filesystem is very interesting, given that the file primarily
resides on the RAID disk in both tests. Is the as possibly
doing read-ahead here or is it simply inode/extent caching?

Figure 7

4.2.6 Bigfs vs. smfs
/hin/cp Read Test
2

.............................•.........................•................................................................

17 ...•.•.••••....•.•...•...••...••..••.•••••.........•..•...•.....•............••...........•..••............•............

Cfl

]

This test attempts to gauge the relative increase in time
required to perform certain operations, based upon the size and
number of files and directories in a filesystem.

•••••••••••••••••••••••••••••••••••••••••••••••.•. "'--_--1.···························9······9······9·········

1

o

RAID-P Bigfs vs Smfs Times

10

gl

w·~----~----------------------------------~

C'/l

16 16
..................................................................................................................

38

36

34

74

75

IS

.........•..•....................•..........•..••...........•.•..•.....•••........•.•..........•.•.•.•.........•.•...

5

...................................................................................•........................

74

2S ..........................................................................................................
OI+--=~no~sm~l~g~~oo~~sm~l~g~~n~o~sm~lg~~=~~~~-=s~~--~~=-~

Figure 8 RAID-H

RAID-P

DD60-SP

DD42-P

Figure 7 shows the perfonnance achieved while writing to the
filesystem under test. A combination of factors, a 32k i/o
library buffer size, inability of the kernel to coalesce or otherwise optimize embarrassingly sequential requests, and the
25ms average latency of the MSI RAID system all lead to
extremely poor perfonnance for both the RAID-H and RAID-P

N Cache

smfs

S

bigfs smfs

Ibin/fmd

Cach

bigfs

N Cache

smfs

S

bigfs smfs

Ibin/Is -lR

Cache

bigfs

smfs

Figure 9
RAID Integration on Model-E lOS

128

bigfs

letc/fsck

Performance Analysis

All tests were run on RAID-P with one set under the smfs file/directory collection and the other under the bigfs file/directory
collection.

Because of the comparable performance on these and other
tasks between DD60-SP and RAID-H, it would be most advantageous to utilize RAID-H in production.

Figure 9 shows run times for a / hin/find . -print, /binlls -IR .,
and / etclfsck -u. The results are consistent with those shown in
figures 2, 3 and 4. The application of a small amount of ldcache
gives some benefit, which is also consistent with other results.
The caveat being that the performance of RAID-P is still
between 1/6 and 113 of the performance of DD60-SP.

4.2.7 Well formed VS. III formed 1/0 test

The increase in elapsed time for the /bin/find test when going
from the smfs to the bigfs tests with 1W cache is somewhat less
than expected since the increase in complexity between the two
data sets is about 3. The /etclfsck and /bin/ls -IR test complete
in about the expected time.

RAID-H Well/Ill Formed 110 Performance (no ldcache)

The next series of figures (10-15) contrasts the significant performance differential for applications that make i/o requests on
boundaries that map well to the allocation unit of the device
and those that do not. Each figure consists of 4 graph lines, a
read and write request that is well formed, and a read and write
request that is ill formed. Well formed requests in this case are
successive synchronous sequential i/o accesses with a block
size of 2D where n ranges from 15 (32k) to 24 (16 megabytes).
III formed requests are also successive synchronous sequential
i/o accesses with a block size of 2n + 1024 where n ranges from
15 (33k) to 24 (16 megabytes + 1024 bytes). The blocks are
RAID-P WelVIlI Formed I/O Performance (no ldcache)

10

32k

let

1(i

Figure 10

Hf

Block Size (bytes)

10'

lei

10'

let

Figure 12

RAID-HWelllIll Formed I/O Performance (sm ldcache)

Hf

.------........ ;..............................; ............................. ;.............................

: :~:~!:~: :i~ F:-:-:-:·-:-:-:-:·~ ~t":~:~ ~ :":~":": t~ ~§:i: : : : : :

::::::.~.~.~....::r::::::::::::::::::::::::::t::::::::::::::::::::::::::::1:::::::::::::::::::::::::::

:

~

~

........................ .... :..............................

:::::::::::::::::::::::::::::r::::::::::::::::::::::::::::r:::::::::::::::::::::::::::r:::::::::::::::::::::::::::
:

let

10'

···············,-·~0~:::~t~:~~:~~~:-:~-:t::::;::::~~·~~·~1¥·;i~~~~:;~;·~;·~····

:

:

:

:

~...................

........

;

.
:

:

~

I

1. .............................t..!~.~~ ....................i ............................
256k:

I
(

I

.........................I .... :: ........................... ····t·: ... ...... ....... . ......... !: ........................... .

............................. ~........................ ······~·····························1········~·~~~···· ......... .
•.•.•.•••••.•••.••••••••..••.
32k:

lei

RAID-P Well/Ill Formed I/O Performance (sm ldcache)

·····························1······························f·····························i···············..............
...... ::-\fJ~We1fi ···1······························;·····························i·····························

2

lei

Block Size (bytes)

2

:

:

:

::::::::::::::f:~~:'=r:_~,:;:~~~P:f-~::~~-~i~~i~~;:~::::

o
let

Figure 11

Hi

Block Size (bytes)

lei

let

10'

uf

Block SIZe (bytes)

lei

lei

Figure 13

RAID Integration on Model-E lOS

129

Performance Analysis

9

is an apparent problem in using ldcache effectively with
DD60-SP Well/Ill Formed I/O Performance (no ldcache) RAID-H. In all instances, performance is lower than expected
and worse than those obtained from RAID-P (figure 13).
•••••..•••.....•..••••.•.•..• j........................······i·····························j·················........... .
............................. J.. ............................L.~~~..................... i.........~~.~I;!••••...••...•• Figure 12 shows RAID-P with no ldcache. The results are consistent with those of RAID-H (figure 10) again representing
1 . . . - _.......
:L;;;C:::-:::i::::::::::::. throughputs indicative of a random i/o test. It is confounding
that CRI would insist that read ahead and write behind would
not be advantageous (see June 17, 1993 entry in the chronology appendix).

~::::::::::::::::::~~~J::::::::~~ ~j::::~:'

Figure 13 shows RAID-P with sm ldcache. The sm ldcache
configuration provides an insight into what write behind optimization can actually do. For 64k byte transfers, RAID-P
shows a remarkable 72 megabytes/second. Write behind allows
Hf
for
stacking commands in the MSI RAID controller and allows
ICf
HY
lei
let
Block Size (bytes)
the application to continue on asynchronously to the i/o proFigure 14
cessing. Similar performance gains are possible from read-ahead
optimization, which could be implemented utilizing
DD60-SP Well/Ill Fonned I/O Performance (sm ldcache)
10
ldcache. In fact, contrasting figures 12 and 13 shows this. For
all transfers (read/write, wel1Jill) less than a megabyte, utilizing
ldcache boosts performance by as much as a factor 10 for reads
and a factor of 40 for writes. To better illustrate the problem for
reads, the amount of time required to return 32k to an application is approximately 0.022 seconds. It takes 0.039 seconds to
return 1 megabyte to an application. For a 56% increase in
time, a 32 fold increase in data is achieved.
Consistent in figures 10 through 13, is the unexplained degradation when going from 8 to 16 megabyte transfers.

ICf

let
Block Size (bytes)

HY

Figure 15

Figures 14 and 15 are included for completeness and show that
CRI disks are also subject to degradation with ill formed
Hf requests but to much less an extent.

4.2.8 Workload Test

passed directly to the system using the read(2) and write(2)
system call. The test case results are shown for block size 2° by
the horizontal line running from 2° to 2°+ 1. Block size reference points are shown along the read well graph line.

Table 2 shows the results of a simulated workload run on
RAID-P and RAID-H configurations. Both were configured
with 170 units of ldcache at a size of 128. The test consisted of
the simultaneous execution of 12 data streams all on the tested

Figure 10 shows the RAID-H filesystem with no ldcache. Without the benefit of read ahead or write behind, these graphs
depict worst case performance for the range tested. In fact, due
to CRI's implementation, the performance results should be
identical for an i/o test that was random instead of sequential
(see section 4.2.9 on random i/o test results)! Also shown is the
dramatic degradation (a factor of 2 to 3) that occurs with the ill
formed requests.

TABLE"1.

Figure 11 shows the RAID-H filesystem with sm ldcache. The
results are consistent with those in figures 5 and 7 in that there

Rate
User System Megabytes Duration
mby/sec Time Time Transferred (Hours)
RAID-H 15.51
RAID-P 51.78

129.65
410.66

100.89
284.98

66,998
210,634

filesystem. Block transfer sizes ranged from 32k to 8 megabytes. The ratio of well-formed to ill-formed i/o requests was 6
to 1. The tests were run over the period of approximately 1

RAID Integration on Model-E lOS

130

1.20
1.13

Performance'Analysis

hour. With the RAID-P filesystem t good performance is maintained even with the mixing of small and large requests. Againt
the problem of ldcaching the RAID-H filesystem is apparent

4.2.9 Random I/O Test
This test was added in an effort to highlight how much better
CRrs DD60 disks were at handling random small block transfers than the MSI RAID. Given that there is a 2x to 3x difference in average latencYt one would expect quite a performance
differential. As it turns out t this may be the coup de grace for
SLED's.
Dedicated machine time was unavailable to run these tests. To
report best case results for each configuration t each test was
run 10 times and the test yielding the least elapsed time was
selected as representative.
This test takes a randomly generated set of numbers which are
offsets into a 64 megabyte test file. The range of the numbers
potentially span the entire file. The program reads in a number,
calls Iseek(2) to position itself in the file to the specified offset,
and then either does a read or write (depending on the test) of
block-size bytes at that offset. This is done 1024 times for each
execution of the test. The test was run for a block-size of 4096,
16384, and 65536. The test was executed against RAID-H,
RAID-P, and a filesystem created on a single DD60.
Two different sets of input lseek numbers were used. One set
was completely random in that the offset into the file could be
at any byte address. These are the "Off Boundaryn results
shown in figures 16 and 17. The other set was also random t but
the offset was restricted to be an integer multiple of the blocksize used in the test. These are the "On Boundary" results

shown in Figures 18 and 19. Block-size is annotated at the base
of each bar graph.
Figure 16 shows the read results for the off boundary test. With
the sector size of the MSI RAID at 64k t the 4k and 16k tests
are likely to require access to only one sector. The 64k test will
likely require access to 2 sectors t which justifies the additional
time required for the operation. The sm cache is quite useful in
this test in negating this effect while only slightly degrading
the 4k and 16k results. Most unexpected are the results of the
DD60 test. Averaging the 4k, 16k and 64 k results and comparing the best RAID configuration against the best DD60 configuration, the DD60 only performs 28% faster than RAID! For
some reason t the CRI disks are taking at best, 17ms to return a
4k block. This has not been a substantial problem in production, as no one has noticed either here or at CRI. Since most file
accesses are sequential [ousterhout85], sequential performance
tends to dominate, thus latency from random i/o may not be a
problem t which would imply that optimizing sequential access
is most important. The primary advantage that CRI disks have
over the RAID is in filesystem metadata access (e.g. t inodes)
which is not a factor with RAID-H.
Figure 17 shows the read results for the on boundary test.
Cache has a typically negative (minor at that) effect on the
results. This is expected in that the on boundary tests only
require access to one sector (except the DD60 64k test that
requires 4). Here, the DD60 is only 16% faster than RAID-P.
Figure 18 shows the write results for the off boundary test.
Notice that the y-axis scale has been increased to 90 seconds
for these tests. The results are expected with the 64k RAID
tests requiring up to two read/modify/write operations for each

On Boundary Read Random I/O Test

Off Boundary Random Read I/O Test

6v

55 ........................................................................................................................

5S ....................................................................................................................... .

:;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::li~I::::::::::::::::::::::::::::::;

:; :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::1i'~~I::::::::::::::::::::::::::::::

40- .............. :.:::,..................................

= ............................................................ ..

~35 .......................................................................................................... --- .. .

~ 0- .............................. ;.:.= ................................ ,..- ....................... = -......... .
8
3

~25 ......... r-- ... .... r""""f"'=

40" ....................................................................................................................... .
~35" ............................................................................................................... ~ .. .
~

~

030" ....................................................................................................................
~
~

g

---

20- ...............................................................:.:.:.:..;.::. ..................... .

00: :::::=.~.::::.~= : : ::::=~.=.::::.~~ : : : : : : : : :';.': : : : : : : : : :

15 ......................................................................................... .

15" ........................................................... .... =r:":":': ............ ......... .

10- ...........................................................................................

10" ...........................................................................................

5 .......................................................................................... .

5 .......................................................................................... .

~

............

r.:":':' . . . . . . . .

r::::r:::: ........................................ .

---

~

O-+-'-4Ic-:-Ll~6k....6-14k--'--4k'-I6k~64Ic...L....-'-4k~I6k..L.6~4k--:-L4-kL=I6k"?-64Ic""---r::..::4k~16k=64~k--:-L-4k......I6k-'-...L..-l
64k
no

RAID-H

sm

sm

no

RAID-P

Figure 16

no

4Ic 16k 641c

4Ic 16k 64k

no

sm

RAID-H

DD60

1m

4Ic 16k 64k
no

4Ic 16k 641c

4k 16k 64Ic

4Ic 16k 64k

sm

no

sm

RAID-P

DD60

Figure 17

RAID Integration on Model-E lOS

131

Performance Analysis

Off Boundarv Random Write I/O Test

9v
80-············· _

.................................................................................................... .

60- ............. . ..............................................................

...................... .

:; ::::::::::::::=:::::::::::::::::::::::::::::::::::=::::::::::::::::1=:::T~~:I::::::::::::::::::::::::::::::

::~. f(",~m ...................... .

4(} ••••.••••..•••••.•••••••••••..•.•.••.•.••••••••••.•...••••••••••••••••••••••.•••••••••.••...•••••...••••••••••••.

Times

~

~

"g5IT ...................................................................................................... ..-- ., .....

o

~.

0:-::-: •••••••••••••••••• I":":"!"•••••••••••••••••••••••• '"'"'" ••••• , • • • •

3IT····~···························~····

~

r--

r--r-

.......................................................
r-

2IT ....................................................................................... ,.....
lIT ..............................................................................................
o+-~~~~~~~~~~~~~~~4-~~=-~
4k 16k 64k
4k 16k 64k
4k 16k 64k
4k 16k 64k
4k 16k 64k
4k 16k 64k
no
8m
no
no
1m
sm

RAID-H

r-

~35

=

••••••••••••••••••••••••• ;:.:.:.: •••••••••••••••••••••••••••••••••••••••••••••••••••.•••••••••••••••••.•••••••••••••
r--

.--r-

03!} ..................................•........................•....•......•.....••••...............•..........

r-

g4(} ....•••• ...- ...•..••.........•••.. •..••••.
~
~

5S •••••••••••••••••••.••.•••.•••••••••••••••••••••.•••••••••.•••••••••.•.••••••••••••••••••••••....•...••.•••••.••••••••••

_Elapsed

..-70- ............. .... ................................. .... .......................

On Boundary Write Random I/O Test

6v

RAID-P

g

r-- f--

~2S

•••••••••.•.•...•.••••••.•••••.•.. " ••••.•••.•••.••.••..•••••••••••••••••••.•.....•.•••.•••••..•••••.

20" .......................................................................................................
. IS •••••••••••••.••••••••••.••••••••.••••••••••.••••••••••••••••••••••.••••••••••••••.•••. "::":" .,.
:--

'; : : ~~ ~~ : : : : : : :~: ~r::~ : : : : : : : : :::::='F :::~ : : : : ::: : : ::::::
4k 16k 64k
no

DD60

Figure 18
user level write. The best DD60 configuration is only 13%
faster than the RAID-P in this test.
Figure 19 shows the write results for the on boundary test. For
both RAID based fllesystems, sm cache significantly improves
performance at the 64k level. The 4k and 16k tests are as
expected because of the read/modify/write that occurs with
transfers less than 64k on the RAID ·based systems. Note also
that the RAID based filesystems are 3x faster than the DD60
for the 64k test! Overall the DD60 is only 32% faster in this
test.
Given this information, one can say with some certainty, that
the DA60 RAID product from CRI which is a RAID level 3
system with a 64k sector size, should perform worse than the
DD60 used for this test, especially with the 16k test from figure
19, which is the only test that the DD60 was significantly better.
The overall random i/o performance advantage that the best
DD60 configuration (no cache) has over the best RAID configuration (RAlD-H sm cache) is 24% faster for writes and 29%
faster for reads.
4.2.10 Suggested Additions

On the MSI Side:
• Reallocation - As previously discussed, data loss can occur
when a drive fails and there are bad sectors on other drives.
Having the capability to automatically reallocate bad sectors on the fly would all but eliminate this potential for data
loss.
• SMC - The addition of a mode that the SMC can be placed
into that prevents any unintended modification of opera-

r-I--

4k 16k 64k

RAID-H

sm

4k 1& 64k
no

4k 16k 64k
1m

RAID-P

4k 16k 64k
no

sm

DD60

Figure 19
tional parameters. As is stands now, anyone who has physical access to the SMC effectively has unlimited power to
change anything at will, or by mistake.
• Remote Status - A command that could be run from a
remote workstation that would tell an operator whether or
not someone actually needed to take further action. This
could then be put into a cron script to status the system several times per day and fire off email if a problem is indicated. This could be extended to for such maintenance
activities as read scrub and flaw management.
• Rewrite SMC software - This software needs to be
rethought. Many common operations are not intuitive and
or awkward. For example, it takes approximately 200 keystrokes to examine suspect permanent flaws.
On the CRI Side:
• Buffer Alignment!grouping - As shown in the performance
section, some operations performed utilizing ldcache cause
unexplained and significant degraded performance, most
notably with the RAID-H configuration.
• Read Ahead into ldcache - When read-ahead is indicated,
perform the read-ahead into ldcache.
• Default Buffer Sizes - Evaluate the potential for enhancing
performance by modifying the default buffer size for library
routines, and elsewhere that it is indicated. Matching this to
better suit the MSI RAID should help to improve performance while requiring only a modest increase in the
amount of main memory.
• Metadata Cache - Add an option to ldcache so that filesystern metadata and data can be cached separately. A metadata write-though cache should improve performance

RAID Integration on Model-E lOS

132

4k 16k 64k

Summary

•

without degrading filesystem integrity when recovering
from crashes.

partial chronological listing of these events and their eventual
outcome.

Metadata Access - The results of the random i/o test indicate that the DD60 outperforms the RAID by 25% to 30%.
Why then are the /bin/find times 3x and the /etc/fsck times
6x that of a DD60?

Nov 5 1992 - Product Performance Demo
As a requirement for the procurement of the RAID, the
potential vendors were required to demonstrate performance. The testing was perfonned at the Maximum Strategy facility in Milpitas, CA. The test environment consisted
of a Gen-3 RAID serving as the testing client and the IFB
bid hardware consisting of a Gen-4 controller and 20
Seagate IPI-2 drives with a storage capacity of 27
gigabytes. Results shown in Table 2

5.0 Summary
It has taken much longer than expected to get the MSI RAID
up to production quality standards. There are still some performance problems that need to be addressed. Overall, the performance and reliability of the MSI RAID system is good and it is
certainly the least expensive high perfonnance alternative.
The RAID-H filesystem is advantageous for several reasons.
Combining this configuration with some amount of ldcache is
the desired configuration for NAS, given that the ldcache
buffer problems can be resolved.

•

Mar 25 1993 - First Installed
HSP-3 (C9016-10241256) installed. Cabled up MSI RAID.
Able to access RAID with the alpha release of the CRI
IPI-3/HiPPI driver. Having problems talking to more than
one facility.

•

Mar 31 1993 - Software Problem
Due to a minor device number conflict, we are unable to
access both facilities of the MSI RAID. Ldcache give an
ENXIO (ermo 6) error when trying to ldcache a RAID filesystem.

The results from the random i/o test indicate that CRI proprietary disks work better primarily because UNICOS optimizes
their access. The same optimization techniques applied to the
RAID should bring all perfonnance to within 10% to 20% of
the DD60-SP configuration. They also show that DD60s do not
significantly outperform the RAID. It then follows that questions concerning RAID latency cannot serve as an excuse to
prevent optimization efforts any longer. Direct hardware support (e.g., eliminating IPI-3) would all but eliminate the small
DD60 perfonnance advantage.
RAID technology is already an attractive option. With fast
SCSI-2 drives approaching $0.50 per megabyte, there is still a
10 fold markup to build a fast RAID controller that integrates
commodity disks and supercomputers, leading to the expectation of even lower prices.

TABLE 2.

7.0 Appendix
Over the past 18 months, there have been a number of goals
met, problems encountered and obstacles overcome. This is a

MSIGen-4

read

720 mbitlsec

write

680 mbitlsec

740 mbitlsec
(92.4 mby/sec)
701 mbitlsec
(87.6 mby/sec)

•

April 9, 1993 - Hardware Problem
Data corruption is detected on reads, further diagnosis
shows a parity error over HiPPI. Replaced HiPPI cable and
MSI HiPPI controller board.

•

April 12, 1993 - Hardware Problem
More data corruption errors on read. MSI replaces RiPPI
controller board again. Diagnosis at MSI shows a hardware
failure on the first replacement board.

•

April 23, 1993 - Software Problem
Ldcaching on the MSI RAID now crashes the C90.

•

April 28, 1993 - Software Problem
Utilizing the primary and secondary allocation on RAID
causes the root filesystem to hang (i.e., Is -1/ never returns)
and eventually requires a re-boot to fix.

•

May 19, 1993 - Software Problem
Inappropriate handling of soft errors on the CRI side. When
a conditional success is sent to the Cray, it is interpreted as
a failed request, The Cray then does this 5 times and quits -

6.0 Acknowledgments
I would like to thank Bob Cave and the others involved in the
IDA Princeton project that proved this type of technology was
usable in a supercomputer environment. Ken Katai and Neal
Murry of MSI also deserve recognition for providing excellent
support of their Gen-4 system.

Requirement

RAID Integration on Model-E lOS

133

Appendix

propagating the problem to the application. It is suggested
that the Cray attempt some type of error recovery.
• Jun 9, 1993 - Software Problem
Ldcache still not working. Primary/secondary allocations
also still not working. No error recovery is done. No read
ahead/write behind. I/o requests must be VERY well
formed to extract performance from the RAID disk.
• Jun 15, 1993 - Software Fix
Kernel mod installed to fix Idcache problem.
• Jun 17,1993 - CRI Response to issues
Primary/secondary allocations are not supported in the current HiPPIIIPI-3 implementation under UNICOS 7.C.2.
This capability is available in Unicos 7.C.3, which is currently scheduled for release August 6th.
CRI declines to do any error processing (other than retries)
on the C90 side. They do make a reasonable argument.
The issue of read ahead/write behind for an IPI-3 driver
came up during the HSP-3 contract negotiations. Cray
Research replied in a letter dated November 4, 1992:
"Cray Research has investigated implementing read-ahead
and write-behind in either the mainframe itself or in the
lOS and believes that such an implementation would be
ineffective in enhancing performance of a HIPPI-based
IPI-3 driver. This is because both the mainframe and the
lOS are too far away from the disk device itself to provide
meaningful improvement in transfer rates. The appropriate
place to put read-ahead and write-behind, in our view, is in
the controller of the RAID device itself. This has not yet
been done in Maximum Strategy products."
• July 12, 1993 - Upgraded UNICOS
Primary/Secondary mods from 7.C.3 are added into 7.C.2
in an attempt to create the hybrid filesystem.
• July 19, 1993 -Fell back to plain 7.C.2
System time the has increased greatly. Experiencing lost
mount ~oints with mixed device filesystems. Returning to
unmodIfied 7 ..~.2 system. Since we will be beta testing
UNICOS 8.0 ill August, no upgrade to the official 7.C.3
release is planned.
• Sep 17, 1993 - Software Problem
Duplicated the auto-reconstruct failure that occurred 2
weeks ago after several tries. Occasionally when powering
off a drive, multiple retry failures cause an EIO (ermo 5 i/o error) to be propagated to applications. 1\vo problem are
apparent here;
1.CRI driver is not appropriately handling conditional success errors ..

•

•

•
•

2.When a drive fails AND there are active flaws on other
drives, correct data cannot reconstructed and the read fails.
A request is made to MSI to add the capability to automatically reallocate suspected permanent flaws that occur during operation.
Sep 28, 1993 - Software Fix
New driver available to fix the inappropriate handling of
conditional success status.
Sep 30, 1993 -Software Problem
Installed new CRI software and New MSI software to fix
all current known problems. On the positive side, performance increased by almost 30% with the new software (80
mby/sec reads and 73 mby/sec writes). Testing has however
uncovered a serious problem that causes corrupted data to
be propagated to applications when a drive is powered off.
Oct 1, 1993 - Response
MSI and CRI are investigating the problem.
Oct 1, 1993 - Software Problems
Duplicated data corruption problems without powering off
drives.

• Oct 7, 1993 - UNICOS 8.0
Began beta UNICOS 8.0 testing, primary/secondary allocations are still not operating correctly.
• Nov 17, 1993 - Software Fix
Installed a new lOS and a new HiPPI driver on the CRI side
and a new driver on the MSI side Nov 10. Testing over the
last week has not turned up any problems. Performance has
dropped somewhat (about 10%) for both reads and writes
• Dec 3,1993 - Production UNICOS 8.0
Primary/Secondary allocations functioning correctly. Perfonnance is mixed yet consistent with 7.C.2J3.
• Jan 2,1994 - Hardware Problem
Machine powered off for several days during facility maintenance, when power returned, RAID will not boot. MSI
able to give instructions over the phone to return system
back on-line. Problem traced to battery failure that has been
fixed in subsequent systems. MSI provides upgraded system processor board.
• Jan 30, 1994 - Limited Production
Extensive testing has turned up no further problems with
the MSI RAID. The system will now be put into limited
production
• Mar 10. 1994 - Every Thing OK
No errors reported. No outstanding problems.

RAID Integration on Model-E lOS

134

References

8.0 References
[anderson93] Anderson, "Mass Storage Industry Directions,"
Proceedings of the Cray User Group, Montreux Switzerland, Spring 1993. pp307-310.
[badger92] Badger, "The Future of Disk Array Products at
Cray Research," Proceedings of the Cray User Group,
Washington D.C., Fall 1992. ppI77-190.
[cave92] Cave, Kelley, Prisner, "RAID Disk for a Cray-2 System," Proceedings of the Cray User Group, Berlin, Germany, Spring 1992. ppI26-129.
[cooper93] Cooper, et aI, "Numerical Aerodynamic Simulation
Program Plan," NASA Communication, September 93.
[homan92] Homan, "Maximum Strategey's Gen-4 Storage
Server," Proceedings of the Cray User Group, Washington
D.C., Fall 1992. ppI91-194.
[ousterhout85] Ousterhout, Da Costa, Harrison, Kunze,
Kupfer, Thompson, "A Trace Driven Analysis of the UNIX
4.2 BSD File System," ACM Operating Systems Review,
Vol 19, No.5 (1985), ppI5-24.
[patterson87] Patterson, Garth, Gibson, Katz, "A case for
Redundant Arrays of Inexpensive Disks(RAID)," University of California Technical Report UCB/CSD 871391, Berkeley CA, December 1987, preprint of [patterson88].
[patterson88] Patterson, Garth, Gibson, Katz, "A case for
Redundant Arrays of Inexpensive Disks(RAID)," Proceedings of the 1988 ACM SIGMOD Conference on Management of Data, Chicago IL, June 1988, ppl09-116

RAID Integration on Model-E lOS

135

Automatic DMF File Expiration
Andy Haxby (andy@tnllnpet.demon.co.uk)
SCIS - Shell Common Information Services
(Formerly Shell UK Information and Computing Services)
Wythenshawe, Manchester M22 5SB England

Abstract

determined length of time (a 'retention period') unless a user
had explicitly requested that a file should be kept for longer.

SCIS has a YMP/264 (until recently a XMP-EAJ2(4) with a
4000 slot STK 4400 silo that is mainly used to hold DMF and
backup tapes. Since nmning DMP users have tended to regard
the :file systems as an infinite resource and have been very lax
about deleting unwanted :files. Eighteen months ago it was
apparent that our DMF pools would grow beyond the capacity
of our silo, so it was necessary to implement a file expiration
date for DMF that could be set by users on their files. The
system has significantly reduced our DMF pools. This paper
describes the external and internal workings of the file
expiration system.

Background
SCIS is a Royal Dutch Shell Group company that provides a
range of computing services to other members of the Shell
group in the UK and the Netherlands. The SCIS Cray is used
for petroleum engineering applications, mainly reservoir
modelling. A typical model will consist of a 10MB executable
and a number of data files each up to I nOMB. There arc
approximately 400 users in 30 different groups. The majority of
users are located at various sites around the UK. but a
diminishing number are in locations as far apart as Canada.
New Zealand and the Middle East. Many users access the Cray
infrequently and only by batch. and the large geographic
spread of users makes it very difficult to manually ask people
using large amounts of storage to delete unwanted files.
For the purpose of disk organisation users are split into two
5GB file systems, lu2 and lu4. depending on whether they are
UK or non-UK based. DMF is run on these file systems and a
few others such as the support staffs file system and a file
system used for archiving old log files and accounting data. In
July 1992 the DMF tape pools had grown to 900 cartridges
each. The rate of growth was linear. consistent with users not
deleting old files, and indicating that we would mn out of
space in the silo early in 1993. It was clearly necessary to
implement a system whereby Jiles would be deleted after a pre-

136

Requirements For The Automatic Expiration
System
It was considered important that the system mllst:

•

Be easy to use.

•

Require little or no maintenance or administration once
installed.

•

Require the minimum amount of local code.

•

Integrate cleanly with Unicos, i.e. have a Unix flavour
interface and be completely application independent.

•

Not compromise any existing security mechanism, i.e.
llsers can only set retention periods on files they have write
access to.

•

Work with subsequent versions of Unicos. At the time we
were running Unicos 6.1.6 and 7.0 was still in Beta.

Design Considerations
Two routes to achieving the requirements were considered:
•

•

A 'database' type system whereby users execute locally
written commands to read and write lists of files into a
database. The database could then be regularly
interrogated by a cron job that would delete any files that·
had passed their 'best before' date. This system would
require no modifications to Unicos source to write, and if it
went wrong it would not interfere with Unicos itself.
However, a large amount of local code would be needed to
check file permissions and ownerships, cope with files
restored from backup, delete the database entry if the file is
deleted by the user rather than the system. etc.

must first be opened with open(2). utime(2) takes a path name as
an argument and does not require that the file is open, which is
more sensible since only the inode is being updated, not the
data block.
The following commands were modified so that users could set
and read retention times. The 'd' flag was chosen to be a
mnemonic for 'detention' since the Is command already has Ie'
and 'r' flags.
•

touch [-a] [-c) [-m] [-d) [mmddhhmm[yy)) [-0 dd]
where -d is the detention (retention) time as a time stamp
and -0 is the detention time in 'days from now'. The
default for touch is still -am. Using the -d or -0 options on
their own does not cause the modification time to be
altered. Unicos sets the sitebits member of the inode
structure to be 0 on every file by default, and hence the
retention time stamp for all files will be 0 by default. It is
not possible to set a retention time on a directory as the
touch(1) command must open(2) the file first and it is not
possible to open(fd, O_WRONLY) a directory.

Modify Unicos commands to provide the tools to enable a
simple file expiration facility to be written. The inode
contains three time stamps. cdi_atmsec. cdi_ctmsec and
cdi_mtmsec (atime. ctime and mtime) which correspond to
date last accessed. date inode last modified and date file
last modified respectively. Under Unicos 7.0 and above
there is a site modifiable member of the inode stmcture.
cdi_sitebits, which can be written to with fcntl(2) and read
with stat(2) system calls. The sitebits inode member can be
used to store an additional time stamp corresponding to
how long the file should be kept for. Commands such as
touch(1), Is(1) and find(1) can be modified to read and write
time stamps into the inode sitebits member.
This method has the advantage of being an elegant and
totally seamless enhancement to Unicos requiring little
local code, and utilising all of the security checking
mechanisms already coded into the commands. The
disadvantages are that a source licence would always be
required and the mods would have to be carried forwards
to each Unicos release. Additionally. under Unicos 6.1
whilst the sitebits member was present in the inode. the
fcntl(2) and stat(2) system calls were missing the code to
read and write to it. so kernel mods were necessary to
provide this functionality until Unicos 7.0 was available.

It is interesting to note an 'undocumented feature' of
touch(1): because utime(2) is used to change the actime and
modtime time stamps on a file it is not possible to touch a
file you don't own even if you have write permission. This
is not the case for the retention time since the open(2) and
fcntl(2) system calls are used to update the retention time
stamp.

•

To enable users to list the retention time on their files both
Is(1) and Is(18S0) commands were modified to accept a -0
option which. when used in conjunction with -lor -t
options. lists the retention time in the same format as the -c
and -u options. The code generates an error message if the
-0 option is used in conjunction with either -c or -u. If the
retention time is zero, i.e. has not been set on a file, the
mtime is lIsed in the default manner of Is(1) and Is(18S0).
The -0 option used on it's own is valid but meaningless, as
are -c and -u options.

•

In order to search file systems for files that have exceeded
their retention period. the find(1) command was modified to
accept a -dtime argument in the manner of -atime -ctime and
-mtime. If the retention time is zero, i.e. has not been set on
a file. the last date of modification is used in the manner of
the -mtime option.

Implementation Details
The second method of implementing a retention system was
chosen. Kernel mods were made to Unicos 6.1.6 to provide the
extra functionality in the fcntl(2) and stat(2) system calls that is
available in Unicos 7.0. Whilst doing this it was apparent that
it would have been better if Cray had not implemented the
facility to write into the inode sitebits member through fcntl(2),
but had rather written another system call that would function
similarly to utime(2). This is because fcntl(2) principally
performs file locking and other operations on the data of a file
and so takes a file descriptor as an argument. therefore the file

To set a retention period the touch(1) command was
modified:

Sites implementing a system such as this could give
consideration to modifying other commands such as fck(1) to

137

report the retention time, or perhaps rm(1) to warn users trying
to delete files with an unexpired retention time.
The mods to commands described above provided the necessary
tools with which to implement a 'file expiration' system. Users
were forewarned, run stream generators for user jobs were
modified to allow batch users the option of touch'ing their files
with a retention time, and a shell script was written to delete
files past their retention time. The she1l script could have been
nm from cron, but in order to enable us to mail users who only
read VAX mail the script is nUl from a batch job submitted
from a VAX front end once every month. It was decided to
delete all files more than 93 days past both their retention time
and date of last modification. Date of last access was not used
as it is updated by commands such as dmget(1) dmput(1) and
file(1). The shell script essentially just does:
find $FS -dtime +93 -mtime +93 -type f -exec nn {} \; I -print
Note that the above command leaves directory stl1lctures in
place in case this is necessary for some applications to rtlll. A
list of the files deleted is mailed to the user. along with a list of
files that will be deleted next month unless a new retention
time is set upon them.

Effect Of Implementation On DMF
The system went live on February 1st 1993. The first deletion
removed approximately 80GB of files. A month later when the
files \vere hard deleted utilisation of our DMF tape pools was
reduced by nearly 50%. Having made a large number of tapes
free we wanted to remove a contigllolls range of higher VSN
tapes from the dmf pools rather than just removing random
VSNs as they became free. This was time consuming and
messy, and could only be done by setting the hold read only
(hro) flag on the VSNs with the dmvdbgen(8) command. and
then waiting until more files were hard deleted from the tapes
before merging them. Multi-volume files that were not due to
be deleted had to be manually recalled, touch(1)'ed to make
DMF think they had been modified and so put them back
somewhere else, and then re-migrated. Tidying the tape pools
up after the start of the file expiration system took several
months.
One initial problem was caused by the fact that the modified
touch(1) command uses the fcntl(2) system call to write the
retention date, and so must open(2) a file first. This means that
if a user sets a retention time on a migrated file, the file is
needlessly recalled. As a result of this there was some
thrashing of tapes during the month before the first delete as
users set retention periods on old migrated files. Fortunately
the act of touch'ing the retention time on the file does not
update the modification time. else a subsequent dmput(1) of the
file would create yet another copy! On the 2nd of December

138

1992 Design SPR No 58900 was submitted suggesting that a
new system call should be written to allow cdi_sitebits to be
updated without recalling migrated files.
The system has been nmning for over a year now. It had been
anticipated that some users would try to circumvent the
retention system by setting very long retention periods on all
their files by default, but so far this has not happened. We also
anticipated being deluged by requests to restore deleted files
but apart from a small number of genuine mistakes this has not
happened either.
Some users do unfortunately consider the retention system to be
an alternative to the rm(1) command and just leave files lying
around until the retention system deletes them. This causes a
problem because the files are held in DMF for three months
before the retention system deletes them and then a further
month before they are hard deleted from DMF. In the worst
case this can result in garbage being held in DMF for nearly
five months.

First Problems
In July '93 a disk problem caused a user file system to be
restored from a backup tape. Shortly afterwards it was noticed
that all of the retention periods set on files had been lost. This
was due to a bug in the restore(8) command. The dump(8)
command correctly dumps the value of cdi_sitebits but there is
no code in restore(8) to put it back again. Dump and restore had
been inadvertently omitted during testing of the retention
system! Cray confirmed that the bug was present in 7.0.
Since restore(8) nms as 'user code' i.e. does everything through
the system call interface rather than accessing the file system
directly. it was not possible to fix this bug in Unicos 7.0 restore would have to open(2) and unmigrate every restored file!
Design SPR No 66926 was submitted on the 8th July 1993
against this problem. The eray reply was that a new system
call would have to be written which would not be available for
some time. The Design SPR suggesting that a new system call
should be used to set cdi_sitebits had been submitted over six
months previously but no action had been taken by Cray. This
reinforces a general concern of the author that Cray pay
insufficient attention to Design SPRs.
Fortunately Cray did suggest a work-around to the problem. A
script was written to read a dump tape and produce a list of
inode numbers and file names. A program then re-reads the
dump tape and produces a list of inode numbers and cdi_sitebits
values. The two lists are read by an awk program that matches
path names to cdi_sitebits values and pipes the result into fsed(8)
to reset the values. Finally the file system must be umounted
and mounted again before Unicos recognises the cdi_sitebits

values. The whole process is a kludge and requires multiple
reads of the dump tapes, but it works.

restricted to super-user. This should allow the bug in
restore(8) to be fixed but may cause problems with our local
mods to touch(1): it may be necessary to make touch(1) a
suid program to allow users to set retention periods. The
restriction on super-user may be lifted in a later release
and the functionality of fcntl(2) will remain indefinitely
should we decide to stick with the current limitations.
Unicos 8.2 is scheduled for release fourth quarter 1994.

Where Are We Now?
Our DMF pools are still growing and are now, at 900 tapes
each, the same size as before we introduced the retention
system. This is due to three main factors.
•

The users in the /1l2 file system have generated a large
number of files which legitimately must be kept for several
months. This is unavoidable and the main cause of the
grmvth.

•

We keep more log files and accounting data on the 'other'
file systems than we used to.

•

Systems support staff themselves are untidy and are
keeping ever increasing amounts of data.

•

Unwanted files can still remain on DMF tapes for several
months.
The only way to prevent this is to reduce the time after
which expired files are deleted, and/or reduce the time
after which files are hard deleted.

Future Plans

Giga Bytes of migrated data

We plan to mn the retention system on support staff file
systems in the near future, and will look at reducing the
amount of log files kept. These changes should remove up to
10 GB of migrated data.
Files more than one day (rather than 93 days) past their
retention period will be deleted provided they are more than 93
days past the date of last modification. This change will delete
21 GB of migrated data.

Conclusions

Outstanding Problems
The following problems are outstanding:

•

It is necessary to unmigrate a file in order to set a retention
time on it.

•

Restoring a dump tape is a convoluted process.
Both of the above problems would be solved if Cray were
to implement a new system call for writing into cdi_sitebits.
According to eRA Y PRlV ATE PRE-RELEASE
information (liable to change), there will be a new system
call, Isetattr(2), in 8.2 which will allow all user-accessible
meta-data associated with a file to be set in a single system
call. The system call takes a pointer to a file name and a
stmcture containing the desired changes as arguments, and
does not require the file to be open.
However, due to the internal workings of the Unicos 8.x
virtual file system. the initial implementation ,viII be

It is very easy to implement a simple and reliable method of
removing unwanted files and preventing DMF tape pools from
growing indelinitely. From Unicos 8.2 existing problems with
having to unmigrate files to set a retention time on them, and
the failure of restore(8) to reset the retention times on restored
files should be fixed.
The system is not. however, a magical remedy for all storage
space limitations. and good storage management is still
required to ensure that users (and systems staff!) do not abuse
the retention system or carelessly keep unnecessarily large
amounts of data.

Acknowledgements
I would like to thank: my colleague Guy Hall who wrote the
code to run the monthly file deletion from the VAX and send
mail to users: Neil Storer at ECMWF for assistance with the
kernel mods to 6.1.6: the staff at eray UK, in particular Barry
Howling, who provided much help during early discussions
about how to implement a file retention system, and who
endured endless 'phone calls and complaints about design SPRs
and the way Cray implemented the sitebits facility.

139

ER90® DATA STORAGE PERIPHERAL
Gary R. Early
Anthony L. Peterson
EMASS Storage System Solutions
From E-Systems
Dallas, Texas

Abstract
This paper provides an architectural overview of the EMASS®
ER90® data storage peripheral, a DD-2 tape storage subsystem.
The areas discussed cover the functionality of tape formatting,
tape drive design, and robotics support. The design of the ER90
transport is an innovative approach in helical recording. The unit
utilizes proven technology developed for the video broadcast
industry as core technology. The remainder of the system is
designed specifically for the computer processing industry.

Introduction
In 1988, E-Systems initiated a project which required a high
performance, high capacity tape storage device for use in a mass
storage system. E-Systems performed an extensive, worldwide
search of the current technologies. That search resulted in the
identification of the Ampex broadcast recorder that utilized D-2
media as the best transport. E-Systems then initiated a joint
development effort with Ampex to use their proven video
transport technology and design the additional electronics
required to produce a computer peripheral device. The result of
this development was the EMASS ER90 tape drive which
connects to computers using the IPI-3 tape peripheral interface.

All motors are equipped with tachometers to provide speed,
direction, or position information to the servo system, including
the gear motors which power the cassette elevator and the
threading apparatus. There are no end position sensors; instead,
the servo learns the limit positions of the mechanisms and
subsequently applies acceleration profiles to drive them rapidly
and without crash stops. This approach also pennits the machine
to recover from an interruption during any phase of operation
without damage to the machine or tape.
The tape transport also features a functional intermediate tape
path that allows high speed searches and reading or writing of the
longitudinal tracks without the tape being in contact with the
helical scan drum.
The ER90 tape drive architecture and media provides several
unique functions that enhance the ability to achieve high media
space utilizations and fast access. Access times are enhanced
through the implementation of mUltiple areas (called system
zones) on the media where the media may be unloaded. This
feature reduces positioning time in loading and unloading the
cassette. Access times are reduced through high speed positioning
in excess of 800 megabytes per second. These core technology
designs support a set of advantages unique to tape transports.
Figure 1 depicts the ER90 transport and associated electronics.

Core Technology
The ER90 transport design meets the stringent requirements for
long media life in its approach to tape handling. A simplified
threading mechanism, a simplified tape path, and automatic scan
tracking along with a proven head-to-tape interface are all features
that lead to selection of the Ampex transport for the ER90 drive.
The ER90 transport uses this direct-coupled capstan hub similar to
high performance reel-to-reel tape drives instead of the usual
pinch-roller design. Advantages include fast accelerations and
direction reversal without tape damage, plus elimination of the
scuffing and stretching problems of pinch roller systems. Since a
direct drive capstan must couple to the backside of the tape, it
must be introduced inside the loop extracted from the cassette. In
order to avoid a "pop up" or moving capstan and the problems of
precise registration, the capstan was placed under the cassette
elevator, so that it is introduced into the threading cavity as the
cassette is lowered onto the turntables.

•

111111111111111111111111111111
111IIIII 1111111111 11111111

In order to prevent tension buildup and potential tape damage,
none of the tape guides within the transport are conventional fixed
posts. Air film lubricated guides are used throughout; one
exception is the precision rotating guide which is in contact with
the backside of the tape.

140

Figure 1. ER90 transport and associated electronics.

Head Life
Helical heads are warranted for 500 hours; however, experience
with helical head contact time exceeds 2000 hours. Because of
the system zones and the ability to move between system zones
without tape loaded to the helical scanner drum, the actual head
life with tape mounted on the drive may be substantially longer.
In addition, when heads do need to be replaced, service personnel
may quickly install new heads on-site, without shipping any
transport subassemblies off-site, in about 20 minutes.

Table 1: Summary of the Error Management System

Format Item

Format Description

Bytes Per Track

48,972

User Bytes Per Track

37,495

C I Dimensions

RS(228,220,8) T=4

C2 Dimensions

RS(106,96,10) T=5

Safe Time on Stopped Tape

C3 Dimensions

RS(96,86, 10) T=5

Whenever the flow of data to or from the tape drive is interrupted,
the media is moved to a system zone and unloaded from the
helical drum. When data is being written, this should be a rare
occurrence because each drive has a 64 megabyte buffer. When in
retrieval mode, returning to a system zone whenever the access
queue is zero should be standard practice. In this way, if the drive
is needed for a different cassette, it is available sooner and if
another access is directed at the same cassette, the average access
time is not affected by where the tape now rests. With this type of
drive management, the cassette may remain mounted indefinitely
without exposure to the tape or heads.

Channel Code

Miller-Squared (rate 112)

C l-C2 Product Code Array

In-track block interleaver
with dimensions 456 x 106
(two Cl words by one C2
word)

C3 Code Cross-Track
Interleave Description

C3 codewords interleaved
across a 32-track physical
block

Outer CRC Error Detection
of C l-C2-C3 Failure

Four 64 parity bit CRC
codewords interlaced over
32 tracks which provide
undetected error probability
of 10-20

Write Retry

Yes

Coding Overhead

28%

Erasure Flagging Capability
of Channel Code

Excellent: probability of
flagging a burst error is
near 1.0

Maximum Cross-Track Burst
Correction

3.3 tracks (139,520 bytes)

Maximum Length of Tape
Defect Affecting 4 Adjacent
Tracks that is Correctable

35,112 bytes

Maximum Raw Byte Error
Rate Which Maintains
Corrected Error Rate < 10- 13

0.021

Maximum Width of
Longitudinal Scratch that is
Correctable

4,560 bytes

Data Processing Design
An ER90 cassette can be partitioned into fixed size units which
can be reclaimed for rewriting without invalidated other recorded
data on the tape cassette. Most tape management systems achieve
space reclamation by deleting an entire tape volume, then
allowing users to request a "scratch tape" or "non-specific"
volume when they wish to record data to tape. Physical cassette
sizes of25, 75, or 165 gigabytes will make this existing process
inefficient or unusable. The ER90 cassette partitioning capability
provides an efficient mechanism for addressing the tape space
utilization problem.
ER90 cassette fonnatting provides for three levels of ReedSolomon error correction. In addition, data is shuffled across the
32 tracks that make up a physical block, and interleaved within
the physical track so that each byte of a block has maximum
separation from every other byte that make up an error correction
code word. Data is then recorded using a patented process called
Miller Squared. This process is a self checking, DC free, rate 112
coding process that has a 100% probability of flagging a burst
error. This has the effect of doubling the efficiency of a ReedSolomon code by knowing where the power of the code should be
applied. A data rate of 15 MB/sec is achieved with all error
correction applied, resulting in no loss of drive perfonnance for
maximum data reliability.
An error rate of 10- 15 should be achieved without factoring in the
effect of the interleave, write retry, and write bias. C3 error
correction is disabled during read back check when writing in
order to bias the write process. If C2 is unable to correct the error
of anyone byte, a retry is invoked. Table I summarizes the error
management system.

Drive Configuration
The Drive configuration allows for physical separation of the
electronics from the transport module at distances up to 100 feet if
desired. This allows users to maximize the transport density in
robotic environments, and to place the electronics modules in
close proximity to host computers.

141

Media Usage Life

Multiple Unload Positions

One of the major applications for ER90 technology is a backstore
for a Disk/fape hierarchy. As such, the number of tape load and
unload cycles, thread/un thread cycles and searches may be
significant. The expected usage capabilities for the ER90 media
should exceed 50,000 load/unload cycles, 50,000 tape thread!
unthread cycles per system zone, and 5,000 end-to-end shuttle
forward and rewind cycles. The number of end-to-end reads using
incremental motion (less that 15 MB/sec) should exceed 2,000
while the number of reads of 1 gigabyte files using incremental
motion should exceed 5,000. The operating environment should
be maintained between 12 to 20 degrees centigrade with relative
humidity between 30 and 70% to achieve best results.

Access times are enhanced through the implementation of
multiple system zone areas where the media may be loaded and
unloaded. Full rewind is therefore unnecessary. This reduces
positioning time when loading and unloading the cassette, while
eliminating mechanical- actions of threading and unloading over
recorded data, as well as eliminating the wear that is always
inherent in any design that requires a return to beginning of tape.

Archival Stability
Assuming cassettes are stored within temperature ranges of 16 to
32 degrees centigrade with relative humidity between 20 and 80%
non-condensing, storage of over 10 years is expected. For even
longer archival stability, an environment maintained between 18.3
and 26.1 degrees centigrade with relative humidity between 20
and 60% non-condensing should result in archival stability
exceeding 14 years.
Recent testing by Battelle Institute on D-2 metal particle tapes
from four vendors revealed no detectable change after 28 days
of exposure to accelerated testing in a mixed gas environment,
equivalent to 14 years of typical storage in a computer facility.
The following results were determined:
• No evidence was found of localized surface imperfections.
• Improved surface formulations provided a protective coating
for the metal particles.
• The D-2 cassette housing protected the tape against damage
by absorbing (gettering) corrosive gases.
• Change of magnetic remanence does not differ significantly
when compared to other tape formulations in use today. I

High Speed Search
ER90 data formats include full function use of the longitudinal
tracks that can be read in either the forward or reverse direction.
One of these tracks contains the geometric address of each
physical block of data. This track can be searched at speeds of
greater than 300 inches per second, equivalent to searching user
data at more than 800 megabytes per second. Another longitudinal
track is automatically recorded on tape which provides either
addressability to the user data block or to a byte offset within a
user file. No user action is required to cause these tracks to be
written and they provide high speed search to any point in the
recorded data, not just to points explicitly recorded at the time of
data creation.

1. Fraser Morrison and John Cororan, "Accelerated Life testing
of Metal Particle Tape", SMPTE Journal, January 1994

142

Robotics
Full robotics support is provided for the ER90 drives by the
EMASS DataTower® and DataLibrarYTM, archives with storage
capacities up to 10 petabytes. Both robotics are supported by
EMASS VolServTM volume management software which is
interfaced to tape daemon in the UNICOS operating system.
The DataTower archive uses proven robotics to transfer D-2 small
cassettes between 227 storage bins and up to four ER90 drives.
This yields a capacity of almost six terabytes of user data in a
footprint of only 27 square feet. A bar code reader on the robot
scans a bar code on each cassette for positive identification and
manageability. Under VolServ software control, the robot inside
the DataTmyer archive rotates on a vertical pole, grabs the
designated D-2 cassette from its bin, moves it to an available drive
where it is automatically loaded into an ER90 drive. This load
operation completes in less than a minute. When the application
completes its use of the D-2 cassette, VolServ software will
instruct the robot to return the cassette to a storage bin.
For larger storage needs, the DataLibrary archive offers a modular
solution that can be expanded to contain petabytes of user data.
The DataLibrary archive stores D-2 small and medium cassettes
in shelves that make up the cassette cabinets. Each four foot wide
cassette cabinet can hold up to 14.4 terabytes of user data. Up to
20 cabinets can be added together to form a row of storage. Rows
are placed parallel to each other to form aisles in which the robot
travels to access the cassettes. A added benefit of this architecture
is that cassettes on interior rows are accessed by two robots.
ERgO drives are placed on the ends of the rows to complete the
DataLibrary archive. As demand for robotic-accessed storage
grows, a DataLibrary archive can be expanded by adding more
storage cabinets, more robots, or more ER90 drives. As with the
DataTower archive, VolServ software provides complete mount
and dismount services through the UNICOS tape daemon.

Conclusions
The ER90 tape drive, by borrowing innovative techniques used in
high-resolution video recording, provides the computer processing industry with a helical scan tape format that delivers data from
a high density 19 mm metal particle tape. With a sustained rate of
15 MB/sec, input/output intensive applications now have a device
that complements the processing speeds of Cray supercomputers.
The implementation of system zones allows safe load and unload
at other than BOT, providing improved access times. The dense
DD-2 format means that the cost of storage media is dramatically
reduced to less than $2 per megabyte.

EMASS/CRAY EXPERIENCES
AND PERFORMANCE ISSUES
Anton L. Ogno
Exxon Upstream Technical Computing Company

Introduction

and some modifIcations to DMF to work with
multiple MSPs.

At the Exxon Upstream Technical Computing
Company (EUTeC) we have recently purchased an
Emass Data Tower from E-Systems. The Data
Tower is capable of holding 228 D2 tapes, with each
tape having a capacity of 25 Gigabytes for a grand
total of approximately 6 Terabytes of storage. The
Tower also provides a robotics system to mount the
tapes, a cabinet with room for four ER90 recorders,
and a SparcStation System (VolServ) for managing
media, drive and other component statuses. IPI
Channels connect our Tower directly to the Cray, and
each recorder is capable of sustaining 15 MB/s. We
intend to harness the storage capacity of the Data
Tower to provide a centralized repository for
managing our large tape library. In the future, we
may expand its usage to our IBM MVS machines and
our UNIX workstations.

Performance has been our second major challenge.
Each drive in the tower is capable of sustaining 15
MB/s from Cray memory to tape on a dedicated
system. At the moment, Data Migration is giving us
at most 7 MB/s per drive or about 20 MB/s
aggregate. By improving the 110 algorithm used by
DMF in a standalone program, we have been able to
achieve a rate of 13.5 MB/s from a single drive to
disk and a rate of 26 MB/s with two drives reading or
writing concurrently.
With such a signifIcant
performance gain, we concluded that the
performance loss seen in Data Migration was due to
the I/O algorithm of the DMF code. An explanation
of these fmdings follows later.

We faced two challenges in making the ER90 drives
available for production use. The fIrst challenge was
to make the ER90 drives available to our users. To
do so, we decided to extend the functionality of Data
Migration (DMF) to include multiple Media SpecifIc
This
Processes (MSPs) solely for D2 media.
approach gave us a simple mechanism for storage
and retrieval of data, as well as archival and
protection services for our data fIles. DMF allows us
to support ER90 use with minimal programming
effort, but at the expense of some flexibility and
some performance. The hurdles involved in making
ER90s available through DMF include confIguration
and operational issues with VolServ and the Data
Tower network, confIguration of the tape daemon,
formatting of the tape media, confIguration of DMF,

DMF/ER90 Experiences
Hardware Hookup and Configuration
of a Standalone ER90
Initially, EUTeC opted to have one ER90 drive on
site until a Data Tower became available for
installation. Before we could make the ER90
available to the Cray, we had to rebuild and
reconfIgure the tape daemon with ER90 software
support, which Cray distributed as an asynchronous
product. These drivers· installed without incident;
however, there were changes to the Tape
ConfIguration File that did not install so easily.
ER90 Tape Daemon Support Technical Note (SG2137) gave us a rough outline of confIguring a tape

143

loader for the ER90.
Unfortunately, the
documentation was insufficient to steer us around
issues such as appropriate time-out values, maximum
block size configuration and operator confirmation of
"blp" tape mounts 1. As the technology matures, we
expect
that
this
document
will
mature
commensurately. (All sites)

Installing and Configuring DMF
After we had established communication between the
tape daemon and the ER90, we focused on
configuring DMF to use the D2 MSPs. Because we
needed to make the ER90s available quickly, we
realized that only two packages could satisfy our data
integrity requirements - Data Migration or Cray Reel
Librarian (CRL). Unfortunately, CRL had to be
abandoned because it cannot coexist with IBM Front
End Servicing (a site requirement).
Thus, we
decided to allow access to the ER90s through DMF
only. We had been running DMF with 3490 tapes for
a year, which significantly lowered the learning
curve for EUTeC and our clients. Before the first
ER90 arrived, we were running DMF 2.04. To gain
ER90 support, we upgraded to DMF 2.05, and with
the installation of Unicos 8.0, we upgraded to DMF
2.10, all of which caused us no problems.
The following paragraphs outline the customizations
we have made or expect to make to DMF:

Multiple MSP Selection
To allow our users to migrate to the D2 MSP while
continuing to use the 3490 MSP, we made local
mods to the dmmfunc routine, which Cray provides
for just such a purpose. 2 At EUTeC, if a user sets the
sitebit in the inode to a number between 1000 and
10 I 0, our modified dmmfunc returns an appropriate
MSP index from 0 to 10. This approach has several
disadvantages, including a lack of access control to
the ER90s. We have submitted a design SPR with

1When formatting tapes and labelling tapes, the
tpformat and tplabel commands use bypass label
processing (see below)

2Dmmfunc receives a stat structure, and returns an
MSP index for the specified file.

144

CRI to add a UDB field for controlling access to each
MSP. (DMF sites with more than one MSP)

No tape unload
To leave cartridges mounted after DMF performs an
operation, we have also added a "no unload" option
to the ER90 tape mounts for DMF. If a program
accesses files from the same cartridge sequentially,
then this measure greatly decreases the number of
tape mounts required. (DMF sites)

Hard Deletes and Default Copies.
Other issues concerning Multiple MSPs include the
lack of hard delete and default copy parameters on a
per MSP basis or per file system basis. We have
submitted design SPRs for both of these parameters.
(DMF sites with more than one MSP)

Read access equals dmput access.
Another problem that we ran into as users shared
these D2 files, was that, while a user could retrieve
files he did not own with dmget, he could not release
their space with dmput. At EUTeC, we modified
dmput to allow users with read or write access to
dmput those files. (DMF sites)

Configuring the Tape Daemon and
VolServ for the Data Tower
When the tower arrived, with the two additional
ER90s, we had to set up the VolServ Sun to
communicate with both the Data Tower and the Cray.
Also, with the addition of the Data Tower, we added
another set of mods to the tape daemon, and
configured another tape configuration file. In this
file, we configured the ER90 loader to use VolServ
as the Front End Service. We have made only minor
changes to VolServ, increasing the time-out values
for RPC messages, and sending VolServ log
messages to an alternate display. (All sites)

Formatting and Labelling tapes
Before the tape daemon uses a D2 tape, the tape must
be formatted with tpformat. This. process encodes

cartridge identification onto the tape, and divides
each tape into logical partitions, which the tape
daemon treats as separate volumes. For performance
reasons, these logical partitions should be sized
appropriately to the user's file sizes3 .
You must run tpformat on every cartridge in the
tower. Initializing all 228 tapes in a Data Tower is a
time consuming process and could take several days
utilizing all three recorders simultaneously. If you
require labels on each partition, then the time for
initialization doubles. Tape labelling is optional, and,
if used, must be done for every partition on every
tape. Labelling will also slow each tape mount by
about 30 seconds, and for that reason is not
recommended by CRI. We have written a script to
perform this onerous task, but have run into problems
because tpformat and tplabel perform their own rsv
command, thus forcing the script to be single
threaded. Also, our users' average file sizes may
change over time, which may require us to reformat
the remammg free tapes to achieve peak
performance. One site has written a C program to
format and label their entire archive by formatting on
several drives at one time. I recommend that
approach heartily. (All sites)

Emass
Training
and
Operational Procedures

Establishing

Hardware Errors
Since the installation of the Data Tower, we have had
a handful of hardware outages. Due to two recent
computer room blackouts, we have lost two power
supplies. We have also had a card in one drive
controller go bad and some failures with unmounting
tapes. Onsite support has been available to fix these
problems, and E-Systems is working to improve its
support at Cray installations. (All sites)

Media Recognition Problem
We have seen cases where tapes, that we had
formatted and labeled, become unmountable. In
some of these cases, when DMF mounts a tape, the
tape daemon returns a recoverable error and attempts
another mount, in other cases, the drive goes down
and our hardware technician must manually remove
the tape from the drive. Worse yet, we have seen a
case where a formatted tape that DMF has written
data to is sporadically unmountable. In this case,
DMF has written 8 out of 11 partitions on one
cartridge, and tape mounts of that cartridge still fail
with the message "Unable to position to partition X."
This problem was fixed in an upgrade of VolServ .
(All sites)

Transition to FileServ
When installing a Data Tower, consider the training
required for its support and administration. Several
members of our staff have attended the VolServ
Class offered by E-Systems. This class gives an
overview of VolServ only. It does not cover
configuring a Data Tower attached to a Cray, or Data
Migration with ER90s, performance, or tape daemon
configuration. Most of what we have learned in
those areas has been through first hand experience,
talking to Cray personnel and hashing out problems
with E-Systems onsite support. (All sites)
Along with the burden of learning the ins and outs of
the system ourselves, we have spent considerable
effort training our operators to handle emergency
situations, cycling the drives and the VolServ
software, and monitoring skills. (All sites)

3An appendix to the DMF 2.05 Administrator's

Guide helps you choose the optimum partition sizes.

There are many potential advantages to switching to
FileServ. To start, FileServ allows users to write
directly to D2 tape. Second, FileServ allows variable
partition sizes, which will optimize tape usage (see
Formatting and Labeling Tapes). Third, FileServ
backs up the data inodes to tape when migrating files.
Lastly, FileServ will use Asynchronous I/O (See
DMFlER90 Performance).
Overall, these
enhancements will be an improvement over DMF in
performance, functionality and recoverability.
FileServ is due to be released in 3Q94, and we will
consider it as an alternative to DMF. (Sites
considering FiIeServ)

DMF/ER90 Performance
With a high performance tape subsystem and high
performance disks come high expectations for I/O
performance. Unfortunately, accessing the ER90s

145

through DMF did not give us the 15 MB/s transfer
rate that we expected. By simulating DMF's I/O
methodology in a standalone program, we were able
to benchmark transfer rates for DMF and then
improve upon the algorithm. Ultimately, we found
that the ER90 caching features, combined with
DMF's synchronous I/O, caused a transfer rate of
between 4 to 8 MB/s. We also found that a program
using buffered, asynchronous I/O could produce
nearly 14 MB/s on a loaded system. Because of this
study, we believe that our I/O methods could
generically improve DMFlER90 performance by
100%. To test our theory about buffering, we wrote
several C programs using Unbuffered Raw I/O,
Flexible File I/O (FFIO), and Asynchronous,
Buffered I/O.

clocks. After that, writes to tape followed a pattern
of 10-14 writes at 40,000,000 clocks, and one write
at 400,000,000 clocks.

Here is a sample of the timing:
readtime 25000000 clocks, writetime 2800000000 clocks
readtime 25000000 clocks, writetime 40000000 clocks
(Last result repeated 12-13 times)
readtime 25000000 clocks, writetime 400000000 clocks
readtime 25000000 clocks, writetime 40000000 clocks
(Last result repeated 12-13 times)
readtime 25000000 clocks, writetime 400000000 clocks

Unbuffered, Raw 1/0
First, we wrote a program to simulate DMF's
dmtpput and dmtpget operations. This program
opened a disk file with the
RAW attribute to
bypass system buffers4. As with DMF, this program
looped - reading 4 MB at a time from a striped DD60
disk file and writing that block of data to D2 tape (or
vice versa for tape reads). As with DMF, reads and
writes occurred .synchronously, and the program
achieved a peak performance in the 7 MB/s range,
significantly lower than the maximum speed of the
ER90s and the DD60 drives.

°

Running this program we noticed a distinct pattern in
our read and write times to and from tape. 5 When
writing to tape, the first write generally takes about
2,800,000,000 clocks, which is two orders of
magnitude larger than most other writes. Since the
first write includes .drive ramp up, tape positioning,
and label processing time, we rewrote the program to
throwaway this the time (see Appendix). After the
first write, the program wrote the next 4 MB chunks
in about 40,000,000 clocks each, until about the 13th
write, which was generally about 400,000,000

4Note that we did not use the O_SYNC option that
DMF uses.
5Between each read and write we called rtclockO to
give us an approximate time for reads verses time for
writes. We used timeO to time the entire transfer
from start to finish.

146

After some digging, we found that there were two
causes for this pattern, both associated with the ER90
caching mechanism. Each ER90 drive buffer inputs
into a 64 MB cache before sending it to tape. When
the buffer fills to 45 MB, the drive ramps up and
starts writing to tape. Because the buffer is initially
empty, the Cray must write about 11 x 4 MB blocks
before the transport starts. Ramp up for the drive and
the transfer from buffer to tape then accounts for the
slow write after the first series of writes. After that,
we found that the ER90 actually reads its cache faster
than the Cray disk could write. This would cause the
drive buffer to empty, which caused the transport to
stop until the buffer filled up and the transport started
again. The combined effects of synchronous I/O,
drive buffering and drive ramp up caused the entire
transfer to be sluggish.

FFIO Program
Based on the results of the fIrst program, we decided
to rewrite the program to allow us the flexibility to
experiment with various I/O methods without
recoding. The tool of choice was Flexible File I/O
(FFIO), which allowed us to change the I/O method
of the program by assigning attributes to a fIle with
the "asgcmd" command.
A sample "asgcmd"
command, assigning 2 x 4 MB library buffers would
be:

Iasgcmd -F bufa:l024:2 diskfIle

Results from the FFIO code showed a dramatic
overall transfer rate increase when we used FFIO
buffering with the disk fIles. The pattern of reads
and writes between disk and tape also changed
dramatically. When writing to tape, the fIrst write
still takes about two orders of magnitude longer than
most other writes, and the fIrst 13 writes behave the
same as in the unbuffered I/O case. But after the
drive ramps up once, reads and writes occur
consistently in under 40,000,000 clocks, until the end
of the transfer.
Because the disk I/O is suffIciently fast to fIll or
empty the drive cache, the drive transport never
stops, which nearly doubles the transfer rate to 13.6

For our tests, we used 1024 block6 buffers to
correspond with the 4 MB tape block size. 7

The following diagram illustrates the data path from
disk to tape with FFIO library buffering. Notice that
12 MB of mainframe memory is required to hold the
library and user buffers and that the ER90 cache is
several times larger than the tape block size:

MB/s.
Here is a sample of the timing (note that the read
times from disk are signifIcantly faster than the write
times to tape.):

readtime 800000 clocks, writetime 2800000000 clocks
readtime 800000 clocks, writetime 40000000 clocks
(Last result repeated 12-13 times)
readtime 800000 clocks, writetime 400000000 clocks

Files Striped accross
0060 Disk Drives

Library Buffers
(4MB each)

At 256 Blocks I Stripe

User Memory
ERgO Buffer
Buffer
(4MB)
(-65MB)

ERgO
Recorder

readtime 800000 clocks, writetime 40000000 clocks

(Last result repeated until end of transfer)

Q

61024 blocks
Bytes =4 MB.

*

512 words
block

*

See Appendix A for a listing of the FFIO Code

8 bytes _
word - 4,194,304
7Varying buffer sizes did not produce signifIcant
performance changes and varying the number of
buffers produced marginal changes

147

Asynchronous 1/0 Program
Finally, once we found the optimal buffering scheme
via our FFIO program, we wrote an asynchronous
I/O program with the same algorithm to benchmark
against the FFIO program. We wanted to eliminate
any overhead in CPU time, memory usage, and
memory to memory copies that the FFIO buffering
layer may have added.
In the Asynchronous I/O code, we hard coded the
double buffering algorithm. Using the reada and
recall system calls, we used a double buffering
scheme that most closely approximated the FFIO
example with 2 x 4 MB buffers (asgcmd -F
bufa: 1024:2 diskfile).
The results from transferring 2 OB of data were
similar to the FFIO program at 13.6 MB/s, but we
managed to eliminate one 4 MB buffer and most of
our user CPU time. The elimination of an additional
memory to memory copy between the FFIO library
space and user space easily accounts for this decrease
in CPU overhead.
See Appendix B for a listing of the Asynchronous
I/O code.

Recommendations

148

Because ER90 tape performance is so dependent
upon disk performance, I believe that Cray should
consider writing separate I/O routines for ER90 tapes
using the fastest and cheapest I/O method available.
Because of the enhanced speed, low memory
overhead and low CPU overhead of the
Asynchronous I/O cases, the results of our testing at
EUTeC clearly support Asynchronous I/O as the
preferred I/O method. I feel that, although recoding
may not produce 13.6 MB/s consistently from DMF,
it would speed file transfers to consistently over 10
MB/s from DD60s.
Further tests that I would recommend are running
transfers from unstriped files residing on DD60s, and
running transfers from DD40 series disks. Other
experiments could include concurrent ER90 testing
and a bottleneck analysis of the data flow.

Acknowledgements
I would like to thank the many people who
contributed to the success of this investigation. I am
especially grateful to Bryan White from Cray
Research, Kent Johnson and Barney Norton from ESystems, and Doug Spragg, Troy Brown and John
Cavanaugh from EUTeC for their insight and
guidance in conducting the transfer tests and in
critiquing
this
paper.
Thank
you.

Appendix A - FFIO Program Partial Listing

#define AMEG
#define BSZ

(long) 1024*1024
(long) BLOCKS * 16384

/* 2097152 bytes */

/* do first write without timer on */
RET1=read(tapefd, buf, BSZ};
RET2=ffwrite(diskfd, buf, BSZ, &diskstb, FULL};
tfwrite=time((long *} o};
do
bef_read=rtclock(} ;
RET1=read(tapefd, buf, BSZ};
aft_read=rtclock(} ;
RET2=ffwrite(diskfd, buf, BSZ, &diskstb, FULL};
aft_write=rtclock(} ;
bytes_w+=(long} RET2;
printf("readtime %d clocks, writetime %d clocks\n",
aft_read-bef_read,
aft_write-aft_read};
} while ( RET1 == BSZ && RET2 == BSZ );
tfinish=time((long *} o};
speed=( (float}bytes_w / ((float}AMEG*((float}tfinish-(float}tfwrite)
printf ("Wrote %d bytes in %d seconds at %7. 3f MB/s\n",
bytes_w, tfinish-tfwrite, speed);

)}

149

Appendix B - Asynchronous 1/0 Program Partial Listing

#define BUFF SIZE (4 * 1024 * 1024 )

/*
* Priming Read

*/
statlist[O] = &rsw[O];
rc = read{in fd, buffer [curr_buffer] ,BUFF_SIZE);
if (rc < BUFF_SIZE) {
finished = 1;
write count = rc;

}
while (!finished)
rc

reada{in_fd, buffer[{curr_buffer+1) % 2 ] ,BUFF_SIZE, &rsw[O] , 0);

rc

write {out_fd, buffer [curr_buffer] ,BUFF_SIZE);

rc

recall {in_fd, 1, statlist);

if (rsw[O] .sw_count < BUFF_SIZE)
write count = rsw[O] .sw_count;
finished = 1;

}
bread += rsw[O] .sw_count;
curr_buffer =(curr_buffer+1) % 2;
/* Start timer after first write */
if (loops == 1 ) tfwrite=time{(long *) 0);
loops++;

rc = write {out_fd, buffer [curr_buffer] ,write_count);
tfinish=time{{long *) 0);
printf ("Wrote %d bytes in %d seconds at %10. 3f MB/s\n",
bread, tfinish-tfwrite,
(float)bread/{1024*1024*{tfinish-tfwrite))

150

);

Appendix C - Write Results
For easier comparison, we always used two Gigabyte files, user striped accross four disks for transfers.
Since these tests were run on a loaded system, the results were slightly lower than the ISMB/s rate we had acheived
from memory to tape on an idle system.

I

Read from Disk or Memory I Write to Tape
110 methodology

Tape
Block
size

Buffer Size

Number of FFIO?
Buffers

CPU
seconds to
transfer
-2GB

MAX Transfer Rate

Like DMF, open(disk,O_RAW),

4MB

nla

I User

2.48 sysl

7.117 MB/s

No

syncronous read/write
Like DMF, but using FFIO.

O.OS usr
4MB

I User

nla

Yes

ffopen( disk,O_RAW),

1.12 sysl

7.299 MB/s

0.06 usr

asgcmd -F system disk,
syncronous read/write
Double buffered using FFIO

4MB

ffopen( disk,O_RAW),

1024
(4MB)

Blocks

1024
(4MB)

Blocks

1024
(4MB)

Blocks

2 Library

Yes

+ I User

1.04sysl

13.699 MB/s

1.79user

asgcmd -F bufa: 1024:2 disk
Double buffered using FFIO

4MB

ffopen( disk,O_RAW),

4 Library

Yes

+ I User

1.07sysl

13.60S MB/s

1.79user

asgcmd -F bufa:1024:4 disk
Double buffered using Async.
I/O

4MB

2 User

No

0.8Ssysl

13.644 MB/s

O.Olusr

opena(disk, 0_RA W)
Memory to Tape,
open(tape,O_RAW)

4MB

nla

I User

Yes

0.18sysl

13.699 MB/s

0.02usr

151

Appendix D - Read Results
For easier comparison, we always used two Gigabyte files, user striped accross four disks for transfers.
Since these tests were run on a loaded system, the results were slightly lower than the 15MB/s rate we had acheived
from tape to memory on an idle system.

I

Read from tape I Write to Disk or Memory
I/O methodology

Like DMF, open(disk,O_RAW),

Tape
Block
size

Buffer Size

4MB

nla

Number of
Buffers

FFIO?

1 User

No

1.61sysl

MAX
Rate

Transfer

6.770 MB/s

0.05usr

syncronous read/write
Like DMF, but using FFIO.

CPU
seconds to
transfer
....2GB

4MB

1 User

nla

Yes

2.58sysl

6.549 MB/s

0.06usr

ffopen( disk, 0_RA W),
asgcmd -F system disk,
syncronous read/write
Double buffered using FFIO

4MB

1024
(4MB)

Blocks

2 Library

Yes

+ 1 User

ffopen( disk, 0_RA W),

1.03sysl

13.184 MB/s

1. 86user

asgcmd -F bufa:1024:2 disk
Double buffered using FFIO

4MB

1024
(4MB)

Blocks

4 Library

Yes

+ 1 User

ffopen( disk, 0 _RA W),

1.17sysl

13.098 MB/s

1.85user

asgcmd -F bufa: 1024:4 disk
Double buffered using Async
110

4MB

1024
(4MB)

Blocks

2 User

No

1.05sysl

13.132 MB/s

0.01usr
opena(disk, 0_RA W)
Tape to memory,
open(tape,O_RAW)

152

4MB

nla

1 User

Yes

0.15sysl
0.02usr

13.158 MB/s

AFS EXPERIENCE AT THE PITTSBURGH SUPERCOMPUTING CENTER
Bill Zumach
Pittsburgh Supercomputing Center
Pittsburgh, Pennsylvania
Intrdoduction
The Pittsburgh Supercomputing Center is one of four
supercomputmg centers funded by the National Science
Foundation. We serve users across the country on projects
ranging from gene sequencing to modeling to graphics
simulation. Their data processing is done on either our C90,
T3D, CM 2 Connection Machine, MASPAR, or our DECAlpha workstation farm. We have an EL/YMP for a file server
using Cray DMF.
To support the data needs of our external users as well as our
support staff, we use AFS, a distributed file system, to provide
uniform, Kerberos secure access to data and for ease of
management. To store the high volume of data, PSC designed a
hierarchical mass storage system to provide access to DMF as
well as other mass storage systems through AFS. We refer to
this as multi-resident AFS.
This paper discusses this mass storage solution and how we use
multi-resident AFS to provide access to data for our users. This
presentation is done in the form of a chronology of our need for
an ever larger and more flexible mass storage system. Also
described are the other sites using our ports of the AFS client
and multi-resident AFS. Lastly, a brief description of future
work with both AFS and DFS.

AFS at PSC
PSC uses the Andrew File System to serve the binaries for
upwards of 120 workstations and the home directories of about
100 staff. AFS provides a location transparent global name
space. This gives users the same view of the directory structure
no matter where they log in. It also makes workstation
administration much easier as the programs need only be located
in a single location. AFS also scales well, allowing for the huge
amount of data that needs to be managed at the PSC.
We chose AFS over the defacto standard NFS for several
reasons. First, NFS does not scale well. Most of the system
binaries, all home directories and many project areas, totaling
about 40 gigabytes, are currently stored in AFS. NFS has two
main difficulties dealing with this amount of data spread across
this many workstations. First, an NFS server becomes
overloaded with requests with a sufficiently large number of
clients. Second, administering the file name space quickly
becomes cumbersome if many partitions are being exported.
An NFS client needs to contact the server for each read or write
of a file. This quickly bogs down an NFS server. AFS on the
other hand caches the pieces of a file which are in use on the
client. The server grants a callback promise to the client,
guaranteeing the data is good. This guarantee holds until some
other client writes to the file. Thus, the AFS client only needs
to talk to the server for a significant change of state in a file. At
the same time, repeated reads and writes to a file by a single
client occur·at near local disk speeds.

With NFS, each client can mount an NFS partition anywhere in
the directory structure. At PSC, most of the system binaries for
all workstation architectures are in the distributed file system.
This makes updating operating system software and data
processing packages extremely easy. But, for NFS this makes it
incumbent upon system administrators to be extremely careful
in setting up each NFS client. When the number of
workstations gets sufficiently high, this task becomes
cumbersome and subject to error. AFS has a location
independent, uniform global name space. So wherever a user
logs in from, they see the same directory structure. An AFS
client finds a file by looking in an AFS maintained database for
the server offering the file and then goes to that server for the
file. This all happens as part of the file file name lookup and
nothing is explicitly mounted on the client.
One other significant feature of AFS is security. This comes in
two forms. First a user is authenticated using Kerberos security
and file transfers from server to client depend on that
authentication. This is a major improvement over NFS.
Secondly, AFS supports the notion of access control lists.
These lists apply to directories and give explicit permissions
based on the Kerberos authentication.
AFS also has the concept of a volume which is a logically
related set of files. Volumes are mounted in a manner similar to
disk partitions, that is, at directories in the AFS name space.
Volumes are typically used to house users' home directories, sets
of binaries such as lusr/local/bin and for space for projects. They
can have quotas attached to them for managing disk usage and
since they are mounted on directories, access control lists apply
to volumes as well. There can be be several volumes per disk
partition, so they provide a finer control of disk quota al1ocation.
As quotas can be dynamically changed, disk usage can be
modified as well. Volumes can also be moved from partition to
partition and across servers, making data distribution
manageable. Dumps are done on a per volume basis, giving
more control over what gets backed up and when.
Volumes can also have backup volumes, which are read only
snapshots of the volume at a given time. One use of this at the
PSC is to maintain an OldFiles directory in each users home
directory, which contains a copy of the home directory as it
appeared the previous day. This makes it extremely easy for the
user to get back that file they decided they should not have
deleted yesterday.
Volumes can also be cloned. These are also read only snapshots
of a volume and can reside on a different file server than the read
write original. This is useful for distributing the read load for
data across several machines. An AFS client randomly chooses
which read only clone to access if one is available when reading
a file. This is typically used for operating system binaries.
One last major concept in AFS is that of a cell. An AFS cell is
an administrative domain. A user authenticates to an AFS cell to

153

access the files in that cell. A cell is delimited by a given set of
file servers. This allows individual sites to maintain their own
set of secure file servers and to restrict access to a selected set of
users. At the PSC, we currently have two cells. One is a stock
production cell and the other is the cell which implements our
solution to the mass storage problem. We are currently looking
into merging these two cells.
AFS is produced by Transarc, and is supported on a wide variety
of platforms. Among the manufacturers supported are IBM, Sun,
DEC, HP and SGl. In addition, Convex and CDC provide AFS
for their machines. This list is by no means exhaustive.
Through the work at PSC, the AFS client has been available for
some time for Cray C90s and YMPs. We have recently ported
the multi-resident AFS file server as well.
From its inception, the PSC used AFS to store binaries and
users' home directories. AFS also served as the home directory
and project space for the Sun front ends to the CM-2 Connection
Machine. As stated earlier, this provided us with a uniform view
of the file system. Since almost all of our users are off site, they
could create their data on any AFS based client, and immediately
work with it at the center without having to explicitly move any
data to a particular machines local file system. Thus, they could
examine their data on any of the front ends or their own work
station, process it on the CM-2 and view the results on their
own system.
This provided the impetus to port the AFS 3.1 client to the
Cray, a YMP at the time. This not only tied our main
processing computer into the distributed file system, but allowed
for user's to easily split their processing tasks based on which
machine, CM-2 or YMP which was best suited to the task
without having to move their data.
As is usual, data storage demands began to outstrip our capacity.
The PSC decided to acquire a RAID disk system for fast, reliable
access to data. We settled on a RAID-3 system from Maximum
Strategy running on a Sun. The problem was that AFS does not
support anything other than native local file systems and the
RAID disks only had a user mode file system. To support this
file system, the file I/O sub-system of the AFS file server was
generalized to be a generic I/O system. We then had an AFS file
server running on a Sun which was able to use the RAID disk
bank as it's local file system.
We soon found that as the RAID disks stored data in 8 kilobytes
blocks, the RAID disks were too inefficient at storing many of
the files generated by a typical user. To this end we wanted to
split the data up between SCSI disks, which are faster for
writing and reading small files, and the RAID disks, which are
more efficient at storing large files. Standard AFS only
determines where a file should be stored based on where the
volume is. That is, a volume's files are stored on the same
partition as where the volume was created. The idea was to
separate the idea of a file from the storage medium. This gives
rise to the concept of a file's residency. That is, a file is in a
volume, but we also store information as to where the file's data
resides. In this case, files in a single volume could reside on
either the RAID disk or the SCSI disk. The location information
is stored in the meta-data in the volume. To determine where a
file should reside, one needs a stored set of characteristics for the
storage device. We call this a residency database. There is one
entry in the database for each storage device. Each entry

154

contains, among other things, the desired mInImum and
maximum file size for that storage device. So, when the
modified AFS file server wants to store a file the residency
database is consulted to determine which storage device wants
files of that particular size.
As demands for storage continued to grow, we realized that we
needed some type of mass storage system. For some time, we
had been using Los Alamos' Common File System (CFS) for
our needs. This provided a simple get and put interface for user's
files, but is also rather cumbersome for users as one needs to
explicitly obtain one's files prior to using them in a program.
In order to tie CFS into AFS, several new concepts were
required. We needed to get files into and out of CPS from an
AFS file server. We needed to be able to move files
automatically into CFS so that they got archived. And we
needed to be able to free up disk space.
CFS did not run on any of our file server machines. We did not
want to port the AFS file server to Unicos and we did not want
to impact the performance of the YMP by making it a file
server. The solution was to have a small server running on the
YMP which handled file requests from an AFS file server. We
refer to this small server as a remote I/O server. So, AFS clients
would request a file from the file server. The volume meta data
indicates the file is remote and the file server sends an RPC to
the remote I/O server to deliver the file to the file server, which
in turn delivers the file to the AFS client. In this case, a file in
CFS would be spooled onto the Cray by the remote I/O server
which would hand it back to the AFS file server who sends the
data to the client. This made CFS access transparent to the user.
In order to migrate files from the disks local to the AFS server
we needed a mechanism which would do this automatically and
on a timely basis. This data migration process is accomplished
by a daemon running on the AFS file server machine which
periodically scans the disk, looking for older files which are
candidates for migration to slower storage devices. In our case
this meant scanning the SCSI and RAID disks, looking for files
to migrate to CFS. This also meant that the residency data base
entries needed an entry which indicated how old a file should be
before being migrated to that storage system. We decided that if
a file had not been accessed in 6 hours, it was a candidate for
migration to mass storage. When the scanning daemon finds
files which are 6 hours old it informs the AFS file server on the
same machine and the file server is in charge of moving that file
to CFS using the remote I/O server.
Now just because a file is old, that alone does not mean it
should be deleted from faster storage media. So we leave the
original copy of the file on the faster media. This means there
are now two copies of the file. One on either SCSI or RAID
disks and the other copy in CFS. This is called multiple
residency and gave rise to the name multi-resident AFS. A
residency is defined to be a storage device along with the list of
file servers which can access that storage device on behalf of an
AFS client. We store a list of all residencies for a given file in
the volume's meta-data so the file server knows where it can go
to find the file an AFS client requests. Note that one does not
want to go to CFS if the file is on local disk. This gives rise to
the concept of a priority for a residency, and each residency
entry in the database contains a priority. While priorities can be
arbitrary, we set priorities based on speed of file access. This
means that if a file is both on a disk local to the AFS file server

as well as in CFS, the local disk copy would be obtained for a
client, since it's at a higher priority.
Since we don't automatically delete a file from a given residency
once it has been moved to tape, it is quite likely that the local
disks would soon fill up. To avoid this problem, each fileserver
machine has a scanning daemon running on it which ensures
that older, migrated files are removed from the disk, once a free
space threshold is reached. The removal algorithm is based on
file age and file size.
We shortly encountered a major problem using CFS for mass
storage. While CFS works well for storing files, the transaction
overhead on each file update is quite high, on the order of 3
seconds. This causes problems with backup volumes in AFS.
When a backup is made, the volume is first cloned and the
volume is off line until the clone is complete in order to ensure
a consistent image of the volume. Also, in multi-resident AFS,
if a file is not on the local disk, it is not explicitly copied. Its
reference count is incremented instead This means that if the file
is in CFS, CFS takes 3 seconds to increment the file's reference
count. So cloning a volume with 1200 files in CFS would
mean that the volume would be offline for an hour.
It was not possible to fix this transaction time overhead problem
in CFS. As a result we investigated other mass storage systems
and settled on Cray's Data Migration Facility (DMF). DMF
provided us. with simple access to the tape media simply by
placing the files on a DMF backed partition. As multi-resident
AFS is already taking care of migration policy, data landing on
this DMF backed partition is already considered to be on slow
media, so we explicitly call dmput to flush the data to tape and
dmget to retrieve required files upon demand.

Current Usage and Performance

Figure I. Current multi-resident system
A part of our current multi-resident AFS configuration is
presented in figure 1. This figure shows two RS6000s for fast
storage. Both have SCSI disks for small files and Maximum
Strategy RAID disks for larger files. For archival storage we are
currently sending the data to either the C90 or the EL/YMP,
both using DMF for tape storage. We are in the process of
migrating our all DMF usage from the C90 to the EL. DMF
originally only wrote files larger than 4 kilobytes to tape, so we

only archive files larger than this. The small files are backed up
using standard AFS backup practices. We have modified the
standard AFS dump routines so that only files actually present
on the local disk are backed up. Our AFS mass storage system
currently contains approximately 383,000 files and about 98.7
gigabytes of data used by 100 users.
AFS Server
DS 3100
DS 5000/200
Sun 4/470
Sun 4/470
Sun 4/470
Sun 4/470
IBM RS6000
EUYMP

Media
SCSI drive
SCSI drive
IPI drive
IPI drive
RAID drives
IPI drive
IPI drive
IPI drive

Network
Ethernet
FDDI
Ultranet
Ethernet
FDDI
FDDI
FDDI
HIPPI

Read Speed

360 KB/sec
726 KB/sec
20 KB/sec
554 KB/sec
906 KB/sec
986 KB/sec
1667 KB/sec
497 KB/sec

Table 1. Cray AFS client read performance.
Table 1 presents the read performance we see on our C90 AFS
client from a variety of servers. Note that, while the EL file
server performance is not spectacular, it is reasonable. The
performance is somewhat slow due to the fact that we are using
the Unix file system as the interface to the disk. This involves a
lot of system overhead in opening and reading meta-data files
which contain file location and access information. It would be
possible to develop an interface for the Cray similar to the one
standard AFS uses to obtain an appreciable improvement in file
server performance. This would require modifying the fsck
program in a fairly straightforward manner and adding 5 system
call entry points to the kernel.

PSC Ports to Unicos of AFS
What follows is a brief technical discussion of the details of
porting the AFS 3.1 client and multi-resident AFS to Unicos.
As mentioned earlier, we ported the AFS 3.1 client early on in
order to give uniform access to data to users using the YMP. We
ported multi-resident AFS for use by the Max Plank Institute in
Garching, Germany.
The initial AFS client port was to Unicos 6.0.on the YMP This
port has since been upgraded to Unicos 7.C.on our C90 There
were several substantive porting issues. First, Unicos until
version 8.0, uses a file system switch, whereas AFS is vnode
based. This meant a fake vnode system needed to be designed to
map AFS vnodes for the cache files and the server files to
Unicos NCI inodes. There are also a large number of problems
associated with the 64 bit word size and structure alignment.
These problems appear in the packets which get sent across the
network, AFS directory layout, and data encryption algorithms.
In addition, since a Crays is a physical memory machines, a
buffer pool needed to be devised to handle malloc'ing data areas.
Lastly, we had to find every place where the kernel could sleep
and ensure that either none of the variable in use across the sleep
we stack variables, or fix the stack addresses once the kernel
came back from the sleep. This is another problem which has
been fixed in Unicos 8.0.
We are now in the process of porting the AFS 3.3 client to
Unicos 7.C as part of the final plan to port the AFS 3.3 client
to Unicos 8.0. The AFS 3.3 client is expected to run much
faster than the AFS 3.1 client owing to improvements in the
network layer, developed initially here by Jonathan Goldick.
Further performance enhancements have also been added by

155

Transarc for AFS 3.3 and we also are beginning to investigate
making further performance enhancements to the client.
Multi-resident AFS was written with Unicos in mind, so
combined with the effort that had gone into the port of the AFS
client, this port was much easier. Multi-resident AFS is based
on AFS 3.2, whereas the Unicos client is based on AFS 3.1.
The client and server share several common libraries, including
the RPC layer, AFS directories, and data encryption for
authentication. Porting these involved bringing the Cray port of
AFS 3.1 up to AFS 3.2. There remained a number of 64 bit
issues for the file server, including handling internet addresses in
the hostent structure as well as a few word alignment issues. In
addition, we need to lock a Unix file system when salvaging
(AFS version of fsck for volumes). Multi-resident AFS depended
on locking a directory using flock which is not possible with
Unicos. The AFS vnode structure needed integer data types
converted to bitfields and is now twice the size of the vnode for
standard AFS. But this helped us optimize the file server as well
as allowing volumes to move between Cray AFS file servers
and AFS inode based file servers. One additional modification
was to port the dump and restore routines to correctly dump
access control lists. These were previously dumped as blobs of
data. But with the change from 32 to 64 bit word size, we needed
to ensure we converted during reads and writes of the dump.
We are currently in the process of porting multi-resident AFS
3.2 to the AFS 3.3 code base. Most of the port is complete and
we are now in the process of testing multi-resident AFS 3.3 on
several platforms.

Other Sites Using Multi-Resident AFS
The Max Planck Institute, IPP, in Garching, Germany recently
purchased a Cray EL to serve as a multi-resident AFS file server
and to support DMF for mass storage. This· is currently
beginning operation and should be a full production environment
this summer.
NERSC is currently testing multi-resident AFS at their facility
and intends to use the Unix file system interface to connect to
Unitree. In addition Transarc is evaluating multi-resident AFS
and if they decide to make a product of it, it will be available by
the end of 1994. This Transarc product will not provide direct
support for Unicos, but will retain the modifications we have
made.
Several sites use our port of the AFS client in a production
environment. Among them are NCSA in Illinois, SDSC in San
Deigo, MP/IPP, LRZ in Munich, ETH in Zurich, RUS in
Stuttgart, and EPFL in France. These sites appear to be satisfied
with the AFS client.

DFS projects at PSC
As part of our close working relationship with Cray, we did the
initial port of the DFS client for Unicos 8.0. During the course
of this work we have also assisted in debugging the DFS file
server.
We are beginning to think about the design of a multi-resident
version of DFS. This will be backwards compatible with multiresident AFS and will be able to use the same AFS to DFS
translator that Transarc is supplying. DFS is still immature and

156

there are several basic design questions which need to be
answered, particularly with regard to DFS filesets before we can
devote a lot of time to this project.

Next Generation of Multi-resident AFS
As noted above, network performance is improved dramatically
in AFS 3.3 as a result of initial work done here at the PSC with
respect to packet size over FDDI. Our initial tests indicate a
doubling in file trasnfer rate with a doubling of the packet size
for FDDI. This provided the initial spark to consider adding
alternate methods of delivering file data from a storage device to
an AFS client. If one had the full bandwidth of HIPPI available
and both the residency and the AFS client were on the same
HIPPI switch, spectacular improvements in data transmission
speeds could be achieved. So the means of asking for the data
needs to be separated from the actual delivery of the data. In this
scenario, the AFS file server serves as an arbitrator, deciding
what is the best network transport (and storage device) to use to
get data to the client. This notion is referred to as third party
transport, the third party in this case being the storage system
offering the file's data, with the first two parties being the client
and the file server. Most of the software is written for this
generation and we are at the point of beginning to debug it.

References
Collins Bill, Debaney Marjorie, and Kitts David,
Profiles in Mass Storage: A Tale of Two Systems , IEEE
M. Satyanarayanan, Scalable, Secure, and Highly Available
Distributed File Access, IEEE Trans. Computers, May, 1990.
pp.9-21.
M. Satyanarayanan, A Survey of Distributed File Systems,
CMU-CS-89-116.
Nydick, D. et aI., An AFS-Based Mass Storage System at the
Pittsburgh Supercomputing Center, Proc. Eleventh IEEE
Symposium on Mass Storage Systems, October, 1991.
Jonathan Goldick. et al., An AFS-Based Supercomputing
Environment Proc. Twelfth IEEE Symposium on Mass Storage
Systems, April, 1993.

AFS Experience
at the University of Stuttgart
Uwe Fischer, Dieter Mack
Regionales Rechenzentrum der Universitat Stuttgart
Stuttgart, Germany
Since late 1991 the Andrew File System (AFS) is in use
at the University of Stuttgart. For the Centers Service
Cluster comprising more than 15 RISC workstations, it
is a key component in providing a single-system-image.
In addition, new services like distribution of public
domain or licensed software are using AFS wherever it
is appropriate. On the long run, the introduction of
AFS was one of the first steps into the emerging OSF/
DCE technologies.

the predecessor of the Distributed File Service (DFS), one of the
extended DCE services, which might also be regarded as one of
the major DCE applications, is one of them. Thus it is obvious,
that our AFS- and DCE-related milestones are tightly coupled.

The University of Stuttgart Regional Computer Center
CRUS) provides computing resources and related services to the
academic users of the university. Especially the supercomputing
service is available to all other state universities in Baden-Wtirttern berg and to corporations under public law as well as industrial
customers.

The first tasks after the DCE project had been formed in
August 1991 were to work with the xntp time service and the Kerberos security system. In November 1991 the RUS cell rus.unistuttgart. de was the first AFS cell installed in Germany. During
summer 1992 RUS took part in mMs AIX - DCE Early Participation Program. At that time the SERVus workstation cluster was
installed and AFS run in a preproduction mode. In November
1992 the DCE cell dce.rus.uni-stuttgart.de was configured and the
port of an AFS-Client to the CRAY Y-MP 2E file server was
completed. In January 1993, the service cluster joint with AFS
went into full production, replacing the midrange-type mainframes which have finally been shutdown by end of March 1993.
Since summer 1993 a DFS prototype is running, and in September
1993 the attempts to port the AFS-Client to the CRAY-2 have been
dropped.

2. Chronology

3. Configurations

Back in 1989/90 the Center started the process to replace
its midrange-type mainframes front-ending the CRAY-2 supercomputer. Focusing on UNIX-derivates as the major operating
system was a joint intension. As a result, in 1991 this effort lead
to the challenging task to investigate, whether and how a RISCbased workstation cluster could cover the typical mainframebased services within an university environment.

Since late 1986 the Center runs a CRAY-2 as its main
supercomputer resource. As shown in figure 1, this will be
replaced in April 1994 by a CRAY C94D. The CRAY Y-MP 2E is
used for high-end visualization applications and as a base for the
mass storage service. In the middle layer, the SERVus workstation
cluster consists of IBM RS/6000 systems with models ranging
from a 580 down to 220s. Today, the cluster is going heterogeneous by incorporating a multi-processor SUN/Sparcserver. Often
neglected in such a picture drawn from a centers perspective is
that the workstations on campus are now several thousands in
number.

1. About the Center

OSFjDCE - and DME - were a name and a concept at that time,
but product availability was not expected within the next two
years. And this was far too long ahead in the future. Thus the Center formed a DCE project to investigate and evaluate the forthcoming technologies of distributed computing. RUS asked
vendors for support, and only IBM was capable of providing this.
As DCE is not built from scratch in evolving and melting existing
technologies, wherever the functional components existed as
independent products, they could be used to start with. AFS being

The shadowed area shows the AFS file space, with the
AFS server machines tightly coupled to the SERVus cluster. The
CRAY file server in the AFS context is acting only as a client, and
it distributes its mass storage service to all requesting systems on
the campus using NFS.

157

File

Server

M

Compute
Server

IPVR

c

MasPar
MP·1216

Figure 1:

p
Intel

DistributedParallel

IPVR
ICAII
RUS

RUS Configuration

AFS Cells
As of today, the University of Stuttgart has 4 registered
AFS cells. There is the main cell rus.uni-stuttgart.de, which
houses all the HOME directories for the workstation cluster·
machines. In addition, public domain and licensed software is distributed using AFS for those platforms where AFS clients are
available. And one particular university department placed its
AFS file server in this cell.
There is a second cell rus-cip.uni-stuttgart.de dedicated
to a workstation pool, to which students have public access.
Due to the deficiencies in delegating the administration
of dedicated file servers, two university departments are running
their own AFS cells ihJ.uni-stuttgart.de and mathematik.uni-stuttgart.de. And there are more departments which show interest in
using or running AFS file servers.

rus AFS Cell Configuration
The main cell has been upgraded to AFS 3.3 last month.
It is based on a total of 4 file server machines. Three of them are
provided by the Center, and the AFS database services are replicated on them. One is owned by a university department and run
as their dedicated file server. All servers are IBM RS/6000 workstations, and the total disk capacity controled by AFS is 19 + 8

158

MassivelyParallel

GB. By a campus license agreement, AFS client software for a
variety of platform - as combinations is available:
DEC DECStation
DEC VAXStation
HP 9000 Series 700
IBM RS/6000
NEXT NeXTStation
SGI 3000,4000 Series
SGI Indigo
SUN 3,4, 4c
SUN 4m

Ultrix 4.0, 4.3
Ultrix 4.0,4.3
HP-UX 9.0
AIX 3.2
NeXT OS 3.0
IRIX 5.0
IRIX 5.0, IRIX 5.1
SunOS 4.1.1 - 4.1.3
Sun OS 4.1.2, Solaris 2.2,2.3

4. Purpose
Key Component/or the Service Cluster
As stated above, AFS is of strategic use for the SERVus
workstation cluster in providing a single-system-image. First, the
login authentication is done using only the Kerberos authentication service, which is part of AFS. Second, all the HOME directo. ries of about 1600 users are within AFS. Thus every user has a
consistent view and access to all his data, regardless of the single
machine of the cluster he is using. In addition, the NQS batch subsystem running on the batch worker machines of the cluster has
been modified to support that an users NQS job gets authenticated

and gains the appropriate authorization for file access.

Software Distribution
The basic software products available on the service
cluster RSj6000 machines are installed only once in AFS, e. g. C
and Fortran compiler, X 11 and application software. That is essentially all except the software needed to bring up a single system.
During 2nd half of 1993, RUS developed a concept to
distribute software to workstation users. The basic idea is that not
every user or system administrator has to care about software
installation and maintenance, instead this is done only once at the
Center. All platform-OS specific executables are available for
direct access by linking a specific directory into the users search
PATH, and in addition the software is ready to be pulled and
unfolded at the client side. As underlying technology AFS is used
where appropriate, else NFS.
Thus without any big effort a huge variety of Public
Domain Software will be available at every workstation. Licensed
Software could be made available as well, using the AFS access
control mechanism to restrict access to the authorized users. Currently this scheme is introduced to distribute PC software.

Statistics
The three rus AFS file server machines have allocated 17
disks (partitions) with 19 GB capacity. There are about 1500 user
HOME directory volumes spread over 7 partitions with an
assigned disk quota of 15 GB and an effective usage of 3,7 GB.
For the purpose of software distribution there are about 280 volumes in use with an assigned disk quota of9,4 GB and a data volume of 6,2 GB.

5. UNICOS Client
The AFS client installation on the CRAY Y-:MP 2E happened at the time of the UNICOS upgrade from 6.1 to 7.0. Thus
the first tests used a Pittsburgh Supercomputing Center code for a
UNICOS 6 system, but finally a derivate of PSCs UNICOS 7.C
AFS client went into production. The port itself was a minor
effort.
Because the CRAY Y-:MP 2E as a file server runs neither
a general interactive nor batch service, the AFS client is used
mainly for AFS backups. Every night, all AFS volumes are
checked and for modified ones the backup clone is "vos dumped"
into a special file system on the CRAY Y-:MP 2E which purpose is
to store backup data. This file system is subject to Data Migration
Facility, thus the tape copies to 3480 cartridges stored in two ACS
4400 silos are handled by DMFs tape media specific processes.
This solution replaced the previous procedures, which
took the volume dumps on a RSj6000 based AFS client and wrote
the dump files via NFS to the file server. Todays procedures are
more reliable and robust, using AFSs RX protocol rather than
UDPjIP. Better transfer rates are assumed as well, but an honest

comparison is not possible, because at the time of change serious
NFS bottlenecks at the file server side had been detected and
eliminated.
The Center intended to integrate the CRAY-2 supercomputer into its AFS environment. But due to major differences in
UNICOS internal structures the port of the AFS client code
turned out to be a complicated and time-consuming task. With
respect to the imminent replacement of the CRAY-2 this effort has
been dropped.

6. Key Concepts
Security
AFS security comprises authentication, which is based
on a Kerberos variant, and authorization via access control lists
(ACLs), placed on directories.
The proof of an users identity is not guaranteed by local
authentication but by a Kerberos token, which has to be acquired
from a server on the network. Authorization to access data can be
granted to individual users as well as user-defined groups. This
allows for much finer granularity of access rights than the UNIX
mode bits.

Volume ( Fileset ) Concept
One of the fundamental technologies of AFS is the volume concept. The name of these conceptual containers for sets of
files changes to fileset in DFS. A volume corresponds to a directory in the file tree, and all data underneath this subtree, except for
other volume mount directories, is kept in that volume.
Volumes are the base for location transparency. AFS is
able to determine, on which file server and physical partition a
volume resides. Thus once an AFS client has been set up, the user
is fine. Something like registering new NFS file servers and filesystems at every client side is not necessary.
Volumes disconnect the files from the physical media.
There are mechanisms to move, transparently to the user, volumes
from one partition to another, even between different file servers.
Thus disk usage balancing will be a manageable task. In the NFS
context, assigning user directories to new file systems could not be
realized without impacting the users.
Volumes represent a new intermediate granularity level.
Running data migration on file systems, the disk space is not the
limiting factor any more. Approaching 1 million files on the
CRAY Y-:MP 2E mass storage file system, as expected it turns out
that the number of files (i-nodes) is the critical resource.
Volumes can be cloned for replication or backup purposes. Through replication high availability of basic or important
data is achievable. In addition, network and access load could be
balanced.

159

Global File System
Server and client machines are grouped into organisational units called cells. But it is possible to mount volumes which
belong to foreign cells. This allows for the construction of a truely
global file system. The AFS file space on the Internet currently
comprises about 100 cells.

PSCs Multiple-Residency Extensions
There is a great potential in this feature. First, it provides
the hooks to integrate hierarchical storage management schemes
into the distributed file system. In addition, the separation of file
data from file attributes is the first step into third party transfers.
Thus functionality similiar to NSL UniTree might be achieved.

7. User Barriers
Todays users are used to work with NFS. And in small
configurations using NFS is quite straightforward, especially
when the user is reliefed of system adminstration tasks, and the
system administrator has done a good job. Starting with AFS
requires some additional setup work and extra knowledge.

File Access Control
First, the user has to acquire a token. Although this can
be achieved transparently by the login process, the token has a
finite lifetime. This has to be considered and taken care of. Access
control is governed not only by the UNIX mode bits, but also by
the ACLs. The user has to be aware of this and to familiarize himself with some new procedures and commands.

AFS Cache Configuration
An AFS client has to provide some cache space, either
memory-resident or disk-based. It's highly recommended that the
disk cache is provided in a seperate file system. That's due to the
fact that the cache manager relies on the specified cache capacity.
And if the file system that contains the cache runs out of space, the
results are unpredictable.
But in most cases the available disks are completely
divided into used partitions, thus providing a seperate file system
for the AFS cache often requires disk repartitioning. And this is
not an easy task.

8. Direction
Consulting Transarcs worldwide CellServDB gives an
indication about the presence of AFS. Although an AFS cell has
not to be registered there, in some sense representative numbers
can be derived using that information. As of March 1993, a total
of 12 AFS cells have been in place in Germany, all running at universities except for one at a research laboratory. Seven of those
cells are based in the local state Baden-Wiirttemberg - Germany
consists of 16 states -, five of them in the capital Stuttgart. Taking
those numbers it becomes quite obvious that there are focal points.

160

The general trend, that workstation clusters are replacing
general purpose mainframes, might be more common in university-like environments than at commercial sites. And distributed
computing technologies are of key importance for the success of
integrating not only the clusters itself but linking all general and
specialized computers together. Having this functionality in place,
distributed computing will not stay restricted to single site locations. As an example, the state universities started collaborated
efforts on software distribution and a "state-wide" file system, the
latter can be provided by OSF/DCEs Distributed File Service.
Provided DCE/DFS is mature enough, and supported by
a sufficient number of vendors, RUS might decide to switch from
AFS to DFS servers by end of this year. One of the major arguments is to use the CRAY Y-I\.1P 2E as a file server in the DCE/
DFS context.

References
Goldick, J. S., Benninger, K., Brown, W., Kirby, Ch.,
Nydick, D. S., Zumach, B. "An AFS-Based Supercomputing Environment." Digest of Papers, 12th IEEE Symposium on Mass Storage Systems, April 1993.
Lanzatella, T. W. "An Evaluation of the Andrew File
System." Proceedings, 28th CRAY User Group meeting,
Santa Fe, September 1991.
Mack, D. "Distributed Computing at the Computer Center of the University of Stuttgart." Proceedings, 29th
CRAY User Group meeting, Berlin, April 1992.
Mack, D. "Experiences with OSF-DCE/DFS in a 'SemiProduction' Environment." Proceedings, 33rd CRAY
User Group meeting, San Diego, March 1994.
Nydick, D., Benninger, K., Bosley, B., Ellis, J., Goldick,
J., Kirby, Ch., Levine, M., Maher, Ch., Mathis, M. "An
AFS-Based Mass Storage System at the Pittsburgh
Supercomputing Center." Digest of Papers, 11 th IEEE
Symposium on Mass Storage Systems, October 1991.
OSP. "Introduction to DCE" Preliminary Revision
(Snapshot 4) - for OSF Members only, June, 1991.
Transarc. "AFS System Administrator's Guide"
FS-D200-00.10.4, 1993.
Wehinger, W. "Client - Server Computimg at RUS."
Proceedings, 31th CRAY User Group meeting, Montreux, March/April 1993.
Zeller, Ch., Wehinger, W. "SERVus - Ein RISC-Cluster
flir allgemeine Dienstleistungen am Rechenzentrum der
Universitat Stuttgart."
to be published

CRAY Research Status of the DCElDFS Project
Brian Gaffey
CRAY Research, Inc.
Eagan, Minnesota

DCE is an integrated solution to distributed computing.

2. Distributed Computing Framework

It provides the services for customers to create distri-

This Framework represents a future CRI architecture that meets the needs of Distributed Computing. All
Programs are represented but not necessarily in complete detail. Components of Federated Services
(XlOpen Federated Naming and Generic Security
Switch [GSSD such as NIS and Kerberos also exist
today but are not federated. Distributed System
Administration will track industry standards such as
OSF's DME or COSE's working group. Meanwhile
CRI will provide products to address the needs of customers in a heterogeneous environment. Nearly everything in our Framework is a standard or a de-facto standard. Nearly all of the software in our Framework was
obtained from outside CRI or will be obtained from outside. The Framework represents the elements which are
essential to high performance supercomputing and to
our strategy of making connections to Cray systems
easy.

buted programs and to share data across a network.
These services include: timing, security, naming, remote
procedure call and a user space threads package. A key
component of DCE is the Distributed File System
(DFS). This talk will review CRI's plans for DCE,
relate our early experiences porting DCE to UNICOS
and describe the issues related to integrating DCE into
UNICOS.

1. Distributed Computing Program
DCE is part of the Distributed Computing Program. The Distributed Computing Program defines the
overall requirements and direction for many subprograms. These sub-programs cover the major areas of
the system needed to support the distributed computing
model. More detail for each can be found in the program roadmaps. The intent is to show how each of these
sub-programs supports the goals of distributed computing. The highest level description is called the Distributed Computing RoadMap. It is the highest-level
description of the entire Program. A RoadMap exists for
each of the sub-programs which in tum is a summary of
product presentations .. The other roadmaps are as follows:
Distributed Job
Distributed Data
Connectivity
Distributed Programming
Network Security
Visualization
Distributed Administration
OSF DCE is covered in three of the RoadMaps :
Distributed Programming which includes threads,
RPCIIDL and naming; Distributed Data which includes
the Distributed File System (DFS); and in Network
Security which includes the DCE Security Services.

OSF DCE is a key element in the Framework.
DCE services will co-exist with ONC and ONC+ services at the RPC, Distributed File System, Security and
Namaing levels of the model. New services, such as
CORBA will be built on top of DCE services.

3. Product Positioning
Architecturally DCE lies between the operating
system and network services on one hand, and the distributed applications it supports on the other. DCE is
based on the client/server model of distributed computing. DCE servers provide a variety of services to
clients. These services are of two types: Fundamental
Services: Tools for software developers to create the
end-user services needed for distributed computing, i.e.
the distributed applications and Data Sharing Services:
Distributed file services for end-users and software
developers. Clients and servers require common networking protocol suites for communication; they may
run DCE on the same or different operating systems.
CRI will support all of the client services of DCE.
CRI will also support the DFS server facility. CRI has
no plans to support security, directory or time servers on

Copyright © 1994. Cray Research Inc. All rights reserved.

161

UNICOS.

cannot be provided (in the short term). For example,
multiple scheduling algorithms cannot be requested.

4. DCE Component Review
The DCE source is a integrated set of technologies. The technologies rely upon one another to provide
certain services. For example, all of the services rely on
threads and most of the services make use of rpc to
accomplish their task. The following is a review of the
major components of DCE.
4.1. Threads in user space
In many computing environments, there is one
thread of control. This thread of control starts and terminates when a program is started and terminated.
With DCE threads, a program can make calls that start
and terminate other threads of control within the same
program. Other DCE components/services make calls
to the threads package and therefore depend on DCE
threads.
Cray already provides the following mechanisms
which allow for additional threads of control:
1. Autotasking
2. Multitasking (macrotasking)
3. Multitasking (microtasking)
4. UNICOS libc.a multitasking
Autotasking allows a programmer to automatically insert directives that use items 2 and 3. Items 2
and 3 are a set of UNICOS library routines which provide multiple threads of execution. They mayor may
not use 4 to manage the multiple threads of execution.
Item 4 is a low level set of UNICOS system calls and
library routines which provide a multithreaded environment. The interfaces provided by these 4 mechanisms
are Cray proprietary and therefore not a "standard."
The DCE threads interface is based on the Portable Operating System Interface (POSIX) 1003.4a standard (Draft 4). This interface is also known as the
Pthreads interface. DCE threads has also implemented
some additional capabilities above and beyond the
Pthreads interface.
In CRI's product, the DCE thread interface routines are mapped directly to existing Multitasking
(macrotasking) routines. This could be configured to
restrict all threads to be within one real UNICOS thread
or to allow for multiple UNICOS threads. With this
approach, existing Multitasking (macrotasking) implementations function correctly. The downside to this
approach is that all of the DCE Threads functionality

162

4.2. Threads in the kernel
In addition to threads in user space, the DFS kernel components require threads in the kernel. Actually,
DFS relies on rpc runtime libraries which use the
pthreads interface. The pthreads interface in the kernel
maps into newprocO which creates a new process in the
kernel. This process is scheduled as a normal process
not as a thread.
4.3. RPC and IDL
RPC, "Remote Procedure Calls" allows programmers to call functions which execute on remote
machines by extending the procedure interface across
the network. RPC is broken into kernel RPC, used only
by DFS, and user-space RPC which is used by most
other DCE components. DCE provides a rich set of
easy to use interfaces for creating remote processes,
bind to them and communicating between the components.
Interfaces to RPC functions are written in a C-like
language called the "Interface Definition Language".
These interfaces are then compiled with the IDL compiler to produce object or C source code stubs. The
stubs in tum are linked with the programmers code and
the RPC libraries to produce client and server executabIes.
A few technical items to note:
-- communication, naming and security are handled
transparently by the RPC runtime library
-- the network encoding is negotiable, but currently
only Network Data Representation (NDR) is supported
-- "receiver makes right" which means that machines
with similar network data types will not need to do data
conversions
-- DCE RPC supports three types of execution
semantics: "at most once", idempotent (possibly many
times) and broadcast
-- RPC will run over TCPIIP or UDP (with DCE
RPC providing transport mechanisms)
CRI plans to rely on ONC's NTP protocol for
clock synchronization since it is already implemented
and a single system can not have two daemons changing
the system clock.
4.4. Directory Services
Directory services is the naming service of DCE.
It provides a universally consistent way to identify and
locate people and resources anywhere in the network.
The service consists of two major portions, the Cell
Directory Service (CDS) which handles naming within

a local network or cell of machines and the Global
Directory Agent (GDA) which deals with resolution of
names between cells.
Applications requiring directory information will
initiate a CDS client process on the local machine called
a Clerk. The Clerk resolves the application's query by
contacting one or more CDS Servers. The Servers each
physically store portions of the namespace with
appropriate redundancy for speed and replication for
handling host failures. Queries referencing objects
external to the local cell will access the GDA to locate
servers capable of resolving the application's request.
When the GDA is resolving inter-cell queries, it
uses either the Global Directory Service (GDS) or
Domain Name Service (DNS). GDS is a X.500 implementation that comes with the DCE release while DNS
is the Internet distributed naming database. Both of
these services will locate a host address in a remote cell
and pass this value back to the CDS Clerk who will then
use it to resolve the application's query.
4.5. Security Services
DCE security services consist of three parts : the
Registry service, the authentication service and
privilege service. The Registry maintains a database of
users, groups, organizations, accounts and policies. The
authentication service is a "trusted third party" for
authentication of principals. The authentication service
is based on Kerberos version five with extensions from
HP. The Privilege Service certifies the credentials of
principals. A principal's credential consist of its identity
and group memberships which are used by a server
principal to grant access to a client principal. The
authorization checking is based on POSIX Access Control Lists (ACLs). The security service also provides
cryptographic checksums for data integrity and secret
key encription for data privacy.
Supporting DCE security on UNICOS can lead to
compatibility problems with current and future
UNICOS products. Specifically, DCE ACLs are a
superset of POSIX ACLs and DCE's Kerberos is based
on version five whereas UNICOS's Kerberos is based
on version four. We don't believe an application can be
part of a V4 realm and a V5 realm. Also, the two protocols aren't compatible, but it is possible to run a Kerberos server that is capable of responding to both version 4 and version 5 requests.
4.6. Distributed File System
The Distributed File System appears to users as a
local file system with a uniform name space, file location transparency, and high availability. A log-based
physical file system allows quick recovery from server

failures. Replication and caching are used to provide
high availability. Location transparency allows easier
management of the file system because an administrator
can move a file from one disk to another while the system is available.
DFS retains the state of the file system through
the use of tokens. Data can be cached on the clients by
granting the client a token for the appropriate access
(read/write). If the data is changed by another user of
the file, the token can be revoked by the server, thus
notifying the client that the cached data is no longer
valid. This can't be accomplished with a stateless file
system, which caches data for some period of time
before assuming that it is no longer valid. If changes
are made by another user, there is no mechanism for the
server to notify the client that its cached data is no
longer valid.
DFS supports replication, which means that multiple copies of files are distributed across multiple
servers. If a server becomes unavailable, the clients can
be automatically switched to one of the replicated
servers. The clients are unaware of the change of file
server usage.
DFS uses Kerberos to provide authentication of
users and an access control list mechanism for authorization. Access Control Lists allow a user to receive
permission from the file server to perform operation on
a particular file, but at the same time access to other
files can be denied. This is an extension of UNIX file
permissions, in that access can be allowed or denied on
a per user basis. UNIX allows authorization based on
group membership, but not to a list of individual users.
DFS is a log-based file system which allows quick
recovery from system crashes. Most file systems must
go through a file system check to ensure that there was
no corruption of the file system. This can occur because
much of the housekeeping information is kept in main
memory and can be lost across a system crash. In contrast, DFS logs every disk operation which allows it to
check only the changes made to the disk since the last
update. This greatly reduces the file system check
phase and consequently file server restarts.
To summarize, the use of a token manager
ensures shared data consistency across multiple clients.
A uniform name space is enforced to provide location
transparency. Kerberos authentication is used and
access control lists provide authorization. DFS allows
its databases and files to be replicated, which provides
for reliability and availability. It can interoperate with
NFS clients through the use of protocol gateways.
Cray's initial port of DFS won't include the
Episode file system. This means that log-based
recovery and cloning won't be available. Cloning is the

163

mechanism used for replication.

4.7. LFS
OSF has selected the Episode file system to be the
local file system for the OSFIDCE product. In the initial
port of DCE it was decided to retain the UNICOS file
system instead of LFS. However, there are significant
features of the Episode file system that when used with
DFS provide reliability and performance enhancements.
The most important feature is the ability to support multiple copies of data. This provides redundancy for
reliability/availability, and increased throughput and
load balancing
CRI will evaluate different log based file systems.
Decide on the best alternative for Cray from available
log base file systems. The current possibilities include
Episode from Tran sarc , Veritas, write our own, Polyscenter from DEC, and others. This evaluation must first
produce a clear set of requirements which will be used
to select the best choice for Cray Research, Inc.

s.

Comparison to ONC Services

Almost all components of DCE have corresponding ONC services. This section is a quick overview of
the technologies available on UNICOS which can be
used now to support distributed computing and distributed data.
Both DCE and ONC have an RPC mechanism,
which have different protocols. To write a program
using ONC RPC, a user makes use of a tool called
RPCGEN, which produces stubs. To write a program
using DCE RPC, a user makes use of an IDL compiler
which also produces stubs. ONC RPC uses XDR for
data translation, while DCE RPC uses IDL. DCE RPC
has an asynchronous option. User programs may use
either DCE RPC or ONC RPC, but not both. The client
and server portions of an application must both use
either DCE RPC or ONC RPC.
NFS is a stateless file system. DFS relies on state
information and uses tokens to control file access. A
user program could access files that exist in a DFS file
system and files in an NFS file system, however a file
must reside in only one file system.
Network Information Service (NIS) is ONC's
directory service. It interfaces to the Domain Name
Server to extract internet naming and addressing information for hosts. DCE's CDS is similar in this respect.
The protocols for NIS and CDS are incompatible.
DCE's User Registry doesn't have a corresponding ONC service, but it will have to coexist or be
integrated with the UNICOS User Data Base. Both
environments support Kerberos for network security.

164

ONC's time service is called Network Time Protocol
(NTP). DCE's time protocol, DTP isn't compatible at
the protocol level. It is possible to tie together an NTP
time service with a DTP time service by having the
DTP server get its time from NTP.
6. CRI's DCE Plans
The DCE Toolkit has been released. It supports
the client implementation of the DCE core services
(threads, rpc/idl, security and directory). We rely on
ONC's NTP to provide the correct time. The Toolkit
passes over 95% of the OSF supplied tests. The test
failure are in the area of threads scheduling. Our Toolkit
is built on top of libu multi-tasking. Since libu has its
own scheduler we removed the OSF supplied scheduler.
CRI does not provide documentation or training
for DCE. Both of these services can be obtained from
OSF or from third parties.
CRI's next release of the OSF technology will be
in two parts : the DCE Client and the DFS Server. The
DCE Client will incorporate an updated version of the
Toolkit and the DFS client. The DCE DFS Server
includes the full OSFIDCE DFS server, providing transparent access to remote files that are physically distributed across systems throughout the network. Implementation of DFS requires UNICOS support for the following new features: pthreads, krpc, and vnodes. These
features are available in UNICOS 8.0. Cray DCE DFS
Server is planned to be available in mid 1994. In addition to UNICOS 8.0, Cray DCE Client Services is a
prerequisite for this product.
The Cray DCE Client Services product provides
all of the functionality of the toolkit as well as DFS
client capabilities. With the introduction of the Cray
DCE Client Services, the Cray DCE Toolkit is no longer
be available to new customers. Since Cray DCE Clients
Services requires UNICOS 8.0, a transition period has
been established for the upgrade of existing Cray DCE
Toolkit customers to Cray DCE Client Services. The
transition period extends from the release of Cray DCE
Client Services until one year after the release of
UNICOS 8.0. During this transition period, the Cray
DCE Toolkit will continue to be supported on UNICOS
7.0 and UNICOS 7.C systems. Cray DCE Toolkit
licenses includes rights to DCE Client Services on a
"when available" basis. There is no upgrade fee.
In future product releases, CRI will provide support for a log-based file system (LFS). We will investigate Episode; the OSF supplied file system, and other
log based file systems which support the advanced
fileset operations. CRI will also integrate DFS will other
components of UNICOS (eg. DMF, SFS, accounting
etc). Finally, we intend to track all major releases of the

DCE technology from OSF.

7. DFS Advantages
In the DFS distributed file environment, users
work with copies of files that are cached on the clients.
DFS solves problems that arise when multiple users on
different clients access and modify the same file. If file
consistency is to be controlled, care must be taken to
ensure that each user working with a particular file can
see changes that others are making to their copy of that
file. DFS uses a token mechanism to synchronize concurrent file accesses by multiple users. A DFS server
has a token manager which manages the tokens that are
granted to clients of that file server. On the clients it is
the cache manager's responsibility to comply with token
control.
Caching of information is transparent to users.
DFS ensures that users are always working with the
most recent version of a file. A DFS file server keeps
track of which clients have cached copies of each file.
Servers such as DFS servers that keep such information,
or 'state' about the clients and are said to be 'stateful'
(as opposed to 'stateless' servers in some other distributed file systems). Caching file data locally improves
DFS performance. The client computer does not need to
send requests for data across the network every time the
user needs a file; once the file is cached, subsequent
access to it is fast because it is stored locally.
Replication improves performance by allowing
read-only copies to be located close to the user of the
data. This load balancing of data locations reduces network overhead. All DFS databases (fileset location,
backup, update) use an underlying technology which
allows replication. This further improves performance
and allows more relability. DFS allows for multiple
administrative domains in a cell. Each domain is controlled via a number of administrative lists which can be
distributed.

8. DCE and UNICOS Subsystems
8.1.
DMF
Since DFS uses NCt has its local file system and
has its cache the integration is transparent. DMF can
migrate and unmigrate DFS files at any time. In future
releases we will study the possibility of a special communications path between DMF and DFS.
8.2. NQS
In a future NQS release, NQS will be able to
access and use DFS files for retrieving job scripts and
returning output.

8.3. SFS
In future releases the Shared File System and
DFS will be integrated in order to maintain high performance and network wide consistency. Our initial work
will be to synchronize DFS tokens and SFS semaphores.
This will ensure that users outside the cluster can access
files in the cluster while maintaining data integrity.
Next, we will extend a facility already within DFS
called the Express path. The Express path will allow
DFS clients within the cluster to access data in the cluster without moving the data.
8.4. Security
Our initial release requires users to validate themselves to DCE before using DCE services. This validation occurs during the initial entry into the DCE. If that
entry occurs on UNICOS then a second is required. If
the entry occurs in the network then no second
UNICOS login is required. DCE security is separate
and distinct from MLS. However, since DCE makes use
of the standard network components that are part of
MLS, some of the benefits of MLS apply to DCE. Later
releases of DCE will makes use of other components of
MLS.

9. DCE's impact on UNICOS
Since the reference implementation of DFS is
based on the latest file system technology in System V
(vnodes), It was necessary to change UNICOS to support vnodes. To support vnodes, all of the old FSS (file
system switch) code had to be removed and replaced
with vnode code. This required all existing file systems
(eg. NCt and NFS) to change from FSS calls to VFS
calls. DFS also requires rpc and threads in the kernel.
The rpc code is a copy of some of the user space rpc
code. The threads support in UNICOS is completely different from the user space threads code. In the kernel,
we implemented threads through direct system calls.
DFS and rpc contain large amounts of code. This is
much more of an issue for real memory system like
CRAYs than it is for other systems. The other components of DCE also contain large amounts of code but
since they are in user space they can swap out. The user
space libraries require threads. We implemented DCE
threads on top of libu multitasking. This has the advantage of easier integration with existing libraries and
tools (eg. cdbx). However, there are some restrictions.

10. Current Status
The DCE toolkit is released. The Toolkit supports
client implementations of all core components except
DFS. The Toolkit allows UNICOS systems to participate in a DCE environment. The Toolkit is at the OSF

165

1.0.1 level. The components include:
Threads
Rpc/idl
Directory
Security
Time API
UNICaS 8.0 contains the infrastructure to support DFS. All DFS products will require 8.0. The infrastructure includes vnodes, kernel rpc and kernel threads.
The DCE client product will includes all of the
Toolkit components (updated to the 1.0.2 level) and the
DFS client. Currently, the DeE Client passes all of the
Toolkit tests and the NFS Connectathon tests. More
than 90% of the DFS tests are working. The DCE DFS
Server product will include support for the DFS servers.
It currently passes Connectathon as well. A major
undertaking, in conjunction with other aSF members, is
underway to multi-thread DFS. LFS (aka Episode) has
been evaluated but will not be part of the initial release.

11. Summary
CRI has a DCE Toolkit available and plans for a
DFS product in third quarter 1994. CRI is committed to
track DCE and enhance it.

166

Networking

Seinet '93 - An' Overview
March 14, 1994, San Diego California
Cray User's Group Meeting

By: Benjamin R. Peek
Peek & Associates, Inc.
5765 SW 161st Avenue
Beaverton, OR 97007-4019
(503) 642-2727 voice
(503) 642-0961 fax

169

SCinet '93 at the Supercomputing '93 Conference in Portland, Oregon
SCinet '93 - An Overview
SCinet '93
SCinet '93 got underway during SUPERCOMPUTING '92. Bob Borchers had asked me
to help locate someone locally (in Oregon) that could handle the SCinet '93
responsibilities. At that point, I was just beginning to understand requirements for the
Local Arrangements responsibilities and had not quite received a commitment from Dick
Allen to take the Education Chair. I decided to spend time during SUPERCOMPUTING
'92 at the SCinet booth in Minneapolis with Dick Kachelmeyer and the SCinet '92 crew.
I was most impressed with the level of service that Dick and his crew were able to give to
all of the exhibitors and researchers at SC'92. The conference period went well for me
but I admit that I was buried in more information than I could handle, especially the
SCinet participants and the Local Arrangements scope and organization. I also attended
meetings with the SC'92 Education folks, along with Borchers, Crawford, Allen and
others, to better understand what they were doing. Basically, my time was spent
collecting a lot of information.

Time track
During December 1992, thing got even more interesting. Dona Crawford, Bill Boas, and
several other folks began organizing the whole idea of the National Information
Infrastructure Testbed, spawned out of the success of SCinet '92 and a desire to establish
a testbed with the attributes of SUPERCOMPUTING experimentation in terms of
industry, academia, and government research participation.. This NUT idea became an
active meeting just before Christmas 1992. On December 17, 1992, Dona Crawford and
several interested folks met in Albuquerque to discuss a permanent network testbed and,
what that would mean, who might be interested, who would benefit, who and how could
such an idea be funded, and so forth. Because demonstrations could result from the NUT
idea that might be showcased at SC'93 on SCinet '93, Dona invited me to participate. I
had other commitments and was unable to make the meeting. Clearly, it was a productive
meeting.
The NUT showcase demonstrations for SCinet '93 were massive, impressive, and
technically superior. The showcase of technology for ATM over the NUT infrastructure
and SCinet '93 was impressive enough that ATM, as a protocol and technology, moved
ahead in its deployment schedule by at least 18 month during 1993, perhaps even more,
in my view.
Starting in March 1993, Mike Cox and I began the process of collecting requirements by
survey for SCinet '93. The survey was sent by fax, email, U.S. Mail, and by phone to all
participants in SUPERCOMPUTING '92, specifically to all of the folks involved in
networking or SCinet. The survey was thought, at this early date, to magically collect all
of the information that we would need to begin the network design process for all of the
networks planned. The survey included questions about Ethernet™, FDDI, ATM, RiPPI,
Serial RiPPI, Fibre Channel, and SONET. Some conference calls were made.
Perhaps the next major event for SCinet '93 came in June 1993. Dona Crawford planned
a SC'93 Program Committee meeting in Portland for June and it seemed convenient and
pertinent that we have the organizing committee meeting for SCinet participants as well,
especially since some were traveling to Dona's meeting already. The meeting at the
Benson Rotel started at 8:30a with about 30 SCinet folks attending in the morning. Bob
Borchers introduced the meeting and gave the group a overview of SUPERCOMPUTING

170

in general, the expected number of attendees at SC'93, and other general information. I
gave an overview of how I planned to organize and run the SCinet '93 committee. We
reviewed preliminary plans including permanent infrastructure at the Oregon Convention
Center and the fund raising requirements to make such an idea possible.
We heard from Jim McCabe (who committed to be vice chair technology at this meeting),
Mark Wilcop, U.S. West (telephone long lines and other issues), Doug Bird, Pacific
Datacomm on Ethernet and FOOl, Mark Clinger, Fore Systems on ATM, Bill Boas,
Essential Communications on HiPPI, Dona Crawford, Sandia on NIIT, and the Intel folks
on experiences at SCinet '92. The meeting was a great success and got the team working
on all of the serious issues. Other ideas were represented including the National Storage
Labs (NSL) plans and of course, Tom Kitchens, DoE and Fran Berman, UCSD presented
the Heterogeneous Computing Challenge and the Le Mans FunRun ideas. A massive
amount of activity was initiated and specifically, the connectivity data collection
functions went into high gear.
The next major "event" was a meeting two months later in Minneapolis (August.).
NetStar was our host and I paid for lunch for around 35-40 people. At the NetStar
meeting, we were able to reach consensus on many of the major functions and tasks for
SCinet '93. Connectivity requirements were proving hard to come by (no one knows
what they will do at a SUPERCOMPUTING conference until a few (very) weeks before
the conference (in terms of equipment, projects, etc.). Before the NetStar meeting, email
surveys were sent a total of three times, faxes were sent twice (to everyone not yet
responding. Finally, in late August, Linda Owen at Peek & Associates, Inc. began the
telephone process and called everyone that we knew to collect additional requirements.
Jim McCabe and the various project leaders were also calling each participant.
During September and specifically on Labor Day, Jim McCabe and his crew from NASA
spent the weekend at the Oregon Convention Center making a physical inventory of the
conduits system, inspecting all aspects of the center layout, and getting preliminary
implementation plans in place for the physical networks defined at that point.

An October meeting was held in Albuquerque at the same time as the SC'93 steering
committee meeting in Portland on October 22 -23, 1993. I was unable to attend due to
the Steering Committee meeting in Portland. The purpose of the meeting was final
review on all network designs for the conference. Each of project leaders (for Ethernet,
FOOl, ATM, HiPPI, FCS, plus external connectivity) gave final reports on the design of
their specific network. Issues and problems were discussed, especially those that related
to interfaces between networks and those issues dependent on loaned equipment from the
many communication vendors that were participants. McCabe reported that the meeting
was very productive. Meeting reports were received from most of the project leaders
outlining problems, proposed solutions, and unfilled requirements.
Prestaging was also planned for October at Cray Research, Superservers Division in
Beaverton, OR. The space was donated, power was arranged, facilities were organized,
and no one showed up. Prestaging is just not a concept that will work for SCinet, in my
view. The elements of the conference do not come together clearly, in enough time, for
the big players in the network like Cray Research, mM, HP, MasPar, Intel, Thinking
Machines, etc. to take advantage of prestaging. The network exhibitors and vendors
seemed much more prepared to take advantage of the early timeframe. However, not all
of them. There is also the logistics problem of getting all of the people scheduled for a
prestaging activity, at the same time. It does not do a lot of good to prestage equipment
that must talk to other equipment if the "other folks" cannot be there.

171

As we moved into mid-October, it became clear that the size, scope, and complexity of
SCinet '93 was perhaps four times SCinet '92 .. I expect a step function for SCinet '94 as
well. The costs have also exploded due to the cost of single and multimode connectors,
fiber optic cable (the volume of it continues to increase), and the labor (very skilled and
expensive) to install and test such networks. Fiber optic installation is also slower than
conventional network technology. Rushing a fiber optic installation just results in
rework. During October, Peek & Associates, Vertex and RFI were installing networks at
the Oregon Convention Center at every available time slot The convention center had
many other conferences that made working in the center really difficult. We would work
for 2-4 hours at a time, generally from midnight to early morning, between events.
The real press started the first week of November. We were in the Oregon Convention
Center installing networks almost continuously from October 30 until November 15 with
the pace becoming more intense with each passing day.

External Networks
SCinet was connected to the outside world in many ways including DS3, OC3, gigabit
fiber optics (HiPPI and FOOl), SONET, etc. External networks were developed,
managed and setup primarily by Mark Wilcop. Mark worked with many phone
companies, research organizations such NUT and Sandia, ANS for the Internet
connection, and with the SCinet team to establish the finest set of external networks that
have been established to date for a technical conference such as SUPERCOMPUTING
'93.

as

The Sandia "Restricted Production Computing Network" connected Sandia, New Mexico
to Sandia, California over a 1100 mile DS3 truck. This network was terminated at both
ends with AT&T BNS-2000 switching equipment that integrated both local FOOl
networks and A TMlSONET testbeds. An additional link was developed at OC3 that
covered 1,500 miles of connectivity from Albuquerque to Portland. This connectivity
included several players and was arranged primarily by Mark Wilcop (U.S. West).
The NIIT experiment was massive. It included connections to the University of New
Hampshire; the Earth, Ocean, Space Institute in Durham; Ellery Systems in Boulder,
CO; Los Alamos; Sandia in Albuquerque; Sandia in Livermore, CA; and Oregon State
University in Corvallis, OR.

An AT&T Frame Relay capability connected NUT to University of Berkeley, CA;
University of California, Santa Barbara; NASA/Jet Propulsion Lab, Pasadena, CA; High
Energy Astrophysics Div., Smithsonian Astrophysical Observatory, Cambridge, MA; and
AT&T in Holmdel, NJ.
Private Boeing ATM network to Seattle, WA with Fore Systems equipment and
participation. This experiment developed and used switched virtual circuits (SVCs) for
the first time in a trial over the switched public network.

Fiber Optic Backbone Network at Oregon Convention Center
A backbone network was installed in the Oregon Convention Center. The backbone
network is described elsewhere in this document (Oregon Convention Center Support
Infrastructure Letters section). The backbone has from 12 - 48 single and multimode
fiber optic connections running throughout the convention center.

172

The diagram attached shows the network details.

Ethernet
The Ethernet network was very extensive and had hundreds of connections. Ethernet
connected to both the FDDI networks and to the ATM backbone. The Ethernet
equipment was supplied by ODS with conference support from both Pacific Datacomm in
Seattle and ODS in Dallas.

Fiber Distributed Data Interface (FDDI)
FDDI networks interfaced with the ATM backbone. The FDDI equipment was supplied
by ODS with conference support from both Pacific Datacomm in Seattle and ODS in
Dallas.

Asynchronous Transfer Mode (A TM)
ATM was one of the most interesting ATM experiments to date. The ATM network
served as the backbone network for the conference. The ATM activity was managed by
Marke Clinger with Fore Systems, Inc. There were many firsts, specifically the first use
of switched virtual circuits (SVCs) over the public switched network, the most extensive
backbone network based on ATM, and in general, the use of production ATM equipment
from a mixed set of vendors over a mixed set of media including the public switched
network, private networks, the OCC backbone network, etc.

SCinet'93 Network Diagrams
On the following pages are detail diagrams of each of the networks from SCinet'93 as
well as figures describing the network experiments.
All of the diagrams go in here. This page number will be used appending A, B, C, D, etc.
to each diagram.
High Performance Parallel Interface (HiPPI)
HiPPI is a suite of communications protocols defining a very high-speed, 800 M1bps
channel and related services. Using crosspoint switches, multiple HiPPI channels can be
combined to form a LAN architecture.
The HiPPI protocol suite consists of the HiPPI Physical Layer (HiPPI-PH), a switch
control (HiPPI-SC), HiPPI Framing Protocol (HiPPI-FP), and three HiPPI Upper Layer
Protocols (UPLs). The UPLs define HiPPI services such as IEEE 802.2 Link
Encapsulation (HiPPI-LE), HiPPI Mapping to Fibre Channel (HiPPI-FC), and HiPPI
Intelligent Peripheral Interface (HiPPI-IPI).
SCinet '93 implemented both HiPPI-IPI and HiPPI-LE. There was the possibility of
HiPPI-FC but no experiment was conducted due to lack of time. It would have been an
interesting experiment and there was a convenient Fibre Channel network that could have
been used.
The HiPPI team was headed by Randy Butler with NCSA. Randy did an outstanding job
of designing, building and operating the largest HiPPI network built to date. The network
extended approximately 28 miles to Beaverton, OR connecting a Paragon at Intel SSD to
the Oregon Convention Center with three separate fiber optic networks. There were
several private RiPPI networks dedicated to a specific application or demonstration. In

173

all, there were more than 30 HiPPI connections over single mode fiber, multimode fiber,
25 and 50 meter HiPPI cables, connected by a wide assortment of HiPPI switches,
routers, and extenders. The system operated flawlessly throughout the conference.
See the attached figure from Network System describing the HiPPI networks for SC'93.

HiPPI Serial
In addition to the above, there were two separate HiPPI Serial links and one FOOl
connection to Intel in Beaverton, approximately 24 miles one way, over three dedicated
fiber optic private networks configured using both U S West and General Telephone dark
fiber connecting the Oregon Convention Center to Intel in Beaverton, OR. These serial
experiments used BCP extenders everywhere. The results were impressive. Thirteen
terabytes of information per day moved between Intel's Paragon configuration in
Beaverton and Intel equipment at the Supercomputing conference within the Oregon
Convention Center.
These fiber optic networks experienced unusually low bit error rates and were functional
throughout four days of the conference. Much of the "impossibility" of doing gigabit
networks at the WAN level evaporated during the experiment. Engineers and
management, across a spectrum of companies and organizations, were convinced that
such technology could be implemented

174

Experiences with OSF-DCEIDFS
in a 'Semi-Production' Environment
Dieter Mack
Computer Center of the University of Stuttgart

Talk presented at CUG, San Diego, March 1994

Abstract:

RUS has been running a DCE cell since late 1992, and DFS since summer 1993. What was originally a mere
test cell is now being used as the day-to-day computing environment by volunteering staff members. The
expressed purpose is to obtain the necessary skills and prepare for the transition from an AFS based
distributed environment to OSF-DCE. We describe experiences made, difficulties encountered, tools being
developed and actions taken in preparation for the switch to DCEIDFS.

Introduction.
The Computer Center of the University of Stuttgart is a regional computer center and its purpose
is to provide computing resources to the University of Stuttgart, the other universities of the state
of Baden-Wiirttemberg, and to commercial customers.

Client-Server Configuration
Computer Center University of Stuttgart (RUS)
File

Compute
Server

MasslvelyParallel

DlstrlbutedParallel

IPVR

2.12.93

Fig. 1: Central Systems
Figure 1 outlines the central services offered to the user community. The Cray 2, which serves the
need for supercomputer power, will be replaced by a C94 shortly. General services and scalar

175

batch are now being served by the SERVus cluster, a cluster of RISC servers(IBM RS/6000), as
shown in figure 2. This cluster has successfully replaced a mainframe based timesharing system
at the beginning of last year.
There are about 2800 registered
users of the central computers.
The LAN of the University of
Stuttgart comprises 3200
computer nodes, including more
then 1600 workstations. As the
users and owners of these local
systems become aware of the
work involved in their
administration and maintenance,
they start to demand new central
services to take this work off
their shoulders.

Service-Cluster
SERVus

The RUS-DCE project.
In response to these needs, in
summer 1991, with some support
from IBM, we started a project to
investigate the technologies of
distributed computing. At this
time OSF-DCE was but an
interesting blue print, so we
decided to look into and evaluate
the available technologies, and to
make them available to the
members of the project team. The
ultimate goal of course was and
still is to offer these as new
services to the users.
The main parts of the project
~ijJS Systemtechnlk
were:
• Time-Services - xntpd
Fig. 2: SERVus RISC Cluster
• Kerberos
• AFS - Andrew File System
• NQS - Network Queuing System
• Distributed Software Maintenance
and Software Distribution

17 CPU's
Linpack:
255 MFlops
SPECmark:
846

Platten:
51 GB
-Scalar Batch
-Program Dev.
-Libraries
-lID Server
-X-Server
-Data Bases
-Word Proces.
-Network Servo
-Software-Lic.
-User Admin.
-Security
-Dist. Data
Cluster

fUI.lY·WthJze IIrvcl3 8IQ3

The time service is just a part of the infrastructure of distributed computing, without any directely
visible impact on the users.
The AFS cell rus.uni-stuttgart.de was installed in November 1991 as the first cell in Germany.
AFS is one of the key components of the SERVus cluster in order to achieve a 'single system
image'. Today AFS is in full production use at the university, and it is hard to imagine, how we
could get our work done without it. AFS is the mature and widely used production Distributed
Computing Environment available today.
But AFS lacks a finer granularity of administrative rights, as would be desirable at a university
in order to keep departments under a 'common roof, but give them as much autonomy as possible

176

without sacrificing security. This is the reason for having multiple AFS cells (currently 4) on the
campus, as departments have to run their own cell, if they want to run and administer their own
file servers.
OSF-DCE and DFS at Stuttgart.
While the main component technologies of OSF-DCE are in production use at RUS, it was
obviously desirable to get hold of DCE at an early stage. We had the opportunity to take part in
IBM's EPP (beta test) for AIX-DCE. We first configured our DeE test cell in November 1992, and
we have DFS running since summer 1993. One does not expect everything to run absolutely
smooth in a beta test, but one hopes to get ones hands onto a new product at an early stage. So
we started off with one server machine and a handfull of clients, the working machines of
participating project members.
Today we have three server machines with the following roles:
CDS, DTS, DFS simple file server
rusdce:
zerberus: Security, CDS, DTS
syssrv2: DTS, DFS file server, FLDB server, system control machine
The clients are essentially the same, but besides the RS/6000's there now is a Sun Sparc machine.
There are 30 user accounts in the registry, and these people have their own DFS filesets,
containing their home directories. We plan to install the DFS/AFS translator as soon as we can
get hold of it.
Experiences with DCE and DFS.
The first impression with DCE is that this is a really monolithic piece of software. Nevertheless
configuring server and client machines is relatively straight forward with the supplied shell
scripts and especially with the SMIT interface supplied by IBM.
Next one realizes that the security is much more complex to administer, but on the other hand
this offers one the finer granularity, which AFS is lacking. The cell_admin account of course still
is master of the universe, but there are numerous ways to delegate well defined administrative
tasks. This is mostly due to fact that one can put ACLs on almost everything from files to
directories in the security space.
The Cell Directory Service (CDS) is intended to locate resources transparently. From this it is
clear, that it is one of the key components of DCE. If the CDS is not working properly, all the
other services may no longer be found and contacted, and everything will eventually grind to a
halt. Its performance should be improved, before one can really use DCE in large production
environments.
DFS essentally still is AFS with different command names. From the users perspective it is as
transparent as AFS or, may be, even more transparent due to the synchronisation between ACL's
and mode bits. And the administration is still the same. As an example: I just replaced the
command names etc. in an old AFS administrative shell script of mine, and it just ran on DFS.
And DFS appears to be quite stable and robust, but we have not yet really stressed it.
Nevertheless it is a good idea to copy ones files from DFS back to AFS, just to be on the safe side.
This has proven to be a wise measure, not because of DFS failure but due to CDS problems, which
can prevent one from getting at ones data. Mounting a DFS fileset locally on the file server can
also help in case of problems.
Another problem worth mentioning was encountered, when we tried to move the security server

177

to a new machine. We succeeded in the end, but only by having the old and the new security
server running simultaneously for takeover. Had the old security server been broken, we would
not have been able to bring up a new security server from a database backup, and would have had
to reconfigure the cell.
This brings us to one very important remark: as all objects in DCE are internally not represented
by their names,. but by their UUID, it is crucial to have the cells UUID written down somewhere.
It is possible to configure a cell with a given UUID. Reconfiguring a cell with a new UUID makes
all data in DFS inaccessible, until one forcibly changes the ACLs by hand.
What is missing in DCE.
There are a some features missing in DCE, which are absolutely necessary.
The most important of these is a Login program, which allows a user to log into DCE and the local
machine in one step. This is the only way to be able to have ones home directory in DFS, and thus
have the very same home directory on every machine one works on. AFS has this indispensable
feature, without which a single system image on a workstation cluster may not be achieved.
We are currently using a 'single login' program developed by IBM banking solutions, which has
some very interesting features for limiting access to machines to specified users.
DCE versions of utilities like ftp and telnet (or rlogin) are needed. As long as there are machines
which are not DCEIDFS clients, there will be a need for shipping files the conventional way. And
there will always be a need for running programs on the computer most suited for the problem at
hand. One of the goals of distributed computing is for the user to see the same data on whichever
computer he deems appropriate for solving his problems, and to. allow him access to these in a
secure manner, i.e. no passwords on the network, etc.
Batch processing in a distributed environment.
The other field which appears to have been forgotten by those developing DCE and DME, is batch
processing. Despite the personal computing revolution most computing is still done in batch. Users
use their desktop systems for program development etc., but the long production runs are usually
beeing done on more powerful systems in batch.
Batch in the context of DCE faces two problems:
1. credential liftetime
2. distributed queue management
The lifetime problem can be overcome by renewable tickets, and by a special 'batch ticket granting
service', which hands a new ticket for the batch user to the batch system in return for a ticket
granted for the 'batch service' (possibly valid only for this jobs UUID), ignoring this tickets
lifetime. This is, at least in the given framework, the only way to overcome the problem of a job
being initiated after the ticket of the user, who submitted it, has already expired. This means
accepting an expired ticket, but if, hours later, I do no longer believe, that this user submitted the
job, I should better not run it anyway. Batch processing means doing tasks on behalf of the user
in his absence, this is the very nature of it.
The second problem, namely distributed queue management, is at the heart of all currently
available distributed batch systems. Their problem is: how to get the job input to the batch
machine, and how to get the output back to the user. In 1976 we ran two Control Data
mainframes with shared disks and, due to special software developed at RUS, shared queues. In
its very essence this was a distributed file system, if only shared between two machines. The

178

existence of a distributed file system makes the problem of sending input and output obsolete. The
data is there, at least logically, no matter from where you look at it. Queues then are essentially
special directories in the distributed file system, with proper ACLs to regulate who is allowed to
submit jobs to which queues. On the other side the batch initiating service on a batch machine
knows in which queues to look for work, like on a stand alone system. And it anyway is best
suited to have each machine decide when to schedule a new job. The only thing necessary might
be a locking mechanism to prevent two machines from starting the same job at the same time. A
centralized scheduler, which decides which batch machine should run which jobs, and then sends
them there, is no longer needed, thus eliminating an other single point of failure.
A model for a campuswide DCE cell.
The envisaged goal of a campuswide DCE cell is to provide a transparent single system image to
the users of computing equipment on the whole university campus. Of course the Computer
Center cannot force departments into becoming part of this cell, but if we offer and provide them
with a good service without too much administrative hassle, they will accept our offer. This is at
least our experience from offering the AFS service to the campus.
There are two requirements for campuswide distributed computing: The user should be registered
in only one place, with one user name, one password, one home directory, but be able to get at all
available resources, for which he is authorised. And departments should be able register there own
users and run their own servers, without opening their services to uncontrolled access.
The requirements on user management can easily be achieved in DCE due to the fact that the
object space of the security service is not flat, but a true tree structure. Principals can be grouped
in directories in the security space, and ACLs on these directories allow for the selective
delegation of the rights to create or remove principals from a specific directory. Hence, by creating
a separate directory for each university department under the principal and groups branches of
the security space, it is possible to have department administrators do their own user
administration, and still have a common campuswide security service.
In the same fashion it is possible to have departments mount their DFS file sets in a common file
tree. By using DFS administrative domains, departments wanting to run their own file servers
can safely do so, and still be part of the same cell. They may thus use centrally maintained
software, share data with users in other departments, even other universities, and do so in a
secure manner. And they may nevertheless have their file sets backed up centrally. This is a
definite improvement of DFS over AFS, where the only way to securely run departmental file
servers was to have multiple cells.
One of the principles of DCE, to which one should stick under all circumstances, is that
authorisation should be regulated by ACLs. A DCE single login, whether invoked directely or via
a DCE telnet, should thus grant access only to principals listed in an ACL for the machines
interactive service. The hooks for this may be found by defining appropriate attributes for the
corresponding objects in the CDS name space.
Conclusion.
To summarize our experiences thus far: DCEIDFS is a good secure system with great potential,
but it is not yet mature and stable enough to be used in a true production environment. But we
can use it today to learn about all its new and rich features, especially its powerfull security
mechanisms. When it will be more mature, and supported by more vendors, we will be ready to
use it as a truely distributed environment for scientific computing at the university.

179

ATM -

Current Status

March 14, 1994, San Diego California
Cray User's Group Meeting

By: Benjamin R. Peek
Peek & Associates, Inc.
5765 SW 161st Avenue
Beaverton, OR 97007-4019
(503) 642-2727 voice
(503) 642-0961 fax

180

ASYNCHRONOUS TRANSFER MODE (ATM)
Overview
Today's Local Area Networks (LANs) do not support emerging high-bandwidth
applications such as visualization, image processing, and multimedia communications.
Asynchronous Transfer Mode (ATM) techniques brings computational visualization,
video and networked multimedia to the desktop. With ATM, each user is guaranteed a
nonshared bandwidth of 622M bps or 155M bps.
'
Conceived originally for Wide Area Network (WAN) carrier applications, ATM's
capability to multiplex information from different traffic types, to support different
speeds, and to provide different classes of service enables it to serve both LAN and W AN
requirements. A detailed specification for developing very high-speed LANs using ATM
technology was published in 1992. The UNI Specification is now at V.3.0, as of last fall.
It describes the interface between a workstation and an ATM hub/switch and the interface
between such a private device and the public network. Signaling is being developed to
enable multipoint-to-multipoint conferencing.
Local ATM products are available now, and many more are expected by mid- to late1994. In fact, 1994 is the year that ATM comes of age. UNI V.3.0 will allow the
technology to be deployed into large scale production environments.
Bandwidth-intensive applications including multimedia, desk-to-desk videoconferencing,
computational visualization, and image processing are now appearing as business
applications. Existing 1M to 16M bps LANs (Ethernet and token-ring), and 100M bps
LANs (Fiber Distributed Data Interface - FDDI, FDDI II), can marginally support these
applications and their expected growth. Peek & Associates, Inc. forecasts indicate that
there could be 1,000,000 multimedia PCs in business, manufacturing, education, and
other industries by 1995.
Peek & Associates, Inc. estimates that the total market for ATM-based equipment and
services will grow from approximately $50 million in 1992 to more than a $1.3 billion
market by mid- to late-1995.
The shortcomings of these shared-medium LANs are related to the limited effective
bandwidth per user, and to the communication delay incurred by users. A more serious
problem involves the potential delay variation. For example, a 10M bps LAN may have
an effective throughput of only 2M to 4M bps. Sharing that bandwidth among 10 users
would provide a sustained throughput of only 200K to 400K bps per user, which, even if
there was no delay variation, is not adequate to support quality video or networked
multimedia applications.
Originally developed for Wide Area Network (WAN) carrier applications, ATM's
capability to effectively multiplex signals carrying different network traffic and support
different speeds makes it ideal for local (LAN) and for remote (WAN) applications. The
term private A TM has been used to describe the use of ATM principles to support LAN
applications.

181

Progress is being made on two fronts:
1.

Within the Exchange Carriers Standards Association (ECSA) Tl Committee,
ANSI, and Consultative Committee on International Telephony and Telegraphy
(CCITT), for WAN and carrier applications; and

2.

Within the ATM Forum for LAN applications. In general, good harmonization
exists between the CCITTIECSA work and the ATM Forum's work. If the ATM
Forum is an example, vendors are very interested ATM. There are now (3/94)
485 member organizations internationally cooperating together in the ATM
Forum.

Current Status
ATM and related standards have been developed in the past eight years under the
auspices of Broadband Integrated Services Digital Network (BISON), as the blueprint for
carriers' broadband networks for the mid-1990s and beyond. BISON is positioned as the
technology for the new fast packet WAN services, such as cell relay and frame relay.
Vendors have selected ATM-based systems for three key reasons:

182

•

Local ATM enables major synergies between the LAN and the WAN, since both
networks rely on the same transport technology. This allows a LAN to be
extended through the public switched network, transparently. With seamless
transparency, the LAN extends to the WAN, then to the GAN. This idea has the
power to integrates communication protocols globally establishing a ubiquitous
marketplace of private virtual networks (PVNs).

•

ATM and related standards are nearly complete. ATM has already had an eightyear life. By 1993, the carrier-driven standards were fully published, and through
the activities of the ATM Forum, a full complement of Local ATM specs was
published in the fall of 1993. The ATM standard itself was stable as of December
1990, but support standards such as signaling (to enable call control, such as setup
and teardown), are still being developed though much progress was made in 1993.
The other (non-ATM) standards are only beginning to be developed, and it might
take several years for them to reach maturity, particularly FFOL.

•

Local ATM allows the delivery of 155M bps signals to the desktop over twistedpair cable. FDDI has only recently made progress in that arena, and FCS relies on
fiber of coax cables. Several documented technical studies have shown the
viability of 155M bps on twisted pair, without exceeding the FCC's radiation
limits. Chipsets exist today (3/94), priced at $20 each in quantities of 1000, that
implement ATM 155M bps over Class V copper infrastructure.

ATM LANIWAN PVN

ATM
Workstations

ATM
Workstations

The ATM Forum provides the focal point for this activity. The Forum is a consortium of
equipment vendors and other providers with a mandate to work with official ATM
standards groups, such as ECSA and CCITT, to ensure interoperability among ATMbased equipment. Vendors supporting the development include manufacturers of
workstations, routers, switches, and companies in the local loop. The Forum was formed
in late 1991, and membership has grown to the current level of 485 organizations in the
short intervening time. Based on these activities, a number of proponents claim that
ATM will become a dominant LAN standard in the near future. 1
BISON standards address the following key aspects:
•
•
•
•
•
•
•

User-Network Interface (UN!)
Network-Node Interface (NNJ)
ATM
ATM Adaptation Layer (AAL), to support interworking ("convergence" with
other systems (such as circuit emulation)
Signaling, particularly for point-to-multipoint and multipoint-to-multipoint
(conferencing) calls
End-to-end Quality of Service (QoS)
Operations and maintenance

Connections can either be supported in a Permanent Virtual Circuit (PVC) mode or in a
Switched Virtual Circuit (SVC) mode. In the former case, signaling is not required, and
the user has a permanently assigned path between the sending point and the receiving
point. In the latter case, the path is activated only for the call's duration; signaling is
required to support SVC service.
ATM-based product vendors realize that the market will go flat if the inconsistencies
characteristic of early ISDN products resurface in the broadband market. The first
M. Fahey, "A1M Gains Momentum," Lightwave, September 1992, pp. 1 ff.

183

"implementers' agreement" to emerge from the ATM Forum was the May 1992 UNI
Specification. 1 The interface is a crucial first goal for designing ATM-based equipment
that can interoperate with equipment from other developers. The development of a 622M
bps UNI is also important for applications coming later in the decade.
A related start-up consortium is the SONET-ATM User Network (Saturn). Saturn's
mandate was to create an ATM UNI chipset. The goal was achieved during 1993 and
there are now several chipsets from which to select.

The ATM Forum's UNI Specification
ATM is a telecommunications concept defined by ANSI and CCITT standards for
carrying voice, data, and video signals, on any UNI. On the basic of its numerous
strengths, ATM has been chosen by standards committees (such as ANSI T1 and CCITT
SO XVllI) as an underlying transport technology for BISDN. "Transport" refers to the
use of ATM switching and multiplexing techniques at the data link layer (OSI Layer) to
convey end-user traffic from a source to a destination.
The ATM technology can be used to aggregate user traffic from existing applications
onto a single access linelUNI (such as PBX trunks, host-to-host private lines, or videoconference circuits) and to facilitate multimedia networking between high-speed devices
(such as supercomputers, workstations, servers, routers, or bridges) at speeds of 155M to
622 M bps range. An ATM user device, to which the specification addresses itself, may
be either of the following:
•

An IP router that encapsulates data into ATM cells, and then forwards the cells
across an ATM UNI to a switch (either privately owned or within a public
network).

•

A private network ATM switch, which uses a public network ATM service for
transferring ATM cells (between public network UNIs) to connect to other ATM
user devices.

The initial Local ATM Specification covers the following interfaces:
1. Public UNI - which may typically be used to interconnect an ATM user with an
ATM switch deployed in a public service provider'S network; and
2. Private UNI - which may typically be used to interconnect an ATM user with an
ATM switch that is managed as part of the same corporate network.
The major distinction between these two types of UNI is physical reach. Both UNIs
share an ATM layer specification but may utilize different physical media. Facilities that
connect users to switches in public central offices must be capable of spanning distances
up to 18,000 feet. In contrast, private switching equipment can often be located in the
same room as the user device or nearby (such as within 100 meters) and hence can be
limited-distance transmission technologies.
ATM Forum, UNI Specification, May 1992.

184

ATM Bearer Service Overview
Carrying user infonnation within ATM fonnat cells is defined in standards as the ATM
bearer service involves specifying both an ATM protocol layer (Layer 2) and a
compatible physical media (Layer 1).
The ATM bearer service provides a connection-oriented, sequence-preserving cell
transfer service between source and destination with a specified QoS and throughput.
The ATM physical (PHY) layers are service independent and support capabilities
applicable to (possibly) different layers residing immediately above them. Adaptation
layers, residing above the ATM layer, have been defined in standards to adapt the ATM
bearer service to provide several classes of service, particularly Constant Bit-Rate (CBR)
and Variable Bit-Rate (VBR) service.
An ATM bearer service at a public UNI is defmed to offer point-to-point, bi-directional
virtual connections at either a virtual path (VP) level and/or a virtual channel (VC) level;
networks can provide either a VP or VC (or combined VP and VC) level services. For
ATM users desiring only a VP service from the network, the user can allocate individual
VCs within the VP connection (VPC) as long as none of the VCs are required to have a
higher QoS than the VP connection. A VPC's QoS is determined at subscription time and
is selected to accommodate the tightest QoS of any VC to be carried within that VPC.
For VC level service at the UNI, the QoS and throughout are configured for each virtual
channel connection (VCC) individually. These penn anent virtual connections are
established or released on a subscription basis.

ATM will initially support three categories of virtual connections:
1. Specified QoS Detenninistic
2. Specified QoS Statistical
3. Unspecified QoS
The two specified categories differ in the Quality of Service provided to the user.
Specified Detenninistic QoS connections are characterized by stringent QoS
requirements in terms of delay, cell delay variation, and loss and are subject to Usage
Parameter Control (UPC) of the peak rate at the ATM UNI. Specified Detenninistic QoS
connections could be used to support CBR service (TIff3 circuit emulation) or VBR
services with minimal loss and delay requirements. Specified Statistical QoS connections
are characterized by less stringent QoS requirements and are subject to Usage Parameter
Control of the peak rate at the ATM UNI. Typically, Specified Statistical QoS
connections would be used to support variable bit rate services (data) that are tolerant to
higher levels of network transit delay compared to applications such as voice, which may
require CBR.
When ATM equipment evolves to require dynamically established ("switched") virtual
connections, the ATM bearer service definition must be augmented to specify the ATM
Adaptation Layer protocol (in the Control Plane - C-plane), required for User-Network
signaling.

185

The ATM bearer service attributes to be supported by network equipment conforming to
the Local ATM UNI specification are summarized in the following table.

Summary of Required ATM Functionality
ATM Bearer Service Attribute

Private UNI

Support for Point-to-Point VPCs
Support for Point-to-Point VCCs
Support for Point-to-Multipoint VPCs
Support for Point-to-Multipoint VCCs/SVC
Support for Point-to-Multipoint VCCslPVC
Support for Permanent Virtual Connections
Support for Switched Virtual Connections
Support for Specified QoS Classes
Support of an unspecified QoS Class
Multiple Bandwidth Granularities for
ATM Connections
Peak RateTraffic Enforcement via UPC
Traffic Shaping
ATM Layer Fault Management
Interim Local Mana ement Interface

Public UNI

Optional
Required
Optional
Optional
Optional
Required
Required
Optional
Optional

Required
Required
Optional
Optional
Optional
Required
Required
Required
Required

Optional
Optional
Optional
Optional
Re uired

Required
Required
Optional
Required
Re uired

Physical Layer Specification
The private UNI connects customer premises equipment, such as computers, bridges,
routers, and workstations, to a port on an ATM switch and/or ATM hub. Local ATM
specifies physical layer ATM interfaces for the public and private UNI. Currently, a
44.736M bps, a 100M bps, and two 155.52M bps interfaces are specified. Physical layer
functions in the "User Plane" are grouped into the Physical Media Dependent (PMD)
sublayer and the Transmission Convergence (TC) sublayer. The PMD sublayer deals
with aspects that depend on the transmission medium selected. The PMD sublayer
specifies physical medium and transmission characteristics (such as bit timing or line
coding) and does not include framing or overhead information. The transmission
convergence sublayer deals with physical layer aspects that are independent of the
transmission medium characteristics.

SONET Physical Layer Interface
The physical transmission system for both the public and private User-Network Interface
is based on the Synchronous Optical Network (SO NET) standards. Through a framing
structure, SONET provides the payload envelope necessary for the transport of ATM
cells. The channel operates at 155.52M bps and conforms to the Synchronous Transport
Signal Level 3 Concatenated (STS-3C) frame. The UNIts physical characteristics must
comply with the SONET PMD criteria specified in ECSA TIE1.2/92-020. Given that
SONET is an international standard, it is expected that SONET hierarchy-based interfaces
will be a means for securing interoperability in the long term for both the public and
private UNI. For various availability and/or economic reasons, however, other physical
layers have been specified to accelerate the deployment of interoperable ATM equipment.

186

The following table depicts the Physical Layer functions which need to be supported by
ATM-based equipment, such as private switches and hubs.

Physical Layer Functions Required for SONET-Based ATM Equipment
Transmission Convergence Sublayer

Header Error Control generation/verification
Cell scrambliIlgldescrambling
Cell delineation
Path signal identification
Frequency justification and pointer processing
Multiplexling
Scramblingldescrambling
Transmission frame generation/recovery

Physical Media Dependent Sublayer

Bit timing
Line coding
PhysiCal interface

Most of the functions comprising the TC sublayer are involved with generating and
processing overhead bytes contained in the SONET STS-3c frame. The "B-ISDN
independent" TC sublayer functions and procedures involved at the UNI are defined in
the relevant sections of ANSI T1.105-1991, ECSA TIE1.2/92-020, and ECSA TlSal92185. The "B-ISDN specific" TC sublayer contains functions necessary to adapt the
service offered by the SONET physical layer to the service required by the ATM layer
(these are the top four functions depicted in the previous table under the TC sublayer).

SONET Frame Format
Payload Options

STS-1 FRAME FORMAT
90 COLUMNS

OR

TRANSPORT
OVERHEAD
3 COLUMNS

STS-1 SYNCHRONOUS PAYLOAD
ENVELOPE
87 COLUMNS

187

Physical Layer Functions Required to Support DS3-Based Local ATM Equipment
Transmission Convergence Sublayer

Header Error Control generation/verification
Physical layer convergence protocol framing
Cell delineation
Path overhead utilization
Physical layer convergence protocol timing
Nibble stuffing

Physical Media Dependent Sublayer

Bit timing
Line coding
Physical interface

The 44.736M bps interface format at the physical layer is based on asynchronous DS3
using the C-Bit parity application (CCITI G.703, ANSI T1.101, ANSI T1.107, ANSI
T1.107a, and Bellcore TR-TSY-000499). Using the C-Bit parity application is the
default mode of operation. If equipment supporting C-Bit parity interface with
equipment that does not support C-Bit parity, however, then the equipment supporting CBit parity must be capable of "dropping 'back" into a clear channel mode of operation.
To carry ATM traffic over existing DS3, 44.736M bps communication facilities, a
Physical Layer Convergence Protocol (PLCP) for DSC is defined. This PLCP is a subset
of the protocol defined in IEEE P802.6 and Bellcore TR-TSV-000773. Mapping ATM
cells into the DS3 is accomplished by inserting the 53-byte ATM cells into the DS3
PLCP.
Extracting ATM cells is done in the inverse manner; that is, by framing on the PLCP and
then simply extracting the ATM cells directly.

Physical Layer for 100M bps MuItimode Fiber Interface
The Local ATM specification also describes a physical layer for a 100M bps multimode
fiber for the private UNI. The motivation is to re-utilize FDDI chip sets at the physical
layer (with ATM at the Data Link Layer). The private UNI does not need the link
distance and the operation and maintenance complexity provided by telecom lines;
therefore, a simpler physical layer can be used, if desired. The Interim Local
Management Interface (ILMI) specification provides the physical layer Operations and
Maintenance functions performed over the local fiber link. As with the previous case, the
physical layer (U-plane) functions are, grouped into the physical media dependent
sublayer and the transmission convergence sublayer. The network interface unit (NIU),
in conjunction with user equipment, provides frame segmentation and reassembly
functions and includes the local fiber link interface.
The ATM Forum document specifies the rate, format, and function of the 100M bps fiber
interface. The fiber interface is based on the FOOl physical layer. The bit rate used
refers to the logical information rate, before line coding; the term line rate is used when
referring to the rate after line coding (such as a 100M bps bit rate resulting in a 125M

188

baud line rate if using 4B/5B coding). This physical layer carries 53-byte ATM cells
with no physical layer framing structure.
This physical layer follows the FOOl PMD specification. The link uses 62.5-micrometer
multimode fiber at 100M bps (125M baud line rate). The optical transmitter and fiber
bandwidth adheres to the specification ISO DIS 9314-3. The fiber connector is the MIC
duplex connector specified for FOOl, allowing single-connector attachment and keying if
desired.
The fiber link encoding scheme is based on the ANSI X3T9.5 (FOOl) committee 4 Bitl5
Bit (4B/5B) code. An ANSI X3T9.5 system uses an 8-bit parallel data pattern. This
pattern is divided into two 4-bit nibbles, which are each encoded into a 5-bit symbol. Of
the 32 patterns possible with these five bits, 16 are chosen to represent the 16 input data
patterns. Some of the others are used as command symbols. Control codes are formed
with various combinations of FOOl control symbol pairs.

Physical Layer for 155M bps MuItimode Fiber Interface Using 8B/10B
The ATM specification also supports an 8B/10B physical layer based on the Fiber
Channel Standard. This PMO provides the digital baseband point-to-point
communication between stations and switches in the ATM LAN. The specification
supports a 155M bps (194.4M baud), 1300-nm. multimode fiber (private) UNI.
The PMO provides all the services required to transport a suitably coded digital bit
stream across the link segment. It meets the topology and distance requirements of
building and wiring standards EIAffIA 568. A 62.5/125-micron, graded index,
multimode fiber, with a minimum modal bandwidth of 500 MH:zJkrn., is used as the
communication link (alternatively, a 50-micron core fiber may be supported as the
communication link). The interface can operate up to 2 km. maximum with the 62.5/125micron fiber; the maximum link length may be shortened when 50-micron fiber is
utilized. The non-encoded line frequency is 155.52M bps, which is identical to the
SONET STS-3c rate. This rate is derived from the insertion of one physical layer for
every 26 data cells (the resultant media transmission rate is 194.40M baud).

ATM Cell Structure and Encoding at the UNI
Equipment supporting the UNI encodes and transmits cells according to the structure and
field encoding convention defined in T1S1.5/92-002R3.

ATM Cell Fields
Generic Flow Control (GFC): This field has local significance only and can be used to
provide standardized local functions (such as flow control) on the customer site. The
value encoded in the OFC is not carried end to end and can be overwritten in the public
network. Two modes of operation have been defined for operation of the OFC field:
uncontrolled access and controlled access. The uncontrolled access mode of operation is
used in the early ATM environment. This mode has no impact on the traffic which a host
generates.

189

Virtual PathNirtual Channel Identifier (VPVVCI): The actual number of routing bits
in the VPI and VCI subfields used for routing is negotiated between the user and the
network, such as on a subscription basis. This number is determined on the basis of the
lower requirement of the user or the network. Routing bits within the VPI and vel fields
are allocated using the following rules:
•
•
•
•

The VPI subfield's allocated bits are contiguous
The VPI subfield's allocated bits are its least significant, beginning at bit 5 of octet 2
The VCI subfield's allocated bits are contiguous
The VCI subfield's allocated bits are the least significant, beginning at bit 5 of octet 4

Any VPI subfield bits that are not allocated are set to O.
Payload Type (PT): This is a 3-bit field used to indicate whether the cell contains user
information or Connection Associated Layer Management information. It is also used to
indicate a network congestion state or for network resource management.
Cell Loss Priority (CLP): This is a I-bit field allowing the user or the network to
optionally indicate the cell's explicit loss priority.
Header Error Control (BEC): The HEC field is used by the physical layer for
detection/correction of bit errors in the cell header. It may also be used for cell
delineation.

ATM
Header

Generic Flow Control
Virtual Path Id. (VPI)
VPI
VPI/VCI
Virtual Channel Identifier (VCI)
VCI
VCI
Payload Type
IReserved
Header Error Check (HEC)

ATM
Payload

Payload

ATM Cell Structure
Late Breaking News
Federal legislation enabling communications companies to develop a national
information highway made its first step through Congress March 2, 1994. The House
telecommunications subcommittee unanimously backed a bill that would overhaul the

190

nation's 1934 Communications Act now that industry alliances and technological
advances are blurring boundaries between telephone and cable industries, Joanne Kelley
of Reuters news service reported.
The information superhighway, championed by Vice President AI Gore, could link the
nation's homes, businesses and public facilities on high-speed networks to allow them to
quickly send and receive information. "This is the most significant change in
communications policy in 60 years," said Massachusetts Democrat Edward Markey, who
heads the House panel. "This legislation will pave the way for the much-anticipated
information superhighway."
The bill, which next faces a vote by the Energy and Commerce committee, would break
down barriers that currently separate the phone and cable television industries, freeing
them to invade each other's turf. Several revisions to the bill dealt with issues raised by
various industry factions and consumer groups during lengthy hearings during the past
several weeks, Kelley reported.
"Legislation to create a competitive telecommunications marketplace should and can be
completed this year," said Decker Anstrom, president of the National Cable Television
Association. Lawmakers left some key issues unresolved in hopes of maintaining the
tight schedule they set to ensure passage of the leg~lation this year. Two lawmakers were
persuaded to withdraw controversial amendments until the full committee takes up the
legislation.
The deferred measures include a provision that would allow broadcasters to use existing
spectrum licenses to provide new wireless communications services. That raised fears
among some lawmakers that this spring's first ever auctions of spectrum for such services
might command lower bids by rivals. The lawmakers also supported a measure pending
before another House judiciary panel that would relax restrictions on regional telephone
companies, which are currently barred from entering long distance and equipment making
businesses.

191

Broadband Network Future - 1996

CENTRAL OFFICE OR
REMOTE UNIT

RESIDENCE OR OFFICE

ATM Nationally and Internationally
MFS Datanet, Inc., an operating company of MFS Communications Company, Inc.,
launched the first end-to-end international ATM service in February 1994, making it
possible to send data globally. Availability of MFS Datanet's ATM service solves the
problem for multinational businesses of how to address increasing requirements for speed
and reliability. More than half of all communications by international companies today
are data, rather than voice.
MFS Datanet -- which launched the first national ATM service on Aug. 4, 1993 -- will
initially offer the service between the U.S. and the United Kingdom, with expansion
planned in Western Europe and the Pacific Rim.!
MFS Datanet's High-speed Local Area Network Interconnect services (HLI), a range of
services allowing personal computer networks to intercommunicate, will now be offered
on an international basis. "MFS Datanet's global expansion will mean customers around
the world will have access to a flexible, economical, high-speed medium to accommodate
emerging high-volume applications such as multimedia," said AI Fenn, president of MFS
Datanet.
.

!

192

Lightwave, September 1993.

The international network will be managed and controlled by the Network Control Center
in MFS Datanet's headquarters office in San Jose, Calif., with parallel control for the
U.K. operation in the London offices ofMFS Communications Limited. In the U.S., the
ATM service is offered in 24 metropolitan areas.
Sprint A TM Service
SCinet '93 was the prototype for the NIITnet and all of the facilities that were prototyped
are currently a part of the NIIT. This briefing was delivered at this conference in a
different session. Additional detail is available in the conference proceedings.
Workstations:
Hewlett·Packard
Routers:
Network Systems - T3

Sprint ATM Service - T3

AT&T - T1
FOOl & Ethernet:
Synoptics

Pacific Bell
Services

ATM· Frame· RelaySenlice~T~
Dial Access
Service

NllTnet Topology 1993

950-1ATI

193

ATM Links Supercomputing and Multimedia in Europe
A cluster using ATM (Asynchronous Transfer Mode) technology has been put into
operational use at Tampere University of Technology, Finland'! "We have tested ATM
technology with real applications," said project manager Mika Uusitalo. "The results
show the good feasibility and scalability of ATM. It really seems to offer a new way to
distribute and use supercomputing facilities."
Used together, high-speed ATM networks and clustering give a platform for the new
generation of scientific applications in which visual information plays the key role,
Uusitalo said. He added that ATM is a solution for distributed audio-visual
communication applications requiring high network bandwidth and also low predictable
delays, satisfying the requirements by transferring data extremely fast in small, fixedlength packets called cells. ATM also has properties that guarantee real time transmission
of data, video, and audio signals.
Tampere University of Technology's cluster, connected directly to an ATM network, can
be used for traditional supercomputing and running distributed multimedia applications.
Currently, the ATM network at the university operates at 100 Mb/sec., and the cluster
consists of eight Digital Equipment (DEC) Alpha AXP machines. Users noted that the
ATM cluster has much better price/performance than traditional supercomputers.
Among the first applications run at Tampere University of Technology was the
computation and visualization of 3D magnetic and electrical fields, an application typical
of today's needs: computationally very demanding and generating a lot of data for
visualization.
"Earlier, the visualization of simulations included two phases," said Jukka Helin, senior
research scientist. "First, the results of the simulation were generated on a supercomputer
and later post processed and visualized by a workstation. Now the user in the ATM
network sees how the solution evolves in real time." For further information, contact
Jukka Helin at helin@cc.tut.fi or Mika Uusitalo at uusitalo@cc.tut.fi.

1

194

2130 A1M Links Supercomputing and Multimedia in Finland Oct. 4, 1993 HPCwire

APPENDIX
Background
A number of special interest groups address broadband services and technologies, serving
the information needs of both end users and vendors. Four important groups are the
ATM forum, Frame Relay Forum, the Fibre Channel Association and the SMDS Special
Interest Group. The primary function of these groups is to educate, not to set standards.
Their objective is to promote their respective technologies by expediting the development
of formalized standards, encouraging equipment interoperability, actuating
implementation agreements, disseminating information through published articles and
information. These groups also provide knowledgeable speakers. These groups were
seldom chartered with defining technology standards. However, some of these groups
have become more active in the standards process. The ATM Forum has become most
active in making specifications work, according to Fred Sammartino, President and
Chairman of the Board of the ATM Forum.

Established Forums Sponsored by INTEROP
Three of these four separate forums, Asynchronous Transfer Mode (ATM), Switched
Multi-megabit Data Service (SMDS), and frame relay, selected INTEROP Co. (Mountain
View, CA) to perform as their secretary. INTEROP Co. was founded in 1985 as an
educational organization to further the understanding of current and emerging networking
standards and technologies. INTEROP sponsors the annual INTEROP Conference and
Exhibition in the spring (Washington, DC) and in the fall (San Francisco, CA).
Additionally, INTEROP publishes the technical journal ConneXions. The Fibre Channel
Association is a separate organization and functions differently.
Interop's Director of Associations, Elaine Kearney, is responsible for the strategic
development of each new forum. Each new forum must be formally incorporated and
seek legal support so that antitrust laws are not violated. INTEROP handles membership
databases, mailings, and trade show representation, and participates on each forum's
Board of trustees as a nonvoting member.
The ATM, SMDS, and Frame Relay Forums, with the assistance of INTEROP Co.,
perform as conduits for distributing information relating to their representative
technologies. Additionally, these forums promote cooperative efforts between industry
members and technology users.
All of these groups are described in more detail in the Appendix. These descriptions
include a location (address) to acquire more information, the structure of the management
team (where available) and a short description of the charter and organization objectives
available.

195

ATMForum
ATMForum
480 San Antonio Road
Suite 100
Mountain View, CA 94040
Phone: (415) 962-2585; Fax: (415) 941-0849
Annual Membership Dues: Full membership $10,000 per company.
Auditing membership: $1,500 (observers may attend general meetings free of charge).
Board of Directors
Chairman and President - Fred Sammartino, Sun Microsystems
Vice President, Business Development - Charlie Giancarlo, ADAPTIVE Corp., an
N.E.T. Company
Vice President, Committee Management - Steve Walters, Bellcore
Vice President, Marketing & Treasurer - Irfan Ali, Northern Telecom
Vice President, Operations & Secretary - Dave McDysan, MCI Communications.
Executive Director - Elaine Kearney, INTEROP Co.
Charter
The ATM Forum's charter is to accelerate the use of ATM products and services through
a rapid convergence and demonstration of interoperability specifications, and promote
industry cooperation and awareness. It is not a standards body, but instead works in
cooperation with standards bodies such as ANSI and CCITI.
Overview of the ATM Forum
The ATM Forum was announced in October 1991 at INTEROP 91 and currently
represents 350+ member organizations. The ATM Forum's goal is to speed up the
development and deployment of ATM products and services. Its initial focus was to
complete the ATM User Network Interface (UNI), which was accomplished June 1,
1992. The UNI isolates the particulars of the physical transmission facilities from upperlayer applications. The ATM Forum Specification was based upon existing and draft
ANSI, CCITT, and Internet standards.
The ATM forum has two principal committees: Technical and Market Awareness and
Education. The Technical committee meets monthly to work on ATM specifications
documents. Areas targeted include Signaling, Data Exchange Interface (DXI), Networkto-Network Interface, Congestion Control, Traffic Management, and additional physical
media (such as twisted-pair cable). The ATM Market Awareness and Education
committee's goal is to increase vendor and end-user understanding of, and promote the
use of ATM technology.
In response to an increasing European interest in ATM technology and the ATM Forum,
the Forum is expanding its activities outside of North America to the European

196

Community. It held a meeting in Paris in November 1992 to form its European activities.
More meetings occurred in 1993.

Frame Relay Forum
The Frame Relay Forum
480 San Antonio Road
Suite 100
Mountain View, CA 94040
Phone: (415) 962-2579; Fax: (415) 941-0849
Annual Membership Dues: Full membership $5,000 per company.
Joint membership for company affiliates: $2,000.
Auditing level: $1,000 per year, reserved for universities, consultants, users, and
nonprofit organizations.
Board of Trustees and Officers
Chairman and President - AlaI Taffel, Sprint
Vice President - Sylvie Ritzenthaler, OST
Treasurer - John Shaw, NYNEX
Secretary - John Valentine, Northern Telecom
Trustees - Richard Klapman, AT&T Data Comm. Services; Lawrence 1. Mauceri,
Hughes Network Systems; Holger Opderbeck, Netrix
Executive Director - Elaine Kearney, INTEROP Co.
Publications
The Frame Relay Forum News
UNI Implementation Agreement
NNI Implementation Agreement
Charter
The Frame Relay Forum is a group of frame relay service providers, equipment vendors,
users, and other interested parties promoting frame relay acceptance and implementation
based on national and international standards.
Overview of the Frame Relay Forum
The Frame Relay Forum was established in January 1991 and currently has 107 member
organizations. It is a nonprofit corporation dedicated to promoting the acceptance and
implementation of frame relay based on national and international standards. The
Forum's activities include work on technical issues associated with implementation,
promotion of interoperability and conformance guidelines, market development and
education, and round tables on end-user experiences. Forum membership is open to
service providers, equipment vendors, users, consultants, and other interested parties. By
participating in the Forum, members can have an active role in developing this new
technology.

197

The Frame Relay Forum is organized with a Board of Trustees and four committees. The
committees are the following:
•
•
•
•

Technical Committee
Interoperability & Testing
Market Development and Education
Inter-Carrier Committee

In addition, the Forum maintains an International branch (FRF International), with
European and Australian Chapters:
European Chapter
Paris, France
Contact: Sylvie Ritzenthaler
Phone: +33-99-32-50-06; Fax: +33-99-41-71-75
AustralialNew Zealand Chapter
Allambie Heights, New South Wales
Contact: Linda Clarke
Phone: +612-975-2577; Fax: +612-452-5397
The Frame Relay Forum has organized User Roundtables so that users of frame relay can
exchange experiences and ideas among each other. The Frame Relay Forum intends User
Roundtables to serve as the basis for a wide array of other user activities.
The Frame Relay Forum Speakers Program
The Frame Relay Forum sponsors a speaker program free of charge. An expert on frame
relay will be made available to visit a company's site and present a 30- to 45-minute
presentation on frame relay. The presentation will introduce frame relay technology,
products, and services. Contact the Frame Relay Forum for further details and
scheduling.

SMDS Special Interest
The SMDS Interest Group
480 San Antonio Road
Suite 100
Mountain View, CA 94040-1219
Phone: (415) 962-2590; Fax: (415) 941-0849
Annual Membership Dues: $4,999.
Annual Mfiliate Membership: $3,000.
Annual Associate Membership: $800.
Annual Observer Membership: Free.
Board of Trustees & Officers
Chairman and President - Steve Starliper, Pacific Bell
Vice President - Scott Brigham, MCI

198

Treasurer - Robyn Aber, Bellcore
Executive Director - Anne M. Ferris, INTEROP Co.
Board Members: Tac Berry, Digital Link; Bill Buckley, Verilink; Joe DiPeppe, Bell
Atlantic; Jack Lee, ADC Kentrox; Hanafy Meleis, Digital Equipment Corp.; Allen C.
Millar, IBM; Connie Morton, AT&T Network Systems; David Yates, Wellfleet
Communications
Publication
SMDS News (published quarterly)
Sharon Hume, Editor
3Com Corporation
5400 Bayfront Plaza
Santa Clara, CA 95052
Phone: (408) 764-5166; Fax: (408) 764-5002
Charter
The SMDS Interest Group is a group of SMDS providers, equipment vendors, users, and
other interested parties working toward the proliferation and interoperability of SMDS
products, applications, and services. The group's activities include advancing SMDS and
providing SMDS education, fostering the exchange of information, ensuring worldwide
interoperability, determining and resolving issues, identifying and stimulating
applications, and ensuring the orderly evolution of SMDS.
Overview of the SMDS Interest Group
The SMDS Interest Group (SIG) was established in the fall of 1990 and currently has 58
member organizations. The SIG provides a central point for discussing and focusing on
SMDS. Members who are SMDS users can influence products and services, gain
exposure to a wide array of vendor offerings, and receive timely information for network
planning. Likewise, SMDS vendors receive timely information for planning and
developing products and services. Vendors also benefit from having the SIG's marketing
efforts to supplement their own. Each SIG member can participate in resolving issues
raised in SIG forums by exercising their membership voting privileges. Each SIG
membership is allowed one vote. Besides having a vote, members can help shape the
direction of SIG activities by participating in several working groups which the SIG
sponsors. These working groups include the following:
•
•
•

Technical Working Group
Inter-Carrier Working Group
PR and Market Awareness Working Group

Each of these working groups meet at the quarterly SIG meeting and at various other
times. The meetings are open to all interested parties.
Members receive meeting minutes outlining the discussions and actions taken at these
forums. Members are kept informed of industry developments via a monthly clipping
service dedicated to tracking published news items on SMDS and competitive services

199

that appear in the trade press. A quarterly newsletter tracks the SMDS activities of SIG
members. Members can use the newsletter for product announcements and profiles.
The SIG is planning several projects that will further SMDS market awareness. These
activities include the development of a speakers package, an SMDS training course, and a
compendium of SMDS applications.

Fibre Channel Association
Fibre Channel Association
12407 MoPac Expressway North 100-357
P.O. Box 9700, Austin, TX 78766- 9700
(800) 272-4618.
Initiation Fee: $2,500
Annual Principal Membership Dues: $5,000.
Annual Associate Membership: $1,500.
Annual Documentation Fee: $1,000.
Annual Observer Membership: $400 per meeting
Annual Educational Membership: $250.
Board of Trustees & Officers
President - Jeff Silva (needs to be verified).
Vice President no information.
Treasurer no information.
Executive Director - no information.
Board Members: Jeff Silva, no information.
Publication
No information.
Charter
Promote industry awareness, acceptance and advancement of Fibre Channel. Encourage
article publication, educational seminars and conferences along with trade shows, round
tables, and special events. Accelerate the use of Fibre Channel products and services.
Distribute technical and market-related information and maintain a products and services
database. Foster a Fibre Channel infrastructure that enables interoperability. Develop
application-specific profiles and test specifications and environments.
Committees
•
•
•

200

Marketing Committee
Strategic Committee
Technical Committee

Information About the Fibre Channel Association
Twenty organizations were originally involved in the formation of the Fibre Channel
Association (FCA) in January 1993. The idea is to promote Fibre Channel technology as
a high-perfonnance interconnect standard for moving large amounts of information.
Fibre Channel can transmit large data files bi-directionally at one gigabit per second. 1
"While computer processors have become increasingly faster and capable of handling
larger amounts of data, the interconnects between these systems, and the input/output
devices that feed them data, are unable to run at speeds necessary to take advantage of
these more powerful products," said Jeff Silva, FCA board member.
"Fibre Channel offers an excellent solution for these data-intensive environments. At the
same time, when links within and between companies, schools, hospitals, governments
and other organizations are created, Fibre Channel will provide standards-based, highspeed on-ramps necessary for these digital information highways of the future."
Under development for more than four years, the Fibre Channel standard was initially
endorsed by the Fibre Channel Systems Initiative (FCSI), a joint effort of HewlettPackard, IBM and Sun Microsystems Computer Corp. "One of the primary missions of
the FCSI was to kick-start the development of profiles and use of Fibre Channel
technology for the workstation arena," said Ed Frymoyer, FCSI's program manager. "The
FCA's wide spectrum of industry players and their varied interests helps ensure the
development of a broad array of Fibre Channel profiles for products aimed at a multitude
of applications."
Dataquest, a San Jose-based market research firm, estimates that total revenue for the
Fibre Channel adapter market alone will grow from under $2 million in 1993 to $1.2
billion in 1998. It estimates that by 1998, Fibre Channel adapter shipments will reach
more than 500,000 units for workstations, PCs, midrange, mainframe and supercomputer
systems. Similar trends are forecasted for other Fibre Channel products, like ports and
switches.
INTEROP is not the key sponsor of the Fibre Channel Association. For this reason, it
looks and acts a bit different from those Forums that are sponsored by INTEROP. The
following information should give the reader an overview of the Fibre Channel Systems
Initiative.

Fibre Channel Systems Initiative (FCSI)
The Fibre Channel Systems Initiative, a joint effort between Hewlett-Packard, IBM and
Sun Microsystems, announced the first prototype implementation of the high-speed Fiber
Channel standard in August, 1993. The companies said the new fiber-optic technology
"dramatically reduces the time it takes to transfer large, complex files between
computers." The first implementation of the prototype technology is at Lawrence

1297 Fibre Channel Association to Advance Interconnect Standard Aug. 16, 1993, HPCwire.

201

Livermore National Laboratory, the interoperability test site for the Fibre Channel
Systems Initiative (FCSI).1
Launched in February, 1993, the FCSI is working toward the advancement of Fibre
Channel as an affordable, high-speed interconnection standard for workstations and
peripherals. Because the results of its efforts will be open and available to the public, the
companies said the eventual impact of the technology will enhance the way all computers
are used in business, medicine, science, education and government.
Fibre Channel simplifies the connection between workstations and supercomputers, and
its speed is not affected by additional connections. It allows data to be transmitted bidirectionally at 1 gigabit per second at distances up to 10 kilometers.
"I wanted an interconnect technology that could transfer data as fast as the human eye
could visualize it, and Fibre Channel was exactly what I was looking for," said Paul R.
Rupert, manager of the Advanced Telecommunications Program at Lawrence Livermore
National Laboratory.
Lawrence Livermore is planning to use Fibre Channel in complex computer simulations
of occurrences such as fusion experiments. Because these models are so complex, they
often cannot be completed on a supercomputer without first being manipulated on a
workstation. This requires transferring the data from the supercomputer to a workstation
for manual correction and then back to the supercomputer for completion.
Transferring this data takes up to 40 minutes using an Ethernet connection. With the
prototype Fibre Channel interconnect, it will take eight minutes; with future gigabit-speed
interconnects, it will take two seconds, the companies said.
"Lawrence Livermore's needs were ideally suited for Fibre Channel interconnects
because their applications involve so many different computers, ranging in scale from
workstations to supercomputers, and include several major brands," said Dr. Ed
Frymoyer, program manager for the Fibre Channel Systems Initiative.
Frymoyer said that Livermore's wide-ranging workstation inventory has given the FCSI
much-needed input for developing "profiles," or specifications, that will be available to
the public. These specifications will assist computer manufacturers in designing products
that have Fibre Channel technology built into them, he said.

1

202

1280 Livennore Hosts First Fibre Channel Gigabit Implementation Aug. 11, 1993, HPCwire.

Network Queuing Environment
Robert. E.Daugherty
Daniel M. Ferber
Cray Research, Inc.
655-F Lone Oak Drive
Eagan, Minnesota 55121

ABSTRACT
This paper describes the Cray Network Queuing Environment. N QE consists of four components - Network Queuing System (NQS), Network Queuing Clients (NQC), Network Load Balancer(NLB), and File transfer Agent (FT A). These are integrated to promote scheduling and
distribution of jobs across a network of systems. The systems can be clustered or mixed over
a wide area network or both. The Network Queuing Environment offers an integrated solution
that covers an entire batch complex - from workstations to supercomputers.
Topics covered include features, platforms supported, and future directions.

CraySoft's Network Queuing Environment (NQE) is a
complete job management facility for distributing work
across a heterogenous network of UNIX workstations.
NQE supports computing in a large network made up of
clients and servers. Within this batch complex, the NQE
servers provide reliable, unattended batch processing and
management of the complex. The NQE clients (NQC) submit jobs to the NQE batch and monitor the job status. The
Network Load Balancer (NLB) provides status and control
of work scheduling seeking to balance the network load
and route jobs to the best available server. The File Transfer Agent (FTA) provides intelligent unattended file trans...;
fer across a network using the f t P protocol.

NQS, the Network Load Balancer, FfA, and the collectors
used by NLB to monitor machine workload. The NQE Execution servers which are replicated on several machines in
the network contain everything needed to execute jobs on
the host. The NQE clients, running on numerous machines,
provide all the necessary interface commands for job submission and control.
NQE license manages the server hosts but provides you
with unlimited access to NQE clients. Configuration
options allow you to build redundancy in your system.
Multiple NLB servers may be installed in order to provide
access to the entire system in the event of the failure of on
NLB server.

NQE Configuration

Network Queuing System

From an administrative standpoint, an NQE configuration

Cray NQS provides a powerful, reliable batch processing
system for Unix networks. The systems can be clustered
(tightly coupled) or mixed over a wide area network
(loosely coupled) or both. NQE offers an integrated solution covering an entire batch network, from workstations
to supercomputers.
Using the qmgr command, the NQE manager defines the
systems in the network and the system operators, creates
NQS queues, and defines other system parameters related
to queues and how the jobs are processed.
NQS operates with batch and pipe queues. Batch queues
accept and process batch job requests. Pipe queues receive

NQG
cload
cqstat
ftad

NQE Clients
NQE Master Server

Figure 1: Sample NQE configuration
could be represented by the figure above. The NQE Master Server is the host which runs all of the software. Cray

Copyright © 1994. Cray Research Inc. All rights reserved.

203

batch job requests and route the requests to another destination for processing. In creating queues, options are provided to allow control over queue destinations, the
maximum number of requests that can processed at one
time, and the priority in which the queues are processed.
Permissions can be created to give user access/restriction
to specified queues. Limits can be set up for each queue to
control the number of requests allowed in a queue at one
time, the amount of memory that can be requested by jobs
in the queue, the number of requests than can be run concurrently, and the number of requests one user can submit. Queues may be configured into queue complexes. In
addition to the limits that may be specified in each queue,
global limits can be implemented to restrict activity
throughout the NQS system.
NQE provides both command-based and graphical user
interfaces for job submission/control information. The
cqstat utility provides status information of requests
throughout the batch complex.
@)

process.
The utility cloadprovides a continually updated display
of machine and load status. This tool enables you to easily
compare machine loads.

CClmplex q5tat display

rile

Actions

c:-=:t-..:V

FIltll'("

liIiB!liZ:lI"IU~

HeJp

Network Job Stahls
CPU

OJ_

tIlS Id

Ru1 Uar

StatuI

U!ed 1I111t

b.30.1icIClO.ldti
b.30.f'Ic IClO.Idti
b.30Jfk Irud\I

257."V1

zoe

101

0

2Gl ••ul
2G5.IIIIS

me

101

me

101

0
0

b30~1ClUd:l

2S6.1111S

ZDe

0

b.30..11lc IClJOd

267.IIIIS
125.snoke-fddl
12!i9.SU· t

c13C16

W
W
A
R03

12S0.9Ult

c13~

l4333.leq.Jo'Ila

batU

b8tchecl~

b.14400.1SQsu.t
b_14bXUlagust
blttll!emt

me
Jbedger-

R03
Cl24

0

0

Figure 3: cload display main window

"""cry

LUd 1I.lt

0

1e6

w:

30

0
0

30

0

30
30

ltJ6
leE
1tJ6

30

0

7200

&6

'a.7

lieS

geS

0

2tJ6

•

71
G2

noo

0

1000

0

81

Popup windows are provided to give additional information about any host selected. This tool provides a visual
way to determine which machines are heavily used and if
a machine is not providing new data.

•
118

Server: yankee

Figure 2: c load display window
Network Load Balancer
The Network Load Balancer (NLB) makes load balancing
recommendations to the NLB server. NLB also provides
an interface for the collection and display of workload
information about the network. This information, in addition to being used to set NLB job destination policies, is
useful to the administrator as a graphical tool for network
load display.
On each host in the network where NQE is installed, collectors are installed. These collectors send load and status
information to the NLB server, installed on the NQE master server. The server stores and processes this information. NQE then queries the recommended host. The job is
then sent to the selected host recommended by the load
balancer. The destination selection process can be configured to take into account site-specific factors. Users may
tailor the balancing policy and algorithm; exits are provided so that jobs can override the destination selection

204

Figure 4: Host load display
The Network Load Balancer allows you to use your heterogenous network of systems from various vendors as a
single computing resource. By utilizing the NLB destination selection process, running jobs on the least loaded
system will in many cases provide the best turnaround
time for the user and make best use of the system
resources.

File Transfer Agent
The File Transfer Agent (FTA) provides reliable
(asynchronous and synchronous) unattended file
transfer across the batch complex. The ftad daemon services FTA requests issued by users and
NQE itself. The ftua and rft commands are
provided for file transfer. Built upon the TCP /IP
ftp utility, FTA provides a reliable way to transfer files across the network from your batch
requests. Using ftp, all file transfers take place
interactively; you must wait until the transfer is
completed before proceeding to another task.
With FTA, file transfer requests can be queued.
FTA executes the transfer; if the transfer does not
occur due to an error, FTAautomatically requeues
the transfer. The f t ua fa ciIi ty is similar to f t pin
its user interface and offers the full range of file
manipulation features found in ftp. The rft
command is a one-line command interface to
FTA. It is a simplified interface to the ftua feature of copying files between hosts. In general it is
a reliable mechanism for transferring files in simple operations, especially from within shell
scripts or NQE.
Also FTA allows for network peer-to-peer authorization, which enables you to transfer files without specifying passwords in batch request files, or
in .netrc files or by transmitting passwords over
the network. It requires use of FTA on the local
system and support for the peer-to-peer authorization on the remote system. It can be used to
authorize both batch and interactive file transfers.

be part of the NQE batch complex. CraySoft NQE
for workstations is not required. But both UNICOS and workstation systems may belong to the
NQE.

Summary
The Cray Network Queuing Environment runs
each job submitted to a network as efficiently as
possible on the available resources. This translates
into faster turnaround for users.
When developing NQS for use on Cray Research
supercomputers, we made it as efficient, reliable,
and production oriented as possible. With NQE
we've created a high-performance integrated
computing environment that is unsurpassed in
performance and reliability.
This environment is now available to Cray and
workstation systems in an integrated package.

NQE Clients
The Network Queuing Client (NQC) provides a
simplified user interface for submitting batch jobs
to the network. Typically the NQC is installed on
all user systems, providing a submit-only access
to NQE. No NQE jobs will be run on the NQC
system. NQC also provides access to the GUI job
status tools, FTA, and the job submission/ deletion and control commands. With each NQE
license, NQC stations may be installed in unlimited quantities at no additional charge.

UNICOS NQE
A new product, NQX, will make available the following NQE components on UNICOS 8 systems;
NLB collector and associated libraries, NQS interface to network load balancer, NQS server support for NQC clients, NQE clients, and the
Network Load Balancer.
With the addition of NQX, UNICOS systems can

205

Operating Systems

Livermore Computing's Production Control System, 3.0*

Robert R. Wood
Livermore Computing, Lawrence Livermore National Laboratory
Abstract
The Livermore Computing Production Control System, commonly
called the PCS, is described in this paper. The intended audiences for this document are system administrators and resource
managers of computer systems.
In 1990, Livermore Computing, a supercomputing center at
Lawrence Livermore National Laboratory, committed itself to
convert its supercomputer operations from the New Livermore
TimeSharing System (NLTSS) to UNIX based systems. This was
a radical change for Livermore Computing in that over thirty years
had elapsed during which the only operating environments used
on production platforms were LTSS and then later NLTSS. Elaborate facilities were developed to help the lab's scientists productively use the machines and to accurately account for their use to
goverment oversight agencies.
UNIX systems were found to have extremely weak means by
which the machine's resources could be allocated, controlled and
delivered to organizations and their users. This is a result of the
origins of UNIX which started as a system for small platforms in
a collegial or departmental atmosphere without stringent or complex budgetary requirements. Accounting also is weak, being
generally limited to reporting that a user used an amount of time
on a process. A process accounting record is made only after the
completion of a process and then only if the system does not
"panic" first.
Thus, resources can only be allocated to users and can only be
accounted for after they are used. No cognizance can be taken of
organizational structure or projects. Denying service to a user
who has access to a machine is crude: administrators can beg
users to log off, can "nice" undesirable processes or can disable a
login.

ness to pay for the computer services provided. With only typical UNIX tools, the appropriate delivery of resources to the correct organizations, projects, tasks and users requires continual
human intervention.
UNIX and related systems have become increasingly more reliable over the past few years. Consequently, resource managers
have been presented with an attendant problem of accurate accounting for resources used. Processes can now run for days or
months without terminating. Thus, a run-away process or a malicious or uninformed user can use an organization's budgeted
resource allocation many times over before an accounting record
to that effect is written. If a process's accounting record is not
written because of a panic, the computer center is faced with possibly significant financial loss.
The PCS project, begun in 1991, addresses UNIX shortcomings
in the areas of project and organizational level accounting and
control of production on the machines:
THE PCS PROVIDES THE BASIC DATA REPORTING MECHANISMS REQUIRED FOR PROJECT LEVEL ACCOUNTING SYSTEMS. THIS RAW DATA
IS PROVIDED IN NEAR-REAL TIME.
THE PCS PROVIDES THE MEANS FOR CUSTOMERS OF UNIX BASED
PRODUCTION SYSTEMS TO BE ALLOCATED RESOURCES ACCORDING TO
AN ORGANIZATIONAL BUDGET. CUSTOMERS ARE THEN ABLE TO CONTROL THEIR USERS' ACCESS TO RESOURCES AND TO CONTROL THE
RATE OF DELIVERY OF RESOURCES.
THE PCS PROVIDES THE MECHANISMS FOR THE AUTOMATED DELIVERY OF RESOURCES TO ALL PRODUCTION BEING MANAGED FOR CUSTOMERS BY THE SYSTEM.
THE PCS DOES MORE THAN MERELY PREVENT THE OVERUSE OF THE
MACHINE WHERE NOT AUTHORIZED. IT ALSO PROACTIVELY DELIVERS
RESOURCES TO ORGANIZATIONS THAT ARE BEHIND IN CONSUMPTION
OF RESOURCES TO THE EXTENT POSSIBLE THROUGH THE USE OF THE

Large computing centers frequently have thousands of users working on hundreds of projects. These users and the projects are
funded by several organizations with varying ability or willing-

UNDERLYING BATCH SYSTEM.
CUSTOMERS ARE ABLE TO MANAGE THEIR USERS' ACCESS DIRECTLY
TO A LARGE EXTENT WITHOUT REQUIRING HEAVY OR CONTINUAL INVOLVEMENT OF THE COMPUTER CENTER STAFF.

• This work was performed under the auspices of the United
States Department of Energy by Lawrence Livermore
National Laboratory under contract No. W-7405-Eng-48

It helps to state what the PCS is NOT. It is not a CPU scheduler;
rather, it relies on a standard kernel scheduler to perform this

209

function. It is not a batch system. However, it may be regarded
as a policy enforcement module which controls the functioning
of a batch system. Finally, it does not do process level accounting.
While the PCS is not a memory or CPU scheduler, it does adapt
the production demand on the machine to present an efficiently
manageable workload to the standard memory and CPU
schedulers. The PCS monitors memory load, swap device load,
idle time, checkpoint space demand (if applicable), etc. to keep
resource demands within bounds that are configurable by site administrators.

A bank represents a resource pool available to sub-banks and
users who are permitted access to the bank. As implied, banks
exist in a hierarchical structure. There is one "root" bank which
"owns" all resources on a machine. Resources of the root bank
are apportioned to its sub-banks. The r~sources available to each
bank in turn may also be apportioned among its sub-banks. There
is no limit to the depth of the hierarchy.
Some users, called bank coordinators, may create and destroy subbanks and may grant and deny other users access to a bank. The
authority of coordinators extend from the highest level bank at
which they are named coordinator through out that bank's sub-tree.

Overview
There are two major components of the PCS. They are the Resource Allocation & Control system (RAC) and the Production
Workload Scheduler (PWS).

Users are permitted access to a part or all of bank's resources
through a user allocation. There is no limit on the number of
allocations a user may have.

Production Workload Scheduler (PWS)
Resource Allocation & Control system (RAC}
Generally, the RAC system is used to manage recharge accounts,
to manage allocation pools and to manage user allocations within
the allocation pools. A recharge account should not be confused
with a user "login" account. So that the term "group" not continue to be overused, the PCS has borrowed another term to mean
an allocation pool or group. This term is "bank".
As resources are consumed on the machine the RAC system associates those resources with the users who are consuming them,
a bank and a recharge account. The user and the bank are debited
the resources used and an accounting report is made for the purpose of charging the account. All of this is done in near-real
time. If the RAC system determines that a user, recharge account
or bank has consumed as much of the resources as has been permitted, the RAC system takes steps to prohibit further resource
consumption by that user, recharge account or bank via the use of
graceful denial of service. Denial of service includes automatically "nicing" processes, suspension of interactive sessions and
checkpointing of batch jobs.
A recharge account, or simply account, is essentially a credit that
represents an amount of usable resources (which may be unlimited). Users may be permitted to charge an account for resources
used. Some users, called account coordinators, may be permitted
to manage the account. That is, account coordinators may grant
and deny other users access to an account. Accounts are independent from each other; that is, accounts have no sub-accounts.
The primary purposes of accounts are 1) as a mechanism by which
budgetary charges may be determined for organizations whose
personnel use the computer and 2) a "one stop" limit on resource
accessibility to users.
A primary purpose of banks is to provide a means by which the
rate of delivery of allocated resources is managed. Also, the PWS
uses the banking system to prioritize and manage production on
the machine. This is further described in the PWS section below.

210

The Production Workload Scheduler schedules production on the
machine. Production requirements are made known to the PWS
in the form of batch requests. When the PWS is installed, users
do not submit their requests directly to the batch system, but rather
submit them to the PWS which then submits them to the batch
system. When users submit a batch request, they must specify
the bank from which resources are to be drawn and the account
to be charged for the request's resources.
One important function of the PWS is to keep the machine busy
without overloading it. Interactive work on the machine is not
scheduled by the PWS. However, the PWS does track the resource load presented by interactive usage and adjusts the amount
of production to "load level" the machine accordingly.
At any point, there is a set of production requests being managed
by the PWS. This set is called the production workload. Requests in this workload are prioritized according to rules and allocations laid out by system administrators and coordinators. High
priority requests are permitted to run insofar as the machine is
not overloaded.
The PWS uses a mechanism called adaptive scheduling to prioritize a set of sibling banks. Simply stated, to schedule a set of
sibling banks adaptively is to schedule from the bank (or its subtree) which has the highest percentage of its allocated time not
yet delivered in the current shift. Each bank has associated with
it a scheduling priority. This priority is a non-negative, floatingpoint number. The scheduling priority is a multiplier or accelerator on the bank's raw adaptive scheduling priority.
Each request has associated with it several priority attributes. They
are coordinator priority, intra-bank priority, also called the group
priority and individual priority. These attributes establish a multitiered scheme used to prioritize requests drawing from the same bank.
Some requests may be declared to be short production by a user.
Requests with this attributes are scheduled on demand (as if they

were interactive). A coordinator may grant or deny any user permission to use this feature.

Vendor Requirements
The PCS requires an underlying batch processing system. The
ability to checkpoint batch jobs is highly recommended where
feasible, but not required.
System functionality required of the platform vendor is as follows:
The processes of a session (in the POSIX sense of that word)
must be identifiable at the session level. For instance, when a
user logs in, the login process (or telnetd process) should call the
POSIX routine setsidO. Every process spawned from any child
of that login (or telnetd) process should be considered to be a
member of the same session (unless one of them calls setsidO
again in which case a new session is created for that process subtree). The locality of processes in a session is not material to
membership in the session. This is meant to address the issue of
"parallel" processes executing in several nodes of a system. Each
session existing simultaneously on a platform must have a unique
identifier. This identifier must be independent of the node(s) on
which the processes of the session are executing.,
The resources used by processes on the platform must be collected
by the underlying system at the session level. The most critical
pieces of information are the session owner user id and the amount
of user CPU, system CPU and user induced idle CPU time consumed by the processes of the session. User induced idle CPU
time results when a CPU is dedicated to a user process which
then sleeps. Other data are useful as well such as characters of II
0, paging (on virtual machines), swap space consumed, etc.
The resources used by the system's idle processes must be collected as "idle time" and the resources used by other system processes that are not part of any session must be collected as "system time."
There must be a way to periodically extract the data collected by
the system. The periodicity of data extraction must be
configurable. The manner of extraction (reading a kernel table,
report by vendor supplied software, etc.) must be efficient.
Various session management functions are required. These functions take a session id and then apply the function to every process in the session. (Usually, these functions are written to allow
several classes of process identifier sets such as process, process
group leader, session, etc.) These functions are:
Send any signal to all processes in the session.
Set the nice value or priority of all processes in the session.
Temporarily suspend (i.e., deny any CPU time to) all processes in the session. (This function must not be revocable
by the user.)
Revoke the action of the suspend function for all processes
in the session.

Addressing the Future
Several development needs have been identified for PCS. These
enhancements will make the use of PCS by administrators more
flexible and will encompass a larger environment (more platforms,
for instance). Users will also find the PCS to be more of an aid in
the handling of their production.
PCS Support of Distributed Computing
Support should be provided for multi-host execution of a single
job where the hosts are not necessarily of the same type. For
instance, a portion of a job might execute on an MPP while
another portion is executing on a large vector machine and finally, a supervisor processor for the job may be executing on a
work station.
Distributed Management of the PCS
One problem with the current PCS in an environment with more
than one host is that system administrators and coordinators
must work with each PCS system independently. We need to
reduce the management workload by managing the PCS for all
hosts from a single platform. Furthermore, coordinators and
system administrators should be able to manage the entire PCS
system from any host rather than being required to manage it
from a single platform.
Cross Host Submission and Management of Production
Users should be able to submit their production requests on
one machine and have it execute on another. The results should
be available at the submitting host or the executing host as
requested by the user. Users should be able to establish crosshost dependencies for their production.
Enhanced Dependent Execution of Production
PCS already supports a dependency type wherein a request is
prohibited from running until a designated job terminates. New
dependency types should be supported. Some of these are:
1. Don't run a request until and unless a designated request terminates with a specified exit status. (If a designated request terminates with an exit status other than that specified in the dependency, the dependent request is removed.)
2. Don't run a request until a designated request has begun executing.
3. Begin the execution of a set of production requests simultaneously.
4. Don't run a request until an "event" is posted by some other request. (This would require mechanisms by which "event definitions" are declared, the occurrence of the event is posted and
event definition removed when no longer needed.)
5. Don't run a request until specified resourc"es (files, etc.) are staged
and available.
Enhancement ofPCS to Use Various Batch & Operating Systems
The PCS has been written to use the Network Queuing System
provided by Cray Research, Inc. and Sterling Software. It
should be extended to use other batch systems as appropriate.
It has been written to run on UNICOS 7.0 and Solaris 2.1.3. It
should, of course, be maintained to remain compatible with
future releases of vendor's systems.

211

ALACCT: LIMITING ACTIVITY BASED ON ACID
Sam West
Cray Research, Inc.
National Supercomputing Center for Energy and the Environment
University of Nevada, Las Vegas
Las Vegas, NV

Abstract
While UNICOS provides a means to account for a user's computing activity under one or more Account IDs, it doesn't allow for
limiting a user's activity on the same basis. Nor does UNICOS
provide for a hierarchy of Account IDs as it does with Fair-Share
~esource Groups.
Alacct is an automated, hierarchically organized, SBU-orien ted administrative system that provides a mechanism to limit a
user's activity based on the SBU balance associated with an
Account ID which the user must seiect.
Implementation comprises an acidlimits fiie along with utilities and library routines for managing and querying the file.

1.0 Introduction
NOTE: In the following the term account is used synonymously
with Login ID. When reference is made to an 'account', in the
accounting sense, Account ID or ACID will be used.
In this paper we will be discussing Alacct, an administrative
system for automatically enforcing access limitation to the
NSCEE's Cray Y-MP2/216 based on year-to-date accumulation of
SBUs within an Account ID. We will discuss the motivation for
such a system and the limitations ofUNICOS that led to its development. An earlier solution to the problem, along with its shortcomings, is also described.

1.1 Motivation
The user base at NSCEE is, like many computing centers, a
dynamic entity. Our users come from the University of Nevada
System, private industry, government agencies, and universities
outside of Nevada. Their accounts evolve over time as projects
come and go and ~liations change and,likewise, the bill for their
computing activity must get charged to different funding sources
as those projects and affiliations change.

212

This dynamic nature creates problems for the system manager who has to deal with the changes. At NSCEE, since account
creation and deletion is done manually, automation of any portion
of that process is a boon. Fortunately, accounting for a particular
user's computing activity is a feature that is already provided by
Unix, and, by extension, UNICOS. In fact, UNICOS goes a step
further and allows for accounting by UID-ACID pair, to facilitate
accounting for one user working on more than one project. Also,
CSA, Cray Research System Accounting, which is an extension to
Standard Unix Accounting, provides a site-configurable System
Billing Unit, or SBU, to express in a single number a user's
resource utilization.

1.2 Limitations of UNICOS
However, important capabilities of system management associated with user accounts that are not directly provided by UNICOS are the ability to automatically force a user to select a project
(Account ID) against which to bill a session's computing, and then
to limit his or her computing activity on that same basis. These
capabilities are important for, at least, two reasons: to the extent
that it is possible you want to be able to correlate the consumption
of resources to actual projects; and, like a 1-900 telephone number, you don't want users consuming resources for which they
may be unable, or unwilling, to pay.
Currently, UNICOS provides no way to force Account ID
selection.-.With respect to the second capability, the only means
provided, other than manually turning off a user's account, is the
cpu resource limitation feature of the limit(2) system call (with the
user's limit set in the cpuquota field of the UDB.) This mechanism
suffers from two inadequacies, however: cpu utilization does not
completely represent the use of a complex resource like a supercomputer; and, even worse, since there is only one lnode per user
the cpuquota is tied only to the DID, not to a UID-ACID pair. So,
disabling a user on this basis means disabling him or her entirely,
not just for activity on a particular project.

2.0 Background
2.1 Multiple UID Accounting
An early attempt at solving this problem at NSCEE. Multiple
UID Accounting. involved the creation of multiple accounts per
individual user. with each account being a member of a class of
account types: Interim. Class. DOE. etc. The individual accounts
were distinguished from one another by a suffix letter that indicated their class membership: i for Interim. c for Gass. d for DOE.
etc. Likewise. each account type had its own Account ID. Users
would then use the Login ID that was associated with the particular project he or she wanted to work on for that session. UIDs were
allocated on multiple-of-S boundaries to allow for future account
types.

So. we decided to write the system from scratch incorporating the
ideas we got from TAMU and the experience we already had with
Multiple urn Accounting.

3.0 Requirements and Constraints
After further discussion. a list was formulated of features and
capabilities the NSCEE needed out of this project. as well as the
constraints under which we should proceed.
Alacct should:
• Be easy to implement.
• Limit users based on the aggregate SBU consumption by
all users within a particular Account ID.
• Allow for hierarchical Account ID management.
• Force Account ID selection at login. etc.
• Be easy to administer.
• Be portable to new releases of UNICOS.
• Be robust.

This mechanism allowed for billing to be tied to a particular
user's activity on a particular project. insofar as was possible.
while guaranteeing that. when a user had reached his or her limit
of resources for that project the system administrator could disable the user's account associated with that project without disabling any of the user's other. viable. accounts. Since. at NSCEE.
a user's bill is generated directly from his or her SBU charges. the
aforementioned limit was expressed in SBUs.
The implementation of this scheme involved a file of perLogin ID limits and a daily report that cross-referenced each
Login ID's year-to-date system utilization with its limit to produce
an exception list. The system administrator would use that exception report to manually disable any account that was over its SBU
limit. Again. the primary reasons for solving the problem in this
way were that it allowed NSCEE to take advantage of existing
mechanisms for forcing users to consciously choose a project and
for allowing the system administrator to limit activity based on
SBU utilization on one project without disabling the user entirely.
Needless to say. this became something of a nightmare to
administer and the users liked it even less. Limits had to be set on
a per-user basis. instead of a class of users. Also. it should be
pointed out that this scheme essentially subverted the purpose of
allowing multiple Account IDs per Login ID as a means of
accounting for the multiple projects of a single user.

2.2 ANew Way
Therefore. a New Way was sought. NSCEE's Director. Dr.
Bahram Nassersharif. had recently come from Texas A&M University. where Victor Hazelwood had implemented a strategy for
automatically limiting. at login. a user's access to the system when
the user had exceeded an SBU Quota. This sounded like a much
better way and we decided to head in that direction and begin
work on implementing it immediately.
After some discussion. however. we concluded that we
needed some features that were not provided by the TAMU system. especially a hierarchically organized Account ID framework.

Alacct should not:
•
•
•
•

Require kernel mods.
Require multiple accounts per user.
Have significant system impact.
Have significant user impact.

4.0 Implementation
Alacct is implemented at the user level. It is not real-time. in
the sense that new usage is reflected only after daily accounting
has been run. It is fairly simple: a central file keeps track of
Account ID limits and usages; a library of user-callable routines
queries that file and modifications to critical UNICOS commands
make calls to those routines; and administrative utilities are used
to maintain that file.

4.1 Acidlimits
The focal point of the implementation of alacct is the:
/usr/local/etc/acidlimits

file. This file contains limit. usage and hierarchy information for
each Account ID on the system. It is queried whenever an Account
ID must be verified to have a balance greater than 0.00. This happens during login. execution of the su(1) and newacct(1) commands and also in nqs_spawn. prior to starting a user's NQS job.
The acidlimits file is symbolic and made up of entries like:
21100:102.00:0.00:5.72:20000

Where the five fields represent:
acid:limit:usage:progeny_usage:parent_acid

The acid field is an integer represepting the numeric Account
ID.

213

The limit field is a float representing the maximum number
of SBUs that can be consumed. collectively. by all users that process under the specified Account ID and its progeny (see the
progeny_usage field.)
The usage field is a float representing the year-to-date. collective SBU consumption by all users who have processed under
the specified Account ID (as well as optional historical usage - see
aI_usages.)
The progeny_usage field is a float representing the recursively computed sum of all usage_field entries for all progeny in
the hierarchy of the Account ID represented in the acid field.
The parent_acid field is an integer representing a numeric
Account ID. This field defines a precedence relation on the entries
in the file to establish a hierarchy of Account IDs. This hierarchy
takes the form of an inverted tree like that in figure 1.
In figure 1 we see a list of users at the bottom. Each of these
users has the Account ID GrQ>13 as one of their Account IDs.
GrCP13's parent Account ID is GrantE. whose parent is Grant.
whose parent is Internal. Internal has. in its parent field. O. which.
when appearing as a parent Account ID. represents the meta-root
Account ID (since 0 is a valid Account ID and must also have a
line in the acidlimits file.) All top-level Account IDs. including
root, have the meta-root as their parent. to complete the hierarchy.
Note that. although in figure 1. users only appear on a leaf of
the tree. there is nothing that prevents users from having one of the
hierarchy's internal Account IDs as one of their Account IDs.
This file contains an entry for all Account IDs currently in use

by any user on the system. If a new Account ID is added to the
/etc/acid

file. then. the next time the acidlimits file is created. that new
Account ID will be inserted with meta-root as its parent and a limit
of unlimited. As you will see later. a new acidlimits file is created
from. among other sources. the old acidlimits file. In the bootstrap
process. when the system is first installed and there is no old acid"limits file. an acidlimits file is created from /etc/acid with a onelevel hierarchy. meta-root as every Account ID's parent. and
unlimited limits for every Account ID.
Note that since acidlimits is a symbolic file. it can be created
or modified by hand with your favorite editor. In a system with
few Account IDs this might be feasible. However. there is an acidlimits editor. aledit(81). that understands the structure of the file
and makes changing the file a simple task (see aledit).

4.2 Local System Modifications
In this implementation there are two times when the user is
forced to select the Account ID to which his or her computing
activity will be billed (and. by virtue of which. access may be
denied): at login and when su'ing. Likewise. it is at these times.
and also when requesting a change of Account IDs with
newacct(1) and starting an NQS job with the user's current
Account ID. that the Account ID is validated for a positive balance.To achieve these actions modifications were made to the following source files:
•
•
•
•

login.c
newacct.c
su.c
nqs_spawn. c

To keep the number of mods small this list is. intentionally.
short. Notably absent are cron(1) (and. by extension, at(I)) and
remsh(1). The user's default Account ID is used in these cases. In
practice. this has not created problems. Also unmodified is the
acctid(2) system call. since kernel mods were specifically avoided
and. anyway. only root can change a user's Account ID.

4.3Iibal.a
The aforementioned mods involve calls to a library. libal.
which was written to support the alacct system. That library contains the following routines:

swcsl regf regu crreef yangf
(users)

figure 1

214

Get a single entry from the acidlimits file
Update a single entry in the acidlimits file
Read the entire acidlimits file
Write the entire acidlimits file
Retrieve the current balance of a particular Account ID
• alselect () Display. and query for selection. the
user's current Account Ids and their
respective balances

•
•
•
•
•

alget ()
alput ()
alread ()
alwrite ()
algetbal ( )

Cra!j UNICOS (clllrk.nscee.edu) (tt\:lP005)

National Supercolllputing Center for Energy Ilnd the Environlllent
Universlt!j of Nevada Llls Vegas

The user must select one of the line numbers corresponding
to an Account ID with a positive balance. If a valid Account ID is
selected. then that Account ID becomes the current Account ID
and the login process (or su) is successfully completed. If an
invalid Account ID is selected. the user is notified and logged off
(or the su fails).

Use of this s!jstePI is restricted to authorized users.

5.2 newacct(l)
login! swest
Pllssword:
Valid account ids and balances for user swest:
1.
root - unliPlited SBUs
2.
adJII - unlilllited SBUs
3.
Nscee - 908.35 SBUs
4.
Class 0.00 SBUs
5. GrCP13 - 17.81 SBUs
please select by line number.
account i d?> 3
Last successful login was : Thu Mar 10 09:33:29 froJII localhost

figure 2

An Account ID's balance is recursively computed as the
lesser of: a) the difference of its limit and the sum of its usage and
its childrens' usage. and b) the difference of its parent's limit and
the sum of its parent's usage and its parent's childrens' usage. with
the meta-root as the limiting case for the recursion (meta-root's
limit is unlimited.)
In other words. an Account ID's balance is the least of all balances of all Account IDs on the path back to the meta-root. This
means that if any Account ID has exhausted its allocation. then all
its progeny have. effectively. exhausted theirs.
It is the requirement that the predecessors' balances be positive that gives function to this hierarchy. If. for example. it were
the case. referring again to figure 1. that Internal accounts (i.e.
Account IDs whose predecessor path includes the Internal
Account ID) had an aggregate limit which was greater than the
limit of the Internal Account ID itself. then it would be possible to
exhaust the allocation to all Internal accounts before anyone of
those accounts had reached its own limit. Also. this scheme allows
for special users. perhaps the principal investigator on a project. to
have access to one of the Account IDs that is an internal node of .
this hierarchy. and draw on its balance directly. using it up before
the users who have access to Account IDs further down the hierarchy have a chance to use theirs.

In the case of the newacct(1) command. its invocation is
unchanged. To change Account IDs the user still specifies the new
Account ID to which he or she wishes to change. With alacct in
operation. however. the user's selection is not only verified as a
legitimate Account ID for that user. but the balance is also verified
for a balance greater than 0.00. Examples of newacct's operation
appear in figure 3.

5.3 nqs.-spawn
NQS jobs carry the Account ID of the qsub'ing user. It is possible that. by the time the user's job is selected for processing by
the nqsdaemon. the limit of the requested Account ID will have
been exceeded. For this reason. the check for positive balance is
deferred until the job is actually spawned by NQS. If at that time
the Account ID is over its limit. the job is rejected and e-mail is
sent to the user.

6.0 Administration
The day-to-day administrative requirements of this system
are minimal. In addition to any Account ID additions. deletions or
modifications (that would be made with the aforementioned aledit
program) the acidlimits file must be revised daily to reflect the
current year-to-date usage of the various Account IDs on the system. Of course. this means that the site must produce a year-todate accounting file at least as often as the site wishes the usage
information stored in the acidlimits file to be accurate. UNICOS
provides a fairly easy method of doing this without requiring that
the system administrator maintain on-line all accounting data for
the year.

6.1 CSA

5.0 User Interaction
As mentioned. there are four instances when a user would be
aware that there is something affecting his or her processing. At
login. when issuing the su and newacct commands. and when his
or her NQS job is started.

5.1Iogin(l) and su(l)
When a user logs in. or issues an su command. he or she will
see a screen like that in figure 2. This is the output from the alselect() library call.

CSA produces summary data for the current accounting
:rilJ~MmM#::mWHHHH::t:t:Hmt/tm}t{mmHH}H:m}Hnmr:mm/tmmtwmmmmmmnmtt/tt:tmmHd!l
lullcrilswest 102=> newacct -a
root
(0), account blllllnce - unl1l11ited
ad",
(7), account balance - unlil1lited
Nscee (21400), Ilccount balance - 908.35
Class (21600), account balllnCe - 0.00
GrCP13 (23513), account balance - 17.81
lu1/cri/swest 103=> newacct -1
Current account name: Nscee, account blliance - 908.35
lullcrllswest 104=> newacct Class
newacct: account balllnce < 0.0, account unchanged
lullcrilswest 105=> 0

I

figure 3

215

period when daily accounting (csarun) is run (it is referred to as
daily accounting. but it can be run on any boundary desired.) This
data is 'consolidated' by the csacon(8) program into eaeet format.
CSA also allows for 'periodic' accounting to be run which adds
together. with csaaddc(8). an arbitrary collection of cacct files to
produce a file that represents accounting for a specified period. By
default. csaperiod stores its output in the fiscal subdirectory of the
/usr/adm/acct directory.
At NSCEE. csaperiod is run on the first day of every month
for the previous month to produce monthly accounting. These
monthly files are kept on-line and the daily files that produced
them are cleaned off to start the new month. Year-to-date accounting then. would be the summation of all the fiscal files along with
the current month's daily files. To accomplish this. a local variation of csaperiod was written. csaperiod_year. Csaperiod-year is
like csaperiod. except that. in addition to the /usr/adm/acct/sum
directory. it also knows about the /usr/adm/acct/fiscal directory to
add in monthly data. as well.
Figure 4 illustrates the year-to-date accounting process which

(fi,cal)

(sum)

S§

§
§§§

pdacct (see figure 5.)

gR
~LJ

figure 5

. The program operates in the following steps:
1. Read /etc/acid to get a current list of valid Account IDs. All
Account IDs start with the default limit of unlimited.
2. Read /usr/local/etc/acidlimits to bring forward the most
recent limit information.
3. Read /usr/local/etc/al_usages to update the Account IDs'
usage field to reflect historic usage (optional).
4. Read /usr/locaVadm/acct/pdacct. summing all usage for
each Account ID and then update the Account IDs accord-.
ingly.
S. Rewrite /usr/local/etc/acidlimits.

6.3 crontab
The accounting and mkal operations have been automated
and figure 6 shows the relevant crontab entries which. each day.
run daily accounting. year-to-date accounting. and mkal.
figure 4

would take place on July 14. Periodic (fiscal) data for each of the
months from January through June along with the Daily (sum)
accounting files for the month of July. which are all in cacct format. would be added together to produce the current year-to-date
accounting file - pdacct.

*

*
*
1
*

*
*
*
*
*

*
*
*
*
*
*

*••

lusr/lib/acctlckpecct
lusr/lib/llcctlckdecct nqs
lusr/local/1ib/acct/dail~

lusr/lib/llcct/dodlsk·1I -y 2> lusr/lldlll/llcct/nlte/dk210g
l~r/l1b/llcct/csarun 2> lusr/adlll/acct/nlte/fd2109
l~r/local/llb/acct/csaperlod.Plont

lusr/local/llb/acct/csaperlod.!:Jellr
lusr/locai/lib/acct/lllkal

6.2 mkal
As mentioned before. the year-to-date accounting file contains the information that will be used to keep the usage fields in
the acidlimits file current. For this purpose. there is the mkal(81)
program. Mkal updates the acidlimits file with current usage
information by reading the year-to-date accounting data file.

l~r/alr/bln1alrdchk

lusr/a 1rib 1n/lllyf lies

lusr/llb/sa/sal 600 6 &
lusr/11b1sa/sa3 -ubwqv
data 1II1gratlon

figure 6

216

6.4 aledit
H resource limits change. or a new Account ID has been created. the acidlimits file must be rewritten to reflect this change or
addition.
The aledit(8l) program is the primary maintenance tool for
the alacct system. It operates in two modes. Interactive and Display. and allows a suitably privileged user to interactively view
and edit the acidlimits file. or to display a formatted. hierarchically
arranged report of the contents of that file.
In Display Mode the aledit command accepts options to produce reports showing:
=MMHiMAM

• The entire acidlimits hierarchy. appropriately indented (a
portion of an example of which is shown in figure 7.)

1 (5 eccounts>
total = unlilili ted

(k)up (J)down (h>left (l)rlght users (b,B)balo!lnce
(x)lII!Irk
(P)Put unlilll1t (u)

figure 8
lusr/lo~l/lib/ecct/aledit

-t I lIIore
root
un 11111 ited
0.00 unlil1lited
12432.33
4383.17
8049.16
Internal
Bench
45.33
13.00
32.33
BenBaI
4.00
3.85
0.15
BenChelll
41.33
9.15
32.18
Inter1J'l
1000.00
495.18
504.82
Unfunded
5203.00
1974.07
3228.93
UnfP25
200.00
0.03
199.97
UnfP24
200.00
0.00
200.00
UnfP23
300.00
0.00
300.00
UnfP22
22.00
7.17
14.83
UnfP21
150.00
0.49
149.51
UnfP20
300.00
1.53
298.47
UnfP19
183.00
0.19
182.81
UnfP18
40.00
26.51
13.49
UnfP17
200.00
0.00
200.00
UnfP16
10.00
9.81
0.19
UnfP15
4.00
0.00
4.00
UnfP14
20.00
20.44
-0.44
UnfP13
400.00
407.20
-7.20
UnfP12
170.00
173.97
-3.97
UnfPl1
811.00
741.20
69.80
UnfPl0
200.00
11.57
188.43
UnfP9
10.00
10.18
-0.18
UnfPB
100.00
110.62
-10.62
UnfP7
40.00
41.58
-1.58
UnfP6
800.00
0.00
800.00
UnfP5
293.00
33.49
259.51
UnfP4
200.00
16.46
183.54
UnfP3
200.00
3.62
196.38
UnfP2
250.00
251.44
-1.44
UnfPl
100.00
106.58
-6.58
Class
0.00
0.00
0.00
Grant
5184.00
1737.69
3446.31
GrantE
1722.00
206.18
1515.82
GrCP18
40.00
0.00
40.00
GrCP17
22.00
0.00
22.00
GrCP16
217.00
0.00
217.00
GrCP15
32.00
0.00
32.00
GrCP14
100.00
101.13
-1.13
GrCP13
20.00
2.61
17.39
GrCP12
20.00
19.10
0.90
GrantCP9
20.00
0.00
20.00
GrantCP8
145.00
7.35
137.65
GrantCP7
20.00
0.09
19.91
GrantCP6
0.00
19.35
-19.35
GrantCP5
20.00
24.31
-4.31
GrantCP4
44.00
0.00
44.00
GrantCP3
207.00
0.00
207.00
GrantCP2
207.00
0.00
207.00
GrantCP1
608.00
32.23
575.77
Grant I
3462.00
1531.52
1930.48
GrUcp19
SO.OO
2.10
57.90
GrUcp18
105.00
2.00
103.00
GrUcp17
100.00
0.00
100.00
GrUcp16
10.00
0.00
10.00
GrUcp15
58.00
0.00
58.00
GrUcp14
425.00
17.29
407.71
GrUcp14a
150.00
17.19
132.81
GrUcp13
34.00
0.00
34.00
GrUcp12
21.00
0.00
21.00
GrUcpl1
24.00
0.00
24.00
GrUcpl0
102.00
0.00
102.00
GrUcp9
11.00
0.00
11.00

figure 7

• The entire acidlimits hierarchy along with users
• A single Account ID. including users. progenitor path and
children
Executing aledit with no options places the user in Interactive
Mode. When you execute aledit in Interactive Mode. the program
reads and assembles. into a tree. the Account ID hierarchy defined
by the predecessor relation found in the acidlimits file. The terminal display is then initialized to show the children of the 'metaroot' account. and input is accepted directly from the keyboard
(see figure 8).
As you can see. the normal display mode shows a list of
Account IDs down one column. and their corresponding limits in
another. Notice the special limit 'unlimited'. Account IDs with an
unlimited limit are not subject to usage deductions when calculating a balance.
The accounts on the screen at any given time constitute a peer
group of accounts. in that they are at the same level in the hierarchy and have the same parent. The parent is shown. along with
some other information. and help messages. at the bottom of the
display (users of the UNICOS shrdist(8) command may recognize
the format - portions of the display code for aledit were derived
from shrdist.)
There are several major functions provided within aledit:
•
•
•
•

Movement within the Hierarchy
Changing Limit Values
Changing the Hierarchy
Changing Display Modes

Notice that in figure 8 the cursor is resting on the limit value
for the Internal Account ID. Changes to that limit can be made by
simply typing in a new value. To move up or down the list. the 'k'
or 'j' keys are pressed H this peer group had a parent Account ID.
which in this case it doesn't. pressing the 'i' key would move the
display up one level to display the peer group of the parent. Also.
if Account ID Internal had children. which it does. pressing the

217

'm' key while the cursor is resting on Internal will take the user
down the hierarchy to display the children of that Account ID (see
figure 9.)

children of the GrantE Account ID.
40.00

22.00
217.00

32.00
100,00
20,01
20,00
20,00
145.00
20.00
0.00

20.00
44.00
207.00
207.00
608.00

(k)up (J)down (h>left (lkisht (1)his/1er (III >1 ower
(/)srch (+,-)pg+,-l (AU)users (b,B)balllnce (?)help
(x)lIIark (p)put (P)Put (U)unlhlit (u)update (q)quit

Account = IMM8jMN#
PlIge 1 of 1 (6 accounts)
account total =
33

(k)up (J)down (h>left hisher (lII)lower
(I)srch (+,-)pg+,-l ("U)users (b,B)blllance (?)help
(x)lIIark
(P)Put (U)unlillli t. (u

figure 9

The Account IDs in figure 9 are peers of one another and are
children of the Internal Account ID. Notice that the total of all limits for all Account IDs on this display is 12432.33 SBUs, which
matches the limit on the Internal Account ID itself. This is not a
requirement. That total may be greater or less, than the parent
limit, depending on the circumstances. As an aid, however, since
those values will be frequently be the same, there is a balance
function that propagates the total up one level to the parent, or all
the way up the hierarchy to meta-root.
Continuing with this example, we see, in figures 10 and 11 the
displays for the children of the Grant Account ID, followed by the

figure 11

Finally. in figure 12. if we change display modes to show
users, we see the list of users who have Account ID GrCP13 as
one of their Account IDs:
As mentioned before. aledit also allows for the manipulation
of the hierarchy itself. This is accomplished in three steps by:
1. Marking with an 'x' an Account ID, or list of Account IDs.
2. Positioning the cursor on the parent (or, alternatively. the
peer group) where the Account ID is to be moved. and
3. Pressing 'p' (or 'P').

Note that if the hierarchy is changed in any way. mkal must
be rerun to correctly compute the new progeny_usages.

6.5 ai_usages
As mentioned earlier the al_usages file is an optional feature
of alacct that allows for mapping SEU usage by a particular urn
to a new Account ID against which this usage must be deducted.
usase: 2.19
swest resf regu crocef !langf

=

Account _MWPage 1 of 1 (2 accounts)
5184.00
account totlll

=

(k)up (J)down (h>left (l)right (I>hlgher (lII>lower
(I)srch (+,-)pg+,-l ("U)users (b,B)balance (?)help
(x)lIIark
(P)Put (U)unlilllit (u)

figure 10
I

figure 12

218

balance: 17 .Bl

This mechanism allows us to bring forward from previous years
historical usage that must be tied to a particular Account ID. The
mkal_usages(8) program is usually used at year-end to create
from the final year-to-date accounting file an aCusages file for the
coming year which contains Account IDs that will be carried forward into the new year.

7.0 Future Plans
Just like the shrdist(8) command allows a point-of-contact

System impact is minimal: approximately 45 seconds of real
time (10 cpu seconds) were required on a lightly loaded system to
compute year-to-date accounting for March 12 (2 fiscal files and
11 daily files). mkal operates on the year-to-date accounting file in
under 1 second to produce the acidlimits file. Disk requirements at
NSCEE are approximately 700 Blocks (-3MB) per cacct file.
whether monthly or daily. so on December 31. we would need
approximately (11 + 30) * 700 Blocks =28700 Blocks (-120MB).
These values will. of course vary from system to system depending on the number of active Account IDs and UIDs.

(POC) for each resource group to control the allocation of

resource shares in his or her group, a POC should be able to distribute limits to children of the Account ID he or she controls.

9.0 Acknowledgments

Improvements to the timeliness of the acidlimits file could be
made with a new routine to summarize the current accounting
period's data and update a new field in the acidlimits file todays_usage.

Thanks very much to Joe Lombardo and Mike Ekedahl at
NSCEE for a final reading of the paper. Also. thanks to Joe for
using aledit while I got it right. Thanks to Victor Hazelwood for
suggestions and for already figuring out that this was a reasonable
thing to do.

Approximation to real-time operation could be implemented
with a daemon process which would scan the process list as often
as desired and, using the aforementioned improvement to the acidlimits file, kill processes with Account IDs which are over the
newly computed usages.

8.0 Summary
Alacct is an administrative extension to UNICOS that allows
for the automatic limitation of computing activity based on SBU
consumption by individual user accounts within an Account ID.

10.0 References
1 Cray Research. Inc.• UN/COS System Administration Volume 1. SG 21137.0, A Cray Research. Inc. Publication.
2 Cray Research. mc.. UN/COS System Calls Reference
Manual. SR 2012 7.0. A Cray Research. Inc. Publication.
3 Cray Research. Inc., shrdist v. 7.0. Unpublished Source
Code.
4 Cray Research, Inc .• UN/COS 7.0. Unpublished Source
Code.

Alacct requires modifications to selected commands in UNICOS. These modifications cause the commands to make calls to a
library of routines, libal.a, to query a file that contains limit information for individual Account IDs. This 'acidlimits' file contains
an entry for each Account ID on the system and is organized hierarchically such that limits are imposed from the top down while
usage is propagated from bottom up.
Administratively, alacct is maintained by manipulating the
acidlimits file directly with the aledit program, which allows for
changes to limits and to the hierarchy itself, and indirectly with the
mkal program which, using current year-to-date accounting information reconstructs the file when necessary to reflect current
usages.
Users interact with this system directly, and are forced to
select an Account ID under which to compute, when logging in
and when executing the su(1) and newacct(1) commands. Users
interact with this system indirectly when their NQS job is started.
Since the implementation of alacct is fairly simplistic and it
does not operate in real-time, the potential exists for users to subvert its purpose. However, the safety net of a robust UNICOS
accounting system guarantees that no SBU will go unbilled, making the greatest risk, therefore, that a user might have to be turned
off manually.

219

Appendix 1 - Alacct(31) man page
alacct{31>

LOCAL INFORMATION

NAME
alacct - Introduction to local alacct UNICOS enhancel'lent
DESCRIPTION
alacct is the nallle given to an NSCEE enhanccillent to UNICOS tNt
perr'll1ts NSCEE to .!!utolll.!!ticall~ lilllit COIllPUting acttvit~ b.!!sed on
SBU uttliz.!!tion within .!In account •
.!!1.!!CCt comprises the following clelllents:
o lusr/localletc/acidlirdts

File of acid lll'lits, usages and
hierarch!;! lnfomatlon. Usage is
updated on a dai 11:1 basis b!;! the
IIIkal uUlitl:l.

o lusr/local/etc/.!!Lus.!!ges

{option.!!l> File of hlstoric.!!l useges.

o lusr/iocalllib/lib.!!l.a

Libr.!!r!! of .!!cidlilllits .!!Ccess end
update routines.

o lusr/locallllb/acct/Jlkal

Adlllinistrative utility to update
usages in the .!!cidlilllits file.

o lusr/iocai/lib/acct/aledi t

Adlllinlstr.!!tive uti lit!:! to lII.!!int.!!in
the .!!cidlirllits file.

o lusr/iocall.!!dlll/.!!cct/pd.!!cct

Year-to-date .!!ccounting inforMation.

o lIIOdifications to
!bin/login
!btn/su
!bin/nelll.!!CCt
lusr/lib/nqs/nqsdaelllon
NOTES
This extension to UNICOS requires I'lods to the following s!:jSteIII source
files (and- their respective NlIlakefiles>:
lusrIsrc/cllld/newacct/newacct. c

lusrlsrc/cllld/su/su.c
lusrlsrc/cllld/iosin/iosin.c
lusrlsrc/net/nqs/src/nqs_spawn.c

(lIIod
(lIIod
(lliod
(lliod

OOcllld00007a)
OOcllldOOOOSa)
OOcllldOOO09.!!)
OOnqsOO010.!!)

For further inforlll.!!tion regarding these lIIods, ple.!!se see their source in
lusrlsrc/rev/local.
FILES
lusr/loc.!!l/etc/.!!cidlillli ts
lusr/iocal/.!!dlll/.!!cct/pd.!!cct
lusr/locel/etc/al_useses
lusr/iocei/lib/i ibel • .!!
lusrl 1OCII 1I.!!cctl 1ib/lllkal
lusr/local/acct/liblaledit

SEE ALSO
libal<3l>
.!!cidl1/11tts(51)
r11k/1HSl>
.!!leditu

(seos)

100000. 00 ~-i-t-i-t-i-t--~-~-i--t--~-t-i-t-i-t--t-t--~-t--t-t-i-t--~-~-i-t--~-t-i-t--t-t-i-t--~-~
I

I

I

•

I

I

I

•

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

,

I

I

::-1--:-1--:-1--1-r-1-1--:-1--:-:-:-1--:-:-:-:--1-1--:-1--1-:--1-1--1-1--:-1--1-1--:-:-:-1

90000 .00
80000. 00 t--~-~-i-t-i-~--t-t-i-~-i-~-i-t-i-t-i-~-i-t-i-t-i-t--t-~--~-~--~-~-i-~--~-t-i-~--~-~
I

70000. 00

I

•

I

I

I

•

I

I

I

I

I

••

I

I

I

I

I

I

I

I

I

I

•

,

I

I

•

I

I

I

I

I

••

I

I

t

r:-r-1-r-:-r-1-1--1-1--1-:-:-1--1-r-1-1--1-1~-1-1-:-r:-1--:-1--1-1--1-1--1-:-:-1--1-1

60000. 00 ~--t-~--t-t-i-~--t-~--~-~-i-~--~-t-i-t-i-~-i-t-i-t--~-t-i-t--t-~--~-~-i-~--~-~-i-~-i-t
I

50000. 00

I

I

I

I

••

I

•

I

I

•••

I

I

I

,

I

I

I

•

I

I

•

I

I

I

I

•••

I

•••

I

•

I

1--:-1--1-:--1-1--:-1--:-1--1-1--:-1--1-::-:-1-1--:-1-1-1--:-1--1-1--:-1--1-1--1-1--:-1--1-1

40000. 00 ~-i-~--~-t-i-~--~-t--~-t--~-~-i-t--~-t-i-t--~-~--~-t-i-t--~-~-i-t--~-~-i-t-i-~-i-t--~-t

30000. 00
20000. 00

I

I

•

I

I

•

I

I

I

I

I

•

I

I

••

I

I

••

I

I

I

•

I

I

I

I

I

•

I

I

I

•

I

I

I

,

I

I

•

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

,

I

I

I

I

I

,

I

I

I

I

I

I

••

I

•

I

••

I

I

I

I

I

I

•

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

••

I

••

I

I

•

I

I

I

I

I

I

I

I

•

t--t-t--t-t--:--t-;--t-;--t--:--t--t-t--:--t-;--t-i-t--t-;-t-;;-t--t-t--t-t-;--t--t-t-i-;i-t
t-;--;i-t--t--t--t-t--t--t--;--;;-;-t-;-t-;-t-t--t-t--t-t--t-t--t--t--t-t--t-t--t-;-t-;-t-t
I

I

1--:-1--1-:--1-1--:-1--1--1--1-:-:-1--1-t--1-1--1--1--:-r-:-:--1-1--1-1--1-1--:-:--1-1--;-1--1-1
9000 .00 r-;-1--;-1--:-1--:-:-:-r-1-1--:-1--:-1--1-1--:--r-:-r-:-:-:-1--:-1--1-1--1-:-:-1--:-1--:-1

10000. 00

8000 . 00 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+

: : : : : : : : : : : : : : : : : : : : : : 1: : : : : : : : : : : : : : : : :

7000 . 00 +--+-+--+-+--+-+--+--+--+--+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+--+

6000. 00

1--1-1--1-1--1-1--1--1--1--1--121--1--1--1-1--1-1--1-1--1--1--1--1--1-1--1-1--1--1--1-1--1-1--1-1--1-1
: : : : : : : : : : : : 1: : : : : : : : : : : 2: : : : : : : : : : : 2: : : : :

5000 . 00 +--+-+--+--+--+--+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+--+

: : : : : : : : : : : : : : : : : : : : : 2: 1: : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : 2: : 2: : : : : 3: : : : : : : : : : :
+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+--+

4000 . 00 +--+-+--+--+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+-+
3000 . 00

: : : : : : : : : : : : 2: : : : : : : : : 8: 3: 2: 1: : : : : : : 1: : : : : : : :

2000.00 +--+-+--+-+--+--+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+

: : : : : : : : : : : : 2: : : : : : : : :11: 4: 1: : 1: : : 1: : 2: 4: : : : : : : :

1000 . 00 +--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+--+

: : : : : : : : : : : : : 2:

: : I;

: : : : : : : : : : : : : : 2:

: : : : : : :

900 . 00 +--+--+--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+--+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+--+--+-+--+--+
800. 00

: : : : : : : : : : : : : : : : : : : : : 8: 1: ; : : : ; : : : ; : : : ; : : :
+--+--+--+-+--+--+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+--+--+-+--+-+--+--+--+-+--+-+
: : : : 1: : : : : : : : : : : 1: : : I; : : 1: : 1: : : : : : : : 1: : : : : ; : :

700.00 +--+-+--+--+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+-+--+--+

:

:

:

:

: I;

: : : : : : 1:

:

: : :

: 1:

: : 5:

: :

: :

: : : : 1: 1:

: :

: :

: :

:

600. 00 +--+--+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+-+--+--+--+-+--+-+

: : : : : : : : : : : : : : : : : : 1: : : : : : : : : : : : : : : : : : : : :
: : : : 1: : : 1: : : : : : : : : : : 1: : : 2: : 1: : : : : : : : : : : : : : : :
+--+--+--+--+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+--+
: : : : 1: : : : : : : : 3: : : : : : ; : : 2: : : : : : : : : : 1: : : : : : : :
+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+--+--+--+--+-+--+-+--+--+
: : : : : : 1: : : : : : 3: : : : 1: : 2: : : 4: 5: : : : : : : : : : : : : : : : :
+--+-+--+-+--+-+--+--+--+--+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+-+--+--+--+-+--+-+--+--+

500 . 00 +--+-+--+--+--+-+--+--+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+--+--+-+--+--+--+-+
400.00
300. 00
200. 00

: : : : 2: : 4: : 1: : : :26: : : : 2: 1: 5: : :14: 3: : 1: : : : : 8: 3: : : : : : : : :

100 . 00 +--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+--+--+--+--+-+--+-+--+-+--+-+--+--+

:

:

:

:

: 1:

: 1:

: :

:

: 6:

:

: 2;

:

:

: 1:

: I;

:

:

:

:

: :

: 1:

: :

:

:

: :

:

:

:

90. 00 +--+--+--+-+--+--+--+--+--+--+--+--+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+--+--+-+

: : : : : 1: 1: : : : : : 2: : : 1: : : : : : 4: 1: : : : : : : 2: : : : : : : : : ;
: : : : : 3: : 1: : : : : 3: : : 1: : : : 1: : 9: : 1: 4: : : : 1: 1: : : : : : : : : :
+--+--+--+-+--+-+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+--+
: : : : : 1: : 5: 2: : : : 1: : : : : : : : 8: 5: : : 1: : : : 1: 1: : : : : : : : : :
+--+--+--+-+--+--+--+--+--+-+--+-+--+--+--+--+--+--+--+--+--+--+--+--+--+-+--+-+--+--+--+--+--+-+--+-+--+--+
: : : : : : : 3: : : : : 3: : : : : : : 1: I; 5: 1: : : 9: : : : 1: : : ; : : ; : : :
+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+--+--+-+
: : : : : I; 1: 3: 1: : : 1: 3: 1: 1: 1: 2; : : 1: : 4: : 1: : 1: : : 1: 1: : : : : : : : : :
+--+--+--+-+--+--+--+--+--+--+--+--+--+-+--+--+--+-+--+-+--+--+--+-+--+--+--+--+--+--+--+--+--+-+--+-+--+-+
: : : : 1: : 1: 6: : : : 1: 4: 2: : : : : 9: : 3:26: : : 2: 9: : : : : : : : : : : : : :
+--+--+--+-+--+--+--+--+--+--+--+-+--+-+--+--+--+-+--+--+--+--+--+-+--+-+--+-+--+--+--+--+--+--+--+--+--+--+
: : : :1A: : 7: 6: 1: : : 3:14: 2: 2: : 6; ;15: : 3:48; : : 6: ; : : : : : : : : : : : : :
+--+--+--+-+--+--+--+-+--+-+--+--+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+--+--+--+--+--+--+-+--+--+--+-+
: : : : 4: 9:70;15: : : 1: 1:14: 3: : I; 2:10: 3: 3:15:20: : 1: 3: 1: ': : : 1: : : : : : : : : :
+--+-+--+--+--+--+--+--+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+-+--+--+
: : : : : 4:11: 2: : : ; : 4: 1: : : : : : : 7: 1: : : : : : : : : : : : : : : : : :
+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+--+--+--+--+-+--+-+
: : : : 2: 2: 13: 2: : 1: : : 3: 1: : 1: : : : : 7: 1: : : 1: : : : : : : : : : : : : : :

80. 00 +--+-+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+--+--+--+--+--+--+-+
70. 00
60. 00
50. 00
40 . 00
30 . 00
20.00
10 . 00
9 . 00

8 . 00 +--+-+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+-+

: : : : 2: : .. : 5: : : 1: : 8: : : : : 1: : : 1:16: : ; 1: : : : ; : : : ; : : : : : :

7. 00 +--+--+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+--+--+-+--+--+

: : : : : :11:5;1: :1: :11:1::: :7:15:13:17;6:: :1::::: : 1 : : : : : : : : :
: : : : 2: :15:11: : 1: 4: :13: : : 1: : 2: 6:13: 9:19: : : : : : : : : : : : : : : : : :
+--+--+--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+--+
: : : :2:1:15:8: :1:1; :13:2: :3;3:1:15:18:2:3:3: : 1 : : : : : : : : : : : : : : :
+--+--+--+-+--+-+--+--+--+--+--+--+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+--+
: : : : 5: 1:11:13: 1: 3: 1: :32: 1: :50: 3:11: 1: :11: 3: 6: 1: 6: 1: : : : : : : : : : : : : :
+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+-+
: : : :1A: 2:20:3A: : 1;10: 2:67:11:92:85:13: 1: 5;20: 6;18: : : 3: 1: : : : 2; : : : : : : : : :

6. 00 +--+--+--+-+--+--+--+--+--+-+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+-+
5.00
4.00
3. 00

2 . 00 +--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+--+

: : : \10: 4:7A:1A: 3: 6:89:40:2A:97:43: : : 6:20:60:10:25: 3: : : 1: : : : : 1: : : : : : : : :
1. 00 +--+-+--+-+--+-+--+--+--+-+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+-+--+-+--+--+--+--+--+--+--+--+--+-+
:2D: :27:6C:5B:2C:6C:4B:6B\1C:1B:5B:2B:1A:24:13:32:46:3A: 6:63: 9: : : : : : : 1: 1: : : : : : : : :
o.00 +--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+-+--+--+--+--+--+-+--+-+--+--+--+--+--+--+--+--+--+-+--+--+--+--+
o 5 1 2 3 4 5 6 7 8 9 1 2 345 6 7 8 9 1 2 3 4 5 6 789 1 2 3 4 5 6 7 8 9 1
000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000
o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
000000000 0

o

!!EM

(lOf)

Figure 11

286

CPU
(seas)

200. 00 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+

:

:

: 1;

:

:

: :

: :

: :

: :

: :

:

:

: :

: :

: :

:

:

:

:

:

:

: :

:

:

:

:

194. 86 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1: : : :

189. 71 ~--~-;-~-;;-~--~-;;-~-;-;-~-~--~-~--~-;;-;-~-;;-~--~-~--~-;;-~--~-;;-;-~
I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

•

I

•

I

I

I

I

I

I

•

I

I

I

•

I

I

I

I

•

184. 57 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+

: : : 1: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
179. 43 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+
: 5: : : : : : : : : : : : : : : : : : : : : : : 1: : : : : : : : : : : :
174 .29 ~-;-~--~-;-~-~--~-~--~-;;-;;-~-;-~--~-~-;-~-;-;;-;-~-;-~-;-~-;;-~--~-~-;
I

I

I

I

I

I

I

I

I

I

I

•

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

•

I

I

I

I

169 .14 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+
164. 00

: : : 1: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1: : :
+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+
: : : : : : : : : : : : 1: : : : : : : : : : : : : : : : : : : : : : : :

158. 86 1--1-1-~-r-:-r-:-1--:-r-1-r~-1-:-:-:-:-:-:-:-:--:-:--:-1--1-r-1-r-1-r-:-1--1
153. 71 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+

: : : : : : : : : : : : : : : : : : : : : : : : : : 1: : : : : : : : : :
: 1: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1: :
143. 43 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+
: 1: : 1: : : : : : : : : : : : : : : : : : : : : 2: : : : : : : : : : : :
138. 29 +--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+
: : : : : : : : : : : : : 1: : : : : : : : 2: : : : : : : : : : : : 1: : :
133 .14 +--+-+--+--+--+-+--+--+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+
: : 1: : : : : : : : 1: : : : : : : : : : : : : : : : : : : : : : : 1: : :
128 . 00 +--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+
: : ; 3: : : : : : : 1: : : 1: : : : : : : : : : : : : 1: : : : : : : : : :
122 . 86 +--+-+--+--+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+
: : : 4: : : : : : : : : : 1: : : : : : : : : : : : : : : : : : : : : : :
117 . 71 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+--+--+-+--+-+--+-+--+
: : : 6: : : : : : : : : : 1: : : : : : : : : : : : : : : : : : : : : : :
148. 57 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+

112. 57 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+

: : : 3: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : 2: : : : : : : : : : : : : : : : : : 2: : : : : : : : : : : : : : :
+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+

107. 43 +--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+
102. 29

: : I; 3:

: : : : : : : :

: 1:

: : : : : : : : : : : : : : : : : : : : : :

97 .14 +--+--+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+
92. 00

: 2: : 3: : : : : : : : : : : : 1: : : : : : : : : 1: : : : : : : : : : : :
+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+
: 1: 1: 2: : : : : 2: : : : : : : : : : : : : : : : : : : : : : : : : : : :

86. 86 +--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+

: : : : : : : : 1:

: : : : : : : : : : : : : : : : : 2;

: : : : : : : : :

81. 71 +--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+

: 3: : 1: : : : : : : : : : : : 1: : : : : : 2: : : : : : : : : : : : : : :

76 . 57 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+

: 2:

: 3:

:

:

:

: 1:

:

:

:

: ;

:

:

:

:

:

: 2;

:

:

:

:

: 6:

:

:

:

:

:

: 1;

:

:

71. 43 +--+--+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+
66. 29

: 4: : 1: : : : : : : : : : : : : : 6: 1: : : : : : : : : : : : : : : : : :
+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+--+--+--+--+-+--+
: 3: : : : : : : : : : :
1: 3: : : : : : : : 1: 1: : : : : : : : :
I

•

:

:

:

61.14 +--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+--+--+

: 2:

: 2:

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1;

:

56. 00 +--+--+--+-+--+-+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+

: 2: 1: : : : : : : : : : : : : 1: : 1: : : : : : : : : 2: : : : : 1: : 1: : :

50. 86 +--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+

: 4; 2: 1:

: 1:

: :

: 2:

: :

: :

: :

: :

: :

: : :

:

:

: 1: 1:

:

:

:

:

:

:

:

:

45. 71 +--+--+--+-+--+-+--+-+--+-+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+
40. 57
35. 43

: 2: : : 1: : 1: : 1: : : : : : : 1: : : : : : : : : 1: : : 1: : : : : : : : :
+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+
: 3: 1: 1: 1: : : : : : : : : 1: 1: : : : : I; : : : : : : 3: : : : : 1: : : : :
+--+--+--+-+--+--+--+--+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+
: 5: 1: 2: 1: : : : : : : : : 7: : : : : : 4: : : : : 4: :12: : : : : : 1: : : :

30 . 29 +--+--+--+-+--+-+--+-+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+

: 9: 4: 2;

: I;

: 2:

: : 2;

: : 7: 7:

: 2:

: 3: 1:

: 3:

: : :23: 3:

: : : : : : 1:

: :

25 .14 +--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+
:lA: 4: 7: : 1: : : : : 4: : : : 1: : : : 3: : : : ; 2: : :13 : : : : : : : : : :
20. 00 +--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+
10.00
66.86
123.71
180.57
237.43
294.29
351.14
2000.00
KEM(KW)

Figure 12

287

Cl'U
(seOB)

100000. 00 ~-~-~--~-~--~-;-~-~--~-;~-~-~--~--~-~-~-;~-;-~-~--~-~-~-~--~-~--~-~-~-;-~-~--~-;~-~
90000 .00

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

It.

I

,

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

••

I

I

,

I

I

•

I

I

I

t

I

I

••

I

I

I

I

I

•

I

I

I

•

I

•••

I

;;-;--t-t--t-;;-;;-t-;--;;-;;-t-;-t-;--;;-t-;--;;-;;-t--;-t--;-;;-;--t-t-i-t

:-:-1--1-1--1-1--:-1--1-1--:-:--1-1--:-:-:-1--1-1--:-1--:-1--:-:--1-1--1-1--:--1--:-r-1--1--:-1

80000. 00
70000. 00 ~--~-~--~--~-~-~--~-~--~--~--~-~--~-~--~-~--~-~--~-~-~-;~-;~-~-~-~-~-~-~-~--~-~-~-~-~-~
60000 .00
50000. 00
40000. 00

30000 . 00

I

•

I

I

•

I

I

I

I

I

I

,

I

•

I

I

I

I

I

I

I

I

I

t

I

I

•

,

I

I

I

,

I

I

I

I

I

I

I

I

I

••

I

I

I

I

I

I

I

•

t

I

I

t

I

I

I

I

I

•

I

I

I

I

I

I

•

I

I

••

I

I

I

,

I

I

I

I

••

I

I

I

I

I

I

I

•

I

I

I

I

I

I

I

t

I

•

I

I

I

I

I

I

I

I

••

I

I

•

I

I

I

I

I

I

•

I

I

I

••

I

I

•••

I

••••

I

I

•

I

I

I

••••

I

I

••••

I

I

••

••

I

•

I

•

I

I

I

I

•

I

I

•

I

I

I

I

I

I

I

•

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

•

I

I

I

I

I

I

I

I

I

I

I

•

I

•

I

I

I

I

•

I

I

t

I

I

I

•••

I

•••

t--;-t--;-;-t-t--t-;;--t--;-t--;-t--t-;--t-t--;-t--;-;;-t-;-t--;-t--;-t--;-;;-;;-t--;-t
t-;--t--t-;--t-t--;-t-i-t--;-t--;-t--t-t--t-t--t-;--t-;--t-t--;-t-;-t--t-;-;-t-;--;-t-t--t-t
t-i-;-;-t--;-t-;--t--;-t--;--t--t-t--t-t-;--t-;--;;-t-;--t-;--t--t-t--t-t--t-t-;--t-;--t--t-t
t-;--t--t-t-;--t--t-t--t-t--t--t--t-t-;--t--t-t--t-t--t-t--t-t--t-t--t-t--t-t--t-t--t-t--t-t--t-t
,

I

I

20000. 00 ~-~-~-~-;~-~--~-~--~-~-~-;-~-~--~-~-~-;~--;~-~-~-;~-~--~-~-~-~-~--;-~-~-~-~-~-~

10000 .00

I

1--:-r-:-1--1-r-1-r-1-1--:-:-:-r-1-1--:-1--:-1--:-r-:-r-:--1--1-1--1-1--1-r-1-r-:-1--:-1
1--1-1--1-:--1-1--:-1--1-r-:-1--:-:--:-1--1-:--:-1--1-:--1-1--:-1--:-1--1-1--1-1--1-1--1-:--1-1

9000. 00
8000 . 00 t--t-t--t-t--t-t--t-t--t-t-;---t--t-t--t-t--t-t--t-t--t-t--t-t--t-t--t--t--t--t--t-t--t--t--t-t--t-t
I
I
••
I
I
I
I
I
•
I
•
I
•
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
,
I
I
I
I
7000 . 00 +--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+--+

: : : : : : : : : : : : 3: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : 1: : : : : : : : : : : 1: : : : : : : : : : : 1: : : : :
: : : : : : : : : : : : : : : : : : : ; : : 1: : : : : : : : : : : : : : : : :
+--+-+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+--+
: : : : : : : : : : : : : : : : : : : : : 1: 1: : : : : : 1: : : : : : : : : : :
+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+-+--+--+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+--+
: : : : : : : : : : : : : : : : : : : : : 2: 1: : : : : : : : : : : : : : : : :
+--+--+--+--+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+-+--+--+
: : : : : : : : : : : : 1: : : : : : : : : 3: 1: 2: : 1: : : 1: : : 1: : : : : : : :
+--+-+--+--+--+-+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+--+--+--+--+-+--+--+--+--+--+-+--+-+--+-+--+-+

6000. 00 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+
5000 . 00 +--+-+--+--+--+--+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+
4000 . 00
3000.00
2000 . 00
1000 . 00

:

:

:

:

:

:

:

:

:

:

:

: I; 1:

: :

:

:

:

:

: :

: :

: :

:

:

:

:

:

:

:

:

: :

: :

:

1--1-1--1-1--1--1--1-1--1-1--1--1--1--1--1-1--1-1--1--1--1--1--1-1--:--1--1--1--1-1--1-1--:--1--1-1--:-1

900.00
800 . 00 +--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+-+

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : I; 1: : : : : : : :
: : : : : : : : : : : : 1: : : : : : : : : 4; : : : : : : : : : : : : : : : : :

700.00 +--+-+--+-+--+-+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+--+--+-+--+--+--+--+--+-+--+-+--+--+

1--:-1--1-r-1-1--1-1--1-1--1--1--1--1--1--1--1-1--1-1--1-1--1-1--1-1--1--1--1--1--1-1--1-1--1--1--1--1

600.00
500.00 +--+-+--+--+--+-+--+--+--+--+--+-+--+-+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+--+--+--+--+--+--+--+

: : : : : : : : : : : : : : : : 1: : : : : : : : : : : : : : : : : : : : : : :
: : : : 1: : : : : : : : : : : : : : : : : 1: 1: : : : : : : : : : : : : : : : :
300.00 +--+--+--+-+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+-+--+--+
: : : : : : 1: : : : : : : : : : : : 1: : : : 1: : : : : : : : : : : : : : : : :
200.00 +--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+-+--+--+--+-+--+--+--+-+--+--+--+--+--+--+--+-+--+--+--+--+--+-+
: : : : 2: : 1: : : : : : : : : 1: 1: : : : : : 1: : : : : : : 1: 1: : : : : : : : :
100.00 +--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+--+--+-+--+-+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+--+
: : : : : 2: : : : : : : 1: : : : : : : : : 3: 1: : : : : : : : : : : : : : : : :
90 . 00 +--+-+--+--+--+--+--+-+--+-+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+-+--+--+--+-+--+--+--+-+
:
: : : : : : : : : : : : : : : : : : : : 1: 1: : : : : : : : : : : : : : : : :
80.00 +--+--+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+--+--+--+--+-+--+-+--+--+--+--+--+--+--+--+--+-+--+--+
400.00 +--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+--+--+-+--+-+--+--+--+-+--+--+--+-+--+--+--+--+

:

:

:

: 1: 1:

:

:

: :

: : 1:

:

: :

:

:

:

:

: 1:

: 1;

: :

:

:

:

:

:

:

: :

:

:

: :

:

70 . 00 +--+--+--+-+--+-+--+-+--+-+--+--+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+

: : : : : 1: : : : : : : : : : : : : : : ; 1: : 1: : : : : : : : : : : : : ; : :

1--:--1--1-1--1-1--1--1--1--1--1--1--1--1--1-1--1-1--1--1--1--1--1-1--1--1--1--1--1--1--1-1--1--1--1--1--1-1

60. 00
50.00 +--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+--+--+--+--+--+--+-+--+--+--+-+--+-+

: : : ; : : : : : : : : : : : : 1: : : : : 3: : : : : : : 1: : : : : : : : : : :

40.00 +--+-+--+--+--+--+--+-+--+--+--+--+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+--+--+--+--+-+

:

:

:

: 1:

:

: 1:

: :

: : 1:

:

: :

:

:

: :

: 8: I;

: 2:

:

: : 1:

:

:

:

:

:

: :

: :

:

30 . 00 +--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+--+--+-+

: : : : : : : : 1: : : 1:

: : 1: : 3:

: 2; :

:28: : : : : : : : : : : : : : : : : :

20. 00 +--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+--+--+-+--+-+--+--+--+--+--+--+--+-+--+--+
: : : : : 3: : : 1: : : : 1: : 1: : : : : 1: : 4: : : : : : : : : : : : : : : : : :
10.00
9 . 00

1--1--1--1--1--1--1--1-1--1--1--1-1--1--1--1-1--1-1--1-1--1-1--1--1--1-1--1--1--1-1--1--1--1--1--1--1--1-1
1--1-1--1-1--1-1--1-1--:-1--1--1--1--1--1--1--1--1--1--1--:--1--1-1--1-1--1-1--1-1--1--1--1-1--1-1--1--1

8. 00 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+

;

:

; :

; ;

:

:

;

;

; ;

: :

; :

:

:

: ; ;13;

: : 1;

; : : ; : : ; ; : ; : ; : :

7. 00 +--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+-+--+-+

6 . 00

: : : : 1: : : : : : : : : : : 1: : 1: : 1: : 1: : : : : : : : : : : : : : : : : :
+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+-+--+-+--+--+--+--+--+--+--+--+
:

:

:

:

:

: 1:

:

: :

: :

: :

: :

:

: 2;

:

: 1:

:

:

: :

: :

:

:

:

:

:

:

:

:

: :

:

5 . 00 +--+--+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+--+--+--+--+-+--+-+--+--+--+--+

: : : : 1;

: : : : : : : : : : : : 1: 5: : : : 1:

: : : : : : : : : : : : : : : :

4 . 00 +--+--+--+-+--+--+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+--+--+-+--+--+--+-+--+--+--+-+--+--+--+-+--+-+

: : : : 1: : 1: : : : : : 2: 1: 1: : 1: 1: 1: 1: : 2: 1: 1: : : : : : : : : : : : : : : :
3. 00 +--+-+--+-+--+-+--+-+--+--+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+--+--+-+--+--+

: : : : : : 1: : : : : : : 3: 1: 1: 1: 1: 1: : : 1: : : : 1: : : : : : : : : : : : : :

2. 00 +--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+-+--+--+--+--+--+-+

: : : : 1: 1: 3: 2: 2:

: : 1: 5: 3: 4; : 1:

: 1: 2: 1:46;

: : : : : : : : : : : : : : : : :

1. 00 +--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+-+--+-+--+-+--+--+--+-+--+-+--+--+--+--+--+-+--+--+--+-+--+--+
: : : 1:24:73:48:U: 6: 6:12:13:27: 8: 5: 1: 5: 3: 2: 1: 2: 4: 1: : : : : : : : 1: : : : : : : : :
+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+-+--+-+--+-+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+
0 5 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 789 1 2 345 6 7 891
o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000
00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000
o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000
000000000 0

o.00

o

HEM

(KIf)

Figure 13

288

*****

Final Clustering Results (datL.size

=

517) *****

The Cluster Size : 10 II
[

1] :

(51 ;; 23
49

65
78
91
101
113
127
138
U8
158
168
178
189
199
211
221
231
2U
251
262
272
282
1 .. 5
6 ;; 11
3
5
"

2]
3]
(]
5]
6]
7]
8]

9
19

[ 9]
[ 10]

10
8

[
[
[
[
[

[
[

(

(

"

(

;; (0
1
"
;; 17
121
10
"
;; 22

U
50
66
79
92
102
1U
128
139
U9
159
169
179
190
202
212
222
232
U2
252
263
273
283

25
51
67
80
93
103
116
129
1(0
150
160
170
180
191
203
213
223
233
2(3
253
26(
27(
28(

3(
55
71
83
95
107
118
131
1(2
152
162
172
182
193
205
215
225
235
US
255
266
276
286

36
56
72
8(
96
108
122
132
1(3
153
163
173
183
19(
206
216
226
236
2(6
256
267
277
287

12
21
7
(2
2
27
187
19
(6

15
13 a
52 53 H1
9 16
70 136
6
8 28
63 6( 85
200 201 257
20 30 32
(8 60 69

18

26
5(
68
81
9(
10(
117
130
1(1
151
161
171
181
192
20(
2a
2U
23(
2H
25(
265
275
285

37
57
73
86
97
109
123
133
1((
15(
16(
17(
18(
195
207
217
227
237
2(7
258
268
278
288

38
58
75
87
98
110
12(
13(
1(5
155
165
175
185
196
208
218
228
238
U8
259
269
279
290

H
59
76
88
99
111
125
135
1(6
156
166
176
186
197
209
219
229
239
249
260
270
280
291

(7
62
77
89
100
112
126
137
1(7
157
167
177
188
198
210
220
230
UO
250
261
271
281
293

29(
306
316
326
336
3(6
356
366
376
386
396
(06
U6
(26
(36
((7
(57
(67
(77
(87
(97
507
517

295
307
317
327
337
3(7
357
'367
377
387
397
(07
(17
(27
(37
((8
(58
(68
(78
(88
(98
508

296
308
318
328
338
3(8
358
368
378
388
398
(08
U8
(28
(38
((9
(59
(69
(79
(89
(99
509

298
309
319
329
339
3(9
359
369
379
389
399
(09
(19
(29
(39
(50
(60
(70
(80
(90
500
510

299
310
320
330
3(0
350
360
370
380
390
(00
UO
(20
(30
((0
(51
(61
(71
(81
(91
501
511

300
311
321
331
3U
351
361
371
381
391
(01
U1
(21
(31
((2
(52
(62
(72
(82
(92
502
512

301
312
322
332
3(2
352
362
372
382
392
(02
U2
(22
(32
((3
(53
(63
(73
(83
(93
503
513

302
313
323
333
3(3
353
363
373
383
393
(03
U3
(23
(33
(((
(5(
(S(

(7(
(8(
(9(
50(
51(

30(
3U
32(
33(
3((
35(
36(
31"
38(
39(
(O(
U(
(U
(3(
H5
(55
(65
(75
(85
(95
505
515

305
315
325
335
3(5
355
365
375
385
395
(OS
US
(25
(35
((6
(56
(66
(76
(86
(96
506
516

29 31 39 (5
105 106 115 119 120
289 292 297 303
33 35 U (3 61
1" 82 90

------------------------------------------------------------------------------------------------------------------------------Ind

Size

Freq.

Minimum

Maximun

Centroid

Median

StlLdev

------------------------------------------------------------------------------------------------

(51

399(00

2

1

2

3

6

6

111.50 ,

(

5

22

( 13212.72 ,

5

(

10

1350.50 ,

19.U ,
( 561(6.9(

0.00) (

1588.52 ,

92.06) ( 56U6.9( ,
80.67) (

38(3.77 , 109.89) (

0.00) ( U031.75 ,
(8.12) (

1.99) (

(83.92 ,

92.06) ( 56U6.9(
1087.28 ,

21. 2() ( 11"98.70 ,

89(6.37 ,

62.32) (

356( .98 ,

99.3( ,

0.01) (

7.0(. (8 ,

92.06) ( 56U6. 9( ,

92.06) (

0.00 ,

0.12) (

99.22) (

0.00)
10.06)

112.00 , 103.20) (

1752.25

11.87) (

5517.81

8.02)

59.9() (

33(6.35

7.58)
10.60)

9.60) ( U385.01
55.28) (

0.30)

2382.55

6

(

29

8505.03 ,

0.57) (

9888.0( ,

22.87) (

89H.63 ,

6.(9) (

8727.89 ,

1.73) (

591.73

7

9

39

111.00 ,

19.37) (

501".76 ,

37.68) (

2(67.61 ,

26.90) (

2170.29 ,

2(.38) (

U25.03 ,

6.28)

8

19

173

0.01) (

5020.(6 ,

(.76) (

0.62) (

11(6.5( •

1.12)

9

10

H

111.99 ,

6.U) (

2007.10 •

17.80) (

1216.33 ,

11.75) (

U93.U •

11.28) (

785.3( ,

3.59)

10

8

30

2(.00 ,

2.75) (

1"5.53 •

8.07) (

253.5( ,

(.22) (

(3.50 •

3.(2) (

358.90

1.97)

1722.12

2871.07

0.81) (

2(00.06

Figure 14

289

Q>U

(aeca)

t-i-H-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-H-t
t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t
80000.00 t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t
70000 .00 t-i-t-i-t-i-t-i-H-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t

100000.00
90000.00

,

,

I

I

I

,

I

,

I

I

I

I

•

I

,

I

•

,

I

I

•

I

I

,

,

I

I

,

I

,

I

I

I

,

•

I

I

,

I

I

•

I

I

,

I

,

I

•

I

I

I

,

I

I

I

I

,

,

I

I

I

I

I

I

•

I

••

I

I

I

I

I

,

I

I

I

,

I

I

,

I

,

,

•

I

I

,

,

,

,

I

I

I

I

I

•

I

,

I

I

••

,

I

I

,

I

I

I

I

,

I

I

,

I

I

•

I

,

,

t

•

,

I

,

,

•

I

I

I

I

I

•

I

,

,

I

I

I

I

,

,

I

I

I

I

I

I

t

I

I

I

•

I

I

I

,

I

I

I

I

I

,

I

I

I

I

I

,

,

I

•

I

I

I

I

I

,

I

I

I

I

I

I

,

I

I

•

I

•

I

I

I

I

,

•

I

I

I

I

,

,

I

I

I

,

,

,

I

I

I

I

I

I

•

,

I

I

I

,

I

I

I

,

I

I

I

•

I

I

I

I

•

I

I

•

I

I

I

I

,

I

I

I

I

I

I

,

I

I

I

I

I

I

,

I

I

I

I

I

I

I

•

I

I

,

,

•

I

I

I

,

I

60000 .00 +--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+
,
,
I
I
,
•
I
,
,
,
•
I
•
I
I
•
,
I
I
I
••
,
•
,
•
,
,
I
,
,
I
I
••
I
I
•
I

t-i-t-i-t-i-ii-ii-ii-ii-ii-H-ii-ii-ii-ii-ii-ii-ii-ii-ii-ii-t
t-i-H-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-ii-ii-t-i-t-i-ii-t-i-t-i-ii-ii-t
30000 .00 t-i-t-i-t"i-t-i-t-i-t-i-ii-t-i-t-i-ii-ii-t-i-ii-t-i-t-i-t-i-t-i-t-i-t-i-t
20000.00 t-i-t-i-t"i-t"i-t-i-ii-t-i-t-i-ii-ii-ii-t-i-ii-t-i-ii-t-i-t-i-t-i-ii-t
10000.00 t-i-t-i-t"i-t-i-t-i-t-i-t-i-t"i-ii-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t-i-t
9000 .00 t-i-t-i-t"i-t-i-t-i-t-i-t-i-t-i-ii-t-i-ii-t-i-t-i-t-i-t-i-ii-t-i-t-i-t-i-t

50000.00

40000.00

I

I

,

I

I

I

I

,

I

1

I

I

I

I

I

I

I

I

I

I

I

,

•

I

,

I

I

I

I

I

I

I

,

,

I

I

,

I

,

I

I

,

I

I

,

I

I

I

I

I

I

I

I

I

I

I

•

,

I

I

I

I

I

I

I

I

I

I

,

t

I

I

I

I

I

I

I

I

I

I

I

I

I

,

I

I

I

I

I

I

I

I

I

I

I

,

I

I

I

,

I

I

I

I

I

•

I

I

,

I

I

I

I

I

•

I

I

I

I

I

I

I

I

I

I

•

,

I

I

I

•

I

,

I

•

I

I

I

I

:

I

I

,

,

I

I

I

It'

t

I

I

I

I

,

,

I

I

;

;

;

;

;

;

;

;

;

;

;

;

;

;

; 1;

OOOO.OO+--+-+--+-~-+-+--+-~-+-~-+-~-+-~-+-~-+-+--+-+--+-~-+-+--+-~-+-+--+-~-+-~-+-+--+-+--+-+
~
7000 .00 +--+-+--+-~-+-+--+-~-+-~-+--+-+--+-~-+-~-+-~-+-+--+-+--+-+

;

;

; ; ; ; ; ; ; ; ; ; : ; ; ; ; ;

3:::::: I : I : I

I ; : : : I : I : ::
: ; ; ; : ; : : : ;;

:::;:;;: I ; : : ; : :

OOOO.OO+--+-+--+-~-+-+--+-~-+-~--~-+-~-+-~-+-+--+-+--+-~--+--+-+--+-~-+-+--+-~-+-+--+-+--+-+

5000 .00
4000.00

1;

I

; ; ; ; ; :

I I

:

a : ;

I

: : ;

: :

: ; 2;

: ; ; :

+--+-+--+-~-+-+--+-~-+-+-- -~-+-~-+-~-+-~-+-+--+-~- -+--+-+--+-~-+-~-+-+--+-+--+-+--+-+

3000.00
2000 .00

i.-!-i.-!-l--!-LJ-LJ-L- ; ; ; ; ; ; ; ;_l.ill. L.

; ; ; ; ;-J-L..!-L..!-LJ-LJ-!
: : : : ; : ; ; : : ; : : : : : :
2; : a: : ; : ; 3
; : ; : ; : : ; : :
t-i-ii-t"i-ii-ii-i,
,
,
,
,
,
, -ii-i-ii-ii-ii-ii-t
, , , , , , , , , "
2""""
8, 3, 2, 1,
"
"1"",,,
+--+-+--+-~-+-+--+-~-+-~- -~-+-~-+-~-+-+--+-~-+-+--+-~-+-+

1000. 00

+--+-+--+-~-+-+--+-~-+-+--

; ; : : ; ; : : : ;:
;

;

:

;

;

:

:

;

;

;

2;;:;::::

;11; 4; 1:

,_
....
., .
.. j~jl ···.·fI~~r~!~&f/1·~ 1:'~

.

:

:

1;

;:;:;

:1

:

-~- -~-

2; 4

: : : : ; ; :

~- -+--+-+--+-~-+-+

;

: ; : : : : :
900.00 +--+-+--+-~-+-+--+-~-+-+-- T~"";"
ifi.~-"I-- -~-+-+--+-+-- -~~- -~-+-+--+-+--+-+
800 .00
~L..!-L..!-L- :
; ; ; : 1; : ; ; ; : : ,..; : ; 1; ; : 1; : ; 1 : 1; : : ;
:1 ; : : : : : ;
700.00 t-i-t-i-t-it-t-i-t-i-t-- ~;-+T~70t1ui-;-6 -t-i-ii-t"- -t-- 1t-i -t-i-t-i-t-i-t
600.00
~.
I
.f.-- -.f.--+-.f.--+-+--+-~

L..!-L..!-L..!-L..!-L..!-L- :.a....!-L..!-L..!-L..!-L!

+--+-+--+-+--+-+--+-+--+-+--

500 .00
400 .00

.

' \.'

,.

-L- L- -L..!-L..!-L..!-!

_+--+_+--+_+-__:-_

; : ; : : : : : : : : : : : : : : : I; ; : : : : : ::

+--+-~-

1::: I;

; ;:

1::::;::: 3: ; : : ::

+--+-~-

: :

--+-~-+-~-+-~-+-~-

: ::

I

I

:

:

:

I

I:

; I; : I; : : : ; 6; : : 2; : :
~-+-~-+-~-+-~-+-~-+-~-+-+--+
; I; 1: : : : : : 2; : : 1; : :
~-+-~-+-~-+-+--+-~-+-~-+-~-+
: S: I I; ; : : : s: : : 1: : :

2;

: 1;

: ::

:

-+--

:

:

::::;::

:

:::;;::

+-- -+--+-+--+-+--+-+

--+-~-+-~-

-+--

~- -+--+-~-+-+--+-+

--+-~-+- --+-~-+-~-

-+-:

+-- -+--+-+--+-+--+-+
:
:::;:::

~-+-~-+-~-+-+--+-+--+-+--+-+--+

300.00 +--+-+-- ~-+-~-+-~-+-+--+-~-+-+--+-~-+
: ::
: : 1: : ; : : : s; : : : 1;:
200.00 t-i-t-, "
2:: 4: : 1: :. : :26: : : : 2: 1:
100.00 +--+-+-- ~-+-+--+-~-+-~-+-~-+-~-+-~-+

: ::
90. 00 +--+-~: ;:
80. 00 +--+-~: ;:

::

2

:: 2:

; : 4; 6

:;::
:

: ::

:

:1 : : : : : : :

-i-t-i- -i-t"i-t-- -i- i- -t"i-ii-ii-t

"

,14, 3

,1,

"

--+-~-+- --+-~-+-~-

,8 3,
-+-- +--

1;: I;
::::
: 1:
--+-~-+- --+-+--+-~- -~~; : 4: 1 : : : ;
: 2:
--+-~-+--+-+-- -~- ~1;: 9:
1; 4 : :
I; 1 :

"','"

-+--+-~-+-+--+-+

:::::::
-+--+-~-+-~-+-+
::;::::
-+--+-~-+-~-+-+
:::;::;

r1-r- rii-r5i2r1-r11r1-r1-r1 -18r5i- -11
-1-r- ~ri .r-r1-r1-r1-t
--++--

70.00
60.00 +--+-+-: ::
50.00 +--+-+-: ;;
40.00 +--+-+--

~-+-~-+-~-+-~-+-~-+-~-+-+--+

I : : 3: : : ; : s: : : : ::
: I; I; SI I; : I 1; SI I; I; I; 2:;

~-+-~-+-~-+-+--+-~-+-~-+-+--+
~-+-~-+-~-+-~-+-~-+-~-+-+--+

: ::

1:; 1: 6:

: : : 1; 4; 2:

: : ::

: :;

: 4:11: 2;

: : : : 4: I;

: ; ::

;1:4: ;13::

;1; :2;

: 1: 1:

a3: 2:

: 3: 3: 1:1

5: 1:11 :13: 1: 3: 1:

:32: 1:

:50: 3:11:

30 .00 +--+-+-- ~-+-~-+-~-+-~-+-~-+-~-+-+--+
: ;:
:: 7; 6: 1; : : S:14; 2; 2; ; 6::1
20.00 +--+-+-- ~-+-~-+-~-+-~-+-~-+-~-+-~-+
: ::
4: 9:70:15: : : 1: 1:14: 3: : 1: 2:10:
10.00 +--+-+-- ~-+-+--+-~-+-~-+-~-+-~-+-~-+
9.00 +--+-+-- ~-+-+--+-~-+-~-+-~-+-~-+-~-+
: ;;
2: 2:13: 2; : I; : ; SI 1; : 1: : :
8.00 +--+-+-- ~-+-+--+-~-+-~-+-~-+-~-+-~-+
: ;:
2:: 41 5; ; : 1; : 8; : ; : : 1:
7.00 +--+-+-- ~-+-~-+-~-+-~-+-~-+-~-+-~-+
: ::
: :11: 5: 1: : 1: :11: 1: : : : 7:1
6.00 +--+-+-- ~-+-+--+-~-+-~-+-~-+-~-+-~-+

::;

5.00 +--+-+-: ::
4.00 +--+-+-: ::
3.00 +--+-+-: ::
2.00 +--+-+-: ::
1.00 +--+-+-:2D: :27
o .00 +--+-~o 5 1 2
o 0

2:

:16:11;

~-+-+--+-~-+-~-+-~-+-~-+-~-+

2: 1:15: 8:

~-+-+--+-~-+-+--+-~-+-~-+-~-+
~-+-~-+-~-+-~-+-~-+-~-+-~-+

\ 2\20:SA\

\ laO\ 2\67\11\92\85\13: 1\

~-+-+--+-~-+-~-+-~-+-~-+-~-+

0: 4:7A:1A: 3: 6:89:40:2A:m43:

:

: 6:

~-+-+--+-~-+-+--+-+--+-+--+-~-+

'5B'2C'6C'4B'6B'lC'lB'5B'2B'lA:24'13:32:4
3 4 5 6 7 8 9 1 234 567
0 0 0 0 0 0 0 0 0 0 0 0 0 0
00000 0 0

--+-~-+-

--+-~- -~-

-+--+-~-+-+--+-+

1: 1; 6: 1 ;
0::
: 1:
;;:::;;
--+- --+-+-- -~- +-- -+--+-~-+-+--+-+
1;; 4;
1;
1;:
11 1 I
I; I ; ; I ;
--+-~-+- --+- --+-+-- -~+-- -+--+-~-+-+--+-+
: 3:26;
: 2 9:;
;
:
:::;::;
--+-~-+- --+- --+-+~- -+--+-~-+-+--+-+
; S:48;
: 6 : : ;:
;
:::;::;
--+-+--+- --+- --+-+--+-~- +-- -+--+-~-+-+--+-+
3:15:20:
1: 3 1: : : : 1 :
:::::::
--+-~-+- --+- --+-+--+-~~- -~-+-~-+-+--+-+
: 7: 1;
:
::::
;
:::::::
--+-~-+- --+- --+-~-+-~~- -+--+-+--+-+--+-+
: 7: 1;
: 1 : ; ;:
;
;:;::;:
--+-+--+- --+- --+-+--+-~- +-- -~-+-~-+-+--+-+
: 1;16;
: 1
: : ::
:
::;:;::
--+-~-+- --+- --+-+--+-~~- -+--+-~-+-~-+-+
13:n: 6:
: 1 : : ::
1:
:::::::
--+-~-+- --+- --+-+--+-~- -+-- -+--+-~-+-~-+-+
13;9:19;
;
::::
;
::;;:::
--+-~-+- --+- --+-+--+-~- -+-- -+--+-~-+-+--+-+
18: 2: 3: 3 : 1 : : : :
:
:::::::
--+-~-+- --+- --+-~-+-~- -~- -+--+-~-+-+--+-+
:11: 3: 6 1: 6 1: : : :
:
:::::::
--+-~-+- --+- --+-+--+-~- -~- -~-+-~-+-+--+-+
20: 6:18:
: 3 1: : : : 2 :
:::::::
--+-~-+--+-+--+-~- -~- -+--+-~-+-+--+-+
60:10:25: 3 : : 1: : : :
1:
:::::::
--+-~-+- --+-+--+-~-+-~- -~- -~-+-~-+-+--+-+
SA: 6:63:
: : : : : : 1 1:
:::::::
--+-+--+-~-+-~
~-+-~-+-+--+-+
8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 891
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000
o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000
00000 0 0 000
--+-~-+-

o

MElt

(KIf)

Figure 15

290

CRA V C90D PERFORMANCE
David Slowinski
Cray Research, Inc.
Chippewa Falls, Wisconsin

What is a eRAy egOD? Cray Research, Inc. (CRI) is
most famous for vector computers built with the fastest
memory technology available. For every new generation of
systems CRI has designed numerous enhancements to add new
capabilities, improve reliability, reduce manufacturing costs, and
increase memory size. The CRAY-l M, CRAY X-MPIEA, and
CRAY Y-MP M90 are all computer systems that use denser,
less expensive memory components to achieve big memory
systems. The CRA Y C90D computer is our big memory
version of the CRAY C90 system with configurations that go
up to 8 processors and 2 gigawords of memory.
egOD Configurations

C92D
C94D
C98D

2/512
411024
812048

How is the eRA y egOD different from a eRA Y
ego? The memory module is a new design that incorporates
C90 circuit technology, 60 nanosecond 16 megabit commodity
DRAM memory components, and the lessons learned on the
successful CRAY Y-MP M90 series. There are minor changes
to power distribution to handle the voltages required for the
DRAM chips. And, we made changes to the CRAY C90 CPU
to allow it to run with the new memory design. The new
Revision 6 CPU includes CRI's latest reliability enhancements
and can run in either a C90 or C90D system. Everything else is
the same! This greatly reduces the costs and risks of introducing
a new system for manufacturing and customer support. The
C90D is bit compatible with the C90 supercomputer and runs
all supported C90 software.
How is performance affected? The good news is that
the C90D subsection busy time is the same as the C90
supercomputer. I wish I knew how to build a less expensive,
more reliable, bigger memory system that is faster than a C90
system. Unfortunately, using a slower memory chip means a
longer memory access time and less memory bandwidth. In my
view, the C90D does have a good "balance." But, of course,
every code would run faster with faster memory.

Performance (in clock periods)
CRAY
CRAY
CRAY
Y-MP
CRAY
Y-MP
Y-MP
C90D
Y-MP
M90
C90
--_ ........... __ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Subsection Busy
5
15
7
28
Bank Busy
4
20
6
35
Scalar Latency
17
27
23

Loop
1

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Livermore Loops
C94D/C916
Application
.92
Hydro Fragment
.81
Incomplete Cholesky-CG
.83
Inner product
.76
Banded linear equations
Tri-diagonal elimination
.76
.66
General linear recurrence
.95
Equation of state
.69
A.DJ. integration
.85
Integrate predictors
Difference predictors
.83
.85
First sum, partial sums
.92
First difference
.79
2D particle in cell
.93
1D particle in cell
1.03
Casual FORTRAN
.76
Monte Carlo search
.87
Implicit, conditional computation
.85
2D explicit hydrodynamics
.87
Linear recurrence
.82
Discrete ordinates transport
.71
Matrix multiply
.49
Planckian distribution
.97
2D implicit hydrodynamics
1.01
First minimum
.80
Harmonic mean

Performance of standard benchmarks. There is a
wide variation in performance depending on the code. The
harmonic mean average performance of the loops relative to the
CRAY Y-MP C916 is .80, but there is no single performance
number that tells the whole story.

Copyright !O 1994 Cray Research, Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Research, Inc.

291

Let's look at these numbers a little closer. The scalar loops (5,
11, 16, 17, 19, and 20) are slower than the BiCMOS memory
in the C90 computer as we might guess. Loops 13 and 14 use
gather; these do relatively well due to the fast subsection busy
time. Many factors affect the performance of the vector loops
including memory demands of the code, access patterns, and data
allocation.
PERFECT Club Benchmarks
Single CPU Optimized Versions

Code
D94/C916
adm
.78
arc2d
.89
flo52
.80
ocean
.82
spec77
.77
bdna
.86
mdg
.97

D94/C216
.72
trfd
.90
dyfesm
.77
spice
.59
mg3d
.89
track
.89
Average
.85
Code
qed

The performance of the C90D computer on the PERFECT Club
benchmarks is similar to what we see on many customer
programs. There is a wide range of performance depending on
each code's demands on memory. Some, like mdg, run nearly
as fast on the C90D as on the C90 supercomputer. Others, like
spice, suffer from the longer memory latency.

Why big memory? I see three main reasons:
Run big jobs. Some jobs are simply too big to run on current
systems. Even though there may be only a few big jobs these
are· often the ones poised to make new breakthroughs and
advance the state-of-the-art. In my view, the justification for
supercomputers is their ability to solve big problems and big
memory is the feature that allows us to attempt unsolved
problems. We cannot advance science by rerunning yesterday's
jobs with better price performance.
Reduced coding complexity. Yes, it is theoretically possible to
run any problem using only a small memory. I have heard one
of the original authors of Nastran tell stories about running out
of core problems with paper tape. It IS possible, but there is a
high cost in coding complexity. Out-of-core solutions may
require many months or even years to implement and they add
nothing to the quality of the solution.
Improved performance. This may come in many different ways.
Reduced 110 latency and system overhead may improve wall
clock performance by an order of magnitude in some cases. The
Cray EIE 110 library effectively uses large memory buffers to
significantly reduce 110 wait time and system overhead for many
programs without any user code changes. The biggest speedups
come when big memory allows us to use a more efficient
algorithm.

292

An exciting example of using a more efficient algorithm for
large memory systems was implemented in Gaussian-92 from
Gaussian Incorporated. Gaussian-92 is a computational
chemistry code used by over one hundred Cray customers.
Gaussian-92 has the ability to use different algorithms depending
on the characteristics of the computer it is running on. The time
consuming parts of a Gaussian-92 run are often the calculations
of the 2-electron integrals.
The original "conventional" scheme generates the integrals once,
writes them to disk, and then reads them for each iteration of a
geometry optimization. This method uses the least floatingpoint operations and little memory but requires big disk files and
lots of 110.
The "direct" method was developed for vector computers and
was released in Gaussian-88. It recomputes the integrals as they
are needed. This method uses little memory and 110, but requires
lots of floating-point operations to recompute the integrals. The
direct method is optimal for vector computers. The direct
method also allows RISC workstations to run problems that
previously could only be attempted using supercomputers.
With enough memory we could compute the integrals just once
and save them in memory. This "incore" method was first
developed for the big memory CRAY-2 systems and was released
in Gaussian-90.
Gaussian-92 Rev. C Benchmark
mp2=(fc/6-311 +g(3df ,2p)

Basis
Functions

Direct

Incore

CPU seconds

CPU seconds

64MW
1700 MW
................................................................
Molecule A
Molecule B

249
240

8127
7097

2121
1712

Here are the timings for two real customer problems run on a
CRAY M98 with 4 gigawords of memory using both the direct
and incore methods. The direct runs used 64 million words of
memory and would not benefit from more memory. The incore
runs needed 1700 million words of memory. Using big memory
with a more efficient algorithm for these problems gains about a
factor of four performance improvement over the best performing
algorithm with small memory ..
I believe there are similiar opportunities for significant speed
increases in many other applications programs.
Conclusion. The cost for a million words of memory has

dropped by a factor of 50,000 since the first CRAY-1 computer.
With the many benefits of big memory and continuing dramatic
improvements in memory cost, big memory will certainly be an
important feature of future supercomputers.

Software Tools

Cray File Permission Support Toolset
David H. Jennings & Mary Ann Cummings
Naval Surface Warfare Center / Dahlgren Division
Code K51
Dahlgren, VA 22448
ABSTRACT
The Cray UNICOS Multilevel Security (MLS) Operating System supports the use of Access Control Lists
(ACLs) to control permissions to directories and files. However, the standard commands (spacl, spget,
spset) are difficult to use and do not allow all the capabilities needed in a multi-user environment. This
paper describes a toolset which enhances the standard UNICOS commands by easily allowing multiple
users the ability to give permissions to a directory or file (or multiple directories or files), even if they are
not the owners.

1. Introduction
The Systems Simulation Branch (K51) is responsible for
the design and programming of large software models
which are typically composed of hundreds of C and
FORTRAN source files. The source, header, and executable
files are divided into separate directories, and often files
from other directories are needed to execute the entire
model. The Cray File Permission Support Toolset was
designed to aid the K51 users in using the UNICOS 1 Access
Control Lists to control file accesses to these large models.
It is general enough that any user of the system can benefit
from using this toolset for access control of any number of
files or directories.
The Cray File Permission Support Toolset was developed
because of a need for a development environment where a
set of files and directories were created and modified by
multiple developers. Other users also needed access to some
of the files and directories. Also, in our environment the
development teams and users of one set of files may need
access to other sets as well. The traditional UNIX2
permission scheme and the added ACL permission scheme
provided with our UNICOS MLS system did not provide
the functionality that was needed.

2. Terminology
Before we begin our discussion of the toolset, the following
terms must be defined: ACL, project, model, POC,
development group, and users.
• In a UNICOS MLS environment, an Access Control List
1.

2.

UNICOS is a registered trademark of Cray Research,
Inc.
UNIX is a registered trademark of AT&T.

•
•
•

•

•

or ACL provides an additional level of permission
control for files and directories. An ACL does not take
the place of the traditional UNIX permission scheme
where the owner of a file or directory controls access to
the owner, group, and all other users of the system (Le.,
world); instead it works in conjunction with the UNIX
permissions. An ACL provides the ability to give one or
more users permission to a file or directory.
A project is a collection of files spread across multiple
directories.
A model consists of one or more projects.
A POC is the point of contact for a project and owns all
the directories (not necessarily all the files) under a
project.
A development group is a set of users who have the
responsibility of changing a project's files. They are
members of the UNIX group for the project's files.
Users are the set of those who need access to certain files
within the project. They are part of the world for UNIX
permissions on files, but certain project directories will
be given an ACL so users can navigate into directories
where the files are located.

3. Requirements
In order to build a set of tools to aid in granting permissions
in our complicated environment, a set of requirements were
first created. These requirements included:
• Ability to change permissions for a project.
• Ability to change permissions for multiple projects (Le.,
models).
• Ability to give different permissions to different
individuals for a project.
• Ability to allow others besides the project's POC to

295

apply pennissions to a project. These individuals are
usually in the project's development group.
• Ability to allow files (not directories) within a project to
be owned by different individuals. This allows the
toolset to be used in conjunction with UNIX
configuration management tools such as the Source
Code Control System (SCCS).

would like other users to have the ability to place ACLs on
the owner's project. The permit and permacl scripts are
able to be shared by all users; however, the permit.exe
program must be available under the project and owned by
thePOC.

permit

4. Deficiencies with Cray Commands
Based on the above requirements, it was apparent that a set
of tools must be written to compensate for the deficiencies
in the traditional UNIX pennission scheme and UNICOS
MLS ACL commands.

setuid C executable
program

With UNIX file permissions, the owner of a file or directory
controls access to members of the group and to all other
users on the system. The user's umask value detennines the
pennission of the file or directory upon creation. No one
other than the owner of the file/directory can grant
pennissions. Only three types of pennissions can be given owner, group, and world. If two individuals belong to the
same group, those two individual cannot have different
permissions to a particular file or directory. Also, two
different groups cannot be given permission to the same file
or directory.
The UNICOS MLS ACL permissions solve some of the
problems with the UNIX file pennissions, but they also
introduce others. With an ACL, multiple groups can be
given access to a file or directory, but only the owner can
apply an ACL. An ACL works with the UNIX group
permissions of a file or directory. In order to have the
flexibility to grant an individual any pennission to a file or
directory, the UNIX group pennission must be set to the
highest pennission needed. For example, if a file's UNIX
group pennissions were read and execute (rx), then no
individual could be granted write (w) pennission to that file
with an ACL. By granting the highest UNIX group
permission needed to a file or directory, the owner has now
granted everyone in the UNIX group this pennission. The
owner must then use an ACL to restrict pennissions to
certain individuals in the UNIX group if necessary.
Finally, the UNICOS ACL commands (spacl, spget, spset)
are difficult to use and force the user to apply the ACL to
each file or directory. This could cause errors if the user
fails to include a file or directory within the project.

5. Overview of Toolset
Figure 1 shows the basic flow of the tools within the
Pennission Support Toolset. Both permit and permacl are
UNIX scripts written in the Bourne Shell programming
language. The C executable program, permit.exe, can be
defined to be a setuid 1 program if the owner decides he

1. A setuid program is one that allows anyone who can
296 execute the program to do so as the owner.

permit.exe

permacl

Figure 1 • Basic Flow of Toolset

6. permit Script
The permit script is the interface between the user and the
ACL commands on the Cray. The following are the valid
input parameters (with defaults in bold italics):

project

path to project (required parameter)

addrem add

I rem
permissions)

ckacl

(option

to

grant/remove

no I yes (option to only check permissions on
project)

type

users I project I penn it (different classes of
permissions allowed)

file

file containing names of user/group names

name(s) user/group

names, with group names
preceded by a colon (:). Note: either file
parameter or at least one name must be
present.

The input parameters are in name=value fonnat. Except for
the list of names (if present), the parameters can appear in
any order.
Before executing the script with the desired input
parameters, the user may define certain environment
variables that the script will use. The environment variables
used and their defaults are:

PERMITEXE

project/cpy_scpt/pennit.exe

8. Iists.txt File

PERMISSION

rx

LOCKFILE

/tmp/.lck

PROJFILE

project/cpy_scpt/projfile.txt

The LISTS macro in the pennit.c program is a file
containing the set of directories/files upon which to apply
ACLs. Figure 3 shows an example lists.txt file. It contains a
maximum of three non-blank, non-commented lines
representing the directories/files to apply users, project, and
pennit pennissions. These lines correspond to the different
classes of pennissions allowed by the type parameter of the
permit script. The USERSLIST would be used to grant
permissions to the users of the project, those who just need
to access the executable and perhaps header files. As shown
in Figure 3, if permit is executed with type=users, then the
given name(s)/group(s) would be granted pennission to the
include and exec directories. If permit is executed with
type=project, then the name(s)/group(s) would be granted
permission to the SCCS, src, include, obj, and exec
directories. If permit is executed with type=:::.permit, then
permission is granted to the cpy_scpt directory and the
cpy_scpt/permit.exe file. This is used in conjunction with a
setuid permit.exe executable to allow those users the ability
to place ACLs on the project. Any valid directory/file under
the project can be specified in the lists.txt file. Wildcards
can also be used (e.g., exec/* would signify all files within
the exec directory under the project). However, it is much
simpler to place ACLs only on the directories containing
the files, instead of the files themselves. Then by making the
files within that directory world readable, those given ACL
permission to the directory will be able to access the files.

PERMITEXE signifies the location of the pennit.exe
program to execute. The PERMISSION applied via the
ACL is read/execute by default. A lock file is used to
prevent the possibility of multiple users changing the ACLs
on the same project at the same time. PROJFILE defines a
file which may contain a list of other projects to which the
ACL should be applied. This saves the user from having to
execute pennit for each project; instead he can execute
pennit once for each model.
The permit script basically perfonns parameter error
checks, then executes the file defined by the PERMITEXE
variable. If type =users and PROJFILE exists, then permit
will call PERMITEXE for each entry in PROJFILE.

7. permit.exe Program
The permit.exe file is a C executable program which will
pass the parameters from permit to permacl. It can be
made setuid by the owner if he wishes others to have the
ability to place ACLs on the owner's files. This program
will also pass the file name (LISTS macro) which contains
the set of directories/files upon which to apply ACLs. The
user cannot override the location of this file. Figure 2 shows
the pennit.c program, which is compiled to produce the
permit.exe executable program.

#@(#)lists.txt 1.29/11/9208:01:03
/* @(#)penniLc 1.4 08:14:07 9/11/92 */
static char SCCS_Id[] "@(#)pennit.c 1.4 08:14:07
9/11/92";
#define USTS "/home/k51/t00Is/pennit/0/exec/lists.txt"
#define PERMACL "/home/k51/tools/exec/pennac1"

=

#inc1ude 
#inc1ude 
#inc1ude 
main(int argc, char** argv) (
int i,
code,
status;
char **nargv = (char**) malloc(sizeof(char*) * (argc + 2»;
nargv[O] = (char*) malloc(sizeof(char) * (strlen(pERMACL)+ 1»;
strcpy(nargv[O],PERMACL);
nargv[l] = (char*) malloc(sizeof(char) * (strlen(USTS)+ I»;
strcpy(nargv[ 1] ,USTS);
for (i = 1; i < argc; ++i) (
nargv[i+l] = (char*) malloc(sizeof(char) * (strlen(argv[i)) + 1»;
strcpy(nargv[i+ 1],argv[i));}
nargv[i+l] = NULL;

==

if (forkO
0) (
code = execv(pERMACL,nargv);
fprintf(stderr,"ERROR: could not execute %s\n",PERMACL);
wait(&status); }

#
This file is located in -k51/t00Is/pennit/0/exec.
# It is used to define three classes of directories/files to be used with
# the K51 pennit tool. Comments (preceded by #) can appear beginning in
# column 1 or at the end of the line containing the directories/files.
# They cannot appear on the same line, but preceding the data. Blank
# lines can also appear anywhere in the file.
# WARNING - TIlE ORDER OF THE DATA IS FIXED.
# If three or more non-commented, non-blank lines appear in file then
#
USERSUST is set to the first line
#
PROJECTLIST is set to the second line
#
PERMITUST is set to the third line
#
remaining lines (if any) are ignored
# If two non-commented, non-blank lines appear in file then
#
USERSUST is set to the first line
#
PROJECTLIST and PERMITUST are set to the second line
# If one non-commented, non-blank line appears in file then
USERS LIST, PROJECTUST, and PERMITUST are set to that line
#
#USERSLIST
include exec
# PROJECTUST
SCCS src include obj exec
# PERMITLIST
cpy_sept epy_sept/pennit.exe

Figure 3 - lists.txt File

Figure 2 - permit.c C Program

297

9. permacl Script

12. Toolset Example

The permacl script is called by the C executable program
and is executed as the owner if the executable is setuid. All
input parameters are passed from the permit script to
permacl. After error checks are performed, an ACL
modification file is created from the desired name(s)/
group(s) input by the user. A lock file is created to lock
other users from placing ACLs on the project's files until
the current job is completed. A list of directories/files is
defined based on the value of the permit type parameter.
For each element of this list the following is performed:

The following example shows how the owner and
users of a project can use the toolset. Figure 4 shows a
project called my_project, which is divided into
various directories, each containing files. The UNIX
mode permissions on all directories/files are shown.

• text version of ACL is retrieved.
• ACL file is created from the text version.
• name(s)/group(s) are removed from ACL file if addrem
is set to add.
• ACL modification file is applied to the ACL file.
• ACL file is applied to the element.
• lock file is removed.

10. Toolset Setup - Owners of Directories/
Files
In order to use the toolset, certain steps must first be taken
to create setup files, to create necessary programs, and to
set needed environment variables. These steps are different
for owners of directories/files and for others who will apply
permissions to these directories/files.
To use the toolset, the owner of directories/files must first:
• Create the setuid C executable program, permit.c, to
pass parameters from the permit script to the permacl
script.
• Ensure that the LISTS macro in permit.c is set
appropriatel y.
• Compile permit.c; no special compilation parameters
are required (cc -0 permit.exe permit.c).
• Create the LISTS file that is specified in the above
LISTS macro.
• Set the setuid permission bits on permit.exe to allow the
UNIX group to execute the file as the owner. This is
done with the command chmod 4750 permit.exe.
WARNING: Be careful with UNIX group and world
permissions on setuid programs!

11. Toolset Use - Users
Once the owner has followed the above instructions, other
users may apply permissions to the directories/files as well.
To do this:
• Check/set the following environment variables:

PERMITEXE, PERMISSION, LOCKFILE,
PROJFILE.
• Execute the permit script with the appropriate
parameters.

298

I

I
homel
( drwxr-u-x)

I

userl/
(drwxr-u-x)

I

my...,projectl
( drw:uwu-x)

I
seesl

I
srcl

includel

I

objl

execl

epy_sept!

(drwxrwx---) (drwxrwx---) (drw:uwx---) (drw:uwx---) (drw:uwx---) (drwu-x---)

s.a.c
s.b.c

a.c
b.c

a_h
b_h

a_o my...,project.exe pennit.exe
b.o

Notes:
All directories end with a slash (f).
All files are shown under the appropriate directory.
UNIX mode pennissions are shown in italics below each directory.
All files within sees directory are (-r--r--r--).
All files within src directory are (-r--r--r--).
All files within include directory are (-r--r--r--).
All files within obj directory are (-rw-rw-r--).
my_project.exe is (-rwxrwxr-x).
pennit.exe is (-rwsr-x---), which makes it setuid.

Figure 4 - Sample Project

The owner has set up this project with cpy_scpt!permit.exe
as a setuid executable. If the lists.txt file of Figure 3 is used
and the owner executes the command
permit project=/homeluserllmyyroject type=project joe mary

then the users joe and mary are given read/execute (rx)
permission to all the directories except cpy_scpt. Since the
files within those directories are at least read to the world,
they can access all the files within those directories.
Next the owner executes the command
permit project =1 home/userlImyyroject type=permit bill

The user bill is given permission to execute the cpy_scpt!
permit.exe executable. Since this is setuid, he can then
grant ACLs on the project to other users.
Next, if the user bill or the owner executes
permit project=lhome/userllmyyroject don kate :math

then the users don and kate and the UNIX group math are
given rx permission to the include and exec directories. If
the PERMISSION environment variable was set to rwx
before executing permit, they would have been given rwx
permission to those directories.

13. Summary
The Cray File Pennission Support Toolset was developed to
enhance the standard ACL commands and meet the
requirements of granting directory/file pennissions in a
complicated environment. The simplest use is to easily
allow the owner of a project to set ACLs on the project's
directories. Users who are allowed to navigate within the
directory via the ACL can access all the files that have
world read pennission. A more complicated scenario would.
involve a setuid executable program allowing others beside
the owner or members of the project's group to grant
pennissions to directories or files within the project. The
toolset is general enough to be tailored to any
developmental environment and any type of directory
structure. It has allowed developers to easily pennit
different classes of pennissions to users of multiple projects
combined into models.

299

TOOLS FOR ACCESSING CRAY DATASETS
ON NON-CRAY PLATFORMS
Peter W. Morreale
National Center for Atmospheric Research
Scientific Computing Division

ABSTRACT

NCAR has a long history of using Cray computers and as a result, some 25 terabytes of data on our Mass Storage
System are in Cray-blocked format. With the addition of several non-Cray compute servers, software was written
to give users the ability to read and write Cray-blocked and unblocked files on these platforms. These non-Cray
platforms conform to the Institute for Electrical and Electronics Engineers (IEEE) standard describing floatingpoint data. Therefore, any tools for manipulating Cray datasets must also be able to convert between Cray data
formats and IEEE data formats. While it is true that the Cray Flexible File I/O (FFIO) software can provide this
capability on the Cray, moving this non-essential function from the Cray allows more Cray cycles for other compute-intensive jobs. This paper will outline a library of routines that allow users to manipulate Cray datasets on
non-Cray platfonns. The routines are available for both C and Fortran programs. In addition, three utilities that
also manipulate Cray datasels will be discussed.

1. Introduction
The National Center for Atmospheric Research (NCAR) has
been using Cray Supercomputers since a Cray-IA was
installed in late 1976. NCAR now has a CRAY Y-MP 8/864,
a CRAY Y-MP 2D/216, and a CRAY EL92/2-512 that are
used for the bulk of computing by our user community.
NCAR has an enormous amount of data in Cray format
stored on NCAR's Mass Storage System. Currently, there are
40 terabytes of data on the NCAR Mass Storage System.
Approximately 25 terabytes of that data in a Cray-blocked
format.
Cray computers use their own format to represent data. On
Cray computers, a 64-bit word is used to define both
floating-point values and integer values. In addition, Cray
computers support a number of different file structures. The
most common Cray file format is the Cray-blocked, or COSblocked, file structure. This file structure, known ali a Cray
dataset, is used by default when a file is created from a
Fortran unformatted WRITE statement. The Cray-blocked
dataset contains various 8-byte control words which define
512-word (4096 byte) blocks, end-of-record (EOR), end-offile (EOF), and end-of-data (EOD). Cray also has an
unblocked dataset structure that contains only data. No
control words of any kind are present in an unblocked
dataset.
In contrast to Cray systems, a number of vendors of other
platforms use Institute for Electrical and Electronics
Engineers (IEEE) binary floating-point standard for
describing data in binary files. Generally, these vendors also
use the same file structure for Fortran unformatted sequential
access binary files. This file structure consistIi of a 4-byte
control word, followed by data, terminated by another 4-byte
control word for each record in the file. Because the same
file structure and data representation are used by a number of

300

vendors, binary files created from Fortran programs are
generally portable between these vendor's computers.
In recent years, a number of IEEE-based compute servers have
been added to our site. In particular, NCAR now has a four-node
cluster of IBM RISC System/6000 model 550 workstations, an
eight-node IBM Scalable POWERparallel 1 (SPl), and a
Thinking Machines, Inc. CM-5. Since many of the users on these
platforms also use our Cray systems, the ability to use the same
data files on all systems is extremely important.
One possible solution would be to use the Cray systems Flexible
File I/O (FFIO) package. This software allows the user to create
data files in binary formats suitable for direct use on different
vendors platforms. The FFIO solution works for users creating
new data files on the Cray; however, we have over 25 terabytes
of Cray-blocked data already in existence on our Mass Storage
System. If FFIO were the only solution, users would spend a
good deal of their computing allocations just reformatting
datasets. In addition, these Cray jobs would consume a large
number of Cray cycles that would otherwise be used for
compute-intensive work.
Another solution would be to use formatted data files. This
solution poses several problems: 1) formatted files are generally
larger than their binary counterparts, 2) formatted I/O is the
slowest form of 1/0 on any computer since the text must be
interpreted and converted into binary format, and 3) formatted
files can incur a loss of precision.
A third solution would be to provide software that can interpret
Cray file structures and convert the Cray data representation into
the non-Cray data format. This solution has several benefits for
the user. One advantage is that the user can use the same data~ets
on both the Cray and n"on-Cray machines. Another benefit.is that
even accounting for the data conversion, the I/O on the non-Cray
platform is significantly faster then equivalent formatted I/O on
the non-Cray platform.

At NCAR, we have implemented the third solution in the form
of a library of routines that perform 1/0 to and from Cray
datasets. This paper describes the library named NCAR Utilities
(ncaru). The paper also describes three stand alone utilities that
manipulate Cray datasets.

2. The ncaru software package
The ncaru library contains a complete set of routines for
performing 110 on Cray datasets. In addition, a number of
routines that convert data between Cray format and IEEE
format are also included in the library.
The user can use the Cray 110 routines to transfer data in Cray
format or have the routine automatically convert the data to the
native format. Having the option of converting data allows the
user to read datasets that contain both numeric and character
data records.
The ncaru library is written in the C language. Since there is no
standard for inter-language communication, the user entry
points to the library must be ported to the different platforms for
use in Fortran programs. The current implementation has been
ported to IBM RISC System/6000 systems running AIX 3.2.2,
Silicon Graphics Inc. Challenge-L running IRIX V5.1.1.2 and
to Sun Microsystems, Inc. systems running SunOS 4.1.1.
The ncaru software package also includes three utilities that aid
the user in manipulating Cray-blocked files on non-Cray
platforms: cosfile, cosconvert, and cossplit. These utilities
describe records and file structure of a Cray dataset (cosfile),
strip Cray-blocking from a Cray dataset (cosconvert), and split
multi-file datasets into separate datasets (cossplit).
Documentation for the ncaru package consists of UNIX man
pages for each routine and utility. There is also a ncaru man
page that describes the library and lists all the user entry point
routine names.

3. The Cray I/O routines
The Cray 110 routines in the ncaru library allow the user to read,
create, or append to a Cray dataset. The user also specifies
whether the dataset uses a Cray-blocked or Cray-unblocked file
structure.
The Cray 110 routines use a library buffer to block 110 transfers
to and from the disk file. This buffer imitates the library buffer
used in Cray system 110 libraries. Use of a library buffer can
reduce the amount of system work necessary to perform 1/0,
with the trade-off being increased memory usage for the
program.
The following is a list of the Cray 1/0 routines with their
arguments.
ier
icf
nwds
nwds
ier
ier
ier
ier
ier

crayblocks(n)
crayopen(path, iflag, mode)
crayread(icf, lac, nwords, iconv)
craywrite(icf, lac, nwords, iconv)
crayback(icf)
crayrew(icf)
crayweof (icf)

crayweod(icf)
crayclose(icf)

The crayblocks routine allows the user to specify a library
buffer size. The argument n specifies the number of 4096byte blocks used by the buffer. This library buffer is
dynamically allocated and is released when the file is closed
with a crayclose routine. If the crayblocks routine is used, all
Cray datasets opened with a crayopen use the specified block
size until another crayblocks routine is executed. The
crayblocks routine must be executed prior to a crayopen
routine if something other than the default library buffer size
(1 block) is desired.
The crayopen routine opens a dataset for either reading or
writing. The path argument specifies the pathname to the file.
The iflag argument specifies the transfer mode, whether the
file structure is blocked or unblocked, and the position of the
file. The nwde argument specifies the file permissions and is
used only if the file is being created. The crayopen routine
dynamically allocates a data structure that contains fields
used by the various 110 routines. The return from a
successful crayopen is the address of this data structure. By
returning the address of the structure as an integer, portability
between Fortran and C is assured.
The crayread routine reads data· from an existing Cray
dataset. The icf argument is the return from a previously
executed crayopen routine. The loc argument is the location
where the first word of data is placed and must conform both
in type and wordsize to the data being read. The nwords
argument specifies the number of words being read. The
iconv argument specifies the desired data conversion.
For blocked files, crayread is fully record-oriented. This
means that if the user specifies a read of a single word, the
first word of the record is transferred and the file is left
positioned at the next record. This feature is useful for
skipping records. The user can also specify a read of more
words than the record actually contains, and only the actual
number of words in the record are transferred. This feature is
useful if the user is not sure of the exact number of words in
the record. In all cases, crayread returns the number of Cray
words actually transferred or an error code.
The craywrite routine writes data to a Cray dataset. Like
crayread, the arguments icf, loc, nwords, and iconv,
correspond to the crayopen return value, location of the data
being written, the number of words to write, and the
conversion flag. Both the crayread and craywrite routines use
a library buffer to reduce the number of physical read or
write requests to disk. For writing, when the buffer is filled,
the library buffer is flushed to disk. This means that if the
user does not close the file via a crayclose, the resulting Cray
dataset may be unusable on the Cray computer.
If the iconv flag for both crayread and craywrite specifies a
numeric conversion, than a conversion buffer is dynamically
allocated. The initial size of the conversion buffer is set to the
number of words in the request. The size of the conversion
buffer is then checked for each subsequent 110 request, and if
a subsequent request is larger than the current size of the
conversion buffer, the buffer is re-allocated to the larger size.
On every request, every byte of the conversion buffer is
preset to zero to prevent bit conversion problems.

301

The crayback routine allows the user to backup a record in a
Cray dataset. The crayback routine can be used on datasets
opened for reading or writing. If the last operation to the
dataset was a write, then crayback will truncate the dataset
prior to positioning at the previous record. This allows the
user to overwrite a record if desired and mimics the behavior
of a Cray Fortran BACKSPACE.
The crayweof routine writes a Cray end-of-file control word.
The crayweod routine writes a Cray end-of-data control
word. These two routines are available for historical
purposes and are seldom used directly by the user. One
possible use of the crayweof routine would be the creation
of a multi-file dataset.
The crayclose routine properly terminates and, if necessary,
flushes the library buffer. The crayclose routine then closes
the dataset and releases all dynamically allocated memory
used for that file.

4. The numeric conversion routines
In addition to the Cray I/O routines, a number of routines to
convert Cray data formats to IEEE data formats are included
in the ncarn library. These routines are implemented as
Fortran subroutine calls and in C as void functions. Here is a
list of the routines and their arguments:
ctodpf(carray,
ctospf(carray,
ctospi(carray,
dptocf(larray,
sptocf(larray,

sptoci (larray,

larray,
larray,
larray,
carray,
carray,
carray,

n)
n)
n,
n)
n)
n,

zpad)

zpad)

In all the routines, the first argument is the location of the
input values and the second argument is the location for the
converted values. The third argument to all the routines is
the number of words to convert. If the routine has a fourth
argument, it is used to tell the conversion routine whether
the IBM RISC Systeml6000 double-padded integer option
was used during compilation.
In all cases, the carray argument is a pointer to an array
containing 64-bit Cray values and the [array argument is a
pointer to an array containing the IEEE 32-bit or IEEE 64bit values.
The ctodpf, ctospf, and ctospi routines convert Cray data to
local DOUBLE PRECISION, REAL, and INTEGER values.
The dptocf, sptocf, and sptoci routines convert local data to
Cray REAL and INTEGER values.
For the routines that convert to IEEE format, any values that
are too large to be properly represented are set to the largest
value that can be represented with the correct sign. Any
values that are too small are set to 0 (zero).
Both the Sun Microsystems and IBM RISC System/6000
Fortran compilers allow the user to specify a command line
option that automatically promotes variables declared as
REAL to DOUBLE PRECISION. This causes the word size
to double from the default 4 bytes to 8 bytes. The routines
with "dpf' in their names should be used in these cases. In
addition, the IBM xlf compiler ha~ an option that allows the

302

compiler to increase the size of Fortran INTEGERs to 8 bytes, 4
bytes to hold the data and 4 bytes of alignment space. For the
integer conversion routines, the zpad argument is used to inform
the routine whether the compiler option was used.
The numeric conversion routines can either be executed directly
from the user program or automatically called through the use of
the Cray I/O routines via the icon v argument to crayread and
craywrite.

5. Example Fortran program
The following sample Fortran program creates a Cray-blocked
dataset with a single record. The IEEE 32-bit REAL values are
converted to Cray single-precision REAL values prior to being
written.
PROGRAM TST
REAL
a(1024)
INTEGER CRAYOPEN, CRAYWRITE, CRAYCLOSE
INTEGER ICF, NWDS, IER
ICF = CRAYOPEN(lldata ll , 1, 0'660')
IF (ICF .LE. 0) THEN
PRINT*, llUnable to open dataset ll
STOP
ENDIF
NWDS = CRAYWRITE(ICF, A, 1024, 1)
IF (NWDS .LE. 0) THEN
PRINT*, llWrite failed ll
STOP
ENDIF
IER = CRAYCLOSE(ICF)
IF (ICF .NE. 0) THEN
PRINT*,
llUnable to close dataset ll
STOP
ENDIF
PRINT*,
END

llSuccess!ll

6. Cray dataset utilities
To assist users with manipulating Cray datasets, three utilities
that operate on Cray-blocked files were created. These utilities
are costile, cosconvert, and cossplit.
The costile utility verifies that the specified file is in a Crayblocked format and gives information about the contents of the
dataset. The number of records and their sizes are displayed for
each file in the dataset. In addition, costile attempts to determine
whether the file contains ASCII or binary data and reports the
percentages of each. Here is a sample costile command and its
resulting output:
% cosfile -v /tmp/data

Processing dataset: /tmp/data
Reclt
Bytes
1
800
2
8000
EOF 1: Recs=2 Min=800 Max=8000 Avg=4400 Bytes=8800
Type=Binary or mixed -- Binary= 99% ASCII= 1%
EOD. Min=800 Max=8000 Bytes=8800

The cosconveit utility convert!) a Cray-blocked data set into one

of several fonnats. Most often, cos convert is used to strip Cray
control words from a dataset, leaving only data. In some cases,
Cray datasets may contain character data with Blank Field
Initiation (BFI). BFI was used under COS to compress datasets
by replacing three or more blanks in a row with a special two
character code. The cosconvert utility can be used to expand
the blanks in those datasets.
The cos split utility creates single file datasets from a multi-file
dataset. Each output single file dataset will have a unique name.

7. Acknowledgments
The ncaru software package is the result of the work of several
people at NCAR. Charles D' Ambra of the Climate and Global
Dynamics (CGD) division of NCAR wrote the original CrayIEEE numeric conversion routines. Craig Ruff of the Scientific
Computing Division (SCD) wrote the original Cray UO
routines for the purpose of adding and stripping Cray-blocking
from files. Dan Anderson and Greg Woods of SCD combined
both the numeric conversion routines and the Cray routines into
a single interface. Tom Parker (SCD) originally wrote cosfile,
cos convert, and cossplit for use on Cray systems. The author,
also of SCD, modified the library code to handle Crayunblocked files, added backspacing functionality, rewrote the
utilities to use the library, and added other enhancements.

8. Availability
This package is available to interested organizations without
charge. Please contact Peter Morreale by sending email to
morreale@ncar.ucar.edu for details.

303

CENTRALIZED USER BANKING AND USER ADMINISTRATION ON UNICOS

Morris A. Jette, Jr. and John Reynolds
National Energy Research Supercomputer Center
Livermore, California
Abstract
A Centralized User Banking (CUB) and user administration
capability has been developed at the National Energy
Research Supercomputer Center (NERSC) for UNICOS and
other UNIX platfonns. CUB perfonns resource allocation,
resource accounting and user administration with a
workstation as the server and four Cray supercomputers as
the current clients. Resources allocated at the computer
center may be consumed on any available computer.
Accounts are administered through the CUB server and
modifications are automatically propagated to the appropriate
computers. These tools facilitate the management of a
computer center as a single resource rather than a collection
of independent resources.
The National Energy Research Supercomputer Center
NERSC is funded by the United States Departtnent of Energy
(DOE). The user community consists of about 4,800
researchers worldwide and about 2,000 high school and
college students. The user community is divided into about
600 organizations or accounts. The available compute
resources include a Cray Y-MP C90116-256, Cray-2/8-128,
Cray-2/4-128, Cray Y -MP EL and an assortment of
workstations. Other resources include an international
network, a Common File System (CFS) with 12 terabytes of
data and the National Storage Laboratory (NSL, a highperfonnance archive being developed as a collaborative
venture with industrial partners and other research
institutions).
General Allocation and Accounting Requirements
While our student users are generally confined to use of the
Cray Y-MP EL, the researchers are free to use any of the
other computers. The DOE allocates available NERSC
compute resources for the center as a whole, not by
individual computer. Researchers are expected to consume
their computer resources on whichever computers are most
cost effective for their problems.
Resources are allocated in Cray Recharge Units (CRUs),
which are for historical reasons based upon the compute
power of a Cray-1. A "typical" problem may be executed on
any available resource and be charged a similar number of
CRUs. The CPU time consumed is multiplied by a CPU

304

speed factor to normalize these charges. These factors have
been derived from benchmarks. Our CPU speed factors are
as follows:
CPU SPEED
FAcroR
3.50
1.63
1.44
1.00

COMPIITER
Cray Y-MP C90116-256
Cray-2I4-128
Cray-2I8-128
Cray-1 A (no longer in service)

The charge rates vary with the level of service desired. Our
machines have been tuned to provide a wide range of service
levels depending upon the nice value of a process and
whether the work is performed interactively or under batch.
A charge priority factor pennits users to prioritize their work
based upon the level of service desired and be charged
accordingly. This scheme has proven very popular with our
clients. The lowest priority batch jobs have a charge priority
of 0.1. The highest priority interactive jobs have a charge
priority of 3.0. The actual formulas for charge priority
factors are as follows:
NICE
JOB
TYPE
VALUE
Interactive o to 10
NQS
5 to 19

CHARGE
PRIORITY
ALgORITHM
0.2*(10-nice)+ 1.0
0.1 *(10-nice)+ 1.0

CHARGE
PRIQRITY
3.0 to 1.0
1.5 to 0.1

Charges are made on a per process basis. It must be possible
to alter the nice value of a process up or down at any time.
NERSC has developed a simple program called CONTROL
which alters nice values of processes and sessions up and
down. CONTROL runs as a root process and restricts nice
values to the ranges specified in the above table. The
NICEM program has been replaced by a front-end to the
CONTROL command. The system call to increase nice
values has not been altered. The charge rate must change
when the nice value is altered. It is a requirement to be able
to allow changes in service level or charge rate once a process
has begun execution. The ability to react to changing
workloads is considered essential.
CPU use must be monitored in near real time. This can be
accomplished on many UNIX systems by reading the process
andlor session tables and noting changes in CPU time used.

Waiting for timecards to be generated on process completion,
as in standard UNIX accounting, is not considered
acceptable. With some processes running for hundreds of
CPU hours, monitoring timecards would occasionally result
in substantially more resources being consumed than
authorized. Additionally, many users have discovered that
they can avoid charges at some computer centers by
preventing their jobs from terminating. Their jobs can
complete useful work and sleep until the computer crashes,
avoiding the generation of a timecard.
The disadvantage of near real time accounting is that jobs are
charged for resources consumed, even if they fail to complete
due to system failure. The Cray hardware and software is
reliable enough to make this only a minor issue. Refunds are
not given in the cases of job failure due to user error.
Refunds are not given in cases where less than 30 CRU
minutes are lost, since the user benefit is probably less than
the effort involved in determining the magnitude of the loss.
The maximum refund is generally 600 CRU minutes. Users
are expected to provide adequate protection from greater
losses through checkpoints and backups. Since the UNICOS
checkpoint/restart mechanism is fairly fragile, many users
have developed their own checkpoint/restart mechanism. In
most cases, user checkpoint/restart mechanism are quite
robust and have modest storage requirements.

interfaces, archive user interfaces and a few others. These
permit the user to finish working in some sort of graceful
fashion whenever necessary. Once a user or his account have
additional resources available, held or suspended jobs must
be restarted automatically and the login shell restored to its
fonner value. Interactive jobs are kept suspended only for 24
hours, then killed in order to release their resources. NERSC
clients execute substantial interactive jobs. The temporary
suspension of interactive jobs minimizes the impact of
resource controls. In practice, a user with an important
interactive job suspended would contact an account manager,
be allocated additional resources and continue working in
short order. E-mail and terminal messages must be sent to
notify users of each action.
It must be possible to monitor resource availability and usage
in real time. It must be possible to alter the percentage of an
account allocated to a user in real time. It is highly desirable
that the manager of an account have the ability to control the
interval at which periodic installments of allocation are made.
It is highly desirable that CUB user interfaces be available in
both X window and command line forms.

At present, we are charging only for CPU use. The net
charge is calculated by the following formula:

Monthly accounting reports must be produced and mailed to
each manager. A variety reports are required, including the
following information: raw CPU time consumed, average
charge priority and CRUs consumed by user and computer.
The precision of all accounting information must be
maintained to within seconds.

CRU charge =CPU time *CPU speed factor* charge priority

General User Administration Requirements

The DOE allocates resources annually by account. In order
to maintain a relatively uniform workload throughout the year
and insure that an account does not prematurely consume its
entire allocation, the allocated resources need to be made
available in periodic installments. Each account has one or
more managers who are responsible for the account
administration. A wide range of administrative controls are
essential, including: the rate at which the annual allocation is
made available (weekly, monthly or other periodic updates
plus allocation infusions on demand), designating other
managers, and allocating resources to the individual users in
the account.

The NERSC user community is quite large and has a
substantial turnover rate, particularly for the students.
Managing 6,800 user accounts on several machines with
different environments is very time consuming. NERSC
formerly had a centralized user administration system for a
proprietary operating system (the Cray Time Sharing System,
CTSS, developed at Lawrence Livermore National
Laboratory). When NERSC installed UNICOS, our support
staff inunediately found its workload increased by a factor of
about three. Rather than increase our support staff
accordingly, we decided that the implementation of a
centralized user management system was essential.

Each user has access to some percentage of the account(s) to
which he has access. In order to simplify administration, the
sum of percentages allocated to each account's user need not
equal 100. For example, the manager may want to make the
account available in its entirety to each user. In this case,
each user would have access to 100 percent of the account's
allocation.

The desired functionality was basic UNIX user
administration: adding and deleting users and groups,
changing user groups, etc. NERSC was also required site
specific administration: assigning email aliases and
propagating them to mail hubs; updating local banking
databases on each CRAY; putting postal mail addresses into
an address database. Ideally, all of the required user
administration would be minimally specified once, within one
user interface, and the required actions would be performed
automatically on the appropriate computers. Local daemons
would eliminate the need for repetitive actions and insulate
the support staff from peculiarities of user administration on
any computer.

Once a user or his account has exhausted his allocation, he
should be prevented from consuming significant additional
resources. Rather than preventing the user from doing any
work when his allocation has been exhausted, running
interactive jobs must be suspended, running and queued NQS
jobs must be held and the login shell must be changed to a
Very Restricted Shell (VRSH). VRSH permits execution of a
limited set of programs such as: MAIL, NEWS, RM, LS,
KILL, CAT, LPR, MV, TAR, QSTAT, QDEL, CUB user

Basic Design Criteria
Previous experience developing and using a similar system

305

led us to believe that having a single centralized database
would be desirable. This makes for simple administration of
the entire computer center. Clients on various computers
would insure that the necessary allocation and user
management functions are performed, maintaining a
consistent state across all computers at the center in an highly
reliable fashion. Centralized accounting, consistent with the
allocation system information, would follow quite naturally.
While we were willing to take advantage of special features
available with UNICOS, the operating system providing most
of our compute capacity, the system would have to support
UNIX platforms of any variety. Specialized software
requirements (e.g .. database licenses) could be required on
the CUB server, but standard UNIX software should be the
only requirement for CUB clients. The system would have to
be user friendly. Use of commercial software, wherever
possible, was encouraged.
Centralized User Banking Development
While no available system at the start of this project satisfied
our needs, there were two resource allocation system which
provided partial solutions. The San Diego Supercomputer
Center (SDSC) had developed a CPU quota enforcement
system in 1990. NERSC adopted this system in 1991 and
found it generally satisfied our needs, except for managing
allocations on each machine independently. We enhanced
this system to permit the transfer of resources between
machines, but it still was not entirely satisfactory. The
Livermore Computer Center used SDSC's allocation system
as a model for the allocation sub-system of their Production
Control System (PCS). We found the PCS software gave us
somewhat greater flexibility and decided to use that as the
basis of CUB's banking client.
We knew that the design of CUB was certain to be quite
complex, involving several programmers from different
NERSC groups. It was decided that a CASE tool would
provide for easier software development and more reliability.
After examining several options, we selected Software
Through Pictures from Interactive Development
Environments. While the training cost was substantial in
terms of initial programmer productivity, there was general
agreement that the use of CASE was a real advantage. A
small portion of the CUB databa~e is shown in Figure 1 to
indicate its complexity.
The banking client is a daemon which reads the process and
session tables at intervals of ten seconds. The user ID,
process or session ID, system and user CPU time used, and a
few other fields are recorded. Changes in CPU time used are
noted each time the tables are read. Periodically, CPU time
used and CRU charge information is transferred for all active
users and accounts to the banking server. The banking server
updates its database then transfers updated records of
resources available by user and account.
Communications are UDPIIP based, for performance reasons.
Additional logic has been added to provide for sequencing
and message integrity. A very simply RPC has been
developed along with an API which insulates the application

306

level programmers from the complexity of the network. Each
connection is fully authenticated and a unique session key is
established. Transactions are protected by a crypto-checksum
utilizing the session key to insure security. The result is
efticient, easily portable, and fairly secure.
The record of previous process and session table information
permits one to note the starting of new processes and
sessions. The system and user CPU times are reestablished
shortly after a held NQS job is restarted. We avoid charging
for these restarted jobs by logging, though not charging, in
cases where the record of CPU time consumed between
checks exceeds a "reasonable" value. We consider anything
over the real time elapsed multiplied by the number of CPUs
multiplied by 2.0 to be an "unreasonable" value. The only
CPU time not accounted for is that consumed prior to the first
snapshot on a restarted NQS job and that consumed after the
last snapshot of any session. Since CUB's sampling interval
is ten seconds, the CPU time not accounted for is typically
under 0.2 percent. The precision of the data can be increased
in proportion to the resources consumed by the CUB local
bmlker. The resources consumed by the local banker itself
are 6.92 CPU minutes per day on a Cray Y-MP C90.
A client daemon called ACCTMAN performs updates to
local databases as required for user administration. It
performs system administration actions such as adding,
deleting or modifying users or groups. This includes creating
home directories, copying start-up files, creating mail
directories, etc.
NERSC has used ORACLE for most of its database work
over the past few years. The logical choice for the CUB
server database wali thus ORACLE. We had an initial design
goal of making CUB completely database independent, but
quickly found that to be impractical given the urgency of our
need for the capabilities represented by CUB and the
knowledge base of the available programmers. Using
ORACLE's 4GL FORMS proved to be the quickest way to
develop an interface for our support staff, and thus the
quickest way to get a base-line version into production. The
result is that CUB's BANKER, and ACCTMAN daemons,
and the SETCUB utility suite can run on any UNIX host; the
server code is completely written in standard SQL and could
employ any UNIX SQL engine; but the support staff interface
is wedded to ORACLE (but not to UNIX) for the foreseeable
future.
Accounting reports are generated monthly based upon the
CUB database and mailed to the primary manager of each
account.
In order to insure reliability, three machines are available as
CUB servers: a production machine, a "warm backup" and a
development platform. The "warm backup" has a database
that is only 15 minutes old and can be made into the
production machine within about 30 minutes, if needed. The
clients can continue operation for an extended period without
a server. The lack of an operating server merely prevents the
clients from synchronizing their databases and prevents other
database updates. Once communications are reestablished,
the databalies are synchronized. In practice, the servers have

been quite reliable, in spite of the continuing development
effort to complete and enhance CUB.

which account should be associated with each session or
process.

Database alterations (other than consumption of resources)
are recorded in a journal in ORACLE to insure persistence
across restarts. The database is also backed up regularly,
including the journals. If a client goes down for an extended
period of time transactions destined for it are not lost.

Once a single user can be associated with multiple accounts,
we will enable illl "add/delete" capability through SETCUB
that will grant PI's and their managers the ability to add/delete
existing NERSC users to their accounts, without NERSC
support staff intervention.

No alterations to the UNICOS kernel were required for CUB.
The LOGIN.C program was modified to request an account
name or number if the user belongs to more than one account.
A default account is used if the user enters RETURN.

We plan to implement Kerberos V5 as a centralized
authentication service for NERSC users on all NERSC
platforms. Having Kerberos will provide much greater
network security for those users able to take advantage of it,
because with Kerberos, passwords are never transmitted in
the clear. Kerberos V5 also allows for a higher level of
convenience. Once the Kerberos security token is obtained,
and assuming the proper client software is installed on the
users workstation, the user can securely telnet or ftp to
NERSC machines without providing a password. There are
some problems that must be solved regarding use of Kerberos
at NERSC. What to do about the many users who will not
have Kerberos on their local hosts is the biggest one. We are
working on tllis now

The primary user interface to CUB is called SETCUB. This
program can be used by account managers to modify the
CUB database. Other users can execute SETCUB to monitor
resource allocations. In order to make CUB easier to use, the
SETCUB program can also be executed by the mune
VIEWCUB or USERINFO.
Both VIEWCUB and
USERINFO provide a subset of SETCUB's functionality.
VIEWCUB can only be used to report allocation information.
USERINFO only reports user information, such as telephone
number and postal address. Examples of SETCUB reports as
shown in Figures 2, 3 and 4.
The overall CUB architecture is shown in Figure 5 showing
the interrelationships between the major components.
Future Plans
While a GUI CUB interface is available to NERSC support
staff, only a command line interface is available to most
users.
An X window interface is currently under
development.
Most NERSC users have login names of the form "u"
followed by a number. We plan to permit users the ability to
modify their login name to something they find more
suitable, on a one-time-basis. We will maintain a flat
namespace that includes login names, email aliases, and
certain reserved words.
NERSC found that charging for archival storage was
ineffective in controlling growth. We instituted a quota by
account in 1993, which has substantially reduced the archive
growth rate and resulted in the elimination of a substantial
amount of old data. At present, this quota is administered
only by account (not user) and is not yet integrated with
CUB. This will be rectified in the second quarter of 1994.
Problems of controlling storage use exist not only in the
archive, but also on-line disks. We are planning to impose
charges for disk space. This will be accomplished by a
daemon periodically calculating on-line storage by user and
account then relaying that infonnation to the CUB server.
At present, a user can only be associated with a single
account. This will soon be changed to permit use of multiple
accounts. The UNICOS NEWACCT command can alter the
account ID on process table entries. Other platfonns would
rely upon a user program relaying to the local banking client

CUB records resource usage only by user and account. In
order to determine which processes consumed the resources,
it is necessary to rely upon UNIX timecards. At some point
in the future, we would like to be able to keep track of
resource use by program. Given the volume of data required
to record each process executed, this would likely only be
used on long running processes.
The client daemons were originally designed for use with
UNICOS. They currently being ported to Hewlett Packard
workstations. Ports are planned for Cray T3D and Intel
Paragon.
Accounts are presently independent of each other. We regard
hierarchical accounts as something desirable. In PCS,
account managers can create and delete sub-accounts, move
users between sub-accounts, and move allocated resources
between sub-accounts. At some time in the future, we may
incorporate hierarchical accounts patterned after the work of
PCS.
Acknow ledgments
CUB's development has been the result of substantial efforts
on the part of many programmers. The following
programmers developed CUB: Patty Clemo, Steve Green,
Harry Massaro, Art Scott, Sue Smith and the authors.
This work was funded by the U.S. Department of Energy
under contract W-7405-Eng-48.
References
1 Hutton, Thomas E., "Implementation of the UNICOS CPU
Quota Enforcement System," Proceedings, Twenty-Fifth
Semi-Annual Cray User Group Meeting, Toronto, Canada,
April 1990.
2. Wood, Robert, "Livermore Computing's Production
Control System Product Description," Lawrence
Livermore National Laboratory Report, November 1993.

307

CM

=>

00

(n:~

)3
---....-.

Object: Generate allocation information about the users in account p6.
Input:

setcub view repo=p6 members

Output:
Users In Repository p6 (id 1032)

~

'"=

Initial
Allec

User Time
Remaining

Charge
Time

Used

0/0 of
Allec.

Perm%

M.CHANCE

1674:06

1667:35

6:31

0.18

46.00

46.00

u447

J. MANICKAM

3093:28

2512:09

581 :19

15.97

85.00

85.00

u4100

J. MANICKAM

727:52

727:52

0:00

0.00

20.00

20.00

u4150

R. DEWAR

727:52

727:52

0:00

0:00

20:00

20:00

u4477

J. MANICKAM

1455:45

1455:45

0:00

0:00

40:00

40:00

Current amounts fer p6

3639:23

3051:33

587:50

16:15

Login
Name

Name

u445

0/0

~
~

=
Object: Generate detailed allocation information about the
account p6.
Input:

setcub view repo=p6 long

Output:

Total Year
Repository Allocation
p6

1032

42540:00
Period
Increment
3420:00

Remaining
Initial
Remaining Update
Year Alloc. Period Allcc. Period Alloe. Period
34007:00

4525:10

Period
Carry Over

Next
Update

412:10

01-DEC-93

Managers (privileges)
u44444 - LISA HANCOCK (t,u,c)

305:56

monthly

PU
Login Name
STEVE JARDIN
u431

Object: Generate information about a specific user
Input: setcub locate user=u7145
Output:
NERSC DATABASE INFORMATION
Common Name
Loginname
u7145
MOE JETTE
Title:
LCode: 561 Bldg. 451 Rm: 2062
BOX: C86
CFSid: 7145
NERSC Domain Email Aliases: @nersc.gov
u7145
jette 
u.S. Postal Information:
NERSC
P.O. Box 5509, L-561
Livermore, CA 94550 USA
Work Phone: +1 510-423-4856

tH
~

~

CUB .ARCHITECTURE
CUB Platform: CRAY (soon HP, MPP)

udb, group, passwd, acid, local.users

CFS
or

local bank,· udb, passwd

UNITREE

C[J
QS
or

PBS

Oracle

-------

CUB

opcon
cubconsole-- ~ all
daemons
Operator
and the
Interface
serVer

oracle forms
opconview
User
Management
Interface

SQLNET
RUDP

312

setcub
userinfo
viewcub
User
Interface

FORTRAN

90

UPDATE

Jon Steidel
Compiler Group
Cray Research, Inc.
Eagan, Minnesota

1.0

Introduction

The first release of the Cray Fortran 90
Programming Environment (CF90 1.0) occurred in
December of 1993. CF90 will replace Cray's
current production Fortran product (CF77) in
future releases. This paper discusses the
current status of the CF90 product, its
performance compared to CF77, the CF77 to CF90
transition plan, and current feature plans for
the second release of the CF90 programming
environment.
2.0

CF90

1.0

Status

The CF90 1.0 status is divided into three
subsections. These are the release status of
CF90 on various platforms, the performance of
CF90 1.0 on parallel vector platforms compared
to CF77, and the status of Fortran 90
interpretation processing by the American
National Standards Institute (ANSI) and the
International Standards Organization (ISO).
Potential impact of Fortran 90 interpretations
on CF90 users is discussed in the
third subsection.
2.1

CF90

Release

Status

containing over two hundred thousand lines of
Fortran 90 code has been ported and tested
using CF90 with very few (less than 10
distinct) problems. A small number of these
problems were due to CF90 problems. The
remainder were due to differences in
implementation due to interpretation of the
Fortran 90 standard or differences in machine
arithmetic between Cray and the platform
where these libraries were developed. While
it is very early in the release cycle of the
CF90 product, the cumulative failure profile,
based on weighted severity of SPRs, looks very
favorable compared with the last three major
CF77 releases. There are very few Fortran 90
production codes running currently; ~his may
also influence the favorable cumulative
failure profile.
2.2

Performance

of

CF90

1.0

CF90 release criteria included four measures
of the programming environment's performance.
These were based on CF77 5.0, as CF77 6.0 was
not released until late in the CF90 1.0
development cycle. These criteria were for
the performance of generated code, size of
generated code, compile time, and compile
size. Specifically, these criteria were:

The CF90 Programming Environment 1.0 was
released December 23, 1993 on Cray Y-MP, C90,
and YMP-EL systems. To date, there have been
forty-five CF90 licenses purchased. These
licenses represent twenty-five Cray Y-MP and
C90 systems installed at twenty-one customer
sites, plus twenty licenses on YMP-EL systems.
As of early March, three upgrade releases have
been made to the product to address bugfixes,
and the first revision level release is
planned for early second quarter 1994.
The CF90 programming environment will be
released as a native programming environment
on SPARC Solaris systems in third quarter
1994. Components of the CF90 programming
environment also serve as the basis of the
Distributed Programming Environment (OPE)
which will be also be released in the third
quarter of 1994. Please see the Distributed
Programming Environment paper prepared by Lisa
Krause for more information about OPE.

The geometric mean ratio (GMR) of
execution time for code generated by CF90
compared to that of CF77 5.0 shall not
exceed 1.15 for the six performance
benchmark suites: Perfect Club, Livermore
Kernels, NAS Kernels, Linpack Kernels,
Cray 100 kernels, and the New Fortran
Performance Suite (NFPS). Note: NFPS is
a collection of Cray applications and some
Cray user proprietary codes used to
measure Fortran performance by the
Performance Test and Evaluation section.
The
not
20%
and

size of code generated by CF90 shall
exceed that of CF77 5.0 by more than
when measured using the Perfect Club
NFPS benchmark suite~.

Compile time of CF90 shall not exceed
twice that of CF77 5.0 when measured using
the Perfect Club and NFPS benchmark
suites.

Initial experiences with CF90 1.0 have been
favorable. A third party vendor's Fortran 90
version of their iibrary package (NAG)

Copyright © 1994. Cray Research.Inc. All rights reserved

313

Compile size shall not exceed twice that
of CF77 5.0 when meas~red using the
Perfect Club and NFPS benchmark suites.
The rationale for these criteria was based on
the fact the CF90 uses new performance
technology for optimization, vectorization,
and tasking which is not nearly as mature as
that of the CF77 compilation system. Also,
these measurements are based on the CFT77
compiler alone, and CF90 has integrated
the analysis and restructuring capabilities of
FPP and FMP into th~ CF90 compiler itself.
Since the CF90 compiler combines the three
phases of CF77 into a single phase, it was
expected that CF90 would exceed the compile
time and space requirements of the CFT77
compiler.
Three of four of these criteria were exceeded.
Execution performance for the various suites
is as follows (GMR of CF90 to CF77) :
Perfect Club
NFPS Suite
Livermore Kernels
NAS Kernels
LINPACK Kernels
Cray 100 Kernels

1.06
1.09
0.97
1.02
0.99
1.09

A number less than one indicates that CF90
took less time than CF77.
For the Perfect Club Suite, the execution
size, compile time, and compile sizes are
respectively, 1.05, 2.22, and 1.82 when
compared to CF77 5.0. For the NFPS tests, the
execution size, compile time, and compile
sizes are 1.08, 2.46, and 1.72 respectively
compared with CF77 5.0. The compile time
goals which were not met are being addressed
in revisions of the 1.0 release, and in the
2.0 release of CF90.
An additional known performance issue involves
codes that make use of the Fortran 90 array
transformational intrinsic functions. These
intrinsic functions are generic in their
definition, operating on many differing data
types, arbitrary dimensions of arrays of any
dimensionality, with one or more optional
arguments which change the behavior of the
function based on the presence or value of the
argument. In short, each intrinsic may
exhibit a number of special cases that can be
optimized differently based on the specific
way in which it is called. CF90 1.0 has
implemented these intrinsic functions as
external calls. As external calls, all
localoptimization is inhibited. Loop fusion
of neighboring array syntax statements cannot
occur; statements calling" the array intrinsic
functions must be broken up into multiple

314

statements, and each call of these functions
may require allocation and deallocation of
array temporaries as well as copy-in/copy-out
semantics for the array arguments. 1nlining
of these intrinsics removes many of the
optimization barriers and the need for many of
the array temporaries. Revisions of 1.0 and
release 2.0 will implement automatic inlining
of many instances of these functions. The
1.0.1 revision targets the MATMUL, CSH1FT,
TRANSPOSE, and DOT_PRODUCT intrinsic
functions. Future revisions will inline
additional array intrinsic functions. The
benefits of this in lining are seen not only in
execution time, but also in compile time and
size.
2.3 Status of Fortran 90
Interpretation
Processing

Since Fortran 90 became an American and
international standard, there have been
approximately 200 formal requests for
interpretation or clarification of the intent
of the standard. At present, about 50 of
these requests are still under initial
consideration. The majority of these requests
are for clarification. Some have required
edits to the standard which will appear in
later revisions of the Fortran language
specification. Cray has attempted to take a
conservative approach in our implementation of
the Fortran 90 language in areas open to
interpretation. Thus, the current
implementation may be more restrictive than
what it may need to be when the
interpretations are resolved. Most
outstanding interpretations involve end cases
in the implementation, and if resolved
differently than the current CF90
implementation, changes would result in a
syntax or semantic error in future versions of
CF90. However, there are a small number of
interpretations pending which may be resolved
in a manner which could result in runtime
behavior different from current CF90 runtime
behavior. If these interpretations are
resolved in a manner incompatible with the
current CF90 implementation, an appropriate
mechanism will be provided so that current
behavior will continue to be supported for at
least one year after the change occurs. This
may take the form of supporting old and new
behavior within a compilation with a message
issued stating old behavior will be removed at
some release level, or, in some cases may
require a compile time switch to select old
and new behavior.
3.0

CF77

to

CF90

Transition

Pl.an

Cray Research has selected CF90 as our primary
Fortran compiler for the future. CF90 is
based on the new Fortran 90 standard.

Portability is maintained with CF77 codes as
Fortran 90 is a superset of FORTRAN 77, and
CF90 supports CF77 language extensions. CF90
will further promote portable parallel
applications as it is introduced on SPARC
platforms, providing the same programming
environment on workstations as on Cray
platforms. As
CF90's integrated optimization, vectorization
and tasking technology matures, it will
provide better performance than that of CF77,
without the use of preprocessors.
CF77 release 6.0 was the final major feature
of the CF77 compilation system. There will be
no more major feature releases, only bugfixes.
New hardware support will be provided in
rev~s~ons as necessary.
Revision support of
CF77 6.0 will continue through third quarter
of 1996.
The CF77 to CF90 transition period is
implemented in two phases beginning
in fourth quarter 1993 and continuing through
third quarter 1996. This allows gradual
migration from CF77 to CF90. During this
transition period, new codes can be developed
using Fortran 90 features, and CF77
applications can be gradually ported to the
CF90 environment.
The first phase of the transition began with
the release of CF90 1.0 and will end when CF90
provides full functionality of the CF77
compilation system and delivers equal or
greater performance than CF77. This phase is
planned to end with a revision of CF90 2.0 in
1995. Phase two begins at this time and
continues through the subsequent release of
CF90.
Existing customers can upgrade their CF77
license to a CF90 license during phase one by
paying the difference in price of a CF90
license over a CF77 license for their system.
CF90 maintenance fees will include CF77
maintenance. CUstomers who received CF77
bundled with their system receive full credit
for a CF77 paid up license. During phase two,
existing customers must pay the full price for
a CF90 license. The CF90 maintenance
price will then include CF77 maintenance also.
New customers and customers upgrading their
systems during phase one can purchase a CF90
license and will also receive CF77.
Maintenance will only be paid for CF90, but
will include both CF77 and CF90 maintenance.
After the initial release of CF90, CF77 will
not be sold separately to new customers.
During phase two, new customers and customers
upgrading their systems can only purchase
CF90. CF77 will not be available to these

customers during phase two.4.0 Feature plans
for CF90 2.0
Release 2.0 of the CF90 programming
environment is planned for mid-year 1995. The
primary focus of this release is to provide
features of the CF77 programming environment
not available with CF90 1.0, and to equal or
surpass the performance provided by the CF77
compiling system. Initial MPP support is
planned, and cross compilers running on
workstations targeting Cray platforms will be
introduced in the second release of the
Distributed Programming Environment.
The features planned for CF90 2.0 include
automatic inlining of external and internal
procedures, runtime checking of array bounds
and conformance, runtime checking of argument
data types and counts, and runtime character
substring range checking. Conditional
compilation facilities will be provided by the
compiler. This feature is available to CF77
users through the gpp preprocessor. User
defined VFUNCTIONs written in Fortran are
planned, though the HPF syntax of PURE and
ELEMENTAL procedures may be used instead of
the VFUNCTION directive. No support is
planned for CFPP$ directives, though simil~r
functionality may be provided for tuning PDGCS
optimization and restructuring through a new
directive set.
Inlining will be similar to CFT77's automatic
inlining capabilities. CF90 will use a
textual interface that produces files in an
intermediate format. Inlining will work off
these intermediate files. This will allow the
frontend of the compiler to run as a stand
alone process producing intermediate text
files. The inliner and optimization/code
generation phases can also be run
independently. The use of the textual
interface is intended to allow for cross
language in lining and additional
interprocedural optimizations in future
releases. No source to source inlining
capability (similar to FPP's inlining) is
planned in the 2.0 time frame.
CF90 2.0 for SPARC platforms will provide 128
bit floating point support (Cray DOUBLE
PRECISION) and 64 bit integer and logical data
types. CF90 1.0 on SPARC platforms supports
only 32 bit integer and logicals, and 32
and 64 bit floating point representations.
For MPP platforms, CF90 2.0 will be the first
release of the full Fortran 90 language. CF90
will provide packed 32 bit integer, real, and
logical data types which will allow
applications to achieve twice the memory and
I/O bandwidth permitted by 64 bit data types.
In addition, CF90 will support 128 bit

315

floating point data types, which are not
currently available with CF77 on MPP
platforms.
5.0

Summary

The Cray Fortran 90 Programming Environment
was released in December 1993, begin~ing the
transition from CF77 to CF90. Initial
experiences have been promising. Performance
of generated code is near that of CF77, and in
some cases exceeds that of CF77. Subsequent
releases of CF90 will move the programming
environment to additional platforms, aiding
portability of applications across platforms.
The CF90 2.0 release will match the
capabilities and performance of CF77, and
introduce cross compilers. In 1995, CF90
should become the Fortran environment of
choice for Cray users.
Autotasking, CRAY, CRAY Y-MP, and UNICOS are
federally registered trademarks and CF77,
CFT77, CRAY-2, SEGLDR, Y-MP, and Y-MP C90 are
trademarks of Cray Research, Inc.

316

~ Fortran 90 Update

~ Fortran 90 Update

• Status
• CF77/CF90 Transition Plan

Jon Steidel

• Plans
Cray Research
Software Development

~ Fortran 90 Status

~ Fortran 90 Status
• CF90 Programming Envlrionment 1.0

.1.0 Performance

- Cr.roJ Y·MP. coo. and YMp·EL release December 23,1993
·45 Ucenses
• 21-,-.. 21 C.._

...

• 218.-,.-.

- CFoo Spare Native programming environment 3094
- Initial FortranOO MPP support release 2.0

• Distributed Programming Environment Release
1.0
-3094
- Lisa Krause Thursday 8:30

~ Fortran 90 Status

-

b_..

CoM,.I.
11....

- Execution Tunes
(GMR of CF90 1.0 to cm 5.0)
• CAlO 1.0 Goal
1.15
• PerledChb
1.06
.NFPS
1.09
.97
• Uvermor. Kernels
1.02
• NASKerneis
.99
• UNPACK Kernels
1.09
• ~Wf 100 Kernels

~ Fortran 90 Status

•..

c-,.I.

CF90 1.0 Goal

1.2

2.0

2.0

Perfect Club
NFPS

1.05
1.08

2.22

1.82

2.46

1.72

• Array Intrinslcs
- 1.0 performance poor
• External calls Inhibit optirrization
• Heavy use of fAInlXInriu. C09Y lnlout
- Difficult to Inline
• Optional .gunents
• Variabr. runber of cirn.nslons
• t.l1II1)' spKial cases
- Inlining of some cases in 1.0 revisions
• t.lAllAUL. CSHIFT, DOT_PRODUCT. TRANSPOSE in

1.0.1

317

~ Fortran 90 Status
Fortran 90 Interpretation processing
• Over 200 formal questions about meaning of
the standard
- Some edits to the standard required

• X3J3 and WGS approval required

~ CF77 to CF90 Transition
• Cray Research has selected CF90 as our
primary Fortran complier for the future
- New Fortran standard
-Portability
• Fortran 90 is a superaet of Fortran n
• CF77 extensions supported
• Portable path to parallel applications

- WGS appI'OYed wHI be In F95 standard
- Approximately 50 yet unresolYed

- Performance
• Advlll'lCed compiler technology
• Integrated Autotasklng and optimization

• Some Interpretations may require changes to
CF90

~ CF77 to CF90 Transition

~ CF77 to CF90 Transition

Plan

• CF77 6.0 wAI be the final major release of the CF77
compOer (no more major features. only bug fixes). Actively
supported through 3096.
• Transition period win be Implemented In two phases from
4093 through 3096 to allow tor gradual migration from
CF77 to CF90.
• Development of F90 applications
• Porting of CF77 applications

~

CF77 to CF90 Transition

~

(for existing

cm customers)

• Phase 1:

• Phase 1:

-

Existing CF77 customers can l4>Orade to CF90 by
paying the "Deka" licensing and maintenance fees.

-

-

CF90 l4>Orade licensing price - CF90 price - CF77
price
• CF77 ~ort Is tKncIed wi1h CAlO maintenance.
• Custom.,.. who r.<»iYed cm buncted with system
received lit credt fot cm paid up license

-

• Phase 2:

318

~ CF77 to CF90 Transition

~ (for new customers and system upgr8des)

-

CF90 license upgrades will be full price

-

CF77 st.ppOrt Is bundled wkh CF90 maintenance.

-

New customers and system upgrades purchase
CF90 license only. CF77 is bundled w~h CF90.
Maintenance win only be paid for CF90.
After the initial release of CF90. CF77 will not be
solei separately to new customers.

• Phase 2:
-

New customers and system upgrades can only
purchase CF90. CF77 will not be available to
these customers.

~ CF90 2.0 Plans

~

CF90 2.0 Plans

• 2Q95 release

• Full CF77 functionality
- Inlinlng (with textual Interface)
- Runtime c:hec:klng

• Array bculds a'Id c:onfolTTWlce checking
• Runtime .go..ment checking

• &bsrtIg range checking
- Conditional oompHatlon
- User defined VFUNCllONs

• MPP support
- Fun Fortran 90 feature support
- Packed 32 bit data types

• INTEGER, LOGICAl., REAL
• &4 bit COUPLEX (232 bit reals)
- 128 bit floating point representation

• Uay use HPF ELEUENTAL and PURE syntax

- No support for CFPP$ diredives

~ CF90 2.0 Plans
• SPARC features
- 128 bit floating point
- 64 bit Integers

• DP E Features
- Target machine arithmetic: simulation (constant folding)
- Cross compilers
• SPARC host. Clay PVP t.get
• Usa KrauM, Ttusday 8:30

319

C AND c++ PROGRAMMING ENVIRONMENTS
David Knaak
Cray Research, Inc.
Eagan, Minnesota

Overview
Cray Research has released Standard C programming
environments for both Cray parallel vector (PVP)
systems and Cray massively parallel (MPP) systems
that include high-performance, reliable compilers and
supporting tools and libraries. A C++ compiler for
PVP systems has been released as a first step towards
a high-performance C++ compiler. A C++ compiler
for MPP systems will soon be released. A transition
will occur over the next few years from separate C
and C++ compilers to a single high-performance
C/C++ compiler, which will be part of a rich
programming environment. High-performance C++
class libraries from Cray Research and third-party
vendors will complement the C++ environment. The
direction for parallel C and C++ is still under study.

Historical Perspective on
Programming Environments

C

and

C++

Prior to 1993, Cray Research compilers were released
asynchronously from the libraries and tools. This
made it difficult to coordinate the delivery of new
functionality when it required changes to both the
compiler and to the tools or libraries. In 1993, Cray
Research released several "programming
environments", integrating the compiler, the
supporting tools, and the supporting libraries into a
single package.
The compilers in these programming environments
need to be high-performance compilers. While the C
and C++ languages are considered good for system
implementation, they are not necessarily good for
numerical programming. Several years ago, Cray
Research committed to delivering a high performance
C compiler suitable for numerical programming. We
have achieved this goal with a combination of
language extensions, optimization directives, and
automatic optimizations. We are now committed to
also delivering a high performance C++ compiler. As
was our goal for C, our goal for C++ is to achieve
performance at or near Fortran performance (within
20%) for equivalent codes.

320

The transition to a high-performance C++ compiler
starts with standard-compliant functionality and will
progress in each release with greater functionality and
with better performance that requires less programmer
effort. Supporting tools and libraries will also be
enhanced. A specific product transition plan will be
presented at a future CUG meeting. We will
collaborate with customers on important applications
to help guide our functionality and performance
enhancements.

Standard C Programming Environment 1.0
for PVP
The Standard C Programming Environment 1.0 was
released for Cray PVP systems in December of 1993.
The components of the PVP version are:
·SCC
• CrayTools: CDBX, xbrowse, PVP performance
tools, [cclint in 1.0.1 release]
• CrayLibs: libm, libsci
The original goal for the Cray Standard C compiler
was to provide an ISO/ANSI compliant compiler that
delivers Cray performance at or near the performance
level of equivalent Fortran code. This goal was
achieved with the 2.0 release and improved on with
the 3.0 and 4.0 releases. Several language extensions
were added to provide some Fortran-like capabilities
that were necessary for delivering the performance.
These language extensions have been proposed to the
ISO/ANSI C committees.
Reliability of the compiler has improved with each
release and is quite good.

Standard C Programming Environment 1.0
for MPP
The Standard C Programming Environment 1.0 was
released for Cray MPP systems in November of
1993. The components of the MPP version are:
·SCC
• CrayTools: TotalView, xbrowse, MPP Apprentice,
[cclint in 1.1 release]
• CrayLibs: libm, libsci, libpvm3

The 1.0 programming environment for MPP supports
multiple PE execution only through message passing.
Two message passing models are available: PVM and
shared memory get and put. Architectural differences
between PVP and MPP systems necessitate some
significant compiler feature differences. On MPP
systems, memory is distributed so distinctions must
be made between local and remote memory
references. On PVP systems, type "float" is 64-bits
but on MPP systems, type "float" is 32-bits. The
computation speed for 32-bit floats is the same speed
as for 64-bit doubles, but arrays of floats are packed 2
values per 64-bit word and so less memory is used
and bandwidth to and from memory or to and from
disk is double. Same story for shorts. Vector
capabilities don't exist on MPP systems. MPP
systems have additional intrinsics and don't support
PVP-specific intrinsics.

Cray C++ 1.0 for MPP
C++ 1.0 for MPP will be released mid-1994. As with
SCC for MPP, C++ 1.0 for MPP supports PVM and
shared memory get and put models only. TotalView
will support C++, including handling of mangled
names. At about the same time, the MathPack.h++
and Tools.h++ class libraries for MPP will be
released as separate unbundled products.

Some C++ Customer Experiences
Some significant performance successes have been
achieved with the current compiler technology. But
this has required changes to user code, changes to
C++ 1.0, and changes to SCC. To illustrate that C++
codes can perform quit well, two examples are
described below.

Cray C++ 1.0 for PVP
Application 1
Cray C++ 1.0 for PVP was released in August of
1992. It was not released as part of a full
programming environment, rather, it depends on the
C programming environment for compilation of the
generated C code and for the supporting tools and
libraries. C++ 1.0 is an enhanced version of USL
C++ 3.0.2 (cfront). The major enhancements are
pragma support, additional inlining, and restricted
pointers.
The functionality of Cray C++ 1.0 matches that of the
emerging C++ standard as it was at that time and the
performance is adequate where performance is not a
critical concern. Where performance is a critical
concern, good performance has been achieved in
some cases, often with programmer intervention.
There is still room for improvement. Currently,
performance is highly dependent on programming
style. The tech note SN-2131 provides some
guidance for which techniques work best.
The C++ debugging support was significantly
enhanced with the release of SCC 4.0, COBX 8.1
(both in the C programming environment 1.0) and
with C++ 1.0.1.
There are currently about 40 CRI customers licensed
for C++. If we had not released C++ 1.0, many
customers would have ported the USL code
themselves, duplicating work and not benefiting from
our enhancements.

This application contained the following C++
statement in the kernel of its code:
t(i, j) += temp * matl.index(i, j);
II t,temp,matl are complex class objects

Though this seems like a very simple statement, it is
expanded by C++ 1.0 into about 150 lines of C code
that SCC must compile. (It really doesn't need to be
quite that complicated.) For this statement, the
equivalent Fortran or C code would contain a loop and
perhaps function calls. The class involved here,
complex, is a relatively simple class but the
optimization issues are general and apply to many
classes. In particular, though it is easier for the
compiler to optimize operations on objects that contain
arrays, the more natural style of writing C++ codes is
usually to have an array of objects. This code uses
arrays of objects. The initial performance (on a single
CRAY C90 CPU) for this part of the code with SCC
3.0 was about 9 MFLOPS. Using a more aggressive
optimization level, -h vector3, resulted in partial
vectorization of the loop and the performance reached
about 135 MFLOPS. Using SCC 4.0 and the
compiler options -h ivdep,vector3 resulted in
performance of about 315 MFLOPS.

321

Application 2
This application is a hydrocode that simulates the
impact of solid bodies at high velocities. The code
has been written to be run on a wide variety of
architectures including scalar, parallel vector, and
massively parallel. A major focus was developing
appropriate classes for the application and tuning the
classes for different architectures. This keeps the
main code quite portable. Another advantage of this
approach is that all the architecture-dependent features
are buried in the class definitions and remain invisible
to the class users. The initial performance of the
entire application on a CRA Y Y-MP was a about 1
MFLOP.
By using reference counting and
management of temporaries to eliminate unnecessary
memory allocations and deallocations, the
performance reached about 25 MFLOPS. Using
restricted pointers and more aggressive inlining
brought that up to 75 MFLOPS on 1 processor of the
Y-MP. The code has also been ported to a CRA Y
T3D using a message passing system based on
get/put. The code runs a typical real-life problem at
about 5.8 MFLOPS per PE with excellent scaling to at
least 128 PEs.
c++ Programming Environment 2.0
The focus for C and C++ programming environments
in 1994 and 1995 is to provide a C/C++ programming
environment with a native, high performance C++
compiler and enhanced tools support of C++. The
2.0 release is planned for mid-1995. In order to focus
our resources on the 2.0 environment, there will be no
major release of the C programming environment in
1994 and 1995. There will be revision releases of the
C and C++ environments as necessary to fix
problems, to support new hardware, and perhaps to
provide some performance enhancements.
The 2.0 environment will include full tools support
for C++. This will include TotalView, xbrowse,
MPP Apprentice, and possibly a class browser.
For the MPP environment, distributed memory
versions of some libsci routines will be available in
2.0. These will include some or all of BLAS2,
BLAS3, LAPACK, FFT, and sparse solvers.
The C++ 2.0 compiler will utilize more advanced
compiler technology. The advantages of the new
compiler technology are:

322

• eliminates the step of translation from C++ code to
(messy) C code
• able to pass C++ specific information to the
optimizer
• same optimization technology as CF90 compiler
• new internal data structure has potential for:
- lower memory usage
- higher throughput
- inter-procedural optimization
- inter-language inlining
C++ 2.0 will be functionally equivalent to C++ 1.0
with somewhat better performance, and with
exception handling. C++ 2.0 will be an ISO/ANSI C
compliant compiler but won't have all of the
functionality of SCC 4.0. The C++ 2.0 programming
environment will not be dependent on the Standard C
programming environment. The target date for C++
2.0 is mid-1995.
c++ Programming Environment 3.0
The target date for C++ 3.0 is mid-1996. C++ 3.0
will keep up with emerging C++ standard and will
have the functionality of SCC 4.0 with maybe a few
minor exceptions. C++ 3.0 will outperform SCC 4.0
for C code. C++ 3.0 will outperform C++ 2.0 for
C++ code.
c++ and Class Libraries
The productivity advantage of C++ is realized when
well defined and optimized classes are available to the
end user that match his or her discipline. C++ can be
used as a meta language that allows the class
implementer to define and implement a new language
that suits the discipline. The class user can then think
in terms of the discipline rather than in terms of
computer science. It does put a greater burden on the
class implementer. This division of labor is a net gain
if the classes are used for more than one application.
The more complicated the hardware gets, the more
important it is that software help hide this complexity
from the user.
Cray Research has been and will be very selective in
which C++ class libraries we distribute. Any that we
do support are, and will be, separately licensed and
priced. We believe that there should be, and will be, a
smorgasbord of class libraries from various vendors
in the future. This gives customers the greatest choice

and will allow even small vendors to compete in niche
markets.

Research has taken on the challenge to do its part to
make C++ a success.

The 2 class libraries that Cray has released are
MathPack.h++ 1.0 and Tools.h++ 1.0 for PVP,
released in December 1993. MathPack.h++ 1.0 and
Tools.h++ 1.0 MPP will be released mid-1994. As
with the evolution of the compilers, this first release
of class libraries focused primarily on providing
functionality and then as good of performance as time
would allow. These libraries are based on Rogue
Wave libraries and therefore have the benefit of
portability of code across many platforms. Industry
standards (even de facto standards) for class libraries
are needed so that the portability benefits can be
reaped. More tuning of the libraries will improve the
performance.

Parallel C and C++
Cray Research has not committed to any C or c++
language extensions for parallelism, or for distributed
memory systems. We are still looking at various
models, experimenting with some, and encouraging
others to experiment. Any model we select must have
potential for high performance. We are also watching
the parallel Fortran models to see what techniques and
paradigms prove most successful. We want to hear
about customers' experiences with parallel extensions.

Summary and Conclusion
The Standard C programming environments for PVP
and MPP include high-performance and reliable
compilers. We have not yet achieved a high level of
automatic high-performance for C++. We are making
the transition of the next few years to a single, high
performance C/C++ compiler that will be part of a rich
programming environment. In the short term, good
C++ performance still requires work by the
programmer. We will work with customers on
important applications to guide our efforts at
enhancing the compiler, tools, and libraries.

In the long run, the success of C++ will depend on:
• compilers that deliver the performance of the
hardware with minimal programmer intervention,
• good supporting tools and libraries,
• well defied and implemented class libraries for
different disciplines,
• and overall, a significant improvement over other
languages in ease of use and time to solution. Cray

323

The MPP Apprentice™ Performance Tool:
Delivering the Performance of the Cray T3D®
Winifred Williams, Timothy Hoel, Douglas Pase
Cmy Research, Inc.
655-F Lone Oak Drive
Eagan, Minnesota 55121

ABSTRACT
The MPP Apprentice™ performance tool is designed to help users tune the performance of their
Cray TID® applications. By presenting performance information from the perspective of the user's
original source code, MPP Apprentice helps users rapidly identify the location and cause of the most
significant performance problems. The data collection mechanism and displays scale to permit performance analysis on long running codes on thousands of processors. Information displayed within
the tool includes elapsed time through sections of code, time spent in shared memory overheads and
message passing routines, instruction counts, calling ttee information, performance measures and
observations. We will demonstrate how the MPP Apprentice's guides the user's identification of performance problems using application examples from benchmarks and industry.

1

Introduction

Fast hardware and system software are great, but the
performance that really counts is that which the end
user can achieve. Early MPPs were notoriously difficult to program, and it was even harder to approach the
peak performance of the systems. While this has been
improving across the board, Cray has a particularly
strong story to tell. The Cray TID has very fast hardware [CRI93-1], system software, and libraries which
overcome the limitations of many other systems.
Cray's MPP Fortran programming model [pa93] ,
along with hardware support from the Cray TID,
attacks the programming issue by letting the user program a distributed memory system as though the memory were shared. Good compilers provide the
performance. But there is always more the user can do
to improve the performance of a code.
The focus of the MPP Apprentice tool is to deliver the
performance of the Cray TID to the user. The tool
assists the user in the location, identification, and resolution of perform~ce problems on the Cray TID. Its
instrumentation and displays were designed specifically to permit performance analysis on thousands of
processors, to handle the longer running codes that
large systems enable, and to support work-sharing,
data parallel, and message passing programming models. Performance characteristics are related back to the
Copyright © 1994. Cray Reseach Inc. All rights reserved.

324

user's original source code, since this is what the user
has the ability to change in order to improve performance, and are available for the entire program, or a
subroutine, or even small code blocks.
2

Possible approaches

Many performance tools today collect a time-stamped
record of user and system specified events, more commonly called event traces. AIMS [Ya94] , GMAT
[CRI91], Paragraph [He91], and TraceView [Ma91] are
examples of event trace based tools. These tools present
information from the perspective of time, and try to
give the user an idea of what was happening in every
processor in the system at any given moment While
event trace tools have great potential to help the user
understand program behavior, the volume of data is difficult to manage at run-time and when post-processing.
The size of an event trace data file is proportional to the
number of processors, the frequency of trace points,
and the length of program execution. Handling such
large data files at run-time can perturb network performance, interprocessor communication, and I/O. At
post-processing time, the volume can be difficult to display and interpret. The data volume also raises concerns
about how event trace tools will scale to handle longer
running programs and larger numbers of processors.
Other tools record and summarize pass counts and
elapsed times through sections of code, and will be

called "stopwatch" tools for the purposes of this paper.
ATExpert [Wi93] and MPP Apprentice [CRI93-2,
Wi94] are examples of these tools. Stopwatch tools
present infonnation from the perspective of the user's
program and provide the user with perfonnance characteristics at a particular point in the program. Profiling tools present information from a similar
perspective, although their data collection mechanism
is very different. The data volume of stopwatch tools,
which is proportional to the size of the program, is
much smaller than event trace tools. The reduced data
volume intrudes less on program behavior and pennits
much finer grained data collection. The data volume is
also more manageable for a tool to post-process and a
user to interpret, and scales well to growing numbers
of processors and lengths of program execution. The
MPP Apprentice tools is a stopwatch tool.

3

MPP Apprentice method

A compile-time option produces a compiler infonnation file (CIF) for each source file and turns on MPP
Apprentice instrumentation. A CIF contains a description of the source code from the front end of the compiler and encapsulates some of the knowledge of the
user's code from the back end of the compiler, such as
instructions to be executed and estimated timings. The
instrumentation occurs during compilation after all
optimizations have occurred. Instructions are added
strategically to collect timing infonnation and pass
counts while minimizing the impact on the user's program.
During program execution, timings and pass counts for
each code block are summed within each processor
and kept locally in each processor's memory, enabling
the MPP Apprentice to handle very long running codes
without any increase in the use of processor memory.
At the end of program execution, or when requested by
the user, the power of the Cray T3D is used to sum the
statistics for each code object across the processors,
keeping high and low peaks, and a run-time infonnation file (RIF) is created. The MPP Apprentice post-processes the RIF, the CIFs, and the user's source files.

4

Visualization of data

When a user initially runs the MPP Apprentice on their
code, the tool provides a summary of the statistics for
the program and all subroutines, sorting them from the
most to the least critical. The list of subroutines
includes both instrumented as well as uninstrumented
subroutines, such as math and scientific library functions. MPP Apprentice defines long running routines
as "critical", and pennits the user to redefine "critical"
if desired. The summary breaks down the total time for
each instrumented subroutine into time spent in over-

head, parallel work, I/O, and called routines. For uninstrumented subroutines, only the total time is available.
Overhead is defined as time that would not occur in a
single processor version of the program, and includes
PVM communication, explicit synchronization constructs (such as barriers), and implicit synchronization
constructs (such as time spent waiting on data, or
implicit barriers at the end of shared loops). MPP
Apprentice details the exact amount and specific types
of overhead.
Figure 1 shows a sample main .window of MPP
Apprentice. The upper panel of the window, or navigational display, shows summarized statistics for the program and each of its subroutines. The legend window
identifies the breakdown of the total time for each code
object. A small arrow to the right of a subroutine name
indicates that the subroutine has been instrumented and
that it may be expanded to see perfonnance characteristics of nested levels. All of the detailed infonnation
available for the program and subroutines is also available for nested levels. Nested code objects are identified by a name identifying the code construct, e.g., If or
Do, and a line number from the original source code.
The user may ask to see the full source code for a code
object. A source code request invokes Cray's Xbrowse
source code browser, loads the appropriate file automatically, and highlights the corresponding lines of
source code (See Figure 2).
The middle panel of MPP Apprentice details the costs
for the code object selected in the navigational display.
It toggles between providing infonnation on instruction counts, shared memory overheads, and PVM overheads. When displaying instruction counts, the exact
number of each type of floating point and integer
instruction is available, as well as the number of local,
local shared, and global memory loads and stores.
These values assist the user in balancing the use of the
integer and floating point functional units and maximizing local memory accesses to fully utilize the
power of the Cray T3D. The shared memory overheads
display gives the amount of time spent in each type of
synchronization construct as well as time spent waiting
on the arrival of data. The PVM overheads display
gives the amount of time spent in each type of PVM
call. Samples of each of these displays is available in
Section 5.
A call sites display helps look at algorithmic problems
related to the calling of one subroutine from another. It
lets the user see all the sites from which a selected subroutine was called, and all the subroutines to which the
selected subroutine makes calls. Timings and pass
counts are available with this infonnation.

325

c:-=a~..."

"¥L1!S'Q";.""

Navigational Display
Selecting ite.s (code objects) in
this window updates all other
windows accordingly. Code objects
with arrows naaybe be expanded to
provide performance information on
nested levels by selecting the
arrow or using the navigate menu.

Menu.
Opening file: IhoIiae/suraac8/111W/cheM/app.rif

Figure 1: Main window of the MPP Apprentice tool

326

ca..on

I block¥ I

i¥( scalar ) return

next = 118 + one
preY = 118 - one
i ¥ ( next • Be. nProc ) next
i¥ ( prey .It. Zero ) prey
xtllp

do i

=x
= one..

= next

= prey

- nProc
+ nProc

nProc-one

call pvaf'initsend( ~ .. ok )
call pvaf'pack( REAL8.. xtllp.. 1.. 1.. ok
call pvaf'send( next.. 0.. ok )
_ .. ___._."•.•. __ . __ cal 1_ vUlfrecv(..

>I'~ev ...

O.. _ok_>

Xbrowse started.
~'n....~ing

¥ile : ·1hoIIe/st.-ac8/..../me-/gdSu..¥· •
was COtIpiled .. ith a Fortran 77 co.piler ..

. r-

Figure 2: MPP Apprentice tool working cooperatively with Xbrowse

A knowledge base and analysis engine built into MPP
Apprentice derives secondary statistics from the measured values. It provides the user with performance
measures (such as MFLOPs ratings), analyzes the
program's use of cache, makes observations about the
user's code, and suggests ways to pursue performance
improvements. It attempts to put the knowledge of
experienced users into the hands of novices. Since the
Cray T3D is a relatively new machine, even experienced users have a lot to learn, so the knowledge base
is expected to grow.

5

Identification of Performance Problems

Many of the commercial and benchmark codes optimized with the MPP Apprentice so far have had a
large amount of time spent on a synchronization construct as their primary performance bottleneck. Time
at synchronization constructs indicates some type of
imbalance in the code, a problem that is typical of
MPP codes in general. The interconnection topology
of the Cray T3D is much faster than other commercially available MPPs, but these performance problems wh.ile reduced substantially, still exist.

327

5.1
code

Load imbalance with a message passing

exists. By resolving this problem developers realized
a speedup of three and a half times.

Figure 1 shows the initial MPP Apprentice output for
a chemical code. With the subroutines sorted in order
of decreasing total time, it is evident that PVMFRECV,
a blocking message receive function, is consuming
the most time, more than half of the total time of the
program. Intuitively it seems undesirable to spend
more than half of program execution time waiting for
messages. By selecting subroutine PVMFRECV in the
Navigational Display and taking a look at the call
sites display, we can see each caII to PVMFRECV and
the amount of time spent in it.

5.2
Load imbalance with Fortran programming model code

In Figure 3 we can see all of the calls to PVMFRECV.
There are not many of them, but two calls are taking
substantia1ly more time than the others. If we were to
select the units button and switch to viewing pass
counts we would see that for the same number of calls
to PVMFRECV from different call sites, caII times
vary significantly. Some type of a load imbalance

The main window in Figure 4 is from the NAS SP
benchmark code. This code uses one of the synchronization constructs available as part of Cray's Fortran
Programming Model, a barrier. The navigational display makes it evident that the barrier is the most critical subroutine in the user's program. With the middle
panel toggled to show shared memory overheads, the
time spent in the barrier shows up as a type of overhead. Since a barrier is a subroutine call, the user
could approach this similarly to the PVM problem
shown previously by using the call sites display to
view the locations from which the barrier was called,
and the time spent at each location.

~ CaD Sites

Figure 3: Call sites display for PVMFRECV

328

Xbrowse started.
Opening

~ile:

/home/sumac8/apprn/test/sp/app.ri~

Figure 4: Main window of CRAFT code with a barrier

S.3
Poor balance of floating point and integer
calculations
Figure 5 shows the main window for a hydrodynamics code. The most critical subroutine is an uninstrumented subroutine called $sldi v. If the user were to
search the source code, $sldi v would not be found.
The observations window in Figure 6 notes and
explains the time spent in $sldiv. Since there is no
integer divide functional unit on the alpha chip used
in the Cray T3D, a subroutine call must be made to do
the divide. Since the cost to call a subroutine is signif-

icantly greater than the direct use of a functional unit,
the program performance will benefit by limiting the
number of integer divides. If we return to Figure 5,
and look at the middle panel which is currently displaying instruction counts, we will note a large number of integer instructions relative to floating point
instructions. Since there are both integer and floating
point functional units which may be used simultaneously, it is desirable to balance their use. The
instructions display indicates an underutilized floating
point unit. When this is combined with the large

329

Opening f'ile: Ihome/sumac8/ww/hydrodynam/app.rif'

Figure 5: Main window of a hydrodynamics code

amount of time spent in $sldiv, a strong case could
be made for converting some integers to floats.

6

Conclusion

Initial users have had a tremendous amount of success
with the MPP Apprentice on substantial codes. The
success of developers working on the chemical code
has already been mentioned. Several developers
working on an electromagnetics benchmark were surprised to find that the routines they had been working
to optimize were the two running most efficiently, and
their biggest bottleneck was elsewhere. Another user

330

was able to take his code from 29.1 MFLOPs to 491
MFLOPS in a short period of time by resolving problems with PVM communication and barriers that
MPP Apprentice pointed out.
The method of data collection, and the choice of the
data being collected, permits the MPP Apprentice to
scale well to long-running programs on large numbers
of processors. By presenting data from the perspective
of the user's original source code, the user is able to
quickly identify the location of performance problems. Detailed information on instructions, shared

Code

Per~ormance:

4 Total processors (PEs) allocated to this application

7.90 x
32.63 x

10~6
10~6

Floating point operations per second (~or 4 PEs)
Integer operations per second (for 4 PEs)

1.97 x
8.16 x

10~6
10~6

Floating point operations per second per processor
Integer operations per second per processor

3.58
1.38
0.00
0.00
0.00
0.00

x
x
x
x

10~6

x
x

10~6

Private loads per second per processor
Private stores per second per processor
Local shared loads per second per processor
Local shared stores per second per processor
Remote loads per second per processor
Remote stores per second per processor

15.36 x
30.45 x

10~6
10~6
10~6

10~6
10~6
10~6

Other instructions per second per processor
Instructions per second per processor

0.55 Floating point operations per load
2.28 Integer operations per load
Time spent per~orming dif~erent task types: H
1744039 usec (20.21Y.) executing Hwork instructions
2237192 usec (25.92Y.) loading instruction and data caches
o usec ( O.OOY.) waiting on shared memory operations
34695 usec ( 0.40y') waiting on PVM communication
o usec ( O.OOY.) executing "read" or other input operations
745985 usec ( 8.64y') executing "write" or other output operations
3868666 usec (44.83Y.) executing un instrumented ~unctions
100.00Y.

Total

Detailed Description:
The combined expenditure o~ time ~or Ssldiv routines is measured to be
3342606 usec, or 38.73Y. of the measured time ~or this program.
The DEC Alpha microprocessor has no integer divide instruction. When the
code calls for an integer divide operation the compiler must insert a
call to a library routine to per~orm the divide. The name o~ the divide
routine is ··Ssldiv··. $sldiv per~orms a ~ull integer divide which is
expensive because it is done in software. I~ the full range o~ integer
values is not used but other integer properties are needed, it can be
raster to use ~loating point values and ~loating point divides in place
of integers, truncating or rounding ~ractions as needed.

Figure 6: Observations window for the hydrodynamics code

331

memory overheads, and PVM communication allow a
user to quickly identify the cause of performance
problems. The knowledge base encapsulated in observations provides performance numbers and helps the
user identify and resolve more difficult problems. As
demonstrated above, the identification of problems in
Cray T3D codes today can be achieved efficiently
using the MPP Apprentice.

7

References

[CRI91] Unicos Performance Utilities Reference
Manual, SR-2040 6.0, Cray Research, Inc., Eagan,
Minnesota, 1991.
[CRI93-1] eray T3D System Architecture Overview,
HR-04033, Cray Research, Inc., Eagan, Minnesota,
1993.
[CRI93-2] Introducing the MPP Apprentice Tool,
IN-2511 1.0, Cray Research, Inc., Eagan, Minnesota,
1993.

[He91] Visualizing the Performance of Parallel Programs, M. Heath and J. Etheridge. Software. IEEE
Computer Society, Silver Spring, MD, September
1991, Volume 8, #5, pp. 28-39.
[Ma91] Traceview: A Trace Visualization Tool, A.
Maloney, D. Hammerslag, and D. Jablonowski. Software. IEEE Computer Society, Silver Spring, MD,
September 1991, Volume 8, #5, pp.19-28.
[Pa93] MPP Fortran Programming Model, Douglas
M. Pase, Tom MacDonald, and Andrew Meltzer.
CRAY Internal Report, February 1993. To appear in
"Scientific Programming," John Wiley and Sons.

[Re93] The Pablo Performance Analysis Environment, D. Reed, R. Aydt, T. Madhystha, R. Noe, K.
Shields, and B. Schwartz. Technical Report, University of Illinois at Urbana-Champaign, Department of
Computer Science.
[Ya93] Performance Tuning with AIMS -- An Automated Instrumentation and Monitoring System for
Multicomputers, Jerry C. Van. HICSS 27, Hawaii, Jan
1994.
[Wi93] ATExpert, Winifred Williams and James
Kohn. The Journal of Parallel and Distributed Computing, 18, Academic Press, June 1993, pp. 205-222.
[Wi94] MPP Apprentice Performance Tool, Winifred
Williams, Timothy Hoel, and Douglas Pase. To appear
in Proceedings of IFIP, April 1994.

332

Cray

Distributed

Programming

Environment

Lisa Krause
Compiler Group
Cray Research, Inc.
Eagan, Minnesota

Abstract
This paper covers the first release of the Cray Distributed Programming
Environment. This will be one of the first releases of Cray software that
is targeted for a user's workstation and not a Cray system.
The paper covers the goals and functionality provided by this initial
release and shows how the user's environment for doing code development
will move to his workstation. This paper also briefly describes plans for
upcoming releases of this distributed environment.

Introduction
In response to customer requests, Cray
Research, Inc. will provide the interactive
portion of the Cray Programming Environments
on the user's desktop system. This first
release of the Cray Distributed Programming
Environment (DPE) 1.0 is scheduled to coincide
with the Programming Environment 1.0.1 release
in the third quarter of this year. The
predominant goal for DPE 1.0 is to provide the
Cray Programming Environment on the user's
desktop. Although not all of the programming
environment components can be moved to the
desktop with this initial release, the focus
for DPE 1.0 is on Fortran 90. By providing
Cray's Fortran 90 on both the user's
workstation and Cray system, we hope to
enhance the user's ability to do code
~
development on whatever system is currently
available and useful to them.

Programming

Environments

Currently, without having the Distributed
Programming Environment on the user's desktop,
all program development that is done to
confirm code correctness and to optimize
performance is done on their Cray parallel
vector system. To be able to fully utilize
the capabilities of the program browser and
the performance analysis tools, an interactive
session on the Cray system is needed. If no
interactive sessions are available at the
site, the user must try to perform all of his
tasks through a batch interface. Although
batch use is valuable for determining Cray
system utilization for big jobs, using batch
for program devel~pmentcan be slow and
contain numerous delays.
The Cray Distributed Programming Environment
will seek to alleviate the interruptions and

delays that can occur when doing code
development through
a batch environment or even a heavily loaded
interactive Cray environment. With DPE 1.0,
the user now has the performance tools, such
as ATExpert, flowview, perfview, profview,
procview, and jumpview residing directly on
'
his workstation. The program browser, xbrowse;
also is on the workstation and can utilize the
Fortran 90 "front-end" of the compiler to
examine, edit and generate compilation
listing. Although not a full compiler, this
"front-end" component parses the"code and
diagnoses syntactic and semantic errors. The
CF90 "front-end" also produces a partial CIF
file which can be used through xbrowse or
cflist on the desktop. Users can communicate
with their Cray system when they need to
finish compilations, load their programs, and
execute them to obtain performance results by
using scripts that facilitate this
communication. The communication mechanism
used is ToolTalk which is provided in the
CrayTools package. CrayTools is included with
the CF90 Programming Environments and is
bundled with the host platforms targeted for
OPE, with the exception being SunOS.
The advantages provided with the addition of
the Cray Distributed Programming Environment
to a user's working environment are numerous.
With DPE 1.0 the user can now utilize Cray's
Fortran 90 compiler on both the workstation
and their Cray system. Access to the
interactive performance analysis tools that
are contained in the Cray Programming
Environments is easier, since these tools
reside on the user's workstations. By having
these components of the environment available
for the user on their workstation, code
development can move off of the Cray systems.
This frees up the Cray system cpu cycles to be
utilized for large jobs that cannot run

Copyright © 1994. Cray Research,/nc. All rights reserved

333

elsewhere. Using the program browser and the
tools interactively on their workstation frees
users from needing to have an interactive
session on their Cray systems. This favorable
environment for the tools also provides a
consistent interactive response time when they
are invoked.

product and can be obtained through CraySoft.
The Cray Distributed Programming Environment
1.0 will need to have the CF90 Programming
Environment installed on a Cray Y-MP, Cray
C90, or Cray EL system, running a UNlCOS
release level of 7.0.6, 7.C.3, or 8.0.
Furture

The release plan for the Cray Distributed
Programming Environment 1.0 is consi~tent with
the release plans for the other Cray
Programming Environments. DPE 1.0 is due to be
released in the third quarter of 1994 and DPE
2.0 will be released in coordination with the
Cray Programming Environment 2.0 releases. The
goals for this first release are to provide
the interactive tools on the desktop and to
gather customer feedback on this distributed
environment.
Functiona1ity

of

OPE

1.0

The Cray Distributed Programming Environment
will provide many features of the Cray
Programming Environments on the user's
desktop. With the first release of DPE, the
user will be able to browse and generate a
compilation information file (ClF) by using
the program browser, xbrowse. Using the
interface provided by xbrowse, or with the
cflist tool, the user can examine this CIF to
obtain information on syntax and semantic
errors regarding their codes without having to
go to the Cray system. When a full compilation
of the code is needed, the user can initiate
this Fortran 90 compilation for the Cray
system from their workstation. Additional
executable components and commands contained
in the first release of DPE provide both a
framework and a method for communicating what
is to be accomplished on the workstation and
on the Cray system. Besides obtaining binaries
from a full Fortran 90 compilation, commands
can also be used to generate an executable and
move this to the Cray system to be run for
performance information. Performance analysis
can then be done on the desktop, especially
the analysis that can be obtained utilizing
the ATExpert performance tool. To understand
and learn more about the performance tools and
the Cray Fortran 90 compiler, the user will be
able to reference on-line documentation using
the CrayDoc reader.
For the first release of DPE, the Sun SPARC
workstation platform was chosen as the host
platform, as well as the CRS CS6400 platform.
These workstations can be running either SunOS
4.1 or Solaris 2.3 or later to be compatible
with DPE 1.0. For the SunOS workstations
ToolTalk will need to be obtained, as this
communications tool does not come bundled into
the operating system. The FLEXlm license
manager is needed to control access to the DPE

334

.P1ans

The second release of the Cray Distributed
Programming Environment, which is planned for
the first half of 1995, will focus on creating
a goal-directed environment. Instead of simply
providing tools needed to support code
development and analysis, the second release
will move to provide a process where the user
can simply indicate what the final goal should
be for the code being provided. If the goal is
to "optimize" the code, this goal-directed
environment, directed by a visual interface
tool, will lead the user through the steps
needed to provide optimization to the code.
Tentative plans for DPE 2.0 will strive to
support this goal-directed emphasis by
providing cross-compilers, distributed
debugging capabilities and a visual interface
tool. The plans may also include expanding the
target platforms to encompass future MPP
systems. Analysis is ongoing to determine the
value of increasing the host platforms that
will support DPE 2.0 to workstation platforms
such as RS6000 and SGI.
Summary
In summary, by responding to customer
requests, Cray Research, Inc. will release the
major components of the Cray Programming
Environment to run on desktop systems. The
first r-elease will provide tools for
performance and static analysis including the
Fortran 90 front-end. The second release is
currently planning to provide a goal-directed
environment highlighted by cross-compilers and
a distributed debugging capability.

~ Cray Distributed
~ Programming Environment

Provide the Cray Programming
Environment on the User's
Desktop

Cray User Group

San Diego, CA
March 1994

u
.. lCt_
CoMpiler
1'wfomI _ _ on L.M.

Without DPE
DPE 1.0 focus is
on Fortran 90

~ Customer Advantage

DPE 1.0

.....,
ReP

•

• Cray's Foonn gO avaiable from deP-top to Crirf systems
• Easy acce" to !he Cray ProgranlTing Envirorment

El

• OeYeiopment mov.. 011 of Cray systems, .Iowing more Cray
cydes lor IlII"ge jobs 1hat camet n.n elsewhere
• Us.n CII'I take lIdIIantage of the Crirf progranming envirorment
without at interactive session
• Consistent Interactive response time when nrnng !he tools

soote

335

~ eray Distributed Programming

@lGOa,S
Release 1.0

Release Plans
• Release 1.0 - 3094

• Provide Interactive tools on the desktop

• Release 2.0 - 1H95

• Gather customer feedback on distributed
environment

@ l Functionality

~ Functionality

Release 1.0
• Browse and generate compilation information
from the desktop

Release 1.0
• Generate an executable and analyze the
performance from the desktop

• Perform syntax checking and static analysis
from the desktop

• Execute ATExpert from the desktop

• Initiate the compilation of Fortran 90 programs
on Cray parallel vector systems

• Reference online documentation from the
desktop

~PlatfOrmS

~ System Requirements

Release 1.0

Release 1.0

• Host
- SunSPARC
• SoIa-is 2.3 « later
• Sl.n os <4.1
- CRSCS6400
• SoI';s 2.3 « IlIIer

• Target
- Cray V -MP. COO, EL
- Future Cray PVP Systems

336

• Host
- Tool1ak for SunOs
- FLEXlm

• Target
- CFOO Programming Environment 1.0
- UNICOS 7.0.6. 7.C.3. or 8.0

OPE 2.0

OPE 2.0 Focus is on '
Goal-Directed Environment

8

__J

.

I¥.
ReP

·.:~tr.·.·
•
................ :

snOke

~GOaIS
Release 2.0

~summary
Responding to customer requests, Cray Research Inc.

will release major components of the Cray Programming

•

cross-compilation

Environment to run on desktop syatems.

•

distributed debugging

•

visual interface tool

• Release 1.0: tools for performance and static analysis.
CF90 front-end

•

target platform extends to MPP systems

•

host platforms extended

•

Release 2.0: goal-direded environment. cross-compiiers
and distriJUled debugger

337

Cray TotalView™ Debugger
Dianna Crawford
Cray Research, Inc.
655-F Lone Oak Drive
Eagan, Minnesota 55121

ABSTRACT

Cray TotalView is a window-oriented multiprocessor symbolic debugger. First released in late
1993, Cray TotalView is a key component in the Cray Programming Environments. In 1994,
it will be available on ail current hardware platforms and will debug program written in Fortran 90, Fortran 77, C, C++ and assembler. This paper gives an overview of the features included in Cray TotalView, describes the releases plans and plans for the CDBX debugger
which Cray TotalView will eventually replace.

1

Introduction

Cray Research has released a new debugger, Cray
TotalView. This new debugger will span all current and
new Cray supported hardware platfonns and compiler
languages and will eventually replace the Cray CDBX
debugger.
Faced with the challenge of providing a debugger on
Cray Research's massively parallel system, the Cray
T3DTM, we realized that it would be difficult to provide
a useful product by porting CDBX to that environment.
CDBX was designed to debug serial programs and the
extensions to provide debugging of moderately parallel
programs were of limited success. Also, CDBX was
designed for a single architecture and was not very portable. We evaluated the currently available debuggers
and selected the TotalView debugger from Bolt
Beranek and Newman Inc. (BBN) as the starting point
for the development of the new Cray parallel debugger.
TotalView was selected because it was a powerful, easy
to use debugger that had been designed to debug parallel programs, and it was portable enough to provide a
good base for developing a debugger that would span
mUltiple languages and hardware architectures.
Cray TotalView was released in the fourth quarter of
1993 and provided initial support for Cray T3D programs and Cray Fortran 90 programs running on the
Cray Y-MP and C90 platforms. In 1994, Cray TotalView will support Fortran 90, FORTRAN 77, C, C++
and assembler programs running on Cray massively
parallel, parallel vector and SPARC systems. This will

Copyright © 1994. Cray Research Inc. All rights reserved.

338

provide a common debugger across all Cray supported
platfonns, allowing programmers to leverage their
learning investment when they program for more than
one type of system.
Section 2 describes several of the features available with
Cray TotalView. Section 3 describes the Cray TotalView release features and schedules. Section 4 describes
the current plans for support of the CDBX debugger.
Section 5 provides a summary of the paper.
(Note: To simplify the tenninology, through the remainder of this paper the tenn TotalView is used to refer to
the Cray TotalView debugger.)

2

Product Overview

TotalView is a window-oriented multiprocessor debugger. The TotalView debugger has an X Window System™, graphical interface that is easy to learn and
displays as default the infonnation most commonly
needed when debugging. The window interface conveys
the basic concepts of parallel codes, providing a separate
window for each process the user is currently debugging. Execution control, including breakpoints and stepping through code, is available across all processes or
for a single process. TotalView allows the user to work
within their language of choice and supports expression
evaluation for both Fortran and C. Viewing large, complex data arrays is supported through an array browser
which displays a two-dimensional slice of a multidimensional array. TotalView functionality is described below
using the screen images shown in Figures 1 through 5.

Figure 1 shows an example of the Root window. This is
the main TotalView window and the first that the user
sees when invoking TotalView. From this window, the
user can start a debug session by loading an executable
or a core file or attaching to a running process, access
the on-line help, and view process information. In the
example Root window, the display shows process
information for each of the four processes that are part
of a single program. One of the processes, cray_transpose, is stopped at a breakpoint, designated by the "B"
after the process id. The user can debug any number of
the processes by clicking the mouse on that process status line which will open a Process window for the
selected process. TotalView also allows debugging of
multiple programs from a single interface, which
means, for example, that the user can debug both sides
of a program distributed between a Cray Y-MP and
T3D.
An example of the Process window is shown in Figure
2. There is one Process window for each process the
user is actively debugging. From this window, the user
can control processes, examine code and data and
manipulate program data. Much of TotalView's functionality is available from this versatile and powerful
window, only a small fraction of which is described
here.
The source pane is at the bottom of the Process window.
This displays the source code of the routine the user is
currently stopped in. The user can change the display to
show the assembler code or source code interleaved
with the corresponding assembler code. Lines where
breakpoints can be set are designated with a small hexagon to the left of the line number. The user clicks the
mouse on the hexagon breakpoint symbol to set or
remove a breakpoint, conditional breakpoint or expression evaluation. The user can run to a breakpoint or a
specific line and can single step through the program,

30071
30072
30073
30055

T cra~_transpose.3
T craY"'t"ranspose.2
T cra~_transpose.l
8 cra~_t~anspose

including stepping across function calls, controlling all
processes or just the process shown in the window. The
text within the source pane is also active. Clicking the
mouse on a variable name allows the user to display and
change its value. Clicking the mouse on a function call
displays the source for that function. The example in
Figure 2 shows a display of the source code for the
function do_work. The program is stopped at a breakpoint at line 278. An evaluation point is set at line 274
(Figure 4 shows the contents of this evaluation point)
and another breakpoint is set at line 281.
The pane at the top right of the Process window is the
current routine pane. It displays the calling parameters,
local variable information, current program counter and
access to the registers. The user can click the mouse on
the information shown to display additional information or change values. For example, Figure 2 shows that
the current routine, do_work, has one calling parameter, prob_p, which is a pointer to a structure. The user
can click the mouse on (Structure) to view that structure's elements and values. Clicking the mouse on the
value I of local variable i allows the user to change the
value.
The top left pane of the Process window, called the
sequence pane, displays the stack trace. The user can
walk the stack by clicking the mouse on an entry point
name which updates the source pane and current routine pane with information for the selected routine. In
the example shown in Figure 2, the user is currently
stopped in the routine do_work, which was called by
$TSKHEAD. $TSKHEAD is the tasking library routine which initiates all slave tasks.
Figure 3 shows an example of the Action_point window. TotalView allows the user to save and load breakpoints across debug sessions. The Action_point
window shows all of the breakpoints and evaluation
points currently set within the program. From this win-

in
in

Figure 1: Root window

339

:I~11jf.f5?;~11_i%{illl'Jl~1~il_~1

rrmmlllllllllllllllllllllllllllili. process.3~~jD: .cray _trajsPofe. (RtBreakyoint>

11111111111111111111111111111111111,11111.

i,l

.~
..

~

i:
/1
·I\y_id:l·
.... (Structure)

P:()~:i750d (~o_~orJc+Op31c)
,;~.

(do_uork~Op24b)

BOO: Op1743c
\ ••

"

. . . .,

. . . . . . . . . . . . . . . . . .\ . . . . . . . . . . . . . . . . . . . . . . . . . . . .,

• • • • • • • • • • • • \ . . . . . \ ••• , •••••••••••••• ,

.... \ . . . . . . . . . . . . . . . . . .

"

•••• , . . . . .

'"

•••••••••• 0' •••• \ • • • , • • • • • • • • • • ':,.'

~:/::

•.•

..............................
Funct~on

i

do_uork

~n

.;:~

cray_transpose.c

I~" ·!!r·'·······'··:::f::~:~::::;;::::::::::::::::~:~:·~·'::::o::'p::::::':::::-:: ~!Iril
~~

277

'l~ ~i~

10
~:~ 1
~

{

I~ : iifNi:U

suap_rou_col (artyprob_p" i);

:inauranSPoSdor_proc (&Hyprob_p>;

1* do per process c.

............ ,.... ,........•.,..•.....•. ,.. ,.,.•... ,•.........,•...,.•...... ,..... ,.......,...........•...•.....,............ ,.,......... ,..... ,..... ,., ..........,......•.......•...... ,.......... ,.,...,......... ,....,.... ,.•..,•. ,.,.,.....•..•. ,.... ,.......:::::y.:y::.':.:..: ..:.::.......... ::::.:;:\·:.\··::;::;':;'·\·::·'::.';:\::\i:.~:.:./\·::· ··:i:~····:::::.\

;·~:;rl

r'~nl

:.: ;

'f';i~~~~jf~~~_~~~~~~~~~~_~~~;J
Figure 2: Process window

340

dow the user can disable or delete breakpoints and evaluation points and can locate and display a breakpoint
within the source.

array being displayed plus a small buffer zone for
scrolling, with additional portions of the array read in
only as needed.

The Breakpoint window is shown in Figure 4. From this
window the user can set a conditional breakpoint or an
expression evaluation point, using either C or Fortran
syntax. These action points can be shared across all processes or a single process and can STOP all processes
when the breakpoint is hit or only a single process. The
example shows a conditional breakpoint (set at line 274
in Figure 2) which is shared across all processes and
will STOP all processes when it is hit.

3

TotalView 0.1 is available today; it was released in the
fourth quarter of 1993 with CrayTools 1.1. It provides
full functionality for source-level debugging of parallel
codes. All of the functionality described in section 2 is
provided in this first release. TotalView 0.1 provides
the initial support for Cray T3D programsandCray
Fortran 90 applications on Cray Y-MP and C90 systems.

Figure 5 shows the array browser. The array browser
displays a two dimensional slice of a multidimensional
array. The user can specify the slice to display, the array
location to display in the upper right hand corner, can
scroll through the array, locate specific values and
change values. The array browser has been optimized
for large arrays and will only read in the portion of the

The first general release of TotalView is release 1.0,
scheduled for the second quarter of 1994 with CrayTools 1.2. TotalView 1.0 will support Cray Fortran 90,
FORTRAN 77, C, C++ and assembler on Cray T3D,
Y-MP, C90, EL and SPARC systems. With this release
TotalView will support all current Cray languages and
system architectures. Additional features include

Release Plans

IIcra!:J_ transpo~

file

IIcra!:J_transpo~

Figure 3: Action_point window

Figure 4: Breakpoint window

341

Figure 5: Array browser

CRAFT T3D programming model support and a batch
interface for debugging core files. Significant effort has
been made in developing reliable, portable code and a
strong regression test suite in order to provide a stable,
usable product across a broad platform base for this first
general release of TotalView.

(Note: CrayTools is a component of the Cray Programming Environment releases. Each Cray Programming
Environment release provides a complete development
environment including a compiling system, CrayTools
development tools, and CrayLibs high performance
libraries.)

A minor release, TotalView 1.1, is scheduled for the
fourth quarter of 1994 to provide two important features, data watchpoints and fast conditional breakpoints. This will be delivered in the CrayTools 1.3
release.

4

TotalView 2.0, scheduled for the middle of 1995 with
CrayTools 2.0, will provide support for new Cray hardware. Release 2.0 will enhance the array browser to
provide visualization of the array data and will provide
enhancements to aid in debugging large numbers of
processes. It will also provide distributed debugging for
the Cray Distributed Programming Environment. This
will allow the user to run TotalView on their desktop
workstation to debug a program running on a Cray system.

342

CDBXPlans

CDBX will eventually be replaced by TotalView. The
following outlines the current plans for CDBX support.
CDBX 8.1 which was released in late 1993 is the last
feature release of CDBX. The 8.1 release supports Fortran 90, FORTRAN 77, C, C++, Pascal and assembler
on Cray X-MP, Cray-2, Cray Y-MP, C90 and EL systems. No new language or new hardware support is
planned. CDBX does not support the Cray T3D or
SPARC systems, for example, nor will it support
Cray's new native C++ compiler planned for 1995.
Minor changes will be made to CDBX to support compiler and UNICOS upgrades. CDBX will be supported
with fixes through 1996.

5

Summary

Cray Research has released a new window-oriented
multiprocessor debugger called Cray TotalView. Cray
TotalView will support all current and new Cray languages and architectures, providing a common debugger across the entire range of products.
TotalView 0.1 is available today, providing full functionality for source-level debugging of parallel T3D
and Fortran 90 codes. TotalView 1.0, scheduled for second quarter of 1994, will support Cray Fortran 90,
FORTRAN 77, C, C++ and assembler on Cray T3D,
Y-MP, C90, EL and SPARC systems. With this release,
TotalView will support all current Cray languages and
system architectures. Data watchpoints and fast conditional breakpoints will be supported in the TotalView
1.1 release scheduled for the fourth quarter of 1994. In
mid-1995 we will' deliver data visualization capabilities
and a distributed debugger for debugging Cray programs from the desktop.

343

Fortran 110 Libraries on T3D
Suzanne LaCroix

Cray Research, Inc.
655-F Lone Oak Drive
Eagan, Minnesota 55121

ABSTRACT

The fundamentals of Fortran I/O on the eRAY T3D system will be covered. Topics include
physical I/O environments, I/O paradigms, library functionality, performance considerations,
and tuning with the I/O library.

1

Introduction

The CRAY T3D is a Massively Parallel Processor system with physically distributed, logically shared memory. The CRAY T3D is attached to, or "hosted by", a
CRAY Y -MP or CRAY C90 computer system. This
paper covers the fundamentals of Fortran liD on the
CRAY T3D. Section 2 provides background information about the CRAY T3D physical 110 environment,
section 3 discusses T3D 110 paradigms, section 4 gives
a high level description of library functionality available on the T3D, section 5 describes performance considerations and tuning opportunities, and section 6
provides a summary.

2

Background

Cray Research offers two physical 110 environments for
T3D systems, Phase I and Phase II. The Phase I liD
environment is the default and is available today. In this
environment, all disk data flows from an liD Cluster
through the host Y-MP or e90 system to the T3D. The
host maintains one common set of filesystems so that all
files available to a user on the host are also available to
a user on the T3D.
The Phase II 110 environment is optional and is currently under development. In this environment, the
optional second high speed channel (HISP) on an 110
Cluster may be connected to the T3D. This allows disk
data to flow directly from the 110 Cluster to the T3D,

Copyright © 1994. Cray Research Inc. All rights reserved.

344

bypassing the host. However, just as in Phase I, the host
maintains one common set of filesystems. Phase II augments Phase I by providing additional connectivity for
T3D systems hosted by small Y-MP or e90 systems.

3

I/O Paradigms

Two liD paradigms are offered for use in T3D applications, Private 110 and Global 110. Private liD is the
default and is available today. With Private liD, each
process element (PE) opens and accesses a file independently. That is, each PE establishes its own file connection and maintains its own file descriptor. Another way
to describe this paradigm is that the PEs behave like
unrelated processes on a Y-MP. The user provides any
necessary coordination of file access among the multiple
PEs. This liD paradigm is analagous to the message
passing programming model. It is available. to both C
and Fortran programs.
The Global 110 paradigm is currently under development. Global 110 allows multiple PEs to share access to
a single file in a cooperative fashion. Global 110 must
be selected by the user by inserting a directive prior to
the Fortran OPEN statement for which global 110 is
desired. In this paradigm, all PEs open a file cooperatively. The file connection and the library buffer are
global, shared resources. Each PE may access the file,
or not, independently and the library coordinates the
access. Another way to describe this paradigm is that

the PEs behave like the cooperative processes in a multitasked program on a Y-MP. Global 110 is available to
Fortran programs following Cray Research MPP Fortran Programming Model rules.

4

Library Functionality

Cray Research supports a rich set of Fortran 110 library
functionality on CRAY Y-MP and CRAY C90 systems. All of this functionality is available or planned to
be available on the CRAY T3D.
Functionality that is available now includes support for
formatted and unformatted 110, sequential and direct
access 110, word addressable 110, Asynchronous
Queued 110 (AQIO), and Flexible File 110 (FFIO).

The user must make sure that the compute PEs have
data delivered at a reasonable rate to avoid idle time.
With Global I/O, an application should scale up or
down with little user effort to create a good load balance.

6

Summary

This paper has described the fundamentals of Fortran
I/O on the CRAY T3D. The T3D 110 environment
takes advantage of the filesystem management and 110
capabilities on the host. Users can mix and match the
Private 110 and Global 110 paradigms to best fit their
application. A powerful set of Fortran 110 library packages and tuning features is available on the T3D.

Features that will be available in the short term include
support for BUFFER INIBUFFER OUT, implicit floating point format conversion, and Global 110.
Support for namelist 110 will be available in the longer
term, in conjunction with support of the Fortran 90
compiler on T3D.

5

Performance Considerations

All 110 requests initiated on the T3D are serviced on the
host. This results in a longer transaction latency for
T3D users than for users on the host. Thus, T3D users
should consider these three closely related 110 performance factors: optimum request size, latency hiding,
and load balancing.
On the T3D, users should minimize the number of 110
system calls generated, and make a sufficiently large
request to ammortize the cost of the system call. The
assign command has several options that allow a user to
select a library buffer size and library buffering algorithm to help reduce the number of system calls. These
options specify FFIO buffering layers, and require no
changes to the application source code. Some of the
more notable FFIO buffering layers are
• assign -F mr (memory resident file)
• assign -F bufa (asynchronous buffering)
• assign -F cache (memory cached 110)
• assign -F cachea (asynchronous memory cached

110)
• assign -F cos (asynchronous buffering, COS
blocking)
Latency hiding is another important consideration. The
FFIO layers which use an asynchronous buffering algorithm are effective at hiding latency. The user could
also employ either AQIO or BUFFER INIBUFFER
OUT to overlap 110 with computation.
Load balancing refers to balancing the number of PEs
doing 110 with the number of PEs doing computation.

345

User Services

USER CONTACT MEASUREMENT TOOLS AND TECHNIQUES
Ted Spitzmiller
Los Alamos National Laboratory
Los Alamos, New Mexico

Overview
Detennining the effectiveness of user services is often
a subjective assessment based on vague perceptions
and feelings. Although somewhat more objective,
fonnal periodic surveys of the user community may
serve mo~ to harden stereotypical attitudes than to
reveal areas where effective improvement is needed.
Surveys tend to bring out perceptions of the most
recent interaction or long standing horror stories, rather
than long tenn perfonnance. This paper examines how
statistical data, surveys, and user perfonnance can be
used to detennine areas which need improvement.
The user services function is often the focal point for
several aspects of a computing operation:
• It provides infonnation on how to use the various
computing resources through personal interaction
by phone or E-mail. (Documentation and
training may also be a part of user services.)
• It provides feedback as to the efficiency of
computing resources and the effectiveness of the
documentation and training for these resources.
How well are we serving the user community? We
occasionally ask ourselves this question, and
management will sometimes wonder as well, should an
irate user complain.
Perhaps equally important is the ability of user contact
measurement tools to effectively assess trends in the
use of computing resources. A sudden increase in
queries relating to a specific resource can be an
important indicator.
Has the user community
discovered the existence of a particular utility or
feature? Is the documentation inadequate? Is training
needed? These are questions that a simple set of tools
can help answer.
This paper will limi~ the scope of user services analysis
to that portion of which deals directly with the
customer. This function may be referred to as the
"Help Desk" or "Consulting".

Evaluation Methods and Criteria
Several criteria may be applied to evaluate the
perfonnance of user services:
• How well does the staff handle queries
(effectiveness)?

• Are responses to queries timely and accurate
(efficiency)?
• Are user concerns accurately and effectively fed
back into the organization (closing the loop)?
• Are future user needs being anticipated?
Several methods may be used to evaluate these aspects
of the user services operation.
• A fonnal survey
• An infonnal survey
• A statistical database
• The ambient complaint rate
All these methods have a place in the evaluation
process and relying on anyone method can lead to a
distorted view of the effectiveness of the operation.
Any "survey" whether fonnal or infonnal involves
varying degrees of SUbjectivity. Perhaps the greatest
constraint in using a survey is that the responses have
to be carefully weighed against the demographics of
the respondents. Without understanding the context in
which responses are submitted, evaluation can be
virtually meaningless or, even worse, misinterpreted.
Evaluating demographic profiles can be as time
consuming a task as the survey itself.
Small, infonnal, and personalized surveys may elicit
the most accurate response. These are typically oneon-one interviews in which the demographics can be
detennined as a function of the interview. The
interviewer can detennine the users knowledge level
and degree of interaction with the computing facility
and the user services operation. Probing questions
based on previous responses may be asked. There are
several caveats to this method.
• The interviewer must be skilled with interviewing
techniques and must understand the customer
service environment as well as the various
computing resources available to the respondent.
• Only a limited number of users can be contacted
in this manner thus, selection can be the critical
factor. Two dozen "selected" users could make
the computing facilities look "bad" or "good".

349

Formal "broadcast" surveys are often distributed to a
large percentage of the user base and may be biased by
human nature. If an individual is satisfied with the
computing facilities they will often not bother to
respond to the survey. Those who have an "axe to
grind" will most always respond and be quite vocal.
Statistical data from some form of "call-tracking"
system should be added to the subjective answers
provided by a survey. In its most elementary form, a
call tracking system is simply a log of all calls
received by the help desk. The following information
is considered minimal to provide useful analysis.
• The total number of calls handled per unit time (a
business day is a good unit of measure).
• The type of resource about which information is
needed. This can be a list of supported resources
such as operating systems, network functions (Email, ftp, and etc.), mass storage, or graphics
output facilities.
• The resolution time for the query.
• The need to refer the query to another person for
additional expertise.
• The organization to which the person belongs.
Several systems are available to automate this process.
These systems typically provide additional levels of
sophistication to include report generation and
statistical analysis.

Performance Metrics
Assuming that a valid survey can be constructed and
administered, what are some of the key responses that
will provide strong indicators of areas that may need
improvement? Responses may be categorized in the
following areas of user satisfaction:
• Are problems handled effectively by the
consulting staff?
- Are they polite, helpful, and friendly?
- Do they exhibit knowledge of most resources?
- Are problems diagnosed with minimal probing?
_ Are calls responsibly
appropriate individuals?

referred

to

more

- Do they follow-up and effectively 'close' calls".
• - Are the computing facilities adequate?
- Does the user need some capability that is not
currently being provided?
- Are the computing resources easy to use?

350

• Is the information provided adequate?
- Are problems consistently resolved with the
initial information provided?
- Is a documentation reference given to the user
for future needs?
- Is documentation available which covers the
information needed by the query?
• Is there an effective feedback mechanism to the
development and engineering people to permit
enhancements
based
on
observed
user
performance?
This paper reviews some data retrieved from the Los
Alamos National Laboratory user services team to
illustrate the use of survey and statistical data as well
as user performance.

Analysis of User Services
Example One
The following user responses are often the focal point
for the argument to reorganize the operation around
phone numbers of specialists instead of a single
generic number.
• User "A" says "Consulting by phone has proven
very ineffective - the consultant cannot easily
visualize my problem, and not being extremely
computer knowledgeable, I can't communicate
the problem. Usually, I ask one of my colleagues
for advice. I suggest setting up consultants to
deal with specific levels of computer expertise."
• User "B" says "It is very difficult to get detailed
consulting information over the phone. It would
be nice to have a list of the consultants along
with their areas of expertise so as to narrow the
'turkey-shoot' somewhat."
These appear to be valid perceptions and, based on
comments like these, some organizations have tried
alternative arrangements. But before making a
decision to change, lets look at what the statistical data
can add.
Consider that of 62 queries per day, 52% are
concluded within 6 minutes and 85% with 12 minutes.
These numbers indicate that most queries need
relatively simple explanations. Note also that the
typical consultant refers less than 23% of queries to
others within the help desk team and less than 8% to
"outside" expertise. This data doesn't support a
"turkey-shoot" assessment.

Resource

Demographics of user "A": A Physicist who is
sensitive about his lack of computer literacy. He feels
uneasy about bothering a "high powered" consultant
with his trivial questions.

UNICOS
ALL-IN-l Mail
Network
SMTP/UNIX Mail
RegistrationNalidation
PAGES/Graphics
Workstations
Other
Common File System
VMS
ADP/Admin
UNIX/ULTRIX

Demographics of user "B": A sophisticated user whose
queries require a higher level of expertise and thus
almost every time he calls he gets referred to "another
person."
Now let's add comments from users "C", "0", and "E".
"Talking to John Jones [consultant] is like going to the
dentist- I'd rather not"; "John Jones is arrogant and not
user friendly", "Some [consultants] are extremely
responsive, courteous, and knowledgeable, while others
are terrible."
Action item: Clearly there is a problem. John Jones a
very knowledgeable consultant but did not relate well
to the user community, especially the neophyte. He
was eventually moved to another area.
Having people with the right mix of technical and
social skills is an important aspect of user services.
Example Two
User "F" reports that "There is sometimes a long wait
for consulting services".
User "G" reports "The consulting phone is alway
busy".
The statistical data indicates that these users are
frequent callers (F had 22 queries and G had 14
queries in a two month period).
Action Item: Install voice mail on the consulting
phone numbers to handle the overload and avoid the
"busy" signal.
Example Three
The call-tracking log (or database) can be used to
determine specific activity areas. Knowing which
resources are requiring significant assistance can assist
in pinpointing poorly designed resources or the need
for improved documentation or training.
The
following table represents the percent of various major
resource areas for a recent two month period:

Percent of Queries
24
18

12
8
7
7
6
6
5
5
5
4

An analysis of the highest activity category, UNICOS,
reveals the top 10 query areas for the past three years.
Query Area
Percent of UNICOS Queries
1991
1992
1993
PROD (job scheduler)
VI (editor)
Fortran
CLAMS (math software)
FRED (editor)
Login/logout
Environment (.cshrc etc.)
/usr/tmp

CFf/CFf77

FCL

8
4
5
8
3
3
2
3
5
5

6
9
7

7
4
9
9
3
7
3

8
6
7
7

5
5
5
4
4

3

Looking more closely at the UNICOS queries reveals
that many users attempt to use UNICOS without any
formal training and without referring to available
documentation. This phenomena, which may be
termed pseudo-similarity, is not unique to UNICOS.
Most computer users who have had some experience
on one system will attempt to use another operating
system with the assumption that the similarities and
intuitive nature of the beast will permit an acceptable
degree of progress. Although there is an element of
truth in this presumption, it causes the consulting effort
to be loaded with elementary level questions.
An example of the effect of the pseudo-similarity
syndrome can be seen by analyzing the queries related
to the vi editor. Less than 10% of vi queries actually
relate to the use of the vi editor itself. Typical
symptoms of vi problems relate to the failure of vi to
properly initialize as a result of one of the following:

351

• Not correctly defining the appropriate TERM
value.
• Using a "non-default" window size (other than 80
X 35) without resize.
• Not establishing the -tabs parameter (stty).
• Using an incorrect version or option with the Los
Alamos CONNECf utility.
Pseudo-similarity type problems are good indicators of
poor design or implementation. In this case the user
has four chances to fail. The real problem is most
likely the UNICOS environment, which the user failed
to properly establish in the .login or .cshrc files.
This example also points to the possible ambiguity of
logging queries into a database. One consultant may
log the problem as vi, while another may define it as
environment. This lack of a "common view" can
spawn widely divergent descriptors for similar
problems, making analysis difficult.
The origins of the pseudo-similarity problem appears to
lie with the traditional approach to preparing new users
or introducing new resources into an established
environment. The computer is a tool and, like any
tool, the user wants it to be easy to use. They don't
want to spend time learning how to use it. Historically
computer documentation and training has been so
poorly implemented that many users will simply apply
what they do know and forge ahead.
When Los Alamos users were asked where they
obtained most of their current knowledge, 53% cited
co-workers and "trial and error". Only 39% said they
would most likely refer to documentation for new
information. Online "complete text" documentation
was ranked last in importance by the user community.
Example Four
A review of the database shows that the category
OTHER has shown a dramatic increase over the past
six months.
Year
1991
1992
1993 (first half)

352

Total OTHER
Queries

Charge Code
Queries

754

407

958

560
620

755

Closer examination revealed that the majority of
OTHER queries were for assistance in using or
understanding the charge codes associated with
computing. The increase was directly attributed to
tight budgets which have caused many departments to
look more closely at their computing expenses. The
financial analysts lacked adequate training in the use of
utilities to retrieve available data and, therefore
required heavy phone consultations.
Action Item: A training class was provided for the
FIN representatives in the use of the COST accounting
program. A separate category for CHARGES was
established in the tracking log to make analysis of this
problem are easier

Summary
The most recent sUlvey of Los Alamos computer users
revealed that the Consulting Team ranked fourth out of
52 computing resources for overall satisfaction and
tenth for importance. This is a strong indicator that
personalized user selVices are needed and valued in a
large and diverse environment. Consulting activity has
likewise grown over the past four years from 40 to 62
queries per day.
Los Alamos users are relying more on the Consulting
Team and less on documentation for their
informational needs. This trend is troubling in that it
is a strong indicator of:
• poor quality documentation available to users
(specifically man pages) and
• poor user interface design.
With respect to improved user interface design,
software development should examine the call-tracking
database when making changes to an existing product
(or when starting a new product) to review the problem
areas of similar resources.
It is imperative that at least a modest effort be made

periodically to evaluate the effectiveness of the
computing resources and the user selVices activities in
particular. Simple tools and techniques can make this
task relatively easy and cost effective.

Balancing Services to Meet the Needs of a Diverse User Base
Kathryn Gates
Mississippi Center for Supercomputing Research
Oxford, MS

Abstract
The Mississippi Center for Supercomputing Research (MCSR) is in the unique position ofproviding high performance computing support to faculty and professional staff, as well as students, at the eight comprehensive institutions of higher learning in
Mississippi. Addressing the needs of both students and veteran researchers calls for a varied approach to user education.
MCSR utilizes on-line services and makes available technical guides and newsletters, seminars, and personal consultation. In
spite of limited funding, MCSR maintains a high level of service through carefully considered policy designed to make the most
efficient and proper use of computer and network resources.

Overview
Established in 1987, the Mississippi Center for Supercomputing Research (MCSR) provides high performance computing support to the faculty, professional staff, and students
at the eight state-supported senior institutions of higher learning in Mississippi. The center originated with the gift of a
Cyber 205 supercomputer system from Standard Oil to the
University of Mississippi and is located on the UM campus in
Oxford. A single processor Cray X-MP system was added in
January 1991, and its capacity was doubled in Apri11992 with
the acquisition of a second processor. MCSR currently supports a Cray Y-MP8D/364 system, a front-end Sun workstation, an SGI Challenge L Series workstation and visualization
equipment located at the eight IHL universities and the UM
Medical Center.
MCSR was among the first sites to operate a Cray Research
supercomputer system without permanent, on-site Cray Research support. The MCSR systems staff consists of two analysts and a student intern, and the user services staff consists
of an administrative director, three consultants, and a student
intern. Together, these groups evaluate user needs, plan operating system configurations, install and maintain operating
systems and software applications, define user policies, and
provide all other user support. Other groups provide operations, hardware, and network support.

A Diverse User Base
As with many supercomputing centers, the MCSR user base
includes university faculty, professional staff, and graduate
student researchers. The MCSR user base also includes undergraduates and a growing number of high school students
and teachers. With major high performance computing centers at the John C. Stennis Space Center in Bay St. Louis and

the U.S. Army Corps of Engineers Waterways Experiment
Station in Vicksburg, an important MCSR function is to aid in
preparing students for careers utilizing supercomputing capabilities. At the close of Fiscal Year 1993, the MCSR user
base included about 350 high school and undergraduate students (referred to as instructional users) and 150 graduate students, professional staff, and faculty members (referred to as
researchers ).
The experience level among users ranges from beginners who
have some proficiency in using personal computers to researchers who are active on systems at national supercomputing centers. The jobs running on MCSR computing platforms represent many disciplines including computational chemistry, computational fluid dynamics, computer science, economics, electrical engineering, mechanical engineering, forestry, mathematics, ocean and atmospheric science, and physics.

A Geographically Dispersed User Base
Some users are located on the University of Mississippi campus and access MCSR machines through the campus network
or through a local dial-in line, but the majority of users are
located at other institutions in Mississippi. All but three of
the universities supported by MCSR are connected to the Internet network through the Mississippi Higher Education Research Network (MISNET) as shown in Figure 1. The most
active users are located at Jackson State University, Mississippi State University, the University of Mississippi, and the
University of Southern Mississippi as shown in Figure 2.
The level of local network development varies from institution to institution. The Mississippi State University campus
network is quite developed while several smaller universities
only support one or two networked labs. A handful of Miss issippi high schools have Internet access, typically through a

353

University of Mississippi
& MCSR

DSU

•

MCIPOP
(point of oreliem;e

U.S. Army Corps
of Engineers
Waterways Experiment
Station - - -

UMMed
Center
MS College \

.ASU
(to SuranetlMCI
Birmingham POP
via New Orleans)

New Orleans
MCIPOP
(point of presence)

Note: Both conventional and/or SLIP dialup connections are available on most
campuses allowing MISNEr/Internet connections for authorized users.

Figure 1. The Mississippi Higher Education Research Network (MISNET)

354

SLIP connection to a nearby university. The lack of convenient network access is a primary hindrance for many MCSR
users.

ers are favored over instructional users. This hierarchy of users is carried out through the application of operating system
utilities. Second, expectations are clarified with the purpose
of guiding users, particularly students, toward a successful
path as they utilize MCSR systems.

Problem Statement
Given these constraints - a user base that is diverse in terms of
experience level and one that includes both students and veteran researchers (with the students greatly outnumbering the
researchers), users with and without network access distributed across the state, and limited staffing and funding levels what is the best means for providing a needed, valuable service that makes effective use of available resources? More
specifically,
(1) What techniques can be used to manage this user base so
that students are accommodated but not allowed to interfere
with production work?

Establishing a User Hierarchy
Without imposing some sort of hierarchy on accounts, instructional users would very quickly monopolize system resources
due to their sheer numbers. Typically, these users need relatively small amounts of CPU time, memory, and disk space.
Moreover, instructional accounts exist for brief, well-defined
periods, normally until a semester closes or until the user graduates. A hierarchy is imposed by defining four categories of
users:
Funded Research - Faculty, staff, graduate students conducting research associated with one or more funded grants.

and
(2) What sorts of services can be reasonably provided to best
meet the needs of the user base without overextending the
staff or budget?
This paper describes the methods that have been used by
MCSR staff to address these two issues.

Pending Funded Research - Faculty, staff, graduate students
conducting research associated with one or more pending
grants.
Research - Graduate students doing thesis or dissertation
work, faCUlty and staff conducting open-ended research.

Managing the User Base

Instructional - High school students, undergraduates, class
accounts, faculty and staff who are using MCSR systems for
instructional rather than research purposes.

Carefully considered policies designed to make the most efficient and proper use of computer and network resources are
employed in managing the user base. First, Cray X-MP and
Cray Y -MP computing resources are allocated so that research-

Users are, by default, assigned to the Instructional category.
Faculty and staff members apply annually to be placed in the
Research, Pending Funded Research, and Funded Research
categories.

Implementing the Hierarchy through UNICOS Tools

Cray X-MP CPU Distribution
by Institution

Q)

•

80

MCSR

Cl

~

ctS

CI)

Univ of Southern MS

:::> 60

sQ):::
~

III

Univof MS

Eill1

40

•

MS State Univ

Q)

a..

Category assignment affects users in several ways including
through disk space quotas, total CPU time allocated, and relative "share" of the system, with the most stringent limits being placed on instructional users. A priority system is established by employing various UNICOS tools including the Fair
Share Scheduler, the Network Queuing System (NQS), user
database (udb) limits, and disk space quotas.

20

Jackson State Univ

Month
Figure 2. Cray X-MP CPU distribution by institution: July 1992 December 1993

Resource groups corresponding to the four categories of users are defined through the Fair Share Scheduler. Figure 3
shows how CPU time is distributed among the resource groups
on the Cray Y-MP system. The Fair Share Scheduler has
been effective in controlling how CPU time is allocated among
users; it is the primary means for insuring that instructional
users do not disrupt production work.

355

cious but certainly unsuitable for a research facility where
computing resources are limited and very much in demand.
An example might be students playing computer games across
the network, using network bandwidth, valuable CPU time
and memory.

Resource Group Allocation
Cray Y-MP

Funded (60.0%)

Funded (25.0%)

Figure 3. Cray Y-MP Fair Share Resource Groups

A separate NQS queue for instructional users is another means
for insuring that these users do not disrupt production work.
Without separate queues, instructional jobs can prevent research related jobs from moving to the running state within a
queue, frustrating any attempts by the Fair Share Scheduler to
grant preference to research related jobs.
User data base limits playa role in managing accounts. Instructional users are limited to five hours ofCray Y-MP CPU
time for the lifetime of the account. This limit is enforced
through the udb cpuquota field. The interactive process CPU,
memory, and file limits are applied to all users to encourage
the execution of jobs under NQS. Each semester, as students
in C programming classes learn about/ork and exec system
calls, the interactive and batch job process limits perform a
particularly useful function in controlling "runaway" programs.
UNICOS disk quotas are enforced on all user accounts to conserve much sought-after disk space.
The Fair Share Scheduler, NQS, udb limits, and disk quotas
permit control over accounts that is typically not available on
UNIX-based systems. Once configured, these tools provide
an orderly, automated method for directing resource allocation, with reasonable administrative overhead. Instructional
accounts are easily accommodated and, at the same time, are
prevented from causing a degradation in system performance.

Guiding Student Users Toward Appropriate Activities
With each passing semester, improper and unauthorized student use has become more of a problem. Unacceptable activities have ranged from students "snooping" in professors'
accounts (the students, in some cases, are far more adept at
maneuvering through UNIX than are their professors) to breaking into systems - fortunately, not MCSR Cray Research systems - by exploiting obscure operating system holes. Other
activities have included those that are not particularly mali-

356

The cost has been highest in terms of the staff time and effort
required to identify unacceptable activities and to pursue the
responsible persons. Moreover, this situation conflicts with
the original purpose for giving students access to
supercomputing systems. Only a few students have caused
significant problems, yet many students seem to lack an appreciation for the seriousness of computer crime and an awareness of computer ethics.
An "Appropriate Use Policy" was introduced to address these
problems. The policy explicitly defines appropriate and inappropriate use of MCSR computing systems. Users read
and sign a copy of the policy when they apply for an account
on MCSR computer systems. Individuals found to be in violation of the policy lose access to MCSR systems. Portions of
the Mississippi Code regarding computer crime are printed
periodically in the MCSR technical newsletter. Greater emphasis has been placed on UNIX file and directory permissions and on guidelines for choosing secure passwords in
MCSR documentation. Password aging is used on all MCSR
systems, requiring users to change their passwords every three
months.
Since taking these steps, there have been fewer incidences of
students misusing MCSR computing facilities. In the future,
it may be necessary to utilize part or all of the UNICOS Multilevel Security system (MLS) to create a more secure environment for researchers. For now, these strategies - controlling resource allocation and educating users about computer
ethics and account security - are sufficient.

Understanding User Needs
The Mississippi Supercomputer Users Advisory Group provides MCSR with program and policy advice to ensure a productive research environment for all state users. The advisory
group consists of faculty and staff representing each university served by MCSR. MCSR staff members work with representatives to develop policies governing the allocation of
supercomputer resources and to define and prioritize user
needs. This relationship is an important means for understanding user needs and evaluating existing services.

Serving Users
A varied approach to user education is required to adequately
address the needs of all users. Given that individuals learn in
different ways, it is advantageous to offer information in mul-

tiple fonnats. Given the disparity in experience levels among
users, it is necessary provide assistance covering a wide range
of topics. MCSR offers many services to its users (summarized in Table 1), including on-line aids, technical guides and
newsletters, seminars, and personal consultation. The services offered by MCSR complement on-line vendor documentation (e.g., UNICOS doc view facility) and tutorials.

Publications

Manuals and technical newsletters are fundamental to educating users in making best use of available resources. MCSR
offers a user's guide for each major system or category of
systems that it supports. These guides provide the site-specific infonnation needed to work on the system. They highlight operating system features and tools, provide many examples, and direct the user to more detailed and comprehensive on-line documentation.
The MCSR document, The Rookie s Guide to UNIX, targets
the subset of users who have little or no prior experience with
the UNIX operating system. It gives beginning level UNIX
training in a tutorial style. Topics covered include logging in,
getting help, file system basics, the vi editor, basic commands,
communicating with other users, networking tools, and shell
programming. This guide is essential due to the large number
of novice users.

Figure 5. Top level menufor the MCSR gopher server

A separate guide, MCSR Services, gives an overview of the
services provided by MCSR. It reviews the MCSR computing and network facilities, lists contacts, describes available
publications and programs, and provides account application
infonnation. The MCSR technical newsletter, The Super Bullet, provides timely supercomputing news and tips, features
noteworthy operating system utilities, gives answers to frequently asked questions, and reports usage statistics.

On-line Services

On-line services play an increasingly important role in educating users. Through on-line services, infonnation is available immediately, and users may retrieve it at their convenience. The active role shifts from the consultant to the user,
freeing the consultant to handle more complex questions. Online services work well in the MCSR environment, because
they are universally available to all who have access to the
computer systems, and they aid in extending a limited consulting staff.
MCSR supports an anonymous ftp site that contains ASCII
and postscript versions of MCSR manuals and account application fonns, example files, and a frequently asked questions
(FAQ) file. This year, MCSR started a gopher server. The
server provides much of the infonnation that is available
through MCSR Services, as well as other useful items such as
advisory group meeting minutes, access to on-line manuals,
and staff listings. Documentation efforts are maximized by
making text available in multiple fonnats as shown in Figure
4.
When possible, MCSR employs on-line vendor tutorials such
as the IMSL "Interactive Documentation Facility" and the Cray
Research on-line course, "Introduction to UNICOS for Application Programmers."

Figure 4. Information flow and reuse

357

Service

Description

Audience

Availability

Rookie's Guide to
UNIX

Manual: Hard copy & On-line,
Tutorial for beginning UNIX users

MCSR users who
are new to UNIX

Hard copy: by
request,

User's Guide to
the MCSR [Cray
Research, SGI
Challenge L,
Vislab] System(s)

Manual: Hard copy & On-line,
Provides site-specific information for
the specified system, highlights
operating system features and tools,
lists applications, offers examples

MCSR Services

Manual: Hard copy & On-line,
Provides overview of computing
platforms, services offered by MCSR

Super Bullet

Technical Newsletter: Hard copy &
On-line, Provides timely news and tips

fI.l

..=
=
....c
=
~

~

CJ

~

~

00

U

~

Hard copy:
contains text and
MCSR users who On-line (through graphics images
have accounts on anonymous ftp and for an enhanced
presentation, some
the specified
gopher): to all
users prefer to
system
users who have
read hard copy as
network or dial-in
opposed to onCurrent and
access to MCSR
screen guides
potential MCSR
systems
users
On-line:
Immediately
All MCSR users
available

Application
Seminar: Provides infonnation about MCSR users who
Programming on developing code, running programs, have accounts on
the MCSR [Cray accessing applications on the specified
the specified
Research, SGI
system
system
Challenge L,
Vislab] System(s)
Beginning UNIX

Seminar: Assists beginning~level
UNIX users

MCSR users who
are new to UNIX

Intennediate
UNIX

Seminar: Assists intermediate-level
UNIX users

MCSR users who
have mastered
beginning UNIX
topics

Cray FORTRAN
Optimization
Techniques

Seminar: Provides infonnation about
Cray FORTRAN extensions and
optimization techniques

FORTRAN
programmers
working on Cray
Research systems

Using the
[Abaqus, Patran,
MPGS, ... JApplication

Seminar: Provides infonnation about
using the specified application

MCSR users who
are working with
the specified
application

In Person

Consultation: Provides "face-to{ace"
assistance covering a range of topics

All MCSR users

By Phone

Consultation: Provides assistance
All MCSR users
over the telephone covering a range of
topics

fI.l

Advantages

Disadvantages
Hard copy: Costly
to produce and
distribute, difficult
to keep up-to-date
On-line: Users can
peruse text only
while in line mode,
unavailable when
host machine is
down

To UMusers,

Personal,
Reaches only a
interactive, can
small subset of
To other sites by
easily customize to users, costly to
request
target specific
send instructor to
needs
remote site, poses
scheduling
problems

J-c
~

.S

e
~

00

~

00

U

~

..=
=
~

S

-==
fI.l

=

U

Electronic Mail
(assist)

Consultation: Provides assistance
through electronic mail covering a
range of topics

On-line, menu driven tool for
distributing infonnation, including
overview of computing facilities &
support, account application
procedures, and technical manuals
"Frequently Asked (On-line) file containing answers to
Questions" File frequently asked questions, available
through gopher and anonymous ftp
MCSR gopher
Server

J-c
~

oS

0

Table 1. Summary of Services offered by MCSR

358

To users on the
UMcampus

Fewest
Unavailable to offcommunication
campus users,
barriers, can focus
Restricted to
MCSR office hours
directly on the
problem

To all MCSR users

Can provide
Restricted to
personal attention, MCSR office hours,
focus directly on some communicathe problem
tion barriers

All MCSR users

To all MCSR users Canfocus directly
who have network on the problem,
or dial-in access
include session
transcripts, offer
after-hour support

Current and
potential MCSR
users

To all MCSR users Content modifica- Unavailable when
who have network tions are easily the host machine is
or dial-in access handled, immedidown
atelyavailable

All MCSR users

To all MCSR users
who have network
or dial-in access

Answers basic
questions,
immediately
available

Response is
"asynchronous" may be slightly
delayed

Unavailable when
the host machine is
down

Interactive Assistance
Personal consultation (in person and by phone) is provided
by the three MCSR consultants, each focusing on a particular
area. Often, the questions addressed in one-on-one sessions
are beyond the beginning level. Faculty, staff, and graduate
students are targeted with this service; however, individual
assistance is occasionally required for users at all levels.
An on-line consulting service, assist, is available to handle all
types of inquiries. Users send an electronic mail message to
assist on any of the MCSR platforms, and the message is forwarded to a consultant on call after a copy is saved in a database. If necessary, the consultant reassigns the request to another staff member who is better equipped to handle it. Questions submitted after normal working hours or during holidays are forwarded to an operator who decides if the question
needs immediate attention. Both instructional users and researchers make frequent use of assist.
MCSR offers a full set of technical seminars covering beginning, intermediate, and advanced topics. The seminar series
is repeated each semester on the UM campus and by request
at remote sites. Because seminars target only a small subset
of users, they are viewed as secondary to other means for disseminating information.

Other Special Programs
Frequently, MCSR is host to regional high school groups for
special sessions on high performance computing. Depending
on the audience, some sessions are introductory in nature, while
others are more technical. Students from the Mississippi

School for Mathematics and Science visit annually for a daylong program on supercomputing. In the following weeks,
they use MCSR facilities from their school to complete research projects. Often these users have little experience with
computers of any kind before working on MCSR systems.

Outlook
Lack of convenient network access remains a primary hindrance to using MCSR facilities for several smaller institutions and most high schools. MCSR consultants will continue to educate users at such sites concerning the benefits of
Internet access. An auxiliary role will be to facilitate the expansion of MISNET through additional sites, by making users aware of potential grants.
MCSR publications will remain at the same level with respect
to content and number of documents offered; however, efforts will be made to continue to enhance the presentation and
organization of publications. The seminar offering will remain at the same level, with the exception of a new series
covering visualization techniques.
The most significant changes will occur in the area of network-based and on-line services. As funding permits, suitable on-line vendor tutorials will be utilized. Consultants will
continue to expand the existing anonymous ftp site and gopher server. An upcoming project is to offer site-specific information through multimedia tools such as NCSA Mosaic.
MCSR consultants will capitalize on network-based and online services to target the majority of users with the widest
range of material.

359

APPLICATIONS OF MULTIMEDIA TECHNOLOGY
FOR USER TECHNICAL SUPPORT
Jeffery A. Kuehn
National Center for Atmospheric Research
Scientific Computing Division
Boulder, Colorado

Abstract
Supercomputing center technical support groups currently face problems reaching a growing and
geographically dispersed user population for purposes of training, providing assistance, and distributing time critical information on changes within the center. The purpose of this presentation
is to examine the potential applications of several multimedia tools,. including Internet multicasts
and Mosaic, for technical support. The paper and talk will explore both the current and future
uses for these new tools. The hardware and software requirements and costs associated with
these projects will be addressed. In addition, the recent success of a NCAR pilot project will be
discussed in detail.

360

oINTRODUCTI0l'l

1 BACKGROUND

Budget cuts, rising support costs, growing and
geographically dispersed user communities, and an
influx of less sophisticated users are driving user
support to a critical junction. In the face of this
adversity, how will we continue to provide highquality support to our user communities? Old
methods of one-on-one assistance and hardcopy
documentation have been supplemented with online
documentation, but the questions still remain: Where
does a new user start learning? Where do they find
the documentation? Where is the answer to this
problem to be found in the documentation? Classes
can be taught but when the users are geographically
dispersed, the travel costs are prohibitive. One answer
to this dilemma may lie in the strategic application of
the latest wave of multimedia tools. These tools,
however, are not a panacea for all of the problems,
and in fact pose several new issues unique to their
realm. The Scientific Computing Division at the
National Center for Atmospheric Research (NCAR)
has examined several such tools, and this paper is an
attempt to summarize our experience with the two
which appear most promising for user support.

1.1 Internet Multicast
An Internet Multicast is not really a single tool, but
rather a broadcast across the network by a
combination of tools used together in concert: a slowscan network video tool, a network' audio tool, and a
network session tool. The individual audio and video
tools are interesting and can be used for a variety of
applications. The network audio tool allows you to
use· your workstation like a speaker phone on a party
line, carrying meetings, conversations, and Radio Free
VAT (a network music program) between
workstations on the network. The network video tool
transmits and receives slow-scan video across the
network allowing a user at one workstation to transmit
images to users at other workstations. Common uses
for the network video tool include transmitting a video
image of yourself sitting in front of your workstation
or transmitting an image of the view from your office
window. The former image lets everyone on the
network (including your supervisor) know that you are
in fact using your workstation (presumably for
something the company considers productive). While
the later image is presumably presented as a service to
those of us on the network who have been denied an

office with a window. It is, however, the combination
of audio and video through a session tool from which
the multicast draws its name, and it is this capability
which is most interesting for application to user
support.

basis, and the author is unaware of any concerted
effort to provide a more network-wide structure at the
time of this writing.

The network session tool does not add to the
capabilities of the network video and audio tools, but
rather serves as a user interface for combining the
video and audio 'tools into a single broadcast signal.
Thus to receive a multicast, you would choose a
session using the session tool, which would initiate an
audio tool ses~ion and a video tool session, allowing
you to hear the audio portion of the broadcast and see
the video portion of the broadcast simultaneously.
Transmitting a session works in a similar fashion. The
session tool is used to select/setup a session and then
start an audio and video tool which will send signals.
A single session can carry multiple audio and video
signals, allowing a user tapping into the session to
pick and choose which and how many of the audio
and video links to display or listen to independently.
For instance, five users carrying on a session may
decide that each one wishes to hear the audio from all
of the others, but that only two of the others are
transmitting a video image that is worth seeing.

2 NCAR PILOT PROJECT EXPERIENCES

These tools are available for Sun SPARC, DEC, and
SGI architectures. The DEC and SGI kernels will
support multicast by default, but the Sun kernel will
require modification to the default kernel and a copy
of the multicast routing daemon. The support
personnel at NCAR strongly recommend that kernel
updates and modifications be tracked carefully on all
three platforms.
1.2 Mosaic
Mosaic is a tool developed at the National Center for
Supercomputing Applications (NCSA) for browsing
information on a network. The information can take
one of several forms: hypertext documents, other
documents, pictures, audio bites, and animation
sequences. The hypertext documents are the central
"glue" which link all of the information together.
These documents contain coded embedded references
to files existing somewhere on the worldwide network.
The reference coding instructs Mosaic as to the
method of retrieval and display of the selected file.
Thus, a user usually starts out with Mosaic from a
"home page" hypertext document which contains links
to other documents (hypertext, ASCIT, PostScript),
image files, sound files, and animation files on the
network. Currently, a great wealth of information is
available, but it is organized mostly on a site by site

2.1 Multicasting User Training Classes
NCAR's Scientific Computing Division has multicast
two of its four user training classes to date: the Cray
FortraniC Optimization class, and the NCAR Graphics
training class. In both cases, the motivation for
attempting a multicast came from one of our user sites
at which there were several individuals interested in
attending the class in question, but because of tight
budgets, funding for .travel to Boulder could not be
obtained. The decision to try the multicast was made
as a compromise solution, allowing the students to
view the class and have two-way audio
communication with the instructor in Boulder, while
sparing the travel budgets. Class materials were
mailed to the remote students ahead of time, allowing
them to follow the materials with the people in
Boulder. The teaching materials provided in both
cases were an integrated set of lecture notes and slides
in a left-page/right-page format which provided the
information from the overhead slide on one page and a
textual discussion of the slide on the opposing page.

2.1.1 Software, Hardware, and Staff Requirements
There are several start-up costs associated with
producing multicast training classes. The software
required to produce a multicast is freely available via
anonymous FTP (see SOFTWARE for information
on how to obtain the software). The costs associated
with this software are mainly those related to the staff
time required to get it working correctly.
Additionally, each workstation will require a kernel
configured to handle multicasts, and the multicast
routing daemon must be run on exactly one. host on
the local subnetwork segment.
The hardware costs are a different story. The largest
single piece of equipment required is a low-end
workstation dedicated to handling the transmission.
Additionally, low-end workstations are required for
the viewers. NCAR used a vanilla Sun SPARC station
IPX for one class and a SGI Indy for the other as the
classroom transmission platform. The cost of such
machines runs between $5,000 and $10,000 depending

361

on the configuration. The systems which will be
transmitting video signals, require a video capture
card which costs between $500 and $1,000 if not
supplied with the workstation. A high quality· camera
(costing
approximately
$1,000)
is
highly
recommended for transmitting the images of the
instructor and their NY material. An inexpensive
CCD video camera (approximately $100 and often
supplied with workstations today) can be used for the
students' images. Additionally, microphones will be
needed for both the instructor and the remote students.
The inexpensive microphones commonly supplied
with most workstations will suffice for the students,
but it is recommended that the instructor use a highquality wireless microphone (approximately $250).
Finally, regarding staff time, we found that setting up
and testing the multicast link required about four
hours with at least one person on each end of the
connection. During the class and in addition to the
instructor, a camera person is required. This person
handles the camera work for the instructor and must
be competent in setting up and managing the link
which may occasionally need to be restarted. Given
this, there is little economic incentive to set up a
multicast for only one or two remote students;
however, other issues may outweigh the economic
factors. This overhead could obviously be figured into
any cost associated with the class for the remote
students.

2.1.2 Results
Overall, b~th of NCAR's multicast classes were
highly successful. The remote students claimed that
the multicast was "almost as good as being there",
while the local students reported that the multicast did
not noticeably disrupt their learning process. The
course evaluation ratings from both the local and the
remote groups were comparable. Later interactions
with local and remote students of the the optimization
class suggested that both groups had acquired a
similar grasp of the material. In all respects, the
multicast worked well in producing a learning
environment for remote students similar to that
available to those in the local classroom.
All of this success did not come without a price. The
multicast caused noticeable and sometimes maddening
slow downs on the computing division's network.
The
available
network
bandwidth
allowed
transmissions of a clear. audio signal and the video
signal was able to send between one half frame per

362

second and five frames per second at the cost of a
heavy impact on other network traffic. This imposed
restrictions on both the instructor and the camera
operator which required changes in presentation "and
camera technique.
The instructor needed to refrain from gesturing and
shift their style more towards "posing". Specifically,
because of the low frame rate, one could not move
continuously, as this would have two effects: motions
would appear disjointed and nonsensical and the
network traffic would climb intolerably. Thus, a more
effective technique would have the instructor point to
material on the projection screen, and hold that
"pointing pose" while describing the material. Pacing
and rapid swapping of slides were two other areas
which caused problems, and both need to be avoided.
It turns out that it is actually rather difficult to alter
such gestures made in an unconscious fashion, and
that some instructors have much more difficulty
making this adjustment than others. When the
instructor breaks for questions, they need to be
conscientious about repeating local questions so that
they may be heard by the remote students. Also, as it
may have been necessary to mute the remote
microphones to reduce feedback on the audio link, the
instructor should make a point of allocating time for
the remote students to pose questions.
Similar adjustments were required of the camera
operator. Specifically, slow panning· and zooming of
the camera came across very poorly on the slow scan
video link and swamped the network in the process.
Thus, the camera person was better off making a fast
pan or zoom adjustment, and then leaving the camera
stationary on the new subject. However, much of this
depends on the instructor and cannot be controlled by
the camera person. If possible, it is recommended that
the instructor and camera operator collaborate on the
presentation.
The last of the problems were related to the stability of
the links. It was possible that either the audio (most
likely) or the video link might fail at some point
during the presentation or that the audio signal might
break up badly enough that it was unintelligible. The
SGI platform had more trouble keeping the link up
and running because of software problems. The Sun
platform had more difficulty with audio break-up
because of the slower CPU. While neither platform
was perfect, both served adequately.

2.2 Online Documentation with Mosaic
Since the release of NCSA's Mosaic, sites world wide
have begun displaying their interest by developing the
hypertext documents which form the links on which
Mosaic's access to network information thrives.
NCAR has developed a great deal of information
regarding their mission and the part in that mission
played by each of the NCAR divisions. Additionally,
portions of the institutional plans for the future are
available. The information can be referenced from
any workstation (running X-Windows), PC (running
Microsoft's Windows software), or Macintosh which
is connected to the Internet and runs the Mosaic
software.

2.2.1 Software, Hardware, and Staff Requirements
NCSA's license agreement for Mosaic is very
generous and, as mentioned above, it will run on a
wide variety of platforms provided that a network
connection is available (see SOFTWARE for
information on obtaining Mosaic). Thus, the primary
costs involved in using Mosaic are for the support
staff time. The Scientific Computing Division at
NCAR has currently dedicated one full time person to
developing Mosaic material. Additionally, others are
involved in projects to move current documentation
over to such online formats.

2.2.2 Results
This project is a work in progress, but the headway
made over the last few months is very impressive.
Already we have a great deal of material on NCAR in
general, and the computing division has a very
detailed description of the work done by each of it~
individual groups. The documents contain static
images, movies, music, speeches, and many other bits
of information to flesh out the work we do at NCAR.
Other divisions are beginning to follow suit and
produce material describing their own functions.
Individuals are also "getting into the act", and the
author himself has just recently learned Hyper Text
Mark-up Language (HTML) and written his own
home document with hypertext links to information on
the network which he accesses frequently. Much of
the information developed over the last few months
has been of an advertising or overview nature to
demonstrate the capabilities of Mosaic, but we are
now beginning to move toward making more technical

information available. It should be noted, however,
that as a site moves toward becoming a hub for
Mosaic users, it would be wise to move the Mosaic
information services onto their own machine to reduce
the impact of network activity on internal functions.

3 FUTURE DIRECTIONS

3.1 Multicasts
There are several possible future uses for multicasts in
user support at NCAR. Training via multicast has
proven so successful that the author hopes to see it
extended to the other classes taught by the computing
division, and possibly to classes taught by other
divisions within our organization. However, the
possibilities do not end there.
Many advisory committees and panels for the
computing division must operate without direct input
from that portion of our user base which is not local to
the Boulder area. Frequently, remote users must be
represented by a member of our staff who has been
designated as the advocate for a particular group.
While this has worked well in the past, it may soon be
possible to allow these groups a more direct method of
representation via multicast.
The computing division's consulting office currently
operates in a help-desk mode. Whoever is on duty
answers all of the walk-in, telephone, and electronic
mail questions coming into the consulting office. The
others are handling either special appointments made
by local users or catching up from their last turn on the
duty shift. Multicasts offer two possibilities "here.
First, a multicast link could serve as an additional
medium for handling user questions, over and above
the standard walk-in, phone, and e-mail. Secondly, it
could serve to extend the capabilities of special
appointments to remote users.

3.2 Mosaic
The future of Mosaic at NCAR can be expected to
cover the two possibilities in user documentation: that
which is relatively static and that which is more
dynamic in nature. Static documentation might
include things such "as documents describing site
specific features and commands, descriptions of how
to use a service, and frequently asked question lists.
The dynamic documentation, however, will be much

363

more interesting and include items such as our daily
bulletin and change control system.
The daily bulletin is published every morning before 9
A.M. It describes the current state of the computing
facility and includes short term announcements of
upcoming changes. In addition to our current scheme
of a local command which will page the current
bulletin on any terminal, it would be relatively
straightforward to make the information available via
Mosaic as well. Furthermore, our current text format
could eventually be enhanced to include audio reports
of status and changes, video clips of significant events
(such as the installation of new machines), and
graphic images relevant to the daily announcements.
The change control system at NCAR is currently set
up via e-mail notices with a browser based on the day
the message was sent. The system works, but
Mosaic's capability for including links presents a
better model for representing the interconnected
nature of the computing facility and how changes in
one area may affect another. The advantage to the
current system is simplicity; a programmer making a
change fills out a template describing the change and
the effective date, then mails the completed template
to the mailing list. To make any real improvement
over this simplicity and take advantage of Mosaic's
hypertext capabilities will require a system
significantly more sophisticated than the one currently
in place.

4 SUMMARY
The tools discussed in this paper provide a cost
effective answer to many of the questions posed in the
introduction. Though users are geographically
dispersed, they usually have. some level of network
access. Thus, the multimedia tools based on the
network can support their needs regardless of their
location. Because multicasts can be used to provide
training without the associated travel· costs, budgets
can be saved for other critical needs. Because Mosaic
provides entry. points into web-like information
structures, the question of the starting point becomes
less important. Users can start at a point covering
their immediate needs and branch out to other areas of
interest as time permits. Furthermore, both multicast
and Mosaic avail themselves to the possibilities of
stretching our current capabilities beyond our facilities
hy exploiting the conferencing capabilities of
rnulticasts and the audiofvideo capabilities of Mosaic.
However, it should be remembered that the tools are

364

heavily dependent on the accessibility of high
bandwidth networks, though our infrastructure is
quickly growing to fulfill this need. The final word
becomes a question of support. What difference exists
between supercomputing centers today other than
differences in support?

5 SOFTWARE
1. SD:
The network session directory tool was
written by Van Jacobson at Lawrence Berkeley
Laboratory.
Binaries are available
via
anonymous FTP at ftp.ee.lbl.gov in the directory
fconferencing.
2. VAT: The network audio conferencing tool
was written by Van Jacobson and Steve
McCanne at Lawrence Be