SAS/STAT 9.2 User's Guide: The FASTCLUS Procedure (Book Excerpt) SAS Users Guide

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 64

DownloadSAS/STAT 9.2 User's Guide: The FASTCLUS Procedure (Book Excerpt) SAS Users Guide
Open PDF In BrowserView PDF
®

SAS/STAT 9.2 User’s Guide

The FASTCLUS Procedure
(Book Excerpt)

®

SAS Documentation

This document is an individual chapter from SAS/STAT® 9.2 User’s Guide.
The correct bibliographic citation for the complete manual is as follows: SAS Institute Inc. 2008. SAS/STAT® 9.2
User’s Guide. Cary, NC: SAS Institute Inc.
Copyright © 2008, SAS Institute Inc., Cary, NC, USA
All rights reserved. Produced in the United States of America.
For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor
at the time you acquire this publication.
U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation
by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19,
Commercial Computer Software-Restricted Rights (June 1987).
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.
1st electronic book, March 2008
2nd electronic book, February 2009
SAS® Publishing provides a complete selection of books and electronic products to help customers use SAS software to
its fullest potential. For more information about our e-books, e-learning products, CDs, and hard-copy books, visit the
SAS Publishing Web site at support.sas.com/publishing or call 1-800-727-3228.
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute
Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.

Chapter 34

The FASTCLUS Procedure
Contents
Overview: FASTCLUS Procedure . . . . .
Background . . . . . . . . . . . . .
Getting Started: FASTCLUS Procedure . .
Syntax: FASTCLUS Procedure . . . . . . .
PROC FASTCLUS Statement . . . .
BY Statement . . . . . . . . . . . .
FREQ Statement . . . . . . . . . . .
ID Statement . . . . . . . . . . . . .
VAR Statement . . . . . . . . . . . .
WEIGHT Statement . . . . . . . . .
Details: FASTCLUS Procedure . . . . . .
Updates in the FASTCLUS Procedure
Missing Values . . . . . . . . . . . .
Output Data Sets . . . . . . . . . . .
Computational Resources . . . . . .
Using PROC FASTCLUS . . . . . .
Displayed Output . . . . . . . . . . .
ODS Table Names . . . . . . . . . .
Examples: FASTCLUS Procedure . . . . .
Example 34.1: Fisher’s Iris Data . .
Example 34.2: Outliers . . . . . . .
References . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1621
1622
1624
1632
1632
1640
1640
1641
1641
1641
1642
1642
1642
1643
1647
1647
1649
1652
1653
1653
1662
1673

Overview: FASTCLUS Procedure
The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed
from one or more quantitative variables. The observations are divided into clusters such that every
observation belongs to one and only one cluster; the clusters do not form a tree structure as they do
in the CLUSTER procedure. If you want separate analysis for different numbers of clusters, you
can run PROC FASTCLUS once for each analysis. Alternatively, to do hierarchical clustering on
a large data set, use PROC FASTCLUS to find initial clusters, and then use those initial clusters as
input to PROC CLUSTER.

1622 F Chapter 34: The FASTCLUS Procedure

By default, the FASTCLUS procedure uses Euclidean distances, so the cluster centers are based
on least squares estimation. This kind of clustering method is often called a k-means model, since
the cluster centers are the means of the observations assigned to each cluster when the algorithm is
run to complete convergence. Each iteration reduces the least squares criterion until convergence is
achieved.
Often there is no need to run the FASTCLUS procedure to convergence. PROC FASTCLUS is
designed to find good clusters (but not necessarily the best possible clusters) with only two or three
passes through the data set. The initialization method of PROC FASTCLUS guarantees that, if
there exist clusters such that all distances between observations in the same cluster are less than all
distances between observations in different clusters, and if you tell PROC FASTCLUS the correct
number of clusters to find, it can always find such a clustering without iterating. Even with clusters
that are not as well separated, PROC FASTCLUS usually finds initial seeds that are sufficiently
good that few iterations are required. Hence, by default, PROC FASTCLUS performs only one
iteration.
The initialization method used by the FASTCLUS procedure makes it sensitive to outliers. PROC
FASTCLUS can be an effective procedure for detecting outliers because outliers often appear as
clusters with only one member.
The FASTCLUS procedure can use an Lp (least pth powers) clustering criterion (Spath 1985,
pp. 62–63) instead of the least squares (L2 ) criterion used in k-means clustering methods. The
LEAST=p option specifies the power p to be used. Using the LEAST= option increases execution
time since more iterations are usually required, and the default iteration limit is increased when
you specify LEAST=p. Values of p less than 2 reduce the effect of outliers on the cluster centers
compared with least squares methods; values of p greater than 2 increase the effect of outliers.
The FASTCLUS procedure is intended for use with large data sets, with 100 or more observations.
With small data sets, the results can be highly sensitive to the order of the observations in the data
set.
PROC FASTCLUS uses algorithms that place a larger influence on variables with larger variance,
so it might be necessary to standardize the variables before performing the cluster analysis. See the
“Using PROC FASTCLUS” section for standardization details.
PROC FASTCLUS produces brief summaries of the clusters it finds. For more extensive examination of the clusters, you can request an output data set containing a cluster membership variable.

Background
The FASTCLUS procedure combines an effective method for finding initial clusters with a standard
iterative algorithm for minimizing the sum of squared distances from the cluster means. The result
is an efficient procedure for disjoint clustering of large data sets. PROC FASTCLUS was directly
inspired by Hartigan’s (1975) leader algorithm and MacQueen’s (1967) k-means algorithm. PROC
FASTCLUS uses a method that Anderberg (1973) calls nearest centroid sorting. A set of points
called cluster seeds is selected as a first guess of the means of the clusters. Each observation is
assigned to the nearest seed to form temporary clusters. The seeds are then replaced by the means

Background F 1623

of the temporary clusters, and the process is repeated until no further changes occur in the clusters.
Similar techniques are described in most references on clustering (Anderberg 1973; Hartigan 1975;
Everitt 1980; Spath 1980).
The FASTCLUS procedure differs from other nearest centroid sorting methods in the way the initial
cluster seeds are selected. The importance of initial seed selection is demonstrated by Milligan
(1980).
The clustering is done on the basis of Euclidean distances computed from one or more numeric
variables. If there are missing values, PROC FASTCLUS computes an adjusted distance by using
the nonmissing values. Observations that are very close to each other are usually assigned to the
same cluster, while observations that are far apart are in different clusters.
The FASTCLUS procedure operates in four steps:
1. Observations called cluster seeds are selected.
2. If you specify the DRIFT option, temporary clusters are formed by assigning each observation
to the cluster with the nearest seed. Each time an observation is assigned, the cluster seed is
updated as the current mean of the cluster. This method is sometimes called incremental,
on-line, or adaptive training.
3. If the maximum number of iterations is greater than zero, clusters are formed by assigning
each observation to the nearest seed. After all observations are assigned, the cluster seeds are
replaced by either the cluster means or other location estimates (cluster centers) appropriate
to the LEAST=p option. This step can be repeated until the changes in the cluster seeds
become small or zero (MAXITER=n  1).
4. Final clusters are formed by assigning each observation to the nearest seed.
If PROC FASTCLUS runs to complete convergence, the final cluster seeds will equal the cluster
means or cluster centers. If PROC FASTCLUS terminates before complete convergence, which
often happens with the default settings, the final cluster seeds might not equal the cluster means or
cluster centers. If you want complete convergence, specify CONVERGE=0 and a large value for
the MAXITER= option.
The initial cluster seeds must be observations with no missing values. You can specify the maximum
number of seeds (and, hence, clusters) by using the MAXCLUSTERS= option. You can also specify
a minimum distance by which the seeds must be separated by using the RADIUS= option.
PROC FASTCLUS always selects the first complete (no missing values) observation as the first
seed. The next complete observation that is separated from the first seed by at least the distance
specified in the RADIUS= option becomes the second seed. Later observations are selected as new
seeds if they are separated from all previous seeds by at least the radius, as long as the maximum
number of seeds is not exceeded.
If an observation is complete but fails to qualify as a new seed, PROC FASTCLUS considers using
it to replace one of the old seeds. Two tests are made to see if the observation can qualify as a new
seed.
First, an old seed is replaced if the distance between the observation and the closest seed is greater
than the minimum distance between seeds. The seed that is replaced is selected from the two

1624 F Chapter 34: The FASTCLUS Procedure

seeds that are closest to each other. The seed that is replaced is the one of these two with the
shortest distance to the closest of the remaining seeds when the other seed is replaced by the current
observation.
If the observation fails the first test for seed replacement, a second test is made. The observation
replaces the nearest seed if the smallest distance from the observation to all seeds other than the
nearest one is greater than the shortest distance from the nearest seed to all other seeds. If the
observation fails this test, PROC FASTCLUS goes on to the next observation.
You can specify the REPLACE= option to limit seed replacement. You can omit the second test
for seed replacement (REPLACE=PART), causing PROC FASTCLUS to run faster, but the seeds
selected might not be as widely separated as those obtained by the default method. You can also
suppress seed replacement entirely by specifying REPLACE=NONE. In this case, PROC FASTCLUS runs much faster, but you must choose a good value for the RADIUS= option in order to
get good clusters. This method is similar to Hartigan’s (1975, pp. 74–78) leader algorithm and the
simple cluster seeking algorithm described by Tou and Gonzalez (1974, pp. 90–92).

Getting Started: FASTCLUS Procedure
The following example demonstrates how to use the FASTCLUS procedure to compute disjoint
clusters of observations in a SAS data set.
The data in this example are measurements taken on 159 freshwater fish caught from the same lake
(Laengelmavesi) near Tampere in Finland. This data set is available from the Data Archive of the
Journal of Statistics Education. The complete data set is displayed in Chapter 82, “The STEPDISC
Procedure.”
The species (bream, parkki, pike, perch, roach, smelt, and whitefish), weight, three different length
measurements (measured from the nose of the fish to the beginning of its tail, the notch of its tail,
and the end of its tail), height, and width of each fish are tallied. The height and width are recorded
as percentages of the third length variable.
Suppose that you want to group empirically the fish measurements into clusters and that you want to
associate the clusters with the species. You can use the FASTCLUS procedure to perform a cluster
analysis.
The following DATA step creates the SAS data set Fish:
proc format;
value specfmt
1=’Bream’
2=’Roach’
3=’Whitefish’
4=’Parkki’
5=’Perch’
6=’Pike’
7=’Smelt’;
run;

Getting Started: FASTCLUS Procedure F 1625

data fish (drop=HtPct WidthPct);
title ’Fish Measurement Data’;
input Species Weight Length1 Length2 Length3 HtPct WidthPct @@;
*** transform variables;
if Weight <= 0 or Weight =. then delete;
Weight3=Weight**(1/3);
Height=HtPct*Length3/(Weight3*100);
Width=WidthPct*Length3/(Weight3*100);
Length1=Length1/Weight3;
Length2=Length2/Weight3;
Length3=Length3/Weight3;
logLengthRatio=log(Length3/Length1);

1
1
1
1
1
1
1
1

format Species specfmt.;
symbol = put(Species, specfmt2.);
datalines;
242.0 23.2 25.4 30.0 38.4 13.4 1
340.0 23.9 26.5 31.1 39.8 15.1 1
430.0 26.5 29.0 34.0 36.6 15.1 1
500.0 26.8 29.7 34.5 41.1 15.3 1
450.0 27.6 30.0 35.1 39.9 13.8 1
475.0 28.4 31.0 36.2 39.4 14.1 1
500.0 29.1 31.5 36.4 37.8 12.0 1
600.0 29.4 32.0 37.2 40.2 13.9 1

290.0
363.0
450.0
390.0
500.0
500.0
.
600.0

24.0
26.3
26.8
27.6
28.5
28.7
29.5
29.4

26.3
29.0
29.7
30.0
30.7
31.0
32.0
32.0

31.2
33.5
34.7
35.0
36.2
36.2
37.3
37.2

40.0
38.0
39.2
36.2
39.3
39.7
37.3
41.5

13.8
13.3
14.2
13.4
13.7
13.3
13.6
15.0

... more lines ...
7
7
7
;

9.8 11.4 12.0 13.2 16.7 8.7 7
13.4 11.7 12.4 13.5 18.0 9.4 7
19.7 13.2 14.3 15.2 18.9 13.6 7

12.2 11.5 12.2 13.4 15.6 10.4
12.2 12.1 13.0 13.8 16.5 9.1
19.9 13.8 15.0 16.2 18.1 11.6

The double trailing at sign (@@) in the INPUT statement specifies that observations are input from
each line until all values are read. The variables are rescaled in order to adjust for dimensionality.
Because the new variables Weight3–logLengthRatio depend on the variable Weight, observations with
missing values for Weight are not added to the data set. Consequently, there are 157 observations in
the SAS data set Fish.
In the Fish data set, the variables are not measured in the same units and cannot be assumed to have
equal variance. Therefore, it is necessary to standardize the variables before performing the cluster
analysis.
The following statements standardize the variables and perform a cluster analysis on the standardized data:
proc standard data=Fish out=Stand mean=0 std=1;
var Length1 logLengthRatio Height Width Weight3;
proc fastclus data=Stand out=Clust
maxclusters=7 maxiter=100 ;
var Length1 logLengthRatio Height Width Weight3;
run;

1626 F Chapter 34: The FASTCLUS Procedure

The STANDARD procedure is first used to standardize all the analytical variables to a mean of 0 and
standard deviation of 1. The procedure creates the output data set Stand to contain the transformed
variables.
The FASTCLUS procedure then uses the data set Stand as input and creates the data set Clust.
This output data set contains the original variables and two new variables, Cluster and Distance.
The variable Cluster contains the cluster number to which each observation has been assigned. The
variable Distance gives the distance from the observation to its cluster seed.
It is usually desirable to try several values of the MAXCLUSTERS= option. A reasonable beginning
for this example is to use MAXCLUSTERS=7, since there are seven species of fish represented in
the data set Fish.
The VAR statement specifies the variables used in the cluster analysis.
The results from this analysis are displayed in the following figures.
Figure 34.1 Initial Seeds Used in the FASTCLUS Procedure
Fish Measurement Data

Replace=FULL

The FASTCLUS Procedure
Radius=0 Maxclusters=7 Maxiter=100

Converge=0.02

Initial Seeds
logLength
Cluster
Length1
Ratio
Height
Width
Weight3
----------------------------------------------------------------------------1
1.388338414 -0.979577858 -1.594561848 -2.254050655
2.103447062
2
-1.117178039 -0.877218192 -0.336166276
2.528114070
1.170706464
3
2.393997461 -0.662642015 -0.930738701 -2.073879107 -1.839325419
4
-0.495085516 -0.964041012 -0.265106856 -0.028245072
1.536846394
5
-0.728772773
0.540096664
1.130501398 -1.207930053 -1.107018207
6
-0.506924177
0.748211648
1.762482687
0.211507596
1.368987826
7
1.573996573 -0.796593995 -0.824217424
1.561715851 -1.607942726
Criterion Based on Final Seeds =

0.3979

Figure 34.1 displays the table of initial seeds used for each variable and cluster. The first line in the
figure displays the option settings for REPLACE, RADIUS, MAXCLUSTERS, and MAXITER.
These options, with the exception of MAXCLUSTERS and MAXITER, are set at their respective
default values (REPLACE=FULL, RADIUS=0). Both the MAXCLUSTERS= and MAXITER=
options are set in the PROC FASTCLUS statement.
Next, PROC FASTCLUS produces a table of summary statistics for the clusters. Figure 34.2 displays the number of observations in the cluster (frequency) and the root mean squared standard
deviation. The next two columns display the largest Euclidean distance from the cluster seed to any
observation within the cluster and the number of the nearest cluster.
The last column of the table displays the distance between the centroid of the nearest cluster and
the centroid of the current cluster. A centroid is the point having coordinates that are the means of
all the observations in the cluster.

Getting Started: FASTCLUS Procedure F 1627

Figure 34.2 Cluster Summary Table from the FASTCLUS Procedure
Cluster Summary
Maximum Distance
RMS Std
from Seed
Radius
Nearest
Cluster
Frequency
Deviation
to Observation
Exceeded
Cluster
----------------------------------------------------------------------------1
17
0.5064
1.7781
4
2
19
0.3696
1.5007
4
3
13
0.3803
1.7135
1
4
13
0.4161
1.3976
7
5
11
0.2466
0.6966
6
6
34
0.3563
1.5443
5
7
50
0.4447
2.3915
4
Cluster Summary
Distance Between
Cluster
Cluster Centroids
----------------------------1
2.5106
2
1.5510
3
2.6704
4
1.4266
5
1.7301
6
1.7301
7
1.4266

Figure 34.3 displays the table of statistics for the variables. The table lists for each variable the
total standard deviation, the pooled within-cluster standard deviation and the R-square value for
predicting the variable from the cluster. The ratio of between-cluster variance to within-cluster
variance (R2 to 1 R2 ) appears in the last column.
Figure 34.3 Statistics for Variables Used in the FASTCLUS Procedure
Statistics for Variables
Variable
Total STD
Within STD
R-Square
RSQ/(1-RSQ)
-----------------------------------------------------------------------Length1
1.00000
0.31428
0.905030
9.529606
logLengthRatio
1.00000
0.39276
0.851676
5.741989
Height
1.00000
0.20917
0.957929
22.769295
Width
1.00000
0.55558
0.703200
2.369270
Weight3
1.00000
0.47251
0.785323
3.658162
OVER-ALL
1.00000
0.40712
0.840631
5.274764
Pseudo F Statistic =

131.87

Approximate Expected Over-All R-Squared =

0.57420

1628 F Chapter 34: The FASTCLUS Procedure

The pseudo F statistic, approximate expected overall R square, and cubic clustering criterion (CCC)
are listed at the bottom of the figure. You can compare values of these statistics by running PROC
FASTCLUS with different values for the MAXCLUSTERS= option. The R square and CCC values
are not valid for correlated variables.
Values of the cubic clustering criterion greater than 2 or 3 indicate good clusters. Values between
0 and 2 indicate potential clusters, but they should be taken with caution; large negative values can
indicate outliers.
PROC FASTCLUS next produces the within-cluster means and standard deviations of the variables,
displayed in Figure 34.4.
Figure 34.4 Cluster Means and Standard Deviations from the FASTCLUS Procedure
Cluster Means
logLength
Cluster
Length1
Ratio
Height
Width
Weight3
----------------------------------------------------------------------------1
1.747808245 -0.868605685 -1.327226832 -1.128760946
0.806373599
2
-0.405231510 -0.979113021 -0.281064162
1.463094486
1.060450065
3
2.006796315 -0.652725165 -1.053213440 -1.224020795 -1.826752838
4
-0.136820952 -1.039312574 -0.446429482
0.162596336
0.278560318
5
-0.850130601
0.550190242
1.245156076 -0.836585750 -0.567022647
6
-0.843912827
1.522291347
1.511408739 -0.380323563
0.763114370
7
-0.165570970 -0.048881276 -0.353723615
0.546442064 -0.668780782
Cluster Standard Deviations
logLength
Cluster
Length1
Ratio
Height
Width
Weight3
----------------------------------------------------------------------------1
0.3418476428 0.3544065543 0.1666302451 0.6172880027 0.7944227150
2
0.3129902863 0.3592350778 0.1369052680 0.5467406493 0.3720119097
3
0.2962504486 0.1740941675 0.1736086707 0.7528475622 0.0905232968
4
0.3254364840 0.2836681149 0.1884592934 0.4543390702 0.6612055341
5
0.1781837609 0.0745984121 0.2056932592 0.2784540794 0.3832002850
6
0.2273744242 0.3385584051 0.2046010964 0.5143496067 0.4025849044
7
0.3734733622 0.5275768119 0.2551130680 0.5721303628 0.4223181710

It is useful to study further the clusters calculated by the FASTCLUS procedure. One method is
to look at a frequency tabulation of the clusters with other classification variables. The following
statements invoke the FREQ procedure to crosstabulate the empirical clusters with the variable
Species:
proc freq data=Clust;
tables Species*Cluster;
run;

Getting Started: FASTCLUS Procedure F 1629

Figure 34.5 displays the marked division between clusters.
Figure 34.5 Frequency Table of Cluster versus Species
Fish Measurement Data
The FREQ Procedure
Table of Species by CLUSTER
Species

CLUSTER(Cluster)

Frequency |
Percent
|
Row Pct
|
Col Pct
|
1|
2|
3|
4| Total
----------+--------+--------+--------+--------+
Bream
|
0 |
0 |
0 |
0 |
34
|
0.00 |
0.00 |
0.00 |
0.00 | 21.66
|
0.00 |
0.00 |
0.00 |
0.00 |
|
0.00 |
0.00 |
0.00 |
0.00 |
----------+--------+--------+--------+--------+
Roach
|
0 |
0 |
0 |
0 |
19
|
0.00 |
0.00 |
0.00 |
0.00 | 12.10
|
0.00 |
0.00 |
0.00 |
0.00 |
|
0.00 |
0.00 |
0.00 |
0.00 |
----------+--------+--------+--------+--------+
Whitefish |
0 |
2 |
0 |
1 |
6
|
0.00 |
1.27 |
0.00 |
0.64 |
3.82
|
0.00 | 33.33 |
0.00 | 16.67 |
|
0.00 | 10.53 |
0.00 |
7.69 |
----------+--------+--------+--------+--------+
Parkki
|
0 |
0 |
0 |
0 |
11
|
0.00 |
0.00 |
0.00 |
0.00 |
7.01
|
0.00 |
0.00 |
0.00 |
0.00 |
|
0.00 |
0.00 |
0.00 |
0.00 |
----------+--------+--------+--------+--------+
Perch
|
0 |
17 |
0 |
12 |
56
|
0.00 | 10.83 |
0.00 |
7.64 | 35.67
|
0.00 | 30.36 |
0.00 | 21.43 |
|
0.00 | 89.47 |
0.00 | 92.31 |
----------+--------+--------+--------+--------+
Pike
|
17 |
0 |
0 |
0 |
17
| 10.83 |
0.00 |
0.00 |
0.00 | 10.83
| 100.00 |
0.00 |
0.00 |
0.00 |
| 100.00 |
0.00 |
0.00 |
0.00 |
----------+--------+--------+--------+--------+
Smelt
|
0 |
0 |
13 |
0 |
14
|
0.00 |
0.00 |
8.28 |
0.00 |
8.92
|
0.00 |
0.00 | 92.86 |
0.00 |
|
0.00 |
0.00 | 100.00 |
0.00 |
----------+--------+--------+--------+--------+
Total
17
19
13
13
157
10.83
12.10
8.28
8.28
100.00
(Continued)

1630 F Chapter 34: The FASTCLUS Procedure

Figure 34.5 continued
Fish Measurement Data
The FREQ Procedure
Table of Species by CLUSTER
Species

CLUSTER(Cluster)

Frequency |
Percent
|
Row Pct
|
Col Pct
|
5|
6|
7| Total
----------+--------+--------+--------+
Bream
|
0 |
34 |
0 |
34
|
0.00 | 21.66 |
0.00 | 21.66
|
0.00 | 100.00 |
0.00 |
|
0.00 | 100.00 |
0.00 |
----------+--------+--------+--------+
Roach
|
0 |
0 |
19 |
19
|
0.00 |
0.00 | 12.10 | 12.10
|
0.00 |
0.00 | 100.00 |
|
0.00 |
0.00 | 38.00 |
----------+--------+--------+--------+
Whitefish |
0 |
0 |
3 |
6
|
0.00 |
0.00 |
1.91 |
3.82
|
0.00 |
0.00 | 50.00 |
|
0.00 |
0.00 |
6.00 |
----------+--------+--------+--------+
Parkki
|
11 |
0 |
0 |
11
|
7.01 |
0.00 |
0.00 |
7.01
| 100.00 |
0.00 |
0.00 |
| 100.00 |
0.00 |
0.00 |
----------+--------+--------+--------+
Perch
|
0 |
0 |
27 |
56
|
0.00 |
0.00 | 17.20 | 35.67
|
0.00 |
0.00 | 48.21 |
|
0.00 |
0.00 | 54.00 |
----------+--------+--------+--------+
Pike
|
0 |
0 |
0 |
17
|
0.00 |
0.00 |
0.00 | 10.83
|
0.00 |
0.00 |
0.00 |
|
0.00 |
0.00 |
0.00 |
----------+--------+--------+--------+
Smelt
|
0 |
0 |
1 |
14
|
0.00 |
0.00 |
0.64 |
8.92
|
0.00 |
0.00 |
7.14 |
|
0.00 |
0.00 |
2.00 |
----------+--------+--------+--------+
Total
11
34
50
157
7.01
21.66
31.85
100.00

Getting Started: FASTCLUS Procedure F 1631

For cases in which you have three or more clusters, you can use the CANDISC and SGPLOT procedures to obtain a graphical check on the distribution of the clusters. In the following statements, the
CANDISC and SGPLOT procedures are used to compute canonical variables and plot the clusters:
proc candisc data=Clust out=Can noprint;
class Cluster;
var Length1 logLengthRatio Height Width Weight3;
proc sgplot data=Can;
scatter y=Can2 x=Can1 / group=Cluster ;
run;

First, the CANDISC procedure is invoked to perform a canonical discriminant analysis by using the
data set Clust and creating the output SAS data set Can. The NOPRINT option suppresses display
of the output. The CLASS statement specifies the variable Cluster to define groups for the analysis.
The VAR statement specifies the variables used in the analysis.
Next, the SGPLOT procedure plots the two canonical variables from PROC CANDISC, Can1 and
Can2. The PLOT statement specifies the variable Cluster as the identification variable. The resulting
plot (Figure 34.6) illustrates the spatial separation of the clusters calculated in the FASTCLUS
procedure.
Figure 34.6 Plot of Canonical Variables and Cluster Value

1632 F Chapter 34: The FASTCLUS Procedure

Syntax: FASTCLUS Procedure
The following statements are available in the FASTCLUS procedure:
PROC FASTCLUS < DATA=SAS-data-set >
< MAXCLUSTERS=n >
< RADIUS=t > ;
VAR variables ;
ID variables ;
FREQ variable ;
WEIGHT variable ;
BY variables ;

Usually you need only the VAR statement in addition to the PROC FASTCLUS statement. The
BY, FREQ, ID, VAR, and WEIGHT statements are described in alphabetical order after the PROC
FASTCLUS statement.

PROC FASTCLUS Statement
PROC FASTCLUS MAXCLUSTERS= n | RADIUS=t < options > ;

You must specify the MAXCLUSTERS= option or RADIUS= option or both in the PROC FASTCLUS statement.
MAXCLUSTERS=n
MAXC=n

specifies the maximum number of clusters permitted. If you omit the MAXCLUSTERS=
option, a value of 100 is assumed.
RADIUS=t
R=t

establishes the minimum distance criterion for selecting new seeds. No observation is considered as a new seed unless its minimum distance to previous seeds exceeds the value given
by the RADIUS= option. The default value is 0. If you specify the REPLACE=RANDOM
option, the RADIUS= option is ignored.

PROC FASTCLUS Statement F 1633

You can specify the following options in the PROC FASTCLUS statement. Table 34.1 summarizes
the options.
Table 34.1

PROC FASTCLUS Statement Options

Option

Description

Specify input and output data sets
DATA=
specifies input data set
INSTAT=
specifies input SAS data set previously created by
the OUTSTAT= option
SEED=
specifies input SAS data set for selecting initial
cluster seeds
VARDEF=
specifies divisor for variances
Output Data Processing
CLUSTER=
specifies name for cluster membership variable in
OUTSEED= and OUT= data sets
CLUSTERLABEL=
specifies label for cluster membership variable in
OUTSEED= and OUT= data sets
OUT=
specifies output SAS data set containing original
data and cluster assignments
OUTITER
specifies writing to OUTSEED= data set on every
iteration
OUTSEED= or MEAN= specifies output SAS data set containing cluster
centers
OUTSTAT=
specifies output SAS data set containing statistics
Initial Clusters
DRIFT
MAXCLUSTERS=
RADIUS=
RANDOM=
REPLACE=
Clustering Methods
CONVERGE=
DELETE=
LEAST=
MAXITER=
STRICT

permits cluster to seeds to drift during initialization
specifies maximum number of clusters
specifies minimum distance for selecting new
seeds
specifies seed to initializes pseudo-random number generator
specifies seed replacement method
specifies convergence criterion
deletes cluster seeds with few observations
optimizes an Lp criterion, where 1  p  1
specifies maximum number of iterations
prevents an observation from being assigned to a
cluster if its distance to the nearest cluster seed is
large

1634 F Chapter 34: The FASTCLUS Procedure

Table 34.1

continued

Option

Description

Arcane Algorithmic Options
BINS=
specifies number of bins used for computing medians for LEAST=1
HC=
specifies criterion for updating the homotopy parameter
HP=
specifies initial value of the homotopy parameter
IRLS
uses an iteratively reweighted least squares
method instead of the modified Ekblom-Newton
method for 1 < p < 2
Missing Values
IMPUTE
NOMISS

imputes missing values after final cluster assignment
excludes observations with missing values

Control Displayed Output
DISTANCE
displays distances between cluster centers
LIST
displays cluster assignments for all observations
NOPRINT
suppresses displayed output
SHORT
suppresses display of large matrices
SUMMARY
suppresses display of all results except for the
cluster summary

The following list provides details on these options. The list is in alphabetical order.
BINS=n

specifies the number of bins used in the bin-sort algorithm for computing medians for
LEAST=1. By default, PROC FASTCLUS uses from 10 to 100 bins, depending on the
amount of memory available. Larger values use more memory and make each iteration somewhat slower, but they can reduce the number of iterations. Smaller values have the opposite
effect. The minimum value of n is 5.
CLUSTER=name

specifies a name for the variable in the OUTSEED= and OUT= data sets that indicates cluster
membership. The default name for this variable is CLUSTER.
CLUSTERLABEL=name

specifies a label for the variable CLUSTER in the OUTSEED= and OUT= data sets. By
default this variable has no label.
CONVERGE=c
CONV=c

specifies the convergence criterion. Any nonnegative value is permitted. The default value is
0.0001 for all values of p if LEAST=p is explicitly specified; otherwise, the default value is
0.02. Iterations stop when the maximum relative change in the cluster seeds is less than or

PROC FASTCLUS Statement F 1635

equal to the convergence criterion and additional conditions on the homotopy parameter, if
any, are satisfied (see the HP= option). The relative change in a cluster seed is the distance
between the old seed and the new seed divided by a scaling factor. If you do not specify the
LEAST= option, the scaling factor is the minimum distance between the initial seeds. If you
specify the LEAST= option, the scaling factor is an L1 scale estimate and is recomputed on
each iteration. Specify the CONVERGE= option only if you specify a MAXITER= value
greater than 1.
DATA=SAS-data-set

specifies the input data set containing observations to be clustered. If you omit the DATA=
option, the most recently created SAS data set is used. The data must be coordinates, not
distances, similarities, or correlations.
DELETE=n

deletes cluster seeds to which n or fewer observations are assigned. Deletion occurs after
processing for the DRIFT option is completed and after each iteration specified by the MAXITER= option. Cluster seeds are not deleted after the final assignment of observations to
clusters, so in rare cases a final cluster might not have more than n members. The DELETE=
option is ineffective if you specify MAXITER=0 and do not specify the DRIFT option. By
default, no cluster seeds are deleted.
DISTANCE | DIST

computes distances between the cluster means.
DRIFT

executes the second of the four steps described in the section “Background” on page 1622.
After initial seed selection, each observation is assigned to the cluster with the nearest seed.
After an observation is processed, the seed of the cluster to which it is assigned is recalculated
as the mean of the observations currently assigned to the cluster. Thus, the cluster seeds drift
about rather than remaining fixed for the duration of the pass.
HC=c
HP=p1 < p2 >

pertains to the homotopy parameter for LEAST=p, where 1 < p < 2. You should specify
these options only if you encounter convergence problems when you use the default values.
For 1 < p < 2, PROC FASTCLUS tries to optimize a perturbed variant of the Lp clustering
criterion (Gonin and Money 1989, pp. 5–6). When the homotopy parameter is 0, the optimization criterion is equivalent to the clustering criterion. For a large homotopy parameter,
the optimization criterion approaches the least squares criterion and is therefore easy to optimize. Beginning with a large homotopy parameter, PROC FASTCLUS gradually decreases
it by a factor in the range [0.01,0.5] over the course of the iterations. When both the homotopy parameter and the convergence measure are sufficiently small, the optimization process
is declared to have converged.
If the initial homotopy parameter is too large or if it is decreased too slowly, the optimization
can require many iterations. If the initial homotopy parameter is too small or if it is decreased
too quickly, convergence to a local optimum is likely. The following list gives details on
setting the homotopy parameter.

1636 F Chapter 34: The FASTCLUS Procedure

HC=c

specifies the criterion for updating the homotopy parameter. The homotopy
parameter is updated when the maximum relative change in the cluster seeds
is less than or equal to c. The default is the minimum of 0.01 and 100 times
the value of the CONVERGE= option.

HP=p1

specifies p1 as the initial value of the homotopy parameter. The default is 0.05
if the modified Ekblom-Newton method is used; otherwise, it is 0.25.

HP=p1 p2

also specifies p2 as the minimum value for the homotopy parameter, which
must be reached for convergence. The default is the minimum of p1 and 0.01
times the value of the CONVERGE= option.

IMPUTE

requests imputation of missing values after the final assignment of observations to clusters. If
an observation that is assigned (or would have been assigned) to a cluster has a missing value
for variables used in the cluster analysis, the missing value is replaced by the corresponding
value in the cluster seed to which the observation is assigned (or would have been assigned).
If the observation cannot be assigned to a cluster, missing value replacement depends on
whether or not the NOMISS option is specified. If NOMISS is not specified, missing values
are replaced by the mean of all observations in the DATA= data set having a value for that
variable. If NOMISS is specified, missing values are replace by the mean of only observations
used in the analysis. (A weighted mean is used if a variable is specified in the WEIGHT
statement.) For information about cluster assignment see the section “OUT= Data Set” on
page 1643. If you specify the IMPUTE option, the imputed values are not used in computing
cluster statistics.
If you also request an OUT= data set, it contains the imputed values.
INSTAT=SAS-data-set

reads a SAS data set previously created with the FASTCLUS procedure by using the OUTSTAT= option. If you specify the INSTAT= option, no clustering iterations are performed and
no output is displayed. Only cluster assignment and imputation are performed as an OUT=
data set is created.
IRLS

causes PROC FASTCLUS to use an iteratively reweighted least squares method instead of
the modified Ekblom-Newton method. If you specify the IRLS option, you must also specify
LEAST=p, where 1 < p < 2. Use the IRLS option only if you encounter convergence
problems with the default method.
LEAST=p | MAX
L=p | MAX

causes PROC FASTCLUS to optimize an Lp criterion, where 1  p  1 (Spath 1985,
pp. 62–63). Infinity is indicated by LEAST=MAX. The value of this clustering criterion is
displayed in the iteration history.
If you do not specify the LEAST= option, PROC FASTCLUS uses the least squares (L2 )
criterion. However, the default number of iterations is only 1 if you omit the LEAST= option,
so the optimization of the criterion is generally not completed. If you specify the LEAST=
option, the maximum number of iterations is increased to permit the optimization process a
chance to converge. See the MAXITER= option for details.

PROC FASTCLUS Statement F 1637

Specifying the LEAST= option also changes the default convergence criterion from 0.02 to
0.0001. See the CONVERGE= option for details.
When LEAST=2, PROC FASTCLUS tries to minimize the root mean squared difference
between the data and the corresponding cluster means.
When LEAST=1, PROC FASTCLUS tries to minimize the mean absolute difference between
the data and the corresponding cluster medians.
When LEAST=MAX, PROC FASTCLUS tries to minimize the maximum absolute difference
between the data and the corresponding cluster midranges.
For general values of p, PROC FASTCLUS tries to minimize the pth root of the mean of the
pth powers of the absolute differences between the data and the corresponding cluster seeds.
The divisor in the clustering criterion is either the number of nonmissing data used in the
analysis or, if there is a WEIGHT statement, the sum of the weights corresponding to all
the nonmissing data used in the analysis (that is, an observation with n nonmissing data
contributes n times the observation weight to the divisor). The divisor is not adjusted for
degrees of freedom.
The method for updating cluster seeds during iteration depends on the LEAST= option, as
follows (Gonin and Money 1989).
LEAST=p
pD1
1 1

1650 F Chapter 34: The FASTCLUS Procedure

If you specify the LEAST=p option, with .1 < p < 2/, and you omit the IRLS option, an additional
column is displayed in the Iteration History table. This column contains a character to identify the
method used in each iteration. PROC FASTCLUS chooses the most efficient method to cluster
the data at each iterative step, given the condition of the data. Thus, the method chosen is data
dependent. The possible values are described as follows:
Value
N
I or L
1
2
3

Method
Newton’s Method
iteratively weighted least squares (IRLS)
IRLS step, halved once
IRLS step, halved twice
IRLS step, halved three times

PROC FASTCLUS displays a Cluster Summary, giving the following for each cluster:
 Cluster number
 Frequency, the number of observations in the cluster
 Weight, the sum of the weights of the observations in the cluster, if you specify the WEIGHT
statement
 RMS Std Deviation, the root mean squared across variables of the cluster standard deviations,
which is equal to the root mean square distance between observations in the cluster
 Maximum Distance from Seed to Observation, the maximum distance from the cluster seed
to any observation in the cluster
 Nearest Cluster, the number of the cluster with mean closest to the mean of the current cluster
 Centroid Distance, the distance between the centroids (means) of the current cluster and the
nearest other cluster
A table of statistics for each variable is displayed unless you specify the SUMMARY option. The
table contains the following:
 Total STD, the total standard deviation
 Within STD, the pooled within-cluster standard deviation
 R-Squared, the R square for predicting the variable from the cluster
 RSQ/(1 - RSQ), the ratio of between-cluster variance to within-cluster variance .R2 =.1 R2 //
 OVER-ALL, all of the previous quantities pooled across variables
PROC FASTCLUS also displays the following:

Displayed Output F 1651

 Pseudo F Statistic,
R2
c 1
1 R2
n c

where R square is the observed overall R square, c is the number of clusters, and n is the
number of observations. The pseudo F statistic was suggested by Calinski and Harabasz
(1974). See Milligan and Cooper (1985) and Cooper and Milligan (1988) regarding the use of
the pseudo F statistic in estimating the number of clusters. See Example 29.2 in Chapter 29,
“The CLUSTER Procedure,” for a comparison of pseudo F statistics.
 Observed Overall R-Squared, if you specify the SUMMARY option
 Approximate Expected Overall R-Squared, the approximate expected value of the overall R
square under the uniform null hypothesis assuming that the variables are uncorrelated. The
value is missing if the number of clusters is greater than one-fifth the number of observations.
 Cubic Clustering Criterion, computed under the assumption that the variables are uncorrelated. The value is missing if the number of clusters is greater than one-fifth the number of
observations.
If you are interested in the approximate expected R square or the cubic clustering criterion but
your variables are correlated, you should cluster principal component scores from the PRINCOMP procedure. Both of these statistics are described by Sarle (1983). The performance
of the cubic clustering criterion in estimating the number of clusters is examined by Milligan
and Cooper (1985) and Cooper and Milligan (1988).
 Distances Between Cluster Means, if you specify the DISTANCE option
Unless you specify the SHORT or SUMMARY option, PROC FASTCLUS displays the following:
 Cluster Means for each variable
 Cluster Standard Deviations for each variable

1652 F Chapter 34: The FASTCLUS Procedure

ODS Table Names
PROC FASTCLUS assigns a name to each table it creates. You can use these names to reference
the table when using the Output Delivery System (ODS) to select tables and create output data sets.
These names are listed in Table 34.4. For more information on ODS, see Chapter 20, “Using the
Output Delivery System.”
Table 34.4

ODS Tables Produced by PROC FASTCLUS

ODS Table Name
ApproxExpOverAllRSq
CCC
ClusterList
ClusterSum
ClusterCenters
ClusterDispersion
ConvergenceStatus
Criterion
DistBetweenClust
InitialSeeds
IterHistory
MinDist
NumberOfBins
ObsOverAllRSquare
PrelScaleEst
PseudoFStat
SimpleStatistics
VariableStat

Description
Approximate expected overall Rsquared, single number
CCC, Cubic Clustering Criterion, single number
Cluster listing, obs, id, and distances
Cluster summary, cluster number, distances
Cluster centers
Cluster dispersion
Convergence status
Criterion based on final seeds,
single number
Distance between clusters
Initial seeds
Iteration history, various statistics for each iteration
Minimum distance between initial seeds, single number
Number of bins
Observed overall R-squared, single number
Preliminary L(1) scale estimate,
single number
Pseudo F statistic, single number
Simple statistics for input variables
Statistics for variables within
clusters

Statement
PROC

Option
default

PROC

default

PROC

LIST

PROC

PRINTALL

PROC
PROC
PROC
PROC

default
default
PRINTALL
default

PROC
PROC
PROC

default
default
PRINTALL

PROC

PRINTALL

PROC
PROC

default
SUMMARY

PROC

PRINTALL

PROC

default

PROC

default

PROC

default

Examples: FASTCLUS Procedure F 1653

Examples: FASTCLUS Procedure

Example 34.1: Fisher’s Iris Data
The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured
in millimeters on 50 iris specimens from each of three species, Iris setosa, I. versicolor, and I.
virginica. Mezzich and Solomon (1980) discuss a variety of cluster analysis of the iris data.
In this example, the FASTCLUS procedure is used to find two and then three clusters. In the
following code, an output data set is created, and PROC FREQ is invoked to compare the clusters
with the species classification. See Output 34.1.1 and Output 34.1.2 for these results.
For three clusters, you can use the CANDISC procedure to compute canonical variables for plotting
the clusters. See Output 34.1.3 and Output 34.1.4 for the results.
proc format;
value specname
1=’Setosa
’
2=’Versicolor’
3=’Virginica ’;
run;
data iris;
title ’Fisher (1936) Iris Data’;
input SepalLength SepalWidth PetalLength PetalWidth Species @@;
format Species specname.;
label SepalLength=’Sepal Length in mm.’
SepalWidth =’Sepal Width in mm.’
PetalLength=’Petal Length in mm.’
PetalWidth =’Petal Width in mm.’;
symbol = put(species, specname10.);
datalines;
50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
... more lines ...
55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
63 33 60 25 3 53 37 15 02 1
;
proc fastclus data=iris maxc=2 maxiter=10 out=clus;
var SepalLength SepalWidth PetalLength PetalWidth;
run;

1654 F Chapter 34: The FASTCLUS Procedure

proc freq;
tables cluster*species;
run;
proc fastclus data=iris maxc=3 maxiter=10 out=clus;
var SepalLength SepalWidth PetalLength PetalWidth;
run;
proc freq;
tables cluster*Species;
run;
proc candisc anova out=can;
class cluster;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 ’Canonical Discriminant Analysis of Iris Clusters’;
run;
proc sgplot data=Can;
scatter y=Can2 x=Can1 /group=Cluster ;
title2 ’Plot of Canonical Variables Identified by Cluster’;
run;

Output 34.1.1 Fisher’s Iris Data: PROC FASTCLUS with MAXC=2 andPROC FREQ
Fisher (1936) Iris Data

Replace=FULL

The FASTCLUS Procedure
Radius=0 Maxclusters=2 Maxiter=10

Converge=0.02

Initial Seeds
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
------------------------------------------------------------------------------1
43.00000000
30.00000000
11.00000000
1.00000000
2
77.00000000
26.00000000
69.00000000
23.00000000
Minimum Distance Between Initial Seeds = 70.85196
Iteration History
Relative Change
in Cluster Seeds
Iteration
Criterion
1
2
---------------------------------------------1
11.0638
0.1904
0.3163
2
5.3780
0.0596
0.0264
3
5.0718
0.0174
0.00766
Convergence criterion is satisfied.
Criterion Based on Final Seeds =

5.0417

Example 34.1: Fisher’s Iris Data F 1655

Output 34.1.1 continued
Cluster Summary
Maximum Distance
RMS Std
from Seed
Radius
Nearest
Cluster
Frequency
Deviation
to Observation
Exceeded
Cluster
----------------------------------------------------------------------------1
53
3.7050
21.1621
2
2
97
5.6779
24.6430
1
Cluster Summary
Distance Between
Cluster
Cluster Centroids
----------------------------1
39.2879
2
39.2879
Statistics for Variables
Variable
Total STD
Within STD
R-Square
RSQ/(1-RSQ)
--------------------------------------------------------------------SepalLength
8.28066
5.49313
0.562896
1.287784
SepalWidth
4.35866
3.70393
0.282710
0.394137
PetalLength
17.65298
6.80331
0.852470
5.778291
PetalWidth
7.62238
3.57200
0.781868
3.584390
OVER-ALL
10.69224
5.07291
0.776410
3.472463
Pseudo F Statistic =

513.92

Approximate Expected Over-All R-Squared =
Cubic Clustering Criterion =

0.51539

14.806

WARNING: The two values above are invalid for correlated variables.

Cluster Means
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
------------------------------------------------------------------------------1
50.05660377
33.69811321
15.60377358
2.90566038
2
63.01030928
28.86597938
49.58762887
16.95876289
Cluster Standard Deviations
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
------------------------------------------------------------------------------1
3.427350930
4.396611045
4.404279486
2.105525249
2
6.336887455
3.267991438
7.800577673
4.155612484

1656 F Chapter 34: The FASTCLUS Procedure

Output 34.1.1 continued
Fisher (1936) Iris Data
The FREQ Procedure
Table of CLUSTER by Species
CLUSTER(Cluster)

Species

Frequency|
Percent |
Row Pct |
Col Pct |Setosa |Versicol|Virginic| Total
|
|or
|a
|
---------+--------+--------+--------+
1 |
50 |
3 |
0 |
53
| 33.33 |
2.00 |
0.00 | 35.33
| 94.34 |
5.66 |
0.00 |
| 100.00 |
6.00 |
0.00 |
---------+--------+--------+--------+
2 |
0 |
47 |
50 |
97
|
0.00 | 31.33 | 33.33 | 64.67
|
0.00 | 48.45 | 51.55 |
|
0.00 | 94.00 | 100.00 |
---------+--------+--------+--------+
Total
50
50
50
150
33.33
33.33
33.33
100.00

Output 34.1.2 Fisher’s Iris Data: PROC FASTCLUS with MAXC=3 and PROC FREQ
Fisher (1936) Iris Data

Replace=FULL

The FASTCLUS Procedure
Radius=0 Maxclusters=3 Maxiter=10

Converge=0.02

Initial Seeds
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
------------------------------------------------------------------------------1
58.00000000
40.00000000
12.00000000
2.00000000
2
77.00000000
38.00000000
67.00000000
22.00000000
3
49.00000000
25.00000000
45.00000000
17.00000000
Minimum Distance Between Initial Seeds = 38.23611
Iteration History
Relative Change in Cluster Seeds
Iteration
Criterion
1
2
3
---------------------------------------------------------1
6.7591
0.2652
0.3205
0.2985
2
3.7097
0
0.0459
0.0317
3
3.6427
0
0.0182
0.0124

Example 34.1: Fisher’s Iris Data F 1657

Output 34.1.2 continued
Convergence criterion is satisfied.
Criterion Based on Final Seeds =

3.6289

Cluster Summary
Maximum Distance
RMS Std
from Seed
Radius
Nearest
Cluster
Frequency
Deviation
to Observation
Exceeded
Cluster
----------------------------------------------------------------------------1
50
2.7803
12.4803
3
2
38
4.0168
14.9736
3
3
62
4.0398
16.9272
2
Cluster Summary
Distance Between
Cluster
Cluster Centroids
----------------------------1
33.5693
2
17.9718
3
17.9718
Statistics for Variables
Variable
Total STD
Within STD
R-Square
RSQ/(1-RSQ)
--------------------------------------------------------------------SepalLength
8.28066
4.39488
0.722096
2.598359
SepalWidth
4.35866
3.24816
0.452102
0.825156
PetalLength
17.65298
4.21431
0.943773
16.784895
PetalWidth
7.62238
2.45244
0.897872
8.791618
OVER-ALL
10.69224
3.66198
0.884275
7.641194
Pseudo F Statistic =

561.63

Approximate Expected Over-All R-Squared =
Cubic Clustering Criterion =

0.62728

25.021

WARNING: The two values above are invalid for correlated variables.

Cluster Means
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
------------------------------------------------------------------------------1
50.06000000
34.28000000
14.62000000
2.46000000
2
68.50000000
30.73684211
57.42105263
20.71052632
3
59.01612903
27.48387097
43.93548387
14.33870968

1658 F Chapter 34: The FASTCLUS Procedure

Output 34.1.2 continued
Cluster Standard Deviations
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
------------------------------------------------------------------------------1
3.524896872
3.790643691
1.736639965
1.053855894
2
4.941550255
2.900924461
4.885895746
2.798724562
3
4.664100551
2.962840548
5.088949673
2.974997167
Fisher (1936) Iris Data
The FREQ Procedure
Table of CLUSTER by Species
CLUSTER(Cluster)

Species

Frequency|
Percent |
Row Pct |
Col Pct |Setosa |Versicol|Virginic| Total
|
|or
|a
|
---------+--------+--------+--------+
1 |
50 |
0 |
0 |
50
| 33.33 |
0.00 |
0.00 | 33.33
| 100.00 |
0.00 |
0.00 |
| 100.00 |
0.00 |
0.00 |
---------+--------+--------+--------+
2 |
0 |
2 |
36 |
38
|
0.00 |
1.33 | 24.00 | 25.33
|
0.00 |
5.26 | 94.74 |
|
0.00 |
4.00 | 72.00 |
---------+--------+--------+--------+
3 |
0 |
48 |
14 |
62
|
0.00 | 32.00 |
9.33 | 41.33
|
0.00 | 77.42 | 22.58 |
|
0.00 | 96.00 | 28.00 |
---------+--------+--------+--------+
Total
50
50
50
150
33.33
33.33
33.33
100.00

Output 34.1.3 Fisher’s Iris Data using PROC CANDISC
Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters
The CANDISC Procedure
Total Sample Size
Variables
Classes

150
4
3

DF Total
DF Within Classes
DF Between Classes

Number of Observations Read
Number of Observations Used

150
150

149
147
2

Example 34.1: Fisher’s Iris Data F 1659

Output 34.1.3 continued
Class Level Information

CLUSTER
1
2
3

Variable
Name

Frequency

Weight

Proportion

50
38
62

50.0000
38.0000
62.0000

0.333333
0.253333
0.413333

_1
_2
_3

Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters
The CANDISC Procedure
Univariate Test Statistics
F Statistics,

Variable Label
Sepal
Length
Sepal
Width
Petal
Length
Petal
Width

Sepal
Length
in mm.
Sepal
Width
in mm.
Petal
Length
in mm.
Petal
Width
in mm.

Num DF=2,

Den DF=147

Total
Pooled
Between
Standard Standard Standard
R-Square
Deviation Deviation Deviation R-Square / (1-RSq) F Value Pr > F
8.2807

4.3949

8.5893

0.7221

2.5984

190.98 <.0001

4.3587

3.2482

3.5774

0.4521

0.8252

60.65 <.0001

17.6530

4.2143

20.9336

0.9438

7.6224

2.4524

8.8164

0.8979

16.7849 1233.69 <.0001

8.7916

646.18 <.0001

Average R-Square
Unweighted
Weighted by Variance

0.7539604
0.8842753

Multivariate Statistics and F Approximations
S=2
Statistic
Wilks’ Lambda
Pillai’s Trace
Hotelling-Lawley Trace
Roy’s Greatest Root

M=0.5

N=71

Value

F Value

Num DF

Den DF

Pr > F

0.03222337
1.25669612
21.06722883
20.63266809

164.55
61.29
377.66
747.93

8
8
8
4

288
290
203.4
145

<.0001
<.0001
<.0001
<.0001

NOTE: F Statistic for Roy’s Greatest Root is an upper bound.
NOTE: F Statistic for Wilks’ Lambda is exact.

1660 F Chapter 34: The FASTCLUS Procedure

Output 34.1.3 continued
Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters
The CANDISC Procedure

1
2

Canonical
Correlation

Adjusted
Canonical
Correlation

Approximate
Standard
Error

Squared
Canonical
Correlation

0.976613
0.550384

0.976123
0.543354

0.003787
0.057107

0.953774
0.302923

Eigenvalues of Inv(E)*H
= CanRsq/(1-CanRsq)

1
2

Eigenvalue

Difference

Proportion

Cumulative

20.6327
0.4346

20.1981

0.9794
0.0206

0.9794
1.0000

Test of H0: The canonical correlations in the
current row and all that follow are zero

1
2

Likelihood
Ratio

Approximate
F Value

Num DF

Den DF

Pr > F

0.03222337
0.69707749

164.55
21.00

8
3

288
145

<.0001
<.0001

Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters
The CANDISC Procedure
Total Canonical Structure
Variable

Label

SepalLength
SepalWidth
PetalLength
PetalWidth

Sepal
Sepal
Petal
Petal

Length in mm.
Width in mm.
Length in mm.
Width in mm.

Can1

Can2

0.831965
-0.515082
0.993520
0.966325

0.452137
0.810630
0.087514
0.154745

Between Canonical Structure
Variable

Label

SepalLength
SepalWidth
PetalLength
PetalWidth

Sepal
Sepal
Petal
Petal

Length in mm.
Width in mm.
Length in mm.
Width in mm.

Can1

Can2

0.956160
-0.748136
0.998770
0.995952

0.292846
0.663545
0.049580
0.089883

Example 34.1: Fisher’s Iris Data F 1661

Output 34.1.3 continued
Pooled Within Canonical Structure
Variable

Label

SepalLength
SepalWidth
PetalLength
PetalWidth

Sepal
Sepal
Petal
Petal

Length in mm.
Width in mm.
Length in mm.
Width in mm.

Can1

Can2

0.339314
-0.149614
0.900839
0.650123

0.716082
0.914351
0.308136
0.404282

Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters
The CANDISC Procedure
Total-Sample Standardized Canonical Coefficients
Variable

Label

SepalLength
SepalWidth
PetalLength
PetalWidth

Sepal
Sepal
Petal
Petal

Length in mm.
Width in mm.
Length in mm.
Width in mm.

Can1

Can2

0.047747341
-0.577569244
3.341309573
0.996451144

1.021487262
0.864455153
-1.283043758
0.900476563

Pooled Within-Class Standardized Canonical Coefficients
Variable

Label

SepalLength
SepalWidth
PetalLength
PetalWidth

Sepal
Sepal
Petal
Petal

Length in mm.
Width in mm.
Length in mm.
Width in mm.

Can1

Can2

0.0253414487
-.4304161258
0.7976741592
0.3205998034

0.5421446856
0.6442092294
-.3063023132
0.2897207865

Raw Canonical Coefficients
Variable

Label

SepalLength
SepalWidth
PetalLength
PetalWidth

Sepal
Sepal
Petal
Petal

Length in mm.
Width in mm.
Length in mm.
Width in mm.

Can1

Can2

0.0057661265
-.1325106494
0.1892773419
0.1307270927

0.1233581748
0.1983303556
-.0726814163
0.1181359305

Class Means on Canonical Variables
CLUSTER

Can1

Can2

1
2
3

-6.131527227
4.931414018
1.922300462

0.244761516
0.861972277
-0.725693908

1662 F Chapter 34: The FASTCLUS Procedure

Output 34.1.4 Plot of Fisher’s Iris Data using PROC CANDISC

Example 34.2: Outliers
This example involves data artificially generated to contain two clusters and several severe outliers.
A preliminary analysis specifies 20 clusters and outputs an OUTSEED= data set to be used for
a diagnostic plot. The exact number of initial clusters is not important; similar results could be
obtained with 10 or 50 initial clusters. Examination of the plot suggests that clusters with more
than five (again, the exact number is not important) observations can yield good seeds for the main
analysis. A DATA step deletes clusters with five or fewer observations, and the remaining cluster
means provide seeds for the next PROC FASTCLUS analysis.
Two clusters are requested; the LEAST= option specifies the mean absolute deviation criterion
(LEAST=1). Values of the LEAST= option less than 2 reduce the effect of outliers on cluster
centers.
The next analysis also requests two clusters; the STRICT= option is specified to prevent outliers
from distorting the results. The STRICT= value is chosen to be close to the _GAP_ and _RADIUS_
values of the larger clusters in the diagnostic plot; the exact value is not critical.
A final PROC FASTCLUS run assigns the outliers to clusters.

Example 34.2: Outliers F 1663

The following SAS statements implement these steps, and the results are displayed in Output 34.2.3
through Output 34.2.8. First, an artificial data set is created with two clusters and some outliers.
Then PROC FASTCLUS is run with many clusters to produce an OUTSEED= data set. A diagnostic
plot using the variables _GAP_ and _RADIUS_ is then produced using the SGSCATTER procedure.
The results from these steps are shown in Output 34.2.1 and Output 34.2.2.
data x;
title ’Using PROC FASTCLUS to Analyze Data with Outliers’;
drop n;
do n=1 to 100;
x=rannor(12345)+2;
y=rannor(12345);
output;
end;
do n=1 to 100;
x=rannor(12345)-2;
y=rannor(12345);
output;
end;
do n=1 to 10;
x=10*rannor(12345);
y=10*rannor(12345);
output;
end;
run;
title2 ’Preliminary PROC FASTCLUS Analysis with 20 Clusters’;
proc fastclus data=x outseed=mean1 maxc=20 maxiter=0 summary;
var x y;
run;
proc sgscatter data=mean1 ;
compare y=(_gap_ _radius_) x=_freq_ ;
run;

Output 34.2.1 Preliminary Analysis of Data with Outliers Using PROC FASTCLUS
Using PROC FASTCLUS to Analyze Data with Outliers
Preliminary PROC FASTCLUS Analysis with 20 Clusters

Replace=FULL

The FASTCLUS Procedure
Radius=0 Maxclusters=20 Maxiter=0

Criterion Based on Final Seeds =

0.6873

1664 F Chapter 34: The FASTCLUS Procedure

Output 34.2.1 continued
Cluster Summary
Maximum Distance
RMS Std
from Seed
Radius
Nearest
Cluster
Frequency
Deviation
to Observation
Exceeded
Cluster
----------------------------------------------------------------------------1
8
0.4753
1.1924
19
2
1
.
0
6
3
44
0.6252
1.6774
5
4
1
.
0
20
5
38
0.5603
1.4528
3
6
2
0.0542
0.1085
2
7
1
.
0
14
8
2
0.6480
1.2961
1
9
1
.
0
7
10
1
.
0
18
11
1
.
0
16
12
20
0.5911
1.6291
16
13
5
0.6682
1.4244
3
14
1
.
0
7
15
5
0.4074
1.2678
3
16
22
0.4168
1.5139
19
17
8
0.4031
1.4794
5
18
1
.
0
10
19
45
0.6475
1.6285
16
20
3
0.5719
1.3642
15
Cluster Summary
Distance Between
Cluster
Cluster Centroids
----------------------------1
1.7205
2
6.2847
3
1.4386
4
5.2130
5
1.4386
6
6.2847
7
2.5094
8
1.8450
9
9.4534
10
4.2514
11
4.7582
12
1.5601
13
1.9553
14
2.5094
15
1.7609
16
1.4936
17
1.5564
18
4.2514
19
1.4936
20
1.8999
Pseudo F Statistic =

207.58

Example 34.2: Outliers F 1665

Output 34.2.1 continued
Observed Over-All R-Squared =

0.95404

Approximate Expected Over-All R-Squared =
Cubic Clustering Criterion =

0.96103

-2.503

WARNING: The two values above are invalid for correlated variables.

Output 34.2.2 Preliminary Analysis of Data with Outliers: Plot Using and PROC SGSCATGTER

1666 F Chapter 34: The FASTCLUS Procedure

In the following SAS statements, a DATA step is used to remove low frequency clusters, then
the FASTCLUS procedure is run again, selecting seeds from the high frequency clusters in the
previous analysis using LEAST=1 clustering criterion. The results are shown in Output 34.2.3 and
Output 34.2.4.

data seed;
set mean1;
if _freq_>5;
run;
title2 ’PROC FASTCLUS Analysis Using LEAST= Clustering Criterion’;
title3 ’Values < 2 Reduce Effect of Outliers on Cluster Centers’;
proc fastclus data=x seed=seed maxc=2 least=1 out=out;
var x y;
run;
proc sgplot data=out;
scatter y=y x=x /group=cluster;
run;

Output 34.2.3 Analysis of Data with Outliers Using the LEAST= Option
Using PROC FASTCLUS to Analyze Data with Outliers
PROC FASTCLUS Analysis Using LEAST= Clustering Criterion
Values < 2 Reduce Effect of Outliers on Cluster Centers

Replace=FULL

Radius=0

The FASTCLUS Procedure
Maxclusters=2 Maxiter=20

Converge=0.0001

Initial Seeds
Cluster
x
y
------------------------------------------1
2.794174248
-0.065970836
2
-2.027300384
-2.051208579
Minimum Distance Between Initial Seeds = 6.806712
Preliminary L(1) Scale Estimate =
Number of Bins =

2.796579
100

Iteration History
Relative Change
Maximum
in Cluster Seeds
Iteration
Criterion
Bin Size
1
2
---------------------------------------------------------1
1.3983
0.2263
0.4091
0.6696
2
1.0776
0.0226
0.00511
0.0452
3
1.0771
0.00226
0.00229
0.00234
4
1.0771
0.000396
0.000253
0.000144
5
1.0771
0.000396
0
0

Least=1

Example 34.2: Outliers F 1667

Output 34.2.3 continued
Convergence criterion is satisfied.
Criterion Based on Final Seeds =

1.0771

Cluster Summary
Mean
Maximum Distance
Absolute
from Seed
Radius
Nearest
Cluster
Frequency
Deviation
to Observation
Exceeded
Cluster
----------------------------------------------------------------------------1
102
1.1278
24.1622
2
2
108
1.0494
14.8292
1
Cluster Summary
Distance Between
Cluster
Cluster Medians
---------------------------1
4.2585
2
4.2585
Cluster Medians
Cluster
x
y
------------------------------------------1
1.923023887
0.222482918
2
-1.826721743
-0.286253041
Mean Absolute Deviations from Final Seeds
Cluster
x
y
------------------------------------------1
1.113465261
1.142120480
2
0.890331835
1.208370913

1668 F Chapter 34: The FASTCLUS Procedure

Output 34.2.4 Analysis Plot of Data with Outliers

The FASTCLUS procedure is run again, selecting seeds from high frequency clusters in the previous analysis. STRICT= prevents outliers from distorting the results. The results are shown in
Output 34.2.5 and Output 34.2.6.

title2 ’PROC FASTCLUS Analysis Using STRICT= to Omit Outliers’;
proc fastclus data=x seed=seed
maxc=2 strict=3.0 out=out outseed=mean2;
var x y;
run;
proc sgplot data=out;
scatter y=y x=x /group=cluster ;
run;

Example 34.2: Outliers F 1669

Output 34.2.5 Cluster Analysis with Outliers Omitted: PROC FASTCLUS SGPLOT
Using PROC FASTCLUS to Analyze Data with Outliers
PROC FASTCLUS Analysis Using STRICT= to Omit Outliers

Replace=FULL

The FASTCLUS Procedure
Radius=0 Strict=3 Maxclusters=2 Maxiter=1
Initial Seeds

Cluster
x
y
------------------------------------------1
2.794174248
-0.065970836
2
-2.027300384
-2.051208579
Criterion Based on Final Seeds =

0.9515

Cluster Summary
Maximum Distance
RMS Std
from Seed
Radius
Nearest
Cluster
Frequency
Deviation
to Observation
Exceeded
Cluster
----------------------------------------------------------------------------1
99
0.9501
2.9589
2
2
99
0.9290
2.8011
1
Cluster Summary
Distance Between
Cluster
Cluster Centroids
----------------------------1
3.7666
2
3.7666
12 Observation(s) were not assigned to a cluster
because the minimum distance to a cluster seed exceeded the STRICT= value.

Statistics for Variables
Variable
Total STD
Within STD
R-Square
RSQ/(1-RSQ)
-----------------------------------------------------------------x
2.06854
0.87098
0.823609
4.669219
y
1.02113
1.00352
0.039093
0.040683
OVER-ALL
1.63119
0.93959
0.669891
2.029303
Pseudo F Statistic =

397.74

Approximate Expected Over-All R-Squared =
Cubic Clustering Criterion =

0.60615

3.197

1670 F Chapter 34: The FASTCLUS Procedure

Output 34.2.5 continued
WARNING: The two values above are invalid for correlated variables.

Cluster Means
Cluster
x
y
------------------------------------------1
1.825111432
0.141211701
2
-1.919910712
-0.261558725
Cluster Standard Deviations
Cluster
x
y
------------------------------------------1
0.889549271
1.006965219
2
0.852000588
1.000062579

Output 34.2.6 Cluster Analysis with Outliers Omitted: Plot Using PROC SGPLOT

Finally, the FASTCLUS procedure is run one more time with zero iterations to assign outliers and
tails to clusters. The results are show in Output 34.2.7 and Output 34.2.8.

Example 34.2: Outliers F 1671

title2 ’Final PROC FASTCLUS Analysis Assigning Outliers to ’
’Clusters’;
proc fastclus data=x seed=mean2 maxc=2 maxiter=0 out=out;
var x y;
run;
proc sgplot data=out;
scatter y=y x=x /group=cluster ;
run;

Output 34.2.7 Cluster Analysis with Outliers Omitted: PROC FASTCLUS
Using PROC FASTCLUS to Analyze Data with Outliers
Final PROC FASTCLUS Analysis Assigning Outliers to Clusters

Replace=FULL

The FASTCLUS Procedure
Radius=0 Maxclusters=2 Maxiter=0
Initial Seeds

Cluster
x
y
------------------------------------------1
1.825111432
0.141211701
2
-1.919910712
-0.261558725
Criterion Based on Final Seeds =

2.0594

Cluster Summary
Maximum Distance
RMS Std
from Seed
Radius
Nearest
Cluster
Frequency
Deviation
to Observation
Exceeded
Cluster
----------------------------------------------------------------------------1
103
2.2569
17.9426
2
2
107
1.8371
11.7362
1
Cluster Summary
Distance Between
Cluster
Cluster Centroids
----------------------------1
4.3753
2
4.3753
Statistics for Variables
Variable
Total STD
Within STD
R-Square
RSQ/(1-RSQ)
-----------------------------------------------------------------x
2.92721
1.95529
0.555950
1.252000
y
2.15248
2.14754
0.009347
0.009435
OVER-ALL
2.56922
2.05367
0.364119
0.572621
Pseudo F Statistic =

119.11

1672 F Chapter 34: The FASTCLUS Procedure

Output 34.2.7 continued
Approximate Expected Over-All R-Squared =
Cubic Clustering Criterion =

0.49090

-5.338

WARNING: The two values above are invalid for correlated variables.

Cluster Means
Cluster
x
y
------------------------------------------1
2.280017469
0.263940765
2
-2.075547895
-0.151348765
Cluster Standard Deviations
Cluster
x
y
------------------------------------------1
2.412264861
2.089922815
2
1.379355878
2.201567557

Output 34.2.8 Cluster Analysis with Outliers Omitted: Plot Using PROC SGPLOT

References F 1673

References
Anderberg, M. R. (1973), Cluster Analysis for Applications, New York: Academic Press.
Bock, H. H. (1985), “On Some Significance Tests in Cluster Analysis,” Journal of Classification, 2,
77–108.
Calinski, T. and Harabasz, J. (1974), “A Dendrite Method for Cluster Analysis,” Communications
in Statistics, 3, 1–27.
Cooper, M. C. and Milligan, G. W. (1988), “The Effect of Error on Determining the Number of
Clusters,” Proceedings of the International Workshop on Data Analysis, Decision Support, and
Expert Knowledge Representation in Marketing and Related Areas of Research, 319–328.
Everitt, B. S. (1980), Cluster Analysis, Second Edition, London: Heineman Educational Books.
Fisher, R. A. (1936), “The Use of Multiple Measurements in Taxonomic Problems,” Annals of
Eugenics, 7, 179–188.
Gonin, R. and Money, A. H. (1989), Nonlinear Lp -Norm Estimation, New York: Marcel Dekker.
Hartigan, J. A. (1975), Clustering Algorithms, New York: John Wiley & Sons.
Hartigan, J. A. (1985), “Statistical Theory in Clustering,” Journal of Classification, 2, 63–76.
Journal of Statistics Education, “Fish Catch Data Set,” http://www.amstat.org/
publications/jse/jse_data_archive.html.
MacQueen, J. B. (1967), “Some Methods for Classification and Analysis of Multivariate Observations,” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1,
281–297.
McLachlan, G. J. and Basford, K. E. (1988), Mixture Models, New York: Marcel Dekker.
Mezzich, J. E. and Solomon, H. (1980), Taxonomy and Behavioral Science, New York: Academic
Press.
Milligan, G. W. (1980), “An Examination of the Effect of Six Types of Error Perturbation on Fifteen
Clustering Algorithms,” Psychometrika, 45, 325–342.
Milligan, G. W. and Cooper, M. C. (1985), “An Examination of Procedures for Determining the
Number of Clusters in a Data Set,” Psychometrika, 50, 159–179.
Pollard, D. (1981), “Strong Consistency of k-Means Clustering,” Annals of Statistics, 9, 135–140.
Sarle, W. S. (1983), “The Cubic Clustering Criterion,” SAS Technical Report A-108, Cary, NC:
SAS Institute Inc.
Spath, H. (1980), Cluster Analysis Algorithms, Chichester, Eng.: Ellis Horwood.

1674 F Chapter 34: The FASTCLUS Procedure

Spath, H. (1985), Cluster Dissection and Analysis, Chichester, Eng.: Ellis Horwood.
Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985), Statistical Analysis of Finite Mixture
Distributions, New York: John Wiley & Sons.
Tou, J. T. and Gonzalez, R. C. (1974), Pattern Recognition Principles, Reading, MA: AddisonWesley.

Subject Index
analyzing data in groups
FASTCLUS procedure, 1640
bin-sort algorithm, 1634
cluster
centers, 1623, 1637
deletion, 1635
final, 1623
initial, 1622, 1623
mean, 1637
median, 1634, 1637
midrange, 1637
minimum distance separating, 1623
seeds, 1622
cluster analysis
disjoint, 1621
large data sets, 1621
robust, 1622, 1637
clustering criterion
FASTCLUS procedure, 1622, 1636, 1637
clustering methods
FASTCLUS procedure, 1622, 1624
computational problems
convergence (FASTCLUS), 1635
computational resources
FASTCLUS procedure, 1647
disjoint clustering, 1621, 1622, 1624
distance
between clusters (FASTCLUS), 1642
data (FASTCLUS), 1622
Euclidean (FASTCLUS), 1623

distance, 1622, 1623, 1642
DRIFT option, 1623
Ekblom-Newton algorithm, 1637
homotopy parameter, 1635
imputation of missing values, 1636
incompatibilities, 1642
iteratively reweighted least squares, 1636
Lp clustering, 1622, 1636
MEAN= data sets, 1638
memory requirements, 1647
Merle-Spath algorithm, 1637
missing values, 1623, 1636, 1638, 1642
Newton algorithm, 1637
OUT= data sets, 1643
outliers, 1622
output data sets, 1638, 1643
output table names, 1652
OUTSTAT= data set, 1638, 1645
random number generator, 1639
scale estimates, 1635, 1637, 1642, 1644
seed replacement, 1623, 1639
weighted cluster means, 1639
homotopy parameter
FASTCLUS procedure, 1635
imputation of missing values
FASTCLUS procedure, 1636
initial seeds
FASTCLUS procedure, 1622, 1623, 1639
k-means clustering, 1622

Ekblom-Newton algorithm
FASTCLUS procedure, 1637
Euclidean distances, 1623

leader algorithm, 1622
Lp clustering
FASTCLUS procedure, 1622
Lp clustering
FASTCLUS procedure, 1636

FASTCLUS procedure
algorithm for updating cluster seeds, 1637
bin-sort algorithm, 1634
cluster deletion, 1635
clustering criterion, 1622, 1636, 1637
clustering methods, 1622, 1624
compared to other procedures, 1647
computational problems, convergence, 1635
computational resources, 1647
controlling iterations, 1638
convergence criterion, 1634

MEAN= data sets
FASTCLUS procedure, 1638
median
cluster, 1634, 1637
memory requirements
FASTCLUS procedure, 1647
Merle-Spath algorithm
FASTCLUS procedure, 1637
missing values

FASTCLUS procedure, 1623, 1636, 1638,
1642
nearest centroid sorting, 1622, 1623
Newton algorithm
FASTCLUS procedure, 1637
OUT= data sets
FASTCLUS procedure, 1643
outliers
FASTCLUS procedure, 1622
output data sets
FASTCLUS procedure, 1638, 1643
output table names
FASTCLUS procedure, 1652
robust
cluster analysis, 1622, 1637
scale estimates
FASTCLUS procedure, 1635, 1637, 1642,
1644
simple cluster-seeking algorithm, 1624

Syntax Index
BINS= option
PROC FASTCLUS statement, 1634
CLUSTER= option
PROC FASTCLUS statement, 1634
CLUSTERLABEL= option
PROC FASTCLUS statement, 1634
CONVERGE= option
PROC FASTCLUS statement, 1634
DATA= option
PROC FASTCLUS statement, 1635
DELETE= option
PROC FASTCLUS statement, 1635
DISTANCE option
PROC FASTCLUS statement, 1635
DRIFT option
PROC FASTCLUS statement, 1635
FASTCLUS procedure
MAXCLUSTERS= option, 1623
RADIUS= option, 1623
syntax, 1632
FASTCLUS procedure, BY statement, 1640
FASTCLUS procedure, FREQ statement, 1640
FASTCLUS procedure, ID statement, 1641
FASTCLUS procedure, PROC FASTCLUS
statement, 1632
BINS= option, 1634
CLUSTER= option, 1634
CLUSTERLABEL= option, 1634
CONVERGE= option, 1634
DATA= option, 1635
DELETE= option, 1635
DISTANCE option, 1635
DRIFT option, 1635
HC= option, 1635
HP= option, 1635
IMPUTE option, 1636
INSTAT= option, 1636
IRLS option, 1636
L= option, 1636
LEAST= option, 1636
LIST option, 1637
MAXCLUSTERS= option, 1632
MAXITER= option, 1638
MEAN= option, 1638
NOMISS option, 1638
NOPRINT option, 1638

OUT= option, 1638
OUTITER option, 1638
OUTS= option, 1638
OUTSEED= option, 1638
OUTSTAT= option, 1638
RADIUS= option, 1632
RANDOM= option, 1639
REPLACE= option, 1639
SEED= option, 1639
SHORT option, 1639
STRICT= option, 1639
SUMMARY option, 1639
VARDEF= option, 1639
FASTCLUS procedure, VAR statement, 1641
FASTCLUS procedure, WEIGHT statement,
1641
HC= option
PROC FASTCLUS statement, 1635
HP= option
PROC FASTCLUS statement, 1635
IMPUTE option
PROC FASTCLUS statement, 1636
INSTAT= option
PROC FASTCLUS statement, 1636
IRLS option
PROC FASTCLUS statement, 1636
L= option
PROC FASTCLUS statement, 1636
LEAST= option
PROC FASTCLUS statement, 1636
LIST option
PROC FASTCLUS statement, 1637
MAXCLUSTERS= option
PROC FASTCLUS statement, 1632
MAXITER= option
PROC FASTCLUS statement, 1638
MEAN= option
PROC FASTCLUS statement, 1638
NOMISS option
PROC FASTCLUS statement, 1638
NOPRINT option
PROC FASTCLUS statement, 1638
OUT= option

PROC FASTCLUS statement, 1638
OUTITER option
PROC FASTCLUS statement, 1638
OUTS= option
PROC FASTCLUS statement, 1638
OUTSEED= option
PROC FASTCLUS statement, 1638
OUTSTAT= option
PROC FASTCLUS statement, 1638
PROC FASTCLUS statement, see FASTCLUS
procedure
RADIUS= option
PROC FASTCLUS statement, 1632
RANDOM= option
PROC FASTCLUS statement, 1639
REPLACE= option
PROC FASTCLUS statement, 1639
SEED= option
PROC FASTCLUS statement, 1639
SHORT option
PROC FASTCLUS statement, 1639
STRICT= option
PROC FASTCLUS statement, 1639
SUMMARY option
PROC FASTCLUS statement, 1639
VARDEF= option
PROC FASTCLUS statement, 1639

Your Turn
We welcome your feedback.
 If you have comments about this book, please send them to
yourturn@sas.com. Include the full title and page numbers (if
applicable).
 If you have comments about the software, please send them to
suggest@sas.com.

SAS Publishing Delivers!
®

Whether you are new to the work force or an experienced professional, you need to distinguish yourself in this rapidly
changing and competitive job market. SAS Publishing provides you with a wide range of resources to help you set
yourself apart. Visit us online at support.sas.com/bookstore.
®

SAS Press
®

Need to learn the basics? Struggling with a programming problem? You’ll find the expert answers that you
need in example-rich books from SAS Press. Written by experienced SAS professionals from around the
world, SAS Press books deliver real-world insights on a broad range of topics for all skill levels.

SAS Documentation

support.sas.com/saspress

®

To successfully implement applications using SAS software, companies in every industry and on every
continent all turn to the one source for accurate, timely, and reliable information: SAS documentation.
We currently produce the following types of reference documentation to improve your work experience:
• Online help that is built into the software.
• Tutorials that are integrated into the product.
• Reference documentation delivered in HTML and PDF – free on the Web.
• Hard-copy books.

support.sas.com/publishing

SAS Publishing News
®

Subscribe to SAS Publishing News to receive up-to-date information about all new SAS titles, author
podcasts, and new Web site features via e-mail. Complete instructions on how to subscribe, as well as
access to past issues, are available at our Web site.

support.sas.com/spn

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. © 2009 SAS Institute Inc. All rights reserved. 518177_1US.0109



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.6
Linearized                      : No
Encryption                      : Standard V2.3 (128-bit)
User Access                     : Print, Copy, Annotate, Fill forms, Extract, Print high-res
Page Mode                       : UseOutlines
XMP Toolkit                     : Adobe XMP Core 4.0-c316 44.253921, Sun Oct 01 2006 17:14:39
Create Date                     : 2008:11:25 12:31:33-05:00
Creator Tool                    : SASLaTeX with hyperref
Modify Date                     : 2009:03:03 14:29:51-05:00
Metadata Date                   : 2009:03:03 14:29:51-05:00
Format                          : application/pdf
Description                     : 
Creator                         : SAS Institute Inc.
Title                           : SAS/STAT 9.2 User's Guide: The FASTCLUS Procedure (Book Excerpt)
Producer                        : pdfeTeX-1.304
Document ID                     : uuid:11737f83-8509-4d4b-8007-5ddfc6c49362
Instance ID                     : uuid:01a2f384-b087-4b21-939c-c6d522ab812d
Page Count                      : 64
Page Layout                     : SinglePage
Subject                         : 
Author                          : SAS Institute Inc.
EXIF Metadata provided by EXIF.tools

Navigation menu