SAS/STAT 9.2 User's Guide: The FASTCLUS Procedure (Book Excerpt) SAS Users Guide
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 64
Download | |
Open PDF In Browser | View PDF |
® SAS/STAT 9.2 User’s Guide The FASTCLUS Procedure (Book Excerpt) ® SAS Documentation This document is an individual chapter from SAS/STAT® 9.2 User’s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute Inc. 2008. SAS/STAT® 9.2 User’s Guide. Cary, NC: SAS Institute Inc. Copyright © 2008, SAS Institute Inc., Cary, NC, USA All rights reserved. Produced in the United States of America. For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st electronic book, March 2008 2nd electronic book, February 2009 SAS® Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hard-copy books, visit the SAS Publishing Web site at support.sas.com/publishing or call 1-800-727-3228. SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Chapter 34 The FASTCLUS Procedure Contents Overview: FASTCLUS Procedure . . . . . Background . . . . . . . . . . . . . Getting Started: FASTCLUS Procedure . . Syntax: FASTCLUS Procedure . . . . . . . PROC FASTCLUS Statement . . . . BY Statement . . . . . . . . . . . . FREQ Statement . . . . . . . . . . . ID Statement . . . . . . . . . . . . . VAR Statement . . . . . . . . . . . . WEIGHT Statement . . . . . . . . . Details: FASTCLUS Procedure . . . . . . Updates in the FASTCLUS Procedure Missing Values . . . . . . . . . . . . Output Data Sets . . . . . . . . . . . Computational Resources . . . . . . Using PROC FASTCLUS . . . . . . Displayed Output . . . . . . . . . . . ODS Table Names . . . . . . . . . . Examples: FASTCLUS Procedure . . . . . Example 34.1: Fisher’s Iris Data . . Example 34.2: Outliers . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1621 1622 1624 1632 1632 1640 1640 1641 1641 1641 1642 1642 1642 1643 1647 1647 1649 1652 1653 1653 1662 1673 Overview: FASTCLUS Procedure The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables. The observations are divided into clusters such that every observation belongs to one and only one cluster; the clusters do not form a tree structure as they do in the CLUSTER procedure. If you want separate analysis for different numbers of clusters, you can run PROC FASTCLUS once for each analysis. Alternatively, to do hierarchical clustering on a large data set, use PROC FASTCLUS to find initial clusters, and then use those initial clusters as input to PROC CLUSTER. 1622 F Chapter 34: The FASTCLUS Procedure By default, the FASTCLUS procedure uses Euclidean distances, so the cluster centers are based on least squares estimation. This kind of clustering method is often called a k-means model, since the cluster centers are the means of the observations assigned to each cluster when the algorithm is run to complete convergence. Each iteration reduces the least squares criterion until convergence is achieved. Often there is no need to run the FASTCLUS procedure to convergence. PROC FASTCLUS is designed to find good clusters (but not necessarily the best possible clusters) with only two or three passes through the data set. The initialization method of PROC FASTCLUS guarantees that, if there exist clusters such that all distances between observations in the same cluster are less than all distances between observations in different clusters, and if you tell PROC FASTCLUS the correct number of clusters to find, it can always find such a clustering without iterating. Even with clusters that are not as well separated, PROC FASTCLUS usually finds initial seeds that are sufficiently good that few iterations are required. Hence, by default, PROC FASTCLUS performs only one iteration. The initialization method used by the FASTCLUS procedure makes it sensitive to outliers. PROC FASTCLUS can be an effective procedure for detecting outliers because outliers often appear as clusters with only one member. The FASTCLUS procedure can use an Lp (least pth powers) clustering criterion (Spath 1985, pp. 62–63) instead of the least squares (L2 ) criterion used in k-means clustering methods. The LEAST=p option specifies the power p to be used. Using the LEAST= option increases execution time since more iterations are usually required, and the default iteration limit is increased when you specify LEAST=p. Values of p less than 2 reduce the effect of outliers on the cluster centers compared with least squares methods; values of p greater than 2 increase the effect of outliers. The FASTCLUS procedure is intended for use with large data sets, with 100 or more observations. With small data sets, the results can be highly sensitive to the order of the observations in the data set. PROC FASTCLUS uses algorithms that place a larger influence on variables with larger variance, so it might be necessary to standardize the variables before performing the cluster analysis. See the “Using PROC FASTCLUS” section for standardization details. PROC FASTCLUS produces brief summaries of the clusters it finds. For more extensive examination of the clusters, you can request an output data set containing a cluster membership variable. Background The FASTCLUS procedure combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distances from the cluster means. The result is an efficient procedure for disjoint clustering of large data sets. PROC FASTCLUS was directly inspired by Hartigan’s (1975) leader algorithm and MacQueen’s (1967) k-means algorithm. PROC FASTCLUS uses a method that Anderberg (1973) calls nearest centroid sorting. A set of points called cluster seeds is selected as a first guess of the means of the clusters. Each observation is assigned to the nearest seed to form temporary clusters. The seeds are then replaced by the means Background F 1623 of the temporary clusters, and the process is repeated until no further changes occur in the clusters. Similar techniques are described in most references on clustering (Anderberg 1973; Hartigan 1975; Everitt 1980; Spath 1980). The FASTCLUS procedure differs from other nearest centroid sorting methods in the way the initial cluster seeds are selected. The importance of initial seed selection is demonstrated by Milligan (1980). The clustering is done on the basis of Euclidean distances computed from one or more numeric variables. If there are missing values, PROC FASTCLUS computes an adjusted distance by using the nonmissing values. Observations that are very close to each other are usually assigned to the same cluster, while observations that are far apart are in different clusters. The FASTCLUS procedure operates in four steps: 1. Observations called cluster seeds are selected. 2. If you specify the DRIFT option, temporary clusters are formed by assigning each observation to the cluster with the nearest seed. Each time an observation is assigned, the cluster seed is updated as the current mean of the cluster. This method is sometimes called incremental, on-line, or adaptive training. 3. If the maximum number of iterations is greater than zero, clusters are formed by assigning each observation to the nearest seed. After all observations are assigned, the cluster seeds are replaced by either the cluster means or other location estimates (cluster centers) appropriate to the LEAST=p option. This step can be repeated until the changes in the cluster seeds become small or zero (MAXITER=n 1). 4. Final clusters are formed by assigning each observation to the nearest seed. If PROC FASTCLUS runs to complete convergence, the final cluster seeds will equal the cluster means or cluster centers. If PROC FASTCLUS terminates before complete convergence, which often happens with the default settings, the final cluster seeds might not equal the cluster means or cluster centers. If you want complete convergence, specify CONVERGE=0 and a large value for the MAXITER= option. The initial cluster seeds must be observations with no missing values. You can specify the maximum number of seeds (and, hence, clusters) by using the MAXCLUSTERS= option. You can also specify a minimum distance by which the seeds must be separated by using the RADIUS= option. PROC FASTCLUS always selects the first complete (no missing values) observation as the first seed. The next complete observation that is separated from the first seed by at least the distance specified in the RADIUS= option becomes the second seed. Later observations are selected as new seeds if they are separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not exceeded. If an observation is complete but fails to qualify as a new seed, PROC FASTCLUS considers using it to replace one of the old seeds. Two tests are made to see if the observation can qualify as a new seed. First, an old seed is replaced if the distance between the observation and the closest seed is greater than the minimum distance between seeds. The seed that is replaced is selected from the two 1624 F Chapter 34: The FASTCLUS Procedure seeds that are closest to each other. The seed that is replaced is the one of these two with the shortest distance to the closest of the remaining seeds when the other seed is replaced by the current observation. If the observation fails the first test for seed replacement, a second test is made. The observation replaces the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater than the shortest distance from the nearest seed to all other seeds. If the observation fails this test, PROC FASTCLUS goes on to the next observation. You can specify the REPLACE= option to limit seed replacement. You can omit the second test for seed replacement (REPLACE=PART), causing PROC FASTCLUS to run faster, but the seeds selected might not be as widely separated as those obtained by the default method. You can also suppress seed replacement entirely by specifying REPLACE=NONE. In this case, PROC FASTCLUS runs much faster, but you must choose a good value for the RADIUS= option in order to get good clusters. This method is similar to Hartigan’s (1975, pp. 74–78) leader algorithm and the simple cluster seeking algorithm described by Tou and Gonzalez (1974, pp. 90–92). Getting Started: FASTCLUS Procedure The following example demonstrates how to use the FASTCLUS procedure to compute disjoint clusters of observations in a SAS data set. The data in this example are measurements taken on 159 freshwater fish caught from the same lake (Laengelmavesi) near Tampere in Finland. This data set is available from the Data Archive of the Journal of Statistics Education. The complete data set is displayed in Chapter 82, “The STEPDISC Procedure.” The species (bream, parkki, pike, perch, roach, smelt, and whitefish), weight, three different length measurements (measured from the nose of the fish to the beginning of its tail, the notch of its tail, and the end of its tail), height, and width of each fish are tallied. The height and width are recorded as percentages of the third length variable. Suppose that you want to group empirically the fish measurements into clusters and that you want to associate the clusters with the species. You can use the FASTCLUS procedure to perform a cluster analysis. The following DATA step creates the SAS data set Fish: proc format; value specfmt 1=’Bream’ 2=’Roach’ 3=’Whitefish’ 4=’Parkki’ 5=’Perch’ 6=’Pike’ 7=’Smelt’; run; Getting Started: FASTCLUS Procedure F 1625 data fish (drop=HtPct WidthPct); title ’Fish Measurement Data’; input Species Weight Length1 Length2 Length3 HtPct WidthPct @@; *** transform variables; if Weight <= 0 or Weight =. then delete; Weight3=Weight**(1/3); Height=HtPct*Length3/(Weight3*100); Width=WidthPct*Length3/(Weight3*100); Length1=Length1/Weight3; Length2=Length2/Weight3; Length3=Length3/Weight3; logLengthRatio=log(Length3/Length1); 1 1 1 1 1 1 1 1 format Species specfmt.; symbol = put(Species, specfmt2.); datalines; 242.0 23.2 25.4 30.0 38.4 13.4 1 340.0 23.9 26.5 31.1 39.8 15.1 1 430.0 26.5 29.0 34.0 36.6 15.1 1 500.0 26.8 29.7 34.5 41.1 15.3 1 450.0 27.6 30.0 35.1 39.9 13.8 1 475.0 28.4 31.0 36.2 39.4 14.1 1 500.0 29.1 31.5 36.4 37.8 12.0 1 600.0 29.4 32.0 37.2 40.2 13.9 1 290.0 363.0 450.0 390.0 500.0 500.0 . 600.0 24.0 26.3 26.8 27.6 28.5 28.7 29.5 29.4 26.3 29.0 29.7 30.0 30.7 31.0 32.0 32.0 31.2 33.5 34.7 35.0 36.2 36.2 37.3 37.2 40.0 38.0 39.2 36.2 39.3 39.7 37.3 41.5 13.8 13.3 14.2 13.4 13.7 13.3 13.6 15.0 ... more lines ... 7 7 7 ; 9.8 11.4 12.0 13.2 16.7 8.7 7 13.4 11.7 12.4 13.5 18.0 9.4 7 19.7 13.2 14.3 15.2 18.9 13.6 7 12.2 11.5 12.2 13.4 15.6 10.4 12.2 12.1 13.0 13.8 16.5 9.1 19.9 13.8 15.0 16.2 18.1 11.6 The double trailing at sign (@@) in the INPUT statement specifies that observations are input from each line until all values are read. The variables are rescaled in order to adjust for dimensionality. Because the new variables Weight3–logLengthRatio depend on the variable Weight, observations with missing values for Weight are not added to the data set. Consequently, there are 157 observations in the SAS data set Fish. In the Fish data set, the variables are not measured in the same units and cannot be assumed to have equal variance. Therefore, it is necessary to standardize the variables before performing the cluster analysis. The following statements standardize the variables and perform a cluster analysis on the standardized data: proc standard data=Fish out=Stand mean=0 std=1; var Length1 logLengthRatio Height Width Weight3; proc fastclus data=Stand out=Clust maxclusters=7 maxiter=100 ; var Length1 logLengthRatio Height Width Weight3; run; 1626 F Chapter 34: The FASTCLUS Procedure The STANDARD procedure is first used to standardize all the analytical variables to a mean of 0 and standard deviation of 1. The procedure creates the output data set Stand to contain the transformed variables. The FASTCLUS procedure then uses the data set Stand as input and creates the data set Clust. This output data set contains the original variables and two new variables, Cluster and Distance. The variable Cluster contains the cluster number to which each observation has been assigned. The variable Distance gives the distance from the observation to its cluster seed. It is usually desirable to try several values of the MAXCLUSTERS= option. A reasonable beginning for this example is to use MAXCLUSTERS=7, since there are seven species of fish represented in the data set Fish. The VAR statement specifies the variables used in the cluster analysis. The results from this analysis are displayed in the following figures. Figure 34.1 Initial Seeds Used in the FASTCLUS Procedure Fish Measurement Data Replace=FULL The FASTCLUS Procedure Radius=0 Maxclusters=7 Maxiter=100 Converge=0.02 Initial Seeds logLength Cluster Length1 Ratio Height Width Weight3 ----------------------------------------------------------------------------1 1.388338414 -0.979577858 -1.594561848 -2.254050655 2.103447062 2 -1.117178039 -0.877218192 -0.336166276 2.528114070 1.170706464 3 2.393997461 -0.662642015 -0.930738701 -2.073879107 -1.839325419 4 -0.495085516 -0.964041012 -0.265106856 -0.028245072 1.536846394 5 -0.728772773 0.540096664 1.130501398 -1.207930053 -1.107018207 6 -0.506924177 0.748211648 1.762482687 0.211507596 1.368987826 7 1.573996573 -0.796593995 -0.824217424 1.561715851 -1.607942726 Criterion Based on Final Seeds = 0.3979 Figure 34.1 displays the table of initial seeds used for each variable and cluster. The first line in the figure displays the option settings for REPLACE, RADIUS, MAXCLUSTERS, and MAXITER. These options, with the exception of MAXCLUSTERS and MAXITER, are set at their respective default values (REPLACE=FULL, RADIUS=0). Both the MAXCLUSTERS= and MAXITER= options are set in the PROC FASTCLUS statement. Next, PROC FASTCLUS produces a table of summary statistics for the clusters. Figure 34.2 displays the number of observations in the cluster (frequency) and the root mean squared standard deviation. The next two columns display the largest Euclidean distance from the cluster seed to any observation within the cluster and the number of the nearest cluster. The last column of the table displays the distance between the centroid of the nearest cluster and the centroid of the current cluster. A centroid is the point having coordinates that are the means of all the observations in the cluster. Getting Started: FASTCLUS Procedure F 1627 Figure 34.2 Cluster Summary Table from the FASTCLUS Procedure Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Cluster Frequency Deviation to Observation Exceeded Cluster ----------------------------------------------------------------------------1 17 0.5064 1.7781 4 2 19 0.3696 1.5007 4 3 13 0.3803 1.7135 1 4 13 0.4161 1.3976 7 5 11 0.2466 0.6966 6 6 34 0.3563 1.5443 5 7 50 0.4447 2.3915 4 Cluster Summary Distance Between Cluster Cluster Centroids ----------------------------1 2.5106 2 1.5510 3 2.6704 4 1.4266 5 1.7301 6 1.7301 7 1.4266 Figure 34.3 displays the table of statistics for the variables. The table lists for each variable the total standard deviation, the pooled within-cluster standard deviation and the R-square value for predicting the variable from the cluster. The ratio of between-cluster variance to within-cluster variance (R2 to 1 R2 ) appears in the last column. Figure 34.3 Statistics for Variables Used in the FASTCLUS Procedure Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) -----------------------------------------------------------------------Length1 1.00000 0.31428 0.905030 9.529606 logLengthRatio 1.00000 0.39276 0.851676 5.741989 Height 1.00000 0.20917 0.957929 22.769295 Width 1.00000 0.55558 0.703200 2.369270 Weight3 1.00000 0.47251 0.785323 3.658162 OVER-ALL 1.00000 0.40712 0.840631 5.274764 Pseudo F Statistic = 131.87 Approximate Expected Over-All R-Squared = 0.57420 1628 F Chapter 34: The FASTCLUS Procedure The pseudo F statistic, approximate expected overall R square, and cubic clustering criterion (CCC) are listed at the bottom of the figure. You can compare values of these statistics by running PROC FASTCLUS with different values for the MAXCLUSTERS= option. The R square and CCC values are not valid for correlated variables. Values of the cubic clustering criterion greater than 2 or 3 indicate good clusters. Values between 0 and 2 indicate potential clusters, but they should be taken with caution; large negative values can indicate outliers. PROC FASTCLUS next produces the within-cluster means and standard deviations of the variables, displayed in Figure 34.4. Figure 34.4 Cluster Means and Standard Deviations from the FASTCLUS Procedure Cluster Means logLength Cluster Length1 Ratio Height Width Weight3 ----------------------------------------------------------------------------1 1.747808245 -0.868605685 -1.327226832 -1.128760946 0.806373599 2 -0.405231510 -0.979113021 -0.281064162 1.463094486 1.060450065 3 2.006796315 -0.652725165 -1.053213440 -1.224020795 -1.826752838 4 -0.136820952 -1.039312574 -0.446429482 0.162596336 0.278560318 5 -0.850130601 0.550190242 1.245156076 -0.836585750 -0.567022647 6 -0.843912827 1.522291347 1.511408739 -0.380323563 0.763114370 7 -0.165570970 -0.048881276 -0.353723615 0.546442064 -0.668780782 Cluster Standard Deviations logLength Cluster Length1 Ratio Height Width Weight3 ----------------------------------------------------------------------------1 0.3418476428 0.3544065543 0.1666302451 0.6172880027 0.7944227150 2 0.3129902863 0.3592350778 0.1369052680 0.5467406493 0.3720119097 3 0.2962504486 0.1740941675 0.1736086707 0.7528475622 0.0905232968 4 0.3254364840 0.2836681149 0.1884592934 0.4543390702 0.6612055341 5 0.1781837609 0.0745984121 0.2056932592 0.2784540794 0.3832002850 6 0.2273744242 0.3385584051 0.2046010964 0.5143496067 0.4025849044 7 0.3734733622 0.5275768119 0.2551130680 0.5721303628 0.4223181710 It is useful to study further the clusters calculated by the FASTCLUS procedure. One method is to look at a frequency tabulation of the clusters with other classification variables. The following statements invoke the FREQ procedure to crosstabulate the empirical clusters with the variable Species: proc freq data=Clust; tables Species*Cluster; run; Getting Started: FASTCLUS Procedure F 1629 Figure 34.5 displays the marked division between clusters. Figure 34.5 Frequency Table of Cluster versus Species Fish Measurement Data The FREQ Procedure Table of Species by CLUSTER Species CLUSTER(Cluster) Frequency | Percent | Row Pct | Col Pct | 1| 2| 3| 4| Total ----------+--------+--------+--------+--------+ Bream | 0 | 0 | 0 | 0 | 34 | 0.00 | 0.00 | 0.00 | 0.00 | 21.66 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+ Roach | 0 | 0 | 0 | 0 | 19 | 0.00 | 0.00 | 0.00 | 0.00 | 12.10 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+ Whitefish | 0 | 2 | 0 | 1 | 6 | 0.00 | 1.27 | 0.00 | 0.64 | 3.82 | 0.00 | 33.33 | 0.00 | 16.67 | | 0.00 | 10.53 | 0.00 | 7.69 | ----------+--------+--------+--------+--------+ Parkki | 0 | 0 | 0 | 0 | 11 | 0.00 | 0.00 | 0.00 | 0.00 | 7.01 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+ Perch | 0 | 17 | 0 | 12 | 56 | 0.00 | 10.83 | 0.00 | 7.64 | 35.67 | 0.00 | 30.36 | 0.00 | 21.43 | | 0.00 | 89.47 | 0.00 | 92.31 | ----------+--------+--------+--------+--------+ Pike | 17 | 0 | 0 | 0 | 17 | 10.83 | 0.00 | 0.00 | 0.00 | 10.83 | 100.00 | 0.00 | 0.00 | 0.00 | | 100.00 | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+--------+ Smelt | 0 | 0 | 13 | 0 | 14 | 0.00 | 0.00 | 8.28 | 0.00 | 8.92 | 0.00 | 0.00 | 92.86 | 0.00 | | 0.00 | 0.00 | 100.00 | 0.00 | ----------+--------+--------+--------+--------+ Total 17 19 13 13 157 10.83 12.10 8.28 8.28 100.00 (Continued) 1630 F Chapter 34: The FASTCLUS Procedure Figure 34.5 continued Fish Measurement Data The FREQ Procedure Table of Species by CLUSTER Species CLUSTER(Cluster) Frequency | Percent | Row Pct | Col Pct | 5| 6| 7| Total ----------+--------+--------+--------+ Bream | 0 | 34 | 0 | 34 | 0.00 | 21.66 | 0.00 | 21.66 | 0.00 | 100.00 | 0.00 | | 0.00 | 100.00 | 0.00 | ----------+--------+--------+--------+ Roach | 0 | 0 | 19 | 19 | 0.00 | 0.00 | 12.10 | 12.10 | 0.00 | 0.00 | 100.00 | | 0.00 | 0.00 | 38.00 | ----------+--------+--------+--------+ Whitefish | 0 | 0 | 3 | 6 | 0.00 | 0.00 | 1.91 | 3.82 | 0.00 | 0.00 | 50.00 | | 0.00 | 0.00 | 6.00 | ----------+--------+--------+--------+ Parkki | 11 | 0 | 0 | 11 | 7.01 | 0.00 | 0.00 | 7.01 | 100.00 | 0.00 | 0.00 | | 100.00 | 0.00 | 0.00 | ----------+--------+--------+--------+ Perch | 0 | 0 | 27 | 56 | 0.00 | 0.00 | 17.20 | 35.67 | 0.00 | 0.00 | 48.21 | | 0.00 | 0.00 | 54.00 | ----------+--------+--------+--------+ Pike | 0 | 0 | 0 | 17 | 0.00 | 0.00 | 0.00 | 10.83 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | ----------+--------+--------+--------+ Smelt | 0 | 0 | 1 | 14 | 0.00 | 0.00 | 0.64 | 8.92 | 0.00 | 0.00 | 7.14 | | 0.00 | 0.00 | 2.00 | ----------+--------+--------+--------+ Total 11 34 50 157 7.01 21.66 31.85 100.00 Getting Started: FASTCLUS Procedure F 1631 For cases in which you have three or more clusters, you can use the CANDISC and SGPLOT procedures to obtain a graphical check on the distribution of the clusters. In the following statements, the CANDISC and SGPLOT procedures are used to compute canonical variables and plot the clusters: proc candisc data=Clust out=Can noprint; class Cluster; var Length1 logLengthRatio Height Width Weight3; proc sgplot data=Can; scatter y=Can2 x=Can1 / group=Cluster ; run; First, the CANDISC procedure is invoked to perform a canonical discriminant analysis by using the data set Clust and creating the output SAS data set Can. The NOPRINT option suppresses display of the output. The CLASS statement specifies the variable Cluster to define groups for the analysis. The VAR statement specifies the variables used in the analysis. Next, the SGPLOT procedure plots the two canonical variables from PROC CANDISC, Can1 and Can2. The PLOT statement specifies the variable Cluster as the identification variable. The resulting plot (Figure 34.6) illustrates the spatial separation of the clusters calculated in the FASTCLUS procedure. Figure 34.6 Plot of Canonical Variables and Cluster Value 1632 F Chapter 34: The FASTCLUS Procedure Syntax: FASTCLUS Procedure The following statements are available in the FASTCLUS procedure: PROC FASTCLUS < DATA=SAS-data-set > < MAXCLUSTERS=n > < RADIUS=t > ; VAR variables ; ID variables ; FREQ variable ; WEIGHT variable ; BY variables ; Usually you need only the VAR statement in addition to the PROC FASTCLUS statement. The BY, FREQ, ID, VAR, and WEIGHT statements are described in alphabetical order after the PROC FASTCLUS statement. PROC FASTCLUS Statement PROC FASTCLUS MAXCLUSTERS= n | RADIUS=t < options > ; You must specify the MAXCLUSTERS= option or RADIUS= option or both in the PROC FASTCLUS statement. MAXCLUSTERS=n MAXC=n specifies the maximum number of clusters permitted. If you omit the MAXCLUSTERS= option, a value of 100 is assumed. RADIUS=t R=t establishes the minimum distance criterion for selecting new seeds. No observation is considered as a new seed unless its minimum distance to previous seeds exceeds the value given by the RADIUS= option. The default value is 0. If you specify the REPLACE=RANDOM option, the RADIUS= option is ignored. PROC FASTCLUS Statement F 1633 You can specify the following options in the PROC FASTCLUS statement. Table 34.1 summarizes the options. Table 34.1 PROC FASTCLUS Statement Options Option Description Specify input and output data sets DATA= specifies input data set INSTAT= specifies input SAS data set previously created by the OUTSTAT= option SEED= specifies input SAS data set for selecting initial cluster seeds VARDEF= specifies divisor for variances Output Data Processing CLUSTER= specifies name for cluster membership variable in OUTSEED= and OUT= data sets CLUSTERLABEL= specifies label for cluster membership variable in OUTSEED= and OUT= data sets OUT= specifies output SAS data set containing original data and cluster assignments OUTITER specifies writing to OUTSEED= data set on every iteration OUTSEED= or MEAN= specifies output SAS data set containing cluster centers OUTSTAT= specifies output SAS data set containing statistics Initial Clusters DRIFT MAXCLUSTERS= RADIUS= RANDOM= REPLACE= Clustering Methods CONVERGE= DELETE= LEAST= MAXITER= STRICT permits cluster to seeds to drift during initialization specifies maximum number of clusters specifies minimum distance for selecting new seeds specifies seed to initializes pseudo-random number generator specifies seed replacement method specifies convergence criterion deletes cluster seeds with few observations optimizes an Lp criterion, where 1 p 1 specifies maximum number of iterations prevents an observation from being assigned to a cluster if its distance to the nearest cluster seed is large 1634 F Chapter 34: The FASTCLUS Procedure Table 34.1 continued Option Description Arcane Algorithmic Options BINS= specifies number of bins used for computing medians for LEAST=1 HC= specifies criterion for updating the homotopy parameter HP= specifies initial value of the homotopy parameter IRLS uses an iteratively reweighted least squares method instead of the modified Ekblom-Newton method for 1 < p < 2 Missing Values IMPUTE NOMISS imputes missing values after final cluster assignment excludes observations with missing values Control Displayed Output DISTANCE displays distances between cluster centers LIST displays cluster assignments for all observations NOPRINT suppresses displayed output SHORT suppresses display of large matrices SUMMARY suppresses display of all results except for the cluster summary The following list provides details on these options. The list is in alphabetical order. BINS=n specifies the number of bins used in the bin-sort algorithm for computing medians for LEAST=1. By default, PROC FASTCLUS uses from 10 to 100 bins, depending on the amount of memory available. Larger values use more memory and make each iteration somewhat slower, but they can reduce the number of iterations. Smaller values have the opposite effect. The minimum value of n is 5. CLUSTER=name specifies a name for the variable in the OUTSEED= and OUT= data sets that indicates cluster membership. The default name for this variable is CLUSTER. CLUSTERLABEL=name specifies a label for the variable CLUSTER in the OUTSEED= and OUT= data sets. By default this variable has no label. CONVERGE=c CONV=c specifies the convergence criterion. Any nonnegative value is permitted. The default value is 0.0001 for all values of p if LEAST=p is explicitly specified; otherwise, the default value is 0.02. Iterations stop when the maximum relative change in the cluster seeds is less than or PROC FASTCLUS Statement F 1635 equal to the convergence criterion and additional conditions on the homotopy parameter, if any, are satisfied (see the HP= option). The relative change in a cluster seed is the distance between the old seed and the new seed divided by a scaling factor. If you do not specify the LEAST= option, the scaling factor is the minimum distance between the initial seeds. If you specify the LEAST= option, the scaling factor is an L1 scale estimate and is recomputed on each iteration. Specify the CONVERGE= option only if you specify a MAXITER= value greater than 1. DATA=SAS-data-set specifies the input data set containing observations to be clustered. If you omit the DATA= option, the most recently created SAS data set is used. The data must be coordinates, not distances, similarities, or correlations. DELETE=n deletes cluster seeds to which n or fewer observations are assigned. Deletion occurs after processing for the DRIFT option is completed and after each iteration specified by the MAXITER= option. Cluster seeds are not deleted after the final assignment of observations to clusters, so in rare cases a final cluster might not have more than n members. The DELETE= option is ineffective if you specify MAXITER=0 and do not specify the DRIFT option. By default, no cluster seeds are deleted. DISTANCE | DIST computes distances between the cluster means. DRIFT executes the second of the four steps described in the section “Background” on page 1622. After initial seed selection, each observation is assigned to the cluster with the nearest seed. After an observation is processed, the seed of the cluster to which it is assigned is recalculated as the mean of the observations currently assigned to the cluster. Thus, the cluster seeds drift about rather than remaining fixed for the duration of the pass. HC=c HP=p1 < p2 > pertains to the homotopy parameter for LEAST=p, where 1 < p < 2. You should specify these options only if you encounter convergence problems when you use the default values. For 1 < p < 2, PROC FASTCLUS tries to optimize a perturbed variant of the Lp clustering criterion (Gonin and Money 1989, pp. 5–6). When the homotopy parameter is 0, the optimization criterion is equivalent to the clustering criterion. For a large homotopy parameter, the optimization criterion approaches the least squares criterion and is therefore easy to optimize. Beginning with a large homotopy parameter, PROC FASTCLUS gradually decreases it by a factor in the range [0.01,0.5] over the course of the iterations. When both the homotopy parameter and the convergence measure are sufficiently small, the optimization process is declared to have converged. If the initial homotopy parameter is too large or if it is decreased too slowly, the optimization can require many iterations. If the initial homotopy parameter is too small or if it is decreased too quickly, convergence to a local optimum is likely. The following list gives details on setting the homotopy parameter. 1636 F Chapter 34: The FASTCLUS Procedure HC=c specifies the criterion for updating the homotopy parameter. The homotopy parameter is updated when the maximum relative change in the cluster seeds is less than or equal to c. The default is the minimum of 0.01 and 100 times the value of the CONVERGE= option. HP=p1 specifies p1 as the initial value of the homotopy parameter. The default is 0.05 if the modified Ekblom-Newton method is used; otherwise, it is 0.25. HP=p1 p2 also specifies p2 as the minimum value for the homotopy parameter, which must be reached for convergence. The default is the minimum of p1 and 0.01 times the value of the CONVERGE= option. IMPUTE requests imputation of missing values after the final assignment of observations to clusters. If an observation that is assigned (or would have been assigned) to a cluster has a missing value for variables used in the cluster analysis, the missing value is replaced by the corresponding value in the cluster seed to which the observation is assigned (or would have been assigned). If the observation cannot be assigned to a cluster, missing value replacement depends on whether or not the NOMISS option is specified. If NOMISS is not specified, missing values are replaced by the mean of all observations in the DATA= data set having a value for that variable. If NOMISS is specified, missing values are replace by the mean of only observations used in the analysis. (A weighted mean is used if a variable is specified in the WEIGHT statement.) For information about cluster assignment see the section “OUT= Data Set” on page 1643. If you specify the IMPUTE option, the imputed values are not used in computing cluster statistics. If you also request an OUT= data set, it contains the imputed values. INSTAT=SAS-data-set reads a SAS data set previously created with the FASTCLUS procedure by using the OUTSTAT= option. If you specify the INSTAT= option, no clustering iterations are performed and no output is displayed. Only cluster assignment and imputation are performed as an OUT= data set is created. IRLS causes PROC FASTCLUS to use an iteratively reweighted least squares method instead of the modified Ekblom-Newton method. If you specify the IRLS option, you must also specify LEAST=p, where 1 < p < 2. Use the IRLS option only if you encounter convergence problems with the default method. LEAST=p | MAX L=p | MAX causes PROC FASTCLUS to optimize an Lp criterion, where 1 p 1 (Spath 1985, pp. 62–63). Infinity is indicated by LEAST=MAX. The value of this clustering criterion is displayed in the iteration history. If you do not specify the LEAST= option, PROC FASTCLUS uses the least squares (L2 ) criterion. However, the default number of iterations is only 1 if you omit the LEAST= option, so the optimization of the criterion is generally not completed. If you specify the LEAST= option, the maximum number of iterations is increased to permit the optimization process a chance to converge. See the MAXITER= option for details. PROC FASTCLUS Statement F 1637 Specifying the LEAST= option also changes the default convergence criterion from 0.02 to 0.0001. See the CONVERGE= option for details. When LEAST=2, PROC FASTCLUS tries to minimize the root mean squared difference between the data and the corresponding cluster means. When LEAST=1, PROC FASTCLUS tries to minimize the mean absolute difference between the data and the corresponding cluster medians. When LEAST=MAX, PROC FASTCLUS tries to minimize the maximum absolute difference between the data and the corresponding cluster midranges. For general values of p, PROC FASTCLUS tries to minimize the pth root of the mean of the pth powers of the absolute differences between the data and the corresponding cluster seeds. The divisor in the clustering criterion is either the number of nonmissing data used in the analysis or, if there is a WEIGHT statement, the sum of the weights corresponding to all the nonmissing data used in the analysis (that is, an observation with n nonmissing data contributes n times the observation weight to the divisor). The divisor is not adjusted for degrees of freedom. The method for updating cluster seeds during iteration depends on the LEAST= option, as follows (Gonin and Money 1989). LEAST=p pD1 11 1650 F Chapter 34: The FASTCLUS Procedure If you specify the LEAST=p option, with .1 < p < 2/, and you omit the IRLS option, an additional column is displayed in the Iteration History table. This column contains a character to identify the method used in each iteration. PROC FASTCLUS chooses the most efficient method to cluster the data at each iterative step, given the condition of the data. Thus, the method chosen is data dependent. The possible values are described as follows: Value N I or L 1 2 3 Method Newton’s Method iteratively weighted least squares (IRLS) IRLS step, halved once IRLS step, halved twice IRLS step, halved three times PROC FASTCLUS displays a Cluster Summary, giving the following for each cluster: Cluster number Frequency, the number of observations in the cluster Weight, the sum of the weights of the observations in the cluster, if you specify the WEIGHT statement RMS Std Deviation, the root mean squared across variables of the cluster standard deviations, which is equal to the root mean square distance between observations in the cluster Maximum Distance from Seed to Observation, the maximum distance from the cluster seed to any observation in the cluster Nearest Cluster, the number of the cluster with mean closest to the mean of the current cluster Centroid Distance, the distance between the centroids (means) of the current cluster and the nearest other cluster A table of statistics for each variable is displayed unless you specify the SUMMARY option. The table contains the following: Total STD, the total standard deviation Within STD, the pooled within-cluster standard deviation R-Squared, the R square for predicting the variable from the cluster RSQ/(1 - RSQ), the ratio of between-cluster variance to within-cluster variance .R2 =.1 R2 // OVER-ALL, all of the previous quantities pooled across variables PROC FASTCLUS also displays the following: Displayed Output F 1651 Pseudo F Statistic, R2 c 1 1 R2 n c where R square is the observed overall R square, c is the number of clusters, and n is the number of observations. The pseudo F statistic was suggested by Calinski and Harabasz (1974). See Milligan and Cooper (1985) and Cooper and Milligan (1988) regarding the use of the pseudo F statistic in estimating the number of clusters. See Example 29.2 in Chapter 29, “The CLUSTER Procedure,” for a comparison of pseudo F statistics. Observed Overall R-Squared, if you specify the SUMMARY option Approximate Expected Overall R-Squared, the approximate expected value of the overall R square under the uniform null hypothesis assuming that the variables are uncorrelated. The value is missing if the number of clusters is greater than one-fifth the number of observations. Cubic Clustering Criterion, computed under the assumption that the variables are uncorrelated. The value is missing if the number of clusters is greater than one-fifth the number of observations. If you are interested in the approximate expected R square or the cubic clustering criterion but your variables are correlated, you should cluster principal component scores from the PRINCOMP procedure. Both of these statistics are described by Sarle (1983). The performance of the cubic clustering criterion in estimating the number of clusters is examined by Milligan and Cooper (1985) and Cooper and Milligan (1988). Distances Between Cluster Means, if you specify the DISTANCE option Unless you specify the SHORT or SUMMARY option, PROC FASTCLUS displays the following: Cluster Means for each variable Cluster Standard Deviations for each variable 1652 F Chapter 34: The FASTCLUS Procedure ODS Table Names PROC FASTCLUS assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in Table 34.4. For more information on ODS, see Chapter 20, “Using the Output Delivery System.” Table 34.4 ODS Tables Produced by PROC FASTCLUS ODS Table Name ApproxExpOverAllRSq CCC ClusterList ClusterSum ClusterCenters ClusterDispersion ConvergenceStatus Criterion DistBetweenClust InitialSeeds IterHistory MinDist NumberOfBins ObsOverAllRSquare PrelScaleEst PseudoFStat SimpleStatistics VariableStat Description Approximate expected overall Rsquared, single number CCC, Cubic Clustering Criterion, single number Cluster listing, obs, id, and distances Cluster summary, cluster number, distances Cluster centers Cluster dispersion Convergence status Criterion based on final seeds, single number Distance between clusters Initial seeds Iteration history, various statistics for each iteration Minimum distance between initial seeds, single number Number of bins Observed overall R-squared, single number Preliminary L(1) scale estimate, single number Pseudo F statistic, single number Simple statistics for input variables Statistics for variables within clusters Statement PROC Option default PROC default PROC LIST PROC PRINTALL PROC PROC PROC PROC default default PRINTALL default PROC PROC PROC default default PRINTALL PROC PRINTALL PROC PROC default SUMMARY PROC PRINTALL PROC default PROC default PROC default Examples: FASTCLUS Procedure F 1653 Examples: FASTCLUS Procedure Example 34.1: Fisher’s Iris Data The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on 50 iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica. Mezzich and Solomon (1980) discuss a variety of cluster analysis of the iris data. In this example, the FASTCLUS procedure is used to find two and then three clusters. In the following code, an output data set is created, and PROC FREQ is invoked to compare the clusters with the species classification. See Output 34.1.1 and Output 34.1.2 for these results. For three clusters, you can use the CANDISC procedure to compute canonical variables for plotting the clusters. See Output 34.1.3 and Output 34.1.4 for the results. proc format; value specname 1=’Setosa ’ 2=’Versicolor’ 3=’Virginica ’; run; data iris; title ’Fisher (1936) Iris Data’; input SepalLength SepalWidth PetalLength PetalWidth Species @@; format Species specname.; label SepalLength=’Sepal Length in mm.’ SepalWidth =’Sepal Width in mm.’ PetalLength=’Petal Length in mm.’ PetalWidth =’Petal Width in mm.’; symbol = put(species, specname10.); datalines; 50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 ... more lines ... 55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 63 33 60 25 3 53 37 15 02 1 ; proc fastclus data=iris maxc=2 maxiter=10 out=clus; var SepalLength SepalWidth PetalLength PetalWidth; run; 1654 F Chapter 34: The FASTCLUS Procedure proc freq; tables cluster*species; run; proc fastclus data=iris maxc=3 maxiter=10 out=clus; var SepalLength SepalWidth PetalLength PetalWidth; run; proc freq; tables cluster*Species; run; proc candisc anova out=can; class cluster; var SepalLength SepalWidth PetalLength PetalWidth; title2 ’Canonical Discriminant Analysis of Iris Clusters’; run; proc sgplot data=Can; scatter y=Can2 x=Can1 /group=Cluster ; title2 ’Plot of Canonical Variables Identified by Cluster’; run; Output 34.1.1 Fisher’s Iris Data: PROC FASTCLUS with MAXC=2 andPROC FREQ Fisher (1936) Iris Data Replace=FULL The FASTCLUS Procedure Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Initial Seeds Cluster SepalLength SepalWidth PetalLength PetalWidth ------------------------------------------------------------------------------1 43.00000000 30.00000000 11.00000000 1.00000000 2 77.00000000 26.00000000 69.00000000 23.00000000 Minimum Distance Between Initial Seeds = 70.85196 Iteration History Relative Change in Cluster Seeds Iteration Criterion 1 2 ---------------------------------------------1 11.0638 0.1904 0.3163 2 5.3780 0.0596 0.0264 3 5.0718 0.0174 0.00766 Convergence criterion is satisfied. Criterion Based on Final Seeds = 5.0417 Example 34.1: Fisher’s Iris Data F 1655 Output 34.1.1 continued Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Cluster Frequency Deviation to Observation Exceeded Cluster ----------------------------------------------------------------------------1 53 3.7050 21.1621 2 2 97 5.6779 24.6430 1 Cluster Summary Distance Between Cluster Cluster Centroids ----------------------------1 39.2879 2 39.2879 Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) --------------------------------------------------------------------SepalLength 8.28066 5.49313 0.562896 1.287784 SepalWidth 4.35866 3.70393 0.282710 0.394137 PetalLength 17.65298 6.80331 0.852470 5.778291 PetalWidth 7.62238 3.57200 0.781868 3.584390 OVER-ALL 10.69224 5.07291 0.776410 3.472463 Pseudo F Statistic = 513.92 Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = 0.51539 14.806 WARNING: The two values above are invalid for correlated variables. Cluster Means Cluster SepalLength SepalWidth PetalLength PetalWidth ------------------------------------------------------------------------------1 50.05660377 33.69811321 15.60377358 2.90566038 2 63.01030928 28.86597938 49.58762887 16.95876289 Cluster Standard Deviations Cluster SepalLength SepalWidth PetalLength PetalWidth ------------------------------------------------------------------------------1 3.427350930 4.396611045 4.404279486 2.105525249 2 6.336887455 3.267991438 7.800577673 4.155612484 1656 F Chapter 34: The FASTCLUS Procedure Output 34.1.1 continued Fisher (1936) Iris Data The FREQ Procedure Table of CLUSTER by Species CLUSTER(Cluster) Species Frequency| Percent | Row Pct | Col Pct |Setosa |Versicol|Virginic| Total | |or |a | ---------+--------+--------+--------+ 1 | 50 | 3 | 0 | 53 | 33.33 | 2.00 | 0.00 | 35.33 | 94.34 | 5.66 | 0.00 | | 100.00 | 6.00 | 0.00 | ---------+--------+--------+--------+ 2 | 0 | 47 | 50 | 97 | 0.00 | 31.33 | 33.33 | 64.67 | 0.00 | 48.45 | 51.55 | | 0.00 | 94.00 | 100.00 | ---------+--------+--------+--------+ Total 50 50 50 150 33.33 33.33 33.33 100.00 Output 34.1.2 Fisher’s Iris Data: PROC FASTCLUS with MAXC=3 and PROC FREQ Fisher (1936) Iris Data Replace=FULL The FASTCLUS Procedure Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Initial Seeds Cluster SepalLength SepalWidth PetalLength PetalWidth ------------------------------------------------------------------------------1 58.00000000 40.00000000 12.00000000 2.00000000 2 77.00000000 38.00000000 67.00000000 22.00000000 3 49.00000000 25.00000000 45.00000000 17.00000000 Minimum Distance Between Initial Seeds = 38.23611 Iteration History Relative Change in Cluster Seeds Iteration Criterion 1 2 3 ---------------------------------------------------------1 6.7591 0.2652 0.3205 0.2985 2 3.7097 0 0.0459 0.0317 3 3.6427 0 0.0182 0.0124 Example 34.1: Fisher’s Iris Data F 1657 Output 34.1.2 continued Convergence criterion is satisfied. Criterion Based on Final Seeds = 3.6289 Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Cluster Frequency Deviation to Observation Exceeded Cluster ----------------------------------------------------------------------------1 50 2.7803 12.4803 3 2 38 4.0168 14.9736 3 3 62 4.0398 16.9272 2 Cluster Summary Distance Between Cluster Cluster Centroids ----------------------------1 33.5693 2 17.9718 3 17.9718 Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) --------------------------------------------------------------------SepalLength 8.28066 4.39488 0.722096 2.598359 SepalWidth 4.35866 3.24816 0.452102 0.825156 PetalLength 17.65298 4.21431 0.943773 16.784895 PetalWidth 7.62238 2.45244 0.897872 8.791618 OVER-ALL 10.69224 3.66198 0.884275 7.641194 Pseudo F Statistic = 561.63 Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = 0.62728 25.021 WARNING: The two values above are invalid for correlated variables. Cluster Means Cluster SepalLength SepalWidth PetalLength PetalWidth ------------------------------------------------------------------------------1 50.06000000 34.28000000 14.62000000 2.46000000 2 68.50000000 30.73684211 57.42105263 20.71052632 3 59.01612903 27.48387097 43.93548387 14.33870968 1658 F Chapter 34: The FASTCLUS Procedure Output 34.1.2 continued Cluster Standard Deviations Cluster SepalLength SepalWidth PetalLength PetalWidth ------------------------------------------------------------------------------1 3.524896872 3.790643691 1.736639965 1.053855894 2 4.941550255 2.900924461 4.885895746 2.798724562 3 4.664100551 2.962840548 5.088949673 2.974997167 Fisher (1936) Iris Data The FREQ Procedure Table of CLUSTER by Species CLUSTER(Cluster) Species Frequency| Percent | Row Pct | Col Pct |Setosa |Versicol|Virginic| Total | |or |a | ---------+--------+--------+--------+ 1 | 50 | 0 | 0 | 50 | 33.33 | 0.00 | 0.00 | 33.33 | 100.00 | 0.00 | 0.00 | | 100.00 | 0.00 | 0.00 | ---------+--------+--------+--------+ 2 | 0 | 2 | 36 | 38 | 0.00 | 1.33 | 24.00 | 25.33 | 0.00 | 5.26 | 94.74 | | 0.00 | 4.00 | 72.00 | ---------+--------+--------+--------+ 3 | 0 | 48 | 14 | 62 | 0.00 | 32.00 | 9.33 | 41.33 | 0.00 | 77.42 | 22.58 | | 0.00 | 96.00 | 28.00 | ---------+--------+--------+--------+ Total 50 50 50 150 33.33 33.33 33.33 100.00 Output 34.1.3 Fisher’s Iris Data using PROC CANDISC Fisher (1936) Iris Data Canonical Discriminant Analysis of Iris Clusters The CANDISC Procedure Total Sample Size Variables Classes 150 4 3 DF Total DF Within Classes DF Between Classes Number of Observations Read Number of Observations Used 150 150 149 147 2 Example 34.1: Fisher’s Iris Data F 1659 Output 34.1.3 continued Class Level Information CLUSTER 1 2 3 Variable Name Frequency Weight Proportion 50 38 62 50.0000 38.0000 62.0000 0.333333 0.253333 0.413333 _1 _2 _3 Fisher (1936) Iris Data Canonical Discriminant Analysis of Iris Clusters The CANDISC Procedure Univariate Test Statistics F Statistics, Variable Label Sepal Length Sepal Width Petal Length Petal Width Sepal Length in mm. Sepal Width in mm. Petal Length in mm. Petal Width in mm. Num DF=2, Den DF=147 Total Pooled Between Standard Standard Standard R-Square Deviation Deviation Deviation R-Square / (1-RSq) F Value Pr > F 8.2807 4.3949 8.5893 0.7221 2.5984 190.98 <.0001 4.3587 3.2482 3.5774 0.4521 0.8252 60.65 <.0001 17.6530 4.2143 20.9336 0.9438 7.6224 2.4524 8.8164 0.8979 16.7849 1233.69 <.0001 8.7916 646.18 <.0001 Average R-Square Unweighted Weighted by Variance 0.7539604 0.8842753 Multivariate Statistics and F Approximations S=2 Statistic Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root M=0.5 N=71 Value F Value Num DF Den DF Pr > F 0.03222337 1.25669612 21.06722883 20.63266809 164.55 61.29 377.66 747.93 8 8 8 4 288 290 203.4 145 <.0001 <.0001 <.0001 <.0001 NOTE: F Statistic for Roy’s Greatest Root is an upper bound. NOTE: F Statistic for Wilks’ Lambda is exact. 1660 F Chapter 34: The FASTCLUS Procedure Output 34.1.3 continued Fisher (1936) Iris Data Canonical Discriminant Analysis of Iris Clusters The CANDISC Procedure 1 2 Canonical Correlation Adjusted Canonical Correlation Approximate Standard Error Squared Canonical Correlation 0.976613 0.550384 0.976123 0.543354 0.003787 0.057107 0.953774 0.302923 Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq) 1 2 Eigenvalue Difference Proportion Cumulative 20.6327 0.4346 20.1981 0.9794 0.0206 0.9794 1.0000 Test of H0: The canonical correlations in the current row and all that follow are zero 1 2 Likelihood Ratio Approximate F Value Num DF Den DF Pr > F 0.03222337 0.69707749 164.55 21.00 8 3 288 145 <.0001 <.0001 Fisher (1936) Iris Data Canonical Discriminant Analysis of Iris Clusters The CANDISC Procedure Total Canonical Structure Variable Label SepalLength SepalWidth PetalLength PetalWidth Sepal Sepal Petal Petal Length in mm. Width in mm. Length in mm. Width in mm. Can1 Can2 0.831965 -0.515082 0.993520 0.966325 0.452137 0.810630 0.087514 0.154745 Between Canonical Structure Variable Label SepalLength SepalWidth PetalLength PetalWidth Sepal Sepal Petal Petal Length in mm. Width in mm. Length in mm. Width in mm. Can1 Can2 0.956160 -0.748136 0.998770 0.995952 0.292846 0.663545 0.049580 0.089883 Example 34.1: Fisher’s Iris Data F 1661 Output 34.1.3 continued Pooled Within Canonical Structure Variable Label SepalLength SepalWidth PetalLength PetalWidth Sepal Sepal Petal Petal Length in mm. Width in mm. Length in mm. Width in mm. Can1 Can2 0.339314 -0.149614 0.900839 0.650123 0.716082 0.914351 0.308136 0.404282 Fisher (1936) Iris Data Canonical Discriminant Analysis of Iris Clusters The CANDISC Procedure Total-Sample Standardized Canonical Coefficients Variable Label SepalLength SepalWidth PetalLength PetalWidth Sepal Sepal Petal Petal Length in mm. Width in mm. Length in mm. Width in mm. Can1 Can2 0.047747341 -0.577569244 3.341309573 0.996451144 1.021487262 0.864455153 -1.283043758 0.900476563 Pooled Within-Class Standardized Canonical Coefficients Variable Label SepalLength SepalWidth PetalLength PetalWidth Sepal Sepal Petal Petal Length in mm. Width in mm. Length in mm. Width in mm. Can1 Can2 0.0253414487 -.4304161258 0.7976741592 0.3205998034 0.5421446856 0.6442092294 -.3063023132 0.2897207865 Raw Canonical Coefficients Variable Label SepalLength SepalWidth PetalLength PetalWidth Sepal Sepal Petal Petal Length in mm. Width in mm. Length in mm. Width in mm. Can1 Can2 0.0057661265 -.1325106494 0.1892773419 0.1307270927 0.1233581748 0.1983303556 -.0726814163 0.1181359305 Class Means on Canonical Variables CLUSTER Can1 Can2 1 2 3 -6.131527227 4.931414018 1.922300462 0.244761516 0.861972277 -0.725693908 1662 F Chapter 34: The FASTCLUS Procedure Output 34.1.4 Plot of Fisher’s Iris Data using PROC CANDISC Example 34.2: Outliers This example involves data artificially generated to contain two clusters and several severe outliers. A preliminary analysis specifies 20 clusters and outputs an OUTSEED= data set to be used for a diagnostic plot. The exact number of initial clusters is not important; similar results could be obtained with 10 or 50 initial clusters. Examination of the plot suggests that clusters with more than five (again, the exact number is not important) observations can yield good seeds for the main analysis. A DATA step deletes clusters with five or fewer observations, and the remaining cluster means provide seeds for the next PROC FASTCLUS analysis. Two clusters are requested; the LEAST= option specifies the mean absolute deviation criterion (LEAST=1). Values of the LEAST= option less than 2 reduce the effect of outliers on cluster centers. The next analysis also requests two clusters; the STRICT= option is specified to prevent outliers from distorting the results. The STRICT= value is chosen to be close to the _GAP_ and _RADIUS_ values of the larger clusters in the diagnostic plot; the exact value is not critical. A final PROC FASTCLUS run assigns the outliers to clusters. Example 34.2: Outliers F 1663 The following SAS statements implement these steps, and the results are displayed in Output 34.2.3 through Output 34.2.8. First, an artificial data set is created with two clusters and some outliers. Then PROC FASTCLUS is run with many clusters to produce an OUTSEED= data set. A diagnostic plot using the variables _GAP_ and _RADIUS_ is then produced using the SGSCATTER procedure. The results from these steps are shown in Output 34.2.1 and Output 34.2.2. data x; title ’Using PROC FASTCLUS to Analyze Data with Outliers’; drop n; do n=1 to 100; x=rannor(12345)+2; y=rannor(12345); output; end; do n=1 to 100; x=rannor(12345)-2; y=rannor(12345); output; end; do n=1 to 10; x=10*rannor(12345); y=10*rannor(12345); output; end; run; title2 ’Preliminary PROC FASTCLUS Analysis with 20 Clusters’; proc fastclus data=x outseed=mean1 maxc=20 maxiter=0 summary; var x y; run; proc sgscatter data=mean1 ; compare y=(_gap_ _radius_) x=_freq_ ; run; Output 34.2.1 Preliminary Analysis of Data with Outliers Using PROC FASTCLUS Using PROC FASTCLUS to Analyze Data with Outliers Preliminary PROC FASTCLUS Analysis with 20 Clusters Replace=FULL The FASTCLUS Procedure Radius=0 Maxclusters=20 Maxiter=0 Criterion Based on Final Seeds = 0.6873 1664 F Chapter 34: The FASTCLUS Procedure Output 34.2.1 continued Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Cluster Frequency Deviation to Observation Exceeded Cluster ----------------------------------------------------------------------------1 8 0.4753 1.1924 19 2 1 . 0 6 3 44 0.6252 1.6774 5 4 1 . 0 20 5 38 0.5603 1.4528 3 6 2 0.0542 0.1085 2 7 1 . 0 14 8 2 0.6480 1.2961 1 9 1 . 0 7 10 1 . 0 18 11 1 . 0 16 12 20 0.5911 1.6291 16 13 5 0.6682 1.4244 3 14 1 . 0 7 15 5 0.4074 1.2678 3 16 22 0.4168 1.5139 19 17 8 0.4031 1.4794 5 18 1 . 0 10 19 45 0.6475 1.6285 16 20 3 0.5719 1.3642 15 Cluster Summary Distance Between Cluster Cluster Centroids ----------------------------1 1.7205 2 6.2847 3 1.4386 4 5.2130 5 1.4386 6 6.2847 7 2.5094 8 1.8450 9 9.4534 10 4.2514 11 4.7582 12 1.5601 13 1.9553 14 2.5094 15 1.7609 16 1.4936 17 1.5564 18 4.2514 19 1.4936 20 1.8999 Pseudo F Statistic = 207.58 Example 34.2: Outliers F 1665 Output 34.2.1 continued Observed Over-All R-Squared = 0.95404 Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = 0.96103 -2.503 WARNING: The two values above are invalid for correlated variables. Output 34.2.2 Preliminary Analysis of Data with Outliers: Plot Using and PROC SGSCATGTER 1666 F Chapter 34: The FASTCLUS Procedure In the following SAS statements, a DATA step is used to remove low frequency clusters, then the FASTCLUS procedure is run again, selecting seeds from the high frequency clusters in the previous analysis using LEAST=1 clustering criterion. The results are shown in Output 34.2.3 and Output 34.2.4. data seed; set mean1; if _freq_>5; run; title2 ’PROC FASTCLUS Analysis Using LEAST= Clustering Criterion’; title3 ’Values < 2 Reduce Effect of Outliers on Cluster Centers’; proc fastclus data=x seed=seed maxc=2 least=1 out=out; var x y; run; proc sgplot data=out; scatter y=y x=x /group=cluster; run; Output 34.2.3 Analysis of Data with Outliers Using the LEAST= Option Using PROC FASTCLUS to Analyze Data with Outliers PROC FASTCLUS Analysis Using LEAST= Clustering Criterion Values < 2 Reduce Effect of Outliers on Cluster Centers Replace=FULL Radius=0 The FASTCLUS Procedure Maxclusters=2 Maxiter=20 Converge=0.0001 Initial Seeds Cluster x y ------------------------------------------1 2.794174248 -0.065970836 2 -2.027300384 -2.051208579 Minimum Distance Between Initial Seeds = 6.806712 Preliminary L(1) Scale Estimate = Number of Bins = 2.796579 100 Iteration History Relative Change Maximum in Cluster Seeds Iteration Criterion Bin Size 1 2 ---------------------------------------------------------1 1.3983 0.2263 0.4091 0.6696 2 1.0776 0.0226 0.00511 0.0452 3 1.0771 0.00226 0.00229 0.00234 4 1.0771 0.000396 0.000253 0.000144 5 1.0771 0.000396 0 0 Least=1 Example 34.2: Outliers F 1667 Output 34.2.3 continued Convergence criterion is satisfied. Criterion Based on Final Seeds = 1.0771 Cluster Summary Mean Maximum Distance Absolute from Seed Radius Nearest Cluster Frequency Deviation to Observation Exceeded Cluster ----------------------------------------------------------------------------1 102 1.1278 24.1622 2 2 108 1.0494 14.8292 1 Cluster Summary Distance Between Cluster Cluster Medians ---------------------------1 4.2585 2 4.2585 Cluster Medians Cluster x y ------------------------------------------1 1.923023887 0.222482918 2 -1.826721743 -0.286253041 Mean Absolute Deviations from Final Seeds Cluster x y ------------------------------------------1 1.113465261 1.142120480 2 0.890331835 1.208370913 1668 F Chapter 34: The FASTCLUS Procedure Output 34.2.4 Analysis Plot of Data with Outliers The FASTCLUS procedure is run again, selecting seeds from high frequency clusters in the previous analysis. STRICT= prevents outliers from distorting the results. The results are shown in Output 34.2.5 and Output 34.2.6. title2 ’PROC FASTCLUS Analysis Using STRICT= to Omit Outliers’; proc fastclus data=x seed=seed maxc=2 strict=3.0 out=out outseed=mean2; var x y; run; proc sgplot data=out; scatter y=y x=x /group=cluster ; run; Example 34.2: Outliers F 1669 Output 34.2.5 Cluster Analysis with Outliers Omitted: PROC FASTCLUS SGPLOT Using PROC FASTCLUS to Analyze Data with Outliers PROC FASTCLUS Analysis Using STRICT= to Omit Outliers Replace=FULL The FASTCLUS Procedure Radius=0 Strict=3 Maxclusters=2 Maxiter=1 Initial Seeds Cluster x y ------------------------------------------1 2.794174248 -0.065970836 2 -2.027300384 -2.051208579 Criterion Based on Final Seeds = 0.9515 Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Cluster Frequency Deviation to Observation Exceeded Cluster ----------------------------------------------------------------------------1 99 0.9501 2.9589 2 2 99 0.9290 2.8011 1 Cluster Summary Distance Between Cluster Cluster Centroids ----------------------------1 3.7666 2 3.7666 12 Observation(s) were not assigned to a cluster because the minimum distance to a cluster seed exceeded the STRICT= value. Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) -----------------------------------------------------------------x 2.06854 0.87098 0.823609 4.669219 y 1.02113 1.00352 0.039093 0.040683 OVER-ALL 1.63119 0.93959 0.669891 2.029303 Pseudo F Statistic = 397.74 Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = 0.60615 3.197 1670 F Chapter 34: The FASTCLUS Procedure Output 34.2.5 continued WARNING: The two values above are invalid for correlated variables. Cluster Means Cluster x y ------------------------------------------1 1.825111432 0.141211701 2 -1.919910712 -0.261558725 Cluster Standard Deviations Cluster x y ------------------------------------------1 0.889549271 1.006965219 2 0.852000588 1.000062579 Output 34.2.6 Cluster Analysis with Outliers Omitted: Plot Using PROC SGPLOT Finally, the FASTCLUS procedure is run one more time with zero iterations to assign outliers and tails to clusters. The results are show in Output 34.2.7 and Output 34.2.8. Example 34.2: Outliers F 1671 title2 ’Final PROC FASTCLUS Analysis Assigning Outliers to ’ ’Clusters’; proc fastclus data=x seed=mean2 maxc=2 maxiter=0 out=out; var x y; run; proc sgplot data=out; scatter y=y x=x /group=cluster ; run; Output 34.2.7 Cluster Analysis with Outliers Omitted: PROC FASTCLUS Using PROC FASTCLUS to Analyze Data with Outliers Final PROC FASTCLUS Analysis Assigning Outliers to Clusters Replace=FULL The FASTCLUS Procedure Radius=0 Maxclusters=2 Maxiter=0 Initial Seeds Cluster x y ------------------------------------------1 1.825111432 0.141211701 2 -1.919910712 -0.261558725 Criterion Based on Final Seeds = 2.0594 Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Cluster Frequency Deviation to Observation Exceeded Cluster ----------------------------------------------------------------------------1 103 2.2569 17.9426 2 2 107 1.8371 11.7362 1 Cluster Summary Distance Between Cluster Cluster Centroids ----------------------------1 4.3753 2 4.3753 Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) -----------------------------------------------------------------x 2.92721 1.95529 0.555950 1.252000 y 2.15248 2.14754 0.009347 0.009435 OVER-ALL 2.56922 2.05367 0.364119 0.572621 Pseudo F Statistic = 119.11 1672 F Chapter 34: The FASTCLUS Procedure Output 34.2.7 continued Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = 0.49090 -5.338 WARNING: The two values above are invalid for correlated variables. Cluster Means Cluster x y ------------------------------------------1 2.280017469 0.263940765 2 -2.075547895 -0.151348765 Cluster Standard Deviations Cluster x y ------------------------------------------1 2.412264861 2.089922815 2 1.379355878 2.201567557 Output 34.2.8 Cluster Analysis with Outliers Omitted: Plot Using PROC SGPLOT References F 1673 References Anderberg, M. R. (1973), Cluster Analysis for Applications, New York: Academic Press. Bock, H. H. (1985), “On Some Significance Tests in Cluster Analysis,” Journal of Classification, 2, 77–108. Calinski, T. and Harabasz, J. (1974), “A Dendrite Method for Cluster Analysis,” Communications in Statistics, 3, 1–27. Cooper, M. C. and Milligan, G. W. (1988), “The Effect of Error on Determining the Number of Clusters,” Proceedings of the International Workshop on Data Analysis, Decision Support, and Expert Knowledge Representation in Marketing and Related Areas of Research, 319–328. Everitt, B. S. (1980), Cluster Analysis, Second Edition, London: Heineman Educational Books. Fisher, R. A. (1936), “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, 7, 179–188. Gonin, R. and Money, A. H. (1989), Nonlinear Lp -Norm Estimation, New York: Marcel Dekker. Hartigan, J. A. (1975), Clustering Algorithms, New York: John Wiley & Sons. Hartigan, J. A. (1985), “Statistical Theory in Clustering,” Journal of Classification, 2, 63–76. Journal of Statistics Education, “Fish Catch Data Set,” http://www.amstat.org/ publications/jse/jse_data_archive.html. MacQueen, J. B. (1967), “Some Methods for Classification and Analysis of Multivariate Observations,” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297. McLachlan, G. J. and Basford, K. E. (1988), Mixture Models, New York: Marcel Dekker. Mezzich, J. E. and Solomon, H. (1980), Taxonomy and Behavioral Science, New York: Academic Press. Milligan, G. W. (1980), “An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms,” Psychometrika, 45, 325–342. Milligan, G. W. and Cooper, M. C. (1985), “An Examination of Procedures for Determining the Number of Clusters in a Data Set,” Psychometrika, 50, 159–179. Pollard, D. (1981), “Strong Consistency of k-Means Clustering,” Annals of Statistics, 9, 135–140. Sarle, W. S. (1983), “The Cubic Clustering Criterion,” SAS Technical Report A-108, Cary, NC: SAS Institute Inc. Spath, H. (1980), Cluster Analysis Algorithms, Chichester, Eng.: Ellis Horwood. 1674 F Chapter 34: The FASTCLUS Procedure Spath, H. (1985), Cluster Dissection and Analysis, Chichester, Eng.: Ellis Horwood. Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985), Statistical Analysis of Finite Mixture Distributions, New York: John Wiley & Sons. Tou, J. T. and Gonzalez, R. C. (1974), Pattern Recognition Principles, Reading, MA: AddisonWesley. Subject Index analyzing data in groups FASTCLUS procedure, 1640 bin-sort algorithm, 1634 cluster centers, 1623, 1637 deletion, 1635 final, 1623 initial, 1622, 1623 mean, 1637 median, 1634, 1637 midrange, 1637 minimum distance separating, 1623 seeds, 1622 cluster analysis disjoint, 1621 large data sets, 1621 robust, 1622, 1637 clustering criterion FASTCLUS procedure, 1622, 1636, 1637 clustering methods FASTCLUS procedure, 1622, 1624 computational problems convergence (FASTCLUS), 1635 computational resources FASTCLUS procedure, 1647 disjoint clustering, 1621, 1622, 1624 distance between clusters (FASTCLUS), 1642 data (FASTCLUS), 1622 Euclidean (FASTCLUS), 1623 distance, 1622, 1623, 1642 DRIFT option, 1623 Ekblom-Newton algorithm, 1637 homotopy parameter, 1635 imputation of missing values, 1636 incompatibilities, 1642 iteratively reweighted least squares, 1636 Lp clustering, 1622, 1636 MEAN= data sets, 1638 memory requirements, 1647 Merle-Spath algorithm, 1637 missing values, 1623, 1636, 1638, 1642 Newton algorithm, 1637 OUT= data sets, 1643 outliers, 1622 output data sets, 1638, 1643 output table names, 1652 OUTSTAT= data set, 1638, 1645 random number generator, 1639 scale estimates, 1635, 1637, 1642, 1644 seed replacement, 1623, 1639 weighted cluster means, 1639 homotopy parameter FASTCLUS procedure, 1635 imputation of missing values FASTCLUS procedure, 1636 initial seeds FASTCLUS procedure, 1622, 1623, 1639 k-means clustering, 1622 Ekblom-Newton algorithm FASTCLUS procedure, 1637 Euclidean distances, 1623 leader algorithm, 1622 Lp clustering FASTCLUS procedure, 1622 Lp clustering FASTCLUS procedure, 1636 FASTCLUS procedure algorithm for updating cluster seeds, 1637 bin-sort algorithm, 1634 cluster deletion, 1635 clustering criterion, 1622, 1636, 1637 clustering methods, 1622, 1624 compared to other procedures, 1647 computational problems, convergence, 1635 computational resources, 1647 controlling iterations, 1638 convergence criterion, 1634 MEAN= data sets FASTCLUS procedure, 1638 median cluster, 1634, 1637 memory requirements FASTCLUS procedure, 1647 Merle-Spath algorithm FASTCLUS procedure, 1637 missing values FASTCLUS procedure, 1623, 1636, 1638, 1642 nearest centroid sorting, 1622, 1623 Newton algorithm FASTCLUS procedure, 1637 OUT= data sets FASTCLUS procedure, 1643 outliers FASTCLUS procedure, 1622 output data sets FASTCLUS procedure, 1638, 1643 output table names FASTCLUS procedure, 1652 robust cluster analysis, 1622, 1637 scale estimates FASTCLUS procedure, 1635, 1637, 1642, 1644 simple cluster-seeking algorithm, 1624 Syntax Index BINS= option PROC FASTCLUS statement, 1634 CLUSTER= option PROC FASTCLUS statement, 1634 CLUSTERLABEL= option PROC FASTCLUS statement, 1634 CONVERGE= option PROC FASTCLUS statement, 1634 DATA= option PROC FASTCLUS statement, 1635 DELETE= option PROC FASTCLUS statement, 1635 DISTANCE option PROC FASTCLUS statement, 1635 DRIFT option PROC FASTCLUS statement, 1635 FASTCLUS procedure MAXCLUSTERS= option, 1623 RADIUS= option, 1623 syntax, 1632 FASTCLUS procedure, BY statement, 1640 FASTCLUS procedure, FREQ statement, 1640 FASTCLUS procedure, ID statement, 1641 FASTCLUS procedure, PROC FASTCLUS statement, 1632 BINS= option, 1634 CLUSTER= option, 1634 CLUSTERLABEL= option, 1634 CONVERGE= option, 1634 DATA= option, 1635 DELETE= option, 1635 DISTANCE option, 1635 DRIFT option, 1635 HC= option, 1635 HP= option, 1635 IMPUTE option, 1636 INSTAT= option, 1636 IRLS option, 1636 L= option, 1636 LEAST= option, 1636 LIST option, 1637 MAXCLUSTERS= option, 1632 MAXITER= option, 1638 MEAN= option, 1638 NOMISS option, 1638 NOPRINT option, 1638 OUT= option, 1638 OUTITER option, 1638 OUTS= option, 1638 OUTSEED= option, 1638 OUTSTAT= option, 1638 RADIUS= option, 1632 RANDOM= option, 1639 REPLACE= option, 1639 SEED= option, 1639 SHORT option, 1639 STRICT= option, 1639 SUMMARY option, 1639 VARDEF= option, 1639 FASTCLUS procedure, VAR statement, 1641 FASTCLUS procedure, WEIGHT statement, 1641 HC= option PROC FASTCLUS statement, 1635 HP= option PROC FASTCLUS statement, 1635 IMPUTE option PROC FASTCLUS statement, 1636 INSTAT= option PROC FASTCLUS statement, 1636 IRLS option PROC FASTCLUS statement, 1636 L= option PROC FASTCLUS statement, 1636 LEAST= option PROC FASTCLUS statement, 1636 LIST option PROC FASTCLUS statement, 1637 MAXCLUSTERS= option PROC FASTCLUS statement, 1632 MAXITER= option PROC FASTCLUS statement, 1638 MEAN= option PROC FASTCLUS statement, 1638 NOMISS option PROC FASTCLUS statement, 1638 NOPRINT option PROC FASTCLUS statement, 1638 OUT= option PROC FASTCLUS statement, 1638 OUTITER option PROC FASTCLUS statement, 1638 OUTS= option PROC FASTCLUS statement, 1638 OUTSEED= option PROC FASTCLUS statement, 1638 OUTSTAT= option PROC FASTCLUS statement, 1638 PROC FASTCLUS statement, see FASTCLUS procedure RADIUS= option PROC FASTCLUS statement, 1632 RANDOM= option PROC FASTCLUS statement, 1639 REPLACE= option PROC FASTCLUS statement, 1639 SEED= option PROC FASTCLUS statement, 1639 SHORT option PROC FASTCLUS statement, 1639 STRICT= option PROC FASTCLUS statement, 1639 SUMMARY option PROC FASTCLUS statement, 1639 VARDEF= option PROC FASTCLUS statement, 1639 Your Turn We welcome your feedback. If you have comments about this book, please send them to yourturn@sas.com. Include the full title and page numbers (if applicable). If you have comments about the software, please send them to suggest@sas.com. SAS Publishing Delivers! ® Whether you are new to the work force or an experienced professional, you need to distinguish yourself in this rapidly changing and competitive job market. SAS Publishing provides you with a wide range of resources to help you set yourself apart. Visit us online at support.sas.com/bookstore. ® SAS Press ® Need to learn the basics? Struggling with a programming problem? You’ll find the expert answers that you need in example-rich books from SAS Press. Written by experienced SAS professionals from around the world, SAS Press books deliver real-world insights on a broad range of topics for all skill levels. SAS Documentation support.sas.com/saspress ® To successfully implement applications using SAS software, companies in every industry and on every continent all turn to the one source for accurate, timely, and reliable information: SAS documentation. We currently produce the following types of reference documentation to improve your work experience: • Online help that is built into the software. • Tutorials that are integrated into the product. • Reference documentation delivered in HTML and PDF – free on the Web. • Hard-copy books. support.sas.com/publishing SAS Publishing News ® Subscribe to SAS Publishing News to receive up-to-date information about all new SAS titles, author podcasts, and new Web site features via e-mail. Complete instructions on how to subscribe, as well as access to past issues, are available at our Web site. support.sas.com/spn SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. © 2009 SAS Institute Inc. All rights reserved. 518177_1US.0109
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : No Encryption : Standard V2.3 (128-bit) User Access : Print, Copy, Annotate, Fill forms, Extract, Print high-res Page Mode : UseOutlines XMP Toolkit : Adobe XMP Core 4.0-c316 44.253921, Sun Oct 01 2006 17:14:39 Create Date : 2008:11:25 12:31:33-05:00 Creator Tool : SASLaTeX with hyperref Modify Date : 2009:03:03 14:29:51-05:00 Metadata Date : 2009:03:03 14:29:51-05:00 Format : application/pdf Description : Creator : SAS Institute Inc. Title : SAS/STAT 9.2 User's Guide: The FASTCLUS Procedure (Book Excerpt) Producer : pdfeTeX-1.304 Document ID : uuid:11737f83-8509-4d4b-8007-5ddfc6c49362 Instance ID : uuid:01a2f384-b087-4b21-939c-c6d522ab812d Page Count : 64 Page Layout : SinglePage Subject : Author : SAS Institute Inc.EXIF Metadata provided by EXIF.tools