007 3214 004

User Manual: 007-3214-004

Open the PDF directly: View PDF PDF.
Page Count: 756

Download007-3214-004
Open PDF In BrowserView PDF
MineSet™
User’s Guide

Document Number 007-3214-004

CONTRIBUTORS
Written by Dieter Rathjens and Helen Vanderberg
Illustrated by Dany Galgani
Production by Kirsten Pekarek
Engineering contributions by Barry Becker, Dave Bouvier, Cliff Brunk, Eric Eros,
Ariel Faigon, Eben Haber, Georges Harik, John Hawkes, Andy Kar, Ed Karrels,
Ronny Kohavi, Alex Kozlov, Clay Kunz, Peter Rathmann, Dan Sommerfield,
Peter Welch, and Brett Zane-Ulman.
St. Peter’s Basilica image courtesy of ENEL SpA and InfoByte SpA. Disk Thrower
image courtesy of Xavier Berenguer, Animatica.
© 1998, Silicon Graphics, Inc.— All Rights Reserved
The contents of this document may not be copied or duplicated in any form, in whole
or in part, without the prior written permission of Silicon Graphics, Inc.
RESTRICTED RIGHTS LEGEND
Use, duplication, or disclosure of the technical data contained in this document by
the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of the
Rights in Technical Data and Computer Software clause at DFARS 52.227-7013
and/or in similar or successor clauses in the FAR, or in the DOD or NASA FAR
Supplement. Unpublished rights reserved under the Copyright Laws of the United
States. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline Blvd.,
Mountain View, CA 94043-1389.
Silicon Graphics and the Silicon Graphics logo are registered trademarks, and
IRIX,MineSet and IRIS InSight are trademarks, of Silicon Graphics, Inc. Oracle is a
registered trademark, and SQL*Net is a trademark of Oracle Corporation.
INFORMIX is a registered trademark of Informix Software, Inc. Sybase is a registered
trademark, and SQL Server is a trademark of Sybase Inc. UNIX is a registered
trademark in the United States and other countries, licensed exclusively through
X/Open Company, Ltd. X Window System is a trademark of the Massachussetts
Institute of Technology.
The Tree Visualizer is patented under United States Patents No. 5,528,735, 5,555,354
and 5,671,381.

MineSet™ User’s Guide
Document Number 007-3214-004

Contents

List of Figures

xxiii

List of Tables xxxi
About This Guide xxxiii
Audience for This Guide xxxiii
Structure of This Document xxxiv
Illustration in This Guide xxxvii
Typographical Conventions xxxvii
1.

Getting Started 39
MineSet Tools Suite 39
Tool Manager 41
DataMover 41
Association Rules Generator 41
Automatic Binning 42
Clustering 42
Column Importance 42
Decision Table Inducer and Classifier 43
Decision Tree Inducer and Classifier 43
Evidence Inducer and Classifier 43
Option Tree Inducer and Classifier 44
Regression Tree Inducer and Regressor 44
Cluster Visualizer 44

iii

Contents

Decision Table Visualizer 45
Evidence Visualizer 45
Map Visualizer 45
Record Viewer 46
Rules Visualizer 46
Scatter Visualizer 46
Splat Visualizer 47
Statistics Visualizer 47
Tree Visualizer 47
Basic Tool Execution Scenario 48
2.

3.

iv

Setting Up MineSet 51
Configuring the DataMover Server 51
The User Configuration File 51
File Handling 54
Mandatory Configuration File 54
Using MineSet With Existing Data Files 56
Using MineSet to Connect to Remote Databases
Loading Sample Datasets 59
The Tool Manager 63
Overview 63
Connecting to an Existing Data Source 64
Transforming the Data 64
Visualizing the Data on the Screen 65
Starting the Tool Manager 66
Choosing a Data Source 68
Choosing an Existing Data File 69
Choosing a Database Table 70

58

Contents

Transforming the Data 75
The Remove Column Button 76
The Bin Columns Button 77
Aggregation 83
The Filter Button 87
The Change Types Button 88
The Add Column Button 91
The Apply Model Button 92
The Sample Button 93
The Table History Buttons 94
The “Current view is” Field 94
The Prev and Next Buttons 94
Investigating the Data 99
Using Visualization Tools 99
Using Mining Tools 102
Using Data Files 107
Session Files 108
Pulldown Menus 109
The File Menu 109
The View Menu 111
The Visual Tools Menu 111
The Help Menu 112
The Tool Manager Options File 112
The Record Viewer 113
Color Options for the MineSet Visualizers
Choosing Colors 115
Using the Color Browser 117

115

v

Contents

4.

5.

vi

Using the Statistics Visualizer 119
Overview of the Statistics Visualizer 119
File Requirements 121
Starting the Statistics Visualizer 121
Starting the Statistics Visualizer 122
Working in the Statistics Visualizer’s Main Window
Pulldown Menus 123
The File Menu 123
The View Menu 124
The Help Menu 125
Sample Data Files 126

123

Using the Tree Visualizer 127
Overview of Tree Visualizer 127
File Requirements 129
Starting the Tree Visualizer 130
Configuring the Tree Visualizer Using the Tool Manager
Selecting the Tree Visualizer Tool 132
Undoing Mappings 134
Specifying Tool Options 134
Saving Tree Visualizer Settings 141
Invoking the Tree Visualizer 141
Working in the Tree Visualizer’s Main Window 142
Highlighting an Object or Node 143
Selecting an Object 144
Spotlighting an Object 144
Using the Right Mouse Button 145
Navigating With the Middle Mouse Button 146
External Controls 147
Buttons 147
Thumbwheels 149
Height Slider 150

132

Contents

Pulldown Menus 150
The File Menu 151
The Show Menu 152
The Display Menu 165
The Selections Menu 166
The Go Menu 167
The Help Menu 169
Null Handling in the Tree Visualizer 170
Sample Configuration and Data Files 171
6.

Using the Map Visualizer 173
Overview of Map Visualizer 173
File Requirements 176
Starting the Map Visualizer 178
Configuring the Map Visualizer Using the Tool Manager
Generating .gfx and .hierarchy Files 180
Selecting the Map Visualizer Tool 181
Mapping Columns to Visual Elements 182
Undoing Mappings 183
Slider Creation for Mapviz 183
Specifying Tool Options 184
Saving Map Visualizer Settings 189
Invoking the Map Visualizer 189
Working in the Map Visualizer’s Main Window 189
Viewing Modes 191
External Main Window Controls 194
Buttons 194
Height-Adjust Slider and Label 195
Thumbwheels 196
The Animation Control Panel 196
Sliders Controlling Independent Dimensions 197
The Summary Window 199
Animation Buttons and Sliders 201

180

vii

Contents

Pulldown Menus 204
The File Menu 204
The View Menu 204
The Selections Menu 208
The InterTool Menu 209
The Help Menu 209
Null Handling in the Map Visualizer 210
Sample Configuration and Data Files 211
7.

viii

Using the Scatter Visualizer 215
Overview of Scatter Visualizer 215
File Requirements 217
Starting the Scatter Visualizer 218
Configuring the Scatter Visualizer Using the Tool Manager
Selecting the Scatter Visualizer Tool 220
Mapping Requirements to Columns 221
Undoing Mappings 221
Slider Creation for Scatterviz 221
Specifying Tool Options 223
Invoking the Scatter Visualizer 229
Saving the Scatter Visualizer Settings 229
Null Handling in the Scatter Visualizer 229
Working in the Scatter Visualizer’s Main Window 230
Viewing Modes 232
External Controls 234
The Animation Control Panel 234
Sliders Controlling Independent Dimensions 234
The Summary Window 238
Animation Buttons and Sliders 239

220

Contents

Pulldown Menus 241
The File Menu 241
The View Menu 241
The Selections Menu 244
The Help Menu 245
Sample Configuration and Data Files
8.

245

Using the Splat Visualizer 249
Overview of the Splat Visualizer 249
Opacity 252
File Requirements 255
Starting the Splat Visualizer 255
Configuring the Splat Visualizer Using the Tool Manager
Selecting the Splat Visualizer Tool 257
Mapping Columns to Requirements 258
Undoing Mappings 258
Specifying Tool Options 258
Invoking the Splat Visualizer 262
Saving the Splat Visualizer Settings 262
Null Handling in the Splat Visualizer 263
Working in the Splat Visualizer’s Main Window 264
Viewing Modes 264
External Controls 267
The Animation Control Panel 267
Sliders Controlling Independent Dimensions 267
The Summary Window 271
Animation Buttons and Sliders 272
Pulldown Menus 276
The View Menu 276
The Selection Menu 279
Splat Type Menu 282
Sample Configuration and Data Files 283

256

ix

Contents

9.

x

Using the Rules Visualizer 287
Overview of Rules Visualizer 287
Data Conversion 290
Association Rules Generator 290
Rules Visualization 292
File Requirements 294
Starting the Rules Visualizer 295
Configuring the Rules Visualizer Using the Tool Manager 297
Setting Up Associations 297
Applying Association Rule Options 299
Mapping Columns to Association Items 300
Specifying Ruleviz Options 301
Mapping Columns to Visual Elements 304
Invoking the Rules Visualizer 305
Working in the Rules Visualizer’s Main Window 305
Viewing Modes 306
External Controls 308
The Height Slider 308
Pulldown Menus 309
The File Menu 309
The Filter Menu 310
The View Menu 312
The Help Menu 312
Sample Files 312
Sample Files for the Association Data Converter 312
Sample Files for the Association Rules Generator 313
Sample Files for the Rules Visualization Part 313

Contents

10.

MineSet Inducers and Classifiers 315
Classifiers 315
Decision Tree Classifiers 316
Option Tree Classifiers 317
Evidence Classifiers 319
Inducers 320
Training Set 322
Applying a Model 322
Error Estimation 324
Backfitting in Error Estimation 328
Confusion Matrices in Error Estimation 329
Lift Curves in Error Estimation 330
Learning Curves in Error Estimation 332
Advanced Options 335
Return-on-Investment Curves 338
Inducer Modes in Tool Manager 340
Error Options for Inducers 341
Backfitting 342
Confusion Matrices 343
ROI Option 343
Lift Curves 343
Loss Matrices 344
Weight Setting 344
Learning Curves 344
OK and Cancel Buttons 345
Go! Button 346

xi

Contents

The Status Window 346
Applying Models, Testing Models, and Fitting New Data
Apply Model 349
Test Model 349
Fit Data to Model 350
Special Options and Limitations 351
Setting Special Options 351
Default Limits and How to Override Them 352
Other Limitations 353
11.

xii

348

Inducing and Visualizing the Decision Tree Classifier 355
Overview 355
Inducing Decision Trees 356
File Requirements 357
Running the Decision Tree Inducer 357
Configuring the Decision Tree Inducer Using the Tool Manager
Discrete Labels 358
Classifier Name 359
Parallelization 359
Decision Tree Options 359
Working in the Tree Visualizer’s Main Window 363
Nodes 363
Lines 364
Using the Main Window to Classify Records 365
External Controls 365
Pulldown Menus 366
The Search and Filter Panels 366
Sample Files 368

358

Contents

12.

13.

Inducing and Visualizing the Option Tree Classifier 377
Overview 377
Inducing Option Trees 380
File Requirements 380
Running the Option Tree Inducer 380
Configuring the Decision Tree Inducer Using the Tool Manager
Discrete Labels 381
Parallelization 382
Classifier Name 382
Option Tree: Further Options 382
Working in the Tree Visualizer’s Main Window 385
Sample Files 385
Inducing and Visualizing the Evidence Classifier 389
Overview 389
Inducing Evidence Classifiers 397
File Requirements 398
Running the Evidence Inducer 398
Starting the Evidence Visualizer 399
Configuring the Evidence Inducer Using the Tool Manager
Discrete Labels 401
Classifier Name 401
Refining the Inducer With Further Options 401
Working in the Evidence Visualizer’s Panes 403
Viewing Modes 405
External Controls 415
Sliders 415
Pulldown Menus 416
The File Menu 416
The View Menu 417
The Nominal Order Menu 418
The Selection Menu 418
Sample Files 420

381

400

xiii

Contents

14.

15.

xiv

Inducing and Visualizing the Decision Table 431
Overview 431
Inducing Decision Tables 436
File Requirements 437
Running the Decision Table Inducer 437
Starting the Decision Table Visualizer 438
Configuring the Decision Table Inducer Using the Tool Manager
Discrete Labels 440
Classifier Name 440
Exploring Data by Mapping Columns to Axes 440
Decision Table Options 441
Working in the Decision Table Visualizer’s Main Window 442
Viewing Modes 444
External Main Window Controls 446
Sliders 446
Pulldown Menus 446
The File Menu 446
The View Menu 447
The Nominal Order Menu 447
The Selection Menu 448
The Help Menu 449
Sample Files 450
Inducing and Visualizing the Regression Tree 465
Overview 465
Running the Regression Tree Inducer 466
Configuring the Regression Tree Inducer Using the Tool Manager
Continuous Label 467
Regressor Name 468
Regression Tree Options 468
Error Estimation 471

439

467

Contents

Visualizing the Regression Tree 472
Lines 473
Using the Main Window to Predict Values
External Controls 474
Pulldown Menus 474
Sample Files 474

473

16.

Inducing and Visualizing Clustering 479
Overview of Clustering 479
Using Clustering and the Cluster Visualizer 482
Single k-Means Clustering Method 483
Iterative k-Means Clustering Method 484
Evaluation of Clustering 485
Using Attribute Weights 486
Further Clustering Options 488
Starting the Cluster Visualizer 489
File Requirements 490
Working in Cluster Visualizer Main Window 490
Pulldown Menus 492
Sample File 492
Alternative Visualization of Clustering 492

17.

Column Importance 493
Finding Important Columns 493
Column Importance Notes 497
Column Importance and Relation to Classifiers
The Discretization Process 497
The Importance Function 498
Dependence on Other Attributes 498
Sample File 499

497

xv

Contents

xvi

18.

Selection and Drill-Through 501
Multiple Selection 501
Drill-Through 502
Tree Visualizer Specific Details 503
Map Visualizer Specific Details 504
Scatter Visualizer Specific Details 504
Splat Visualizer Specific Details 504
Rules Visualizer Specific Details 504

19.

File Exchange Between MineSet and SAS 505
Overview 505
Converting MineSet Data Files to SAS Data Sets 505
The -names namefile Command Line Option 506
The -svsc Option 506
Converting SAS Data Sets Into MineSet Data Files 507
The -nolabel Option 507
The -names namefile Option 507
The -nodata Option 508
The -svsc Option 508

20.

MineSet Web Extensions 509
Overview 509
MineSet Web Extension Files 510
scripts Subdirectory 510
examples Subdirectory 510
examples/rview_dir Subdirectory 511
MineSet Web Installation (Client) 511
MineSet Web Installation (Server) 512
Setting Up the Server 512
Local Installation 513
MineSet mtr Files 514
Creating mtr Files 514

Contents

MineSet Remote View 516
Installing MineSet Remote View 516
Configuring and Using rview_dir.cgi 516
Configuring and Using rview_file.cgi 519
MineSet Web Extension Security-Related Issues

520

A.

Flat File Support for MineSet 521
The Data File 521
Data Types 522
Arrays 523
The .schema File 525
Variable Names 525
Strings and Characters 526
Comments 526
File Statements 526
Data Statements 526
Input Options 529
Exceptions 529

B.

Creating Data and Configuration Files for the Tree Visualizer 531
The Data File 531
Data Types 532
Enumerations 533
Arrays 534
The Configuration File 536
Sections 536
Options Files 536
Statements 537
Variable Names 537
Option Statements 537
Include Statements 538
Sinclude Statements 538
Strings and Characters 538

xvii

Contents

Keywords 539
Expressions 540
The Input Section 541
File Statements 541
Data Statements 542
Input Options 544
The Expression Section 546
The Hierarchy Section 547
Levels Statements 547
Key Statements 548
Aggregate Subsection 551
Aggregate Base Subsection 552
Expressions Subsection 553
Sort Statements 553
Hierarchy Options 554
The View Section 555
Height Statements 556
Base Height Statements 558
Disk Height Statements 559
Color Statements 560
Base Color Statements 562
Disk Color Statements 563
Label Statements 563
Message Statements 563
The View Options 565

xviii

Contents

C.

Creating Data, Configuration, Hierarchy, and GFX Files for the Map Visualizer 573
The Data File 573
Data Types 574
Fixed Arrays 575
The Configuration File 576
Overview 576
Keywords 579
Expressions 580
The Input Section 581
The Expressions Section 587
The View Section 588
The Hierarchy File 595
The .gfx File 596

D.

Creating Data and Configuration Files for the Scatter Visualizer 601
The Data File 601
Data Types 602
Arrays 603
Null Values 603
The Configuration File 604
Sections 604
Defaults Files 604
Statements 605
Variable Names 605
Options Statements 605
Include Statements 606
Sinclude Statements 606
Strings and Characters 606
Comments 606
Keywords 607
Expressions 608

xix

Contents

The Input Section 609
File Statements 610
Enumeration Statements 610
Data Statements 612
Input Options 614
The Expressions Section 614
The View Section 615
Slider Statement 616
Entity Statement 616
Size Statement 617
Color Statement 618
Axis Statement 621
Summary Statement 622
Message Statement 624
Execute Statement 625
The Filter Statement 625
View Options 626
E.

xx

Creating Data and Configuration Files for the Splat Visualizer 627
The Data File 627
Data Types 628
Null Values 629
The Configuration File 629
Sections 629
Defaults Files 630
Statements 630
Variable Names 630
Options Statements 631
Include Statements 631
Sinclude Statements 631
Strings and Characters 632
Comments 632
Keywords 632

Contents

The Input Section 633
File Statements 633
Enumeration Statements 634
Data Statements 636
Input Options 637
The View Section 637
Slider Statement 638
Opacity Statement 638
Color Statement 640
Axis Statement 643
Summary Statement 644
View Options 645
F.

Creating Data and Configuration Files for the Rules Visualizer 647
The Association Data Converter 648
Association Data Converter File Requirements 648
Files Generated by the Association Data Converter 650
The Association Data Converter Command-Line Operation 650
Association Data Converter Examples 651
Association Rules Generator 652
Association Rules Generator Files Requirements 652
Association Rules Generator Command-Line Operation 652
Association Rule Examples 657
Rules Visualization 663
Rules Visualization File Requirements 663

G.

Format of the Evidence Visualizer’s Data File 677

H.

Creating Data and Configuration Files for the Decision Table Visualizer 681
Sample File 682

xxi

Contents

I.

Command-Line Interface to MIndUtil: Analytical Data Mining Algorithms
MIndUtil Invocation and Options 683
General Options 687
Induction Modes 690
Decision Tree Inducer Options 692
Option Tree Inducer Options 693
Evidence Inducer Options 693
Decision Table Inducer Options 694
Regression Tree Inducer Options 695
Estimate Error 695
Learning Curve 696
Clustering 696
Discretization 697
Column Importance and Auto Selection 698
Fit-Data 699
MineSet-to-MLC, MLC-to-MineSet 699
Visualize 700

J.

Nulls in MineSet 701
Semantics of Nulls 701
Representation of Nulls 702
Operations on Nulls 702
Arithmetic Expressions 702
Boolean Expressions 702
Relational Operations 703
Testing for Nulls 703
Aggregations in the Presence of Nulls
Sort Order for Nulls 705
Bins and Arrays With Nulls 705

K.

Further Reading and Acknowledgments
Further Reading 707
Acknowledgments 711
Index

xxii

704

713

707

683

List of Figures

Figure 1-1
Figure 3-1
Figure 3-2
Figure 3-3
Figure 3-4
Figure 3-5
Figure 3-6
Figure 3-7
Figure 3-8
Figure 3-9
Figure 3-10
Figure 3-11
Figure 3-12
Figure 3-13
Figure 3-14
Figure 3-15
Figure 3-16
Figure 3-17
Figure 3-18
Figure 3-19
Figure 3-20
Figure 3-21
Figure 3-22
Figure 3-23
Figure 3-24
Figure 3-25
Figure 3-26

Tool Execution Sequence 48
The Tool Manager Startup Window 67
File Pulldown Menu 68
Open New Data File Dialog Box 69
Choosing New Database Table Dialog Box 71
Specifying Server Name, Login, and Password 71
Sample Dialog Box Listing Available DBMS Names/Vendors
Dialog Box After Selecting Informix or Sybase DBMS 73
SQL Query Dialog Box 74
The Data Transformations Panel 75
Bin Columns Dialog Box 77
Binning With Automatically Computed Thresholds 79
Aggregate Dialog Box 86
Filter Dialog Box 87
Change Types Dialog Box 88
Types Popup List 89
The Add Column Dialog Box 91
Sampling Dialog Box 93
Table History Buttons 94
View History Dialog Box 96
Zoom Buttons 97
Overview Button 97
Vertical/Horizontal View Button 97
Data Destination Panel 100
Columns Mapped to Requirements 101
The Associations Tab 103
The Column Importance Tab 104

72

xxiii

List of Figures

Figure 3-27
Figure 3-28
Figure 3-29
Figure 3-30
Figure 3-31
Figure 3-32
Figure 3-33
Figure 3-34
Figure 3-35
Figure 3-36
Figure 4-1
Figure 4-2
Figure 4-3
Figure 4-4
Figure 4-5
Figure 4-6
Figure 5-1
Figure 5-2
Figure 5-3
Figure 5-4
Figure 5-5
Figure 5-6
Figure 5-7
Figure 5-8
Figure 5-9
Figure 5-10
Figure 5-11
Figure 5-12
Figure 5-13
Figure 5-14
Figure 5-15
Figure 5-16

xxiv

Advanced Mode of Column Importance 105
The Data Files Panel 107
File Menu 109
View Menu 111
Sample Record Viewer Screen 114
Configuration Option With a Single Color Swatch 115
Color Browser 115
Multiple Colors Swatches 116
Scroll Arrows on Color Browser 116
Color Browser Out of Colors 116
Numeric Column Displayed by Statistics Visualizer 120
Discrete Column Displayed by Statistics Visualizer 120
File > Open Menu Selection for Statistics Visualizer 121
Data Destination Panel With Statistics Visualizer Selected 122
StatViz View Pulldown Menu 124
Statistics Visualizer Help Menu 125
Example Display in the Tree Visualizer’s Main Window 128
Tree Visualizer’s File Pulldown Menu 130
Data Destination Panel of Tool Manager With Tree
Visualizer Selected 133
Tree Visualizer’s Configuration Options Dialog Box 135
Tree Visualizer’s Initial View When Specifying store.treeviz 142
A Highlighted Object and the Information It Represents 143
Example of a Selected (Spotlighted) Object 145
Example of the Square as Navigational Base 146
Tree Visualizer’s External Button Controls 147
Tree Visualizer’s Thumbwheels 149
Tree Visualizer’s Height Slider 150
Tree Visualizer’s File Pulldown Menu With Options 151
Tree Visualizer’s Show Pulldown Menu With Options 152
Tree Visualizer’s Overview Window 153
Tree Visualizer’s Search Dialog Box 154
Sample Results of a Search in the Tree Visualizer 155

List of Figures

Figure 5-17
Figure 5-18
Figure 5-19
Figure 5-20
Figure 5-21
Figure 5-22
Figure 5-23
Figure 5-24
Figure 5-25
Figure 5-26
Figure 6-1
Figure 6-2
Figure 6-3
Figure 6-4
Figure 6-5
Figure 6-6
Figure 6-7
Figure 6-8
Figure 6-9
Figure 6-10
Figure 6-11
Figure 6-12
Figure 6-13
Figure 6-14
Figure 6-15
Figure 6-16

Detail of the Tree Visualizer’s Search Dialog Box 156
Tree Visualizer’s Filter Dialog Box 159
Tree Visualizer’s Marks Panel 163
Window Resulting From Clicking Mark Button 163
Main Window With Flags Representing Marks 164
Tree Visualizer’s Display Menu 165
Tree Visualizer’s Selection Menu 166
Tree Visualizer’s Go Pulldown Menu 167
Tree Visualizer’s Help Pulldown Menu 169
Representation of a Null Value Mapped to Height, Color,
Disk, and Label 171
Sample Map Visualizer Screen Showing 1990 U.S. Population 174
Sample Map Visualizer Screen Showing Relative Population
of Major U.S. Cities 175
Sample Map Visualizer Screen Showing the United States
With Specific Endpoints 176
Map Visualizer’s Startup Screen, With File Pulldown
Menu Selected 178
Data Destination Panel, With Map Visualizer Selected 182
Map Visualizer’s Options Dialog Box 185
Population.usa.mapviz Example With the Slider Moved to 1990 190
Highlighted Information in the Viewing Window and
Selected Information 192
Detail View of Top Right Buttons 194
Lower Half of Window With Thumbwheels 196
Map Visualizer’s Summary Window With Slider and
Animation Controls 197
Map Visualizer’s Summary Window With One Slider and
Animation Controls 198
If There Are No Independent Dimensions, No Animation
Control Panel Appears 199
Map Visualizer’s View Pulldown Menu 204
Map Visualizer Filter Panel 205
Map Visualizer Selections Menu 208

xxv

List of Figures

Figure 6-17
Figure 6-18
Figure 7-1
Figure 7-2
Figure 7-3
Figure 7-4
Figure 7-5
Figure 7-6
Figure 7-7
Figure 7-8
Figure 7-9
Figure 7-10
Figure 7-11
Figure 7-12
Figure 8-1
Figure 8-2
Figure 8-3
Figure 8-4
Figure 8-5
Figure 8-6
Figure 8-7
Figure 8-8
Figure 8-9
Figure 8-10
Figure 8-11
Figure 8-12

xxvi

Map Visualizer’s InterTool Pulldown Menu 209
Representation of a Null Value Mapped to Height
(Top Middle Object) and to Color (Bottom Right Object) 211
Sample Scatter Visualizer Screen 216
Scatter Visualizer Start-Up File Pulldown Menu Selected 219
Data Destination Panel With Scatter Visualizer Selected 220
Scatter Visualizer’s Options Dialog Box 224
Initial View When Specifying company.scatterviz 231
Displayed Information When Cursor is Over a Selected Entity 233
Animation Control Panel With Summary Window and Both
Slider Controls 235
Animation Control Panel With Summary Window and One
Slider Control 236
Scatter Visualizer With No Independent Dimension or
Animation Control Panel 237
Scatter Visualizer View Menu 241
Scatter Visualizer Filter Panel 242
The Scatter Visualizer Selections Menu 244
Sample Splat Visualizer With One Slider Control 250
Shape of Opacity Function For Low and High Values of u 252
Image Where u = 5.3, and u = 30 253
Data Destination Panel With Splat Visualizer Selected 257
Splat Visualizer’s Options Dialog Box 259
Pick Dragger Over Data 266
Animation Control Panel With Summary Window and
Both Slider Controls 268
Splat Visualizer Without Independent Dimension or An
Animation Control Panel 270
Changed Visualization as a Result of Moving the Slider
(Compare to Figure 8-1) 273
Splat Visualizer View Menu 276
Splat Visualizer Filter Panel 277
The Splat Visualizer’s Selection Menu 279

List of Figures

Figure 8-13
Figure 9-1
Figure 9-2
Figure 9-3
Figure 9-4
Figure 9-5
Figure 9-6
Figure 9-7
Figure 9-8
Figure 9-9
Figure 9-10
Figure 9-11
Figure 9-12
Figure 9-13
Figure 10-1
Figure 10-2
Figure 10-3
Figure 10-4
Figure 10-5
Figure 10-6
Figure 10-7
Figure 10-8
Figure 10-9
Figure 10-10
Figure 10-11
Figure 10-12
Figure 10-13
Figure 10-14
Figure 10-15

Image With Fixed Selection Box (Gray) and Active Selection
Box (Yellow) 280
Execution Sequence of the Rules Visualizer 289
Detail View of the Rules Visualizer’s Main Window 293
Initial Tool Manager Window for Association Generation 298
Association Rule Options Dialog Box 299
Association Mappings Dialog Box 300
Rule Visualizer Options Dialog Box 301
The Rules Visualizer’s Mappings Panel 304
Initial Rules Visualizer View When Specifying group.ruleviz 305
Cursor Over a Rules Visualizer Object 307
Rules Visualizer’s Height Slider 308
Rules Visualizer File Menu 309
Rules Visualizer Filter Panel 310
Rules Visualizer View Menu 312
The Decision Tree Generated by the Decision Tree Inducer
for Churn Dataset 316
The Option Tree Generated by the Option Tree Inducer for
the Cars Dataset 317
Results of Evidence Inducer for Iris Dataset 319
Method for Building a Classifier 320
Using a Classifier to Label New Records 320
Tool Execution Sequence for Classifiers 321
Sample Records From a Training Set 322
Iris Dataset Misclassification, Example 1 323
Iris Dataset Misclassification, Example 2 324
Estimating the Classifier’s Accuracy 326
Classifier Cross-Validation (k=3) 327
Confusion Matrix for Iris Dataset 329
Lift Curve for the Churn Dataset 331
Learning Curve for the Churn Dataset 333
Learning Curve for the Adult Dataset With Label Set to Gross
Income Binned at $50,000 334

xxvii

List of Figures

Figure 10-16
Figure 10-17
Figure 10-18
Figure 10-19
Figure 10-20
Figure 10-21
Figure 10-22
Figure 10-23
Figure 10-24
Figure 10-25
Figure 10-26
Figure 10-27
Figure 10-28
Figure 10-29
Figure 10-30
Figure 11-1
Figure 11-2
Figure 11-3
Figure 11-4
Figure 12-1
Figure 12-2
Figure 12-3
Figure 13-1
Figure 13-2
Figure 13-3
Figure 13-4
Figure 13-5
Figure 13-6
Figure 13-7

xxviii

Confusion Matrix for the Mushroom Dataset Using
Defaults Settings 335
Confusion Matrix for the Mushroom Dataset With Loss Matrix 336
Confusion Matrix for the Mushroom Dataset With Loss Matrix
Allowing Unknown Predictions 337
Options for Running the Inducer 340
Error Estimation Options With Holdout 341
Error Estimation Options With Cross Validation 342
Backfitting, Confusion Matrices, Lift Curve, and ROI
Curve Options 342
ROI Option for Generating a Return on Investment Curve 343
Enabling Loss Matrices and Setting the Weight Attribute 344
Learning Curve Options 345
The Status Window 346
The Test and Apply Model Dialog Box: Selecting a Classifier 348
The Apply Model Panel 349
The Test Model Panel 350
The Fit Data to Model Panel 351
Decision Tree for the Iris Dataset 356
Data Destination Panel in Tool Manager Showing Classifiers 358
Further Inducer Options 360
Tree Visualizer’s Search Dialog Box 366
Option Decision Tree for the Cars Dataset 379
Data Destination Panel in Tool Manager Showing Classifiers 381
Further Inducer Options 383
The Evidence Visualizer Applied to the Iris Dataset 390
Evidence Visualizer Showing Probabilities 391
Selecting sepal length < 5.45 and sepal width > 3.05 Using the
Iris Dataset 394
Selecting Two Contradictory Pies Results in a Gray Pie
on the Right 395
Veil-Color Attribute in the Mushroom Dataset 396
File > Open Menu Selection 399
Tool Manager With Data Destination Panel Showing Classifiers 400

List of Figures

Figure 13-8
Figure 13-9
Figure 13-10
Figure 13-11
Figure 13-12
Figure 13-13
Figure 13-14
Figure 13-15
Figure 13-16
Figure 13-17
Figure 13-18
Figure 13-19
Figure 13-20
Figure 13-21
Figure 13-22
Figure 13-23
Figure 13-24
Figure 14-1
Figure 14-2
Figure 14-3
Figure 14-4
Figure 14-5
Figure 14-6
Figure 14-7
Figure 14-8
Figure 14-9
Figure 14-10
Figure 14-11
Figure 14-12
Figure 14-13
Figure 15-1

Classification Options Dialog Box Without Accuracy Estimate 402
Evidence Visualizer Window for cars.eviviz 404
Label Value “Japan” Selected Using the Cars Dataset 406
Loss Matrix to Avoid Predicting Poisonous Mushrooms as
Being Edible 407
Loss Matrix Applied to Probabilities in the Label Probability Pane 408
Pie Charts With the First Binned Range of weightlbs Highlighted 409
Bar Chart With a Range Selected 411
Iris Dataset With the Value petal width .75 - 1.65 Selected 412
Bars Showing Evidence For iris-virginica 413
Bars Showing Evidence Against iris-virginica 414
Evidence Visualizer Height Scale Slider 415
Evidence Visualizer Detail Slider 416
Evidence Visualizer Percent Weight Threshold Slider. 416
Evidence Visualizer’s View Menu 417
Evidence Visualizer’s Nominal Order Menu 418
Evidence Visualizer’s Selection Menu 419
Filtered Adult Dataset With Multiple Selection 420
Decision Table for the Mushroom Dataset 432
Decision Table for the Mushroom Dataset, Showing Drill-Down 433
Mushroom Dataset Close-Up of “odor=none and
spore-print-color=white” 434
Data Destination Panel in Tool Manager Showing Classifiers 439
Further Inducer Options 441
Decision Table Showing Classifier Induced From adult94 Dataset 443
Example of Making Multiple Selections 445
Decision Table Visualizer’s View Menu 447
Decision Table Visualizer’s Nominal Order Menu 447
Decision Table Visualizer’s Selection Menu 448
Drilling Down on the Churn Dataset 451
Decision Table Visualizer Using the Adult Dataset 455
Closer Inspection of the Adult Dataset 457
Regression Tree for the Adult Dataset 466

xxix

List of Figures

Figure 15-2
Figure 15-3
Figure 16-1
Figure 16-2
Figure 16-3
Figure 16-4
Figure 16-5
Figure 17-1
Figure 17-2
Figure 18-1

xxx

Data Destination Panel in Tool Manager Showing Regressors
Further Inducer Options 469
Clustering Visualization on Adult Dataset 480
The Clustering Tab 481
Clustering Using Iterative K-Means 484
Clustering Options Dialog Box 487
Cluster Visualizer Main Window 491
The Column Importance Tab 494
Advanced Mode of Column Importance 495
Table of Values for Selected Objects 502

467

List of Tables

Table 3-1
Table 3-2
Table 3-3
Table 3-4
Table 3-5
Table 3-6
Table 3-7
Table 3-8
Table 8-1
Table 8-2
Table 8-3
Table 9-1
Table 9-2
Table B-1
Table C-1
Table C-2
Table C-3
Table D-1
Table D-2
Table D-3
Table E-1
Table E-2
Table F-1
Table F-2
Table F-3

Aggregate Example 1 83
Aggregate Example 2 83
Aggregate Example 3 84
Example of Binning 84
Results When Making Total $ Spent an Array 84
Results When Specifying Sex_bin 85
Results of Making an Array by Age_bin and Sex_bin 85
Results of Distributing Sex_bin and Indexing by Age_bin 85
Ages 40 to 50 274
Ages 50 to 60 274
Interpolation Midway Between Table 1 and Table 2 275
Association Rules Components 292
Example of Hierarchical Levels 292
Keywords for the Tree Visualizer 539
Keywords for the Map Visualizer 579
Operators Used With Expressions 580
Characters That Can Follow the Percent Symbol in the
Format String 583
Scatter Visualizer Keywords 607
Operators Used With Expressions 608
Characters That Can Follow the Percent Symbol in the
Format String 611
Splat Visualizer Keywords 632
Characters That Can Follow the Percent Symbol in the
Format String 635
Single-Item Format 649
Multiple-Item Format 649
Options for the Association Data Converter 650

xxxi

List of Tables

Table F-4
Table F-5
Table F-6
Table F-7
Table F-8
Table F-9
Table F-10
Table F-11
Table F-12
Table F-13
Table F-14
Table F-15

xxxii

Options for Controlling Rule Generation 653
Options for Restricting Generated Rules 654
Options for the mapassocgen Command 655
Example Hierarchy 656
Options Set 3 657
Data Example 2 658
Rule Generation Example 1 659
Example Hierarchy 660
Example of Rules at the Lowest Hierarchical Level 661
Second Example of Rules Generated at Lowest Hierarchical Level
Field Names and Types for Rules File 665
Operators Used With Expressions 666

663

About This Guide

The MineSet User’s Guide describes the features and capabilities of this suite of four
database mining and nine visualization tools. Current information about the MineSet
product can be found on the World Wide Web at
http://www.sgi.com/Products/software/MineSet

Audience for This Guide
If you are using the Tool Manager to extract data from a database into the MineSet tools,
you should understand database structures. It also would be helpful to know SQL.
If you are configuring the tools directly (through the configuration files, or through the
command line in the case of the association rules), you should have some knowledge of
UNIX as well as some programming experience.
Once the data has been loaded into the various visualization tools, you will not need a
database or programming background, although you will be able to interpret the
displays more easily if you have an understanding of the data and what it represents.

xxxiii

About This Guide

Structure of This Document
In addition to this preface, the documentation for MineSet consists of the following
chapters:
Chapter 1, “Getting Started”
This provides a brief overview of each MineSet tool and describes the processes that
occur when invoking and using a tool.
Chapter 2, “Setting Up MineSet”
This chapter describes how to set up MineSet by configuring the DataMover.
Chapter 3, “The Tool Manager”
This chapter describes the menus and functions of the initial interface for invoking tools
and tells how to produce their respective configuration files.
Chapter 4, “Using the Statistics Visualizer”
This chapter provides a description of the Statistics Visualizer. This tool is valuable for
comprehending variations in statistics by comparing box plots and histograms.
Chapter 5, “Using the Tree Visualizer”
This chapter provides a complete description of the Tree Visualizer tool interface. This
tool is valuable for visualizing hierarchical data.
Chapter 6, “Using the Map Visualizer”
This chapter provides a complete description of the Map Visualizer interface. This tool is
valuable for visualizing data that is connected with a geographical location.
Chapter 7, “Using the Scatter Visualizer”
This chapter provides a complete description of the Scatter Visualizer interface. This tool
is valuable for visualizing multidimensional data.
Chapter 8, “Using the Splat Visualizer”
This chapter provides a complete description of the Splat Visualizer. This tool, which is
particularly well suited for application to very large datasets, lets you visually analyze
relationships among several variables, either statically or by animation.
Chapter 9, “Using the Rules Visualizer”
This chapter provides a complete description of the Rules Visualizer. This tool is valuable
for mining large datasets and visualizing correlations in that data.

xxxiv

About This Guide

Chapter 10, “MineSet Inducers and Classifiers”
This chapter provides a brief introduction to classifiers and regressors, and the
algorithms that generate them, called inducers. Specifically, it introduces the three
MineSet classifiers: Decision Tree, Option Tree and Evidence.
Chapter 11, “Inducing and Visualizing the Decision Tree Classifier”
This chapter describes how to generate and use the Decision Tree Classifier. This tool is
valuable for classifying data according to a set of attributes by making a series of
decisions based on those attributes.
Chapter 12, “Inducing and Visualizing the Option Tree Classifier”
This chapter describes how to generate and use the Option Tree Classifier. This tool
assigns each record to a class. Option trees can contain special option nodes that allow
the classifier to consider the influence of splitting on multiple attributes simultaneously.
Chapter 13, “Inducing and Visualizing the Evidence Classifier”
This chapter describes how to generate and use the Evidence Classifier. This tool is
valuable for classifying data by examining the probabilities of a specified result
occurring based on a given attribute.
Chapter 14, “Inducing and Visualizing the Decision Table”
This chapter describes how to generate and use the Decision Table Classifier. This tool is
useful for examining data and visualizing correlations between pairs of attributes.
Chapter 15, “Inducing and Visualizing the Regression Tree”
This chapter describes how to generate and use the Regression Tree Classifier. This tool
is useful for predicting attributes based on continuous values, such as occur in real life.
Chapter 16, “Inducing and Visualizing Clustering”
This chapter describes how to generate and use clustering to explore data. This tool is
useful to detect groups of records that have similar characteristics.
Chapter 17, “Column Importance”
This chapter provides a complete description of the column importance tool. It also
describes the relationship between column importance and the importance ranking in
the other data mining tools.
Chapter 18, “Selection and Drill-Through”
This chapter describes the how to use multiple selection in the MineSet tools, as well as
the concept of drill-through.

xxxv

About This Guide

Chapter 19, “File Exchange Between MineSet and SAS”
This chapter describes the support for file exchanges between the MineSet and SAS
formats.
Chapter 20, “MineSet Web Extensions”
This chapter describes the MineSet extensions that are provided to let you create or view
visualizations and/or interact with MineSet over the web.
Appendix A, “Flat File Support for MineSet”
This appendix describes the .schema and the .data files that are required for MineSet to
read flat files.
Appendix B, “Creating Data and Configuration Files for the Tree Visualizer”
This appendix explains the required formats of the Tree Visualizer data and
configuration files.
Appendix C, “Creating Data, Configuration, Hierarchy, and GFX Files for the Map
Visualizer”
This appendix explains the required formats of the Map Visualizer data, configuration,
hierarchy, and .gfx files.
Appendix D, “Creating Data and Configuration Files for the Scatter Visualizer”
This appendix explains the required formats of the Scatter Visualizer data and
configuration files.
Appendix E, “Creating Data and Configuration Files for the Splat Visualizer”
This appendix describes the format of the Splat Visualizer’s data file.
Appendix F, “Creating Data and Configuration Files for the Rules Visualizer”
This appendix explains the required formats of the Rules Visualizer data and
configuration files.
Appendix G, “Format of the Evidence Visualizer’s Data File”
This appendix describes the format of the Evidence Visualizer’s data file.
Appendix H, “Creating Data and Configuration Files for the Decision Table Visualizer”
This appendix describes the format of the Decision Table’s data file.

xxxvi

About This Guide

Appendix I, “Command-Line Interface to MIndUtil: Analytical Data Mining
Algorithms”
This appendix describes how the server side of the MineSet images handles classifiers,
regressors, discretization, column importance, file conversions, and their options.
Appendix J, “Nulls in MineSet”
This appendix describes how MineSet supports nulls in the data access tools, the mining
tools, and the visualization tools.
Appendix K, “Further Reading and Acknowledgments”
This appendix lists reference sources for further reading about concepts and their
implementations used in the MineSet tools. It also lists acknowledgments for data
sources used in the examples provided with these tools.

Illustration in This Guide
The hard copy of this documentation provides all screen shots and illustrations in black
and white. The online version, however, provides these visuals in full, original color.
Thus, if you are reading the hard copy version and find a particular graphic or screen
shot difficult to see, go to the respective page of the online version for greater clarity.

Typographical Conventions
The following type conventions and symbols are used in this guide:
Italics

Executable names, filenames, program variables, tools, utilities, variable
command-line arguments, and variables to be supplied by the user in
examples, code, and syntax statements.

Bold

Keywords

Fixed-width type

On-screen command-line text and prompts.
Bold fixed-width type

User input, including keyboard keys (printing and non-printing);
literals supplied by the user in examples, code, and syntax statements.
[]

Syntax statement arguments surrounded by square brackets denote that
these arguments are optional.

xxxvii

Chapter 1

1. Getting Started

This introduction provides an overview of MineSet™, an integrated suite of data mining
and visualization tools, and describes the basic tool execution scenario.
Note: Before using any of the MineSet tools, follow the installation and licensing

instructions in the MineSet release notes. Then your system administrator must set up
the DataMover configuration file. You also can choose to set up various options. The
setup details are described in Chapter 2.

MineSet Tools Suite
The MineSet suite of tools lets you mine and graphically display quantitative
information in ways that can help you better visualize, explore, and understand your
data. This suite of data mining and analysis tools can help you organize and examine
your data in new and meaningful ways. The mining tools automatically find patterns
and build models that can be viewed using the visualization tools. The visualization
tools can also be applied directly to the data for further insights. These tools provide an
enabling power that lets you gain a deeper, intuitive understanding of your data, and
helps you discover hidden patterns and important trends.
These tools provide a highly interactive, three-dimensional (3D) visual interface that lets
you manipulate visual objects on the screen, as well as search, filter and perform
animations. This ability to visualize and survey complex data patterns can prove
invaluable for decision support, in business intelligence and knowledge management.

39

Chapter 1: Getting Started

The MineSet suite consists of three basic components:
•

a centralized control module, consisting of a graphical user interface tool called the
Tool Manager, and a process called the DataMover, which runs on the server part of
MineSet’s client/server architecture.

•

analytical data mining, with nine data mining tools:

•

–

Association Rules Generator

–

Automatic Binning

–

Cluster Generator

–

Column Importance

–

Decision Table Inducer and Classifier

–

Decision Tree Inducer and Classifier

–

Evidence Inducer and Classifier

–

Option Tree Inducer and Classifier

–

Regression Tree Inducer and Regressor

visualization tools, which let you view your data using ten different visual
metaphors:
–

Cluster Visualizer

–

Decision Table Visualizer

–

Evidence Visualizer

–

Map Visualizer

–

Record Viewer

–

Rules Visualizer

–

Scatter Visualizer

–

Splat Visualizer

–

Statistics Visualizer

–

Tree Visualizer

The following sections provide a brief description of each of the above-mentioned
components.

40

MineSet Tools Suite

Tool Manager
Each of the mining and visualization tools described below can be configured and started
via a consistent graphical user interface known as the Tool Manager. The Tool Manager
•

connects you to the server on which the analytical mining and transformations are
performed

•

lets you access, query and transform data

•

creates configuration files for each tool

DataMover
The DataMover is a process that runs on the server on behalf of the user. The DataMover
•

connects to databases, flat files (ASCII or binary), and retrieves the data

•

invokes the mining tools

•

performs additional data manipulation such as binning and aggregation

•

returns the data to the Tool Manager for distribution to the visualization tools

•

can store the data in files on the server or client for future operations.

Association Rules Generator
The Association Rules Generator processes an input file, then generates an output file
consisting of rules. These rules indicate the frequency with which one item occurs in a
record along with another item. The strength of the association is quantified by three
numbers.
•

The first number, the predictability of the rule, quantifies how often an item X and an
item Y occur together as a fraction of the number of records in which X occurs. For
example, given that someone has bought milk, how often do they also buy eggs.

•

The second number, the prevalence of the rule, quantifies how often X and Y occur
together in the file as a fraction of the total number of records. For example, how
often were milk and eggs bought together.

•

The third number is expected predictability. This gives an indication of what the
predictability would be if there were no relationship between the items in the
record. For example, how often were eggs bought, regardless of whether milk was
bought as well.

41

Chapter 1: Getting Started

Automatic Binning
Automatic Binning groups together closely spaced numerical data into discrete
categories. Some data mining algorithms, such as the Decision Tree Inducer, require
some discrete (categorical) data; similarly, visualization tools such as the Splat Visualizer
may need data categorized in this way.
MineSet can automatically determine these categories, or you can determine how you
need it done. Requirements can be as simple as dividing the data into three equal ranges;
or as complex as having MineSet choose ranges differentiated according to some chosen
attribute, at the same time discarding the outer five percent of the data as outliers.

Clustering
Clustering segments data into similar groups or clusters. For example, you can ask
MineSet to suggest a segmentation of customers into five distinct groups, without giving
any further parameters. Once the clustering operation has been run, you can view the
results in the Cluster Visualizer; or apply the clustering model to the current data, then
analyze the resulting clusters in any MineSet visualization or mining tool.

Column Importance
Column Importance determines how important various attributes are for determining
the value of a given label attribute. For example, you can ask MineSet to select
automatically the best three attributes that help determine whether someone is a good
credit risk. The system might select income, own-house, and car-cost. These attributes
can then be used to configure various visualizers.
Column Importance has an advanced mode that provides additional capabilities. First, it
lets you determine how important each of the attributes is. (For example, you could
determine that both income and salary are similar in importance in determining credit
risk. Although income might be slightly better in determining importance, you might
prefer to use salary because it is easier to obtain.) Second, once you explicitly choose an
attribute, you can determine what other attributes are important in conjunction with it.
(For example, if you have chosen salary rather than income, house-cost might become
more important than own-house, and income would have a very low importance.)

42

MineSet Tools Suite

Decision Table Inducer and Classifier
The Decision Table Classifier classifies data by making a series of consecutive decisions
leading to the classification based on a record’s attributes. It can be used to predict events
such as whether a bank customer is likely to default on a loan, or a homeowner is likely
to refinance their mortgage.
The Decision Table Inducer creates a Decision Table Classifier from the data. Attributes
are tested to classify the data, and you have the option to set the order in which the tests
are run as well. The resulting Decision Table Classifier can be viewed using the Decision
Table Visualizer, so you can simultaneously explore multiple attribute tests, two at a
time.

Decision Tree Inducer and Classifier
The Decision Tree Classifier classifies data according to a set of attributes by making a
series of decisions based on those attributes. Applying this classifier to determine the
profile of someone with credit worthiness, for example, a decision tree might determine
if someone who owns a home, owns a car that cost between $15,000 and $23,000, and has
two children, is a good credit risk.
The Decision Tree Inducer generates a Decision Tree Classifier, the structure of which is
displayed using the Tree Visualizer, each decision being represented by a node of the tree.
The graphical representation helps you understand the model, as well as gives valuable
insight into the data, by using visual searching and filtering.

Evidence Inducer and Classifier
The Evidence Classifier classifies data by examining the probabilities of a specified result
occurring based on a given attribute. For example, it might determine that someone who
owns a car that cost between $15,000 and $23,000 has a 70% chance of being a good credit
risk, and a 30% chance of being a bad credit risk. The classifier predicts the class with the
highest probability based on a simple probabilistic model.
The model is displayed using the Evidence Visualizer, which shows pie charts
illustrating the different probabilities. This graphical representation can help the user
understand the classification algorithm, as well as providing valuable insights into the
data and answering “what if” questions.

43

Chapter 1: Getting Started

Option Tree Inducer and Classifier
The Option Tree Classifier classifies data using a technique similar to the Decision Tree
Classifier. Unlike decision trees, option trees can contain special option nodes, which
allow the classifier to consider the influence of splitting on multiple attributes
simultaneously. For example, an option node in an option tree built to identify a car's
country of origin might choose miles per gallon, horsepower, number of cylinders, and
weight as informative attributes. In a decision tree, a node can choose at most one
attribute for consideration at a time. In an option tree, the results of all options are
“voted” when performing classification. Option trees are often more accurate than
decision trees; however, they generally are much larger.
The Option Tree Inducer generates an Option Tree Classifier from a training set in much
the same way that the Decision Tree inducer generates a Decision Tree. The induced
option tree is displayed using the Tree Visualizer. This visualization helps you
understand the classifier, and provides insight into which attributes are important in
determining the value of the label.

Regression Tree Inducer and Regressor
The Regression Tree Regressor predicts continuous attributes, in the same the way that
the Decision Tree and Option Tree Classifiers predict discrete attributes. While a classifier
predicts an event, such as whether a customer will churn (leave you) or not, a regressor
predicts specific numerical values, such as the profit margin for a business for the next
financial quarter.
The Regression Tree Inducer builds a Regression Tree Regressor model from your data.
As with Decision and Option Trees, this model can be viewed and analyzed using the
Tree Visualizer, so you can understand the basis from which its predictions are made.

Cluster Visualizer
The Cluster Visualizer displays statistics about the clusters or groups that are generated
by the clustering mining tool. It places these statistics side-by-side with those for the
entire data set, so that you can see which features make each cluster unique.
The Cluster Visualizer places the attributes in the display in the order of importance for
understanding the clustering. When you select one particular cluster, Cluster Visualizer
produces an ordering which is the most useful for discriminating between that cluster
and the remainder of the data set.

44

MineSet Tools Suite

Decision Table Visualizer
The Decision Table Visualizer allows you to view the distribution of data from a discrete
column at multiple levels of a hierarchy. For example, you can examine the profitability
of a business along dimensions of product class, geography, sales promotions and
sales-representative compensation plan. The Decision Table Visualizer distributes the
data two attributes at a time, allowing you to drill-down to further pairs of attributes at
each level.
The Decision Table Visualizer explores the results of the Decision Table Inducer, so that
the discrete column you examine is the label that the inducer classifies. When this is
done, the Decision Table Inducer arranges the attributes to determine which pair to
display first, and how to drill down from that top level to subsequent levels.

Evidence Visualizer
The Evidence Visualizer visually represents the model generated by the Evidence
Classifier. It initially shows cake charts that represent how the various attributes
contribute to the decision, and allow “what-if” analysis.

Map Visualizer
The Map Visualizer lets you visualize data relationships that exist across geographically
meaningful areas. For example, you can visualize different areas of a country, showing
the relative impact of a marketing program. The Map Visualizer’s drill-down capabilities
let you focus on designated regions and perform a more detailed analysis in smaller
geographical elements. One application might be analyzing how one or more products
are being sold across different geographies. A powerful animation feature, coupled with
a capability to connect different views of the same or related data, permits fast
comparisons and difference analyses. This tool lets you visually examine patterns in your
data that are difficult to detect when that data is shown in a tabular, two-dimensional
form.

45

Chapter 1: Getting Started

Record Viewer
The Record Viewer lets you view the data in the current table in a row/column
spreadsheet-like tool.

Rules Visualizer
The Rules Visualizer visually represents the model of the Association Rules Generator
mining tool. It provides detailed data analysis that lets you examine relationships across
data elements in new ways. In doing so, you might discover relationships that
significantly differ from what you might have expected; this, in turn, can lead to
important discoveries about your data or the processes behind that data. This tool’s
visualization capabilities let you discover additional patterns of co-occurrence between
these data elements. For example, you can use the analysis of products sold during the
last sales promotion to guide your advertising campaign for the next sales period. The
Rules Visualizer’s high performance would let you analyze the results from today’s sales
data in time to alter the advertising campaign for the future.

Scatter Visualizer
The Scatter Visualizer lets you examine the behavior of data across eight different
dimensions. The data is shown in a grid representing up to three dimensions. Extra
dimensions can map to the size, color, and label of each displayed entity. Two further
independent dimensions can be assigned as dynamic dimensions. A slider can be used
to select specific values along those dimensions, or a path can be traced through those
dimensions, for animation. During the path traversal, the display changes automatically
to reflect the change in the independent variables.

46

MineSet Tools Suite

Splat Visualizer
The Splat Visualizer produces 3D plots of very large data sets. Instead of showing
individual data points, it renders the density of data using varying opacity. It has many
of the same features as the Scatter Visualizer.

Statistics Visualizer
The Statistics Visualizer computes and displays summary information for the current
dataset (maximum, minimum, median, standard deviation, distinct values, and
quartiles).

Tree Visualizer
The Tree Visualizer helps you analyze data that has hierarchical relationships. It provides
an interactive “fly-through” capability for examining relationships among data at
different hierarchical levels. For example, the Tree Visualizer can be used to examine a
company’s product line, graphically displaying each product’s contribution to the
company’s total revenue. Each branch of the hierarchy displays information at increasing
levels of detail, breaking revenues down by product lines and, eventually, individual
products. Another example of using the Tree Visualizer is to show company sales
revenue, displaying a company-wide total as well as sub-totals at regional and other
levels. The fly-through capability in the Tree Visualizer lets you rapidly reposition your
view of the data. The Tree Visualizer’s filtering and searching capabilities let you focus
on specific data elements and queries.
The Tree Visualizer is also used to view the resulting models of the Decision Tree and
Option Tree Classifiers, and the Regression Tree Regressor; with each decision being
represented by a separate node in the tree. Each node also contains bars showing how the
data is modeled based on the decisions up to that point (for example, 73% of people who
own a home and have two children are good credit risks, while 27% are not).

47

Chapter 1: Getting Started

Basic Tool Execution Scenario
Each of the MineSet tools is started, configured, and run in a consistent manner. The
sequence of actions you follow at your MineSet client and at the MineSet server is shown
schematically in Figure 1-1. A description of the steps inherent in this figure follows.

MineSet client

MineSet server

OR
Configuration
file
User

Tool
manager

Configuration
file

Visualization
tool

Inducer
(MIndUtil)

MODEL
Information & statistics
(error estimate)

48

Data
file

Visual
files

Vis
dis ual
pla
y

Figure 1-1

User's
data
source

DataMover

Tool Execution Sequence

Basic Tool Execution Scenario

The following steps describe a “typical” interaction with a MineSet tool, and the
sequence of the tool’s actions. Depending on your requirements, some steps might be
skipped (for instance, if the data and configuration files have been generated in a
previous work session).
1.

Start the Tool Manager, which is the graphical interface for generating and
specifying the configuration file, data file, and tools to be used. The Tool Manager
runs on your MineSet client.

2. The Tool Manager opens a network connection to the DataMover, which runs on the
MineSet server, which in some cases may be the same as your client workstation,
and in others is a separate machine.
3. Use the Tool Manager to specify
•

the database and table, or a binary or ASCII flat file containing the data on
either the client or the server

•

which mining or visualization tools are to be applied

•

how that data is to be displayed, through tool options

•

a session file to save the history of your work

Information retrieved via the DataMover is used to guide this interaction. As a
result, the Tool Manager generates a configuration file. This file contains the
user-defined parameters that determine the execution of the following steps.
4. The Tool Manager transmits a copy of the configuration file from step 3 to the
DataMover. The DataMover processes the file by
•

accessing the database or flat file

•

performing the specified data transformations

•

running the mining tools when requested

•

generating the visualization files when requested

These visualization files consist of your data in a specific format readable by the
MineSet tool. Then a copy of these visualization files is transferred to the MineSet
client.
5. The Tool Manager invokes the appropriate MineSet visualization tool.
6. The tool accesses the visualization files and displays the data.
7. If you generated a model, that model can be applied to additional data (see
Figure 10-5).

49

Chapter 1: Getting Started

Note: The MineSet client and server can run on different machines, using a network to

communicate. Because network bandwidth is often scarce, you should be cautious about
transferring large files between client and server regularly. If you are doing mining
operations on a large database or file, you can achieve greater efficiency by storing that
file on the server, where the DataMover runs, rather than on the client.

50

Chapter 2

2. Setting Up MineSet

This chapter describes how to set up MineSet, which requires configuring the
DataMover. The configuration has two parts:
•

configuring the user’s account on the server (optional), and

•

a global configuration, which usually is done by the system administrator

Parallelization is offered through the multiprocessor (n32) version of MineSet only. The
DataMover is a process that runs on the server, although it is not directly accessible to
users. The DataMover provides access to databases and data stored in flat files, and
transforms data for the mining and visualization tools. The last section of this chapter
describes how to load sample datasets into the supported relational databases.

Configuring the DataMover Server
In order to use the MineSet tools, two configuration files must be created on the server:
one by you, the other by the system administrator.

The User Configuration File
Note: You must have a UNIX account on every server you want to access.

The DataMover creates files on the server machine on behalf of each user. The
DataMover configuration file, .datamove, lets you control where these files are created and
whether different classes of files are saved or discarded. This file is located on the server,
in your home directory. A sample .datamove file called datamove.sample is located on the
server, in the /usr/lib/MineSet/datamove directory.

51

Chapter 2: Setting Up MineSet

If the .datamove file is absent, or if a particular entry is not present in the .datamove file, the
DataMover uses a default value for that entry.
Each entry in the DataMover’s configuration file must be on a separate line. For example:
file_cache = directory_name

where file_cache specifies the location in which the DataMover stores its output data files
and models resulting from mining algorithms. If the file_cache directory does not exist,
the DataMover attempts to create it on its first invocation. The default file_cache directory
is ./mineset_files/%U. The %U is a wildcard that is filled in with the user’s login name on
the client machine. This is useful in reducing contention if many users want to log in to
a common account on the server. If multiple sessions were simultaneously connected to
the same file_cache directory, they could overwrite each other’s server files, causing
incorrect and unexpected results. To prevent this, DataMover maintains a lock at the
file_cache directory level. The second and later attempts to connect to a particular
file_cache directory result in failure and an error message. The user can recover from such
a failure by killing one of the DataMover’s attempts to connect to a given file in the cache
directory.
The file_cache should be a directory in a file system with sufficient room to hold all of a
user’s output and temporary files. DataMover will create this directory if it doesn’t
already exist. These are deleted when the DataMover no longer needs them, unless one
of the following keep options is set:
keep_client_upload
keep_client_download
keep_classifier_files
keep_classifier_options_files
keep_mlc_input
use_ascii_mlc_input

52

Configuring the DataMover Server

Each of these entries is described below.
keep_client_upload (default no)

Keep files uploaded from the client for processing. If kept, they will be in the client_upload
subdirectory.
keep_client_download (default no)

Retain on the server a copy of data files and visualizations after they are downloaded to
the client. If kept, the files will be in the client_download subdirectory.
keep_classifier_files (default yes)

Keep the persistent classifiers (decision trees and so forth) generated by mining
operations. The tactic is generally useful.
keep_classifier_options_files (default no)

Keep the options file that is used when generating, or inducing the classifier. This tactic
is not useful. If kept, the files will be in the mlc_work subdirectory.
keep_mlc_input (default no)

Keep input files used for mining (MIndUtil or associations) operations. If kept, the files
will be in the mlc_work subdirectory.
use_ascii_mlc_input (default no)

Normally the DataMover creates MineSet binary files for MIndUtil input. If this option
is set, create ascii files instead.
aggregation_memory_limit (default 2147483647)

Memory limit (in bytes) for aggregation operations. This can be no larger than the
system-wide limit set in the dm_config file.
optimize_history=yes

The DataMover is able to rewrite histories to remove redundant computations. The
optimize_history parameter controls whether or not to do this. Since this rewriting can
speed up processing considerably, it is normally turned on.

53

Chapter 2: Setting Up MineSet

File Handling
A file in the file_cache directory is the result of a successful operation. If an operation
returns an error (that is, Tool Manager reports a message beginning “fatal error on
server,”) nothing should be changed in the file_cache directory. Two examples help
illustrate the point:
•

Example 1: A user’s file_cache directory contains the files cars.data and cars.schema,
both the result of a previous database query. The user then selects the same table,
and sets the output to server_file, filtering for examples with mpg>55. Since no
records in the dataset have mpg values this high, when the history executes, it
returns no rows, which is flagged as a fatal error. After this happens, the user’s
file_cache directory will still contain the old cars.schema and cars.data files.

•

Example 2: A user’s file_cache directory contains the files cars.data and cars.schema,
both the result of a previous database query. The user then selects the same table,
and sets the output to a visualization. The operation completes and the
visualization launches successfully. Once again, the user’s file_cache directory still
contains the old cars.schema and cars.data files. The file_cache directory is not updated
unless the user specifically chooses server_file as the output.

Mandatory Configuration File
If you are using relational databases, the MineSet DataMover server must be configured
to find information in the databases. The DataMover works with Oracle® versions 7.2 or
later, INFORMIX®, and Sybase®.
The DataMover server reads the /usr/lib/MineSet/datamove/dm_config file during start up.
This file is not created by Inst during installation. It must be created by the system
administrator, who must log in as root to edit this file. It can be created via an editor such
as jot, vi, or Emacs. An example file can be found in
/usr/lib/MineSet/datamove/dm_config.sample. The format of this file is as follows:

54

Configuring the DataMover Server

Oracle {
"ORACLE_SID", "ORACLE_HOME";
}
Oracle_Remote {
“DATABASE_NAME”, “ADMIN_DIRECTORY”;
}
Informix {
"INFORMIXSERVER", "INFORMIXDIR";
}
Sybase {
"DSQUERY", "SYBASE";
}

Each optional entry describes the databases in use at your site. If your server is not
running any databases, that is, you intended to use MineSet with ASCII files only, simply
make an empty dm_config file.
The line "ORACLE_SID", "ORACLE_HOME" is filled in with the specific information and
repeated once for each Oracle database to be accessed via the DataMover. ORACLE_SID
and ORACLE_HOME are Oracle specific parameters defining an Oracle instance.
The Oracle_Remote section is for accessing remote Oracle databases via SQL*NET V2.
The DATABASE_NAME entry is a logical name for the remote database, as defined in a
tnsnames.ora file. The ADMIN_DIRECTORY entry is where DataMover searches for the
tnsnames.ora file. This file is described in Oracle’s SQL*NET documentation. Remote
access to databases is described in more detail in “Using MineSet to Connect to Remote
Databases” on page 58.
Each line in the Informix section defines a database server that, in turn, can contain
several databases. The server is checked at runtime to determine which databases it
contains, so there is no need to record the individual databases in the dm_config file. The
first entry is the INFORMIX server (corresponding to the INFORMIXSERVER
environment variable), and the second is the INFORMIX directory (corresponding to the
INFORMIXDIR environment variable).
Each entry in the Sybase section defines a database server (or, in Sybase terminology, an
SQL Server™). The first entry is the Sybase SQL Server name (corresponding to the
DSQUERY environment variable); the second is the Sybase home directory
(corresponding to the SYBASE environment variable).

55

Chapter 2: Setting Up MineSet

An example configuration file might be as follows:
Oracle {
"v73", "/usr/people/oracle/v73";
"wrhse", "/opt/oracle";
}
Oracle_Remote {
“lifeseq”, “/usr/lib/MineSet2/datamove/”;
}
Informix {
"learn_online", "/u5/informix";
}
Sybase {
"MINESET", "/usr/sybase/10.0.2.4";
}

This configuration file lets the DataMover access:
•

three Oracle databases, one named v73 (installed in /usr/people/oracle/v73), another
named wrhse (installed in /opt/oracle), and a remote database named lifeseq,

•

an INFORMIX Server;

•

and a Sybase SQL Server.

Each of the INFORMIX and Sybase servers can, in turn, contain multiple databases.
For Sybase, DataMover uses vendor-supplied shared libraries as its connection to the
databases. One of the purposes of the dm_config file is to specify where DataMover must
look for its shared libraries. DataMover looks in the $SYBASE/lib/ directory for the
following shared libraries: libct.so, libcs.so, ibcomn.so, libintl.so, libtcl.so, libinsck.so.

Using MineSet With Existing Data Files
Sometimes it is convenient to use MineSet with data that is already stored as a file, but
requires further processing before it can be mined or visualized. In this case, the data file
can be made available (with a modest effort) to the Tool Manager/DataMover.
First, the data file must be in a tab-delimited format, with the same number of fields in
each line. A numeric or string field with a single “?” character appearing between
delimiters is loaded as a Null value.

56

Configuring the DataMover Server

For a detailed discussion of null values, refer to Appendix J, “Nulls in MineSet.”
The contents of the data file must be described to Tool Manager/DataMover via a file
with the .schema extension. The format of the .schema file is shown next:
#
# A line beginning with a "#" is a comment
#
input {
#
#

The first line lists the data file which is described.
must be a simple filename, not a path.

It

file "carmodels.data";
#
#
#
#

Fields are listed left to right in the line, legal
types are float, double, int, string, date, fixedString and
dataString
Be sure to end every line with a semicolon ";"
float mpg;
int cylinders;
float cubicinches;
int horsepower;
int weightlbs;
double timeaccelerate;
date when_introduced;
string origin;
fixedString(3) manufacturer_code;
dataString model;

}

The schema and data files must be located in the same directory. If you prepare a dataset
in this fashion on the client machine, it can be opened with the Tool Manager’s Find File
dialog. If the file requires any additional processing, it is copied to the server. Sometimes
this is not convenient, especially if the file already exists on the server, or is large. In this
case, the .schema and .data files must be copied (or symbolically linked) into your file_cache
directory on the server. The directory used as the file cache is specified in your .datamove
file; the default is ./mineset_files/%U, where %U becomes your login name on the client
machine.
For a more extended description of MineSet .schema files see Appendix A.

57

Chapter 2: Setting Up MineSet

Using MineSet to Connect to Remote Databases
Sometimes it might not be feasible to install DataMover on the machine running the
database server. In this situation, DataMover can be installed on an intermediate server,
and DataMover then can use the database vendor’s networking facility to connect to the
remote database. (This sometimes is referred to as a three-tier architecture.)
Oracle

MineSet supports two ways to access remote Oracle databases:
•

The remote database is specifically mentioned in the dm_config file. For this method,
add entries to the Oracle_Remote section of the dm_config file, as described in the
“Mandatory Configuration File” section, above. Every remote database named in
the dm_config file must be defined in the tnsnames.ora file. This file can be manually
edited, or, more commonly, generated automatically by a network administration
tool provided by Oracle. If this method is chosen, the only Oracle-specific file
needed on the DataMover server is tnsnames.ora; in particular, Oracle need not be
installed on this machine.

•

A local Oracle install is used as a gateway to a remote database. In this case, the
dm-config file requires an entry for the local Oracle install, with ORACLE_HOME and
ORACLE_SID. This entry must be in the Oracle, not Oracle_remote section. Entries
for any remote databases must be added to the
$ORACLE_HOME/network/admin/tnsnames.ora file of the Oracle install on the
intermediate server.
Then, when users want to log in to user “system”, password “manager” at database
“remotedb”, they must provide the name of the intermediate server for the Tool
Manager “Log on to server...” dialog and select the intermediate server’s Oracle
database. When logging in to the database, use system@remotedb for the database
username, and manager for the password. (The added @remotedb specifies that
Oracle must use SQL*Net™ to connect to the remote database, instead of using a
local connection.)

Operating across SQL*Net is substantially slower than a local connection, especially for
queries that return a large amount of data. If possible, install DataMover on the same
machine as the Oracle server.

58

Loading Sample Datasets

Sybase

A Sybase installation is required on the intermediate DataMover server; this Sybase
installation need not be running an active database, but it is needed for access to the
shared libraries and the interfaces file.
In order to access the Sybase SQL server running on the remote machine, the interfaces
file on the DataMover server machine must have an entry for this Sybase SQL server.
Please refer to your Sybase manuals for the procedure for creating such entries. Also, the
name of this Sybase SQL server on the remote machine must be included in the dm_config
file on the intermediate DataMover server machine.
Once this setup is done, access to the Sybase SQL server on the remote machine is
handled transparently. The user can choose it and access data from it just like any other
database source, using the panels from the Tool Manager.

Loading Sample Datasets
This section describes how to load the sample datasets included with the MineSet
distribution into one of the supported relational databases.
Installed on the server in /usr/lib/MineSet/DBexamples are
•

all the sample data, along with a brief description of what it contains.

•

directions on how to load the data using the provided scripts.

Load the sample datasets into a database that has been set up on your server. The data
and these directions (README.server) are installed in /usr/lib/MineSet/DBexamples on the
server.
The /usr/MineSet/DBexamples directory contains scripts for loading the complete set of
data files into one of the supported databases. To load the complete set of data, run one
of the following loader scripts, depending on which database you have. (This assumes
your database and environment are already set up.)
sh load_all_Oracle.sh  
sh load_all_Sybase.sh  

59

Chapter 2: Setting Up MineSet

If you are going to work with an INFORMIX database, use the dbaccess interface to
select
create_all_Informix.sql

followed by
load_all_Informix.sql

Loading Individual Datasets

Alternatively, you can load, or reload, the sample data separately. Each data directory in
/usr/lib/MineSet/DBexamples on the server contains files necessary to load the data into
any of the supported databases. These files are:
README - explains the data
*.sql - sets up an Oracle table
*.ctl - control file for loading into Oracle
*_syb.sql - sets up a Sybase table
*.bcf.fmt - Sybase format file
*_inf.sql - sets up an INFORMIX table
*_load.sql - loads the data into the INFORMIX table
In the *.ctl file, the separator is declared in the line
" fields terminated by X'20'

"

The separator is specified in ASCII hexadecimal; thus:
X'20' is used for ‘ ’
X'2c' is used for ‘,’
X'09' is used for ‘\t’

60

Loading Sample Datasets

Loading Into Oracle

Perform the following steps on the server with an Oracle database:
1.

Ensure the following environment variables are set correctly:
ORACLE_HOME
ORACLE_SID

2. Type
sqlplus /
SQL> @.sql

Where dataset is the name of the dataset being loaded, and userid/passwd are your
assigned username and password for the Oracle database.
To delete an already existing table, type
SQL>

drop table ;

3. Type
sqlload control = .ctl userid = /
log = /tmp/.log direct = true

4. Check the resulting dataset.log to ensure the data was loaded correctly.
Loading Into Sybase

Perform the following steps on the server with a Sybase database:
1.

Ensure that the following environment variables are set:
SYBASE
DSQUERY

2. To create the table, type
isql -U -P -i _syb.sql

Where dataset is the name of the dataset being loaded, and userid/passwd are your
assigned username and password for the Sybase database.
To delete an already existing table, type
isql -U -P
drop table 
go

61

Chapter 2: Setting Up MineSet

3. To load the data, type
bcp  in .data -U -P -f
.bcp.fmt

where dataset is the table name (created using _syb.sql), in means
"load into the dbms," .data refers to the name of the ASCII data file, and -f
points to the already-created format file. (When reading in from a file, the data types are
character.)
Loading Into INFORMIX

Perform the following steps on the server with an INFORMIX database:
1.

Ensure the following environment variables are set:
ONCONFIG
INFORMIXSERVER
INFORMIXTERM

2. To create the table, type
dbaccess

3. If necessary, log into the appropriate database.
4. Choose Query-language, then choose the appropriate database from those listed.
5. Choose _inf.sql, and run it.
6. Choose _load.sql, and run it (where  is the name of the dataset
being loaded).

62

Chapter 3

3. The Tool Manager

This chapter discusses the functions of the Tool Manager, which is the graphical user
interface (GUI) that lets you specify data and configuration information for the MineSet
tools in this package. It provides an overview of this interface, then describes every
component of each panel that this interface displays for all MineSet tools.
Note: Any screens dedicated to a specific tool are discussed in the chapter for that tool;

for example, the screen for specifying the Tree Visualizer’s configuration file is discussed
in Chapter 5, “Using the Tree Visualizer.”

Overview
The Tool Manager is the initial graphical user interface (GUI) you use for most of your
interactions with the MineSet components. With Tool Manager you can select an existing
data source, transform or analyze that data, and visualize the results using any of the
MineSet individual tools. You can step through the process in these sections:
•

“Connecting to an Existing Data Source”

•

“Transforming the Data”

•

“Visualizing the Data on the Screen”

Note: The Tool Manager generally does not support data files not created by the Tool

Manager without some manual work to make them compatible.

63

Chapter 3: The Tool Manager

Connecting to an Existing Data Source
You can specify the source of the data as being from a:
•

database table

•

database SQL query

•

file

Transforming the Data
Often the original data is unsuitable for mining or visualization. It may contain irrelevant
or redundant columns, data types that are not applicable for viewing, or inconsistencies
that result in unhelpful visualization. You can transform the data with the Tool Manager
to display it in a useful form in any of these ways:

64

•

mining tools—finds patterns in data

•

binning variables—discretizes column values into groups, such as grouping years
by decade

•

removing columns—excises unneeded columns to save space

•

adding new columns—creates columns that are functions of existing columns

•

aggregation—finds the average, sum, min, max, or counts of column values

•

filtering—selects a subset of the data based on an expression using column values

•

sampling—selects a random subset of the data

•

making arrays—takes the values of one column and turns them into an array
indexed by discrete values in another column

•

distributing columns—makes two or more new columns from a single column of
values, distributed by the discrete values of another column

Overview

Visualizing the Data on the Screen
The final step, having transformed the data, is to visualize the results.You can do this in
any of these ways; for example you can display the data on the screen as:
•

a hierarchy (Tree Visualizer—Option, Decision, Regression)

•

a map (Map Visualizer)

•

a scatter plot showing relations of numerous independent variables (Scatter
Visualizer and Splat Visualizer)

•

as associated rules (Rules Visualizer)

•

as evidence and probability (Evidence Visualizer)

•

as box plots and histograms (Statistics Visualizer and Cluster Visualizer)

•

as layered tables or cakes (Decision Table)

With Tool Manager you can map data values to specific visual elements on the screen
such as:
•

colors

•

bars

•

heights

Finally, Tool Manager lets you control those options not related to data, including:
•

background colors

•

grid spacing

•

label sizes

65

Chapter 3: The Tool Manager

Starting the Tool Manager
You can run the Tool Manager in two modes:
•

interactive mode—the Tool Manager provides windows, menus, buttons, and so on,
to let you access, mine, and visualize your data. Interactive mode also lets you save
a description of your actions to a “session file” for future use.

•

batch mode—the Tool Manager performs all the actions described in a session file
without bringing up windows. For example, batch mode is useful for lengthy
computations that need to be done every night, so that the data can be fully
prepared each morning.

There are three ways to start the Tool Manager in interactive mode:
•

Double-click the MineSet icon, which is in the Applications or the MineSet page of
the icon catalog. The Tool Manager starts with the same configuration used in the
last Tool Manager session.

•

Double-click an icon representing a session file saved from a previous invocation of
the Tool Manager. This starts the Tool Manager with that session file.

•

Start the Tool Manager from the UNIX shell command line by entering this
command at the prompt:
mineset [ sessionFile ]

Here, sessionFile is optional and specifies the name of the session file to use. If
you do not specify a configuration file, MineSet starts up with the configuration
most recently used.
To start the Tool Manager in batch mode, enter this command at the UNIX shell prompt:
mineset_batch [-s serverPassword -d databasePassword] sessionFile

The -s and -d options allow you to specify the password for logging into the server and
database respectively. If you do not specify these options, mineset_batch will ask you to
type in the passwords, thus these options are useful when running mineset_batch from a
shell script. To specify that there is no password for either the server or database, use -s
or -d followed by two double quotes, that is,
mineset_batch -s "" -d "" foo.mineset

If you specify one of the two passwords, you must specify both.

66

Starting the Tool Manager

Figure 3-1 shows the Tool Manager’s startup window.

Figure 3-1

The Tool Manager Startup Window

This window consists of two panels related to the specific dataset and tool chosen, and
two information sections. Specification of servers and data sources is done via popup
dialogs accessible from the File menu.

67

Chapter 3: The Tool Manager

The panels and information sections are
•

Data Transformations, which lets you modify the data from your data source.

•

Data Destination, which lets you create visualizations based on your data, save the
data to a file, mine the data for association rules, create classifiers based on the data,
or find important columns in the data.

•

The top panel, which provides information on the currently selected data source.

•

The bottom panel, which contains a stream of information on the status on certain
operations.

The following sections describe each panel of the main Tool Manager window.

Choosing a Data Source
Data sources are selected using the first set of menu items in the File menu.

Figure 3-2

68

File Pulldown Menu

Choosing a Data Source

The first three options in the File menu let you select the data source from a
•

DBMS Table

•

DBMS Query

•

Data File

The fourth option, Connect to Server, lets you connect to a server without specifying the
data source.
You must connect to a server to get information from a database or mining tool, or to
apply transformations to an existing data file. It is not necessary if you plan to visualize
an existing client data file without transforming it.

Choosing an Existing Data File
Use the Open New Data File menu option to work with an existing data file. When you
select this option, the dialog in Figure 3-3 appears.

Figure 3-3

Open New Data File Dialog Box

69

Chapter 3: The Tool Manager

This dialog box, which is similar to a standard file selection dialog box, provides a toggle
at the top to select client versus server files; it also has a label indicating the name of the
current MineSet server, and a push button to let you log in to a new server. The radio
buttons at the top let you select files on your client machine (in any directory accessible
to you) or files that exist in your single cache on the DataMover server, (see “Configuring
the DataMover Server” in Chapter 2.)
When you select the name of a file from the list in the left window, the columns of that
data file are shown in the right window.
When you click the Change Server button, a dialog prompts you for a server name, login
name, and password to connect to the server (see Figure 3-5).
If you want to access a data file created outside of Tool Manager, you must create a
.schema file for it. This is a text file containing a configuration “input” section, which gives
the name of the data file and describes its layout. The Tool Manager supports input
sections similar to those for the Tree Visualizer (described in Appendix B), except that it
does not support variable length arrays or the monitor option.

Choosing a Database Table
Use the Open New DBMS Table menu option to work with tables in a DBMS. Selecting
this option causes the dialog box in Figure 3-4 to be displayed.

70

Choosing a Data Source

Figure 3-4

Choosing New Database Table Dialog Box

The name of the currently selected server appears to the left of the Change Server button.
If you click this button, the dialog box shown in Figure 3-5 appears. This lets you specify
a server name, login, and password.

Figure 3-5

Specifying Server Name, Login, and Password

71

Chapter 3: The Tool Manager

Once you have logged in to a server, click the Change DBMS button to bring up a dialog
box that contains a popup menu listing DBMS names/vendors (see Figure 3-6). Select a
DBMS from the menu, and enter the login name and password to connect to the DBMS.
Note that the DBMS login and password are usually different from those required to
connect to the server.

Figure 3-6

Sample Dialog Box Listing Available DBMS Names/Vendors

If you have logged on to an Oracle DBMS, the dialog box appears as shown in Figure 3-4,
with a list of tables on the left. When you select a table, the columns for that table are
shown on the right.
If the DBMS is Informix or Sybase, the dialog box shown in Figure 3-7 appears, with a list
of databases for the DBMS. Select a database, and the list of tables in that database are
shown.

72

Choosing a Data Source

Figure 3-7

Dialog Box After Selecting Informix or Sybase DBMS

To use a certain table in the Tool Manager, select the table you want to use and click OK.
Running an SQL Query

Use the Open New DBMS Query menu option to work with tables created via SQL
queries against a DBMS. Selecting this option causes the dialog box in Figure 3-8 to
appear.

73

Chapter 3: The Tool Manager

Figure 3-8

SQL Query Dialog Box

Selecting a server and DBMS in this dialog box has the same effect as selecting those
items in the Open New DBMS Table dialog box.
The SQL query is shown in the panel at the lower left. You can enter the query there, or
load it from a disk file using the Load SQL from File button. The names of tables and
columns in the current DBMS are shown to help build queries. To have their names
transferred to the SQL query panel, double-click on them.
When you have entered the SQL query, click the Submit SQL Query button to send it to
the DBMS for execution. The table columns resulting from the query appears on the
right.

74

Transforming the Data

Transforming the Data
The Data Transformations panel lets you manipulate the tables with which you want to
work. After you have selected a table (via the File menu, described above), its column
headings appear in the Current Columns window of the Data Transformations panel
(Figure 3-9).

Figure 3-9

The Data Transformations Panel

The functions of the displayed options are:
•

Remove Column—lets you delete one or more columns that are not relevant to the
current visualization or mining.

•

Bin Columns—lets you assign each record to a group that falls within a certain range
(bin) of column values. For example, an age column may be binned into the ranges
(bins): 0-18, 19-25, 26-35, and so on.

75

Chapter 3: The Tool Manager

•

Aggregate—adds columns of records (sum), creates a new column representing
maximum or minimum values, or makes an array from a column that is indexed by
other columns.

•

Filter—lets you select a subset of the data based on an expression involving column
values, for example, leave only those records in which the age is less than 20.

•

Change Types—lets you change a column’s name as well as its type.

•

Add Column—lets you add a new column based on a mathematical expression. For
example, add a column “minor” based on the column “age,” using the expression:
“if age is less than or equal to 18 then minor is true; else minor is false.”

•

Apply Model—lets you use a previously created classifier to label new records, to
estimate probabilities for label values, to test the classifier on new data, or to backfit
data to an existing classifier (see Chapter 10, “MineSet Inducers and Classifiers,” for
details).

•

Sample—lets you select a random subset of the data. This is useful for very large
data sets.

The Remove Column Button
Remove Column lets you delete columns by selecting the column name or names in the
Current Columns panel, then clicking this button. The items in the Current Columns panel
change to show the new table columns. To choose multiple contiguous columns for
simultaneous removal, click and drag the mouse over the columns. To choose multiple
non-contiguous columns for simultaneous removal, hold down the Ctrl key while
selecting the additional columns.

76

Transforming the Data

The Bin Columns Button
Binning lets you sort the information from one or more columns into groups in a new
column or columns (for example, with a range of ages, 0-18, 19-25, 26-35, and so on).
Click Bin Columns to get a dialog box that lets you specify the binning options
(Figure 3-10).

Figure 3-10

Bin Columns Dialog Box

77

Chapter 3: The Tool Manager

This dialog box lets you
•

choose the column that is to be divided into bins

•

specify the name of the new column to contain values for the bins

•

set bin thresholds, or specify a range with thresholds at regular intervals

To specify binning options for one or more columns, select the column name(s), choose
the appropriate options below, and click the Apply button at the bottom of the dialog box.
If you select only one column for binning, the name of the resulting binned column
appears in the “New column name” box, and you can type in a new name if you like. In
the example shown in Figure 3-10, mpg_bin is the name for the new column; in this case,
it provides a range of fuel efficiencies. If you select more than one column for binning,
New column name stays inactive.
Next to New column name is a check box labeled “Delete original column”. When
chosen, this option automatically deletes the original column after binning. Click the
check box to turn this function on or off.
In the middle of the Bin columns dialogue box are two tabs for choosing Automatic
Thresholds or User Specified Thresholds. Choose Automatic Thresholds if you’d like the
computer to suggest the bins or User Specified Thresholds if you’d like to specify the
thresholds yourself.
Automatically Computed Thresholds

If you’ve chosen the Automatic Thresholds tab, the program can use machine learning to
suggest bins.

78

Transforming the Data

Figure 3-11

Binning With Automatically Computed Thresholds

79

Chapter 3: The Tool Manager

The first choice under Automatic Thresholds is between the Automatically choose number
of bins and the Group into: ___ bins buttons. Click Automatically choose number of bins to let
the computer decide the best number of bins. If you choose to specify the number of bins,
click Group into: ___ bins, and type the number of bins you want into the field.
There are three ways to categorize data into bins:
•

Automatic—you must also select a discrete label. The thresholds are chosen so that
the distributions of labels within different bins are as different as possible. This
approach continues to create thresholds that split the range until no additional
interval is considered significant.
The “Min weight per bin” text field lets you specify the minimum weight in any bin;
this prevents the creation of bins with less weight than the number specified. No
interval is split if the two resulting subintervals do not each contain at least the
minimum weight you specify. By default, each instance has unit weight. In this
situation, specifying the Min weight per bin is the same as specifying the minimum
number of instances per bin.
Rather than specifying the minimum weight per bin, it is possible to have the
algorithm set that value automatically. The check box labeled Auto causes the
algorithm to calculate a value for the minimum weight per bin based on the total
weight of the instances: the more total weight, the higher the minimum weight per
bin (the relationship is logarithmic).

•

•

80

Uniform Range—the algorithm divides the value range into the specified number of
uniformly sized subintervals. The upper and lower bounds for the extreme ranges
include any values outside the ranges observed in the data. For example, if the
values for an attribute are in the range 3-8, and you specify four bins, the thresholds
identified are 4.25, 5.5 and 6.75, corresponding to the ranges:
•

≤ 4.25

•

> 4.25 to 5.5

•

> 5.5 to 6.75

•

> 6.75

Uniform Weight—the algorithm divides the value range into the specified number of
equal weight bins. Unlike Uniform Range, in which thresholds are identified that
separate the value range into intervals of equal size, Uniform Weight identifies
thresholds that group the instances into subsets of equal weight. By default, each
instance has unit weight. In this case, the Uniform Weight approach produces the
specified number of bins, each containing an approximately equal number of
instances.

Transforming the Data

Both Uniform Range and Uniform Weight let you specify a trimming fraction, which
indicates the fraction of extreme values to be excluded from the value range prior to
generating bins. The default trimming fraction is 0.05. This excludes the 5% of the
instances with the most extreme values (2.5% with the lowest values in the range, and
2.5% with the highest values in the range). Trimming tends to reduce the influence of
outliers on the generation of thresholds.
All of the approaches let you decide whether you want to specify the number of bins or
let the algorithm select the number automatically. For the Uniform Range and Uniform
Weight approaches, the automatic selection of the bins is based on the number of distinct
values: the more distinct values, the more bins are chosen (the relationship is
logarithmic).
Typically, all of the available instances are used when identifying thresholds. When
binned attributes are later used to induce a classifier, the error estimates for that classifier
tend to be overly optimistic. This is because distributional information from the test set
was used to identify thresholds. Use training set only prevents the binning approaches
from looking at the records in the test set when identifying thresholds. This tends to give
a more realistic estimate of the classifiers' error rate. Use training set only requires the user
to specify the same Holdout ratio and Random seed (see “Error Options for Inducers” in
Chapter 10) that are used to create the holdout set for estimating classifier error.
The Use Weight menu lets you weight the instances by any numeric attribute. Changing
instance weight affects both Automatic and Uniform Weight, but has no affect on the
Uniform Range.
If you click Apply, the Tool Manager picks bin thresholds and displays them in the
“Thresholds for selected column are” text field. The text field at the bottom of the Bin
Columns window shows the progress of the binning algorithm and any errors that occur.

81

Chapter 3: The Tool Manager

Specifying Thresholds

If you specify your own thresholds (as shown in Figure 3-10), you can choose between
Use custom thresholds or Use evenly spaced thresholds by clicking either button. When you
type in the thresholds, you must click Apply to make those thresholds effective for the
selected columns.
The Use custom thresholds text box lets you enter the range criteria. For example, you
could enter the numbers 18, 30, 50, 60. This results in the following ranges: 0-18, 19-30,
31-50, 51-60, 61+. Note that you enter only the digits and commas, not the ranges.
To specify equally spaced bins over a range of values, click the Equally Spaced Bins button.
This activates the three text fields below it. You can type the start of the binning range,
the end of the range, and the spacing of the bins, respectively, into these fields. If you are
binning a column that is a date, you can specify units of time for the bin spacing (using
the “Date units” popup menu under the text fields). This would permit you, for example,
to bin a time period into bins of three weeks. Dates entered into these fields must be
typed in the form “MM/DD/YY”. Possible time units are as follows:
•

years

•

quarters

•

months

•

days

•

hours

•

minutes

•

seconds

The Use custom thresholds text box accepts dates either in double quotes (as shown below),
or without. If you enter dates without quotes, the quotes are added automatically.
"1/1/96", "2/1/96", "3/1/96", "4/1/96", "5/1/96", "6/1/96"

However, do not put quotes around dates used with Use evenly spaced thresholds.
Note: If you enter an invalid parameter, an error message is displayed after you click

Apply, informing you of the valid options and letting you either cancel the command or
return to the dialog box to make the appropriate changes.

82

Transforming the Data

Aggregation
Before describing the features and effects of the Aggregate button (see page 86), this
section provides an introduction to the concept of arrays and distribution as used in the
aggregation feature.
Introduction to Arrays and Distribution

The Aggregate button lets you perform simple aggregations (for example, sum, min, max,
and so on), make arrays, and distribute columns. (See Table 3-1)
Table 3-1

Aggregate Example 1

State

Age_bin

Total $ Spent

CA

0-20

$50

CA

21-40

$454

CA

41-60

$693

NY

0-20

$35

NY

21-40

$541

NY

41-60

$628

If you make Total $ Spent into an array indexed by the binned column Age_bin, the
resulting table has only two columns:
Table 3-2

Aggregate Example 2

State

Total $ Spent [Age_bin]

CA

[$50, $454, $693]

NY

[$35, $541, $628]

In this case, making an array reduces the number of columns by one, and also reduces
the number of rows by four. Arrays are useful for the Tree Visualizer tool; they are
necessary if you want to use sliders in Scatter Visualizer and Map Visualizer displays.

83

Chapter 3: The Tool Manager

Distributing columns is similar, but different in several important ways. Instead of
producing a single new column holding many values, distributing produces one new
column for each value of the index. For example, if in the first table was not made an
array, but instead distributed by Age_bin, the result is:
Table 3-3

Aggregate Example 3

State

Total $_0-20

Total $_21-40

Total $_41-60

CA

$50

$454

$693

NY

$35

$541

$628

Thus, distributing increases the number of columns but decreases the number of rows.
If you have more than one binned column (for example, Age_bin and Sex_bin), you can
make a two-dimensional array (indexed by combinations of Age_bin and Sex_bin). You
also can distribute and make an array at the same time.
This table has two binned columns: one for age, one for sex.:
Table 3-4

Example of Binning

State

Age_bin

Sex_bin

Total $ Spent

CA

0-20

1

$20

CA

0-20

2

$30

CA

21-40

1

$220

CA

21-40

2

$234

CA

41-60

1

$401

CA

41-60

2

$292

If you make Total $ Spent an array indexed by age, and remove Sex_bin, the results are:
Table 3-5

84

Results When Making Total $ Spent an Array

State

Total $ Spent [Age_bin]

CA

[$50, $454, $693]

Transforming the Data

If you do not remove Sex_bin, the results are:
Table 3-6

Results When Specifying Sex_bin

State

Sex_bin

Total $ Spent [Age_bin]

CA

1

[$20, $220, $401]

CA

2

[$30, $234, $292]

If you make an array by both Age_bin and Sex_bin, the results are:
Table 3-7

Results of Making an Array by Age_bin and Sex_bin

State

Total $ Spent [Age_bin] [Sex_bin]

CA

[$20, $220, $401, $30, $234, $292]

Finally, if you distribute by Sex_bin and index by Age_bin, the results are:
Table 3-8

Results of Distributing Sex_bin and Indexing by Age_bin

State

Total $ Spent [Age_bin], Sex = 1

Total $ Spent [Age_bin], Sex = 2

CA

[$20, $220, $401]

[$30, $234, $292]

The examples above (with the exception of Table 3-5) had exactly one relevant value for
each array element, and the distribution merely rearranged existing data values. For the
example in Table 3-5, there were two data values for each array element, and these were
summed. MineSet provides several aggregation options for datasets containing more
than one value to be distributed into a given output array element. The most common
option is to add the values (as done in Table 3-5). This is useful when accumulating
expenditures into budgets, for example. You also can take the minimum, maximum, and
average of the total number of values, as well as count them.
When distributing values for a given dataset, it is possible that there are no values
appropriate for a particular bin. In this case, for min, max, avg, and sum aggregations, the
DataMover fills in a value of NULL. For count aggregations, the DataMover fills in a
value of 0.

85

Chapter 3: The Tool Manager

The Aggregate Button

You can use the Aggregate button to create simple aggregations, make arrays, or
distribute columns. Clicking this button causes the Aggregate dialog box to appear
(Figure 3-12). It shows three lists, with the columns in the current table appearing in the
middle list. If you want to aggregate, distribute, or turn a column into an array, select the
name of the column, and click the left arrow button between the left and center lists.
Below are popup menus that let you specify indexes (if the result is to be an array) and a
distribution column (if the result is to be distributed). In addition, at the bottom of the
dialog box are five toggles that let you specify how different values are to be combined
when aggregated: either summed, averaged, the min or max value, or the count. When
you are aggregating number-valued columns, you can choose any combination of these
options. For other types, only count is permitted. If you choose more than one option,
you get more than one result. For example, selecting average and max gives you one
result with average values, and another one holding the max values.

Figure 3-12

86

Aggregate Dialog Box

Transforming the Data

The three lists of column names are given below:
•

Columns to aggregate.

•

Group-By columns (the default); this keeps the columns unchanged throughout the
operation. For each set of records with the same combination of values in the Group
By columns, only one record is output in the resulting table, with values in the
aggregated columns summed, averaged, minned, maxed, or counted (depending
on the checkboxes at the bottom of the panel).

•

Columns to remove, as can be seen with the Sex_bin column in Table 3-5

After you have finished with the additional aggregate criteria dialog box, the Current
Columns text box in the Table Processing window shows the new column names that
result from applying these criteria.

The Filter Button
This button lets you filter the data via a mathematical expression. The resulting table
includes only records for which the expression is true (or, if numerical, non-zero). When
you click Filter, the Filter dialog (Figure 3-13) appears.

Figure 3-13

Filter Dialog Box

87

Chapter 3: The Tool Manager

This dialog box lets you select column names and operators on the left to build an
expression on the right. For a complete description of the expression definition language,
see “The Configuration File” in Appendix B.

The Change Types Button
This button lets you change the name of a column, as well as its type.
Changing a Column Type

Some databases store numerical values as strings. Oracle stores all numbers (both
integers and real numbers) in a single format, which defaults to the data type double in
the Tool Manager. You can use the Change Types button to ensure that these values are
processed correctly. To change the type of one or more columns, click the Change Types
button. A new dialog box appears (see Figure 3-14). This dialog box contains a window
with a list of column headings and their respective types.

Figure 3-14

88

Change Types Dialog Box

Transforming the Data

First select a column heading in the window. Then click the New type button. This
produces a popup list of the possible types (invalid types are grayed out), as shown in
Figure 3-15.

Figure 3-15

Types Popup List

•

int—represents a 32-bit signed integer.

•

float—represents a single-precision floating-point number. The decimal point is
optional when representing a floating-point number.

•

double—represents a double-precision floating-point number. The decimal point is
optional when representing a floating-point number.

•

dataString—represents a string that is unlikely to appear multiple times. If it
appears multiple times, several copies are made. A dataString can be used to store
an address. Addresses are unlikely to be compared, and each record can have a
different address.

89

Chapter 3: The Tool Manager

•

string—represents a string of characters that can appear multiple times in the data
file. Unlike a dataString, only a single copy of a given string is stored in memory, no
matter how many times it appears in the data. This saves memory for strings
appearing many times.
Comparing strings is also much quicker than comparing dataStrings. However,
reading in strings can be slower than reading in dataStrings because it is necessary
to look for duplications. An example of string use would be for a division name that
appears once for each department in the division. If you are unsure whether to use a
string or a dataString, use a string.

•

date—represents the date type from the database.

•

bin—represents a column created by a binning operation.

•

fixed-length array—an array of values of fixed size, not created by the Tool Manager.

•

bin-base array—an array of values as can be created by the Tool Manager.

•

variable-based array—an array of values of variable size, not created by the Tool
Manager.

After selecting a new type, click Apply to have the change take effect.
If you try to convert an inappropriate field (such as a name) to a number, the resulting
values are all zeroes.
Note: When the data source is an existing file, there are fewer possibilities available for

changing any given column.
Changing a Column Name

Select the original column, type a new name in the text field, and click Apply. Then click
Close.
To exit this dialog box, click Close.

90

Transforming the Data

The Add Column Button
You can use the Add Column button to create a new column whose values are computed
based on a mathematical expression. For example, you could add a new column whose
values are the ratio of values from two existing columns. Click Add Column to get a dialog
box that lets you specify the new column name and expression (Figure 3-16).

Figure 3-16

The Add Column Dialog Box

In the upper left of this dialog box is a field for entering the new column’s name. Below
this is a popup menu that lets you specify the column type (integer, string, floating point,
and so on).

91

Chapter 3: The Tool Manager

The right-hand side of the dialog contains a large text entry area where you can type in
a definition of the expression (for a complete description of the expression definition
language, see “The Configuration File” in Appendix B). As a shortcut to typing column
names and operators, scrolled lists in the lower left of the dialog display all columns in
the current table and all possible operators. To insert a column name or operator into the
expression, either double-click it in its scrolled list, or select it and click the arrow button
to the right of the scrolled list.
To check the expression you have created, click the Check Expression button. If there is an
error, a dialog box appears, indicating what the error is and where it occurred. When you
click OK, the expression is automatically checked, and the dialog box is not removed
unless the expression is correct.
The Add Column dialog box checks for type compatibility: if you have assigned a
numerical expression to a string column (or vice versa), a warning message appears, and
the type of the new column is automatically changed to be correct.

The Apply Model Button
The Apply Model button lets you use a previously created model to label new records in
the current table, to estimate probabilities for a label value, to test the performance of the
model on the current table, or to backfit the current table onto an existing model. See
Chapter 10, “MineSet Inducers and Classifiers,” for details.

92

Transforming the Data

The Sample Button
This button lets you select a random subset of the data. This is useful for data sets that
are too large to work with efficiently. When you click Sample, the Sampling dialog box
(Figure 3-17) appears.

Figure 3-17

Sampling Dialog Box

You can sample two ways: as a percentage of the current table, or by setting the
maximum number of records to put in the sample. Percentage sampling is approximate,
you can get slightly more or slightly fewer records than the exact percentage would
indicate. The random sample is based on a numeric seed that can be specified in the
sampling dialog. If no seed is specified, the number 1 is used as the seed. If you want a
different random sample, specify a different random seed.
When you click the Complementary Sample toggle, you get all records except those that fall
in the random sample. That is, if you get a 10% sample with the Complementary Sample
not clicked, when you click it, you get the remaining 90% of the data.

93

Chapter 3: The Tool Manager

The Table History Buttons
Table processing is a series of operations performed by using the buttons described
above. To allow you to see this series of steps, and go back if you made a mistake, there
are two Table History buttons at the bottom of the Table Processing panel (Figure 3-18).
When you click the left arrow button, the columns window shows the table as it
appeared at an earlier step. Clicking the right arrow button returns the table to its current
state.

Figure 3-18

Table History Buttons

The “Current view is” Field
To the right of the history buttons is the information field Current view is, which counts
the changes you’ve made and indicates which step you are viewing. The two integers in
this field indicate which table view you’re looking at, out of the total number of table
views that exist. For example, if you’ve made two changes, you can view the original
table (1 of 3), the table after the first change (2 of 3), or the table after the second change
(3 of 3).

The Prev and Next Buttons
As you go back and forth using the Table History buttons to view earlier versions of the
table, the Prev: and Next: fields (under the arrow buttons) help you keep track of where
you are in the history of the table. For any table you view, the Prev: field tells you what
the previous change was, and the Next: field tells you the next change.

94

Transforming the Data

The Edit Prev. Op Button

The Edit Prev. Op. button allows you to edit the operation shown in the Prev. field. (This
button is not active when Current view is: 1 of some number, because that is the original
table, with no previous changes.) When you click the Edit Prev. Op. button, the dialog box
for the previous operation comes up, and you can make changes to that operation. For
example, if the previous operation was binning columns, when you click Edit Prev. Op.,
the Bin Columns dialog box appears.
Note that by changing a previous operation, you could affect operations you set up
subsequent to the current one. For example, if you delete a column that you used in a
subsequent binning operation, that binning operation becomes invalid. The Edit History
button can help you avoid such problems.
The View History Button

When you click the View History button, the panels showing the current column and data
destination are replaced by a panel showing you the complete history of the Data
Transformation table (Figure 3-19). Each version of the table appears as a box containing a
list of the columns, linked by a smaller box (indicating the operation performed on the
table) to the next version of it.

95

Chapter 3: The Tool Manager

Figure 3-19

View History Dialog Box

As with Edit Prev. Op, changing one operation usually affects (sometimes invalidates)
subsequent operations in the history. You can select a specific operation to edit, add, or
view. The View History dialog warns you when changes affect the history, shows you the
new history. The row of buttons beneath the diagram window of the View History panel
allows you to change the size and orientation of the diagram as detailed below.

96

Transforming the Data

Zoom Buttons

Under the window displaying this flow chart are the zoom buttons that let you view the
flow chart closer up or farther away (Figure 3-20). You can choose the zoom by using the
button indicating the percentage, or by clicking the arrow buttons to increase or decrease
the size. The increments of change are the same whether you use the percentage button
or the arrow buttons.

Figure 3-20

Zoom Buttons

Overview Button

This button (Figure 3-21) creates, in a separate window, an overview of the entire history
chart that is synchronized with the Edit History dialog. The overview window shows
you which part of the history is currently visible, and lets you pan to other parts of the
history.

Figure 3-21

Overview Button

Vertical/Horizontal View Button

Next to the zoom buttons is a toggle button that lets you view the flow chart vertically or
horizontally (Figure 3-22). Clicking the button switches you back and forth between the
two points of view.

Figure 3-22

Vertical/Horizontal View Button

97

Chapter 3: The Tool Manager

Data Source

Under the Data Source heading is the Change Data Source... button, which lets you change
the table on which the history operates. When you hold the button down, a menu
appears that lets you choose
•

...to DBMS table

•

...to DBMS query

•

...to Data File

Selecting one of these items causes a dialog box to appear that lets you select the new data
source.
Note: As with editing the history, changing the data source can invalidate history

operations.
View

Under the indicator View is a View Single Ops/Dest button. When this button is pressed,
the panel showing the history is hidden, and the panels showing current columns and
data destinations return to view. The function of this button is the same as choosing the
Single Ops and Destination option from the View pulldown menu at the top of the
window.
For Selected Operation/Table

Under the indicator For Selected Operation are three rows of buttons that become active if
you click one of the operations or tables in the flow chart. Once you select an operation,
you can alter it.

98

•

The Edit Op button brings up the dialog box for the selected operation, so you can
make changes to it.

•

The Delete Op button removes the operation from the table history, and the elements
that follow in the flow chart move over when it disappears.

•

The Add New Op. Before and Add New Op. After buttons let you insert a new
operation into the table history.

•

The View Data button shows the data for any selected table in the history. When you
click this button, a menu is displayed enabling you to select the entire dataset, or a
random sample of 10, 100, or 1,000 records.

Investigating the Data

Other

Under the indicator Other there are three buttons that affect the total history file:
•

Undo Change—undoes the most recent change to the history (except changes to the
data source)

•

Redo Change—redoes any change you have undone.

•

Save to PostScript—save a picture of the history flow chart to a file in PostScript
format.

Investigating the Data
The Data Destination panel (Figure 3-23) lets you direct your processed data to one of the
MineSet visualization or mining tools, or to a data file.
There are three tabs at the top of this panel:
•

Viz Tools

•

Mining Tools

•

Data Files

These are the three possible destinations for your data. They are discussed in greater
detail in later chapters dealing with the Data Destination tools.

Using Visualization Tools
If you choose the Viz Tool tab, the visualization tool panel appears under Data
Destination (Figure 3-23).

99

Chapter 3: The Tool Manager

Figure 3-23

Data Destination Panel

Viz Tool is a popup menu that lets you choose among Map Visualizer, Scatter Visualizer,
Splat Visualizer, Tree Visualizer, Statistics Visualizer and Record Viewer, to determine the type
of visual representation you want for your data.
The first five tools are described in their respective chapters.
The Record Viewer lets you view the data in the current table in a row/column
spreadsheet-like tool. To use the Record Viewer select it from the tool menu, and click
Invoke Tool.

100

•

Tool Options—lets you further specify options you want to set in the specified tool’s
configuration file.

•

Clear Selected—lets you undo the mapping to a selected Visual Element.

•

Clear All—clears all mappings.

•

Invoke Tool—lets you start the tool you specified (via the top button) using the
configuration file named in the Saved as text field.

Investigating the Data

Each tool’s requirements are listed individually in the Visual Elements pane. This pane lets
you map a table column to a requirement. To do this,
1.

Select a column by clicking its name in the Current Columns pane.

2. Select the requirement to which you want to map the column by clicking on that
requirement in the Visual Elements pane.
The Viz Tool panel now shows the Visual Element and the column to which it has been
mapped (see Figure 3-24).

Figure 3-24

Columns Mapped to Requirements

101

Chapter 3: The Tool Manager

You can clear the mapping at any time by selecting the requirement that has the mapping
you want to change, then clicking the Clear Selected button. You can clear all mappings
using the Clear All button.
If you want to specify other details to fine-tune your mappings or to change the settings
so that the data representations more clearly reflect your intentions, click the Tool Options
button. A dialog box specific to each MineSet tool appears, where you can manually
specify the options to use.
Note: For details on a specific tool’s options, see that tool’s chapter.

Using Mining Tools
The MineSet Classifiers are described in Chapter 10, “MineSet Inducers and Classifiers,”
Chapter 11, “Inducing and Visualizing the Decision Tree Classifier,” Chapter 12,
“Inducing and Visualizing the Option Tree Classifier,” Chapter 13, “Inducing and
Visualizing the Evidence Classifier,” Chapter 14, “Inducing and Visualizing the Decision
Table,” Chapter 15, “Inducing and Visualizing the Regression Tree.” Clustering is
described in Chapter 16, and Column Importance is described in Chapter 17.
Creating Associations for the Rule Visualizer

If you click the Mining Tools tab, then the Associations tab, the panel lets you take the data
file you created in Data Transformations and proceed to the Rule Visualizer. Each step of
the process is shown in the subpanels:
•

Assoc Settings—creates rule generation options and mappings from the columns in
your table to elements of association rules

•

Ruleviz Settings—provides options to tailor the representation of the association
rules in the Ruleviz tool

•

Execution—a button to invoke the process of finding association rules and
visualizing them

If you don’t want to go through this process manually, click the Execution button, and the
computer will perform the process using defaults.

102

Investigating the Data

Figure 3-25

The Associations Tab

Finding Important Columns

The Column Importance (Figure 3-26) allows you to determine how important various
columns are in discriminating the different values of the label column you choose. You
might, for example, want to find the best three columns for discriminating the label good
credit risk so you can choose them for the Scatter Visualizer. When you select the label and
click Go!, a popup window appears with the three columns that are the best three
discriminators. A measure called “purity” (a number from 0 to 100) informs you how
well the columns discriminate the different labels. Adding more columns can only
increase the purity.

103

Chapter 3: The Tool Manager

Figure 3-26

The Column Importance Tab

There are two modes of column importance:
•

Simple Mode
To invoke the simple mode, choose a discrete label from the popup menu, and
specify the number of columns you want to see.

104

Investigating the Data

•

Advanced Mode
Advanced mode lets you control the choice of columns. To enter advanced mode,
click Advanced Mode in the Column Importance panel. A dialog box appears, as
shown in Figure 3-27. The dialog box contains two lists of column names: the left
list contains available attributes, and the right list contains attributes chosen as
important (by either the user or the column importance algorithm).

Figure 3-27

Advanced Mode of Column Importance

105

Chapter 3: The Tool Manager

Advanced mode can work two different ways: finding several new important
attributes, or ranking available attributes.
•

Finding Several Important Attributes
To enter this sub-mode, click the first of the two radio buttons at the bottom of
the dialog (...find [number] additional important columns). If you click Go! with no
further changes, the effect is the same as if you were in Simple Mode, finding
the specified number of important columns and automatically moving them to
the right column. Near each column, the cumulative purity is given (that is, the
purity of all the columns up to and including the one on the line). More
attributes can only increase the purity.
Alternatively, by moving columns names from the left list to the right list, you
can pre-specify columns that you want included and let the system add more.
For example, to select the age column and let the system find three more
columns, click the age column name, then click the right arrow.
Clicking Go! lets you see the cumulative purity of each column, together with
the previous ones in the list. A purity of 100 means that using the given
columns, you can perfectly discriminate the different label values.

•

Ranking Available Attributes
Advanced Mode also lets you compute the change in purity that each column
would add to all those that were already selected. For example, you might
choose age, and then ask the system to compute the incremental improvement
in purity that each column would yield.
To enter this sub-mode, click the second of the two radio buttons at the bottom
of the dialog (...compute improved purity for left columns, cumulative purity for right
columns.). This sub-mode permits fine control over the process. If two columns
are ranked very closely, you might prefer one over the other (for example,
cheaper to gather, more reliable, easier to understand).

Column Importance Notes
Note that with other columns, the importance of features varies from their ranking alone.
For example, while net-income might be a good column individually, it might not be as
important together with salary because they are likely to be highly correlated. The best
set of three columns is not necessarily composed of the columns that rank highest
individually. If two columns give the income in dollars and in another currency, they are
ranked equally alone; however, once one of them is chosen, the other adds no
discriminatory power to the set of best features.

106

Investigating the Data

Column selection is useful for finding the best three axes for the Scatter Visualizer, as well
as for finding a good discriminatory hierarchy for the Tree Visualizer.
All floating point values (double or float) are pre-discretized using the automatic
discretization. If a column has no value given to it in the left list, the algorithm did not
consider it, because it either had a single value (for example, when it is discretized into
one interval), or the number of records that it would separate is not statistically
significant.

Using Data Files
The Tool Manager lets you save the manipulated table for future use in a data file on the
client or server. If you click the Data Files tab, the panel shown in Figure 3-28 appears.

Figure 3-28

The Data Files Panel

The two toggle buttons in this panel let you specify whether the file is to be saved on the
server or your client machine. The selected name for the client file appears next to the
Client checkbox. If you select Client, the Choose new client file button brings up a dialog for
you to choose the name for the client file. If you select Server, you can type the server
filename directly into the adjacent text field.
Note: Pathnames are not permitted for server files; all server files are stored in the

DataMover cache directory.

107

Chapter 3: The Tool Manager

Session Files
The Tool Manager can save a description of your work to a “session file” for future use.
A session file contains a description of the data source you selected, all the
transformations on the data, and the mining or visualization of the data. Each session file
can hold descriptions of only one data source and one data destination; thus, if you
change the destination visual tool or source data table, the session file loses its links to
any previous data source or destination.
Session files can be saved at any time through the entries in the File menu, described
below. The name of the current session appears in the window’s title bar. The Tool
Manager also keeps a parallel session file, called .latest.mineset, in your home directory. It
always has a record of your most recent actions in the Tool Manager. Whenever you start
the Tool Manager without a session file, it reads the contents of the .latest.mineset file to
return you to the state when you last ran MineSet.
Session files also can be used for running the Tool Manager in batch mode, by issuing this
command at the UNIX shell prompt:
mineset_batch [-s serverPassword -d databasePassword] sessionFile

The -s and -d options let you specify the password for logging into the server and
database respectively. If you do not specify these options, mineset_batch will ask you to
type in the passwords, thus these options are useful when running mineset_batch from a
shell script. To specify that there is no password for either the server or database, use -s
or -d followed by two double quotes, that is,
mineset_batch -s "" -d "" foo.mineset

If you specify one of the two passwords, you must specify both.
In batch mode, the Tool Manager does not bring up tools or windows; however, it creates
files for tools. For example, if the session file includes the Tree Visualizer as the data
destination, running the Tool Manager in batch mode produces files for running the Tree
Visualizer, but the Tool Manager does not invoke it.

108

Pulldown Menus

Pulldown Menus
At the top of the Tool Manager window (see Figure 3-1 on page 67) are four pulldown
menus:
•

File

•

View

•

Visual Tools

•

Help

The following section describes each of these menus.

The File Menu
The File menu lets you choose what to do with your current session, which is one
complete session with a tool. This includes choosing the server, data source and table, all
the table manipulations, the mapping or classifying of the data, as well as opening or
saving a tool history, changing the working directory, and setting preferences.

Figure 3-29

File Menu

109

Chapter 3: The Tool Manager

The File menu provides five sets of functions:
•

•

•

The first set is for selecting a data source.
–

Open New DBMS Table—lets you select a single table from a DBMS.

–

Open New DBMS Query—lets you make an SQL query against the DBMS.

–

Open New Data File—lets you select a table from a data file on disk.

–

Connect To Server—lets you open a connection to a MineSet server.

The second set is for opening or saving .mineset files.
–

Open Saved Session...—lets you open a .mineset file.

–

Reopen Current Session —lets you reopen the current session file from the disk, in
case you do not want to save the current changes.

–

Save Current Session—lets you save a currently open .mineset file.

–

Save Current Session As...—lets you name (or rename) and save a currently open
history as a .mineset file.

The third set is for changing the current directory.
–

•

•

110

Change Current Directory—lets you specify the directory in which the Tool
Manager creates all data and visualization files.

The fourth set is for setting preferences. Here you can specify whether to
–

use ASCII or binary files

–

include an entry for NULL values when creating arrays

–

automatically load the most recent session when starting up the Tool Manager

–

run MIndUtil in single- or multi-threaded mode; a slider allows you to select
how many threads to use.

The last option, Exit, lets you end the current session and exit the Tool Manager.

Pulldown Menus

The View Menu
The View Menu lets you select whether to see the history panel or the current columns
and data destination panels.

Figure 3-30

View Menu

•

Single Ops and Destination—shows the current columns and data destination panels.
This menu option performs the same function as the View Single Ops/Dest button on
the history panel.

•

Entire History—shows the history panel. This menu option performs the same
function as the View History button on the Data Transformations panel.

The Visual Tools Menu
The Visual Tool menu lets you invoke any of the following visual tools directly:
•

Cluster

•

Decision Table

•

Evidence Visualizer

•

Map Visualizer

•

Rule Visualizer

•

Scatter Visualizer

•

Splat Visualizer

•

Statistics Visualizer

•

Record Viewer

•

Tree Visualizer

111

Chapter 3: The Tool Manager

If you have created a file that runs within one of these tools, and you want to go back to
it, click the tool. From within the tool, use File > Open to open the data file. These viewers
are described in their respective chapters, except for Record Viewer, which is described
later in this chapter, in “The Record Viewer” on page 113.

The Help Menu
The Help menu provides information about the elements of the Tool Manager and how
they work:
•

Click for Help—Gives help information about a particular item if you press Shift-F1,
then click the item for which you want help.

•

Overview—Gives an overview of the online help and how to use it.

•

Index—Provides an index of the complete help system. This option is currently
disabled.

•

Keys & Shortcuts—Provides the keyboard shortcuts for all of the Tree Visualizer’s
functions that have accelerator keys.

•

Product Information—Indicates what version of the Tool Manager you are using.

•

MineSet User’s Guide—Invokes the IRIS Insight viewer with the online version of
this manual.

The Tool Manager Options File
The Tool Manager creates a .mineset file in your home directory. This is used to store the
preference indicating whether to restore the most recent session on startup, as well as the
default server name, login, and password. If you log in to the same server often, edit this
file and specify a server name and login as follows:
default_server_name: mineset
default_server_login: guest
default_server_password:

Whenever you try to log in to a server, these names appear as defaults.

Warning: Putting a password in a file is a great security risk. Do not place a
password in the Tool Manager options file unless you want other people to know that
password.

112

The Record Viewer

The Record Viewer
The Record Viewer lets you view MineSet data files in a format similar to spreadsheets.
There are five ways to start the Record Viewer.
•

Use the Tool Manager to start the Record Viewer. This invokes the Record Viewer on
the data currently configured in the Tool Manager.

•

Double-click on the Record Viewer icon, which is in the MineSet page of the icon
catalog. Since no .schema file is specified, you must select one by using File > Open.

•

Double-click on any MineSet .schema file. This launches the Record Viewer on that
.schema file.

•

Drag a .schema file onto the Record Viewer icon.

•

Start the Record Viewer from the UNIX shell command line by entering this
command at the prompt:
recordview [ file.schema ]

where file.schema is optional and specifies the name of the .schema file to use. If
you do not specify a .schema file, you must use File|Open to specify one.
The Record Viewer shows the data specified by the .schema file in spreadsheet format (see
Figure 3-31).

113

Chapter 3: The Tool Manager

Figure 3-31

Sample Record Viewer Screen

If a column is not wide enough to see a specific value, click on it to display that value at
the top of the Record Viewer. You also can change the width of columns by dragging the
separators between the columns.
To read a new .schema file into the Record Viewer, select File > Open. To close the Record
Viewer, select File > Exit.
Note that some of the visual tools also bring up record viewers to display the current
selections. These record viewers are built into the visual tools; while their behavior is the
same as the Record Viewer discussed above, they do not allow opening other .schema
files.

114

Color Options for the MineSet Visualizers

Color Options for the MineSet Visualizers
Many of the tool option dialogs have options for choosing colors. MineSet has a color list
chooser that uses color swatches. This section describes how to choose, apply, and
change color options for the MineSet Visualizers.

Choosing Colors
If only one color is to be chosen (for example a grid color), a single color swatch appears
(Figure 3-32).

Figure 3-32

Configuration Option With a Single Color Swatch

Clicking the swatch brings up a Color Browser that lets you change the color of that
swatch (Figure 3-33). The Color Browser is described in more detail in the “Using the
Color Browser” section, shown in Figure 3-33.

Figure 3-33

Color Browser

115

Chapter 3: The Tool Manager

If a list of color swatches is to be chosen, the list of swatches appears (these can be empty
initially), as shown in Figure 3-34.

Figure 3-34

Multiple Colors Swatches

To edit the color, click a swatch with the left mouse button. This also selects the swatch
for making changes to the colors with the buttons. If you click on the swatch with the
middle mouse button, the swatch is selected, but the color chooser does not appear.
Next to the list of swatches are four buttons. First is the Add button, labeled with a plus
sign (+), which adds a new color at the end of the list. A swatch is added, and the color
chooser appears, where you can select the color of that swatch. The Add button is
disabled if the maximum number of colors is already in the list.
Next to the Add button is a Delete button, labeled with a minus sign (-). This button
deletes the selected color. It is disabled if no swatch is selected, or if the list already has
the minimum number of colors.
Next to the Delete button are two buttons to shift the selected color right and left. These
buttons are disabled if no swatch is selected, or if the swatch is already at the end of the
list.
If there are more colors in the list than room to display them, scroll arrows are added at
each end of the list (Figure 3-35).

Figure 3-35

Scroll Arrows on Color Browser

If the hardware runs out of colors, the color swatches are replaced with text labels
showing the color in X notation (Figure 3-36).

Figure 3-36

116

Color Browser Out of Colors

Color Options for the MineSet Visualizers

Using the Color Browser
The Color Browser (Figure 3-33) appears when you click a color swatch or the add button
in the Colors panel of the visualizer’s Configuration Options panel.
To select a color using the Color Browser:
1.

Move your mouse cursor on top of the small circle in the colored hexagon.

2. Press the left mouse button, and move your mouse around the hexagon. The color
beneath the small circle appears in the rectangle next to the Current Color label. This
rectangle acts as your color palette while you choose a color.
3. Release the mouse button when the small circle is on top of a color you want. The
selected swatch immediately takes on the chosen color.
You can edit several colors without dismissing the Color Browser; clicking any color
in the options panel lets you edit that color in the already posted Color Browser.
4. Click the OK button when you decide on a color. The Color Browser window closes.

117

Chapter 4

4. Using the Statistics Visualizer

This chapter discusses the features and capabilities of the Statistics Visualizer. It provides
an overview of this data visualization tool, then explains the Statistics Visualizer’s
functionality when working with the
•

main window

•

external controls

•

pulldown menus

Finally, it lists and describes the sample files provided for this tool.

Overview of the Statistics Visualizer
The Statistics Visualizer lets you visualize statistics on columns. Statistics Visualizer
presents a window that contains one small panel for each column listed in the Current
Columns pane of Tool Manager. The Statistics Visualizer main window has a default size
and shows only a restricted number of column panels. If the number of columns is large,
scrollbars appear; alternatively, you can stretch the Statistics Visualizer window
horizontally or vertically to view more column panels.
The format of the column panel varies according to the column type, and the number of
distinct values that exist for that column. Columns are generally divided into two types:
numeric and discrete, shown as box plots and histograms, respectively.
A numeric column has integer, float, double or date values. Each box plot panel shows
statistics about data from a single column, including the minimum, maximum, mean,
median, and two quartiles (25th and 75th percentiles) of these numeric values. These
values are shown as lines across a vertical bar in graduated shades of green, and the
standard deviation of the population is shown as a +/- value. The quartiles are shown
whenever there are fewer than 50,000 distinct values, (see Figure 4-1). If there are more
than 50,000 distinct values in the column, the statistics are shown as a gray vertical bar.

119

Chapter 4: Using the Statistics Visualizer

Figure 4-1

Numeric Column Displayed by Statistics Visualizer

A discrete (or nominal) column has non-numeric (string, bin, or enum) values shown as
histograms, (see Figure 4-2). The discrete column panel shows up to 100 distinct values,
as well as a histogram of the number of instances of this distinct value. The default
ordering of the discrete rows is by decreasing count, but you can use the View pulldown
menu to select an alternative sorting. If there are 100 or fewer distinct categories, then the
column panel also contains the count of distinct values.

Figure 4-2

Discrete Column Displayed by Statistics Visualizer

After creating a visualization of your data, the Statistics Visualizer lets you see truncated
textual information in the histograms with a brush highlighter. The brush highlighter
activates as you pass the mouse across a field without clicking.

120

File Requirements

File Requirements
The Statistics Visualizer requires a data file, consisting of ASCII or binary fields. This file
is easily created when running the Tool Manager (see Chapter 3).

Starting the Statistics Visualizer
There are five ways to start the Statistics Visualizer:
•

Use the Tool Manager to configure and start the Statistics Visualizer. See Chapter 3
for details on most of the Tool Manager’s functionality, which is common to all
MineSet tools.

•

Double-click the Statistics Visualizer icon, which is in the MineSet page of the icon
catalog. The icon is labeled statviz. Since no configuration file is specified, the
start-up screen requires you to select one by using File > Open.

•

Double-click the Statistics Visualizer icon on your Silicon Graphics desktop. The
startup screen requires you to select a data file by choosing File > Open.

Figure 4-3

File > Open Menu Selection for Statistics Visualizer

Starting the Statistics Visualizer from the icon activates only the File and Help
pulldown menus. For the main window to be fully functional, open a .statviz file by
selecting File > Open.

121

Chapter 4: Using the Statistics Visualizer

•

If you know what .statviz file you want to use, double-click the icon for that file. This
starts the Statistics Visualizer and automatically loads the file you specified. This
works only if the filename ends in .statviz (which is always the case for data files
created for the Statistics Visualizer using the Tool Manager).

•

Drag the .statviz file icon onto the Statistics Visualizer icon. This starts the Statistics
Visualizer and automatically loads the file you specified. This works even if the
configuration filename does not end in .statviz.

Starting the Statistics Visualizer
Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen
(Figure 4-4). From the popup list of tools, choose Statistics Visualizer.

Figure 4-4

122

Data Destination Panel With Statistics Visualizer Selected

Working in the Statistics Visualizer’s Main Window

Working in the Statistics Visualizer’s Main Window
If you started the Statistics Visualizer from the icon, the main window shows the
copyright notice and license agreement for the Statistics Visualizer. Only the File and
Help pulldown menus can be used. For the main window to show all menus and
controls, open a .statviz file. Use File > Open (Figure 4-3) to see a list of configuration files.

Pulldown Menus
Three pulldown menus let you access additional Statistics Visualizer functions. These are
labeled File, View, and Help. If you start the Statistics Visualizer without specifying a
configuration file, only the File and the Help menus are available.

The File Menu
The File pulldown menu for the Statistics Visualizer contains four options.
•

Open loads and opens a file and displays it in the main window.

•

Save As saves the current state of the Statistics Visualizer main window into an
image file

•

Print Image captures the image of outputs the current state of the Statistics
Visualizer main window and prints it to a printer.

•

Exit closes all windows and exits the application

123

Chapter 4: Using the Statistics Visualizer

The View Menu
The Statistics Visualizer View pulldown menu (Figure 4-5) contains two options.

Figure 4-5

124

StatViz View Pulldown Menu

•

Sort Nominals By Count specifies that the nominal (discrete) columns show the
histogram of values that is ordered by decreasing per-value counts.

•

Sort Nominals By Name specifies that those same columns be ordered by the relative
alphabetical order of each data value name.

The Help Menu

The Help Menu
The Help menu provides access to five help functions (see Figure 4-6).

Figure 4-6

Statistics Visualizer Help Menu

•

Click for Help turns the cursor into a question mark. Placing this cursor over an
object in the Statistics Visualizer’s main window and clicking the mouse causes a
help screen to appear; this screen contains information about that object. Closing the
help window restores the cursor to its arrow form and deselects the help function.
The keyboard shortcut for this function is Shift+F1. (Note that it also is possible to
place the arrow cursor over an object and press the F1 function key to access a help
screen about that object.)

•

Overview provides a brief summary of the major functions of this tool, including
how to open a file and how to interact with the resulting view.

•

Index provides an index of the complete help system. This option is currently
disabled.

•

Keys & Shortcuts provides the keyboard shortcuts for all of the Statistics Visualizer’s
functions that have accelerator keys.

•

Product Information brings up a screen with the version number and copyright notice
for the Statistics Visualizer.

•

MineSet User’s Guide invokes the Insight viewer with the online version of this
manual.

125

Chapter 4: Using the Statistics Visualizer

Sample Data Files
The provided sample data files demonstrate the Statistics Visualizer’s features and
capabilities. The following files are in the /usr/lib/MineSet/statviz/examples directory:
mushroom.statviz
census95.statviz.

126

Chapter 5

5. Using the Tree Visualizer

This chapter discusses the features and capabilities of the Tree Visualizer. It provides an
overview of this visualization tool, discusses ways of invoking it, then explains the Tree
Visualizer’s functionality when working with the following elements.
•

main window

•

external controls

•

pulldown menus

•

overview window

Finally, this chapter lists and describes the sample files provided for this tool.

Overview of Tree Visualizer
The Tree Visualizer is a graphical interface that displays data as a three-dimensional
“landscape.” It presents your data as clustered, hierarchical blocks (nodes) and bars with
disks through which you can dynamically navigate, viewing part, or all, of the dataset.
As shown in Figure 5-1, the Tree Visualizer displays quantitative and relational
characteristics of your data by showing them as hierarchically connected nodes. Each
node contains bars whose height, color and disk correspond to aggregations of data
values. The lines connecting nodes show the relationship of one set of data to its subsets.

127

Chapter 5: Using the Tree Visualizer

Figure 5-1

128

Example Display in the Tree Visualizer’s Main Window

File Requirements

Values in subgroups can be summed and displayed automatically in the next higher
level. The base under the bars can provide information about the aggregate value of all
the bars. Bars representing negative values are shown below the top of the base. You can
see negative value bars more clearly by disabling the base height (see “The Display
Menu” on page 165, or the “Base Height Statements” section in Appendix B, “Creating
Data and Configuration Files for the Tree Visualizer”).

File Requirements
The Tree Visualizer requires the following files:
•

A data file consisting of rows of tab-separated fields. This file is easily created using
the Tool Manager (see Chapter 3). If you are generating this file yourself, see
Appendix B, “Creating Data and Configuration Files for the Tree Visualizer” for the
required file format.
Data files are generated by extracting data from a source (such as an Oracle,
INFORMIX, or Sybase database) and formatting it specifically for use by the Tree
Visualizer. Data files have user-defined extensions (the sample files provided with
the Tree Visualizer have a .data extension).

•

A configuration file describing the format of the input data and how these are
converted to a hierarchy. This file also is easily created using the Tools Manager (see
Chapter 3). You also can use an editor (such as jot, vi, or Emacs) to produce this file
(see Appendix B, “Creating Data and Configuration Files for the Tree Visualizer”).
Configuration files must have a .treeviz extension. When starting the Tree Visualizer,
or when opening a file, specify the configuration file, not the data file.

129

Chapter 5: Using the Tree Visualizer

Starting the Tree Visualizer
There are five ways to start the Tree Visualizer:
•

Use the Tool Manager to configure and start the Tree Visualizer. (See Chapter 3 first
for details on most of the Tool Manager’s functionality, which is common to all
MineSet tools; see below for details about using the Tool Manager in conjunction
with the Tree Visualizer.)

•

Double-click the Tree Visualizer icon, which is in the MineSet page of the icon
catalog. The icon is labeled treeviz. Since no configuration file is specified, the
start-up screen requires you to select one by using File|Open.
Starting the Tree Visualizer without specifying a configuration file causes the main
window to show the copyright notice for this tool. Only the File and Help pulldown
menus can be used. For the main window to be fully functional, open a
configuration file by selecting File|Open (Figure 5-2).

Figure 5-2

•

130

Tree Visualizer’s File Pulldown Menu

If you know what configuration file you want to use, double-click the icon for that
file. This starts the Tree Visualizer and automatically loads the file you specified.
This only works if the filename ends in .treeviz (which is always the case for
configuration files created for the Tree Visualizer via the Tool Manager).

Starting the Tree Visualizer

•

Drag the configuration file icon onto the Tree Visualizer icon. This starts the Tree
Visualizer and automatically loads the file you specified. This works even if the
filename does not end in .treeviz.

•

Start the Tree Visualizer from the UNIX shell command line by entering this
command at the prompt:
treeviz [ configFile ]

where configFile is optional and specifies the name of the configuration file to use. If
you don’t specify a configuration file, you must use File > Open to specify one (see
Figure 5-2).
Options for Invoking the Tree Visualizer

There are a two options that affect how this tool is invoked:
•

-warnexecute indicates that if you attempt to execute a command specified in an

execute statement, a warning is displayed and you are given the option to execute
the command or not. This is intended for an insecure environment, such as files
obtained from the Web, and is used automatically when commands are executed via
mtr files.
You can enable this option permanently by adding the line
*minesetWarnExecute:TRUE

to the user’s .Xdefaults file, or by setting the environment variable
MINESET_WARN_EXECUTE

•

-quiet eliminates the dialogs that popup to indicate progress. You can enable this

option permanently by adding the line
*minesetQuiet:TRUE

to your .Xdefaults file.

131

Chapter 5: Using the Tree Visualizer

Configuring the Tree Visualizer Using the Tool Manager
This section describes how the Tree Visualizer can be configured using the Tool Manager.
Although the Tool Manager greatly simplifies the task of configuring the Tree Visualizer,
you can construct a configuration file manually for this tool using an editor (see
Appendix B, “Creating Data and Configuration Files for the Tree Visualizer”).
For the Tree Visualizer, the Tool Manager does not support the following:
•

Non-aggregated hierarchies where the data is displayed directly without
aggregating it.

•

Real-time monitoring.

•

A number of very rarely used options (skip missing, overview, shrinkage, root label,
speed, climb speed, leaf margin, root leaf margin, leaf edge margin, initial position,
initial angle, bar label size, base label size, and lod). See Appendix B.

•

Variable-length arrays.

•

Expressions computed after creating the hierarchy. For example, if you are
computing a percentage, the percentage must be computed after the hierarchy
aggregation takes place, since it is not possible to aggregate the percentages.

Note that the steps required to connect to a data source are described in Chapter 3.

Selecting the Tree Visualizer Tool
Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen
(Figure 5-3). From the popup list of tools, select Tree Visualizer. The mapping
requirements for the Tree Visualizer are displayed in the window on the right side of this
panel. Items in the Visual Elements: list that are preceded by an asterisk are optional.

132

Configuring the Tree Visualizer Using the Tool Manager

Figure 5-3

Data Destination Panel of Tool Manager With Tree Visualizer Selected

Key - Bars lets you define what the bars shown in the Tree Visualizer main window
represent. For example, in a table representing the budget of the 50 United States, the
keys could be state names. If the first key is associated with Alabama, the first bar
represents the values for Alabama.
Height - Bar lets you specify what the bar heights represent. Typically, the higher the bar,
the greater the value represented.
Sort By lets you specify a column, the values of which are used to sort the layout of the
nodes. The sort order defaults to ascending from left to right.
Hierarchy Root Level lets you specify how the table from your data source is converted into
a hierarchy. The Visual Elements list defaults to six hierarchical levels. If you specify a
sixth hierarchy level, the Tree Visualizer automatically adds a seventh. With every extra
level you specify, the Tree Visualizer adds another one. You can specify as many
hierarchy levels as necessary.

133

Chapter 5: Using the Tree Visualizer

Height - Disk—lets you specify what the heights represent for optional disks placed at the
same location as the bar. If no mapping is specified, no disks are displayed.
Height - Base—lets you specify what the base heights represent. If no mapping is
specified, the bar height mapping is used.
Color - Bar—lets you specify what the bar colors represent. The specific colors must be
assigned via the Tool Manager’s Tool Options panel (see “Choosing Colors” and “Using
the Color Browser” in Chapter 3).
Color - Disk—lets you specify what the disk colors represent. This option has an effect
only if the disk height is specified (see “Choosing Colors” and “Using the Color Browser”
in Chapter 3).
Color - Base—lets you specify what the base colors represent. If no mapping is specified,
the bar color mapping is used (see “Choosing Colors” and “Using the Color Browser” in
Chapter 3).

Undoing Mappings
To undo any mapping, select that mapping in the Requirements: window, then click the
Clear Selected button. To undo all mappings, click the Clear All button.

Specifying Tool Options
Clicking the Tool Options button causes a new dialog box to be displayed (Figure 5-4).
This lets you change some of the Tree Visualizer options from their default values.

134

Configuring the Tree Visualizer Using the Tool Manager

Figure 5-4

Tree Visualizer’s Configuration Options Dialog Box

The top of the dialog box has three columns: Bars, Node Bases, and Disks.

135

Chapter 5: Using the Tree Visualizer

Normalize Heights

This option lets you normalize heights across each level of the hierarchy (or across all
levels) of bars, node bases, and disks. Normalizing the heights determines the maximum
value of the height variable; it normalizes all values relative to that height. Thus, if the
maximum value is 30.0, and the maximum bar height was set to 1.0 (in arbitrary units),
a value of 15.0 would be mapped to a value of 0.5.
Normalizing across each level independently normalizes each level of the hierarchy. This
option is most useful if data has been summed up the hierarchy, and prevents the top
level of the hierarchy from dwarfing items at the lowest level. Normalizing across all
levels normalizes everything together, regardless of the level in the hierarchy. If neither
box is checked for bars, no normalization takes place.
Node Bases are normalized independently of Bars. If no boxes are checked, the same
normalization method used for bars is used for node bases, although the values are
normalized independently.
If disks are present and normalize with bars is checked, the disks are normalized in
conjunction with the bars: a disk and a bar representing the same value have the same
height. If one of the other normalize boxes is checked in the Disks column, disks are
normalized independently of the bars: the highest disk and the tallest bar have the same
height, regardless of the actual values represented by them.
Max/Scale Heights

This option lets you specify the height of the tallest bars and node bases. The default is
1.0 (in arbitrary units). If after looking at the view, you see that the heights are too low or
too high, use this field to adjust them. For example, entering 2 in the field causes all bars
to be doubled in height; entering .5 makes all bars half as big.
If normalization was specified, this value represents the height of the tallest bar or base.
If normalization was not specified, all values are scaled by this amount. The latter can be
useful when comparing views of two different datasets.

136

Configuring the Tree Visualizer Using the Tool Manager

Filter out % shortest

This option lets you filter out nodes containing only short bars. First, the tallest bar in the
scene is calculated (if heights are normalized by level, then the tallest bar in each level).
Then only those nodes that contain at least one bar that is the appropriate percentage of
the tallest bar are shown. For example, if you enter 5% in this field, then only those nodes
containing at least one bar that is at least 5% of the height of the tallest bar are shown.
(Also shown are ancestors of such bars). This option is intended as a coarse way to filter
out small, uninteresting nodes. It is not intended as an exact mechanism of identifying
specific nodes of a certain value. Use of this option can accelerate the rendering of slow,
complex scenes, or reduce clutter resulting from many bars near zero height.
Although small nodes are filtered out, they are nonetheless counted in any cumulation
up the hierarchy.
Height Aggregation

By default, the height of the bars of the parent node is the sum of the height of all the bars
of the children; however, these heights can be average, max, min, count, or any of the
values that appear. This aggregation can be used for the values of the bar heights, base
heights, and disk heights.
Colors

This set of options lets you
•

specify the list of colors to use

•

specify the kind of mapping

•

map colors to bars, node bases, and disks

To use these Colors options, you must have mapped a column to the *Color - Bar,
*Color - Disk, or *Color - Base requirements of the Data Destination panel. See
“Choosing Colors” and “Using the Color Browser” in Chapter 3 for a more detailed
explanation of how to choose and change colors.

137

Chapter 5: Using the Tree Visualizer

Color list to use lets you specify the color list using the + button next to the color list label.
This brings up a color editor that lets you specify a color to be added to the list.
Kind of mapping lets you specify whether the color change that is shown in the graphic
display is Continuous or Discrete. If you choose Continuous, the color values (of the bars,
node bases, or disks) shift gradually between the colors entered in the Color list to use
field as a function of the values that are mapped to those colors in the Color mapping field.
If you choose Discrete, the colors change only at the specified boundaries.
Color mapping lets you specify values to which the colors are mapped.
Example 5-1

If you
•

used the Color Browser to apply red and green to bars

•

selected Discrete for the Kind of mapping

•

entered the values 0 100

then the display shows all bars (or node bases or disks) with values of less than 100 in
red, and all those with values greater than or equal to 100 in green.
Example 5-2

If you
•

used the Color Browser to apply red and green to bars

•

selected Continuous for the Kind of mapping

•

entered the values 0 100

then the display shows all bars (or node bases or disks) with values less than or equal to
0 as completely red, those as greater than or equal to 100 as completely green, and those
between 0 and 100 as shadings from red to green.

138

Configuring the Tree Visualizer Using the Tool Manager

Color Aggregation

By default, the values of the colors of the bars of the parent node are the sum of the values
of all the bars of the children; however, these colors can be average, max, min, or any of
the values that appear. This aggregation can be used for the values of the bar colors, base
node colors, and disk colors.
Color by Key

This option lets you automatically color the bars by their key value. This option is
ignored if another coloring was specified. If you specify no color list, or specify
insufficient colors, additional colors are chosen at random. If extra colors are specified,
they are ignored.
Make Fixed

By default, this option places all bars across one row. This option allows changing the
number of rows or columns. If neither rows nor columns are selected, or the number is
set to 0, then neither rows nor columns are fixed, and the closest approximation to a
square is displayed.
Message

This option lets you type in any message you want. The message statement specifies the
message displayed when the pointer is moved over an object or when an object is
selected. By default, the same message is used for the base as for the bars. If no message
is specified, a default message containing the names and values of all the columns is
used.
The format of the message must match the type of data being used:
•

Strings must use %s.

•

Ints must use integer formats (like %d).

•

Floats and doubles must used floating-point formats (like %f).

For a detailed description of the message field, see “Message Statements” in Appendix B.

139

Chapter 5: Using the Tree Visualizer

Execute and Base Execute

These options let you type in a UNIX command that is executed when double-clicking
on a bar or base. If only the Execute field is filled in, it applies to both bars and bases. If
both are filled in, Execute applies to bars, and Base Execute applies to bases. The format
is similar to the message statement. If no execute statement appears, double-clicking has
no effect.
For a detailed description of the Execute field, see “The Execute Statement” in
Appendix B.
Sky Color

You can specify either one or two colors. If only one color is specified, the sky is solid. If
two colors are specified, the sky is shaded between the colors. When specifying two
colors, the first color is for the top of the sky, the second for the bottom.
Ground Color

You can specify either one or two colors. If only one color is specified, the ground is solid.
If two colors are specified, the ground is shaded between the colors. For the ground, the
first color is for the far horizon, the second is for the near ground.
Base Label Color

You can specify the color of the labels on the front of the bases.
Bar Label Color

You can specify the color of the labels on the front of the bars.
Line Color

You can specify the color of the lines connecting the bases.

140

Configuring the Tree Visualizer Using the Tool Manager

Sort Order

If you select the Sort by Key checkbox, the nodes in the display are in sorted order. The
menu next to the checkbox lets you specify whether to sort in ascending or descending
order.
Resetting the Tool Options

If, after you have made changes to the Tool Options dialog box, you want to reset the
values of all options to their default values, click the Reset Options button.
Saving the New Tool Options

Once you have finished making changes to the Tool Options dialog box, click OK to
return to the Tool Manager’s main screen.

Saving Tree Visualizer Settings
The Tool Manager stores information for the Tree Visualizer in several files, all sharing
the same prefix:
•

.treeviz.data contains data.

•

.treeviz.schema describes the data file.

•

.treeviz contains information needed by the Tree Visualizer.

•

.mineset contains all the information needed to create the other files.

To specify a prefix, use the Save Current Session As ... menu option in the File menu of the
Tool Manager’s main window. If you do not specify a prefix, it is based on the data
source.
When you use the Invoke Tool button, the .data, .schema, and .treeviz files are updated, if
necessary.

Invoking the Tree Visualizer
To see the Tree Visualizer graphically represent your data, click the Invoke Tool button at
the bottom of the Data Destination panel.

141

Chapter 5: Using the Tree Visualizer

Working in the Tree Visualizer’s Main Window
A file’s hierarchy is visible only after a valid configuration file is specified. For example,
specifying store.treeviz results in Figure 5-5.

Figure 5-5

142

Tree Visualizer’s Initial View When Specifying store.treeviz

Working in the Tree Visualizer’s Main Window

The root node of the hierarchy is at the front of the scene, near the bottom of the Tree
Visualizer’s main window. In back of the root node are its descendents; each one consists
of a base with bars on it. You can change what the heights and colors of the bars represent
via the Tool Manager or by manually changing the .treeviz configuration file; usually, the
base represents the aggregate of all the bars. Bases are connected with lines representing
the connection of the nodes to their descendents.

Highlighting an Object or Node
To highlight an object, move the mouse over that object (either a base or a bar). This
causes information about that object to appear over the top left of the view area, under
the Pointer is over: label (Figure 5-6). To highlight a node and obtain information about
that node, place the pointer over a line leading to that node. This information appears in
the same place as that for an object.

Figure 5-6

A Highlighted Object and the Information It Represents

143

Chapter 5: Using the Tree Visualizer

Selecting an Object
To select an object and zoom to it, left-click the mouse on that object. Hold the Ctrl key
down while clicking to select the object without zooming to it. At the top of the window,
under the label “Selection:”, you see information about a selected object. The information
is the same as that shown when highlighting an object. As long as the object is selected,
the information is displayed. This lets you compare information about two objects by
selecting one, then highlighting the other. Using the mouse, you can cut and paste
selection information into other applications, such as reports or databases.
If you hold the Shift key while left-clicking on an object, the selection of that object is
toggled. If the object is currently not selected, it then is selected; conversely, if it is
currently selected, it then is deselected. Using this technique, it is possible to select
multiple objects simultaneously. While the information under the “Selection:” label only
shows the information on the last object selected, it is possible to see the values for all
selections by using Selections > Show Values or by drilling through to the original data
behind the selections (see “The Selections Menu” on page 166).
If an execute statement was specified via Tool Manager or the configuration file, then
double clicking on an object executes the appropriate command. If the -warnexecute
option was specified when invoking the Tree Visualizer, a warning is given first.

Spotlighting an Object
When you select an object, a white spotlight appears on it (Figure 5-7). A yellow spotlight
appears when you are searching (see “The Search Panel” on page 154). Spotlights are
visible even if the selected object is a descendent node in the far background.
The edges of spotlights are surrogates for an object: when you move the pointer over the
edge of a spotlight, the associated object is highlighted, and information about that object
appears above the top left of the view. Left-click the edge of a spotlight to select the
associated object and (if the Ctrl key is not held down) to zoom to it. The spotlight is
active only on the solid lines along the edges, not the translucent section in the center.
This lets you select objects behind the spotlight.

144

Working in the Tree Visualizer’s Main Window

Figure 5-7

Example of a Selected (Spotlighted) Object

Using the Right Mouse Button
When the cursor is in the main window, clicking the right mouse button (or, if the mouse
has been reconfigured, the third button) brings up a menu that lets you select the children
of a node. If you click on a node with children, it provides you with a list of the children.
This list is displayed as long as you hold the mouse button down. If you do not click on
a node, but one is selected, it provides you with a list of children of the selected node. If
nothing is selected, or if the selected node has no children, no menu is displayed.

145

Chapter 5: Using the Tree Visualizer

Navigating With the Middle Mouse Button
To navigate over the scene in the main window, use the middle mouse button. You also
can use external controls to perform all middle mouse button functions (see the “External
Controls” on page 147).
To move through the main window, click the middle mouse button. A small square
appears (see Figure 5-8). Move the cursor out of this square while pressing the mouse to
move your point of reference dynamically through the 3D landscape. The farther the
cursor is from the square, the faster your viewpoint moves. To move the viewpoint
forward, move the mouse up. To move the viewpoint back, move the mouse down.
Moving the mouse left and right causes the viewpoint to shift accordingly. You can move
in any direction as long as a part of your data is visible.

Figure 5-8

Example of the Square as Navigational Base

To move the viewpoint up and down, hold the Shift key down when pressing the middle
mouse button. To move the viewpoint up, move the mouse up. To move the viewpoint
down, move the mouse down. You cannot move below ground level.
To combine horizontal and vertical motion (that is, to move the viewpoint back and forth,
as well as up and down), hold the Alt key down when pressing the middle mouse button.
Note that while moving forward, the viewpoint also moves down, based on the current
tilt. Similarly, while moving backward, the viewpoint moves up, based on the tilt.
Note: You cannot turn from side to side. Tilting the viewpoint requires using external

controls.

146

External Controls

External Controls
Several external controls surround the graphics window. These consist of buttons and
thumbwheels.

Buttons
At the top right of the image area are eleven buttons as shown in Figure 5-9.
Home
Set Home
View All
Go Back
Go Forward
Parent
Move Left
Move Right
First Child
Last Child
Choose Child

Figure 5-9

Tree Visualizer’s External Button Controls

•

Home takes you to a designated location. Initially, this location is the first viewpoint
shown after invoking the Tree Visualizer and specifying a configuration file. If you
have been working with the Tree Visualizer and have clicked the Set Home button,
then clicking Home returns you to the viewpoint that was current when you last
clicked Set Home.

•

Set Home makes your current location the Home location. Clicking the Home button
returns you to the last location where you clicked Set Home.

•

View All lets you view the whole hierarchy, keeping the tilt of the camera. To get an
overhead view of the scene, tilt the camera to point straight down, then click the
View All button. To tilt the camera, see the description of the Tilt thumbwheel (see
“Thumbwheels” on page 149).

147

Chapter 5: Using the Tree Visualizer

•

Go Back lets you return to the previous location. If you have just started the Tree
Visualizer and have not moved from the home view, this button is grayed out.

•

Go Forward lets you proceed to the location from which you clicked the Go Back
button. If you have not clicked the Go Back button, the Go Forward button is grayed
out.

•

Parent is active only when you have an object selected. If a bar is selected, clicking
this button selects the base containing the bar. If a base is selected, clicking this
button moves up the hierarchy to the parent node. Once the root node has been
reached (highest level of the hierarchy), the Parent button is grayed out. Note that
when using Parent, the selected node is changed to the parent of the previously
selected one.

•

Move Left lets you select the next sibling to the left. If a bar is selected, the bar to the
left of it is selected. If a base is selected, then, if the parent has another child to the
left, that is selected. This button is grayed out if nothing is selected, or if the current
selection has no sibling to the left.

•

Move Right lets you select he next sibling to the right. If a bar is selected, the bar to
the right of it is selected. If a base is selected, then, if the parent has another child to
the right, that is selected. This button is grayed out if nothing is selected, or if the
current selection has no sibling to the right.

•

First Child lets you select the first child of the current node. This button is grayed out
if there is no selection, if a bar is selected, or if the current selection has no children.

•

Last Child lets you select the last child of the current node. This button is grayed out
if there is no selection, if a bar is selected, or if the current selection has no children.

•

Choose Child produces a popup menu that lists all the children of the current node.
This button is grayed out if there is no selection, if a bar is selected, or if the current
selection has no children.

You also can perform these functions using the Go menu (see “The Go Menu” on
page 167.)

148

External Controls

Thumbwheels
Four thumbwheels appear around the lower part of the graphics window border (see
Figure 5-10). They let you dynamically move the viewpoint.

Thumbwheels

Figure 5-10

Tree Visualizer’s Thumbwheels

•

The vertical H (height) thumbwheel, on the upper left, moves the camera up and
down. You cannot move the viewpoint below ground level.

•

The vertical Tilt thumbwheel, at the bottom left, tilts the camera. You can tilt the
viewpoint to any position from straight ahead and straight down. You cannot tilt
the viewpoint to look up.

•

The horizontal <--> (pan) thumbwheel, at the bottom left, moves the viewpoint
from left to right and back. You cannot rotate the viewpoint.

•

The vertical Dolly thumbwheel, on the right, moves the viewpoint forward and
backward.

149

Chapter 5: Using the Tree Visualizer

Height Slider
A slider to the top left of the main window (Figure 5-11) lets you rescale all objects in the
window. Pushing the slider up to a value of 2.0 doubles the size of all objects in the main
window. Pulling the slider back down to a value of 1.0 returns the objects in the window
to their original heights.

Figure 5-11

Tree Visualizer’s Height Slider

Pulldown Menus
You also can access all of the Tree Visualizer’s functions via five pulldown menus. These
are labeled File, Show, Display, Go, and Help.
If you start the Tree Visualizer without specifying a configuration file, only the File and
the Help menus are available. The Show, Display, and Go menus are available after a
graph is loaded.

150

Pulldown Menus

The File Menu
The File menu (Figure 5-12) contains nine options.

Figure 5-12

Tree Visualizer’s File Pulldown Menu With Options

•

Open loads and opens a configuration file, displaying it in the main window.
Previously displayed data is discarded. Use Open to view a new dataset, or to view
the same dataset after changing its configuration.

•

Open Other Window opens a configuration file, but displays its results in a different
window. The current dataset remains open.

•

Reopen reopens the currently opened file. This can be used after the configuration or
data file has been updated.

•

Copy Other Window opens a new window that displays the same view of the current
dataset. You can interact with these windows independently.

•

Save As saves the state of the current Tree Visualizer window into an image file. The
user specifies both the file name (default is treeviz.rgb), format (default is rgb), and
whether to save the entire window, including any legends, or just the main scene
with the graphical objects (default is the full window).

151

Chapter 5: Using the Tree Visualizer

•

Print Image outputs the state of the current Tree Visualizer window to a printer. You
can specify the output printer using a Print dialog panel (default is your system's
default printer) and, like the Save As dialog, choose whether to print the entire
window or just the main scene window.

•

Start Tool Manager starts the Tool Manager (if not already running), and restores it to
the state it was in when the Tree Visualizer was invoked.

•

Close closes the current window (and all panels associated with it). If no other
windows are open, Close exits the application.

•

Exit closes all windows and exits the application.

The Show Menu
The Show menu (Figure 5-13) contains four options:
•

Overview

•

Search Panel

•

Filter Panel

•

Marks Panel

Each of these options brings up another dialog box for interacting with the data.

Figure 5-13

152

Tree Visualizer’s Show Pulldown Menu With Options

Pulldown Menus

The Overview Window

Select Overview in the Show menu to bring up a new window with an overhead view of
the complete hierarchy (Figure 5-14). If you want the Overview to be brought up
automatically each time the scene is viewed, set the Overview option in the configuration
file (see “Overview” on page 566).

Figure 5-14

Tree Visualizer’s Overview Window

The “X” in the Overview window shows your current location. The Overview helps you
keep track of your location and viewpoint in the entire scene. It can also help you quickly
go to a specific node.
To select an object in the Overview and have the main view zoom to it, left-click that
object. This is similar to left-clicking the object in the main view. Middle-clicking
anywhere in the overview zooms your viewpoint to that location, even if no object is at
that point.

153

Chapter 5: Using the Tree Visualizer

The Search Panel

Select Search in the Show menu to bring up a dialog box that lets you specify criteria to
search for objects (Figure 5-15).

Figure 5-15

154

Tree Visualizer’s Search Dialog Box

Pulldown Menus

Once the search is complete, yellow spotlights highlight objects matching the search
criteria (see Figure 5-16). To display information about an object under a yellow
spotlight, move the pointer over that spotlight; the information appears in the upper left
corner, under the label Pointer is over:. To select and zoom to an object under a yellow
spotlight, left-click the spotlight; if you press the Ctrl key while clicking, zooming does
not occur.

Figure 5-16

Sample Results of a Search in the Tree Visualizer

Items in the Search Panel

To specify whether a search is case-sensitive, click the Ignore Case In Searches checkbox, at
the top of the Search panel. For example, if this toggle is on (a check mark appears on that
button), the string “hello” is the same as “HellO.”
To the right of the case sensitivity checkbox is another, labeled Treat Nulls as Zeros. If this
checkbox is off (the default), comparisons involving nulls cannot return TRUE in a
search. If the it is on, nulls are treated as equal to zero.

155

Chapter 5: Using the Tree Visualizer

Below the case-sensitivity checkbox are controls that let you specify the parts of the
hierarchy to be searched. By default, the whole hierarchy is searched. To limit the levels
searched, select a relational operator (such as <=) from the option menu that lets you
specify the operand for the level. Then use the slider to select the level to be searched.
Level 0 is the root of the hierarchy, level 1 is the level below that, and so forth. To search
the root and the two levels below that, for example, choose <= 2.
Checkboxes also let you choose whether to search the bars or the bases.
When searching through bars, the default is that all bars are searched. To search only a
specific list of bars, you must select them. The Set All button turns on all bars; this is
useful if most of the bars are to be searched, and only a few are to be turned off. The Clear
button turns off all bars. If no bar is selected, the bar list is ignored, and all bars are
searched.
Below the panel for bar labels is a Hierarchy field that lets you specify nodes to search
(Figure 5-17). Below the Hierarchy field are fields that let you specify search criteria for
individual columns (defined in the Current Columns: window of the Tool Manager’s
Table Processing pane, see “Selecting the Tree Visualizer Tool” on page 132).

Figure 5-17

156

Detail of the Tree Visualizer’s Search Dialog Box

Pulldown Menus

To search for numeric values, enter the value, and select a relational operation (=, !=, >,
<, >=, <=). To search for alphanumeric values, enter the string for which you want to
search. You can use any of three types of string comparisons:
•

“Contains” indicates that it contains the appropriate string. For example, California
contains the strings Cal and forn.

•

“Equals” requires the strings to match exactly.

•

“Matches” allows wildcards:
–

An asterisk (*) represents any number of characters.

–

A question mark (?) represents one character.

–

Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.
In some cases (usually associated with binning in the Tool Manager), an option menu of
values appears, instead of a text field. To ignore that variable, select Ignored in the Option
menu. You can use relational operators (such as >=) with these options. This means that
the specified value as well as subsequent ones are selected.
In addition to numeric and string comparison operations, you can specify Is Null,
which is true if the value is null.
To the right of each search field is an additional option menu that lets you specify “And”
or “Or” options. For example, you could specify “sales > 20 And < 40.” You can have any
number of And or Or clauses for a given column, but cannot mix And and Or in a single
column.
Note that if different levels of the hierarchy are keyed by different types of data (for
example, the top level is selected by strings, while the second level is selected by
integers), then the “Hierarchy” search field is treated as a string and provides string
operations, not number operations.
If the Ignore Case In Searches checkbox is checked, the comparisons of all string searches
are case-insensitive.

157

Chapter 5: Using the Tree Visualizer

Six buttons are placed across the bottom of the Search panel:
•

Search causes the search to be started. This button is automatically activated if the
Enter key is pressed and the panel is active.

•

Clear turns off all search spotlights and erases the values from the search fields.

•

Next selects and zooms to the next matched object, in left-to-right order. After the
last matched object is selected, clicking Next returns the view to the Home position.
Next is valid only after a search that has found matches.

•

Previous selects and zooms in the opposite order from that of the Next button.

•

Select causes all objects that matched the search criteria to be selected. The Selections
menu can then interact with these objects.

•

Close closes the search window and turns off the search spotlights. If the Search
panel is reopened, it is in the same state as it was before the last Close; clicking Search
again repeats the last search.

The Filter Panel

The Filter panel filters out selected information, thus fine-tuning the displayed hierarchy.
You can use the Filter panel to emphasize specific information, or to shrink the amount
of data for better performance. Figure 5-18 shows a sample Filter panel.

158

Pulldown Menus

Figure 5-18

Tree Visualizer’s Filter Dialog Box

To specify whether a filter is case-sensitive, click the Ignore Case In Filter checkbox, at the
top of the Filter panel. For example, if this toggle is on (a check mark appears on that
button), the string “hello” is the same as “HellO.”

159

Chapter 5: Using the Tree Visualizer

To the right of the case sensitivity checkbox is another, labeled Treat Nulls as Zeros. If this
checkbox is off (the default), comparisons involving nulls cannot return TRUE in a filter.
If the it is on, nulls are treated as equal to zero.
Below the case-sensitivity checkbox are controls that let you specify the parts of the
hierarchy to be filtered. By default, the whole hierarchy is filtered. To limit the levels
filtered, select a relational operator (such as <=) from the option menu that lets you
specify the operand for the level. Then use the slider to select the level to be filtered. Level
0 is the root of the hierarchy, level 1 is the level below that, and so forth. To filter the root
and the two levels below that, for example, choose <= 2.
Checkboxes also let you choose whether to filter the bars or bases.
When filtering bars, the default is that all bars are filtered. To filter only a specific list of
bars, you must select them. The Set All button turns on all bars; this is useful if most of
the bars are to be filtered, and only a few are to be turned off. The Clear button turns off
all bars. If no bar is selected, the bar list is ignored.
Filtering bars does not affect the information in the base, which continues to include the
summary of all bars.
Below the panel for bar labels is a Hierarchy field, which lets you specify nodes to filter.
Below the Hierarchy field are fields that let you specify filter criteria for individual
columns (defined in the Current Columns: window of the Tool Manager’s Table
Processing pane, see “Selecting the Tree Visualizer Tool” on page 132).
To filter for numeric values, enter the value, and select a relational operation (=, !=, >, <,
>=, <=). To filter for alphanumeric values, enter the string for which you want to filter.
You can use any of three types of string comparisons:
•

“Contains” indicates that it contains the appropriate string. For example, California
contains the strings Cal and forn.

•

“Equals” requires the strings to match exactly.

•

“Matches” allows wildcards:
–

An asterisk (*) represents any number of characters.

–

A question mark (?) represents one character.

–

Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.

160

Pulldown Menus

In some cases (usually associated with binning in the Tool Manager), an option menu of
values appears, instead of a text field. To ignore that variable, select Ignored in the Option
menu. You can use relational operators (such as >=) with these options. This means that
the specified value as well as subsequent ones are selected.
In addition to numeric and string comparison operations, you can specify Is Null,
which is true if the value is null.
To the right of each filter field is an additional option menu that lets you specify “And”
or “Or” options. For example, you could specify “sales > 20 And < 40.” You can have any
number of And or Or clauses for a given column, but cannot mix And and Or in a single
column.
Note that if different levels of the hierarchy are keyed by different types of data (for
example, the top level is selected by strings, while the second level is selected by
integers), then the “Hierarchy” filter field is treated as a string and provides string
operations, not number operations.
If the Ignore Case In Filters checkbox is checked, the comparisons of all string filters are
case-insensitive.
If a node does not meet the filter criteria, has no bars that meet the criteria, and has no
children that meet the criteria, the node is not shown. There can be, however, cases in
which a specific object meets the filter criteria, but its ancestors up the tree do not. Also,
other bars in the same node might not meet the criteria. Since position is important in
interpreting context, it might not be good to eliminate those bars. Consequently, you are
given an option of selecting one of three radio buttons that control how these objects
should be drawn: Solid, Outline, and Hidden. Note, however, that if objects are drawn in
a less solid form due to the Display Zeros or Display Null menu, they are displayed
appropriately. For example, if Nulls are to be hidden, they are always hidden, regardless
of the filter criteria.
The exception to this is when filtering to specific bars. In such a case, the other bars are
eliminated and don’t take up space, regardless of the radio button settings.
The Height Filter slider lets you filter out those nodes containing only short bars. The size
of a value is shown as a percentage of the maximum height. First, the tallest bar in the
scene is calculated (if heights are normalized by level, then the tallest bar in each level).
Then only those nodes that contain at least one bar that is the appropriate percentage of
the tallest bar are shown.

161

Chapter 5: Using the Tree Visualizer

For example, if you enter 5% in this field, then only those nodes containing at least one
bar that is at least 5% of the height of the tallest bar are shown. (Also shown are ancestors
of such bars). This option is intended as a coarse way to filter out small, uninteresting
nodes. It is not intended as an exact mechanism of identifying specific nodes of a certain
value; use the search panel for that purpose. Use of this option can accelerate the
rendering of slow, complex scenes, or reduce clutter resulting from many bars near zero
height. You can also set this filtering option in the configuration file by using the Height
Filter command.
Although small nodes are filtered out, they are nonetheless counted in any cumulation
up the hierarchy.
The Depth slider, which is under the Height Filter slider, lets you display the hierarchy
so that only a given number of levels are displayed at any given time. When you are at
the top of the hierarchy, only the number of hierarchical levels specified by the slider is
seen. The nodes in the rows are arranged to optimize their visibility. When navigating to
nodes lower in the hierarchy, additional rows are made visible automatically. The nodes
above them automatically adjust their locations to accommodate the newly added nodes;
thus, some nodes might seem to move. Note that the overview shows all nodes in the
hierarchy, not just the top nodes; thus, the layout of the overview might not match the
layout of the main view. The X in the overview approximates the corresponding location
in the main view; there is no exact mapping between the two layouts.
•

Click the Filter button to start filtering. If the Enter key is pressed while the panel is
active, filtering automatically starts.

•

Click the Close button to close the panel.

The Marks Panel

The Marks panel, from the Tree Visualizer's Show Pulldown Menu (Figure 5-13,) lets you
name and store important locations (viewpoints) so that you can easily and quickly
return to them (see Figure 5-19). The location is stored relative to the currently selected
object. If no object is selected, the absolute location is recorded.
All marks can be indicated by colored flags in the main view. If the mark represents a
selected object, the flag is placed on that object. If it represents an absolute position, the
flag is placed at that position. To go to the mark, click the flag. All flags can be turned on
and off using the Mark Flags menu entry in the Display menu. (See Mark Flags in “The
Display Menu” on page 165).

162

Pulldown Menus

Figure 5-19

•

Tree Visualizer’s Marks Panel

Click the Mark button to mark the current location. Another dialog box appears (in
Figure 5-20) to prompt you for the name and color of the mark. The default name is
that of the currently selected object. The color controls the color of the flag
appearing in the main window and represents the mark. If you do not want a flag to
represent the mark, click the button with the “Not” symbol (slash through a circle).
To add another color to the palette, click the button with the plus symbol (+) to
bring up a color chooser.

Figure 5-20

Window Resulting From Clicking Mark Button

163

Chapter 5: Using the Tree Visualizer

Figure 5-21 shows a sample main window with flags representing the created marks.

Figure 5-21

Main Window With Flags Representing Marks

•

Click the Go to button to go to the current location associated with the selected mark
in the panel. Double-clicking a mark has the same effect. If the object selected by
that mark no longer exists (because it was filtered out, or the data was changed
since the mark was created), the location shown is close to where the object would
have been.

•

Click the Delete button to delete the selected mark in the panel.

•

Click the Modify button to change the name or color of the selected mark in the
panel.

•

Click the Up button to move the selected mark in the panel up the listing order.

•

Click the Down button to move the selected mark in the panel down the listing
order.

•

Click the Close button to exit the marks panel.

The file storing the marks information has the same name as the configuration file, with
a .marks suffix appended. Whenever a mark is changed, all marks are saved to that file.
If all marks are deleted, the .marks file is removed. If mark changes cannot be saved
(because of a permission error, for instance), a warning appears; this warning is not
repeated when subsequent mark changes are attempted.

164

Pulldown Menus

The Display Menu
The Tree Visualizer's Display menu lets you control several display parameters.

!
Figure 5-22

Tree Visualizer’s Display Menu

Base Heights is a checkbox that lets you turn the heights of the bases on and off. To see
negative numbers, or to make it easier to compare the bar heights, turn this option off.
Turning it on provides summary information about all the bars. The initial value of this
toggle can be changed with the “base height” statement in the configuration file.
Mark Flags is a toggle option that lets you turn on or off the flags representing marks (also
see “The Marks Panel”).
Zeros is a submenu that controls how objects with zero height are displayed. By default,
they are shown like other objects: a solid cube of height zero (a plane). The submenu lets
you specify them to be displayed as outlines (appearing as a hollow square), or to be
hidden completely (not drawn). The initial value of this of this can be changed using the
“zero” option in the configuration file (see “Zero” on page 568).
Nulls is a submenu that controls how objects of null height are displayed. It has the same
options as the zero menu; however, the default for null options is to display the objects
as an outline. The initial value can be changed using the “null” option in the
configuration file (see “Null” on page 569).

165

Chapter 5: Using the Tree Visualizer

The Selections Menu
The Selections menu lets you drill through to the underlying data. This menu has five
items (see Figure 5-23).

Figure 5-23

Tree Visualizer’s Selection Menu

•

Show Values displays a table (Record Viewer) of the values for all selected objects.

•

Show Original Data retrieves and displays the records corresponding to what has
been selected. The resulting records are shown in a table viewer.

•

Send To Tool Manager inserts a filter operation, based on the current box selection(s),
at the beginning of the Tool Manager history. The actual expression used to do the
drill through is determined by extents of the current box selection(s). If nothing is
selected, a warning message appears.

•

Complementary Drill Through causes the Show Original Data and Send To Tool Manager
selections, when used, to fetch all the data that are not selected.

•

Normalize Subtree determines the maximum height of the elements in the subtree,
and normalizes all values relative to that height.

For further details on drill-through, see Chapter 18, “Selection and Drill-Through.”

166

Pulldown Menus

The Go Menu
The Go menu duplicates the functions of the buttons on the upper right-hand side of the
main window (see Figure 5-24). It also identifies keyboard shortcuts for some functions.

Figure 5-24

Tree Visualizer’s Go Pulldown Menu

•

Home takes you to a designated location. By default, this location is the initial view
point of the scene. Initially, this location is the first viewpoint shown after invoking
the Tree Visualizer and specifying a configuration file. If you have been working
with the Tree Visualizer and have clicked the Set Home menu item, then clicking
Home returns you to the viewpoint that was current when you last clicked Set Home.
The keyboard shortcut for this function is Ctrl+H.

•

Set Home changes the Home location to your current location. Clicking the Home
menu item then returns you to the viewpoint that was current when you last clicked
Set Home.

167

Chapter 5: Using the Tree Visualizer

168

•

View All shows the whole hierarchy, keeping the tilt of the camera. To get an
overhead view of the scene, tilt the camera to point straight down, then click the
View All menu item. (To tilt the camera, see the description of the Tilt thumbwheel
in “Thumbwheels” on page 149.)

•

Go Back lets you return to the previous location. If you have just started the Tree
Visualizer and have not moved from the home view, this menu item is grayed out.
The keyboard shortcut for this function is Ctrl+B.

•

Go Forward lets you proceed to the location from which you clicked the Go Back
menu item. If you have not clicked the Go Back menu item, the Go Forward menu
item is grayed out. The keyboard shortcut for this function is Ctrl+R.

•

Parent is active only when an object is selected. If a bar is selected, clicking this
menu item selects the base containing the bar. If a base is selected, clicking this
menu item moves up the hierarchy to the parent node. Once the root node has been
reached (highest level of the hierarchy), the Parent menu is grayed out. The
keyboard shortcut for this function is Ctrl+U.

•

Move Left lets you select the next sibling to the left. If a bar is selected, the bar to the
left of it is selected. If a base is selected, then, if the parent has another child to the
left, that is selected. This button is grayed out if nothing is selected, or if the current
selection has no sibling to the left.

•

Move Right lets you select the next sibling to the right. If a bar is selected, the bar to
the right of it is selected. If a base is selected, then, if the parent has another child to
the right, that is selected. This button is grayed out if nothing is selected, or if the
current selection has no sibling to the right.

•

First Child lets you select the first child of the current node. This button is grayed out
if there is no selection, if a bar is selected, or if the current selection has no children.

•

Last Child lets you select the last child of the current node. This button is grayed out
if there is no selection, if a bar is selected, or if the current selection has no children.

Pulldown Menus

The Help Menu
The Help menu (see Figure 5-25) provides access to six help functions.

Figure 5-25

Tree Visualizer’s Help Pulldown Menu

•

Click for Help turns the cursor into a question mark. Placing this cursor over an
object in the main window and clicking the mouse causes a help screen to appear;
this screen contains information about that object. Closing the help window restores
the cursor to its arrow form and deselects the help function. The keyboard shortcut
for this function is Shift+F1. (Note that it also is possible to place the arrow cursor
over an object and press the F1 function key to access a help screen about that
object.)

•

Overview provides a brief summary of the major functions of this tool, including
how to open a file and how to interact with the resulting view.

•

Index provides an index of the complete help system. This option is currently
disabled.

•

Keys & Shortcuts provides the keyboard shortcuts for all of the Tree Visualizer’s
functions that have accelerator keys.

•

Product Information brings up a screen with the version number and copyright notice
for the Tree Visualizer.

•

MineSet User’s Guide invokes the IRIS Insight viewer with the online version of this
manual.

169

Chapter 5: Using the Tree Visualizer

Null Handling in the Tree Visualizer
Nulls represent unknown data (see Appendix J, “Nulls in MineSet”).
In the Tree Visualizer, nulls can occur in the following cases:
•

The database or data file contains a null value.

•

The skipMissing option is not present in the configuration file (see skipMissing in
Appendix B,) and data is present for the key value in one node of the hierarchy, but
not in another. For example, in a representation of state budgets, if there is no record
for state income tax for Texas, Texas would have an income tax of null. This is
different than for the case where there is a record showing 0 as the income tax for
Texas, in which case it would show a tax of 0.

•

When the Tool Manager is used to make an array based on bins and no data falls
into a specific bin, the value for that bin is null. For example, if there is no data for
30-40 year olds, that bin is null.

•

When making an array in the Tool Manager and the null enum option is specified,
an extra array entry, corresponding to the first bar in each bar chart, is created to
represent the aggregation of all the values where the bin value is null (see
“Aggregations in the Presence of Nulls” in Appendix J). This bar is labeled with a
question mark (?), representing null. If there is no data for that null bin, the values
associated with it are null as well.
Note: if all values throughout the data associated with the null bin are null, the Tree

Visualizer ignores the null bin and does not display it.
•

Expressions and aggregations of nulls can generate nulls (see Appendix J).

When a null value is mapped to a visual attribute, special representations are used in the
Tree Visualizer. If null is mapped to height, the object is normally drawn in outline mode
(although this is configurable through the Display menu (see the “The Display Menu”
section) or the configuration file (see “Null” in Appendix B). For a bar or a base, this looks
like an empty square. (It does not look like a cube, since it has no height.) For a disk, it
looks like a circle. If a null value is mapped to a color, it is drawn in a dark grey (see
Figure 5-26).

170

Sample Configuration and Data Files

Figure 5-26

Representation of a Null Value Mapped to Height, Color, Disk, and Label

When selecting an object with a null value, it is shown as a question mark (?) in the
selection field.

Sample Configuration and Data Files
The provided sample configuration and data files demonstrate the Tree Visualizer’s
features and capabilities. The following files are in the directory
/usr/lib/MineSet/treeviz/examples:
•

store.data and store.treeviz
When graphically displayed, these files show hypothetical sales data for a store
chain. The hierarchy includes the entire chain, regions, states, cities, and individual
stores. Four products are shown for each level in the hierarchy. In this configuration,
heights represent sales in dollars; colors represent the percentage of the target dollar
amount.

•

stateRevenue.data and stateRevenue.treeviz
When graphically displayed, these files show the revenue components of every
state’s budgets for 1992, as obtained from the United States Census Bureau (from
http://www.census.gov/govs/state/stfin92.dat). Heights represent the dollar
amounts in taxes. The descendent nodes in the background show the contribution
of various taxes to the total revenues shown in the root node.

171

Chapter 5: Using the Tree Visualizer

•

beer.data and beer2.data, and beer.treeviz and beer2.treeviz
When graphically displayed, these files show fictitious data based on consumer
research of beer purchases. The hierarchy contains three levels:
1.

The first is category (for example, beer or ale).

2.

The second level is brand codes (randomly assigned).

3.

The third is the individual product codes; for example, twelve-pack versus
six-pack (randomly assigned).

Each chart contains seven bars, representing seven age groups. Bar height
represents the total dollars spent by that age group. Colors represent the percentage
of dollars spent by males and females. Brands, products, and data used in these files
are samples only.
Both beer.treeviz and beer2.treeviz produce the same graphical output, but they have
been constructed differently. In beer.treeviz, each type of beer is represented by a
single record, with values for male and for female consumption; these values are
stored in an enumerated array (explained in Appendix B, “Creating Data and
Configuration Files for the Tree Visualizer”).
In beer2.treeviz, there are seven records for each beer, with each record representing
one age group. Note that in the beer file, the age groups are represented in the
configuration file; in the beer2 file, they are included in the data file.
The beer file requires less storage space than the beer2 file; however, the
configuration file is a little more complicated. In some cases, it might be easier to
produce data in the form used by the beer2 file.
Additional examples of the Tree Visualizer to visualize a Decision tree are provided in
Chapter 11.

172

Chapter 6

6. Using the Map Visualizer

This chapter discusses the features and capabilities of the Map Visualizer. It provides an
overview of this visualization tool, then explains the Map Visualizer’s functionality
when working with the following elements:
•

main window

•

viewing modes

•

external controls

•

pulldown menus

Finally, it lists and describes the sample files provided for this tool.

Overview of Map Visualizer
The Map Visualizer is a graphical interface that displays data as a three-dimensional
“landscape” of arbitrarily specified and positioned “bar chart” shapes. This tool displays
quantitative and relational characteristics of your geographically oriented data.
Data items are associated with graphical “bar chart” objects in the visual landscape.
However, the objects have recognizable geographical shapes and positions. The
landscape can consist of a collection of these geographical objects, each with individual
heights and colors (see Figure 6-1). You can dynamically navigate through this landscape
by
•

panning

•

rotating

•

zooming to more clearly see areas of interest

•

drilling down to see increased granularity of geographic details

•

drilling up to aggregate data into coarser-grained graphical objects

•

using animation to see how the data changes across one or two independent
dimensions.

173

Chapter 6: Using the Map Visualizer

Figure 6-1

Sample Map Visualizer Screen Showing 1990 U.S. Population

The landscape can also consist of a flat plane of these geographical objects drawn as
simple outlines, with “bar chart” cylinders placed at specific locations (see Figure 6-2).

174

Overview of Map Visualizer

Figure 6-2

Sample Map Visualizer Screen Showing Relative Population of Major U.S. Cities

Another landscape possibility is lines with endpoints at specific point locations, all with
individual widths and colors (see Figure 6-3). Lines have width and color properties,
instead of the height and color properties of the arbitrarily shaped objects and cylinders.

175

Chapter 6: Using the Map Visualizer

Figure 6-3

Sample Map Visualizer Screen Showing the United States With Specific Endpoints

File Requirements
The Map Visualizer requires the following files:
•

A data file consisting of rows of tab-separated fields. Typically, the Tool Manager
creates this file (see Chapter 3). You can also generate this file without using the Tool
Manager (for the required file format, see Appendix C, “Creating Data,
Configuration, Hierarchy, and GFX Files for the Map Visualizer”).
Data files are the result of extracting raw data from a source (such as an Oracle,
INFORMIX, or Sybase database) and formatting it specifically for use by the Map
Visualizer. Data files have user-defined extensions (the sample files provided with
the Map Visualizer have a .data extension).

•

A gfx file consisting of a description of the shapes and locations of the 1-, 2-, or
3-dimensional objects to be displayed.
Gfx files must have a .gfx extension. MineSet includes various .gfx files, including
the United States to the granularity of counties, telephone area codes, and postal zip
codes, as well as Canada to the granularity of provinces. You can also manually
generate .gfx files (see Appendix C, “Creating Data, Configuration, Hierarchy, and
GFX Files for the Map Visualizer” for the required file format).

176

File Requirements

•

A hierarchy file consisting of a description of
–

the column names of the various graphical objects to be displayed

–

the filenames of the .gfx files that describe the locations and shapes of the
graphical objects

–

an optional description of the hierarchical relationship of the graphical objects,
which is used for the drill-down and drill-up functions.

Hierarchy files enable drill down and drill up. This means that information
associated with objects at one level can be aggregated (or, conversely, shown in
greater detail) and displayed at a different level. For example, a hierarchy file
defining the relationships between states and regions comprising multiple states
allows values such as population levels to be displayed at both the individual state
level as well as at regional levels. The gfx_files/usa.state.gfx file, for example,
describes the shapes of the 50 United States; the gfx_files/usa.state.hierarchy file
describes the hierarchy grouping individual states into regions, regions into
East-West areas, and the East-West areas into an aggregated United States.
For more information, see Appendix C, “Creating Data, Configuration, Hierarchy,
and GFX Files for the Map Visualizer”
•

A configuration file describing the format of the input data and how these are to be
displayed. Typically, this file is created using the Tool Manager (see Chapter 3). You
also can use an editor (such as jot, vi, or Emacs) to produce this file without using
the Tool Manager (see Appendix C, “Creating Data, Configuration, Hierarchy, and
GFX Files for the Map Visualizer”).
Configuration files should have a .mapviz extension. If they do not, they are not
listed when selecting the Open option from the File pulldown menu. When starting
the Map Visualizer, or when opening a file, specify the configuration file, not the
data file.

177

Chapter 6: Using the Map Visualizer

Starting the Map Visualizer
There are five ways to start the Map Visualizer:
•

Use the Tool Manager to configure and start the Map Visualizer. See Chapter 3 first
for details on most of the Tool Manager’s functionality, which is common to all
MineSet tools; see below for details about using the Tool Manager in conjunction
with the Map Visualizer.

•

Double-click the Map Visualizer icon, which is in the MineSet page of the icon
catalog. The icon is labeled mapviz. Since no configuration file is specified, the
start-up screen requires you to select one by using File > Open.

Figure 6-4

Map Visualizer’s Startup Screen, With File Pulldown Menu Selected

Starting the Map Visualizer without specifying a configuration file causes the main
window to show the copyright notice for this tool. Only the File and Help pulldown
menus can be used. For the main window to be fully functional, open a
configuration file by selecting File > Open (Figure 6-4).
•

178

If you know what configuration file you want to use, double-click the icon for that
configuration file. This starts the Map Visualizer and automatically loads the
configuration file you specified. This only works if the configuration filename ends
in .mapviz (which is always the case for configuration files created for the Map
Visualizer using the Tool Manager).

Starting the Map Visualizer

•

Drag the configuration file icon onto the Map Visualizer icon. This starts the Map
Visualizer and automatically loads the configuration file you specified. This works
even if the configuration filename does not end in .mapviz.

•

Start the Map Visualizer from the UNIX shell command line by entering this
command at the prompt:
mapviz [ configFile ]

where configFile is optional and specifies the name of the configuration file to use. If
you don’t specify a configuration file, you must use File > Open to specify one (see
Figure 6-4).
Options for Invoking the Map Visualizer

There are a two options that affect how this tool is invoked:
•

-warnexecute indicates that if you attempt to execute a command specified in an

execute statement, a warning is displayed and you are given the option to execute
the command or not. This is intended for an insecure environment, such as files
obtained from the Web, and is used automatically when commands are executed via
mtr files.
You can enable this option permanently by adding the line
*minesetWarnExecute:TRUE

to your .Xdefaults file, or by setting the environment variable
MINESET_WARN_EXECUTE

•

-quiet eliminates the dialogs that popup to indicate progress. You can enable this

option permanently by adding the line
*minesetQuiet:TRUE

to your .Xdefaults file.

179

Chapter 6: Using the Map Visualizer

Configuring the Map Visualizer Using the Tool Manager
This section describes how the Map Visualizer can be configured using the Tool Manager.
Although the Tool Manager greatly simplifies the task of configuring the Map Visualizer,
you can construct a configuration file manually for this tool using a text editor (see
Appendix C, “Creating Data, Configuration, Hierarchy, and GFX Files for the Map
Visualizer”).
Note that the steps required to connect to a data source are described in Chapter 3.

Generating .gfx and .hierarchy Files
To use the Map Visualizer, you must provide the application with two files that define
the graphical objects to be displayed:
•

One or more .gfx files, which define the shapes of the graphical objects displayed.

•

A .hierarchy file, which describes the relationship of multiple, interrelated map (.gfx)
files.

These files are not created by the Tool Manager; they must already exist as part of
MineSet (residing in the /usr/lib/MineSet/mapviz/gfx_files directory), or they must be
created by the user. For instructions on their creation, see Appendix C, “Creating Data,
Configuration, Hierarchy, and GFX Files for the Map Visualizer”
The .gfx and .hierarchy files that are part of the MineSet package include

180

•

the individual states of the United States

•

the areas covered by the individual counties of the United States

•

the areas covered by the individual five-digit ZIP codes of the United States

•

the areas covered by the telephone area codes of the United States

•

the individual provinces and territories of Canada

•

the individual states of Mexico

•

the individual states and territories of Australia

•

the individual countries of Western and Central Europe

•

regional subdivisions of both France and The Netherlands

Configuring the Map Visualizer Using the Tool Manager

The Map Visualizer requires a data file with
•

One column indicating geographical objects (for example, states). Each row in this
column must indicate a unique geographical object (staying with the example, this
means one row for each state).

•

At least one column with numeric values mapped (using arithmetic expressions) to
the heights and/or colors of each geographic bar. These columns can be scalar, a 1D
array, or a 2D array. If the column is an array, a slider must be used to select specific
data points for this mapping to heights and colors.

If both heights and colors are mapped to 1D or 2D arrays, the arrays must have the same
indexes (see Appendix C, “Creating Data, Configuration, Hierarchy, and GFX Files for
the Map Visualizer”).

Selecting the Map Visualizer Tool
Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen
(Figure 6-5). From the popup list of tools, select Map Visualizer. The window on the right
side of this panel displays the mapping requirements for the Map Visualizer. Items in the
Visual Elements list that are preceded by an asterisk are optional.

181

Chapter 6: Using the Map Visualizer

Figure 6-5

Data Destination Panel, With Map Visualizer Selected

•

Entity - Bars lets you specify which column contains the keywords of the graphical
objects.

•

Height - Bars lets you specify the heights of the geographic bars on the map.

•

*Color - Bars lets you assign the colors of the geographic bars. See “Choosing
Colors” and “Using the Color Browser” in Chapter 3 for a more detailed
explanation of how to choose and change colors.

•

*Slider1 and *Slider2 let you map columns directly to one or two animation Sliders
(see “Slider Creation for Mapviz,” below).

Mapping Columns to Visual Elements
A column in the Current Columns window should be mapped to the Visual Element
Height - Bars by clicking the column first, then Height - Bars. Optionally, another column
(perhaps even the same column) can be mapped to the Visual Element *Color - Bars.
Another column must be mapped to the Visual Element Entity. This must be a string
column.

182

Configuring the Map Visualizer Using the Tool Manager

Undoing Mappings
To undo a mapping, select the mapping in the Requirements: window, then click the
Clear Selected button. To undo all mappings, click the Clear All button.

Slider Creation for Mapviz
Sliders can be created manually or automatically. The following subsections describe
these methods.
Manual Slider Creation

Tool Manager generates sliders whenever there is an array column present in the current
table. The sliders correspond to the indices of the array columns. If the column has one
index (one-dimensional array), only one slider is created, but if the column has two
indices (two-dimensional array), both an X and a Y slider are created. The current slider
indices are indicated in the Tool Options dialog box from the Tool Manager.
Note that for a slider to be created, all array columns in the current table must have the
same indices. If array columns with differing indices exist in the current table, no slider
is created.
See “Aggregation” in Chapter 3 for more information on creating arrayed columns.
Automatic Slider Creation

If no arrayed columns are in the current table, Tool Manager can automatically generate
sliders by use of the Slider1 and Slider2 mappings. Sliders are created through a
combination of automatic binning and aggregation. These automatic operations occur
after clicking Invoke Tool. The operations do not affect the current history operations of
Tool Manager, but they do appear in the configuration files for the tool.
Columns mapped to Slider1 and Slider2 eventually form the indices for the sliders. These
columns must be either numeric (int, float, double) or binned. If a column mapped to a
slider is already binned, no automatic binning is needed for this column, and this column
is used as an index for a slider. However, if the column is not binned, a binned column is
created using the automatic binning options in the Tool Options dialog box.

183

Chapter 6: Using the Map Visualizer

The three methods of binning are:
•

Selecting All Distinct Values creates a bin for every unique value of the column.

•

Specify the number of bins you want to create. The thresholds for the bins are
determined using the Uniform Range approach.

•

Selecting Automatic automatically determines the number of bins to create and
determines the bin thresholds using the Uniform Range approach.

(See “The Bin Columns Button” in Chapter 3 for more information about binning.) The
column used in forming the automatic bins is deleted from the current table.
The binned columns now form the indices of array columns. Note that if you want to
create only one slider, the index must be mapped to Slider1. Attempting to create only
one slider with a mapping to Slider2 is not allowed and generates a Tool Manager error.
Also, a column mapped to a slider cannot be mapped to any other mapping, since it is
removed during the aggregation process.
Once the slider indices are formed, the arrayed columns are created. This is done using
automatic aggregation. Any numeric columns mapped to Height or Color are
aggregated using the automatic aggregation options in the Tool Options dialog box. You
can either specify aggregating by Sum or by Average. The binned columns created from
the slider mappings form the indices for the aggregation. The column mapped to Entity
is the only Group-By column. Any remaining columns in the table are removed. (See
“Aggregation” in Chapter 3 for a description of the aggregation process.)
The aggregation step automatically forms the arrayed columns used for sliders. These
arrayed columns form the new tool mappings. For example, if the column “mpg” were
mapped to Height, a new column “avg_mpg[]” is formed and remapped to Height. The
progress of the automatic slider generation is displayed in the Tool Manager status
window.

Specifying Tool Options
Clicking the Tool Options button causes a new dialog box to be displayed (Figure 6-6).
This lets you change some of the Map Visualizer options from their default values.

184

Configuring the Map Visualizer Using the Tool Manager

Figure 6-6

Map Visualizer’s Options Dialog Box

The following sections describe the buttons and fields of the Map Visualizer’s Options
dialog box.

185

Chapter 6: Using the Map Visualizer

Geography

The Entities File specifies a .hierarchy file to be used for the representation of the
geographical "entity" objects, in the Map Visualizer's main window.
The Outlines File specifies outline objects to draw, which appear as a flat plane on which
the 3-D entity objects are placed.
The Find File button lets you browse your files to find the .hierarchy file to be used.
Note that the Entities File and Outlines File fields are optional. If the Entities File is not
supplied, then the Map Visualizer creates graphical entity objects consisting of simple
rectangles that are arbitrarily sized and placed in the scene.
Height

This section specifies an initial height Scale value (default is 1.0) and whether to display
a height legend at the bottom of the Map Visualizer window.
Color

To use these Color options, you must have mapped a column to the *Color - Bars
requirement of the Data Destination panel. See “Choosing Colors” and “Using the Color
Browser” in Chapter 3 for a more detailed explanation of how to choose and change
colors.
Color List—You can specify the color list using the + button next to the color list label. This
brings up a color editor that lets you specify a color to be added to the list.
Mapping—You can specify whether the color change that is shown in the graphic display
is Continuous or Discrete. If you choose Continuous, the color values shift gradually
between the colors entered in the “Color List” field as a function of the values that are
mapped to those colors in the “Mapping” field.
The field to the right of the popup button lets you enter specific values to which the colors
are mapped. You must have the same number of values in this field as there are colors
entered in the “Color list to use” field.

186

Configuring the Map Visualizer Using the Tool Manager

Example 6-1

If you
•

used the Color Browser to choose gray and red

•

selected Discrete for the Mapping

•

entered the values 0 150000

then the display shows the population of the United States across the time period
1770-1990. States with more than 150,000 square miles are shown in red, the rest are in
gray.
Example 6-2

If you
•

used the Color Browser to choose gray and red

•

selected Continuous for the Mapping

•

entered the values 0 300000

then the display shows the population of the United States across the same time period.
The states’ colors vary from gray to red, depending on their size; the largest states are
shown with the greatest density of red.
You can enter as many colors into this field as necessary for your display. If the number
of values in the column that maps to *Color - Bars exceeds the number of distinct colors
you have chosen, the Map Visualizer adds an appropriate number of randomly chosen
colors at runtime.
Legend On—lets you determine whether a color legend is displayed or hidden.
Normalize On—lets you determine whether the Map Visualizer automatically scales the
colors between the color column's minimum and maximum values (this is called color
normalization), as opposed to you manually specifying threshold values. When
Normalize On is enabled, the threshold values must lie within the range 0 to 100,
representing a percentage of the color column's minimum to maximum numeric range.

187

Chapter 6: Using the Map Visualizer

Sliders

You can manually select a binned column to be associated with the slider(s), where the
binned column indexes an aggregated array that is mapped to height or color.
Alternatively, you can have the Tool Manager automatically perform the binning and
aggregations. For more details on the Slider options, see “Slider Creation for Mapviz” on
page 183.
Message Field

This lets you specify the message displayed when an entity is selected. For a listing and
description of format types that can be entered in this field, see the “Message Statement”
section in Appendix C, “Creating Data, Configuration, Hierarchy, and GFX Files for the
Map Visualizer”
Title field

This lets you specify a string that appears at the bottom of the Map Visualizer main
window. This string must be enclosed in double-quotes.
Execute Field

This option lets you type in a UNIX command that is executed when double-clicking on
an entity. The format is similar to the message statement. If no execute statement appears,
double-clicking has no effect.
For a detailed description of the Execute field, see “Execute Statement” in Appendix C.
Resetting the Tool Options

If, after making changes to the Tool Options dialog box, you want to reset the values of
all options to their default values, click the Reset Options button.
Accepting the Tool Options

Once you have finished making changes to the Tool Options dialog box, click OK to
return the Tool Manager’s main screen.

188

Working in the Map Visualizer’s Main Window

Saving Map Visualizer Settings
The Tool Manager stores information for the Map Visualizer in several files, all sharing
the same prefix:
•

.mapviz.data contains data.

•

.mapviz.schema describes the data file.

•

.mapviz contains information needed by the Map Visualizer.

•

.mineset contains all the information needed to create the other files.

To specify a prefix, use the Save ... menu option in the File menu of the Tool Manager’s
main window. If you do not specify a prefix, it is based on the data source.
When you use the Invoke Tool button, the .data, .schema, and .mapviz files are updated, if
necessary.

Invoking the Map Visualizer
To see the Map Visualizer graphically represent your data, click the Invoke Tool button at
the bottom of the Data Destination panel.

Working in the Map Visualizer’s Main Window
If you started the Map Visualizer without specifying a configuration file, the main
window shows the copyright notice for the Map Visualizer. Only the File and Help
pulldown menus can be used. For the main window to show all menus and controls,
open a configuration file. Use File > Open (Figure 6-4) to see a list of configuration files.
When a valid configuration file has been specified, its geographical landscape is visible.
For example, Figure 6-7 shows the results of specifying population.usa.mapviz and moving
the Year slider to the far right.

189

Chapter 6: Using the Map Visualizer

Figure 6-7

Population.usa.mapviz Example With the Slider Moved to 1990

This shows the population and population density for each state of the United States. The
population of each state is represented by the height of the state’s graphical shape.
Heights are relative to each other across the entire range of the animation controls.

190

Working in the Map Visualizer’s Main Window

Viewing Modes
The two modes of viewing are grasp and select. To toggle between these modes, move the
cursor into the main window, and press the Esc key. You can also change from one mode
to the other by clicking the appropriate button: to enter select mode, left-click the arrow
button (to the top-right of the main window); to enter grasp mode, left-click the hand
button (immediately below the arrow button, near the top right of the main window).
Grasp Mode

In grasp mode, the cursor appears as a hand. This mode supports panning, rotating, and
scaling the scene’s size in the main window.
•

To pan the display, press the middle mouse button and drag it in the direction you
want the display panned.

•

To rotate the display, press the left mouse button and move the mouse in the
direction you want to rotate.

•

To move the viewpoint forward, press the left and middle mouse buttons
simultaneously and move the mouse downwards. To move the viewpoint
backward, press the left and middle mouse buttons simultaneously and move the
mouse upwards. This is equivalent to the functions provided by the Dolly
thumbwheel.

Select Mode

In select mode, you can highlight an object by positioning the cursor over that object.
Information about that object then appears at the top of the view area. This information
remains visible in the window only as long as the pointer cursor remains over the object.
If you position the pointer cursor over an object and click the left mouse button, the same
information appears in the Selection Window, which is above the main window, under
the “Selection” label (Figure 6-8).

191

Chapter 6: Using the Map Visualizer

Figure 6-8

Highlighted Information in the Viewing Window and Selected Information

This Selection information remains visible until you select another object or click the
background. Using the mouse, you can cut and paste this text into other applications,
such as reports or databases.

192

Working in the Map Visualizer’s Main Window

Drilling Down and Drilling up

To view a finer level of geographical granularity for an object (if the .data and .hierarchy
files support it), click the right mouse button while the cursor is over that object. This is
called “drilling down.” You can repeat this down to the finest level of granularity
supported by the data. If the cursor is positioned over a specific object when drilling
down, only the more detailed sub-objects of that object appear. If, instead, the cursor is
positioned on the background at the time of the mouse click, then the more detailed
sub-objects of the entire set of objects appear. This might produce a display with a large
number of individual objects. The greater the number of objects, the longer the Map
Visualizer takes to construct the scene, and the slower the performance when moving the
animation controls.
To move up one level and view a coarser geographical granularity (“drill up”), click the
middle mouse button. If the cursor is positioned on the background when you click, all
the higher-level objects appear. If the cursor is positioned on a specific object in the scene,
then the scene “returns” to the group of higher-level objects visible when you last drilled
down with the right mouse button.
If an execute statement was specified via Tool Manager or the configuration file, then
double clicking on an object executes the appropriate command. If the -warnexecute
option was specified when invoking the Map Visualizer, a warning is given first.
Note: By default, the Map Visualizer initially displays objects at the lowest level of detail;

thus, initially, only drill-up (to coarser granularity) is active.

193

Chapter 6: Using the Map Visualizer

External Main Window Controls
Several external controls surround the graphics window. These consist of buttons,
sliders, and a summary window. Each of these controls is described in this section.

Buttons
At the top right of the image area are 11 buttons (see Figure 6-9).

Arrow
Hand
Viewer help
Home
Set Home
View All
Seek
Perspective
Top View
Front View
Right View

Figure 6-9

194

Detail View of Top Right Buttons

•

Arrow puts you in select mode, which lets you highlight entities in the main
window. When in this mode, the cursor shape is an arrow.

•

Hand puts you in grasp mode, which lets you rotate, zoom, and pan the display in
the main window. When in this mode, the cursor shape is a hand.

•

Viewer help brings up a help window describing the viewer itself.

External Main Window Controls

•

Home takes you to a designated location. Initially, this location is the first viewpoint
shown after invoking the Map Visualizer and specifying a configuration file. If you
have been working with the Map Visualizer and have clicked the Set Home button,
then clicking Home returns you to the viewpoint that was current when you last
clicked Set Home.

•

Set Home makes your current location the Home location. Clicking the Home button
returns you to the last location where you clicked Set Home.

•

View All lets you view the entire graphic display, without changing the angle of
view you had before clicking on this option. To get an overhead view of the scene,
rotate the camera so that you are looking directly down on the entities, then click
the View All button.

•

Seek takes you to the point or object you click after selecting this button.

•

Perspective is a toggle button that lets you view the scene in 3D perspective (closer
objects appear larger, farther object appear smaller). Clicking this button again turns
3D perspective off. If Perspective is off, the Dolly thumbwheel becomes the Zoom
thumbwheel

•

Top View lets you view the scene from the top.

•

Front View lets you view the scene from the front.

•

Right View lets you view the scene from the right side.

Height-Adjust Slider and Label
To the left of the Map Visualizer’s main window is a vertical height adjust slider and,
below it, a label containing a numeric value between 0.1 and 100. This slider lets you
change the absolute heights of all the graphical objects in the main window. Moving the
slider up increases the heights of the objects; moving it down decreases their heights. The
numeric value in the label changes accordingly. This value indicates the height
multiplier, the default value of which is 1.0. The height adjust slider is useful for
accentuating relative height differences between objects in the view window.

195

Chapter 6: Using the Map Visualizer

Thumbwheels
Three thumbwheels appear around the lower part of the main window border (see
Figure 6-10). They let you dynamically move the viewpoint.

Thumbwheels

Figure 6-10

Lower Half of Window With Thumbwheels

•

The vertical thumbwheel Rotx (rotate about the x axis), on the left, rotates the
display up and down.

•

The horizontal thumbwheel Roty (rotate about the y axis), at the bottom left, rotates
the scene in the main window around its centerpoint left and right.

•

The vertical Dolly thumbwheel, on the right, moves the viewpoint forward and
backward. Note that as you use the Dolly thumbwheel to magnify the scene in the
main window, additional detail can appear. This is not the case with the Zoom
slider, which merely enlarges the scene without adding detail.

The Animation Control Panel
To the right of the Map Visualizer’s main window are several external controls,
depending on the type of data being displayed (see Figure 6-11). These controls can
include

196

•

sliders for independent dimensions

•

a summary window containing a color density profile.

•

a color legend showing the color density value limits

•

buttons and sliders for animation

The Animation Control Panel

Figure 6-11

Map Visualizer’s Summary Window With Slider and Animation Controls

Sliders Controlling Independent Dimensions
The number of sliders appearing adjacent to the summary window is dependent on the
dataset displayed in the Map Visualizer’s main window. Datasets can have two, one, or
no independent dimensions.
Datasets With Two Independent Dimensions

If the dataset has two dimensions of independently varying data (such as
nl.births.mapviz), the animation control panel to the right of the main graphics window
becomes visible (as in Figure 6-11).

197

Chapter 6: Using the Map Visualizer

Within this animation control panel are the 2D summary window and two sliders. The
summary window has a horizontal slider below it for selecting data points of the first
independent dimension, and a vertical slider to the left for selecting data points of the
second independent dimension. The horizontal slider’s dimension is identified by a label
below it. The vertical slider’s dimension is identified by a label above it.
Datasets With One Independent Dimension

For datasets with one independent dimension (such as population.usa.mapviz), only the
slider below the summary window appears, and the summary window is compressed
(see Figure 6-12). This slider’s dimension is identified by a label below it.

Figure 6-12

198

Map Visualizer’s Summary Window With One Slider and Animation Controls

The Animation Control Panel

Datasets With No Independent Dimension

For datasets with no independent dimensions (such as population.europe.mapviz), no
animation control panel appears (see Figure 6-13).

Figure 6-13

If There Are No Independent Dimensions, No Animation Control Panel Appears

The Summary Window
The summary window provides a 2D representation of the aggregation of values that the
main window displays in 3D. Above this window is a label, Sum Heights, followed by
two rectangles: the first white, the second red. Within the rectangles are numbers; each is
the respective value for the maximum density of that color. This summary color legend
provides a visual and numeric comparison to the densities in the summary window.

199

Chapter 6: Using the Map Visualizer

The whiter the areas of the summary window, the lower the total values represented by
the heights of the objects in the main window. The greater the density of red shown in
areas of the summary window, the higher the total of those values. The density of these
colors in the summary window provides a summary of the data across the one or two
independent dimensions in the dataset, which is useful for guiding your exploration
through the data.
By default, the summary window also contains a set of black dots, evenly spaced across
the one or two dimensions of data. These dots indicate the precise positions of the
discrete datapoints of the data. You can turn off the dots using the View > Show Data
Points menu option.
Color Density Examples in the Summary Window

After opening the population.usa.mapviz file, for example, the 2D summary window
shows a color range from white (on the left) to red (on the right). White corresponds to
the low aggregate population in the early years of the United States; red represents the
higher aggregate population in later years. In this example, the greater the density of red,
the higher the total population of United States.
For a more complex example, open perhouse.perage.mapviz. This dataset has two
independent dimensions: time and age. The summary window displays these
dimensions as a complex pattern of colors. Place the cursor on the horizontal lines with
the greatest density of red, which runs horizontally across the summary window (this
means the age group making the greatest number of purchases). Click the left mouse
button. The information displayed in the field below the horizontal slider shows that this
represents purchases made by 30- to 39-year-olds.
Now place the cursor at the junction of the densest red horizontal (age group) and
vertical (time frame) parts of the summary window, and click the left mouse button. The
information displayed in the field below the horizontal slider shows that most purchases
were made by 30- to 39-year-olds in May-June 1989 and May-June 1990.

200

The Animation Control Panel

Creating a Path in the Summary Window

If the dataset loaded into the Map Visualizer has at least one independent dimension, it
is possible to view all or any part of that dataset via animation. This is done by first
creating a path in the summary window, then activating the animation controls
described in the next section.
The three ways to draw a path in the summary window are as follows:
•

Define a starting point by clicking and holding down the left mouse button, then
draw a path by dragging the cursor over the window. The actual path passes
through intermediate discrete points closest to the path of the mouse. End the path
by releasing the left mouse button.

•

Define a starting point by clicking the left mouse button, then define an endpoint by
moving the cursor to another part of the window and clicking the middle mouse
button. A path appears between those two endpoints, passing through the
intermediate discrete data point(s) that are closest to the hypothetical straight line
between the endpoints. To add more line segments, continue with repeated middle
mouse clicks.

•

Define a starting point by clicking the left mouse button, then drag one of the
independent dimension sliders to draw a straight line along this dimension. If there
are two sliders, then using the second slider will continue to draw a straight line
along the axis controlled by this second slider.

The path you draw can only go through the well-defined discrete data points, identified
by the black dots in the summary window.

Animation Buttons and Sliders
Use the seven VCR-like buttons and two sliders (Path and Speed) below the 2D summary
window to control animation.

201

Chapter 6: Using the Map Visualizer

Animation Buttons

Once a path is drawn in the summary window (see “Creating a Path in the Summary
Window,” above), you can use the VCR-like buttons to control animation along this path.
The middle Stop button is highlighted in blue to indicate an initial state. Use the adjacent
Play Forward button (to the right of Stop) or Play Reverse (to the left) to begin simple
movement along the drawn path in a forward or reverse direction. Forward and Reverse
are defined by the sequence in which the path was drawn, not by a sense of left-to-right
or right-to-left movement.
To stop and restart the animation, click the Stop button, then use the Play Forward or
Reverse button. When you use the Stop button, the animation continues in the current
direction until the position falls on a discrete data point.
Adjacent to the Play buttons are the Single-Step buttons, also Forward and Reverse.
Clicking one of these buttons causes the current path position to change to the next
discrete data point.
On the outside are the Fast Forward and Fast Reverse buttons. Clicking one of these Fast
buttons while in Stop state changes the path position to the end (for Forward) or to the
beginning (for Reverse) of the path. Clicking a Fast button when in Play state increases the
animation speed.
Animation Flow

Below the Animation Buttons are the three Animation Flow buttons.
Play-once (default)—the animation moves either forward or in reverse until it reaches the
end of the path, then stops.
Loop—when the animation reaches the end of the path, it automatically resets to the
beginning and starts over again.
Swing—when the animation reaches the end of the path, it reverses direction and retraces
its path to the other end; upon reaching that end, the animation reverses direction again,
beginning the cycle again.

202

The Animation Control Panel

Animation Sliders

While animation is stopped, you can move the Path slider to reset the position along the
path. Note that when you use the Path slider, the cursor in the summary window moves
across the drawn path, and the 1D sliders (below and to the left of the drawing area)
move consistently with the cursor position. Then use the Play or Reverse button to restart
the animation from the newly specified point.
You can drag the Path slider to an arbitrary position on the path between discrete data
points; however, when you release the slider, the path position changes to a stop at the
nearest discrete data point.
Use the Speed slider to adjust the speed of the animation along the path.
Data Points and Interpolation

As animation proceeds, the variables mapped to height and color in the Map Visualizer
also change. However, the variables displayed in the Selection: message box show only
the data values of the nearest discrete data position, not intermediate (interpolated) data
values.
The animation is produced in the following manner: Assume you have data for 10 years,
on a per-year basis (that is, 10 data values) and that these correspond to the height of one
state in the Map Visualizer. The years are 1991 to 2000, the height for 1991 is 20, and the
height for 1992 is 40. As you move the year slider from 1991 to 1992, the height changes
by being uniformly interpolated between 20 and 40. For example, midway between 1991
and 1992, the height appears to be 30. As you approach 1992, the height approaches 40.
However, you cannot stop an animation between discrete data points, and you cannot
drag the Path slider to a stationary position between discrete data points.
The data points in the summary window represent the slider positions corresponding to
the actual data from the data file. For example, the heights 20 and 40 are representations
of actual data, but the height 30 is not. In this example, there would be data points in the
summary window at the slider positions corresponding to each year.
Note that not all variables are required to vary with a slider. For example, in the Map
Visualizer, the area and name of the state do not vary with the slider (for example, year).
If there are two sliders, some variables can vary with only one of the sliders, while other
variables vary with both.

203

Chapter 6: Using the Map Visualizer

Pulldown Menus
Five pulldown menus let you access additional Map Visualizer functions. These are
labeled File, View, Selections, InterTool, and Help. If you start the Map Visualizer
without specifying a configuration file, only the File and the Help menus are available.
The View menu is available after a valid dataset is loaded.

The File Menu
The File menu is the same for all visualization tools; see “The File Menu” in Chapter 5.

The View Menu
The View menu (Figure 6-14) contains six options. This section describes those options
below.

Figure 6-14

Map Visualizer’s View Pulldown Menu

Filter Panel brings up a filter panel (Figure 6-15), which lets you reduce the number of
entities displayed in the main viewing area, based on one or more criteria. You can use
the filter panel to fine-tune the display, emphasize specific information, or simply shrink
the amount of information displayed. Scale to Filter lets you specify whether the heights
of the graphical objects are scaled across the entire dataset or just across the filtered data.

204

Pulldown Menus

Figure 6-15

Map Visualizer Filter Panel

The filter panel has two panes. The top pane lets you filter based on string variables. To
select all values of a variable, click Set All. To clear the current selections, click Clear. To
select a value, click it. To deselect a value, simply click it again.
The bottom pane lets you filter based on the values of both string and numeric variables.
Only variables whose values do not change as you navigate the slider can be used in
filtering.

205

Chapter 6: Using the Map Visualizer

To filter numeric values, enter the value, and select a relational operation (=, !=, >, <, >=,
<=). To filter alphanumeric values, enter the string. You can use any of three types of
string comparisons:
•

Contains indicates that it contains the appropriate string. For example, California
contains the strings Cal and forn.

•

Equals requires the strings to match exactly.

•

Matches allows wildcards:
•

An asterisk (*) represents any number of characters.

•

A question mark (?) represents one character.

•

Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.
In some cases (usually associated with binning in the Tool Manager), an option menu of
values appears, instead of a text field. To ignore that variable, select Ignored in the Option
menu. You can use relational operators (such as >=) with these options. This means that
the specified value as well as subsequent ones are selected.
In addition to numeric and string comparison operations, you can specify Is Null,
which is true if the value is null.
To the right of each field is an additional option menu that lets you specify “And” or “Or”
options. For example, you could specify “sales > 20 And < 40.” You can have any number
of And or Or clauses for a given variable, but cannot mix And and Or in a single variable.
Click the Apply button to start filtering. If you press Enter while the panel is active,
filtering starts automatically.
Click the Close button to close the panel.

206

•

Show Window Decoration causes the buttons around the main window to be
displayed. Default for this option is on. Toggle this option to make the window
decoration disappear.

•

Show Animation Panel causes the animation control panel to be displayed to the right
of the main view. Click this option again to deselect it. When this option is
deselected, the animation panel is not displayed. Not displaying the animation
panel can be useful when you have applied the InterTool menu’s Synchronize All
Mapviz Sliders option (described in the “The InterTool Menu” on page 209) and need
only a single animation control panel on the screen.

Pulldown Menus

•

Show Data Points causes a grid of black dots to appear (or disappear) in the 2D
summary window. Each dot denotes the precise position of a discrete data value in
the input dataset. For example, if the input dataset has 10 data values across one
independent dimension, then you see heights and colors of the graphical objects in
the main window vary continuously, based on data values that are interpolations
between these discrete data points. These data point dots in the summary window
help you better understand when the heights and colors are derived directly from
the input data values, and when they are derived indirectly from interpolated
values.

•

Use Random Colors causes the configuration file’s color mapping specifications (for
example, white-to-red shadings representing population density) to be ignored.
Random, constant colors are assigned to the graphical objects. Click this option
again to deselect it.

•

Display X-Y Coordinates puts the Map Visualizer into a special mode that lets you
identify X-Y vertex pairs at specific points of the scene in the main window. In this
mode, the Map Visualizer resets the cursor to select mode and displays 3D objects
as flat background lines. Clicking the left mouse button on various parts of the
displayed scene causes the corresponding X-Y vertex pair values to appear in the
Selection Details window. You can also enter the vertex pair points into the .gfx file
to identify point objects or the endpoints of line objects for subsequent display. Note
that displaying X-Y coordinates is used for developing and refining .gfx files, not for
data analysis.
When Display X-Y Coordinates mode is initially enabled, or when a point in the
background is selected, the selection window shows the minimum and maximum
X-Y pairs of the currently displayed image in the main window. Add these two
value pairs to the new .gfx file you are generating. The first record in the file
gfx_files/usa.cities.gfx shows an example of how the min-max pairs of the usa.sates.gfx
file were entered into the associated usa.cities.gfx file. This ensures that the X-Y
coordinate pairs in usa.cities.gfx share the same coordinate system as the X-Y
coordinate pairs in usa.sates.gfx.

207

Chapter 6: Using the Map Visualizer

The Selections Menu
The Selection menu lets you drill through to the underlying data. The menu has six items.

Figure 6-16

Map Visualizer Selections Menu

•

Select All performs the equivalent of selecting (with the mouse pointer) all the
visible graphical objects in the current scene.

•

Show Values displays a table (Record Viewer) of the values for all selected objects.

•

Show Original Data retrieves and displays the records corresponding to what has
been selected. The resulting records are shown in a table viewer.

•

Send To Tool Manager inserts a filter operation, based on the current box selection(s),
at the beginning of the Tool Manager history. The actual expression used to do the
drill through is determined by extents of the current box selection(s). If nothing is
selected, a warning message appears.

•

Use Slider On Drill Through determines whether or not to use the slider position
when creating the drill-through expression. If checked (default), an additional term
is added to the drill-through expression, limiting the drill-through to those records
defined by the slider’s position. If this option is not checked, no such limiting term
is added.

•

Complementary Drill Through causes the Show Original Data and Send To Tool Manager
selections, when used, to fetch all the data that are not selected.

For further details on drill-through, see Chapter 18, “Selection and Drill-Through.”

208

Pulldown Menus

The InterTool Menu
The InterTool menu has one option, as shown Figure 6-17.

Figure 6-17

Map Visualizer’s InterTool Pulldown Menu

Selecting Synchronize All Mapviz Sliders identifies this Map Visualizer window as one in
a “synchronized sliders” cooperative: changing the current slider positions in one Map
Visualizer window causes/produces the same change in all others currently open. Click
this option again to deselect it. This menu option must be selected in every Mapviz main
window that is to be part of the synchronization.
Note that currently only the sliders’ physical positions are synchronized, not the
underlying meanings of those positions. For example, synchronizing
population.usa.mapviz (with dates ranging from 1770 to 1990) and population.canada.mapviz
(with dates ranging from 1871 to 1991) probably is not useful, since the slider physical
midpoint position represents 1880 in the United States and 1931 in Canada. Generally,
synchronization is useful only when the sliders of each dataset represent the same range
of independent variables.

The Help Menu
The Help menu is the same for all visualization tools; see “The Help Menu” in Chapter 5.

209

Chapter 6: Using the Map Visualizer

Null Handling in the Map Visualizer
Nulls represent unknown data (see Appendix J, “Nulls in MineSet”).
In the Map Visualizer, nulls can occur when any of the following is true:

210

•

The database or data file contains a null.

•

The Tool Manager is used to make an array based on bins and no data falls into a
specific bin. For example, if there is no data for the 30-40-year-old population, that
bin is null.

•

The Tool Manager is used to make an array and the null enum option is specified. In
this case, an extra array element is created to represent the aggregation of all the
values for which the bin value is null. The Tool Manager assigns the question mark
(?) character to this extra bin. To view the values of this bin, move the corresponding
slider to its left-most position. If there are no data for that null bin, the values
associated with it are null as well, and the Map Visualizer represents the
corresponding graphical object(s) as a “null object.”

•

Expressions and aggregations of nulls can generate nulls (see Appendix J, “Nulls in
MineSet”).

•

The Map Visualizer uses special representations when a null value is mapped to a
visual attribute. A null height results in a dark grey object with zero height; a null
color results in an object with appropriate height (as defined by the value mapped
to height), but with a dark gray color (see Figure 6-18).

Sample Configuration and Data Files

Figure 6-18

Representation of a Null Value Mapped to Height (Top Middle Object) and to
Color (Bottom Right Object)

When selecting an object with a null value, a question mark (?) is shown in the
selection field.

Sample Configuration and Data Files
The provided sample configuration and data files demonstrate the Map Visualizer’s
features and capabilities. The .data and .mapviz files are in the directory
/usr/lib/MineSet/mapviz/examples; the .gfx and .hierarchy files are in the directory
/usr/lib/MineSet/mapviz/gfx_files.
•

blocks.mapviz, blocks.data, blocks.gfx, and blocks.hierarchy
This simple example shows four adjacent blocks. The height and color of each block
varies based on the underlying data in blocks.data. You can drill up using the middle
mouse button (see the “Select Mode” section) to see the upper pair and the lower
pair of blocks aggregate; then drill up again to see these upper and lower blocks
aggregate into a single block. You can drill down using the right mouse button to
see the objects of finer granularity reappear.

211

Chapter 6: Using the Map Visualizer

•

population.australia.mapviz, population.australia.data, australia.states.gfx, and
australia.states.hierarchy
The data file contains one row for each Australian state and territory. Each row
contains three tab-separated items: a keyword name for the state or territory, the
population value, and the size of the territory.
This sample graphically displays the 1991 population and population density of the
Australian states and territories. Heights of the graphical objects represent the
relative population; color represents the relative population density. A legend at the
bottom of the display describes the color range and the associated values.

•

population.canada.mapviz, population.canada.data, canada.provinces.gfx, and
canada.provinces.hierarchy
The data file contains one row for each Canadian province and territory. In this
example, each row contains 13 blank-separated values (one for each decade
between 1871 and 1991).
This sample graphically displays the population and population density of the
Canadian provinces and territories from 1871 to 1991, in 10-year increments. The
animation control panel lets you dynamically view the datasets across a range of
time. Animation operation is explained in “Sliders Controlling Independent
Dimensions” on page 197.

•

population.europe.mapviz, population.europe.data, europe.countries.hierarchy, and
europe.countries.gfx
When graphically displayed, this shows the 1992 population and population
density of countries in Western and Central Europe.

•

population.usa.mapviz, population.usa.data, usa.sates.gfx, and usa.sates.hierarchy
When graphically displayed, this shows the population and population density of
the United States from 1770 to 1990. The animation controls let you dynamically
view population and density changes across time.

•

population.usa.cities.mapviz, population.usa.cities.data, usa.sates.gfx, usa.sates.hierarchy,
and usa.cities.gfx and usa.cities.hierarchy
The usa.sates.gfx file specifies the United States, which is displayed as a background.
The usa.cities.gfx file specifies the location of the cities on this background. The .data
file specifies the population of each city.
This sample graphically displays the population of the 48 largest U.S. cities from
1950 to 1990. No data has been mapped to the colors. The animation controls let you
dynamically view changes across time.

212

Sample Configuration and Data Files

•

perhouse.perage.mapviz, perhouse.perage.data, usa.sates.gfx, and usa.sates.hierarchy
This sample graphically displays consumer household spending data from
July-August 1988 to May-June 1991. Color is mapped to the gender of the spending
household member; height represents the average dollar amount spent per
household for a given time period and age group. This data has two independent
dimensions: time and age. The highest spending is indicated in the summary
window (see “The Summary Window” on page 199) by the areas with the greatest
color density, namely “May-June 1989 (Age: 30-39)” and “May-June 1990 (Age:
30-39).”

•

telecom.mapviz, telecom.data, usa.cities.lines.gfx, usa.cities.lines.hierarchy, usa.sates.gfx,
and usa.sates.hierarchy
This sample graphically displays a flat map with arched lines on it. These lines
connect two endpoints. The lines can have variable width and color. In this
example, the widths and colors are random; however, they could relate to the
volume and duration of the connections between the endpoints.

•

fasta.m.data, fasta.m.mapviz, fasta.m.gfx, and fasta.m.hierarchy
The data file for this example contains the partial results of a full biological
sequence comparison between two complete genomes (courtesy of Dr. Tom Flores,
European Bioinformatics Institute). When graphically displayed, scientists can
quickly identify and locate the regions of similarity between the two genomes. The
ability to display such large amounts of information in a visual data exploration
method such as this could be extended to include much more information about the
individual genomes. Scientists could explore this data more easily and thereby
perhaps better understand the function and purpose of the similar genetic
sequences.
In this example, the “map” is the circular-shaped genome of a biological organism
called Mycoplasma genitalium (MG). The MG genome is divided into 500 equal
segments, each representing a 1000-nucleotide sequence in the genome. The slider
selects one of the segments of the second genome, called Haemophilus influenzae
(HI), for cross-comparison between the two genomes. The Summary Window in the
Animation Control Panel indicates which segments show the greatest similarities,
and you can move the slider to examine those particular segments of interest. The
bar heights and colors on the “map” therefore indicate the relative similarity of each
MG segment to each HI segment, where higher bars correspond to greater measures
of similarity. This similarity is measured by the “Reciprocal Evalues,” which ranges
from 0.0 to 1.0.

213

Chapter 7

7. Using the Scatter Visualizer

This chapter discusses the features and capabilities of the Scatter Visualizer. It provides
an overview of this database visualization tool, then explains the Scatter Visualizer’s
functionality when working with the
•

main window

•

external controls

•

pulldown menus

Finally, it lists and describes the sample files provided for this tool.

Overview of Scatter Visualizer
The Scatter Visualizer lets you visually analyze relationships among several variables
(see Figure 7-1), either statically or by animation. It is particularly useful for seeing
individual data points when you do not have a large number of records. If your dataset
has a very large number of records consider using the Splat Visualizer. Analysis in the
Scatter Visualizer is done using
•

a three-dimensional landscape

•

an animation control panel that includes a two-dimensional slider

•

graphical objects, called entities, that can be animated in the three-dimensional
landscape

215

Chapter 7: Using the Scatter Visualizer

Figure 7-1

Sample Scatter Visualizer Screen

The Scatter Visualizer lets you visualize your data by mapping each record, or row, in the
dataset to an entity in the three-dimensional landscape. Variables in the data can be
mapped to the sizes, colors, and positions of the entities. Also, you can map one or two
numeric variables to the sliders in the animation control panel. If the variables mapped
to sizes, colors, or positions of the entities depend on the variables mapped to sliders, the
sliders can be used to drive an animation. For example, the data might represent the sales
of several companies over time. If the time variable is mapped to a slider and the sales
variable is mapped to size, then the entities grow or shrink as the time slider is animated.

216

File Requirements

After you create a visualization of your data, the Scatter Visualizer lets you analyze the
data in various ways. The animation control panel lets you trace animation paths in one
or two dimensions. By playing back the path you created, you can watch the size, color,
and motion of the entities for trends or anomalies. In the three-dimensional landscape,
you can orient the display to emphasize particular dimensions or a point of view. The
Scatter Visualizer lets you scale the values of variables to give them greater emphasis.
Also, you can filter the display to show only those entities meeting certain criteria.

File Requirements
The Scatter Visualizer requires the following files:
•

A data file, consisting of rows of tab-separated fields. This file is easily created using
the Tool Manager (see Chapter 3). If you are generating this file yourself, see
Appendix D, “Creating Data and Configuration Files for the Scatter Visualizer” for
the required file format.
You can generate data files by extracting data from a source (such as a database) and
formatting it specifically for use by the Scatter Visualizer. Data files have
user-defined extensions (the sample files provided with the Scatter Visualizer have
a .data extension).

•

A configuration file, describing the format of the input data and how it is to be
displayed. The Tool Manager can create this file (see Chapter 3), or you can use an
editor (such as jot, vi, or Emacs) to produce this file yourself (see Appendix D,
“Creating Data and Configuration Files for the Scatter Visualizer”).
Configuration files must have a .scatterviz extension. When starting the Scatter
Visualizer, or when opening a file, you must specify the configuration file, not the
data file.

217

Chapter 7: Using the Scatter Visualizer

Options for Invoking the Scatter Visualizer

There are a two options that affect how this tool is invoked:
•

-warnexecute indicates that if you attempt to execute a command specified in an

execute statement, a warning is displayed and you are given the option to execute
the command or not. This is intended for an insecure environment, such as files
obtained from the Web, and is used automatically when commands are executed via
mtr files.
You can enable this option permanently by adding the line
*minesetWarnExecute:TRUE

to your .Xdefaults file, or by setting the environment variable
MINESET_WARN_EXECUTE

•

-quiet eliminates the dialogs that popup to indicate progress. You can enable this

option permanently by adding the line
*minesetQuiet:TRUE

to your .Xdefaults file.

Starting the Scatter Visualizer
There are five ways to start the Scatter Visualizer:

218

•

Use the Tool Manager to configure and start the Scatter Visualizer. (See Chapter 3
for details on most of the Tool Manager’s functionality, which is common to all
MineSet tools; see “Configuring the Scatter Visualizer Using the Tool Manager” on
page 220 for details about using the Tool Manager in conjunction with the Scatter
Visualizer.)

•

Double-click the Scatter Visualizer icon, which is in the MineSet page of the icon
catalog. The icon is labeled scatterviz. Since no configuration file is specified, the
start-up screen requires you to select one by using File > Open.

Starting the Scatter Visualizer

Figure 7-2

Scatter Visualizer Start-Up File Pulldown Menu Selected

Starting the Scatter Visualizer without specifying a configuration file causes the
main window to show the copyright notice and license agreement for this tool.
Only the File and Help pulldown menus can be used. For the main window to be
fully functional, open a configuration file by selecting File > Open.
•

If you know which configuration file you want to use, double-click the icon for that
configuration file. This starts the Scatter Visualizer and automatically loads the
configuration file you specified. This works only if the configuration filename ends
in .scatterviz (which is always the case for configuration files created for the Scatter
Visualizer using the Tool Manager).

•

Drag the configuration file icon onto the Scatter Visualizer icon. This starts the
Scatter Visualizer and automatically loads the configuration file you specified.

•

Start the Scatter Visualizer from the UNIX shell command line by entering this
command at the prompt:
scatterviz [ configFile ]

configFile is optional and specifies the name of the configuration file to use. If you
don’t specify a configuration file, you must use File > Open to specify one (see
Figure 7-2).

219

Chapter 7: Using the Scatter Visualizer

Configuring the Scatter Visualizer Using the Tool Manager
This section describes how the Scatter Visualizer can be configured using the Tool
Manager. Although the Tool Manager greatly simplifies the task of configuring the
Scatter Visualizer, you can construct a configuration file manually for this tool using a
text editor (see Appendix D, “Creating Data and Configuration Files for the Scatter
Visualizer”).
The steps required to connect to a data source are described in Chapter 3.

Selecting the Scatter Visualizer Tool
Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen
(Figure 7-3). From the popup list of tools, select Scatter Visualizer. The mapping
requirements for the Scatter Visualizer are displayed in the window on the right side of
this panel. Items in the Visual Elements list that are preceded by an asterisk are optional.

Figure 7-3

220

Data Destination Panel With Scatter Visualizer Selected

Configuring the Scatter Visualizer Using the Tool Manager

•

Axis 1, *Axis 2, *Axis 3 let you assign to the axes in the Scatter Visualizer’s main
window the data you want represented. Assigning data to Axis1 is required.
However, this alone does not produce a useful display. By assigning data to Axis 2,
you can create an XY chart. Assigning data to all three axes produces a 3-D chart.

•

*Entity-size, *Entity-color, *Entity-label let you assign size, color, and label to the
entities appearing in the Scatter Visualizer’s main window.

•

*Summary is the value mapped to the summary column, if you have a slider. It
determines the color of the slider’s background.

•

*Slider1 and *Slider2 let you map columns directly to one or two animation Sliders
(see “Slider Creation for Scatterviz,” below).

Mapping Requirements to Columns
You can map requirements to columns by selecting a column name in the Current
Columns window of the Table Processing panel, then selecting a category in the Visual
Elements window.

Undoing Mappings
To undo a specific mapping, select that mapping in the Visual Elements window, then
click the Clear Selected button. To undo all mappings, click the Clear All button.

Slider Creation for Scatterviz
Sliders can be created manually or automatically. The following subsections describe
these methods.

221

Chapter 7: Using the Scatter Visualizer

Manual Slider Creation

Tool Manager generates sliders whenever there is an array column present in the current
table. The sliders correspond to the indices of the array columns. If the column has one
index (one-dimensional array), only one slider is created, but if the column has two
indices (two-dimensional array), both an X and a Y slider are created. The current slider
indices are indicated in the Tool Options dialog box from the Tool Manager. Array
columns can be created using the “index by” menus in the Tool Manager aggregation
panel (see “Aggregation” in Chapter 3).
Note that for a slider to be created, all array columns in the current table must have the
same indices. If array columns with differing indices exist in the current table, no sliders
are created.
Automatic Slider Creation

If no arrayed columns are in the current table, Tool Manager can automatically generate
sliders by use of the Slider1 and Slider2 mappings. Sliders are created through a
combination of automatic binning and aggregation. These automatic operations occur
after clicking Invoke Tool in the Data Destination Panel. The operations do not affect the
current history operations of Tool Manager, but they do appear in the configuration files
for the tool.
Columns mapped to Slider1 and Slider2 eventually form the indices for the sliders. These
columns must be either numeric (int, float, double) or binned. If a column mapped to a
slider is already binned, no automatic binning is needed for this column, and this column
is used as an index for a slider. However, if the column is not binned, a binned column is
created using the automatic binning options in the Tool Options dialog box.
The three methods of binning are:
•

Selecting All Distinct Values creates a bin for every unique value of the column.

•

Specify the number of bins you want to create. The thresholds for the bins are
determined using the Uniform Range approach.

•

Selecting Automatic automatically determines the number of bins to create and
determines the bin thresholds using the Uniform Range approach.

(See “The Bin Columns Button” in Chapter 3 for more information about binning.) The
column used in forming the automatic bins is deleted from the current table.

222

Configuring the Scatter Visualizer Using the Tool Manager

The binned columns now form the indices of array columns. Note that if you want to
create only one slider, the index must be mapped to Slider1. Attempting to create only
one slider with a mapping to Slider2 is not allowed and generates a Tool Manager error.
Also, a column mapped to a slider cannot be mapped to any other mapping, since it is
removed during the aggregation process.
Once the slider indices are formed, the arrayed columns are created. This is done using
automatic aggregation. Any numeric columns mapped to Axis 1, Axis 2, Axis 3,
Entity-size, Entity-color, Entity-label, or Summary are aggregated using the automatic
aggregation options in the Tool Options dialog box. You can either specify aggregating
by Sum or by Average. The binned columns created from the slider mappings form the
indices for the aggregation, and any remaining columns in the table are Group-By
columns. (See “Aggregation” in Chapter 3 for a description of the aggregation process.)
Be sure to remove any columns you do not wish to use in the grouping process. If you
need different types of aggregates for different mappings, you must aggregate manually.
The aggregation step automatically forms the arrayed columns used for sliders. These
arrayed columns form the new tool mappings. For example, if the column “mpg” were
mapped to Axis 1, a new column “avg_mpg[]” is formed and remapped to Axis 1. The
progress of the automatic slider generation is displayed in the Tool Manager status
window.

Specifying Tool Options
Clicking the Tool Options button causes a new dialog box to be displayed (Figure 7-4).
This lets you change some of the Scatter Visualizer options from their default values.

223

Chapter 7: Using the Scatter Visualizer

Figure 7-4

224

Scatter Visualizer’s Options Dialog Box

Configuring the Scatter Visualizer Using the Tool Manager

The Scatter Visualizer’s Options dialog box has four basic options blocks:
•

Entities

•

Sliders

•

Axes

•

Other

Entity Options

This option block lets you specify a number of characteristics for the entities that the
Scatter Visualizer then graphically displays.
•

Entity Legend On—lets you determine whether the entity legend is displayed or
hidden.

•

Entity Size—lets you scale the entity to a max size, a scale size, or a default (no
adjustment). You also can specify whether the legend for entity size is displayed or
hidden.

•

Entity Colors—lets you control the colors in which entities are displayed. You can

•

–

specify the list of colors to use

–

specify the kind of mapping

–

map the list of colors to a list of values

–

specify whether the legend for color is displayed or hidden

–

map colors to entities

Entity Shape—lets you choose a visual representation for the entities: cubes, bars, or
diamonds.

To use these Colors options, you must have mapped a column to the *Entity-color
requirement of the Data Destination panel. See “Choosing Colors” and “Using the Color
Browser” in Chapter 3 for a more detailed explanation of how to choose and change
colors.
Color list to use lets you specify the color list using the + button next to the color list label.
This brings up a color editor that lets you specify a color to be added to the list.

225

Chapter 7: Using the Scatter Visualizer

Color mapping let you specify whether the color change that is shown in the graphic
display is Continuous or Discrete. If you choose Continuous, the color values shift
gradually between the colors entered in the Color list to use field as a function of the
values that are mapped to those colors in the Color mapping field.
The field to the right of the popup button lets you enter specific values for mapping the
colors. If you do not specify any mapping values, the range of values in the color variable
is used.
Example 7-1

If you
•

used the Color Browser to apply red and green to bars

•

selected Continuous for the Kind of mapping

•

entered the values 0 100

then the display shows all entities with values less than or equal to 0 as completely red,
those as greater than or equal to 100 as completely green, and those between 0 and 100
as shadings from red to green.
Example 7-2

If you
•

used the Color Browser to apply red and green to entities

•

selected Discrete for the Kind of mapping

•

entered the values 0 50

then the display shows all entities with values of less than 50 in red, and all those with
values greater than or equal to 50 in green.

226

•

Entity Label Color lets you modify a label color by clicking on it. This causes the
Color Choose dialog box to appear, which lets you implement your color changes.

•

Entity Label Size controls the size of the entity labels. A smaller number decreases
the size, a larger one increases it.

Configuring the Scatter Visualizer Using the Tool Manager

Summary Options

Summary options let you specify what color to use for the Summary window. You can
also specify whether the summary legend, which indicates what the values are, is
displayed or hidden.
If you have an array of values, you can specify an X or Y slider. The popup buttons next
to these options provide a list of available keys, and let you specify which to use as
sliders.
Slider Options

The Slider options control how the slider mappings are interpreted. For details see
“Slider Creation for Scatterviz” on page 221.
Axis Options

The Axis options let you specify the following, for each axis:
•

A label. (If you leave this box blank, the Scatter Visualizer defaults to using the
column names for each axis.)

•

A color

•

A size type for each axis. (This can be Max Size, Scale Size, or No Adjustment.)
–

Max Size lets you specify that an axis is scaled independently to a specified size.
If one axis has a Max Size that is twice as large as the other, it will be twice as
long, regardless of the data values. This option is most useful when comparing
axes that are in different units (for example, comparing income to age). This
option has no effect on non-numeric data.

–

Scale Size lets you specify that the axis is scaled based on its maximum value. If
two axes have the same Scale Size, but one has a maximum that is twice the
value of the other, the former will be twice as long as the latter. This option is
useful for comparing axes with the same units (for example, income vs.
expenses). This option does affect the size of non-numeric axes.

–

No Adjust is equivalent to a Scale Size of 1.0.

•

A size value

•

Whether the axis should be extended to include the value 0.

227

Chapter 7: Using the Scatter Visualizer

Other Options

The Other Options, at the bottom of the dialog box, include the following fields:
•

Message lets you specify the message displayed when an entity is selected. For a
listing and description of format types that can be entered in this field, see the
“Message Statement” section in Appendix D, “Creating Data and Configuration
Files for the Scatter Visualizer”

•

Execute lets you type in a UNIX command that is executed when double-clicking on
an entity. The format is similar to the message statement. If no execute statement
appears, double-clicking has no effect. For a detailed description of the Execute
field, see “Execute Statement” in Appendix D.

•

Hide Label Distance controls the distance at which entity labels become invisible.
Smaller distances might improve performance, but the labels disappear more
quickly. The higher the number, the greater the distance at which labels are hidden.

•

Axis Label Size controls the size of the axis labels. A smaller number decreases the
size, a larger one increases it.

•

Grid (X, Y, Z) Size lets you specify the spacing between grid lines for the respective
axis. A smaller number decreases the size, a larger one increases it.

•

Grid Color lets you modify a grid color by clicking on it. This causes the Color
Chooser dialog box to appear, which lets you implement your color changes.

Resetting the Tool Options

If you want to reset the values of all options to their default values, click the Reset Options
button.
Saving the New Tool Options

Once you have finished making changes to the Tool Options dialog box, click OK to
return to the Tool Manager’s main screen.

228

Configuring the Scatter Visualizer Using the Tool Manager

Invoking the Scatter Visualizer
To see Scatter Visualizer graphically represent your data, click the Invoke Tool button at
the bottom of the Data Destination panel.

Saving the Scatter Visualizer Settings
When you press Invoke Tool, the Tool Manager stores information for the Scatter
Visualizer in three files, all sharing the same prefix:
•

.scatterviz.data contains data.

•

.scatterviz.schema describes the data file.

•

.scatterviz contains information needed by the Scatter Visualizer.

To save the entire session along with the current tool options, use one of these menu
options from the File menu:
•

Save Current Session... where the default prefix is based on the data source

•

Save Current Session As... to specify your own prefix

The saved file is .mineset, and contains all the information needed to return
MineSet to its current state.

Null Handling in the Scatter Visualizer
The Scatter Visualizer uses special representations when fields with unknown data
values, or nulls, are mapped to visual attributes. (For a discussion of null values, see
Appendix J, “Nulls in MineSet.”) When a null value is mapped to an entity’s size, the
entity is drawn as the outline of a cube. When a null value is mapped to an entity’s color,
it is drawn in dark grey. When a null value is displayed in the Selection Window or
“Pointer is Over” area, it is shown as a question mark (?). (The Selection Window and
“Pointer is Over” areas are discussed in the “Select Mode” section.)
If a null value is mapped to the x, y, or z position of an entity, the result depends on the
Show Entities with Null Positions option under the View Menu (see “The View Menu”
on page 241). If the option is set, the entity is shown just below the range of the
corresponding axis. If the option is not set, the entity is not shown.

229

Chapter 7: Using the Scatter Visualizer

Working in the Scatter Visualizer’s Main Window
If you started the Scatter Visualizer without specifying a configuration file, the main
window shows the copyright notice and license agreement for the Scatter Visualizer.
Only the File and Help pulldown menus can be used. For the main window to show all
menus and controls, open a configuration file. Use File > Open (Figure 7-2) to see a list of
configuration files.
When a valid configuration file has been selected, the 3D landscape it specifies is visible.
For example, selecting company-total.scatterviz gives results as shown in Figure 7-5.

230

Working in the Scatter Visualizer’s Main Window

Figure 7-5

Initial View When Specifying company.scatterviz

This shows the sales of life insurance, auto insurance, and home insurance with respect
to income brackets over time.

231

Chapter 7: Using the Scatter Visualizer

Viewing Modes
The two modes of viewing are grasp and select. To toggle between these modes, press the
Esc key or click the appropriate cursor button adjacent to the top-right of the viewing
area.
Grasp Mode

In grasp mode, the cursor appears as a hand. This mode supports panning, rotating, and
scaling the scene’s size in the main window.
•

To pan the display, press the middle mouse button and drag it in the direction you
want the display panned.

•

To rotate the display, press the left mouse button and move the mouse in the
direction you want to rotate. (Also see the thumbwheel controls Rotx and Roty,
described in “Thumbwheels” in Chapter 6.)

•

To move the viewpoint forward, press the left and middle mouse buttons
simultaneously and move the mouse downwards. To move the viewpoint
backward, press the left and middle mouse buttons simultaneously and move the
mouse upwards. This is equivalent to the functions provided by the Dolly
thumbwheel.

Select Mode

In select mode, you can highlight an object by positioning the cursor over that object.
Information about that object then appears at the top of the view area, under the Pointer
is over: label (Figure 7-6). This information remains visible in the window only as long as
the pointer cursor remains over the object. Position the pointer cursor over an object and
click the left mouse button; the same information appears in the Selection Window, above
the main window. A white box appears around the entity, indicating it has been selected,
and a table viewer shows your current selection. Select several entities by holding down
the Shift key while clicking the left mouse button. The most recent selection is shown
under the Selection label at the top of the scene. All current selections are shown in the
Record Viewer. You can now drill-through on your selection (see “The Selection Menu”
on page 279 for the different drill-though options.)
This Selection information remains visible until another object is selected, or you click the
black background. Using the mouse, you can cut and paste this selection information into
other applications, such as reports or databases.

232

Working in the Scatter Visualizer’s Main Window

Figure 7-6

Displayed Information When Cursor is Over a Selected Entity

If an execute statement was specified using Tool Manager or the configuration file, then
double clicking on an object executes the appropriate command. If the -warnexecute
option was specified when invoking the Scatter Visualizer, a warning is given first.
Note: Users familiar with Open Inventor can configure the Scatter Visualizer so that the

right mouse button brings up the standard Inventor Menu. This provides additional
functions, such as stereo viewing and spin animation. These functions are provided by
the Open Inventor library. To enable the Open Inventor Menu, add the line
*minesetInventorMenu:TRUE

to your .Xdefaults file.

233

Chapter 7: Using the Scatter Visualizer

External Controls
Several external controls surround the main window, including buttons and
thumbwheels. These controls are substantially the same for most MineSet visualization
tools (see the descriptions “Buttons” in Chapter 6, and “Thumbwheels” in Chapter 6).

The Animation Control Panel
The animation control panel, which appears to the right of the main window, consists of
a summary window, with up to two adjacent sliders, an information field, animation
buttons, and animation sliders.

Sliders Controlling Independent Dimensions
The number of sliders appearing adjacent to the summary window is dependent on the
dataset displayed in the Scatter Visualizer’s main window. Datasets can have two, one,
or no independent dimensions.
Datasets With Two Independent Dimensions

If the dataset has two dimensions of independently varying data (such as
company.scatterviz), the controls to the right of the main graphics window become visible
(see Figure 7-7).

234

The Animation Control Panel

Figure 7-7

Animation Control Panel With Summary Window and Both Slider Controls

To the right of the main window are the summary window and slider controls. The
summary window has a horizontal slider below it for selecting data points of the first
independent dimension, and a vertical slider to the left for selecting data points of the
second independent dimension. The horizontal slider’s dimension is identified by a label
below it. The vertical slider’s dimension is identified by a label above it.

235

Chapter 7: Using the Scatter Visualizer

Datasets With One Independent Dimension

For datasets with one independent dimension (such as store-type.scatterviz), only the
slider below the summary window appears, and the summary window is compressed
(see Figure 7-8). This slider’s dimension is identified by a label below it.

Figure 7-8

236

Animation Control Panel With Summary Window and One Slider Control

The Animation Control Panel

Datasets With No Independent Dimension

For datasets with no independent dimensions (such as brand.scatterviz), no slider control
appears (see Figure 7-9).

Figure 7-9

Scatter Visualizer With No Independent Dimension or Animation Control Panel

237

Chapter 7: Using the Scatter Visualizer

The Summary Window
The summary window provides a 2D representation of the aggregation of values that the
main window displays in 3D. The whiter the areas of the summary window, the lower
the total values represented by the entities in the main window. The greater the color
density in areas of the summary window, the higher the total of those values. The density
of these colors in the summary window provides a summary of the data across the one
or two independent dimensions in the dataset.
By default, the summary window also contains a set of black dots, evenly spaced across
the one or two dimensions of data. These dots indicate the precise positions of the
discrete datapoints. You can turn off these black dots using the View|Show Data Points
menu option.
Color Density Examples in the Summary Window

After opening the company.scatterviz file, for example, the 2D summary window shows a
color range from white (on the left) to red (on the right). White corresponds to a low sales
volume; red represents a higher aggregate sales volume. In this example, the greater the
density of red, the higher the total sales of life, auto, and home insurance.
Creating a Path in the Summary Window

If the dataset loaded into the Scatter Visualizer has at least one independent dimension,
it is possible to view all or any part of that dataset via animation. This is done by first
creating a path in the summary window (this path connects a sequence of data points),
then activating the animation controls described in the next section.
The three ways to draw a path in the summary window are as follows:

238

•

Define a starting point by clicking and holding down the left mouse button, then
draw a path by dragging the cursor over the window. End the path by releasing the
left mouse button.

•

Define a starting point by clicking the left mouse button, then define an endpoint by
moving the cursor to another part of the window and clicking the middle mouse
button. A line appears between those two points. To add more line segments,
continue with repeated middle mouse clicks.

•

Define a starting point by clicking the left mouse button, then drag one of the
independent dimension sliders, thus drawing a straight line along this dimension.
If there are two sliders, use of the second slider causes a straight line to be drawn
along the axis controlled by this second slider.

The Animation Control Panel

Animation Buttons and Sliders
The seven VCR-like buttons and two sliders (Path and Speed) below the 2D summary
window let you control the animation.
Animation Buttons

Once a path is drawn in the summary window (see “Creating a Path in the Summary
Window,” above), you can use the VCR-like buttons to control animation along this path.
The middle Stop button is highlighted in blue, indicating an initial state. Use the adjacent
Play Forward button (to the right of Stop) or Play Reverse (to the left) to begin simple
movement along the drawn path in a forward or reverse direction. (Forward and Reverse
are defined by the sequence in which the path was drawn, not by the left-to-right or
right-to-left movement.)
To stop and restart the animation, click the Stop button, then use the Play Forward or
Reverse button again. Note that when you stop, the animation continues in the current
direction until the position falls upon a discrete data point.
Adjacent to the Play buttons are the Single-Step buttons, as well as Forward and Reverse.
Clicking on one of these buttons changes the current path position to the next discrete
data point.
On the outside are the Fast Forward and Fast Reverse buttons. Clicking one of these
buttons while in Stop state changes the path position to the end (for Forward) or to the
beginning (for Reverse) of the path. Clicking a Fast button when in Play state increases the
animation speed.
Animation Flow

Below the Animation Buttons are the three Animation Flow buttons.
Play-once (default)—the animation moves either forward or in reverse until it reaches the
end of the path, then stops.
Loop—when the animation reaches the end of the path, it automatically resets to the
beginning and starts over again.
Swing—when the animation reaches the end of the path, it reverses direction and retraces
its path to the other end; upon reaching that end, the animation reverses direction again,
beginning the cycle again.

239

Chapter 7: Using the Scatter Visualizer

Animation Sliders

While animation is stopped, you can move the Path slider to reset the position along the
path. Note that when you use the Path slider, the cursor in the summary window moves
across the drawn path, and the 1D sliders (below and to the left of the drawing area)
move consistently with the cursor position. Then use the Play or Reverse button to restart
the animation from the newly specified point. You can drag the Path slider to an arbitrary
position between discrete data points; however, when you release the slider, the path
position changes to the nearest discrete data point.
Use the Speed slider to adjust the speed of the animation along the path.
Data Points and Interpolation

As animation proceeds, the variables mapped to size, color, and axes (positions) in the
Scatter Visualizer change smoothly. However, the information displayed in the
“Selection:” message box and the Pointer is over: field show only the data values of the
nearest discrete data position; they do not show interpolated data values.
The animation is produced in the following manner: Assume you have data for 10 years,
on a per-year basis (that is, 10 data values) and that these correspond to the size of one
entity in the Scatter Visualizer. Assume further that the years are 1991 to 2000, the size for
1991 is 20, and the size for 1992 is 40. As you move the year slider from 1991 to 1992, the
size changes by being uniformly interpolated between 20 and 40. For example, midway
between 1991 and 1992, the size is 30. As you approach 1992, the size approaches 40.
However, you cannot stop an animation between discrete data points, and you cannot
drag the Path slider to a stationary position between discrete data points.
The data points in the summary window represent the slider positions corresponding to
the actual data from the data file. For example, sizes 20 and 40 are representations of
actual data, but size 30 is not. In this example, there would be data points in the summary
window at the slider positions corresponding to each year.
Note that not all variables are required to vary with a slider. If there are two sliders, some
variables can vary with only one of the sliders, while other variables vary with both.

240

Pulldown Menus

Pulldown Menus
Four pulldown menus let you access additional Scatter Visualizer functions. These are
labeled File, View, Selections, and Help. If you start the Scatter Visualizer without
specifying a configuration file, only the File and the Help menus are available.

The File Menu
The File menu is the substantially the same for all visualization tools see “The File Menu”
in Chapter 5.

The View Menu
The View menu lets you control certain aspects of what is shown in the Scatter Visualizer
window (Figure 7-10).

Figure 7-10

Scatter Visualizer View Menu

•

Show Window Decoration lets you hide or show the external controls around the
main window.

•

Show null Positions lets you hide or show entities that have null or unknown
position values along one or more axes.

•

Show Animation Panel lets you show or hide the animation control panel. This menu
item is disabled for datasets with no independent dimension.

241

Chapter 7: Using the Scatter Visualizer

•

Show Filter Panel lets brings up the Filter Panel. This panel (Figure 7-11) lets you
reduce the number of entities displayed in the main viewing area, based on one or
more criteria. You can use the filter panel to fine-tune the display, emphasize
specific information, or simply shrink the amount of information displayed. The Set
Landscape to Filter checkbox, which appears in the lower right of the filter panel,
lets you specify whether the landscape in the main window covers the entire
dataset or just the filtered data.

•

Set Background Color brings up a color chooser to let you specify a new background
color.

Figure 7-11

242

Scatter Visualizer Filter Panel

Pulldown Menus

The Filter panel has two panes. The top pane lets you filter based on string columns.
To select all values of a column, click Set All. To clear the current selections, click
Clear. To select a value, click it. To deselect a value, simply click it again.
The bottom pane lets you filter based on the values of both string and numeric
columns.
To filter numeric values, enter the value, and select a relational operation (=, !=, >, <,
>=, <=). To filter alphanumeric values, enter the string. You can use any of three
types of string comparisons:
•

Contains indicates that it contains the appropriate string. For example,
“California” contains the strings “Cal” and “forn”.

•

Equals requires the strings to match exactly.

•

Matches allows wildcards:

–

An asterisk (*) represents any number of characters.

–

A question mark (?) represents one character.

–

Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.
For columns which were binned, an option menu of values appears, instead of a
text field. To ignore that column, select Ignored in the Option menu. You can use
relational operators, such as >=, with these options. This means that the specified
value as well as subsequent ones are selected.
In addition to numeric and string comparison operations, you can specify Is Null,
which is true if the value is null.
To the right of each field is an additional option menu that lets you specify “And” or
“Or” options. For example, you could specify “sales > 20 And < 40.” You can have
any number of And or Or clauses for a given column, but cannot mix And and Or in
a single column.
Scale to Filter lets you specify whether the filtered landscape is rescaled to the size of
the filtered data or remains the size of the entire data set.
Click the Filter button to start filtering. If you press Enter while the panel is active,
filtering starts automatically.
Click the Close button to close the panel.

243

Chapter 7: Using the Scatter Visualizer

The Selections Menu
The Selections menu lets you drill through to the underlying data.

Figure 7-12

244

The Scatter Visualizer Selections Menu

•

Create Box Selection creates a 3-D box selector that can be stretched and translated to
select regions of the volume. While active, a table in Record Viewer format is
opened showing information about all of the aggregated data that is represented by
the entities within it. Closing this window clears all current selections. Any entities
within the selection box or selected using Shift-click are shown in the table window.
To translate the selection box, click on one of the faces with the left mouse button,
and drag it in the desired direction. Holding the Shift key while dragging constrains
the motion to the axis to which the drag motion is closest. To change the extent of
the selection box, drag one of the gray scale tabs in the desired direction. Trying to
resize or translate beyond the bounds of the volume is not permitted. The gray scale
tabs constantly resize to maintain constant screen size. If at any time they appear
too big, you can zoom in closer, and they reduce their size relative to the box.

•

Show Original Data retrieves and displays the records corresponding to what has
been selected. The resulting records are shown in a table viewer.

•

Send To Tool Manager inserts a filter operation, based on the current box selection(s),
at the beginning of the Tool Manager history. The actual expression used to do the
drill through is determined by the extent of the current box selection. If nothing is
selected, a warning message appears.

•

Complementary Drill Through causes the Show Original Data and Send To Tool Manager
selections, when used, to fetch all the data that are not selected.

Sample Configuration and Data Files

•

Preferences brings up a panel that lets you select which columns are used in
drill-through. Unlike other visual tools, there are no specific columns in the data
that are designated as the key to the data. It is impossible for the Scatter Visualizer
to determine which columns the user desires in the drill-through expression. For
example, you might have cars data with brand, model, and weight. Perhaps you
want to drill through to the original data, and specify that brand and model should
be considered, but weight should not. By default, all columns that have been
mapped to graphical requirements are considered significant on drill-through. The
others are not, but may be made so by highlighting them in the Preferences dialog
box.

For further details on drill-through, see Chapter 18, “Selection and Drill-Through.”

The Help Menu
The Help menu is substantially the same for all visualization tools; see “The Help Menu”
in Chapter 6

Sample Configuration and Data Files
The provided sample data and configuration files demonstrate the Scatter Visualizer’s
features and capabilities. The following files are in the /usr/lib/MineSet/scatterviz/examples
directory:
•

company.data
This file contains fictitious sales data of several insurance companies in three
product categories: life insurance, auto insurance, and home insurance. The data
span ten years (in increments of one year) and includes five income brackets (the
customer’s annual income).

•

company.scatterviz
This file specifies that the years form one slider dimension and the income brackets
form the other slider. Sales of life insurance, auto insurance, and home insurance
become the three dimensions in the Scatter Visualizer landscape. The color density
in the slider summary window represents the total sales of all companies across all
categories of insurance.

245

Chapter 7: Using the Scatter Visualizer

•

company-total.scatterviz
This file contains the same specifications as company.scatterviz, except that the size of
each company is determined by the total sales of that company across all the
categories of insurance.

•

company-life.scatterviz
This file contains the same specifications as company.scatterviz, except that the color
of each object indicates the life insurance sales as a fraction of total sales.

•

store-type.data and store-type.scatterviz
These files show sales of various product groups by store type during a three-year
period. The single independent variable for which a slider appears is time. Each
entity represents a store type (such as Food Store, Drug Store, Service Station, and
so forth). For each store type, the data file contains the total sales of several product
groups, such as alcoholic beverages, cereal, and so forth. The data spans 36 months,
in increments of one month.
The configuration file uses the month as the single slider dimension. One axis is
sales of alcoholic beverages, the other is sales of tobacco products. A third axis is not
used.
Note: The data file includes other categories. You can edit the configuration file to

use other product categories for the axes (see Appendix D, “Creating Data and
Configuration Files for the Scatter Visualizer”).

246

•

brand.data and brand.scatterviz
These files show sales of several soft-drink brands in a variety of store types. In this
dataset the brands form the entities, and the store types are associated with the axes.
The total sales are mapped to the size of each brand. The color mapping is random.
Since there are no independent variables, no slider is present.

•

cars.data and cars.scatterviz
These files show the weight, horsepower, model year, and acceleration of several car
models.

•

people.data and people.scatterviz
These files show the height, weight, density, and cholesterol level for a population
sample.

•

nl.births.data and nl.births.scatterviz
These files show birth patterns in the Netherlands. For each region, the population
density, birth rate, and population are shown. The animation sliders are mapped to
the age of the mother and the year.

Sample Configuration and Data Files

•

adult94.data and adult94.scatterviz
These files show a complex example with scatterviz applied to
/usr/lib/MineSet/data/adult.data. The three axes in the visualization are avg_hrswk
(that is, average hours worked per week), avg_gross_income, and
avg_education_num. Unfortunately “education num” does not correspond exactly
to number of years of education, but it is close. The slider on the right side animates
across different age ranges. Each aggregate was created by grouping by occupation,
race and sex. This means that there is an entity for every combination of values for
these three attributes. The color shows different occupations, as shown in the
legend. The size of each entity corresponds to record counts. The summary slider is
also colored by data density. To find out how this visualization was created, you
may select Start Tool Manager from the File menu. This will bring up the Tool
Manager with the session used to create this example.
Initially the scene shows information for people under 20 years of age. Note that the
average hours worked (about 14) and the average income (about $4000) are low. If
you animate over age using the slider, and examine the scene from the three
orthogonal viewer (try using the lower 3 buttons to the right of the main window),
you will notice various trends emerge. For example, if you orient the scene so you
see only income by hours per week, you can see that people start to work longer
hours as they age, until about age 25, then they seldom work more that 49 hours per
week until they retire. Income, however, grows until people age 50, then plateaus,
then goes lower again. The actual trend depends somewhat on the career choice and
other factors.
Suppose you were interested in comparing trends between the occupations
craft-repair and prof-specialty. Open the Filter panel (View > Show Filter Panel) and
select just “craft-repair” and “prof-specialty” from the list of occupations. Now
when you animate, you can see that “prof-specialty” actually starts with lower
incomes, but quickly outpaces “craft-repair” as people age. “Prof-specialty” is much
higher on the education axis than “craft-repair”. You may wish to limit your filter
further by showing just females, or those of a certain race.

247

Chapter 8

8. Using the Splat Visualizer

This chapter discusses the features and capabilities of the Splat Visualizer. It provides an
overview of this database visualization tool, then explains the Splat Visualizer’s
functionality when working with the
•

main window

•

external controls

•

pulldown menus

Finally, it lists and describes the sample files provided for this tool.

Overview of the Splat Visualizer
The Splat Visualizer lets you visually analyze relationships among several variables (see
Figure 8-1), either statically or by animation. It is particularly well-suited for application
to datasets with large numbers of records. Choose the Scatter Visualizer if you want to
see individual data points and do not have a large number of records. Data analysis is
done using
•

a three-dimensional landscape

•

an animation control panel that includes a two-dimensional slider

•

graphical objects, called splats, which represent aggregates of datapoints. Color and
opacity of the splats can change during animation.

249

Chapter 8: Using the Splat Visualizer

Figure 8-1

Sample Splat Visualizer With One Slider Control

The Splat Visualizer lets you visualize your data by mapping columns to axes, sliders,
color, and opacity. The resulting three-dimensional landscape can be thought of as an
approximation to a scatterplot in which every datapoint is drawn separately. It is not
truly a scatterplot, because datapoints that are close together (fall in the same bin) are
aggregated and drawn as a single splat.

250

Overview of the Splat Visualizer

Each numeric column that is mapped to an axis or slider first must be binned. If this
binning step is skipped, the Tool Manager does it using automatic uniform binning (see
“The Bin Columns Button” in Chapter 3). String columns can be mapped directly to axes.
Any numeric column can be mapped to a color. The color of a splat is derived by
averaging the value of the column mapped to color for all the data points that fall in a
bin. The opacity of a splat is based on a weighting of the number of datapoints that fall
in a bin. If nothing is mapped to opacity, record counts are used to determine it. The
interactivity of the resulting visualization is independent of the number of data points
represented; it depends only on the number of bins in the axis dimensions. If your dataset
is very large, aggregate explicitly in the Tool Manager. This causes the server to perform
the processing, rather than having the entire dataset sent to the client and aggregated
there.
Up to two numeric columns can be mapped to the sliders in the animation control panel.
The splats change their color and opacity during animation as the sliders in the
animation panel are moved from point to point along the slider ‘s path. Unlike the Scatter
Visualizer, neither the position nor the size of the splats change; they are at fixed,
uniformly spaced positions. Only their color and opacity change, which can give the
illusion of actual movement.
After creating a visualization of your data, the Splat Visualizer lets you analyze the data
in various ways:
•

The animation control panel lets you note global shifts and trends in the data.

•

The three-dimensional landscape lets you orient the display to emphasize particular
dimensions or a point of view.

•

You can use the scale slider (located to the left of the Main Window) to lower the
overall opacity of the splats, so only regions with dense data show up; conversely,
you can increase the scale slider so all regions having any data become visible. The
regions with dense data are likely to show less color variation, because the color is
based on the average of many values (see Figure 8-3).

•

You can filter the display to show only those splats meeting certain criteria. You can
filter on the columns corresponding to axes, sliders, weight, and color.

•

An opaque pick dragger lets you display textual information about individual
splats in the volume.

•

A box selector lets you define a selected region for drilling through to the original
data or for sending to the Tool Manager.

251

Chapter 8: Using the Splat Visualizer

If a string column is mapped onto an axis, binning is defined to be the distinct values of
that column. The order of the values along a string axis is automatically determined by
sorting the distinct values by the average aggregate value of the column mapped to color.
Looking at the color changes along a string-valued axis lets you see how well that
column correlates with the column mapped to color. The left axis in Figure 8-1 shows
occupations sorted by average income (the average income of everyone with that
occupation) along an axis. The occupation, executive-managerial, listed at the end of the
axis, has the highest average income. This ordering often presents a natural progression
for the values. For example, the ordering for the values of education (the right axis in
Figure 8-1) was generally from low to high; but, in a few cases, there were anomalies in
the order. This unexpected ordering might be interesting because it points out places
where the data does not agree with expectations.

Opacity
The column mapped to opacity should be record count or a column used to weight
record counts. A splat’s opacity, α, is based on this column according to the following
relation:
α = 1 – e –u ⋅ weight
where count is the column mapped to opacity (or the record count if no such column was
mapped to opacity). The shape of this function is such that the opacity asymptotically
approaches 1 (totally opaque) as the value of weight becomes large. The variable u is what
is scaled when you adjust the opacity scale slider. Figure 8-2 shows the shape of this
function for low and high values of u. Figure 8-3 shows the same visualization with low
and high values of u.

Figure 8-2

252

Shape of Opacity Function For Low and High Values of u

Overview of the Splat Visualizer

Figure 8-3

Image Where u = 5.3, and u = 30

If nothing is mapped to opacity, the Splat Visualizer generates a column of ones to
produce record counts when aggregating. This means all records are weighted equally.
A sum aggregation is done on this column, and an average aggregation is done on the
column mapped to color while grouping by all the axis and slider columns. All other
columns are unnecessary and removed. You do not need to map anything to opacity
unless you want each record to be weighted by something other than 1.
You can avoid processing on the client by aggregating in the Tool Manager. This also
avoids having to transfer a large dataset to the client. This is done by
1.

Binning the numeric columns which are to be used for axes and sliders.

2. Aggregating the column to be mapped to color by count and average while
grouping by the axis and slider columns.
3. Mapping the resulting count aggregation to opacity.
4. Mapping the resulting average aggregation to color.

253

Chapter 8: Using the Splat Visualizer

For example, using the adult94 data (provided with the distribution):
1.

Bin age and hours_per_week.

2. Aggregate gross_income using count and average. Keep education, occupation,
age_bin and hours_per_week_bin, in the group-by pane while removing all the
other columns.
3. Map education, occupation, and hours_per_week_bin to the axes.
4. Map avg_gross_income to color, count_gross_income to opacity, and age_bin to a
slider.
When you invoke the tool, note that all the processing is done on the server, and that the
datafile, adult94.splatviz.data, contains rows that are aggregates of rows in the original
data. This produces the same visualization as seen in Figure 8-1.
In some cases, you might have a column by which you want to weight the records. For
example, if you have a dataset for which one column was population and another was
average_salary (which you want to map to color), you can map population to opacity,
and average_salary to color; then have the Splat Visualizer do the aggregation. Its
aggregation groups-by the axis and slider columns, so that it sum aggregates the opacity
column (which, in this case, is population). The new column is called sum_population.
The average_salary column is revised, so that it is still average salary, but weighted by
each row’s population. In this way, the average salary column still shows the average
salary for all the people it represents.
Alternatively, if you want to avoid client-side processing and storage because of the size
of your dataset, you can perform the same aggregation in Tool Manager by doing the
following:
1.

Create a new column, defining temp = population*avg_income.

2. Perform an aggregation: group-by axis and slider columns, sum aggregate
population, and sum aggregate temp.
3. create a new column, defining
avg_salary = sum_temp/sum_population
This creates the weighted average.
4. Now you can map sum_population to opacity, and avg_salary to color.
Note that these steps are the ones taken by the Splat Visualizer if you do not explicitly do
them in the Tool Manager.

254

File Requirements

File Requirements
The Splat Visualizer requires the following files:
•

A data file, consisting of rows of tab-separated fields. This file is easily created using
the Tool Manager (see Chapter 3). If you are generating this file yourself, see
Appendix E, “Creating Data and Configuration Files for the Splat Visualizer” for
the required file format.
You can generate data files by extracting data from a source (such as a database) and
formatting it specifically for use by the Splat Visualizer. Data files have user-defined
extensions (the sample files provided with the Splat Visualizer have a .data
extension).

•

A configuration file, describing the format of the input data and how it is to be
displayed. The Tool Manager can create this file (see Chapter 3), or you can use an
editor (such as jot, vi, or Emacs) to produce this file yourself (see Appendix E,
“Creating Data and Configuration Files for the Splat Visualizer”).
Configuration files must have a .splatviz extension. When starting the Splat
Visualizer, or when opening a file, you must specify the configuration file, not the
data file.

Starting the Splat Visualizer
There are five ways to start the Splat Visualizer:
•

Use the Tool Manager to configure and start the Splat Visualizer. (See Chapter 3 for
details on most of the Tool Manager’s functionality, which is common to all MineSet
tools; see “Configuring the Splat Visualizer Using the Tool Manager” on page 256
for details about using the Tool Manager in conjunction with the Splat Visualizer.)

•

Double-click the Splat Visualizer icon, which is in the MineSet page of the icon
catalog (or on your Indigo Magic desktop). The icon is labeled splatviz. Since no
configuration file is specified, the start-up screen requires you to select one by using
File > Open.
Starting the Splat Visualizer without specifying a configuration file causes the main
window to show the copyright notice and license agreement for this tool. Only the
File and Help pulldown menus can be used. For the main window to be fully
functional, open a configuration file by selecting File > Open.

255

Chapter 8: Using the Splat Visualizer

•

If you know what configuration file you want to use, double-click the icon for that
configuration file. This starts the Splat Visualizer and automatically loads the
configuration file you specified. This works only if the configuration filename ends
in .splatviz (which is always the case for configuration files created for the Splat
Visualizer via the Tool Manager).

•

Drag the configuration file icon onto the Splat Visualizer icon.

•

Start the Splat Visualizer from the UNIX shell command line by entering this
command at the prompt:
splatviz [ configFile ]

configFile is optional and specifies the name of the configuration file to use. If you
don’t specify a configuration file, you must use File >Open to specify one.
Options for Invoking the Splat Visualizer

The -quiet option eliminates the dialogs that popup to indicate progress. You can enable
this option permanently by adding the line
*minesetQuiet:TRUE

to your .Xdefaults file.

Configuring the Splat Visualizer Using the Tool Manager
This section describes how the Splat Visualizer can be configured using the Tool
Manager. Although the Tool Manager greatly simplifies the task of configuring the Splat
Visualizer, you can construct a configuration file manually for this tool using a text editor
(see Appendix E, “Creating Data and Configuration Files for the Splat Visualizer”).
The steps required to connect to a data source are described in Chapter 3.

256

Configuring the Splat Visualizer Using the Tool Manager

Selecting the Splat Visualizer Tool
Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen
(Figure 8-4). From the popup list of tools, select Splat Visualizer. The mapping
requirements for the Splat Visualizer are displayed in the window on the right side of this
panel. Items in the Visual Elements list that are preceded by an asterisk are optional.

Figure 8-4

Data Destination Panel With Splat Visualizer Selected

•

Axis 1, *Axis 2, *Axis 3 — determine which columns are assigned to the axes in the
Splat Visualizer’s main window. Assigning data to the first axis is required;
however, this alone does not usually produce a useful display. By assigning data to
Axis 2, you can create an XY chart. Assigning data to all three axes produces a 3-D
chart.

•

*Color — Requires a numeric column used to determine the color of the splats.
If you have a two-valued string column, you can create a new numeric column
using an expression such as:
('stringCol'==”value1”)? 1:0
If nothing is mapped to color, the resulting scene is monochromatic.

257

Chapter 8: Using the Splat Visualizer

•

*Opacity—the tool was designed to have the opacity based on a weighting of
records. If you do not aggregate in the Tool Manager, this requirement need not be
mapped; it will be determined automatically by the tool. If you do a count
aggregation in the Tool Manager, or there is a column in the data that already is
based on counts, use that column for this requirement.

•

*Sliders — the summary slider dimensions. They must be numeric or binned.

•

*Summary—this is the value to be shown in the summary slider. If no summary
column is mapped, count is used by default. If a summary column is mapped, a
weighted average value for that column is shown in the summary.

Mapping Columns to Requirements
You can map requirements to columns by selecting a column name in the Current
Columns window of the Table Processing panel, then selecting a category in the Visual
Elements window.

Undoing Mappings
To undo a specific mapping, select that mapping in the Requirements window, then click
the Clear Selected button. To undo all mappings, click the Clear All button.

Specifying Tool Options
Clicking the Tool Options button causes a new dialog box to be displayed (Figure 8-5).
This lets you change some of the default values of the Splat Visualizer options.

258

Configuring the Splat Visualizer Using the Tool Manager

Figure 8-5

Splat Visualizer’s Options Dialog Box

The Splat Visualizer’s Options dialog box has three basic options blocks:
•

Splats

•

Summary

•

Other

259

Chapter 8: Using the Splat Visualizer

Splat Options

This option lets you specify a number of characteristics for the Splats that the Splat
Visualizer then graphically displays.
•

•

Splat Colors—lets you control the colors used for the splats. You can
–

specify the list of colors to use

–

specify the kind of mapping

–

map the list of colors to a list of values

Splat Shape—lets you choose one of the following methods for drawing splats:
linear, gaussian, texture, sphere, cube, or diamond. See “Splat Type Menu” on
page 282 for a further explanation of each of these.

To use these Colors options, you must have mapped a column to the *color requirement
of the Data Destination panel. If nothing is entered in the color list, the default colormap
is used. The default colormap is a continuous spectrum from blue (lowest value) to red
(highest value). See “Choosing Colors” and “Using the Color Browser” in Chapter 3 for
a more detailed explanation of how to choose and change colors.
Color list—You can specify the color list using the + button next to the color list label.
This brings up a color editor that lets you specify a color to be added to the list.
Color mapping—You can specify whether the color change that is shown in the graphic
display is Continuous or Discrete. If you choose Continuous, the color values shift
gradually between the colors entered in the Color list to use field as a function of the
values that are mapped to those colors in the Color mapping field.
The field to the right of the popup button lets you enter specific values for mapping the
colors. If you do not specify any mapping values, the range of values in the color column
is used.

260

Configuring the Splat Visualizer Using the Tool Manager

Example 8-1

If you
•

used the Color Browser to apply red and green to the splats

•

selected Continuous for the Kind of mapping

•

entered the values 0 100

the display shows all splats with values less than or equal to 0 as completely red, those
with values greater than or equal to 100 as completely green, and those between 0 and
100 have a color which results from a linear interpolation between red and green.
Example 8-2

If you
•

used the Color Browser to apply red and green to the splats

•

selected Discrete for the Kind of mapping

•

entered the values 0 50

The display shows all splats with values of less than 50 in red, and all those with values
greater than, or equal to, 50 in green.
Summary Options

Summary options let you specify what color to use for the Summary window. This is
only applicable if you have mapped a column to the summary.

261

Chapter 8: Using the Splat Visualizer

Other Options

The Other Options, at the bottom of the dialog box, include the following fields:
•

Hide Label Distance — controls the distance at which axis tick labels (for string
valued axes) become invisible. Increase this number to make the labels appear at
further distances. The higher the number, the greater the distance at which labels
are hidden.

•

Axis Label Size — this controls the size of the axis labels. A smaller number decreases
the size, a larger one increases it.

•

Grid Color — lets you modify a grid color by clicking on it. This causes the Color
Chooser dialog box to appear, which lets you implement your color changes.

•

Grid (X, Y, Z) Size — lets you specify the spacing between grid lines for the
respective axis. A smaller number decreases the size, a larger one increases it. If the
Size is set to 0, there are no grid lines in that dimension.

Resetting the Tool Options

Clicking the Reset Options button resets the values of all options to their default values.

Invoking the Splat Visualizer
To see Splat Visualizer graphically represent your data, click Invoke Tool at the bottom of
the Data Destination panel.

Saving the Splat Visualizer Settings
When you press Invoke Tool, The Tool Manager stores information for the Splat Visualizer
in several files, all sharing the same prefix:

262

•

.splatviz.data contains data.

•

.splatviz.schema describes the data file.

•

.splatviz contains information required by the Splat Visualizer.

Configuring the Splat Visualizer Using the Tool Manager

To save the entire session along with the current tool options, use one of these menu
options from the File menu:
•

Save Current Session... where the default prefix is based on the data source

•

Save Current Session As... to specify your own prefix

The saved file is .mineset, and contains all the information needed to return
MineSet to its current state.
When you use Invoke Tool, the .data, .schema, and .splatviz files are updated, if necessary.

Null Handling in the Splat Visualizer
The Splat Visualizer uses special representations when fields with unknown data values,
or nulls, are mapped to visual attributes. (For a discussion of null values, see Appendix J,
“Nulls in MineSet.”) When every record in a bin has a null value for the column mapped
to color, the resulting color for that splat is gray. If one or more records in the aggregate
have non-null values for the column mapped to colors, then that value is (or those values
are) used to compute the color. While the sum of a value and null is null, the average of
a value and null is the value (that is, value + Null = Null; avg(val, Null) = val).
When a null value is displayed in the Pick Window, Selection Window or “Pointer is
Over” area, it is shown as a question mark (?). (The Selection Window and “Pointer is
Over” areas are discussed in the “Select Mode” section.)
For numeric columns containing nulls which are mapped to axes, there is a special null
position below the range defined by the axis. This is to help show that the null value is
discontinuous with the other values. The null positions for numeric axes can be turned
off using the Show Null Positions option under the View Menu (see “The View Menu”
on page 276). For string-valued columns mapped to axes, nulls (represented by a ‘?’) are
treated as just another value.

263

Chapter 8: Using the Splat Visualizer

Working in the Splat Visualizer’s Main Window
If you started the Splat Visualizer without specifying a configuration file, the main
window shows the copyright notice and license agreement for the Splat Visualizer. Only
the File and Help pulldown menus can be used. For the main window to show all menus
and controls, open a configuration file. Use File > Open to see a list of configuration files.
When a valid configuration file has been selected, the 3-D landscape it specifies is visible.

Viewing Modes
The two modes of viewing are grasp and select. To toggle between these modes, press Esc,
or click the appropriate cursor button adjacent to the top-right of the viewing area.
Grasp Mode

In grasp mode, the cursor appears as a hand. This mode supports panning, rotating, and
scaling the scene’s size in the main window.

264

•

To pan the display, press the middle mouse button and drag it in the direction you
want the display panned.

•

To rotate the display, press the left mouse button and move the mouse in the
direction you want to rotate. (Also see the thumbwheel controls Rotx and Roty,
described in “Thumbwheels” in Chapter 6.)

•

To move the viewpoint forward, press the left and middle mouse buttons
simultaneously and move the mouse downwards. To move the viewpoint
backward, press the left and middle mouse buttons simultaneously and move the
mouse upwards. This is equivalent to the functions provided by the Dolly
thumbwheel.

Working in the Splat Visualizer’s Main Window

Select Mode

In select mode, you can move a 3-D pick dragger through the volume in order to display
information about regions in the scene. This pick dragger is composed of a cylinder and
a square. If you pick on the cylinder and drag, motion is constrained to be parallel to the
cylinder’s axis. If you pick on the square and drag, motion is constrained to the plane
defined by the square. You can cycle through the three possible orientations of the pick
dragger by pressing the Control key with the cursor over the dragger. (You need not press
the mouse button.) In the case of dragging the square portion of the dragger, you can use
the Shift key to constrain the motion along one of the two axes within the plane.
Alternatively, each axis has a disk that aligns with the pick dragger position. Moving the
disk on an axis moves the dragger, and vice-versa.
The dragger lets you pick within a dense cloud of points, freeing you from the limitation
of having to pick regions on the surface.
When the pick dragger is over data, the cylinder changes its color to that of the splat
under it, and information about that region appears at the top of the view area
(Figure 8-6). If no data is present, the cylinder remains light gray, and information about
its position is displayed at the top of the render area for aid in navigation.
When you are finished dragging, and have released the mouse button, the message for
the splat you are currently over is shown in the Pick Window at the top. This pick
information is updated if the animation slider is moved. Using the mouse, you can cut
and paste this selection information into other applications, such as reports or databases.
The pick dragger may be removed from the scene by unchecking Selection > Show Pick
Dragger.

265

Chapter 8: Using the Splat Visualizer

Figure 8-6

Pick Dragger Over Data

The information is displayed when the pick dragger is over the object.
Note: Users familiar with Open Inventor can configure the Splat Visualizer so that the

right mouse button brings up the standard Inventor Menu. This provides additional
functions, such as stereo viewing and spin animation. These functions are provided by
the Open Inventor library. To enable the Open Inventor Menu, add the line
*minesetInventorMenu:TRUE

to your .Xdefaults file.

266

External Controls

External Controls
Several external controls surround the main window, including buttons and
thumbwheels. (These and the rest of the buttons are substantially the same as the other
MineSet visualization tools and are described in “Buttons” in Chapter 6, and
“Thumbwheels” in Chapter 6).

The Animation Control Panel
The animation control panel, which appears to the right of the main window, consists of
a summary window, with up to two adjacent sliders, an information field, animation
buttons, and animation sliders.

Sliders Controlling Independent Dimensions
The number of sliders appearing adjacent to the summary window is dependent on the
slider mappings specified in the configuration file. Datasets can have two, one, or no
independent dimensions.
Datasets With Two Independent Dimensions

If the dataset has two dimensions of independently varying data (such as
adultJobs2.splatviz), the controls to the right of the main graphics window become visible
(see Figure 8-7).

267

Chapter 8: Using the Splat Visualizer

Figure 8-7

268

Animation Control Panel With Summary Window and Both Slider Controls

The Animation Control Panel

To the right of the main window are the 2-D summary window and slider controls. The
summary window has a horizontal slider below it for selecting data points of the first
independent dimension, and a vertical slider to the left for selecting data points of the
second independent dimension. The horizontal slider’s dimension is identified by a label
below it. The vertical slider’s dimension is identified by a label above it.
Datasets with One Independent Dimension

For datasets with one independent dimension (such as adultJobs.splatviz), only the slider
below the summary window appears, and the summary window is compressed (see
Figure 8-1). This slider’s dimension is identified by a label below it.
Datasets With No Independent Dimension

For datasets with no independent dimensions (such as mushroom.splatviz), no slider
control appears (see Figure 8-8). In this example, the splats that are neither completely
red nor completely blue indicate that both poisonous and edible mushrooms are plotted
at that location.

269

Chapter 8: Using the Splat Visualizer

Figure 8-8

270

Splat Visualizer Without Independent Dimension or An Animation Control Panel

The Animation Control Panel

The Summary Window
The summary window provides a 2-D representation of the aggregation of values that
the main window displays in 3-D. The whiter the areas of the summary window, the
lower the summary value represented by the splats in the main window. The greater the
color density in areas of the summary window, the higher the summary values. The
summary value is either the total weight of data at that slider position, or the weighted
average of the column that was mapped to summary. The density of these colors in the
summary window provides a summary of the data across the one or two independent
dimensions in the dataset. If no column is explicitly mapped to summary, count is used
to show which positions on the slider represent the most data.
By default, the summary window also contains a set of black dots, evenly spaced across
the one or two dimensions of data. These dots indicate the precise positions of the
discrete datapoints. You can turn off these black dots by unchecking the box at the
bottom of the summary slider window. Slider positions between these positions use
interpolation of the underlying data to produce an image.
Color Density in the Summary Window

After opening the adultJobs.splatviz file, for example, the 2-D summary window shows a
color range from white (on the left) to red (in the middle) to white (on the right). Red
represents more records (12,838 in this case), while white represents fewer records
(3,606). In this example, the greater the density of red in the middle of the slider, means
the highest concentration of people are in the 20-50 age range.
Creating a Path in the Summary Window

If the dataset loaded into the Splat Visualizer has at least one independent dimension, it
is possible to view all or any part of that dataset via animation. This is done by first
creating a path in the summary window (this path connects a sequence of data points),
then activating the animation controls described in the next section.

271

Chapter 8: Using the Splat Visualizer

The three ways to draw a path in the summary window are:
•

Define a starting point by clicking and holding down the left mouse button, then
draw a path by dragging the cursor over the window. End the path by releasing the
left mouse button.

•

Define a starting point by clicking the left mouse button, then define an endpoint by
moving the cursor to another part of the window and clicking the middle mouse
button. A path appears between those two endpoints. To add more line segments,
continue with repeated middle mouse clicks.

•

Define a starting point by clicking the left mouse button, then drag one of the
independent dimension sliders, thus drawing a straight line along this dimension.
If there are two sliders, use of the second slider causes a straight line to be drawn
along the axis controlled by this second slider.

Animation Buttons and Sliders
The animation panel is identical to that of the Map Visualizer, see “Animation Buttons
and Sliders” in Chapter 6. The following section is different, however.
Slider Data Points and Interpolation

As animation proceeds, the size and color of the splats change smoothly. The information
displayed in the message box field shows the interpolated data values. When the slider
motion stops, the slider position snaps to the nearest discrete data position where
interpolated data values are not used.
There is a table for each binned position on the summary slider. Each row in one of these
tables (which is an aggregate of original data) defines a splat in the scene. Tables
corresponding to adjacent bins on the summary need not have the same number of rows
because of the differences in data distribution from one position to the next. For example,
if we change the visualization in Figure 8-1 from showing 40-50 year-olds to one showing
50-60 year-olds by moving the slider one notch to the right (see Figure 8-9), some
positions might show splats where there were none before, and vice versa.

272

The Animation Control Panel

Figure 8-9

Changed Visualization as a Result of Moving the Slider (Compare to Figure 8-1)

273

Chapter 8: Using the Splat Visualizer

For interpolation on a one dimensional slider, two adjacent tables are merged, then
aggregated using the spatial columns as unique keys. The count is simply interpolated
(0 count is assumed if one of the tables lacks a particular row). The average value used
for color is also interpolated, but weighted by the count.
Example 8-3

(This example describes technical details of the interpolation process.) Suppose we want
to show an image that represents an interpolation between the tables for the 40-50
year-olds and the 50-60 year-olds on the external slider. Let Table 8-1 and Table 8-2 be the
tables for age=40-50 and age=50-60, respectively, for the two slider positions.
Table 8-1

Ages 40 to 50

education

occupation

hours_worked

income

count

HS-grad

Exec-Man.

15-25

25000

2

HS-grad

Mach-op

15-25

30000

1

Masters

Technician

25-35

35000

3

Table 8-2

Ages 50 to 60

education

occupation

hours_worked

income

count

HS-grad

Exec-Man.

15-25

70000

1

Vocational

Mach-op

35-45

40000

2

This is how the Splat Visualizer performs the interpolation. For Table 8-1, a new count
column equal to (1-t)count and a new weighted value column equal to (1-t) (count) (value)
are added. For Table 2, a new count column equal to (t)(count), and a new weighted value
column equal to (t) (count) (value) are added. The two tables are merged together.
The merged table is aggregated using the spatial axes columns as keys, and sum
aggregating the two new columns. This ensures that no two rows have the same binned
values for all the spatial axes. Finally, divide the summed value by the summed count to
get the interpolated values. In this case, the interpolated values are for income. If t=.5, the
resulting table would be Table 8-3.

274

The Animation Control Panel

Table 8-3

Interpolation Midway Between Table 1 and Table 2

education

occupation

hours_worked

income

count

HS-grad

Exec-Man.

15-25

40000

1.5

HS-grad

Mach-op

15-25

30000

.5

Masters

Technician

25-35

35000

1.5

Vocational

Mach-op

35-45

40000

1

If the external query slider has two dimensions, bilinear interpolation is used.
This census dataset contains nearly 150,000 rows. The purpose of the external slider is to
allow navigation through, and show summary info for additional dimensions in the
data. The red regions represent places where the summary value is high; white shows
areas where it is low. When the slider is positioned over a black point, the image shows
uninterpolated data. One can trace a path on the slider and animate it using the VCR
control panel below the slider.
To show how animation is produced, assume you have data for 8 years, 1990-1997 (that
is, eight data points in the summary window). Lets examine how one splat changes as
the slider is moved from one year to the next. Assume that in 1990 a splat at a given
position has value of 20 (to be mapped to color) and a count of 2. Assume further that in
1991 that same splat has a value of 40 and a count of 200. The splat in year 1991 is much
more opaque than the one in 1990 because it represents an aggregation of many more
records (or of much more heavily weighted records). As you move the year slider from
1990 to 1991, the count changes by being linearly interpolated between 2 and 200. The
value is computed by taking an average of the two values weighted by records counts (or
weights). For example, midway between 1990 and 1991, the count is 101, and the value
is ((1-.5)*2*20+.5*200*40)/((1-.5)*2+.5*200) = 39.8. As you approach 1992, the size
approaches 40. You cannot stop an animation between discrete data points, and you
cannot drag the Path slider to a stationary position between discrete data points.
The data points in the summary window represent the slider positions corresponding to
the actual data from the data file. For example, values 20 and 40 represent aggregations
of actual data, but the value 39.8 does not.

275

Chapter 8: Using the Splat Visualizer

Pulldown Menus
Five pulldown menus let you access additional Splat Visualizer functions. These are
labeled File, View, Selection, Splat Type, and Help. If you start the Splat Visualizer
without specifying a configuration file, only the File and the Help menus are available.
The File and Help menus are the same as found in other MineSet visualization tools. For
a description see “The File Menu” in Chapter 5, and “The Help Menu” in Chapter 5.

The View Menu
The View menu lets you control certain aspects of what is shown in the Splat Visualizer
window.

Figure 8-10

276

Splat Visualizer View Menu

•

Show Window Decoration lets you hide or show the external controls around the
main window.

•

Show Null Positions lets you hide or show splats that have null or unknown position
values along one or more axes.

•

Show Animation Panel lets you show or hide the animation control panel. This menu
item is disabled for datasets with no independent dimension.

Pulldown Menus

•

Show Filter Menu brings up a filter panel (Figure 8-11) that lets you reduce the
number of splats displayed in the main viewing area, based on one or more criteria.
You can use the filter panel to fine-tune the display, emphasize specific information,
or simply shrink the amount of information displayed. Columns other than those
mapped to axes, sliders, opacity, and color are not available for filtering because
they are removed during aggregation. The Scale to filter checkbox, which appears in
the lower right of the filter panel, lets you specify whether the landscape in the main
window covers the entire dataset or just the filtered data.

•

Set Background Color brings up a color chooser to let you specify a new background
color.

Figure 8-11

Splat Visualizer Filter Panel

277

Chapter 8: Using the Splat Visualizer

The Filter panel has two panes. The top pane lets you filter based on string columns.
To select all values of a column, click Set All. To clear the current selections, click
Clear. To select a value, click it. To deselect a value, simply click it again.
The bottom pane lets you filter based on the values of both string and numeric
columns.
To filter numeric values, enter the value, and select a relational operation (=, !=, >, <,
>=, <=). To filter alphanumeric values, enter the string. You can use any of three
types of string comparisons:
•

Contains indicates that it contains the appropriate string. For example,
“California” contains the strings “Cal” and “forn”.

•

Equals requires the strings to match exactly.

•

Matches allows wildcards:

–

An asterisk (*) represents any number of characters.

–

A question mark (?) represents one character.

–

Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.
For columns which were binned, an option menu of values appears, instead of a
text field. To ignore that column, select Ignored in the Option menu. You can use
relational operators, such as >=, with these options. This means that the specified
value as well as subsequent ones are selected.
In addition to numeric and string comparison operations, you can specify Is Null,
which is true if the value is null.
To the right of each field is an additional option menu that lets you specify “And” or
“Or” options. For example, you could specify “sales > 20 And < 40.” You can have
any number of And or Or clauses for a given column, but cannot mix And and Or in
a single column.
Scale to Filter lets you specify whether the filtered landscape is rescaled to the size of
the filtered data or remains the size of the entire data set.
Click the Filter button to start filtering. If you press Enter while the panel is active,
filtering starts automatically.
Click the Close button to close the panel.

278

Pulldown Menus

The Selection Menu
The Selection menu (Figure 8-12) lets you drill through to the underlying data. The menu
has six items.

Figure 8-12

•

The Splat Visualizer’s Selection Menu

Create Box Selection creates a 3-D box selector that can be stretched and translated to
select regions of the volume. While the box selector is active, a window in Record
Viewer format is opened showing information about all of the aggregated data that
is represented by the splats within it (see top of Figure 8-13). Closing this window
clears the current box selection(s). Selecting this option again creates a new box
selection, making the previous selection fixed. The fixed-selection boxes are gray,
while the active one is light yellow (see Figure 8-13). The selected bins, shown in the
selection window, are the bins enclosed by the union of all the selection boxes.

279

Chapter 8: Using the Splat Visualizer

Figure 8-13

280

Image With Fixed Selection Box (Gray) and Active Selection Box (Yellow)

Pulldown Menus

To translate the active selection box, click on one of the faces with the left mouse
button, and drag it in the desired direction. Holding the Shift key while dragging
constrains the motion to the axis to which the drag motion is closest. To change the
extent of the selection box, drag one of the gray scale tabs in the desired direction.
Trying to resize or translate beyond the bounds of the volume is not permitted. The
gray scale tabs constantly resize to maintain constant screen size. If at any time they
appear too big, you can zoom in closer, and they reduce their size relative to the
box.
•

Show Original Data retrieves and displays the records corresponding to what has
been selected via Box Selection(s). The resulting records are shown in a table viewer.

•

Send To Tool Manager inserts a filter operation, based on the current box selection(s),
at the beginning of the Tool Manager history. The actual expression used to do the
drill through is determined by extents of the current box selection(s). If nothing is
selected, a warning message appears.

•

Use Slider On Drill-Through determines whether or not to use the slider position
when creating the drill-through expression. If checked (default), an additional term
is added to the drill-through expression, limiting the drill-through to those records
defined by the slider’s position. If this option is not checked, no such limiting term
is added.

•

Complementary Drill Through causes the Show Original Data and Send To Tool Manager
selections, when used, to fetch all the data that are not selected.

•

Show Pick Dragger toggles the visibility of the pick dragger (on by default). The pick
dragger is removed when a box selection is started, but it can be made active at the
same time that a box selection is active.

For further details on drill-through, see Chapter 18, “Selection and Drill-Through.”

281

Chapter 8: Using the Splat Visualizer

Splat Type Menu
Splats are used in this tool to model clouds of small points (see Lee Westover, “Footprint
Evaluation for Volume Rendering” in Proceedings of SIGGRAPH ‘90, Vol. 24, No. 4, pages
367-376).
The Splat Type menu lets you change the method for drawing the splats. You can choose
to exchange accuracy for interactivity. Texture splats are the most accurate representation
of ideal Gaussian density that is approximated in every approach. Since most computers
support hardware-assisted texturing well, the texture splat is usually the best choice.
Among SGI platforms, only the Indy or earlier systems are restricted to the slower
software implementation. The three splat types are:
•

Linear draws a small set of triangles to give a linear approximation to a Gaussian
splat.

•

Gaussian draws a large set of triangles to approximate a Gaussian splat.

•

Texture uses a texture mapped rectangle to give the most accurate representation.
This can be very slow on machines that don’t support hardware-assisted texture
mapping.

Alternatively, the following opaque primitives are allowed.

282

•

Sphere draws an opaque sphere, the radius for which varies with the cube root of the
count (or weight).

•

Cube draws a large set of triangles to approximate a Gaussian splat.

•

Diamond uses a texture-mapped rectangle to give the most accurate representation.
This can be very slow on machines that don’t support hardware-assisted texture
mapping.

Sample Configuration and Data Files

Sample Configuration and Data Files
The provided sample data and configuration files demonstrate the Splat Visualizer’s
features and capabilities. The following files are in the /usr/lib/MineSet/splatviz/examples
directory:
•

mushroom
The mushroom.data file contains pre-aggregated data concerning more than 5,000
mushrooms. The group by columns were: odor, gill_color, and cap_color. For every
combination of these three columns in the original data, there is a count and an
average edibility, where 0 is edible, and 1 is poisonous. An average edibility
between 0 and 1 means some of the mushrooms in that aggregate are edible and
some are poisonous, since mushrooms can not be partially poisonous.
The visualization (Figure 8-8) shows that the unique values for each of these
columns have been sorted along the axes according to average edibility. Odor is
clearly the best determinant of edibility. Also note that most splats are either all 0 or
all 1, meaning these three columns are useful in segmenting the two classes of
mushrooms. Lower the opacity slider to determine which splats have the highest
counts. The most opaque splat represents 288 mushrooms having common values
for odor, gill_color, and cap_color. To confirm this try filtering based on
sum_count_poison>280 and picking on the remaining splats to see their counts.
Note that all mushrooms with gill_color=buff are poisonous.

•

adultJobs
The adultJobs.data file was derived from adult94, a dataset provided with the
distribution. It was created using an aggregation that grouped by education,
occupation, hrs_worked_per_week(binned), and age (binned). The gross_income
column was aggregated by count and average. For a display using the Splat
Visualizer (Figure 8-1), age_bin was mapped to a slider, while the other group-by
columns were mapped to axes. The count_gross_income column was mapped to
opacity, and avg_gross_income was mapped to color.
When the slider is in the left-most position, the color of the plot is almost entirely
blue. This means that regardless of occupation, education, or number of hours
worked, people younger than 20 have low incomes. Move the slider to the right,
and note how incomes rise faster for higher education and occupations toward the
end of the axis. By the opacity variation you can see that the most common types of
education are HS, some college and Bachelors degree.
Moving the Summary slider shows how the distribution of income changes with
respect to the axis columns as people age.

283

Chapter 8: Using the Splat Visualizer

•

adultJobs2
The adultJobs2 file is also based on the adult94 dataset. Here, the axis columns are
working_class, education, and occupation. The two columns mapped to sliders are
age(binned) and hours_worked_per_week(binned). Again, income was aggregated
by count and average for use with opacity and color, respectively. Since there are
more positions on the 2D slider, there are fewer records represented by each
position. This causes greater variation of color and opacity. The red region in the
center of the hrs_per_week dimension of the Summary slider shows that nearly
everyone works between 35 and 45 hours per week (see Figure 8-7). Note that some
occupations are aligned with specific working classes. For example, everyone in the
Armed-forces has Fed-Government for their working class.

•

censusIncome
This example is based on a dataset similar to adult94, but was not included with the
distribution because of its size. In attempt to understand the differences between
gross income and total income, gross_income, total_income, and hrs_per_week
have been mapped to axes. Color shows age. By studying the image we can learn
that there are many records where total_income=gross_income, but there are also a
larger portion of records with high total_income, but 0 gross_income. It is
surprising that in many cases gross_income is greater that total_income.
Note where the people of different ages are concentrated. Many old people (yellow)
are in the hrs_per_wk=0 plane. They are probably retirees. Many children and
young adults (blue) are in the line gross_income=total_income=0. Note the fairly
opaque splats near the outside edges of the volume. These positions include all
points that fell in the maximum bin shown for an axis. For example, the highest bin
for total_income is 70300+. Any point higher than 70300 goes in this bin.
To better see the varying density, adjust the opacity slider. At low opacity scales, the
diagonal lines show that for most people gross_income=total_income, or they have
just total_income and no gross_income. As you raise the scale, you can see that
almost the entire volume contains data. This dataset contains 150,000 records.

284

Sample Configuration and Data Files

•

churn
Churn is when a customer leaves one company for another. This example shows
customer churn for a telephone company. The data used to generate this example is
in /usr/lib/MineSet/data/churn.schema.
Using column importance, we found that total_day_charge,
number_customer_service_calls, and international_plan were important
discriminators. These columns were mapped to axes. We then created a new
numeric column, churn, which equals churned==Yes, and mapped it to color.
In the resulting visualization, red areas of the volume indicate high churn. The area
corresponding to three or more customer service calls and low total_day charge
corresponds to high churn. You might want to weight big-spending customers more
heavily than others. To do this, create a new column, total_charge, equal to
`total_day_charge`+`total_eve_charge`+`total_night_charge`

or some power of this sum. Then map this total_charge column to opacity. This
means every record is weighted by total_charge. Now the visualization shows
additional areas of interest near the high end of the total_day_charge axis.

285

Chapter 9

9. Using the Rules Visualizer

This chapter discusses the components and capabilities of the Rules Visualizer. It first
provides an overview of this data mining and visualization tool, then it explains this
tool’s functionality when working with the
•

main window

•

external controls

•

pulldown menus

Finally, it lists and describes the provided sample files for these tools.

Overview of Rules Visualizer
The Rules Visualizer gives you the power to mine data by constructing, verifying, and
graphically representing models of patterns in large databases. These patterns are
expressed via association rules, which indicate the frequency of items occurring together
in a database.
Discovering and graphically displaying association rules can be relevant to many
enterprises, including supermarket inventory planning, shelf planning, and attached
mailing in direct marketing.
The tool execution scenario described in Chapter 1 of this document (see Figure 1-1) is
slightly modified for the Rules Visualizer. First, the “raw” data in your database must be
converted into a specially formatted file that can be processed by the association rules
generator part of the Rules Visualizer. When the association rules generator has
processed this file, the results can be displayed by the rules visualizer part of this tool.

287

Chapter 9: Using the Rules Visualizer

Thus, the Rules Visualizer consists of three operations:
1.

Data conversion. The association data converter processes a “raw” data file and
creates a file usable by the association rules generator.

2. Association rules generation. The data file created by the association data converter
is processed by the association rules generator, which creates a file usable by the
rules visualizer.
3. Rules visualization. This operation displays the generated association rules.
In addition to the input data and rules file requirements, each operation requires a
configuration file that specifies operational parameters.
The sequence of actions by the user, at the user’s workstation, and at the host server is
shown schematically in Figure 9-1. The phases indicated at the right of the illustration
correlate to the operations listed above.

288

Overview of Rules Visualizer

Client
workstation

Host server

User

Tool
manager

User's
data source

OR

Configuration
file
Data Mover

Data conversion

Format
file

"Raw"
data file

Association
data
converter

Rules visualization

Rules
generation

Binary (flat)
data file

Association
rules
generator

Rules
Visualizer

Figure 9-1

Rules file
Vi
dissual
pla
y

Execution Sequence of the Rules Visualizer

289

Chapter 9: Using the Rules Visualizer

Data Conversion
The association data converter takes a “raw” data file, such as one resulting from a
database query, and creates a binary data file in the format used by the association rules
generator. The internal format of this generated file allows optimum processing by the
rules generator. The data converter also accepts input from flat files as well as databases.

Association Rules Generator
One example of applying the association rules generator is to obtain “market basket”
data for customer buying patterns. Here, “market basket” is the set of items bought by
each customer on a single visit to a store. An example rule in this context might be: “80%
of the people that buy diapers buy baby powder.” This percentage is known as the
predictability of the rule.
In the example, “diapers” is the item on the left-hand side (LHS) of the rule, and “baby
powder” is the item on the right-hand side (RHS) of the rule.
Some applications of these rules are as follows:
•

If “Fizzy Pop” appears on the RHS, the LHS can help us determine what the store
should do to boost sales of this beverage.

•

If “Bagels” appears on the LHS, the RHS can help us determine what products
might be affected if the store no longer sells bagels.

The association rules generator part of this tool processes an input file, then generates an
output file consisting of the rules. If X and Y are items in a record, then a rule such as
X⇒Y
indicates that whenever X occurs in a record, expect Y to occur with some frequency.

290

Overview of Rules Visualizer

Components of a Generated Association Rule

The strength of the association is quantified by three numbers. The first number, the
predictability of the rule, quantifies how often X and Y occur together as a fraction of the
number of records in which X occurs. For example, if the predictability is 50%, X and Y
occur together in 50% of the records in which X occurs. Thus, knowing that X occurs in
a record, expect that 50% of the time Y occurs in that record.
The second number, the prevalence of the rule, quantifies how often X and Y occur
together in the file as a fraction of the total number of records. For example, if the
prevalence is 1%, X and Y occur together in 1% of the total number of records.
You can specify a minimum prevalence threshold for the generated rules. The default
minimum prevalence threshold is 1%. The lower the minimum prevalence, the more
rules are generated, and the slower the performance of the tool might be. You can also
specify a minimum predictability threshold for the generated rules. The minimum
predictability threshold default is 50%.
Rules that meet a minimum prevalence threshold are important for two reasons:
1.

A rule might have business value only if a reasonably significant fraction of records
support the rule. For example, if everyone who buys caviar also buys vodka, the
rule Caviar ⇒Vodka has 100% predictability. However, if only a handful of people
buy caviar, the rule might be of limited value to the retailer.

2. A rule might not be statistically significant if a very small number of records
support the rule. The rule might be due to chance, and it would not be prudent to
make decisions based on such a rule.
The third number is expected predictability. The expected predictability is the frequency of
occurrence of the RHS items. So the difference between expected predictability and
predictability is a measure of the change in predictive power due to the presence of the
LHS rule. Expected predictability gives an indication of what the predictability would be
if there were no relationship between the items.
The Association Rules generator does not report rules in which the predictability is less
than the expected predictability. In other words, a rule such as A->B is not reported if the
frequency of A and B occurring together is less than the frequency of B alone.
Note: Given just Y and a rule of the form X ⇒ Y, nothing is known about X. Rules specify

implications only from the LHS to the RHS.

291

Chapter 9: Using the Rules Visualizer

Table 9-1 summarizes the three numbers that quantify the strength of each association
rule.
Table 9-1

Association Rules Components

Measure

Description

Prevalence

Frequency of LHS and RHS occurring together.

Predictability

Fraction of RHS out of all items with LHS, or the prevalence
divided by the frequency of occurrence of LHS items.

Expected Predictability

Frequency of occurrence of RHS items.

Hierarchical Data

The rules generator also works on hierarchical data, which includes a component that
relates (or maps) data to new data at varying degrees of generality. The ability to handle
hierarchical data allows rules to be generated at the desired level of generality.
For example, consider the hierarchy shown in Table 9-2. This hierarchical information, in
addition to the “market basket” data that lists the products purchased in each record,
allows rules to be generated at four levels. In contrast to rules learned at the lowest level,
which relate specific products to each other, a rule at the highest level might be “Milk
implies Bread.”
Table 9-2

Example of Hierarchical Levels

Level

Example

Product Group

Milk

Category

Non-Refrigerated Milk

Brand

Lucerne®

Product ID (UPC/SKU Code)

1 pint can of Premium Condensed Milk

Rules Visualization
The rules visualization part lets you graphically display and explore the generated
association rules. The rules are presented on a grid landscape, with left-hand side (LHS)
items on one axis, and right-hand side (RHS) items on the other. As shown in Figure 9-2,
attributes of a rule are displayed at the junction of its LHS and RHS item. The display can
include bars, disks, and labels.
292

Overview of Rules Visualizer

Bar
Disk

Label

Figure 9-2

Detail View of the Rules Visualizer’s Main Window

If the displayed view is too small, item labels do not appear on the side of the axes. You
can zoom in on the view until the item labels appear (see the Dolly description in
“Thumbwheels” in Chapter 6).
A legend indicating the mapping between displayed attributes (such as bar heights and
colors) and the values associated with the underlying rules (such as predictability and
prevalence) can be displayed at the bottom of the main window.
The Tool Manager interface for associations allows you to run the Rule Visualizer
without running associations, by using the Visual Tools menu from Tool Manager’s main
window.

293

Chapter 9: Using the Rules Visualizer

File Requirements
Each of the Rules Visualizer’s three components has its own file requirements. These are
detailed in the following subsections.
Files Required by the Association Data Converter Part

•

A “raw” data file that results from extracting raw data from a source (such as a
relational database). This file is processed by the association data converter to
produce the internal binary data file used by the association rules generator.

•

A format file that specifies the format of the data file. If the internal binary data file
(see next subsection) is created via the Tool Manager, this format file is created
automatically. If the internal binary data file is created via the command line, this
format file must be created manually (see Appendix F, “Creating Data and
Configuration Files for the Rules Visualizer”).

Files Required by the Association Rules Generator Part

•

An internal binary data file, which results from running the association data
converter on your original data.
If you have hierarchical data, the association rules generator also requires the
following two files:

•

A mapping file, which specifies the mapping between hierarchical levels.

•

A description file, which specifies a string description for each item at a specific
hierarchical level.

Files Required by the Rules Visualization Part

•

A rules file that results from running the association rules generator.

•

A .ruleviz configuration file that specifies parameters used by the rules visualizer
program (such as mapping colors to prevalence values) when displaying the
generated rules. This file is easily created using the Tool Manager (see Chapter 3).
You also can use an editor (such as jot, vi, or Emacs) to produce this file (see
Appendix F, “Creating Data and Configuration Files for the Rules Visualizer”).
These configuration files must have a .ruleviz extension.

294

Starting the Rules Visualizer

Starting the Rules Visualizer
The Rules Visualizer has three components. The following subsections describe the
procedure for starting each one.
Starting the Association Data Converter Part

There are two ways to start the association data converter part of the Rules Visualizer:
•

Use the Tool Manager to configure and start the data converter. (See Chapter 3 first
for details on most of the Tool Manager’s functionality, which is common to all
MineSet tools; see below for details about using the Tool Manager in conjunction
with the data converter.)

•

Enter the following command at the UNIX shell command-line prompt:
assoccvt parameters

The parameters are described in Appendix F, “Creating Data and Configuration Files
for the Rules Visualizer.”
Starting the Association Rules Generator Part

There are two ways to start the association rules generator part of the Rules Visualizer:
•

Use the Tool Manager to configure and start the association rules generator. (See
Chapter 3 first for details on most of the Tool Manager’s functionality, which is
common to all MineSet tools; see below for details about using the Tool Manager in
conjunction with the association rules generator.)

•

If the data with which you are working is non-hierarchical, enter this command at
the UNIX shell command line prompt:
assocgen parameters

If your data is hierarchical, enter this command at the UNIX shell command-line
prompt:
mapassocgen parameters

The parameters for both instances are described in Appendix F, “Creating Data and
Configuration Files for the Rules Visualizer.”

295

Chapter 9: Using the Rules Visualizer

Starting the Rules Visualization Part

There are five ways to start the rules visualization part of this tool:
•

Use the Tool Manager to configure and start the Rules Visualizer. (See Chapter 3
first for details on most of the Tool Manager’s functionality, which is common to all
MineSet tools; see below for details about using the Tool Manager in conjunction
with the Rules Visualizer.)

•

Double-click the Rules Visualizer icon, which is in the MineSet page of the icon
catalog. The icon is labeled ruleviz. Since no configuration file is specified, the
start-up screen requires you to select one by using File > Open.

•

If you know what configuration file you want to use, double-click the icon for that
configuration file. This starts the Rules Visualizer and automatically loads the
configuration file you specified. This only works if the configuration filename ends
in .ruleviz (which is always the case for configuration files created for the Rules
Visualizer via the Tool Manager).

•

Drag the configuration file icon onto the Rules Visualizer icon. This starts the Rules
Visualizer and automatically loads the configuration file you specified. This works
even if the configuration filename does not end in .ruleviz.

•

Enter this command at the UNIX shell command-line prompt:
ruleviz [ configFilename ]

When starting the rules visualization part of this tool, you must specify the configuration
file, not the data or rules file.
Option for Invoking the Rules Visualizer

The -quiet option eliminates the dialogs that popup to indicate progress. You can enable
this option permanently by adding the line
*minesetQuiet:TRUE

to your .Xdefaults file.

296

Configuring the Rules Visualizer Using the Tool Manager

Configuring the Rules Visualizer Using the Tool Manager
This section describes how the components of the Rules Visualizer can be configured
using the Tool Manager. Although the Tool Manager greatly simplifies the task of
configuring the Rules Visualizer, you can construct a configuration file for this tool using
an editor (see Appendix F, “Creating Data and Configuration Files for the Rules
Visualizer”).
Note that the steps required to connect to a data source are described in Chapter 3.
The sections below follow the configuration and invocation of the Rules Visualizer
components in the conventional order:
•

creating a file for the association rules generator

•

generating rules

•

displaying rules

Setting Up Associations
To show how to set up associations, the following example uses the cars database table.
Assume that you want to find out if there is an association between miles per gallon,
horsepower, and the year the car was built. For example, did mileage improve over time?
Did engines become less powerful? The following steps (and Figure 9-3) show you how
to set up the associations and map table columns to the data you want to study.

297

Chapter 9: Using the Rules Visualizer

Figure 9-3

1.

Initial Tool Manager Window for Association Generation

Connect to a MineSet server. Refer to Chapter 2, “Setting Up MineSet,”if you need
help.

2. Open a data source.
3. (Optional step) In the Data Transformations tab you can choose the transformations
you want do on the data before you give it to the associations engine. One
recommended transformation is to create bins for numeric data. (The binning
operation and the options available for it are described in detail in Chapter 3.) This
leads to more “meaningful” rules from the association engine. For example, instead
of using discrete values for the weightlbs attribute in the “cars” table such as 3504,
3693, 3436, 3433, and so on, it may be more meaningful to give weightlbs_bin value
ranges such as 1600-2500, 2501-3500, and so on.
For this example, click on the Bin Columns button, and select all the columns in the
Bin Column window for binning.
Note: If you run associations without binning any of the numerical columns (ints,

floats, doubles) you get the warning message
Running associations on unbinned non-categorical data. Binning is
recommended for producing more useful results.

4. Choose the Mining Tools tab from the Data Destination tab.
5. Choose the Assoc. tab (abbreviation for Associations) from the Mining Tools tab.

298

Configuring the Rules Visualizer Using the Tool Manager

Applying Association Rule Options
After selecting a data source, you can run the Association Rules generator. You can
choose options for this by clicking on the Assoc Options button. This causes the dialog box
in Figure 9-4 to be displayed.

Figure 9-4

Association Rule Options Dialog Box

Prevalence—lets you specify the minimum prevalence threshold as a percentage of the
total number of records. Rules with a prevalence below this value are not generated. The
default is 1%. The possible values are 0–100.
Predictability—lets you specify the minimum predictability threshold for rules. Rules
with a predictability below this value are not generated. The default is 50%. The possible
values are 0–100.
6. Once you have made your association rule options selections, click the OK button.
This returns you to the Tool Manager startup screen.

299

Chapter 9: Using the Rules Visualizer

Mapping Columns to Association Items

Figure 9-5

Association Mappings Dialog Box

The database in the Current Columns text panel can contain multiple table columns. By
mapping specific columns to association rules, the association rules generator can find
the association between any possible pair of those items.
1.

Click on Assoc. Mappings button to open the Mapping Columns to Assoc Items
dialog box.
The Mapping Columns to Assoc Items window shows two panels:
•

Columns shows the columns in the data

•

Items shows the mapping between columns in the data and items

The Map All button on this window can be used to map all the attributes in the data
source to items for the associations engine. The Clear All and Clear Selected buttons
can be used to clear/change the mapping between columns and items.
2. The default behavior is to map all columns to items. Therefore, if you omit this step
or if you open this window, you find all columns mapped. For this example, click
OK.

300

Configuring the Rules Visualizer Using the Tool Manager

Specifying Ruleviz Options
Clicking on the Ruleviz Options button causes a new dialog box to be displayed
(Figure 9-6). This lets you change some of the Rules Visualizer options from their default
values.

Figure 9-6

Rule Visualizer Options Dialog Box

This dialog box has two panels: the top one lets you set options for bars and disks; the
bottom one lets you specify options for items, the grid, and labels.

301

Chapter 9: Using the Rules Visualizer

Items in the top panel are listed below:
•

Height button—lets you specify whether the bars and disk heights are to be
normalized so that the tallest bar equals the height field value (Max Height), or
whether they are to be scaled by the height field value (Scale Height).

•

Height field—lets you enter the maximum or scale value for bar and disk heights.

•

Hide Distance—lets you specify the distance at which disks are not graphically
represented. Smaller numbers in this field specify a shorter distance; this means
fewer disks are shown and performance is greater. Larger numbers indicate a
greater distance; this means disks further away remain visible.

•

Legends—lets you enter a text string that appears as mapping information displayed
at the bottom of the main Rules Visualizer window. This is information about
mapping between display entities and data values (for example, bar height
corresponds to predictability values).

•

Color list—lets you add or edit a color. To add a color to the list, click the + button. To
edit a color, click the color. See “Choosing Colors” and “Using the Color Browser”
in Chapter 3 for a more detailed explanation of how to choose and change colors.

•

Mapping—lets you specify whether the color change that is shown in the graphic
display is Continuous or Discrete. If you choose Continuous, the color values (of the
bars or disks) shift gradually between the colors entered in the Color list field as a
function of the values that are mapped to those colors in the Color list field.

Example 9-1

If you
•

used the Color Browser to apply red and green (for bars and/or disks)

•

selected Discreet for the Mapping

•

entered the values 0 100

then the display shows all bars and/or disks with values of less than 50 in red, and all
those with values greater than or equal to 50 in green.

302

Configuring the Rules Visualizer Using the Tool Manager

Example 9-2

If you
•

used the Color Browser to apply red and green (for bars and/or disks)

•

selected Continuous for the Mapping

•

entered the values 0 100

then the display shows all bars and/or disks with values less than or equal to 0 as
completely red, those as greater than or equal to 100 as completely green, and those
between 0 and 100 as shadings from red to green.
If no mapping and values are specified, a continuous mapping is used, and values are
generated automatically from the minimum value to the maximum value in the data.
Items in the bottom panel are as follows:
•

Items On and Grids On checkbox buttons let you determine whether items (the
names on the side of the grid) are displayed or hidden.

•

Size (for Items, Grid, and Bar Labels) lets you specify the size for items, the grid, and
bar labels. If you mapped a column value to bar labels in the Requirements panel of
the Tool Manager startup screen, you can specify a size for those labels.

•

Color (for Left-Hand Items, Right-Hand Items, Grid, and Bar Labels) lets you specify
the color for LHS and RHS items, the grid, and bar labels. If you mapped a column
value to bar labels in the Requirements panel of the Tool Manager startup screen,
you can specify a size for those labels.

•

Hide Distance lets you specify the distance at which the LHS items, RHS items, grid,
or labels become invisible. Smaller distances might improve performance, but the
objects disappear more quickly. The higher the number, the greater the distance at
which labels are hidden.

•

Message lets you specify the message displayed when the pointer is moved over an
object or when an object is selected. (See Figure 9-8.) The syntax of the message
string is the same as for the mapviz message string. See the Message Statement
section in Appendix C, “Creating Data, Configuration, Hierarchy, and GFX Files for
the Map Visualizer.”

303

Chapter 9: Using the Rules Visualizer

Mapping Columns to Visual Elements
The Rules Visualizer lets you map attributes of the rules to visual elements of the display.
Clicking on the RuleViz Mappings button brings up the Ruleviz Mappings panel shown in
Figure 9-7.

Figure 9-7

The Rules Visualizer’s Mappings Panel

The visual elements that can be mapped are listed below, where the items with “*” are
optional:

304

•

Height - Bars—lets you specify what the bar heights represent.

•

*Height - Disks—lets you specify what the disk heights represent.

•

*Color - Bars—lets you specify what the bar colors represent.

•

*Color - Disks—lets you specify what the disk colors represent.

•

*Label - Bars—lets you specify what the bar labels represent.

Working in the Rules Visualizer’s Main Window

The default mappings are as follows:
•

predictability to bar heights

•

expected predictability to disk heights

•

lift to bar and disk colors

Lift is the predictability divided by the expected predictability.

Invoking the Rules Visualizer
To see the Rules Visualizer graphically represent your data, click the Run Assoc & Rules
button at the bottom of the Associations tab in the Data Destination panel of the main
Tool Manager window.

Working in the Rules Visualizer’s Main Window
The Rules Visualizer part of this tool graphically displays the data in a rules file using the
specifications of a valid configuration file. For example, specifying group.ruleviz results in
the image shown in Figure 9-8.

Figure 9-8

Initial Rules Visualizer View When Specifying group.ruleviz

305

Chapter 9: Using the Rules Visualizer

The rules are presented on a grid, initially displayed with left-hand side (LHS) items
displayed on the left side of the window and right-hand side (RHS) items on the right. A
rule is displayed at the junction of its LHS and RHS items. The display can include bars,
disks, and labels.
When the scene is close enough, the LHS and RHS axes are labeled with the item names,
unless this has been turned off in the configuration file. (To view the grid and rules at
closer range, use the Dolly thumbwheel, described in the “Thumbwheels” in Chapter 6.)
You can change the labels as well as what the heights and colors of the bars and disks
represent by modifying the configuration file via the Tool Manager (see Chapter 3) or
using an editor to change the configuration file.
For example, in Figure 9-8, bar heights correspond to predictability values, bar colors
correspond to prevalence values, and disk heights correspond to expected predictability.

Viewing Modes
The two modes of viewing are grasp and select. To toggle between these modes, press the
Esc key. You also can change from one mode to the other by clicking the appropriate
button: to enter select mode, left-click the arrow button (to the top right of the main
window); to enter grasp mode, left-click the hand button (immediately below the arrow
button, near the top right of the main window as shown in Figure 9-2).
Grasp Mode

In grasp mode the cursor appears as a hand. This mode supports panning, rotating, and
scaling the scene’s size in the main window.

306

•

To rotate the display, press the left mouse button and move the mouse in the
direction you want to rotate. (Also see the rotating controls Rotx and Roty described
in “Thumbwheels” in Chapter 6.)

•

To pan the display, press the middle mouse button and drag it in the direction you
want the display panned.

•

To move the viewpoint forward, press the left and middle mouse buttons
simultaneously and move the mouse downwards. To move the viewpoint
backward, press the left and middle mouse buttons simultaneously and move the
mouse upwards. This is equivalent to the functions provided by the Dolly
thumbwheel.

Working in the Rules Visualizer’s Main Window

Select Mode

In select mode, you can obtain additional information about a rule by placing the cursor
over a bar. This highlights the selected bar and causes information about the rule
represented by that bar to appear at the top of the main window.

Figure 9-9

Cursor Over a Rules Visualizer Object

The information is displayed as long as the cursor remains over the object. If you position
the pointer cursor over an object and click the left mouse button, that same information
appears in the Selection Window, which is above the main window, under the
“Selection” label.
This Selection information remains visible until another object is selected, or until no
object is selected (if you click the black background). Using the mouse, you can cut and
paste this text into other applications, such as reports or databases.)

307

Chapter 9: Using the Rules Visualizer

External Controls
Several external controls surround the main window, including buttons and
thumbwheels. (These and the rest of the buttons are the same as the other MineSet
visualization tools and are described in “Buttons” in Chapter 6, and “Thumbwheels” in
Chapter 6)

The Height Slider
The Height slider, at the upper left corner of the main window, lets you scale the heights
of objects (bars and disks) in the main window.

Height slider

Figure 9-10

308

Rules Visualizer’s Height Slider

Pulldown Menus

Pulldown Menus
The Rules Visualizer has three pulldown menus, labeled File, View, and Help.

The File Menu
The File menu (Figure 9-11) contains six options.

Figure 9-11

Rules Visualizer File Menu

•

Open loads and opens a configuration file, displaying it in the main window.
Previously displayed data is discarded.

•

Reopen reloads the current configuration file. This is useful if either the
configuration file or data file has changed.

•

Save As... allows uyou to save the file under a different file name

•

Print Image... allows you to print the image from the screen

•

Filter....allows you to filter the data.

•

Exit closes the current window and exits the application.

309

Chapter 9: Using the Rules Visualizer

The Filter Menu
The Filter menu brings up a Filter panel (Figure 9-12) that lets you reduce the number of
rules displayed in the main viewing area, based on one or more criteria. You can use the
filter panel to fine-tune the display, emphasize specific information, or simply shrink the
amount of information displayed.

Figure 9-12

310

Rules Visualizer Filter Panel

Pulldown Menus

The top pane lets you filter based on string variables, such as LHS and RHS. To select all
values of a variable, click Set All. To clear the current selections, click Clear. To select a
value, click it. To deselect a value, click it again.
The bottom pane lets you filter based on the values of both string and numeric variables.
To filter numeric values, enter the value, and select a relational operation (=, !=, >, <, >=,
<=). To filter alphanumeric values, enter the string. You can use any of three types of
string comparisons:
•

Contains indicates that it contains the appropriate string. For example, California
contains the strings Cal and forn.

•

Equals requires the strings to match exactly.

•

Matches allows wildcards:
–

An asterisk (*) represents any number of characters.

–

A question mark (?) represents one character.

–

Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.
In addition to numeric and string comparison operations, you can specify Is Null.
Currently, this option does not match any rules, resulting in an empty display.
To the right of each field is an additional option menu that lets you specify “And” or “Or”
options. For example, you could specify “sales > 20 And < 40.” You can have any number
of And or Or clauses for a given variable, but cannot mix And and Or in a single variable.
Click the Filter button to start filtering. If you press Enter while the panel is active,
filtering starts automatically.
Click the Close button to close the panel.

311

Chapter 9: Using the Rules Visualizer

The View Menu
The View menu (Figure 9-13)contains one option.

Figure 9-13

Rules Visualizer View Menu

Use Symmetric Axes controls how items are displayed along the left- and right-hand side
axes. If enabled, every item appears on both axes, making the axes identical. Otherwise,
only the required items appear on each axis.

The Help Menu
The Help menu is the same for all MineSet Visualization tools, (see “The Help Menu” in
Chapter 5 for a complete description).

Sample Files
The provided sample data, rules, and configuration files demonstrate the features and
capabilities of the Rules Visualizer.

Sample Files for the Association Data Converter
There are two sample files provided for each of the two formats of the association data
converter. These files are located in the /usr/lib/MineSet/assoccvt/examples directory.

312

•

sing.dat and sing.fmt
The sing.dat file is a “raw” data file type, as described in the “Files Required by the
Association Data Converter Part” on page 294. The sing.fmt file is the format file
described in the same section. Both files are of the single-item-per-record format.

•

mult.dat and mult.fmt
The mult.dat file is a “raw” data file type, as described in the “Files Required by the
Association Data Converter Part” on page 294. The mult.fmt file is the format file
described in the same section. Both files are of the multiple-item-per-record format.

Sample Files

Sample Files for the Association Rules Generator
These files are located in the /usr/lib/MineSet/assocgen/examples directory. Except for the
synthn.dsc file, the sample files for the association rules generator are provided in 2-byte
and 4-byte integer versions. The difference between the respective files is that the 4-byte
integer version requires twice the amount of storage space of the 2-byte integer version.
•

synthn.dsc
This is a description file for items at the nth level of the hierarchy. For example, if n
is 0, this file describes the lowest level; if n = 1, the file describes the next higher
level of the hierarchy, and so forth. Description files are common to both 2-byte and
4-byte integer files.

Two-byte Integer Version

•

synths.dat
This is a data file with 2-byte integers. It corresponds to the data shown in Table F-9
on page 658.

•

synths.map
This is a 2-byte integer mapping file for hierarchical data.

Four-byte Integer Version

•

synthb.dat
This is a data file with 4-byte integers. It corresponds to the data shown in Table F-9
on page 658.

•

synthb.map
This is a 4-byte integer mapping file for hierarchical data.

Sample Files for the Rules Visualization Part
The following sample rules and configuration files are provided for use with the rules
visualization part of this tool. These files correspond to the hierarchical datasets. Rules
files contain the generated rules obtained by running the association rules generator part
of the Rules Visualizer. Rules files must have a .rules extension. Each configuration file
specifies how the corresponding rules file is displayed. Configuration files must have a
.ruleviz extension. The files mentioned in this subsection are in the
/usr/lib/MineSet/ruleviz/examples directory.

313

Chapter 9: Using the Rules Visualizer

•

group.rules and group.ruleviz
These files provide the generated rules and configuration specifications for product
groups, such as bread and baked goods, dairy milk, and carbonated beverages.

•

category.rules and category.ruleviz
These files provide the generated rules and configuration specifications for product
categories within product groups, such as refrigerated or non-refrigerated milk.

•

people94.rules and people94.ruleviz
These files provide the generated rules and configuration specifications for a census
database, showing associations among marital status, education level, age, income,
and other variables.

•

germanCredit.rules and germanCredit.ruleviz
These files provide the generated rules and configuration specifications for a credit
database from Germany, showing associations among credit history, employment,
savings, and other variables.

See /usr/lib/MineSet/ruleviz/examples/README for additional information on the files in
that directory.

314

Chapter 10

10. MineSet Inducers and Classifiers

This chapter provides an introduction to classifiers and the algorithms that build them,
called inducers. MineSet provides three inducer-classifier pairs:
•

Decision Tree

•

Option Tree

•

Evidence

The information in this chapter is equally applicable to all the MineSet classifiers and
inducers. The chapter consists of two parts: the first part introduces the basic concepts,
and the second part details how to apply those concepts via the Tool Manager.
Detailed descriptions of the MineSet inducers and classifiers are provided in Chapter 11,
“Inducing and Visualizing the Decision Tree Classifier,” Chapter 12, “Inducing and
Visualizing the Option Tree Classifier,” and Chapter 13, “Inducing and Visualizing the
Evidence Classifier.”

Classifiers
A classifier predicts one attribute of a set of data, given several other attributes. For
example, if you have data on customers of a telecommunications company, a classifier
can be generated to predict whether the customer will churn (leave the company) given
information such as whether the customer has voice mail, an international plan or not,
and how much time they spend on the phone. The attribute being predicted is called the
label, and the attributes used for prediction are called the descriptive attributes.
MineSet can build a classifier automatically from a training set. The training set consists
of records in the data for which the label has been supplied. For example, you supply a
database table with one column for each descriptive attribute (such as the presence of a
voice mail plan, the average number of calling minutes per day), and one column for the
label (churned or not). An algorithm that automatically builds a classifier from a training
set is called an inducer.

315

Chapter 10: MineSet Inducers and Classifiers

When a classifier is generated, MineSet also generates a visualization that can help you
understand how the classifier operates. This visualization can also provide valuable
insight into the data itself. Once a classifier is generated, it can be used to classify records
that do not contain the label attribute. This value is predicted by the classifier.
Note: See Appendix K for a list of further readings about classifiers as well as

acknowledgements for the datasets used in MineSet sample files.

Decision Tree Classifiers
Figure 10-1 shows the Decision Tree generated by the Decision Tree Inducer for the churn
dataset.

Figure 10-1

316

The Decision Tree Generated by the Decision Tree Inducer for Churn Dataset

Classifiers

To understand how the Decision Tree classifier assigns a label to each record, look at the
attributes tested at the nodes and the values on the connecting lines. In the Decision Tree
shown in Figure 10-1, the first test (at the root of the tree) is for total day minutes. There
are two branches from this root. If the total day minutes is <= 264.45, the left branch is
taken; otherwise the right branch is taken. The process is repeated until a leaf (node with
no branches) is reached. The leaf is labeled with the predicted class. The leaf represents
a rule that is the conjunction of all tests from the root to the leaf. For example, the right
most leaf, labeled No, matches the rule
total day minutes > 264.45 and voice mail plan = yes and
international plan = yes and total day minutes > 276.3

Option Tree Classifiers
Figure 10-2 shows an Option Tree generated by the Option Tree inducer.

Figure 10-2

The Option Tree Generated by the Option Tree Inducer for the Cars Dataset
317

Chapter 10: MineSet Inducers and Classifiers

The top node in this figure is an “Option node,” which indicates that several good
attributes can be chosen at the root. A Decision Tree Inducer picks the single “best”
attribute for each subtree; however, there might be several good attributes on which to
split. In such cases, an Option Tree can create option nodes. In the example dataset
(Figure 10-2), the task is to predict whether a car is manufactured in Europe, Japan, or the
US. The Decision Tree Inducer picks cubic inches for the root. The Option Tree inducer
chooses several options: cubic inches, cylinders, weight, mpg, and brand are all good
choices for the root.
Option nodes can appear elsewhere besides the root. With the default settings, however,
they appear only at the root or one level below the root (after a single test node).
Option Trees usually take 10 to 15 times longer to build than do Decision trees, but they
provide two significant advantages:
1.

Comprehensibility — Option nodes let you see several likely options. Instead of
having to settle for a single attribute, option nodes let you choose from several.
When you fly over the tree, you can choose to follow an option that you believe is
easier to understand or that you believe is better for predictions based on your
background knowledge of the problem.

2. Accuracy — In many cases, Option Trees are more accurate (have lower error-rates)
than Decision Trees. Option Trees classify by letting each option “vote” for each
label value, then average the votes. This is similar to having a series of “experts,”
each one attempting to predict the label based on a different main criterion. The
option node averages all these experts’ votes. Just as distributing stock investments
reduces the risk, using a mixture of options usually results in a more stable, less
risky classifier.

318

Classifiers

Evidence Classifiers
Figure 10-3 shows the evidence information generated by the Evidence Inducer.

Figure 10-3

Results of Evidence Inducer for Iris Dataset

The right window of the screen shows the distribution of the classes in the training set.
The left side shows rows of cake charts, one for each attribute. For every value of an
attribute in the data, there is one chart matching it in the row for the attribute. Given a
record with an attribute value corresponding to a chart, the chart represents how much
evidence the classifier “adds” to each possible label value. For example, in Figure 11-3, a
record with total day minutes < 175 shows much evidence for the No label value, and
little evidence for the Yes label value. After evidence is accumulated from all the
attributes, the label value with the most evidence is predicted.

319

Chapter 10: MineSet Inducers and Classifiers

Inducers
An inducer is an algorithm that builds a classifier from a training set, which consists of
records with labels. The training set is used by the inducer to “learn” how to construct
the classifier, as shown in Figure 10-4.

Classifier

Training Set
(records with
labels)

Inducer

Visualization
files

Figure 10-4

Method for Building a Classifier

Once the classifier is built, its structure can be visualized or used to classify unlabeled
records, as shown in Figure 10-4 and Figure 10-5.

Records
without
labels

Figure 10-5

Classifier
Labels

Using a Classifier to Label New Records

Running inducers can be a CPU- and I/O-intensive process. For this reason, the MineSet
inducers run on the MineSet server, rather than on the MineSet client (see Figure 10-6).

320

Inducers

MineSet client

MineSet server

OR
Configuration
file
User

Tool
manager

Configuration
file

Visualization
tool

User's
data
source

DataMover

Data
file

Visual
files

Vis
dis ual
pla
y

Inducer
(MIndUtil)

Classifier
Information & statistics
(error estimate)

Figure 10-6

Tool Execution Sequence for Classifiers

321

Chapter 10: MineSet Inducers and Classifiers

Training Set
Inducers require a training set, which is a table containing attributes, one of which is
designated as the class label. The label attribute type must be discrete (binned values,
character string values, or a few integers). The number of possible values for the label
attribute should be small, preferably two or three values. Let’s look at an example where
the goal is to classify the type of an iris flower (iris-setosa, iris-versicolor, or iris-virginica)
given as descriptive attributes its sepal length, sepal width, petal length and petal width.
Figure 10-7 shows several records from a sample training set for this problem.
Descriptive Attributes

Record 1
Record 2
Record 3

Label

sepal length

sepal width

petal length

petal width

5.1
5.9
6.5
6.3
6.5

3.5
3
2.8
2.9
3

1.4
5.1
4.6
5.6
5.8

0.2
1.8
1.5
1.8
2.2

Figure 10-7

iris type

Iris-setosa
Iris-virginica
Iris-versicolor
Iris-virginica
Iris-virginica

Sample Records From a Training Set

Once a classifier is built, it can classify new records as belonging to one of the above
classes (see Figure 10-5). These new records must be in a table that has all the attributes
used by the classifier with the same name and type as they were in the training set. The
table need not contain the label attribute. If it exists, it is ignored during classification.

Applying a Model
After building a classifier, you can apply it to records to predict their label. For example,
if you built a classifier for predicting iris type, you can apply the classifier to records
containing only the descriptive attributes, and a new column is added with the predicted
iris type.
In a marketing campaign, for example, a training set can be generated by running the
campaign at one city and generating label values according to the responses in that city.
A classifier can then be induced and campaign mail can then be sent only to people who
are labeled by the classifier as likely to respond, thus saving mailing costs.

322

Training Set

As an example of using mining tools for ensuring data quality, after building a classifier
you can apply it to the training set in order to identify records that are mislabeled by the
classifier. Such records might warrant closer investigation. Perhaps they are “noise,” or
they might yield special insights. If, for example, you have a Decision Tree for the iris
dataset induced using the Classify Only mode, by applying the classifier, you get a new
column (iris type_1) containing the predicted labels. You can then add a column that is
defined as type int with the expression (iris type != iris type_1). The new column has a 1
whenever the classifier misclassifies, and a zero when it correctly classifies. Figure 10-8
shows a Scatter Visualizer plot of the data where the new column is mapped to color with
the colors set such that green is 0 (OK) and 1 is red (error). By looking at the plot, it is
possible to determine where mistakes are being made.

Figure 10-8

Iris Dataset Misclassification, Example 1

Another alternative is to define the new column as a float with the expression
(iris type != iris type_1) + 0.01. The Scatter Visualizer can then be used with the original
label mapped to color, and this new column mapped to size. Incorrect predictions are
shown as big cubes; correct predictions are shown as small cubes (see Figure 10-9).

323

Chapter 10: MineSet Inducers and Classifiers

Figure 10-9

Iris Dataset Misclassification, Example 2

Error Estimation
When a classifier is built, it is useful to know how well you can expect it to perform in
the future (what is the classifier’s error-rate). Factors affecting classification error-rate
include:
•

The number of records available in the training set.
Since the inducer must learn from the training set, the larger the training set, the
more reliable the classifier should be; however, the larger the training set, the longer
it takes the inducer to build a classifier. The improvement to the error-rate decreases
as the size of the training set increases (this is a case of diminishing returns).

324

Error Estimation

•

The number of attributes.
More attributes mean more combinations for the inducer to compute, making the
problem more difficult for the inducer and requiring more time. Note that
sometimes random correlations can lead the inducer astray; consequently, it might
build less accurate classifiers (technically, this is known as “overfitting”). If an
attribute is irrelevant to the task, remove it from the training set (this can be done
using the Tool Manager).

•

The information in the attributes.
Sometimes there is not enough information in the attributes to correctly predict the
label with a low error-rate (for example, trying to determine someone’s salary based
on their eye color). Adding other attributes (such as profession, hours per week,
and age) might reduce the error-rate.

•

The distribution of future unlabeled records.
If future records come from a distribution different from that of the training set, the
error-rate probably will be high. For example, if you build a classifier from a
training set containing family cars, it might not be useful when attempting to
classify records containing many sport cars, because the distribution of attribute
values might be very different.

The two common methods for estimating the error-rate of a classifier are described
below. Both of these assume that future records will be sampled from the same
distribution as the training set.
•

Holdout: A portion of the records (commonly two-thirds) is used as the training set,
while the rest is kept as a test set. The inducer is shown only two-thirds of the data
and builds a classifier. The test set is then classified using the induced classifier, and
the error-rate or loss on this test set is the estimated error-rate or estimated loss.
Figure 10-10 shows this error estimation method.

325

Chapter 10: MineSet Inducers and Classifiers

Training Set

Inducer

Classifier
Evaluation
(percent
incorrect
predictions)

Figure 10-10

Estimating the Classifier’s Accuracy

This method is fast, but since it uses only two-thirds of the data for building the
classifier, it does not make efficient use of the data for learning. If all the data were
used, it is possible that a more accurate classifier could be built.
•

Cross-validation: The data is split into k mutually exclusive subsets (folds) of
approximately equal size. The inducer is trained and tested k times; each time, it is
trained on all the data minus a different fold, then tested on that holdout fold. The
estimated error-rate is then the average of the errors obtained. Figure 10-11 shows
cross-validation with k=3 (note that the default value is k=10).
Cross-validation can be repeated multiple times (t). For a t times k-fold
cross-validation, k*t classifiers are built and evaluated. This means the time for
cross-validation is k*t times longer. By default, k=10 and t=1, so cross-validation
takes approximately 10 times longer than building a single classifier.
Increasing the number of repetitions (t) increases the running time and improves
the error estimate and the corresponding confidence interval.
You can increase or decrease k. Reducing it to 3 or 5 shortens the running time;
however, estimates are likely to be biased pessimistically because of the smaller
training set sizes. You can increase k, but this is recommended only for very small
datasets.

326

Error Estimation

Training Set

Inducer

Inducer

Inducer

Classifier

Classifier

Classifier

Evaluation

Evaluation

Evaluation

Average

Figure 10-11

Classifier Cross-Validation (k=3)

Generally, a holdout estimate should be used at the exploratory stage, as well as on
datasets of over 5,000 records. Cross-validation should be used for the final classifier
building phase, as well as on small datasets.
327

Chapter 10: MineSet Inducers and Classifiers

Backfitting in Error Estimation
An inducer builds a classifier, which has two parts:
•

Structure — For Decision Trees and Option Trees, the structure is the shape of the
tree. For evidence, the structure is the number of bins for every attribute and the
thresholds if the attribute is numeric.

•

Probability estimates — Each part of the structure estimates the probability of each
class. These estimates are commonly based on the counts of training records at
different points in the structure. For Decision Trees, the probabilities are determined
by the weight of records at the leaves. For the Evidence classifier, the probabilities
are determined by the conditional probabilities for every attribute value or range.

Backfitting a classifier with a set of records does not alter the structure of the classifier,
but updates the probability estimates based on the given data. Backfitting is useful for
several reasons:
1.

A structure can be built from a small training set, then backfitted with a big dataset
to improve the probability estimates in the structure. Backfitting is a faster process
than inducing the classifier's structure.

2. When holdout error estimation is used, a portion of the data is left out for testing.
Once the classifier structure is induced and the error estimated, it is possible to
backfit all of the data through the structure, which can reduce the error of the final
classifier. When counts, weights, and probabilities are shown in the classifier's
structure, they reflect all the data, not just the training set portion.
When using drill-through from the visualizers, you can see data corresponding to the
weights shown, which reflect the whole dataset. If backfitting is not used, the weights
shown represent only the training set.

328

Error Estimation

Confusion Matrices in Error Estimation
Confusion matrices give a more detailed picture of the errors made by a classifier. Instead
of simply analyzing the number of correct and incorrect predictions, the confusion
matrix shows the type of errors being made.
Figure 10-12 shows a confusion matrix for a Decision Tree that was induced on the iris
dataset.

Figure 10-12

Confusion Matrix for Iris Dataset

329

Chapter 10: MineSet Inducers and Classifiers

The two axes represent:
•

the class values predicted by the classifier, and

•

the actual class values given in the test set (holdout set).

Entries on the diagonal are correct predictions. Entries off the diagonal indicate incorrect
predictions. This representation shows that iris-versicolor and iris-virginica are frequently
confused, but iris-setosa is always predicted correctly.
When the cost of making different types of mistakes is uneven, it is frequently useful to
understand the type of errors that are being made (see loss matrices below).
Note: The confusion matrix shows the errors made on the test set; thus, it represents the

expected true distribution of errors in an actual situation if the underlying distribution
of the data does not change significantly. The confusion matrix in MineSet is computed
prior to backfitting and is the same whether or not backfitting is applied.

Lift Curves in Error Estimation
A lift curve is a graph that plots the cumulative weight of the records from a specified
label value as a function of the weight of all the records. The order in which the records
occur determines the slope of the curve. Typically, a lift curve plots the difference
between randomly ordered records and records sorted based on a classifier's predictions.
For example, in telecommunications, it is valuable to be able to predict which customers
are likely to switch providers (churn). In the dataset churn, about 13.5% of the customers
are likely to switch provider. Figure 10-13 shows the lift curve obtained by using a
Decision Tree classifier on this dataset.

330

Error Estimation

Figure 10-13

Lift Curve for the Churn Dataset

The X axis shows the number of records sampled; the Y axis shows the number of records
corresponding to customers who churn. The lower curve (red) shows the number of
customers expected to churn given a random ordering of the records. The upper curve
(white) shows the percentage of customers that churn when ordered according to the
classifier's score (probability estimate) for each record. Records representing customers
that the classifier identifies as most likely to churn appear first; those less likely to churn
appear last. The lift that the classifier ordering provides can be seen by the difference
between the classifier curve and the random curve.

331

Chapter 10: MineSet Inducers and Classifiers

If some action should be taken before customers churn, it should be prioritized according
to the classifier's score. If the action costs money (for example, an operator contacting the
customer or a mailing), lift curves can help identify a cutoff point that maximizes returns.
Note: The lift curve shows lift on the test set; thus, it represents the expected true lift in

actual situations if the underlying distribution of the data does not change significantly.
The lift curve in MineSet is computed prior to backfitting and is the same whether or not
backfitting is applied.

Learning Curves in Error Estimation
A learning curve is a graph that shows the error of the classifier generated by an inducer
as a function of the number of records used to create the classifier. Typically, the more
records used to generate the classifier, the lower its error.
A learning curve is created by generating the specified number of classifiers for each of
the points on the curve. Each classifier is generated using a random sample of the
records, and its error is estimated using the rest of the records (those not used for
training).
Figure 10-14 shows a learning curve for the Decision Tree Inducer on the churn dataset.
Figure 10-15 shows a learning curve for the Decision Tree Inducer on the adult dataset
with the label set to gross income binned at $50,000 (so one class is gross income less than
or equal to $50,000 and the other class is gross income >$50,000). The X axis shows the
number of records used for training the inducer; the Y axis shows the error. The graph
consists of four type of points:

332

•

The yellow points are the actual error estimates taken from the runs.

•

The white points are averages.

•

The blue points interpolate between the white points.

•

The red points show a 95% confidence interval about the average based on actual
error estimates for each run.

Error Estimation

The more runs that are requested, and the bigger the test set (portion used to test), the
smaller the confidence interval. The error is generally reduced as more records are used
for training.
We can see that for the churn dataset, the error continues to decrease as the training set
size grows, while for the adult dataset, there is little advantage to training on the whole
dataset. The third point represents about 13,000 records and has an estimated error of
16.96%, while the last point represents about 44,000 records and has an estimated error
of 16.85%.

Figure 10-14

Learning Curve for the Churn Dataset

333

Chapter 10: MineSet Inducers and Classifiers

Figure 10-15

Learning Curve for the Adult Dataset With Label Set to Gross Income
Binned at $50,000

A small sample might suffice for most of the study, with the full data used only for the
final runs. In many cases, a small sample can result in a sufficiently accurate classifier,
with the error reducing only slightly if the number of records is increased (diminishing
returns). Once a learning curve is seen, the desired sampling point can be determined,
and the “sample” transformation in Tool Manager can be used to generate a sample of
this size (see “The Sample Button” in Chapter 3). Small samples reduce the time needed
to build a classifier and make the knowledge discovery process more interactive.

334

Error Estimation

Advanced Options
MineSet supports several advanced options for all inducers. These let you take into
account different costs for making mistakes and to take into account an experimental
design that has a non-uniform sampling process (that is, some parts of the true
population are sampled more heavily than others). Another option lets you create more
complicated classifiers which may have better accuracy, at the expense of added compute
time.
Loss Matrices: Not All Mistakes Were Created Equally

Suppose you are trying to classify mushrooms as poisonous or edible. Classifying a
mushroom that is actually edible as poisonous might cost you $2, since you are not eating
it; however, classifying a poisonous mushroom as edible (that is, eating it) might incur a
$10,000 operation.
Figure 10-16 shows a confusion matrix for the mushroom dataset with the Decision Tree
Inducer when only a ratio of 0.1 (10%) was used for a training set.

Figure 10-16

Confusion Matrix for the Mushroom Dataset Using Defaults Settings

335

Chapter 10: MineSet Inducers and Classifiers

Eight records, representing poisonous mushrooms, were classified as edible (0.1%); 15
records, representing edible mushrooms, were classified as poisonous (0.2%). 3793
edible mushrooms and 3496 poisonous mushrooms were correctly classified. While the
error-rate for the classifier is only 0.31% (less than one percent), our estimated loss would
be $10000*8 + $2*15 = $80,030.
Figure 10-17 shows a confusion matrix for the same dataset, but with the Decision Tree
Inducer run using a loss matrix representing the above costs. The new classifier is very
conservative and makes no mistakes in classifying a poisonous mushroom as edible; but
it makes 1558 mistakes (1543+8) in classifying edible mushrooms as poisonous. The total
estimated loss we would incur is thus $10000*0 + $2*1558 = $3116, only 3% of the cost of
the classifier that did not take losses into account.

Figure 10-17

336

Confusion Matrix for the Mushroom Dataset With Loss Matrix

Error Estimation

Loss matrices also allow predicting unknown (null values), which are shown as question
marks (?). For example, suppose it costs us $1 to ask an outside expert whether a
mushroom is poisonous or edible. In that case, some classifications result in an unknown
prediction. Running the Decision Tree Inducer yields the confusion matrix shown in
Figure 10-18, where there are 1551 unknowns, and only 15 edible mushrooms are
classified as poisonous. The overall cost is thus $10000*0 + $1*1551 + $15*2 = $1581

Figure 10-18

Confusion Matrix for the Mushroom Dataset With Loss Matrix Allowing
Unknown Predictions

Note that loss matrices are based on probability estimates made at the leaves of the tree.
For reliable estimates:
1.

Raise the "split lower bound" in Further Options of Decision Trees and Option Trees
from the default value to a higher value (for example: 5). In general, the larger and
noisier the training set size, the higher this value should be.

2. Use large training sets. You might need large training sets to get reliable estimates
when the costs are not as extreme as in this example.
3. Use Option Trees. While they do not always help, they usually provide better
probability estimates that tend to reduce the loss. For example, running the above
example with $10000 changed to $100 with unknowns not allowed, yields an
estimated loss of $1464 for Decision Trees and an estimated loss of $662 for Option
Trees.

337

Chapter 10: MineSet Inducers and Classifiers

Return-on-Investment Curves
A Return-on-Investment (ROI) curve is similar to a Lift Curve, but displays accuracy in
terms of loss rather than in terms of error; taking into account the Loss Matrix used. The
points in an ROI curve are ordered by the expected loss for each record, were they to be
labeled by the chosen label value. Similarly, the height of each point in the curve indicates
the cumulative profit (inverse loss), rather than the cumulative accuracy (inverse error)
of all records up to this point. The expected loss is computed by multiplying entries in
the Loss Matrix, under the chosen labels column (see “Loss Matrices,” under “Advanced
Options,” below) by the probabilities assigned to the corresponding classes, for the
classes by the classifier. Hence, if the classifier is very sure about its prediction, the
expected loss will be low, and the record will appear near the left side of the ROI curve.
The idea behind the ROI curve is that the user will take an action for each individual
record in the dataset. That action will be the one associated with the chosen label value.
For example, in the churn dataset, the action associated with the label Yes, might be to
send that person some marketing material. This might stop the person from churning;
but the action is costly if done indiscriminately. The peak of the ROI curve shows
approximately how much money would have been saved on the test set, if the classifier
was used to predict whether or not to send the mailing to a particular person.
Special care needs to be taken when filling out a Loss Matrix for use with a Loss Curve.
The column under a certain predicted label determines the resulting ROI curve for that
label value. The entries in this column need to represent the expected gain or loss for
taking the action associated with that label value, on all of the possible classes. For
example, the entry under the column “prediction yes” in churn, under the row “actual
value no”, may contain the value 2 to indicate that the cost of mailing a brochure (the
action associated with “yes”) to someone who was not going to churn, is 2 dollars. On
the other hand, the entry under the column yes, row yes, may have a value of -10 to
indicate that a customer was prevented from churning, saving the company ten dollars
over the cost of the mailing.
Record Weighting: Not All Records Were Sampled Equally

In certain experimental designs, portions of the true population are sampled more
frequently than others. For example, while you might want a 1% sample of some
population, a small minority that is already 0.1% of the population results in a 0.001%
sample, which might be too small (for instance, you might get two people). Record
weighting lets you give each record a weight; thus, a subpopulation that was sampled
twice as frequently might get a weight of 0.5, while the rest of the population is given a
weight of 1.

338

Error Estimation

As another example, a phone company stores all fraudulent phone calls in the dataset,
while storing only a small fraction of non-fraudulent calls. By using record weighting, it
is possible give each record its true portion of the population.
Finally, some datasets are already aggregated, and the records have a natural “count”
associated with them (for example, statistics about cities in the U.S. usually have an
associated count of the population). This count attribute can be mapped to weight, which
is equivalent to replicating each record by the number of counts.
The semantics of record weighting is that a record weight of 2 is equivalent to two records
with a record weight of 1. Floating point weights are allowed.
Boosting: Accuracy is Sometimes Crucial

In some cases, the most important issue in creating a classifier is its error rate. For
example, suppose you have analyzed a dataset for churn prediction to a point that you
are satisfied with, and are ready to create a classifier that will predict which of your
customers are the most likely to churn. At this point, you are no longer interested in
visualizing your classifier, since you have a reasonably good understanding of the factors
that are involved. You also want to achieve the best classification accuracy possible. In
this case, you might want to enable Boosting, which is an algorithm that creates several
different classifiers and combines their predictions using a weighted voting scheme.
Boosted classifiers often improve classifier accuracy, by focusing the induction process
on examples in the data which are harder to model than others.
Boosting will not always increase accuracy, but it often does. Boosted classifiers cannot
be visualized, though you can still see confusion matrices, lift curves, learning curves
and ROI curves for boosted classifiers. Boosting is a computationally intensive process,
often taking 25 times longer to run than the corresponding inducer without Boosting.
Backfitting does not work with Boosted classifiers, because of the special way Boosting
weights multiple classifiers and the records used to train them.
The theory behind boosting has not been generalized past two-class problems. MineSet
2.5 and later versions, however, allow you to use boosting with labels that have any
number of values. Boosting will not always improve the error rate of induced classifiers;
this is especially stressed for problems that have more than two label values.

339

Chapter 10: MineSet Inducers and Classifiers

Boosting works by repeatedly assigning new weight distributions to the training set and
inducing classifiers on the reweighted sets. The number of times this occurs is limited by
the BOOST_NUM_TRIALS option, which can be set using the .mineset-classopt files on
the client (see Appendix I, “Command-Line Interface to MIndUtil: Analytical Data
Mining Algorithms,”for more information about the .mineset-classopt file). The number of
classifiers generated may be lower than this parameter if the training set error rate drops
to zero before this many classifiers are generated.
The following section describes the options provided for the classifiers by the Tool
Manager.

Inducer Modes in Tool Manager
There are four modes for running an inducer (shown in Figure 10-19).
•

Classifier and Error

•

Classifier Only

•

Estimate Error

•

Learning Curve

Figure 10-19

Options for Running the Inducer

The Classifier and Error mode uses a holdout method to build a classifier: a random
portion of the data is used for training (commonly two-thirds) and the rest for testing.
This holdout proportion can be set in Further Inducer Options (see “Error Estimation” on
page 324). This method is the default mode and is recommended for initial explorations.
It is fast and provides an error estimate.

340

Error Options for Inducers

The Classifier Only mode uses all the data to build the classifier. There is no error
estimation. Use this mode when there is little data or when you build the final classifier.
The Estimate Error mode assesses the error of a classifier that would be built if all the data
were used (as with Classifier Only mode). Estimate Error uses cross-validation, resulting
in long running times. Cross-validation splits the data into k folds (commonly 10) and
builds k classifiers. The process can be repeated multiple times to increase the reliability
of the estimate. You can set the number k and the number of times in Further Inducer
Options, as explained in “Error Options for Inducers,” below. Use this method when
there is little data. The induced classifier is exactly the same as the one induced by the
Classifier Only mode.
The Learning Curve mode assesses the effect of training set size on the error of a
classifier.

Error Options for Inducers
The following options are available to fine tune the error estimation for the inducers. The
Error Options available to you depend on the mode you have chosen.
In both Classifier & Error and Estimate Error, you can set a random seed that determines
how the data is split into training and testing sets. Changing the random seed causes a
different split of the data into training and test sets. If the error estimate varies
appreciably, the induction process is not stable.
In Classifier & Error (see Figure 10-20), you can set the Holdout Ratio of records to keep
as the training set. This defaults to 0.666667 (two-thirds). The rest of the records are used
for assessing the error.

Figure 10-20

Error Estimation Options With Holdout

341

Chapter 10: MineSet Inducers and Classifiers

In Estimate Error (see Figure 10-21), you can set the number of folds in cross validation
and the number of times to repeat the process.

Figure 10-21

Error Estimation Options With Cross Validation

Backfitting
The Backfit test set option is a checkmark that can be found under Further Options for all
inducers when using Classifier & Error mode, and is shown in Figure 10-22. The backfit
checkmark is disabled when Boosting is enabled.

Figure 10-22

342

Backfitting, Confusion Matrices, Lift Curve, and ROI Curve Options

Error Options for Inducers

Confusion Matrices
The Display Confusion Matrix option is checkmark under Further Options for all inducers
when using Classifier & Error mode is shown in Figure 10-22.

ROI Option
The Display ROI Curve option is a checkmark under Further Options for all inducers when
the classifier and Error mode is shown in Figure 10-23. An ROI curve requires a label
value to be chosen. An ROI curve is then generated and displayed for that label value.

Figure 10-23

ROI Option for Generating a Return on Investment Curve

Lift Curves
The Display Lift Curve option is a checkmark under Further Options for all inducers when
using Classifier & Error mode is shown in Figure 10-22. A Lift Curve requires a label
value to be chosen. A lift curve is generated and displayed for that label value.

343

Chapter 10: MineSet Inducers and Classifiers

Loss Matrices
The Use Loss Matrix option is a checkmark under Further Options for all inducers (See
Figure 10-24). The Edit matrix button can then be used to define the loss matrix. To avoid
unknowns from being predicted, fill the unknown prediction column with the highest
value in the matrix.

Figure 10-24

Enabling Loss Matrices and Setting the Weight Attribute

Weight Setting
The Use Weight option is a checkmark under Further Options for all inducers (See
Figure 10-24). Choose the column for the weight. The Weight is Attribute option
determines whether the inducer can use this attribute for classification purposes or not.
In certain cases where the weight is a result of a stratified sample that is part of the
experimental design, the classifier should not be given access to the weight column as it
is not a property of the real-word entity.

Learning Curves
Learning Curve is a mode in the Classify menu of the Mining Tools tab. It can be used
with any of the inducers. When the Learning Curve mode is selected, the Further Options
dialog box lets you specify Learning Curve Options (shown in Figure 10-25), including:
•

the number of points in the learning curve,

•

the number of runs per point, and

•

the number of records to use at the start and end points.

The number of records to use at each intermediate point is calculated automatically.

344

Error Options for Inducers

Figure 10-25

Learning Curve Options

The number of points in the learning curve must be specified; also, it must be greater than
or equal to 1. The number of records for the starting and ending points can be specified
to allow generating a learning curve for a specific range of the training set. If either of
these options are left blank, they are calculated automatically based on the number of
points in the learning curve and the total number of records in the training set. This
default covers the entire range of the training set. For instance, assume a file containing
80,000 records. If you specified 3 points in the learning curve, the algorithm generates
points at 20,000, 40,000 and 60,000 records. Often it is useful to “zoom in” on a smaller
range. For example, a learning curve might be generated only for a range of 1000 to
10,000 records.
Generating a learning curve takes a significant amount of CPU time. If ti is the time to
train an inducer on training set i (where i ranges from 1 to the number of points), and
there are k runs per point, the total time is k*Σ ti . Increasing the number of runs per
i
point increases the running time proportionally, but improves the estimate of the
average. The default value of the number of runs is 3.
The Scatter Visualizer’s filter panel can be used to filter some of the data types shown
(average points, confidence intervals, interpolated points, or actual trials). For example,
you might want to remove the data points for the trials and confidence intervals and
show only the averages and interpolated points.

OK and Cancel Buttons
Once you have specified the Classification Options, click OK to have these options take
effect and to return to the Data Destination panel. To return to the Data Destination panel
without having changes to the options take effect, click Cancel.

345

Chapter 10: MineSet Inducers and Classifiers

Go! Button
After you have set the options, click the Go! button in the Data Destination panel to run
the inducer. The appropriate visualizer will automatically be launched.

The Status Window
After you press Go! in the Data Destination panel, the Status Window at the bottom of
the Tool Manager’s main window shows the inducer’s progress and the output
classifier’s statistics. It displays specific information for the induced classifier. For
example, for Decision Trees it shows the number of nodes, the number of leaves, and the
depth of the Decision Tree (Figure 10-26). This information is saved automatically on
your workstation under the session file name with a -dt.out, -odt.out, or -eviviz.out
extension, depending on whether a Decision Tree, Option Tree, or Evidence Inducer was
executed.
For Classifier & Error, the first series of dots represent reading the file, then information
about the classifier build progress is shown, then the test set classification progress is
shown.
For Classifier Only mode, there is no test set classification phase.
For Estimate Error, the times and folds are shown.
For Learning curves, each average point on the x-axis will be described on a line and each
run for that average point will be represented by a dot.

Figure 10-26
346

The Status Window

The Status Window

When you have selected the Classifier & Error mode, the Status window contains the
following information:
•

The random seed used to split the data into training and test sets.

•

The number of records used for training the inducer.

•

The number of records used for evaluating the resulting classifier; of the test
records, how many were seen during training, excluding the label attribute. It is
possible to have duplicate records (“seen”) in a dataset; some records can be in both
the training and test set. A large value of seen records indicates that there are many
duplicate records. If their labels are contradictory, it might be impossible to achieve
high accuracy without adding more attributes to the dataset.

•

The number of correct and incorrect predictions made.

•

The average normalized mean squared error represents the accuracy of the
probability estimates. For each test record, the mean squared error is the square of
one minus the probability estimate for the correct label value, plus the sum of the
squares of the probability estimates for the other (incorrect) label values. The
normalized mean squared error is half the mean squared error, which is a value
between zero and one. The average normalized mean squared error is the
normalized mean squared error averaged over all the records in the test set by their
appropriate weights (weighted average).

•

The classification error, which is the percent of incorrect predictions.

•

Both the average mean squared error and the classification error show the standard
deviation of the mean and the confidence interval for the mean. This is the range
you can expect from the classifier if the data comes from the same distribution. For
error estimates (not losses), a more accurate formula than the usual two-standard
deviation rule is used.

When you have selected the Estimate Error mode, the Status window contains the
following information:
•

The number of cross-validation folds and times.

•

The random seed.

•

The estimated accuracy with standard deviation.

•

The 95% confidence interval for the estimated accuracy.

347

Chapter 10: MineSet Inducers and Classifiers

Applying Models, Testing Models, and Fitting New Data
The Apply Model button in the Data Transformations panel lets you:
•

take a previously created classifier and apply it to new data.

•

test a previously created classifier’s performance on the current table.

•

fit the current table into a previously created classifier’s structure.

On the top left of this dialog box (Figure 10-27) is a list of all classifiers currently available
on the server. If you select a classifier, the right-hand side lists the column names and
types required by that classifier. If these requirements match the current table, a message
at the bottom states this, and the buttons on the bottom (OK, Run Test, or Fit Data) is
activated. If the current table does not have all the columns required for the selected
classifier, the message at the bottom states this, the columns that are missing are selected
in the list on the right, and the button on the bottom is deactivated.

Figure 10-27

348

The Test and Apply Model Dialog Box: Selecting a Classifier

Applying Models, Testing Models, and Fitting New Data

Apply Model
The Apply Model panel is used to apply a previously created classifier to the current
table, as shown in Figure 10-28. There are two modes of application for the classifier:
•

To Predict discrete label values for the records in the current table. For example, if you
created a classifier to determine churn, you can use this option to add a column that
labels each customer as either likely to churn or not likely to churn.

•

To generate Estimated probability values for a specified label value. Instead of using
the classifier to predict the label value of each record, it is used to estimate the
probability that each record has a specified label value (for example, churn = yes).
Given the classifier created to determine churn, you can use this option to add a
column that indicates the probability that each customer is likely to churn.

The New column name text field lets you specify the name of the new column.

Figure 10-28

The Apply Model Panel

Test Model
The Test Model panel is used to test a previously created classifier on the current table,
as shown in Figure 10-29. The table must contain columns with the names and types
required by the selected classifier. Unlike Apply Model, Test Model also requires the
table to contain a label column with the same name and type as the label column used
when building the classifier.

349

Chapter 10: MineSet Inducers and Classifiers

The Test Model panel has options that lets you
•

show the confusion matrix of the classifier on the table records

•

show the lift curve of the classifier for a specified label value

•

show the ROI curve of the classifier for a specified label value

•

show a visualization of the classifier with the table used as the test-set (this is only
relevant for Decision Tree and Option Tree classifiers)

•

select an attribute to use as the record weight

The text field at the bottom of the Test Model panel shows the results.

Figure 10-29

The Test Model Panel

Fit Data to Model
The Fit Data to Model panel is used to fit the data in the current table to a previously
created classifier, as shown in Figure 10-30. This produces a new classifier with the same
structure as the original one; however, the new one uses the data from the table to update
the probability estimates (see “Backfitting in Error Estimation” on page 328). Because all
of the data from the table is being fit into the structure of the classifier, there is no error
estimation. Fit Data to Model cannot be used on classifiers that were built using boosting.
Use Test Model to evaluate the performance of the new classifier on a separate test set
(disjoint from the fit data).

350

Special Options and Limitations

The Fit Data to Model panel has options that lets you
•

show a visualization of the new classifier

•

specify a name for the new classifier

•

select an attribute to use as the record weight

Figure 10-30

The Fit Data to Model Panel

Special Options and Limitations
The following subsections describe how to set special options and the limitations of the
inducers.

Setting Special Options
When the Tool Manager runs an inducer on the server (the MIndUtil program), it passes
certain options to the inducers. Not all options are controlled through the Tool Manager
GUI. Those options not controlled by Tool Manager take on their default values and can
be overridden by setting them in a special file, called .mineset-classopt. Tool Manager
prepends this file to the options sent. This file is optional. Tool Manager looks for it first
in the current directory, then in your home directory. See Appendix I, “Command-Line
Interface to MIndUtil: Analytical Data Mining Algorithms” for more details about the
options.

351

Chapter 10: MineSet Inducers and Classifiers

The file should contain one line per option, in the following format: