007 3214 004

User Manual: 007-3214-004

Open the PDF directly: View PDF .
Page Count: 756 [warning: Documents this large are best viewed by clicking the View PDF Link!]

MineSet™

User’s Guide

Document Number 007-3214-004

MineSet™ User’s Guide

Document Number 007-3214-004

CONTRIBUTORS

Written by Dieter Rathjens and Helen Vanderberg

Illustrated by Dany Galgani

Production by Kirsten Pekarek

Engineering contributions by Barry Becker, Dave Bouvier, Cliff Brunk, Eric Eros,

Ariel Faigon, Eben Haber, Georges Harik, John Hawkes, Andy Kar, Ed Karrels,

Ronny Kohavi, Alex Kozlov, Clay Kunz, Peter Rathmann, Dan Sommerﬁeld,

Peter Welch, and Brett Zane-Ulman.

St. Peter’s Basilica image courtesy of ENEL SpA and InfoByte SpA. Disk Thrower

image courtesy of Xavier Berenguer, Animatica.

The contents of this document may not be copied or duplicated in any form, in whole

or in part, without the prior written permission of Silicon Graphics, Inc.

RESTRICTED RIGHTS LEGEND

Use, duplication, or disclosure of the technical data contained in this document by

the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of the

Rights in Technical Data and Computer Software clause at DFARS 52.227-7013

and/or in similar or successor clauses in the FAR, or in the DOD or NASA FAR

Supplement. Unpublished rights reserved under the Copyright Laws of the United

States. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline Blvd.,

Mountain View, CA 94043-1389.

Silicon Graphics and the Silicon Graphics logo are registered trademarks, and

IRIX,MineSet and IRIS InSight are trademarks, of Silicon Graphics, Inc. Oracle is a

registered trademark, and SQL*Net is a trademark of Oracle Corporation.

INFORMIX is a registered trademark of Informix Software, Inc. Sybase is a registered

trademark, and SQL Server is a trademark of Sybase Inc. UNIX is a registered

trademark in the United States and other countries, licensed exclusively through

X/Open Company, Ltd. X Window System is a trademark of the Massachussetts

Institute of Technology.

The Tree Visualizer is patented under United States Patents No. 5,528,735, 5,555,354

and 5,671,381.

iii

Contents

List of Figures xxiii

List of Tables xxxi

About This Guide xxxiii

Audience for This Guide xxxiii

Structure of This Document xxxiv

Illustration in This Guide xxxvii

Typographical Conventions xxxvii

1. Getting Started 39

MineSet Tools Suite 39

Tool Manager 41

DataMover 41

Association Rules Generator 41

Automatic Binning 42

Clustering 42

Column Importance 42

Decision Table Inducer and Classifier 43

Decision Tree Inducer and Classifier 43

Evidence Inducer and Classifier 43

Option Tree Inducer and Classifier 44

Regression Tree Inducer and Regressor 44

Cluster Visualizer 44

Contents

Decision Table Visualizer 45

Evidence Visualizer 45

Map Visualizer 45

Record Viewer 46

Rules Visualizer 46

Scatter Visualizer 46

Splat Visualizer 47

Statistics Visualizer 47

Tree Visualizer 47

Basic Tool Execution Scenario 48

2. Setting Up MineSet 51

Configuring the DataMover Server 51

The User Configuration File 51

File Handling 54

Mandatory Configuration File 54

Using MineSet With Existing Data Files 56

Using MineSet to Connect to Remote Databases 58

Loading Sample Datasets 59

3. The Tool Manager 63

Overview 63

Connecting to an Existing Data Source 64

Transforming the Data 64

Visualizing the Data on the Screen 65

Starting the Tool Manager 66

Choosing a Data Source 68

Choosing an Existing Data File 69

Choosing a Database Table 70

Contents

Transforming the Data 75

The Remove Column Button 76

The Bin Columns Button 77

Aggregation 83

The Filter Button 87

The Change Types Button 88

The Add Column Button 91

The Apply Model Button 92

The Sample Button 93

The Table History Buttons 94

The “Current view is” Field 94

The Prev and Next Buttons 94

Investigating the Data 99

Using Visualization Tools 99

Using Mining Tools 102

Using Data Files 107

Session Files 108

Pulldown Menus 109

The File Menu 109

The View Menu 111

The Visual Tools Menu 111

The Help Menu 112

The Tool Manager Options File 112

The Record Viewer 113

Color Options for the MineSet Visualizers 115

Choosing Colors 115

Using the Color Browser 117

Contents

4. Using the Statistics Visualizer 119

Overview of the Statistics Visualizer 119

File Requirements 121

Starting the Statistics Visualizer 121

Starting the Statistics Visualizer 122

Working in the Statistics Visualizer’s Main Window 123

Pulldown Menus 123

The File Menu 123

The View Menu 124

The Help Menu 125

Sample Data Files 126

5. Using the Tree Visualizer 127

Overview of Tree Visualizer 127

File Requirements 129

Starting the Tree Visualizer 130

Configuring the Tree Visualizer Using the Tool Manager 132

Selecting the Tree Visualizer Tool 132

Undoing Mappings 134

Specifying Tool Options 134

Saving Tree Visualizer Settings 141

Invoking the Tree Visualizer 141

Working in the Tree Visualizer’s Main Window 142

Highlighting an Object or Node 143

Selecting an Object 144

Spotlighting an Object 144

Using the Right Mouse Button 145

Navigating With the Middle Mouse Button 146

External Controls 147

Buttons 147

Thumbwheels 149

Height Slider 150

Contents

vii

Pulldown Menus 150

The File Menu 151

The Show Menu 152

The Display Menu 165

The Selections Menu 166

The Go Menu 167

The Help Menu 169

Null Handling in the Tree Visualizer 170

Sample Configuration and Data Files 171

6. Using the Map Visualizer 173

Overview of Map Visualizer 173

File Requirements 176

Starting the Map Visualizer 178

Configuring the Map Visualizer Using the Tool Manager 180

Generating .gfx and .hierarchy Files 180

Selecting the Map Visualizer Tool 181

Mapping Columns to Visual Elements 182

Undoing Mappings 183

Slider Creation for Mapviz 183

Specifying Tool Options 184

Saving Map Visualizer Settings 189

Invoking the Map Visualizer 189

Working in the Map Visualizer’s Main Window 189

Viewing Modes 191

External Main Window Controls 194

Buttons 194

Height-Adjust Slider and Label 195

Thumbwheels 196

The Animation Control Panel 196

Sliders Controlling Independent Dimensions 197

The Summary Window 199

Animation Buttons and Sliders 201

viii

Contents

Pulldown Menus 204

The File Menu 204

The View Menu 204

The Selections Menu 208

The InterTool Menu 209

The Help Menu 209

Null Handling in the Map Visualizer 210

Sample Configuration and Data Files 211

7. Using the Scatter Visualizer 215

Overview of Scatter Visualizer 215

File Requirements 217

Starting the Scatter Visualizer 218

Configuring the Scatter Visualizer Using the Tool Manager 220

Selecting the Scatter Visualizer Tool 220

Mapping Requirements to Columns 221

Undoing Mappings 221

Slider Creation for Scatterviz 221

Specifying Tool Options 223

Invoking the Scatter Visualizer 229

Saving the Scatter Visualizer Settings 229

Null Handling in the Scatter Visualizer 229

Working in the Scatter Visualizer’s Main Window 230

Viewing Modes 232

External Controls 234

The Animation Control Panel 234

Sliders Controlling Independent Dimensions 234

The Summary Window 238

Animation Buttons and Sliders 239

Contents

Pulldown Menus 241

The File Menu 241

The View Menu 241

The Selections Menu 244

The Help Menu 245

Sample Configuration and Data Files 245

8. Using the Splat Visualizer 249

Overview of the Splat Visualizer 249

Opacity 252

File Requirements 255

Starting the Splat Visualizer 255

Configuring the Splat Visualizer Using the Tool Manager 256

Selecting the Splat Visualizer Tool 257

Mapping Columns to Requirements 258

Undoing Mappings 258

Specifying Tool Options 258

Invoking the Splat Visualizer 262

Saving the Splat Visualizer Settings 262

Null Handling in the Splat Visualizer 263

Working in the Splat Visualizer’s Main Window 264

Viewing Modes 264

External Controls 267

The Animation Control Panel 267

Sliders Controlling Independent Dimensions 267

The Summary Window 271

Animation Buttons and Sliders 272

Pulldown Menus 276

The View Menu 276

The Selection Menu 279

Splat Type Menu 282

Sample Configuration and Data Files 283

Contents

9. Using the Rules Visualizer 287

Overview of Rules Visualizer 287

Data Conversion 290

Association Rules Generator 290

Rules Visualization 292

File Requirements 294

Starting the Rules Visualizer 295

Configuring the Rules Visualizer Using the Tool Manager 297

Setting Up Associations 297

Applying Association Rule Options 299

Mapping Columns to Association Items 300

Specifying Ruleviz Options 301

Mapping Columns to Visual Elements 304

Invoking the Rules Visualizer 305

Working in the Rules Visualizer’s Main Window 305

Viewing Modes 306

External Controls 308

The Height Slider 308

Pulldown Menus 309

The File Menu 309

The Filter Menu 310

The View Menu 312

The Help Menu 312

Sample Files 312

Sample Files for the Association Data Converter 312

Sample Files for the Association Rules Generator 313

Sample Files for the Rules Visualization Part 313

Contents

10. MineSet Inducers and Classiﬁers 315

Classifiers 315

Decision Tree Classifiers 316

Option Tree Classifiers 317

Evidence Classifiers 319

Inducers 320

Training Set 322

Applying a Model 322

Error Estimation 324

Backfitting in Error Estimation 328

Confusion Matrices in Error Estimation 329

Lift Curves in Error Estimation 330

Learning Curves in Error Estimation 332

Advanced Options 335

Return-on-Investment Curves 338

Inducer Modes in Tool Manager 340

Error Options for Inducers 341

Backfitting 342

Confusion Matrices 343

ROI Option 343

Lift Curves 343

Loss Matrices 344

Weight Setting 344

Learning Curves 344

OK and Cancel Buttons 345

Go! Button 346

xii

Contents

The Status Window 346

Applying Models, Testing Models, and Fitting New Data 348

Apply Model 349

Test Model 349

Fit Data to Model 350

Special Options and Limitations 351

Setting Special Options 351

Default Limits and How to Override Them 352

Other Limitations 353

11. Inducing and Visualizing the Decision Tree Classiﬁer 355

Overview 355

Inducing Decision Trees 356

File Requirements 357

Running the Decision Tree Inducer 357

Configuring the Decision Tree Inducer Using the Tool Manager 358

Discrete Labels 358

Classifier Name 359

Parallelization 359

Decision Tree Options 359

Working in the Tree Visualizer’s Main Window 363

Nodes 363

Lines 364

Using the Main Window to Classify Records 365

External Controls 365

Pulldown Menus 366

The Search and Filter Panels 366

Sample Files 368

Contents

xiii

12. Inducing and Visualizing the Option Tree Classiﬁer 377

Overview 377

Inducing Option Trees 380

File Requirements 380

Running the Option Tree Inducer 380

Configuring the Decision Tree Inducer Using the Tool Manager 381

Discrete Labels 381

Parallelization 382

Classifier Name 382

Option Tree: Further Options 382

Working in the Tree Visualizer’s Main Window 385

Sample Files 385

13. Inducing and Visualizing the Evidence Classiﬁer 389

Overview 389

Inducing Evidence Classifiers 397

File Requirements 398

Running the Evidence Inducer 398

Starting the Evidence Visualizer 399

Configuring the Evidence Inducer Using the Tool Manager 400

Discrete Labels 401

Classifier Name 401

Refining the Inducer With Further Options 401

Working in the Evidence Visualizer’s Panes 403

Viewing Modes 405

External Controls 415

Sliders 415

Pulldown Menus 416

The File Menu 416

The View Menu 417

The Nominal Order Menu 418

The Selection Menu 418

Sample Files 420

xiv

Contents

14. Inducing and Visualizing the Decision Table 431

Overview 431

Inducing Decision Tables 436

File Requirements 437

Running the Decision Table Inducer 437

Starting the Decision Table Visualizer 438

Configuring the Decision Table Inducer Using the Tool Manager 439

Discrete Labels 440

Classifier Name 440

Exploring Data by Mapping Columns to Axes 440

Decision Table Options 441

Working in the Decision Table Visualizer’s Main Window 442

Viewing Modes 444

External Main Window Controls 446

Sliders 446

Pulldown Menus 446

The File Menu 446

The View Menu 447

The Nominal Order Menu 447

The Selection Menu 448

The Help Menu 449

Sample Files 450

15. Inducing and Visualizing the Regression Tree 465

Overview 465

Running the Regression Tree Inducer 466

Configuring the Regression Tree Inducer Using the Tool Manager 467

Continuous Label 467

Regressor Name 468

Regression Tree Options 468

Error Estimation 471

Contents

Visualizing the Regression Tree 472

Lines 473

Using the Main Window to Predict Values 473

External Controls 474

Pulldown Menus 474

Sample Files 474

16. Inducing and Visualizing Clustering 479

Overview of Clustering 479

Using Clustering and the Cluster Visualizer 482

Single k-Means Clustering Method 483

Iterative k-Means Clustering Method 484

Evaluation of Clustering 485

Using Attribute Weights 486

Further Clustering Options 488

Starting the Cluster Visualizer 489

File Requirements 490

Working in Cluster Visualizer Main Window 490

Pulldown Menus 492

Sample File 492

Alternative Visualization of Clustering 492

17. Column Importance 493

Finding Important Columns 493

Column Importance Notes 497

Column Importance and Relation to Classifiers 497

The Discretization Process 497

The Importance Function 498

Dependence on Other Attributes 498

Sample File 499

xvi

Contents

18. Selection and Drill-Through 501

Multiple Selection 501

Drill-Through 502

Tree Visualizer Specific Details 503

Map Visualizer Specific Details 504

Scatter Visualizer Specific Details 504

Splat Visualizer Specific Details 504

Rules Visualizer Specific Details 504

19. File Exchange Between MineSet and SAS 505

Overview 505

Converting MineSet Data Files to SAS Data Sets 505

The -names namefile Command Line Option 506

The -svsc Option 506

Converting SAS Data Sets Into MineSet Data Files 507

The -nolabel Option 507

The -names namefile Option 507

The -nodata Option 508

The -svsc Option 508

20. MineSet Web Extensions 509

Overview 509

MineSet Web Extension Files 510

scripts Subdirectory 510

examples Subdirectory 510

examples/rview_dir Subdirectory 511

MineSet Web Installation (Client) 511

MineSet Web Installation (Server) 512

Setting Up the Server 512

Local Installation 513

MineSet mtr Files 514

Creating mtr Files 514

Contents

xvii

MineSet Remote View 516

Installing MineSet Remote View 516

Configuring and Using rview_dir.cgi 516

Configuring and Using rview_ﬁle.cgi 519

MineSet Web Extension Security-Related Issues 520

A. Flat File Support for MineSet 521

The Data File 521

Data Types 522

Arrays 523

The .schema File 525

Variable Names 525

Strings and Characters 526

Comments 526

File Statements 526

Data Statements 526

Input Options 529

Exceptions 529

B. Creating Data and Conﬁguration Files for the Tree Visualizer 531

The Data File 531

Data Types 532

Enumerations 533

Arrays 534

The Configuration File 536

Sections 536

Options Files 536

Statements 537

Variable Names 537

Option Statements 537

Include Statements 538

Sinclude Statements 538

Strings and Characters 538

xviii

Contents

Keywords 539

Expressions 540

The Input Section 541

File Statements 541

Data Statements 542

Input Options 544

The Expression Section 546

The Hierarchy Section 547

Levels Statements 547

Key Statements 548

Aggregate Subsection 551

Aggregate Base Subsection 552

Expressions Subsection 553

Sort Statements 553

Hierarchy Options 554

The View Section 555

Height Statements 556

Base Height Statements 558

Disk Height Statements 559

Color Statements 560

Base Color Statements 562

Disk Color Statements 563

Label Statements 563

Message Statements 563

The View Options 565

Contents

xix

C. Creating Data, Conﬁguration, Hierarchy, and GFX Files for the Map Visualizer 573

The Data File 573

Data Types 574

Fixed Arrays 575

The Configuration File 576

Overview 576

Keywords 579

Expressions 580

The Input Section 581

The Expressions Section 587

The View Section 588

The Hierarchy File 595

The .gfx File 596

D. Creating Data and Conﬁguration Files for the Scatter Visualizer 601

The Data File 601

Data Types 602

Arrays 603

Null Values 603

The Configuration File 604

Sections 604

Defaults Files 604

Statements 605

Variable Names 605

Options Statements 605

Include Statements 606

Sinclude Statements 606

Strings and Characters 606

Comments 606

Keywords 607

Expressions 608

Contents

The Input Section 609

File Statements 610

Enumeration Statements 610

Data Statements 612

Input Options 614

The Expressions Section 614

The View Section 615

Slider Statement 616

Entity Statement 616

Size Statement 617

Color Statement 618

Axis Statement 621

Summary Statement 622

Message Statement 624

Execute Statement 625

The Filter Statement 625

View Options 626

E. Creating Data and Conﬁguration Files for the Splat Visualizer 627

The Data File 627

Data Types 628

Null Values 629

The Configuration File 629

Sections 629

Defaults Files 630

Statements 630

Variable Names 630

Options Statements 631

Include Statements 631

Sinclude Statements 631

Strings and Characters 632

Comments 632

Keywords 632

Contents

xxi

The Input Section 633

File Statements 633

Enumeration Statements 634

Data Statements 636

Input Options 637

The View Section 637

Slider Statement 638

Opacity Statement 638

Color Statement 640

Axis Statement 643

Summary Statement 644

View Options 645

F. Creating Data and Conﬁguration Files for the Rules Visualizer 647

The Association Data Converter 648

Association Data Converter File Requirements 648

Files Generated by the Association Data Converter 650

The Association Data Converter Command-Line Operation 650

Association Data Converter Examples 651

Association Rules Generator 652

Association Rules Generator Files Requirements 652

Association Rules Generator Command-Line Operation 652

Association Rule Examples 657

Rules Visualization 663

Rules Visualization File Requirements 663

G. Format of the Evidence Visualizer’s Data File 677

H. Creating Data and Conﬁguration Files for the Decision Table Visualizer 681

Sample File 682

xxii

Contents

I. Command-Line Interface to MIndUtil: Analytical Data Mining Algorithms 683

MIndUtil Invocation and Options 683

General Options 687

Induction Modes 690

Decision Tree Inducer Options 692

Option Tree Inducer Options 693

Evidence Inducer Options 693

Decision Table Inducer Options 694

Regression Tree Inducer Options 695

Estimate Error 695

Learning Curve 696

Clustering 696

Discretization 697

Column Importance and Auto Selection 698

Fit-Data 699

MineSet-to-MLC, MLC-to-MineSet 699

Visualize 700

J. Nulls in MineSet 701

Semantics of Nulls 701

Representation of Nulls 702

Operations on Nulls 702

Arithmetic Expressions 702

Boolean Expressions 702

Relational Operations 703

Testing for Nulls 703

Aggregations in the Presence of Nulls 704

Sort Order for Nulls 705

Bins and Arrays With Nulls 705

K. Further Reading and Acknowledgments 707

Further Reading 707

Acknowledgments 711

Index 713

xxiii

List of Figures

Figure 1-1 Tool Execution Sequence 48

Figure 3-1 The Tool Manager Startup Window 67

Figure 3-2 File Pulldown Menu 68

Figure 3-3 Open New Data File Dialog Box 69

Figure 3-4 Choosing New Database Table Dialog Box 71

Figure 3-5 Specifying Server Name, Login, and Password 71

Figure 3-6 Sample Dialog Box Listing Available DBMS Names/Vendors 72

Figure 3-7 Dialog Box After Selecting Informix or Sybase DBMS 73

Figure 3-8 SQL Query Dialog Box 74

Figure 3-9 The Data Transformations Panel 75

Figure 3-10 Bin Columns Dialog Box 77

Figure 3-11 Binning With Automatically Computed Thresholds 79

Figure 3-12 Aggregate Dialog Box 86

Figure 3-13 Filter Dialog Box 87

Figure 3-14 Change Types Dialog Box 88

Figure 3-15 Types Popup List 89

Figure 3-16 The Add Column Dialog Box 91

Figure 3-17 Sampling Dialog Box 93

Figure 3-18 Table History Buttons 94

Figure 3-19 View History Dialog Box 96

Figure 3-20 Zoom Buttons 97

Figure 3-21 Overview Button 97

Figure 3-22 Vertical/Horizontal View Button 97

Figure 3-23 Data Destination Panel 100

Figure 3-24 Columns Mapped to Requirements 101

Figure 3-25 The Associations Tab 103

Figure 3-26 The Column Importance Tab 104

xxiv

List of Figures

Figure 3-27 Advanced Mode of Column Importance 105

Figure 3-28 The Data Files Panel 107

Figure 3-29 File Menu 109

Figure 3-30 View Menu 111

Figure 3-31 Sample Record Viewer Screen 114

Figure 3-32 Configuration Option With a Single Color Swatch 115

Figure 3-33 Color Browser 115

Figure 3-34 Multiple Colors Swatches 116

Figure 3-35 Scroll Arrows on Color Browser 116

Figure 3-36 Color Browser Out of Colors 116

Figure 4-1 Numeric Column Displayed by Statistics Visualizer 120

Figure 4-2 Discrete Column Displayed by Statistics Visualizer 120

Figure 4-3 File > Open Menu Selection for Statistics Visualizer 121

Figure 4-4 Data Destination Panel With Statistics Visualizer Selected 122

Figure 4-5 StatViz View Pulldown Menu 124

Figure 4-6 Statistics Visualizer Help Menu 125

Figure 5-1 Example Display in the Tree Visualizer’s Main Window 128

Figure 5-2 Tree Visualizer’s File Pulldown Menu 130

Figure 5-3 Data Destination Panel of Tool Manager With Tree

Visualizer Selected 133

Figure 5-4 Tree Visualizer’s Configuration Options Dialog Box 135

Figure 5-5 Tree Visualizer’s Initial View When Specifying store.treeviz 142

Figure 5-6 A Highlighted Object and the Information It Represents 143

Figure 5-7 Example of a Selected (Spotlighted) Object 145

Figure 5-8 Example of the Square as Navigational Base 146

Figure 5-9 Tree Visualizer’s External Button Controls 147

Figure 5-10 Tree Visualizer’s Thumbwheels 149

Figure 5-11 Tree Visualizer’s Height Slider 150

Figure 5-12 Tree Visualizer’s File Pulldown Menu With Options 151

Figure 5-13 Tree Visualizer’s Show Pulldown Menu With Options 152

Figure 5-14 Tree Visualizer’s Overview Window 153

Figure 5-15 Tree Visualizer’s Search Dialog Box 154

Figure 5-16 Sample Results of a Search in the Tree Visualizer 155

List of Figures

xxv

Figure 5-17 Detail of the Tree Visualizer’s Search Dialog Box 156

Figure 5-18 Tree Visualizer’s Filter Dialog Box 159

Figure 5-19 Tree Visualizer’s Marks Panel 163

Figure 5-20 Window Resulting From Clicking Mark Button 163

Figure 5-21 Main Window With Flags Representing Marks 164

Figure 5-22 Tree Visualizer’s Display Menu 165

Figure 5-23 Tree Visualizer’s Selection Menu 166

Figure 5-24 Tree Visualizer’s Go Pulldown Menu 167

Figure 5-25 Tree Visualizer’s Help Pulldown Menu 169

Figure 5-26 Representation of a Null Value Mapped to Height, Color,

Disk, and Label 171

Figure 6-1 Sample Map Visualizer Screen Showing 1990 U.S. Population 174

Figure 6-2 Sample Map Visualizer Screen Showing Relative Population

of Major U.S. Cities 175

Figure 6-3 Sample Map Visualizer Screen Showing the United States

With Specific Endpoints 176

Figure 6-4 Map Visualizer’s Startup Screen, With File Pulldown

Menu Selected 178

Figure 6-5 Data Destination Panel, With Map Visualizer Selected 182

Figure 6-6 Map Visualizer’s Options Dialog Box 185

Figure 6-7 Population.usa.mapviz Example With the Slider Moved to 1990 190

Figure 6-8 Highlighted Information in the Viewing Window and

Selected Information 192

Figure 6-9 Detail View of Top Right Buttons 194

Figure 6-10 Lower Half of Window With Thumbwheels 196

Figure 6-11 Map Visualizer’s Summary Window With Slider and

Animation Controls 197

Figure 6-12 Map Visualizer’s Summary Window With One Slider and

Animation Controls 198

Figure 6-13 If There Are No Independent Dimensions, No Animation

Control Panel Appears 199

Figure 6-14 Map Visualizer’s View Pulldown Menu 204

Figure 6-15 Map Visualizer Filter Panel 205

Figure 6-16 Map Visualizer Selections Menu 208

xxvi

List of Figures

Figure 6-17 Map Visualizer’s InterTool Pulldown Menu 209

Figure 6-18 Representation of a Null Value Mapped to Height

(Top Middle Object) and to Color (Bottom Right Object) 211

Figure 7-1 Sample Scatter Visualizer Screen 216

Figure 7-2 Scatter Visualizer Start-Up File Pulldown Menu Selected 219

Figure 7-3 Data Destination Panel With Scatter Visualizer Selected 220

Figure 7-4 Scatter Visualizer’s Options Dialog Box 224

Figure 7-5 Initial View When Specifying company.scatterviz 231

Figure 7-6 Displayed Information When Cursor is Over a Selected Entity 233

Figure 7-7 Animation Control Panel With Summary Window and Both

Slider Controls 235

Figure 7-8 Animation Control Panel With Summary Window and One

Slider Control 236

Figure 7-9 Scatter Visualizer With No Independent Dimension or

Animation Control Panel 237

Figure 7-10 Scatter Visualizer View Menu 241

Figure 7-11 Scatter Visualizer Filter Panel 242

Figure 7-12 The Scatter Visualizer Selections Menu 244

Figure 8-1 Sample Splat Visualizer With One Slider Control 250

Figure 8-2 Shape of Opacity Function For Low and High Values of u 252

Figure 8-3 Image Where u = 5.3, and u = 30 253

Figure 8-4 Data Destination Panel With Splat Visualizer Selected 257

Figure 8-5 Splat Visualizer’s Options Dialog Box 259

Figure 8-6 Pick Dragger Over Data 266

Figure 8-7 Animation Control Panel With Summary Window and

Both Slider Controls 268

Figure 8-8 Splat Visualizer Without Independent Dimension or An

Animation Control Panel 270

Figure 8-9 Changed Visualization as a Result of Moving the Slider

(Compare to Figure 8-1) 273

Figure 8-10 Splat Visualizer View Menu 276

Figure 8-11 Splat Visualizer Filter Panel 277

Figure 8-12 The Splat Visualizer’s Selection Menu 279

List of Figures

xxvii

Figure 8-13 Image With Fixed Selection Box (Gray) and Active Selection

Box (Yellow) 280

Figure 9-1 Execution Sequence of the Rules Visualizer 289

Figure 9-2 Detail View of the Rules Visualizer’s Main Window 293

Figure 9-3 Initial Tool Manager Window for Association Generation 298

Figure 9-4 Association Rule Options Dialog Box 299

Figure 9-5 Association Mappings Dialog Box 300

Figure 9-6 Rule Visualizer Options Dialog Box 301

Figure 9-7 The Rules Visualizer’s Mappings Panel 304

Figure 9-8 Initial Rules Visualizer View When Specifying group.ruleviz 305

Figure 9-9 Cursor Over a Rules Visualizer Object 307

Figure 9-10 Rules Visualizer’s Height Slider 308

Figure 9-11 Rules Visualizer File Menu 309

Figure 9-12 Rules Visualizer Filter Panel 310

Figure 9-13 Rules Visualizer View Menu 312

Figure 10-1 The Decision Tree Generated by the Decision Tree Inducer

for Churn Dataset 316

Figure 10-2 The Option Tree Generated by the Option Tree Inducer for

the Cars Dataset 317

Figure 10-3 Results of Evidence Inducer for Iris Dataset 319

Figure 10-4 Method for Building a Classifier 320

Figure 10-5 Using a Classifier to Label New Records 320

Figure 10-6 Tool Execution Sequence for Classifiers 321

Figure 10-7 Sample Records From a Training Set 322

Figure 10-8 Iris Dataset Misclassification, Example 1 323

Figure 10-9 Iris Dataset Misclassification, Example 2 324

Figure 10-10 Estimating the Classifier’s Accuracy 326

Figure 10-11 Classifier Cross-Validation (k=3) 327

Figure 10-12 Confusion Matrix for Iris Dataset 329

Figure 10-13 Lift Curve for the Churn Dataset 331

Figure 10-14 Learning Curve for the Churn Dataset 333

Figure 10-15 Learning Curve for the Adult Dataset With Label Set to Gross

Income Binned at $50,000 334

xxviii

List of Figures

Figure 10-16 Confusion Matrix for the Mushroom Dataset Using

Defaults Settings 335

Figure 10-17 Confusion Matrix for the Mushroom Dataset With Loss Matrix 336

Figure 10-18 Confusion Matrix for the Mushroom Dataset With Loss Matrix

Allowing Unknown Predictions 337

Figure 10-19 Options for Running the Inducer 340

Figure 10-20 Error Estimation Options With Holdout 341

Figure 10-21 Error Estimation Options With Cross Validation 342

Figure 10-22 Backfitting, Confusion Matrices, Lift Curve, and ROI

Curve Options 342

Figure 10-23 ROI Option for Generating a Return on Investment Curve 343

Figure 10-24 Enabling Loss Matrices and Setting the Weight Attribute 344

Figure 10-25 Learning Curve Options 345

Figure 10-26 The Status Window 346

Figure 10-27 The Test and Apply Model Dialog Box: Selecting a Classifier 348

Figure 10-28 The Apply Model Panel 349

Figure 10-29 The Test Model Panel 350

Figure 10-30 The Fit Data to Model Panel 351

Figure 11-1 Decision Tree for the Iris Dataset 356

Figure 11-2 Data Destination Panel in Tool Manager Showing Classifiers 358

Figure 11-3 Further Inducer Options 360

Figure 11-4 Tree Visualizer’s Search Dialog Box 366

Figure 12-1 Option Decision Tree for the Cars Dataset 379

Figure 12-2 Data Destination Panel in Tool Manager Showing Classifiers 381

Figure 12-3 Further Inducer Options 383

Figure 13-1 The Evidence Visualizer Applied to the Iris Dataset 390

Figure 13-2 Evidence Visualizer Showing Probabilities 391

Figure 13-3 Selecting sepal length < 5.45 and sepal width > 3.05 Using the

Iris Dataset 394

Figure 13-4 Selecting Two Contradictory Pies Results in a Gray Pie

on the Right 395

Figure 13-5 Veil-Color Attribute in the Mushroom Dataset 396

Figure 13-6 File > Open Menu Selection 399

Figure 13-7 Tool Manager With Data Destination Panel Showing Classifiers 400

List of Figures

xxix

Figure 13-8 Classification Options Dialog Box Without Accuracy Estimate 402

Figure 13-9 Evidence Visualizer Window for cars.eviviz 404

Figure 13-10 Label Value “Japan” Selected Using the Cars Dataset 406

Figure 13-11 Loss Matrix to Avoid Predicting Poisonous Mushrooms as

Being Edible 407

Figure 13-12 Loss Matrix Applied to Probabilities in the Label Probability Pane 408

Figure 13-13 Pie Charts With the First Binned Range of weightlbs Highlighted 409

Figure 13-14 Bar Chart With a Range Selected 411

Figure 13-15 Iris Dataset With the Value petal width .75 - 1.65 Selected 412

Figure 13-16 Bars Showing Evidence For iris-virginica 413

Figure 13-17 Bars Showing Evidence Against iris-virginica 414

Figure 13-18 Evidence Visualizer Height Scale Slider 415

Figure 13-19 Evidence Visualizer Detail Slider 416

Figure 13-20 Evidence Visualizer Percent Weight Threshold Slider. 416

Figure 13-21 Evidence Visualizer’s View Menu 417

Figure 13-22 Evidence Visualizer’s Nominal Order Menu 418

Figure 13-23 Evidence Visualizer’s Selection Menu 419

Figure 13-24 Filtered Adult Dataset With Multiple Selection 420

Figure 14-1 Decision Table for the Mushroom Dataset 432

Figure 14-2 Decision Table for the Mushroom Dataset, Showing Drill-Down 433

Figure 14-3 Mushroom Dataset Close-Up of “odor=none and

spore-print-color=white” 434

Figure 14-4 Data Destination Panel in Tool Manager Showing Classifiers 439

Figure 14-5 Further Inducer Options 441

Figure 14-6 Decision Table Showing Classifier Induced From adult94 Dataset 443

Figure 14-7 Example of Making Multiple Selections 445

Figure 14-8 Decision Table Visualizer’s View Menu 447

Figure 14-9 Decision Table Visualizer’s Nominal Order Menu 447

Figure 14-10 Decision Table Visualizer’s Selection Menu 448

Figure 14-11 Drilling Down on the Churn Dataset 451

Figure 14-12 Decision Table Visualizer Using the Adult Dataset 455

Figure 14-13 Closer Inspection of the Adult Dataset 457

Figure 15-1 Regression Tree for the Adult Dataset 466

xxx

List of Figures

Figure 15-2 Data Destination Panel in Tool Manager Showing Regressors 467

Figure 15-3 Further Inducer Options 469

Figure 16-1 Clustering Visualization on Adult Dataset 480

Figure 16-2 The Clustering Tab 481

Figure 16-3 Clustering Using Iterative K-Means 484

Figure 16-4 Clustering Options Dialog Box 487

Figure 16-5 Cluster Visualizer Main Window 491

Figure 17-1 The Column Importance Tab 494

Figure 17-2 Advanced Mode of Column Importance 495

Figure 18-1 Table of Values for Selected Objects 502

xxxi

List of Tables

Table 3-1 Aggregate Example 1 83

Table 3-2 Aggregate Example 2 83

Table 3-3 Aggregate Example 3 84

Table 3-4 Example of Binning 84

Table 3-5 Results When Making Total $ Spent an Array 84

Table 3-6 Results When Specifying Sex_bin 85

Table 3-7 Results of Making an Array by Age_bin and Sex_bin 85

Table 3-8 Results of Distributing Sex_bin and Indexing by Age_bin 85

Table 8-1 Ages 40 to 50 274

Table 8-2 Ages 50 to 60 274

Table 8-3 Interpolation Midway Between Table 1 and Table 2 275

Table 9-1 Association Rules Components 292

Table 9-2 Example of Hierarchical Levels 292

Table B-1 Keywords for the Tree Visualizer 539

Table C-1 Keywords for the Map Visualizer 579

Table C-2 Operators Used With Expressions 580

Table C-3 Characters That Can Follow the Percent Symbol in the

Format String 583

Table D-1 Scatter Visualizer Keywords 607

Table D-2 Operators Used With Expressions 608

Table D-3 Characters That Can Follow the Percent Symbol in the

Format String 611

Table E-1 Splat Visualizer Keywords 632

Table E-2 Characters That Can Follow the Percent Symbol in the

Format String 635

Table F-1 Single-Item Format 649

Table F-2 Multiple-Item Format 649

Table F-3 Options for the Association Data Converter 650

xxxii

List of Tables

Table F-4 Options for Controlling Rule Generation 653

Table F-5 Options for Restricting Generated Rules 654

Table F-6 Options for the mapassocgen Command 655

Table F-7 Example Hierarchy 656

Table F-8 Options Set 3 657

Table F-9 Data Example 2 658

Table F-10 Rule Generation Example 1 659

Table F-11 Example Hierarchy 660

Table F-12 Example of Rules at the Lowest Hierarchical Level 661

Table F-13 Second Example of Rules Generated at Lowest Hierarchical Level 663

Table F-14 Field Names and Types for Rules File 665

Table F-15 Operators Used With Expressions 666

xxxiii

About This Guide

The MineSet User’s Guide describes the features and capabilities of this suite of four

database mining and nine visualization tools. Current information about the MineSet

product can be found on the World Wide Web at

http://www.sgi.com/Products/software/MineSet

Audience for This Guide

If you are using the Tool Manager to extract data from a database into the MineSet tools,

you should understand database structures. It also would be helpful to know SQL.

If you are conﬁguring the tools directly (through the conﬁguration ﬁles, or through the

command line in the case of the association rules), you should have some knowledge of

UNIX as well as some programming experience.

Once the data has been loaded into the various visualization tools, you will not need a

database or programming background, although you will be able to interpret the

displays more easily if you have an understanding of the data and what it represents.

xxxiv

About This Guide

Structure of This Document

In addition to this preface, the documentation for MineSet consists of the following

chapters:

Chapter 1, “Getting Started”

This provides a brief overview of each MineSet tool and describes the processes that

occur when invoking and using a tool.

Chapter 2, “Setting Up MineSet”

This chapter describes how to set up MineSet by conﬁguring the DataMover.

Chapter 3, “The Tool Manager”

This chapter describes the menus and functions of the initial interface for invoking tools

and tells how to produce their respective conﬁguration ﬁles.

Chapter 4, “Using the Statistics Visualizer”

This chapter provides a description of the Statistics Visualizer. This tool is valuable for

comprehending variations in statistics by comparing box plots and histograms.

Chapter 5, “Using the Tree Visualizer”

This chapter provides a complete description of the Tree Visualizer tool interface. This

tool is valuable for visualizing hierarchical data.

Chapter 6, “Using the Map Visualizer”

This chapter provides a complete description of the Map Visualizer interface. This tool is

valuable for visualizing data that is connected with a geographical location.

Chapter 7, “Using the Scatter Visualizer”

This chapter provides a complete description of the Scatter Visualizer interface. This tool

is valuable for visualizing multidimensional data.

Chapter 8, “Using the Splat Visualizer”

This chapter provides a complete description of the Splat Visualizer. This tool, which is

particularly well suited for application to very large datasets, lets you visually analyze

relationships among several variables, either statically or by animation.

Chapter 9, “Using the Rules Visualizer”

This chapter provides a complete description of the Rules Visualizer. This tool is valuable

for mining large datasets and visualizing correlations in that data.

About This Guide

xxxv

Chapter 10, “MineSet Inducers and Classiﬁers”

This chapter provides a brief introduction to classiﬁers and regressors, and the

algorithms that generate them, called inducers. Speciﬁcally, it introduces the three

MineSet classiﬁers: Decision Tree, Option Tree and Evidence.

Chapter 11, “Inducing and Visualizing the Decision Tree Classiﬁer”

This chapter describes how to generate and use the Decision Tree Classiﬁer. This tool is

valuable for classifying data according to a set of attributes by making a series of

decisions based on those attributes.

Chapter 12, “Inducing and Visualizing the Option Tree Classiﬁer”

This chapter describes how to generate and use the Option Tree Classiﬁer. This tool

assigns each record to a class. Option trees can contain special option nodes that allow

the classiﬁer to consider the inﬂuence of splitting on multiple attributes simultaneously.

Chapter 13, “Inducing and Visualizing the Evidence Classiﬁer”

This chapter describes how to generate and use the Evidence Classiﬁer. This tool is

valuable for classifying data by examining the probabilities of a speciﬁed result

occurring based on a given attribute.

Chapter 14, “Inducing and Visualizing the Decision Table”

This chapter describes how to generate and use the Decision Table Classiﬁer. This tool is

useful for examining data and visualizing correlations between pairs of attributes.

Chapter 15, “Inducing and Visualizing the Regression Tree”

This chapter describes how to generate and use the Regression Tree Classiﬁer. This tool

is useful for predicting attributes based on continuous values, such as occur in real life.

Chapter 16, “Inducing and Visualizing Clustering”

This chapter describes how to generate and use clustering to explore data. This tool is

useful to detect groups of records that have similar characteristics.

Chapter 17, “Column Importance”

This chapter provides a complete description of the column importance tool. It also

describes the relationship between column importance and the importance ranking in

the other data mining tools.

Chapter 18, “Selection and Drill-Through”

This chapter describes the how to use multiple selection in the MineSet tools, as well as

the concept of drill-through.

xxxvi

About This Guide

Chapter 19, “File Exchange Between MineSet and SAS”

This chapter describes the support for ﬁle exchanges between the MineSet and SAS

formats.

Chapter 20, “MineSet Web Extensions”

This chapter describes the MineSet extensions that are provided to let you create or view

visualizations and/or interact with MineSet over the web.

Appendix A, “Flat File Support for MineSet”

This appendix describes the .schema and the .data ﬁles that are required for MineSet to

read ﬂat ﬁles.

Appendix B, “Creating Data and Conﬁguration Files for the Tree Visualizer”

This appendix explains the required formats of the Tree Visualizer data and

conﬁguration ﬁles.

Appendix C, “Creating Data, Conﬁguration, Hierarchy, and GFX Files for the Map

Visualizer”

This appendix explains the required formats of the Map Visualizer data, conﬁguration,

hierarchy, and .gfx ﬁles.

Appendix D, “Creating Data and Conﬁguration Files for the Scatter Visualizer”

This appendix explains the required formats of the Scatter Visualizer data and

conﬁguration ﬁles.

Appendix E, “Creating Data and Conﬁguration Files for the Splat Visualizer”

This appendix describes the format of the Splat Visualizer’s data ﬁle.

Appendix F, “Creating Data and Conﬁguration Files for the Rules Visualizer”

This appendix explains the required formats of the Rules Visualizer data and

conﬁguration ﬁles.

Appendix G, “Format of the Evidence Visualizer’s Data File”

This appendix describes the format of the Evidence Visualizer’s data ﬁle.

Appendix H, “Creating Data and Conﬁguration Files for the Decision Table Visualizer”

This appendix describes the format of the Decision Table’s data ﬁle.

About This Guide

xxxvii

Appendix I, “Command-Line Interface to MIndUtil: Analytical Data Mining

Algorithms”

This appendix describes how the server side of the MineSet images handles classiﬁers,

regressors, discretization, column importance, ﬁle conversions, and their options.

Appendix J, “Nulls in MineSet”

This appendix describes how MineSet supports nulls in the data access tools, the mining

tools, and the visualization tools.

Appendix K, “Further Reading and Acknowledgments”

This appendix lists reference sources for further reading about concepts and their

implementations used in the MineSet tools. It also lists acknowledgments for data

sources used in the examples provided with these tools.

Illustration in This Guide

The hard copy of this documentation provides all screen shots and illustrations in black

and white. The online version, however, provides these visuals in full, original color.

Thus, if you are reading the hard copy version and ﬁnd a particular graphic or screen

shot difﬁcult to see, go to the respective page of the online version for greater clarity.

Typographical Conventions

The following type conventions and symbols are used in this guide:

Italics Executable names, ﬁlenames, program variables, tools, utilities, variable

command-line arguments, and variables to be supplied by the user in

examples, code, and syntax statements.

Bold Keywords

Fixed-width type

On-screen command-line text and prompts.

Bold fixed-width type

User input, including keyboard keys (printing and non-printing);

literals supplied by the user in examples, code, and syntax statements.

[ ] Syntax statement arguments surrounded by square brackets denote that

these arguments are optional.

Chapter 1

1. Getting Started

This introduction provides an overview of MineSet™, an integrated suite of data mining

and visualization tools, and describes the basic tool execution scenario.

Note: Before using any of the MineSet tools, follow the installation and licensing

instructions in the MineSet release notes. Then your system administrator must set up

the DataMover conﬁguration ﬁle. You also can choose to set up various options. The

setup details are described in Chapter 2.

MineSet Tools Suite

The MineSet suite of tools lets you mine and graphically display quantitative

information in ways that can help you better visualize, explore, and understand your

data. This suite of data mining and analysis tools can help you organize and examine

your data in new and meaningful ways. The mining tools automatically ﬁnd patterns

and build models that can be viewed using the visualization tools. The visualization

tools can also be applied directly to the data for further insights. These tools provide an

enabling power that lets you gain a deeper, intuitive understanding of your data, and

helps you discover hidden patterns and important trends.

These tools provide a highly interactive, three-dimensional (3D) visual interface that lets

you manipulate visual objects on the screen, as well as search, ﬁlter and perform

animations. This ability to visualize and survey complex data patterns can prove

invaluable for decision support, in business intelligence and knowledge management.

Chapter 1: Getting Started

The MineSet suite consists of three basic components:

•a centralized control module, consisting of a graphical user interface tool called the

Tool Manager, and a process called the DataMover, which runs on the server part of

MineSet’s client/server architecture.

•analytical data mining, with nine data mining tools:

–Association Rules Generator

–Automatic Binning

–Cluster Generator

–Column Importance

–Decision Table Inducer and Classiﬁer

–Decision Tree Inducer and Classiﬁer

–Evidence Inducer and Classiﬁer

–Option Tree Inducer and Classiﬁer

–Regression Tree Inducer and Regressor

•visualization tools, which let you view your data using ten different visual

metaphors:

–Cluster Visualizer

–Decision Table Visualizer

–Evidence Visualizer

–Map Visualizer

–Record Viewer

–Rules Visualizer

–Scatter Visualizer

–Splat Visualizer

–Statistics Visualizer

–Tree Visualizer

The following sections provide a brief description of each of the above-mentioned

components.

MineSet Tools Suite

Tool Manager

Each of the mining and visualization tools described below can be conﬁgured and started

via a consistent graphical user interface known as the Tool Manager. The Tool Manager

•connects you to the server on which the analytical mining and transformations are

performed

•lets you access, query and transform data

•creates conﬁguration ﬁles for each tool

DataMover

The DataMover is a process that runs on the server on behalf of the user. The DataMover

•connects to databases, ﬂat ﬁles (ASCII or binary), and retrieves the data

•invokes the mining tools

•performs additional data manipulation such as binning and aggregation

•returns the data to the Tool Manager for distribution to the visualization tools

•can store the data in ﬁles on the server or client for future operations.

Association Rules Generator

The Association Rules Generator processes an input ﬁle, then generates an output ﬁle

consisting of rules. These rules indicate the frequency with which one item occurs in a

record along with another item. The strength of the association is quantiﬁed by three

numbers.

•The ﬁrst number, the predictability of the rule, quantiﬁes how often an item X and an

item Y occur together as a fraction of the number of records in which X occurs. For

example, given that someone has bought milk, how often do they also buy eggs.

•The second number, the prevalence of the rule, quantiﬁes how often X and Y occur

together in the ﬁle as a fraction of the total number of records. For example, how

often were milk and eggs bought together.

•The third number is expected predictability. This gives an indication of what the

predictability would be if there were no relationship between the items in the

record. For example, how often were eggs bought, regardless of whether milk was

bought as well.

Chapter 1: Getting Started

Automatic Binning

Automatic Binning groups together closely spaced numerical data into discrete

categories. Some data mining algorithms, such as the Decision Tree Inducer, require

some discrete (categorical) data; similarly, visualization tools such as the Splat Visualizer

may need data categorized in this way.

MineSet can automatically determine these categories, or you can determine how you

need it done. Requirements can be as simple as dividing the data into three equal ranges;

or as complex as having MineSet choose ranges differentiated according to some chosen

attribute, at the same time discarding the outer ﬁve percent of the data as outliers.

Clustering

Clustering segments data into similar groups or clusters. For example, you can ask

MineSet to suggest a segmentation of customers into ﬁve distinct groups, without giving

any further parameters. Once the clustering operation has been run, you can view the

results in the Cluster Visualizer; or apply the clustering model to the current data, then

analyze the resulting clusters in any MineSet visualization or mining tool.

Column Importance

Column Importance determines how important various attributes are for determining

the value of a given label attribute. For example, you can ask MineSet to select

automatically the best three attributes that help determine whether someone is a good

credit risk. The system might select income, own-house, and car-cost. These attributes

can then be used to conﬁgure various visualizers.

Column Importance has an advanced mode that provides additional capabilities. First, it

lets you determine how important each of the attributes is. (For example, you could

determine that both income and salary are similar in importance in determining credit

risk. Although income might be slightly better in determining importance, you might

prefer to use salary because it is easier to obtain.) Second, once you explicitly choose an

attribute, you can determine what other attributes are important in conjunction with it.

(For example, if you have chosen salary rather than income, house-cost might become

more important than own-house, and income would have a very low importance.)

MineSet Tools Suite

Decision Table Inducer and Classiﬁer

The Decision Table Classiﬁer classiﬁes data by making a series of consecutive decisions

leading to the classiﬁcation based on a record’s attributes. It can be used to predict events

such as whether a bank customer is likely to default on a loan, or a homeowner is likely

to reﬁnance their mortgage.

The Decision Table Inducer creates a Decision Table Classiﬁer from the data. Attributes

are tested to classify the data, and you have the option to set the order in which the tests

are run as well. The resulting Decision Table Classiﬁer can be viewed using the Decision

Table Visualizer, so you can simultaneously explore multiple attribute tests, two at a

time.

Decision Tree Inducer and Classiﬁer

The Decision Tree Classiﬁer classiﬁes data according to a set of attributes by making a

series of decisions based on those attributes. Applying this classiﬁer to determine the

proﬁle of someone with credit worthiness, for example, a decision tree might determine

if someone who owns a home, owns a car that cost between $15,000 and $23,000, and has

two children, is a good credit risk.

The Decision Tree Inducer generates a Decision Tree Classiﬁer, the structure of which is

displayed using the Tree Visualizer, each decision being represented by a node of the tree.

The graphical representation helps you understand the model, as well as gives valuable

insight into the data, by using visual searching and ﬁltering.

Evidence Inducer and Classiﬁer

The Evidence Classiﬁer classiﬁes data by examining the probabilities of a speciﬁed result

occurring based on a given attribute. For example, it might determine that someone who

owns a car that cost between $15,000 and $23,000 has a 70% chance of being a good credit

risk, and a 30% chance of being a bad credit risk. The classiﬁer predicts the class with the

highest probability based on a simple probabilistic model.

The model is displayed using the Evidence Visualizer, which shows pie charts

illustrating the different probabilities. This graphical representation can help the user

understand the classiﬁcation algorithm, as well as providing valuable insights into the

data and answering “what if” questions.

Chapter 1: Getting Started

Option Tree Inducer and Classiﬁer

The Option Tree Classiﬁer classiﬁes data using a technique similar to the Decision Tree

Classiﬁer. Unlike decision trees, option trees can contain special option nodes, which

allow the classiﬁer to consider the inﬂuence of splitting on multiple attributes

simultaneously. For example, an option node in an option tree built to identify a car's

country of origin might choose miles per gallon, horsepower, number of cylinders, and

weight as informative attributes. In a decision tree, a node can choose at most one

attribute for consideration at a time. In an option tree, the results of all options are

“voted” when performing classiﬁcation. Option trees are often more accurate than

decision trees; however, they generally are much larger.

The Option Tree Inducer generates an Option Tree Classiﬁer from a training set in much

the same way that the Decision Tree inducer generates a Decision Tree. The induced

option tree is displayed using the Tree Visualizer. This visualization helps you

understand the classiﬁer, and provides insight into which attributes are important in

determining the value of the label.

Regression Tree Inducer and Regressor

The Regression Tree Regressor predicts continuous attributes, in the same the way that

the Decision Tree and Option Tree Classiﬁers predict discrete attributes. While a classiﬁer

predicts an event, such as whether a customer will churn (leave you) or not, a regressor

predicts speciﬁc numerical values, such as the proﬁt margin for a business for the next

ﬁnancial quarter.

The Regression Tree Inducer builds a Regression Tree Regressor model from your data.

As with Decision and Option Trees, this model can be viewed and analyzed using the

Tree Visualizer, so you can understand the basis from which its predictions are made.

Cluster Visualizer

The Cluster Visualizer displays statistics about the clusters or groups that are generated

by the clustering mining tool. It places these statistics side-by-side with those for the

entire data set, so that you can see which features make each cluster unique.

The Cluster Visualizer places the attributes in the display in the order of importance for

understanding the clustering. When you select one particular cluster, Cluster Visualizer

produces an ordering which is the most useful for discriminating between that cluster

and the remainder of the data set.

MineSet Tools Suite

Decision Table Visualizer

The Decision Table Visualizer allows you to view the distribution of data from a discrete

column at multiple levels of a hierarchy. For example, you can examine the proﬁtability

of a business along dimensions of product class, geography, sales promotions and

sales-representative compensation plan. The Decision Table Visualizer distributes the

data two attributes at a time, allowing you to drill-down to further pairs of attributes at

each level.

The Decision Table Visualizer explores the results of the Decision Table Inducer, so that

the discrete column you examine is the label that the inducer classiﬁes. When this is

done, the Decision Table Inducer arranges the attributes to determine which pair to

display ﬁrst, and how to drill down from that top level to subsequent levels.

Evidence Visualizer

The Evidence Visualizer visually represents the model generated by the Evidence

Classiﬁer. It initially shows cake charts that represent how the various attributes

contribute to the decision, and allow “what-if” analysis.

Map Visualizer

The Map Visualizer lets you visualize data relationships that exist across geographically

meaningful areas. For example, you can visualize different areas of a country, showing

the relative impact of a marketing program. The Map Visualizer’s drill-down capabilities

let you focus on designated regions and perform a more detailed analysis in smaller

geographical elements. One application might be analyzing how one or more products

are being sold across different geographies. A powerful animation feature, coupled with

a capability to connect different views of the same or related data, permits fast

comparisons and difference analyses. This tool lets you visually examine patterns in your

data that are difﬁcult to detect when that data is shown in a tabular, two-dimensional

form.

Chapter 1: Getting Started

Record Viewer

The Record Viewer lets you view the data in the current table in a row/column

spreadsheet-like tool.

Rules Visualizer

The Rules Visualizer visually represents the model of the Association Rules Generator

mining tool. It provides detailed data analysis that lets you examine relationships across

data elements in new ways. In doing so, you might discover relationships that

signiﬁcantly differ from what you might have expected; this, in turn, can lead to

important discoveries about your data or the processes behind that data. This tool’s

visualization capabilities let you discover additional patterns of co-occurrence between

these data elements. For example, you can use the analysis of products sold during the

last sales promotion to guide your advertising campaign for the next sales period. The

Rules Visualizer’s high performance would let you analyze the results from today’s sales

data in time to alter the advertising campaign for the future.

Scatter Visualizer

The Scatter Visualizer lets you examine the behavior of data across eight different

dimensions. The data is shown in a grid representing up to three dimensions. Extra

dimensions can map to the size, color, and label of each displayed entity. Two further

independent dimensions can be assigned as dynamic dimensions. A slider can be used

to select speciﬁc values along those dimensions, or a path can be traced through those

dimensions, for animation. During the path traversal, the display changes automatically

to reﬂect the change in the independent variables.

MineSet Tools Suite

Splat Visualizer

The Splat Visualizer produces 3D plots of very large data sets. Instead of showing

individual data points, it renders the density of data using varying opacity. It has many

of the same features as the Scatter Visualizer.

Statistics Visualizer

The Statistics Visualizer computes and displays summary information for the current

dataset (maximum, minimum, median, standard deviation, distinct values, and

quartiles).

Tree Visualizer

The Tree Visualizer helps you analyze data that has hierarchical relationships. It provides

an interactive “ﬂy-through” capability for examining relationships among data at

different hierarchical levels. For example, the Tree Visualizer can be used to examine a

company’s product line, graphically displaying each product’s contribution to the

company’s total revenue. Each branch of the hierarchy displays information at increasing

levels of detail, breaking revenues down by product lines and, eventually, individual

products. Another example of using the Tree Visualizer is to show company sales

revenue, displaying a company-wide total as well as sub-totals at regional and other

levels. The ﬂy-through capability in the Tree Visualizer lets you rapidly reposition your

view of the data. The Tree Visualizer’s ﬁltering and searching capabilities let you focus

on speciﬁc data elements and queries.

The Tree Visualizer is also used to view the resulting models of the Decision Tree and

Option Tree Classiﬁers, and the Regression Tree Regressor; with each decision being

represented by a separate node in the tree. Each node also contains bars showing how the

data is modeled based on the decisions up to that point (for example, 73% of people who

own a home and have two children are good credit risks, while 27% are not).

Chapter 1: Getting Started

Basic Tool Execution Scenario

Each of the MineSet tools is started, conﬁgured, and run in a consistent manner. The

sequence of actions you follow at your MineSet client and at the MineSet server is shown

schematically in Figure 1-1. A description of the steps inherent in this ﬁgure follows.

Figure 1-1 Tool Execution Sequence

MineSet client MineSet server

Tool

manager

Configuration

file

Configuration

file

Visualization

tool

Visual

files

Data

file

DataMover User's

data

source

User

Visual

display

Inducer

(MIndUtil)

MODEL

Information & statistics

(error estimate)

Basic Tool Execution Scenario

The following steps describe a “typical” interaction with a MineSet tool, and the

sequence of the tool’s actions. Depending on your requirements, some steps might be

skipped (for instance, if the data and conﬁguration ﬁles have been generated in a

previous work session).

1. Start the Tool Manager, which is the graphical interface for generating and

specifying the conﬁguration ﬁle, data ﬁle, and tools to be used. The Tool Manager

runs on your MineSet client.

2. The Tool Manager opens a network connection to the DataMover, which runs on the

MineSet server, which in some cases may be the same as your client workstation,

and in others is a separate machine.

3. Use the Tool Manager to specify

•the database and table, or a binary or ASCII ﬂat ﬁle containing the data on

either the client or the server

•which mining or visualization tools are to be applied

•how that data is to be displayed, through tool options

•a session ﬁle to save the history of your work

Information retrieved via the DataMover is used to guide this interaction. As a

result, the Tool Manager generates a conﬁguration ﬁle. This ﬁle contains the

user-deﬁned parameters that determine the execution of the following steps.

4. The Tool Manager transmits a copy of the conﬁguration ﬁle from step 3 to the

DataMover. The DataMover processes the ﬁle by

•accessing the database or ﬂat ﬁle

•performing the speciﬁed data transformations

•running the mining tools when requested

•generating the visualization ﬁles when requested

These visualization ﬁles consist of your data in a speciﬁc format readable by the

MineSet tool. Then a copy of these visualization ﬁles is transferred to the MineSet

client.

5. The Tool Manager invokes the appropriate MineSet visualization tool.

6. The tool accesses the visualization ﬁles and displays the data.

7. If you generated a model, that model can be applied to additional data (see

Figure 10-5).

Chapter 1: Getting Started

Note: The MineSet client and server can run on different machines, using a network to

communicate. Because network bandwidth is often scarce, you should be cautious about

transferring large ﬁles between client and server regularly. If you are doing mining

operations on a large database or ﬁle, you can achieve greater efﬁciency by storing that

ﬁle on the server, where the DataMover runs, rather than on the client.

Chapter 2

2. Setting Up MineSet

This chapter describes how to set up MineSet, which requires conﬁguring the

DataMover. The conﬁguration has two parts:

•conﬁguring the user’s account on the server (optional), and

•a global conﬁguration, which usually is done by the system administrator

Parallelization is offered through the multiprocessor (n32) version of MineSet only. The

DataMover is a process that runs on the server, although it is not directly accessible to

users. The DataMover provides access to databases and data stored in ﬂat ﬁles, and

transforms data for the mining and visualization tools. The last section of this chapter

describes how to load sample datasets into the supported relational databases.

Conﬁguring the DataMover Server

In order to use the MineSet tools, two conﬁguration ﬁles must be created on the server:

one by you, the other by the system administrator.

The User Conﬁguration File

Note: You must have a UNIX account on every server you want to access.

The DataMover creates ﬁles on the server machine on behalf of each user. The

DataMover conﬁguration ﬁle, .datamove, lets you control where these ﬁles are created and

whether different classes of ﬁles are saved or discarded. This ﬁle is located on the server,

in your home directory. A sample .datamove ﬁle called datamove.sample is located on the

server, in the /usr/lib/MineSet/datamove directory.

Chapter 2: Setting Up MineSet

If the .datamove ﬁle is absent, or if a particular entry is not present in the .datamove ﬁle, the

DataMover uses a default value for that entry.

Each entry in the DataMover’s conﬁguration ﬁle must be on a separate line. For example:

file_cache = directory_name

where ﬁle_cache speciﬁes the location in which the DataMover stores its output data ﬁles

and models resulting from mining algorithms. If the ﬁle_cache directory does not exist,

the DataMover attempts to create it on its ﬁrst invocation. The default ﬁle_cache directory

is ./mineset_ﬁles/%U. The %U is a wildcard that is ﬁlled in with the user’s login name on

the client machine. This is useful in reducing contention if many users want to log in to

a common account on the server. If multiple sessions were simultaneously connected to

the same ﬁle_cache directory, they could overwrite each other’s server ﬁles, causing

incorrect and unexpected results. To prevent this, DataMover maintains a lock at the

ﬁle_cache directory level. The second and later attempts to connect to a particular

ﬁle_cache directory result in failure and an error message. The user can recover from such

a failure by killing one of the DataMover’s attempts to connect to a given ﬁle in the cache

directory.

The ﬁle_cache should be a directory in a ﬁle system with sufﬁcient room to hold all of a

user’s output and temporary ﬁles. DataMover will create this directory if it doesn’t

already exist. These are deleted when the DataMover no longer needs them, unless one

of the following keep options is set:

keep_client_upload

keep_client_download

keep_classifier_files

keep_classifier_options_files

keep_mlc_input

use_ascii_mlc_input

Configuring the DataMover Server

Each of these entries is described below.

keep_client_upload (default no)

Keep ﬁles uploaded from the client for processing. If kept, they will be in the client_upload

subdirectory.

keep_client_download (default no)

Retain on the server a copy of data ﬁles and visualizations after they are downloaded to

the client. If kept, the ﬁles will be in the client_download subdirectory.

keep_classifier_files (default yes)

Keep the persistent classiﬁers (decision trees and so forth) generated by mining

operations. The tactic is generally useful.

keep_classifier_options_files (default no)

Keep the options ﬁle that is used when generating, or inducing the classiﬁer. This tactic

is not useful. If kept, the ﬁles will be in the mlc_work subdirectory.

keep_mlc_input (default no)

Keep input ﬁles used for mining (MIndUtil or associations) operations. If kept, the ﬁles

will be in the mlc_work subdirectory.

use_ascii_mlc_input (default no)

Normally the DataMover creates MineSet binary ﬁles for MIndUtil input. If this option

is set, create ascii ﬁles instead.

aggregation_memory_limit (default 2147483647)

Memory limit (in bytes) for aggregation operations. This can be no larger than the

system-wide limit set in the dm_conﬁg ﬁle.

optimize_history=yes

The DataMover is able to rewrite histories to remove redundant computations. The

optimize_history parameter controls whether or not to do this. Since this rewriting can

speed up processing considerably, it is normally turned on.

Chapter 2: Setting Up MineSet

File Handling

A ﬁle in the ﬁle_cache directory is the result of a successful operation. If an operation

returns an error (that is, Tool Manager reports a message beginning “fatal error on

server,”) nothing should be changed in the ﬁle_cache directory. Two examples help

illustrate the point:

•Example 1: A user’sﬁle_cache directory contains the ﬁles cars.data and cars.schema,

both the result of a previous database query. The user then selects the same table,

and sets the output to server_ﬁle, ﬁltering for examples with mpg>55. Since no

records in the dataset have mpg values this high, when the history executes, it

returns no rows, which is ﬂagged as a fatal error. After this happens, the user’s

ﬁle_cache directory will still contain the old cars.schema and cars.data ﬁles.

•Example 2: A user’sﬁle_cache directory contains the ﬁles cars.data and cars.schema,

both the result of a previous database query. The user then selects the same table,

and sets the output to a visualization. The operation completes and the

visualization launches successfully. Once again, the user’sﬁle_cache directory still

contains the old cars.schema and cars.data ﬁles. The ﬁle_cache directory is not updated

unless the user speciﬁcally chooses server_ﬁle as the output.

Mandatory Conﬁguration File

If you are using relational databases, the MineSet DataMover server must be conﬁgured

to ﬁnd information in the databases. The DataMover works with Oracle® versions 7.2 or

later, INFORMIX®, and Sybase®.

The DataMover server reads the /usr/lib/MineSet/datamove/dm_conﬁg ﬁle during start up.

This ﬁle is not created by Inst during installation. It must be created by the system

administrator, who must log in as root to edit this ﬁle. It can be created via an editor such

as jot, vi, or Emacs. An example ﬁle can be found in

/usr/lib/MineSet/datamove/dm_conﬁg.sample. The format of this ﬁle is as follows:

Configuring the DataMover Server

Oracle {

"ORACLE_SID", "ORACLE_HOME";

}

Oracle_Remote {

“DATABASE_NAME”, “ADMIN_DIRECTORY”;

}

Informix {

"INFORMIXSERVER", "INFORMIXDIR";

}

Sybase {

"DSQUERY", "SYBASE";

}

Each optional entry describes the databases in use at your site. If your server is not

running any databases, that is, you intended to use MineSet with ASCII ﬁles only, simply

make an empty dm_conﬁg ﬁle.

The line "ORACLE_SID", "ORACLE_HOME" is ﬁlled in with the speciﬁc information and

repeated once for each Oracle database to be accessed via the DataMover. ORACLE_SID

and ORACLE_HOME are Oracle speciﬁc parameters deﬁning an Oracle instance.

The Oracle_Remote section is for accessing remote Oracle databases via SQL*NET V2.

The DATABASE_NAME entry is a logical name for the remote database, as deﬁned in a

tnsnames.ora ﬁle. The ADMIN_DIRECTORY entry is where DataMover searches for the

tnsnames.ora ﬁle. This ﬁle is described in Oracle’s SQL*NET documentation. Remote

access to databases is described in more detail in “Using MineSet to Connect to Remote

Databases” on page 58.

Each line in the Informix section deﬁnes a database server that, in turn, can contain

several databases. The server is checked at runtime to determine which databases it

contains, so there is no need to record the individual databases in the dm_conﬁg ﬁle. The

ﬁrst entry is the INFORMIX server (corresponding to the INFORMIXSERVER

environment variable), and the second is the INFORMIX directory (corresponding to the

INFORMIXDIR environment variable).

Each entry in the Sybase section deﬁnes a database server (or, in Sybase terminology, an

SQL Server™). The ﬁrst entry is the Sybase SQL Server name (corresponding to the

DSQUERY environment variable); the second is the Sybase home directory

(corresponding to the SYBASE environment variable).

Chapter 2: Setting Up MineSet

An example conﬁguration ﬁle might be as follows:

Oracle {

"v73", "/usr/people/oracle/v73";

"wrhse", "/opt/oracle";

}

Oracle_Remote {

“lifeseq”, “/usr/lib/MineSet2/datamove/”;

}

Informix {

"learn_online", "/u5/informix";

}

Sybase {

"MINESET", "/usr/sybase/10.0.2.4";

}

This conﬁguration ﬁle lets the DataMover access:

•three Oracle databases, one named v73 (installed in /usr/people/oracle/v73), another

named wrhse (installed in /opt/oracle), and a remote database named lifeseq,

•an INFORMIX Server;

•and a Sybase SQL Server.

Each of the INFORMIX and Sybase servers can, in turn, contain multiple databases.

For Sybase, DataMover uses vendor-supplied shared libraries as its connection to the

databases. One of the purposes of the dm_conﬁg ﬁle is to specify where DataMover must

look for its shared libraries. DataMover looks in the $SYBASE/lib/ directory for the

following shared libraries: libct.so,libcs.so,ibcomn.so,libintl.so,libtcl.so,libinsck.so.

Using MineSet With Existing Data Files

Sometimes it is convenient to use MineSet with data that is already stored as a ﬁle, but

requires further processing before it can be mined or visualized. In this case, the data ﬁle

can be made available (with a modest effort) to the Tool Manager/DataMover.

First, the data ﬁle must be in a tab-delimited format, with the same number of ﬁelds in

each line. A numeric or string ﬁeld with a single “?” character appearing between

delimiters is loaded as a Null value.

Configuring the DataMover Server

For a detailed discussion of null values, refer to Appendix J, “Nulls in MineSet.”

The contents of the data ﬁle must be described to Tool Manager/DataMover via a ﬁle

with the .schema extension. The format of the .schema ﬁle is shown next:

# A line beginning with a "#" is a comment

input {

# The first line lists the data file which is described. It

# must be a simple filename, not a path.

file "carmodels.data";

# Fields are listed left to right in the line, legal

# types are float, double, int, string, date, fixedString and

# dataString

# Be sure to end every line with a semicolon ";"

float mpg;

int cylinders;

float cubicinches;

int horsepower;

int weightlbs;

double timeaccelerate;

date when_introduced;

string origin;

fixedString(3) manufacturer_code;

dataString model;

}

The schema and data ﬁles must be located in the same directory. If you prepare a dataset

in this fashion on the client machine, it can be opened with the Tool Manager’sFind File

dialog. If the ﬁle requires any additional processing, it is copied to the server. Sometimes

this is not convenient, especially if the ﬁle already exists on the server, or is large. In this

case, the .schema and .data ﬁles must be copied (or symbolically linked) into your ﬁle_cache

directory on the server. The directory used as the ﬁle cache is speciﬁed in your .datamove

ﬁle; the default is ./mineset_ﬁles/%U, where %U becomes your login name on the client

machine.

For a more extended description of MineSet .schema ﬁles see Appendix A.

Chapter 2: Setting Up MineSet

Using MineSet to Connect to Remote Databases

Sometimes it might not be feasible to install DataMover on the machine running the

database server. In this situation, DataMover can be installed on an intermediate server,

and DataMover then can use the database vendor’s networking facility to connect to the

remote database. (This sometimes is referred to as a three-tier architecture.)

Oracle

MineSet supports two ways to access remote Oracle databases:

•The remote database is speciﬁcally mentioned in the dm_conﬁg ﬁle. For this method,

add entries to the Oracle_Remote section of the dm_conﬁg ﬁle, as described in the

“Mandatory Conﬁguration File” section, above. Every remote database named in

the dm_conﬁg ﬁle must be deﬁned in the tnsnames.ora ﬁle. This ﬁle can be manually

edited, or, more commonly, generated automatically by a network administration

tool provided by Oracle. If this method is chosen, the only Oracle-speciﬁc ﬁle

needed on the DataMover server is tnsnames.ora; in particular, Oracle need not be

installed on this machine.

•A local Oracle install is used as a gateway to a remote database. In this case, the

dm-conﬁg ﬁle requires an entry for the local Oracle install, with ORACLE_HOME and

ORACLE_SID. This entry must be in the Oracle, not Oracle_remote section. Entries

for any remote databases must be added to the

$ORACLE_HOME/network/admin/tnsnames.ora ﬁle of the Oracle install on the

intermediate server.

Then, when users want to log in to user “system”, password “manager” at database

“remotedb”, they must provide the name of the intermediate server for the Tool

Manager “Log on to server...” dialog and select the intermediate server’s Oracle

database. When logging in to the database, use system@remotedb for the database

username, and manager for the password. (The added @remotedb speciﬁes that

Oracle must use SQL*Net™ to connect to the remote database, instead of using a

local connection.)

Operating across SQL*Net is substantially slower than a local connection, especially for

queries that return a large amount of data. If possible, install DataMover on the same

machine as the Oracle server.

Loading Sample Datasets

Sybase

A Sybase installation is required on the intermediate DataMover server; this Sybase

installation need not be running an active database, but it is needed for access to the

shared libraries and the interfaces ﬁle.

In order to access the Sybase SQL server running on the remote machine, the interfaces

ﬁle on the DataMover server machine must have an entry for this Sybase SQL server.

Please refer to your Sybase manuals for the procedure for creating such entries. Also, the

name of this Sybase SQL server on the remote machine must be included in the dm_conﬁg

ﬁle on the intermediate DataMover server machine.

Once this setup is done, access to the Sybase SQL server on the remote machine is

handled transparently. The user can choose it and access data from it just like any other

database source, using the panels from the Tool Manager.

Loading Sample Datasets

This section describes how to load the sample datasets included with the MineSet

distribution into one of the supported relational databases.

Installed on the server in /usr/lib/MineSet/DBexamples are

•all the sample data, along with a brief description of what it contains.

•directions on how to load the data using the provided scripts.

Load the sample datasets into a database that has been set up on your server. The data

and these directions (README.server) are installed in /usr/lib/MineSet/DBexamples on the

server.

The /usr/MineSet/DBexamples directory contains scripts for loading the complete set of

data ﬁles into one of the supported databases. To load the complete set of data, run one

of the following loader scripts, depending on which database you have. (This assumes

your database and environment are already set up.)

sh load_all_Oracle.sh <userid> <passwd>

sh load_all_Sybase.sh <userid> <passwd>

Chapter 2: Setting Up MineSet

If you are going to work with an INFORMIX database, use the dbaccess interface to

select

create_all_Informix.sql

followed by

load_all_Informix.sql

Loading Individual Datasets

Alternatively, you can load, or reload, the sample data separately. Each data directory in

/usr/lib/MineSet/DBexamples on the server contains ﬁles necessary to load the data into

any of the supported databases. These ﬁles are:

README - explains the data

*.sql - sets up an Oracle table

*.ctl - control ﬁle for loading into Oracle

*_syb.sql - sets up a Sybase table

*.bcf.fmt - Sybase format ﬁle

*_inf.sql - sets up an INFORMIX table

*_load.sql - loads the data into the INFORMIX table

In the *.ctl ﬁle, the separator is declared in the line

" fields terminated by X'20' "

The separator is speciﬁed in ASCII hexadecimal; thus:

X'20' is used for ‘ ’

X'2c' is used for ‘,’

X'09' is used for ‘\t’

Loading Sample Datasets

Loading Into Oracle

Perform the following steps on the server with an Oracle database:

1. Ensure the following environment variables are set correctly:

ORACLE_HOME

ORACLE_SID

2. Type

sqlplus <userid>/<passwd>

SQL> @<dataset>.sql

Where dataset is the name of the dataset being loaded, and userid/passwd are your

assigned username and password for the Oracle database.

To delete an already existing table, type

SQL> drop table <dataset>;

3. Type

sqlload control = <dataset>.ctl userid = <userid>/<passwd>

log = /tmp/<dataset>.log direct = true

4. Check the resulting dataset.log to ensure the data was loaded correctly.

Loading Into Sybase

Perform the following steps on the server with a Sybase database:

1. Ensure that the following environment variables are set:

SYBASE

DSQUERY

2. To create the table, type

isql -U<userid> -P<passwd> -i <dataset>_syb.sql

Where dataset is the name of the dataset being loaded, and userid/passwd are your

assigned username and password for the Sybase database.

To delete an already existing table, type

isql -U<userid> -P<passwd>

drop table <dataset>

Chapter 2: Setting Up MineSet

3. To load the data, type

bcp <dataset> in <dataset>.data -U<userid> -P<passwd> -f

<dataset>.bcp.fmt

where dataset is the table name (created using <dataset>_syb.sql),in means

"load into the dbms," <dataset>.data refers to the name of the ASCII data ﬁle, and -f

points to the already-created format ﬁle. (When reading in from a ﬁle, the data types are

character.)

Loading Into INFORMIX

Perform the following steps on the server with an INFORMIX database:

1. Ensure the following environment variables are set:

ONCONFIG

INFORMIXSERVER

INFORMIXTERM

2. To create the table, type

dbaccess

3. If necessary, log into the appropriate database.

4. Choose Query-language, then choose the appropriate database from those listed.

5. Choose <dataset>_inf.sql, and run it.

6. Choose <dataset>_load.sql, and run it (where <dataset> is the name of the dataset

being loaded).

Chapter 3

3. The Tool Manager

This chapter discusses the functions of the Tool Manager, which is the graphical user

interface (GUI) that lets you specify data and conﬁguration information for the MineSet

tools in this package. It provides an overview of this interface, then describes every

component of each panel that this interface displays for all MineSet tools.

Note: Any screens dedicated to a speciﬁc tool are discussed in the chapter for that tool;

for example, the screen for specifying the Tree Visualizer’s conﬁguration ﬁle is discussed

in Chapter 5, “Using the Tree Visualizer.”

Overview

The Tool Manager is the initial graphical user interface (GUI) you use for most of your

interactions with the MineSet components. With Tool Manager you can select an existing

data source, transform or analyze that data, and visualize the results using any of the

MineSet individual tools. You can step through the process in these sections:

•“Connecting to an Existing Data Source”

•“Transforming the Data”

•“Visualizing the Data on the Screen”

Note: The Tool Manager generally does not support data ﬁles not created by the Tool

Manager without some manual work to make them compatible.

Chapter 3: The Tool Manager

Connecting to an Existing Data Source

You can specify the source of the data as being from a:

•database table

•database SQL query

•ﬁle

Transforming the Data

Often the original data is unsuitable for mining or visualization. It may contain irrelevant

or redundant columns, data types that are not applicable for viewing, or inconsistencies

that result in unhelpful visualization. You can transform the data with the Tool Manager

to display it in a useful form in any of these ways:

•mining tools—ﬁnds patterns in data

•binning variables—discretizes column values into groups, such as grouping years

by decade

•removing columns—excises unneeded columns to save space

•adding new columns—creates columns that are functions of existing columns

•aggregation—ﬁnds the average, sum, min, max, or counts of column values

•ﬁltering—selects a subset of the data based on an expression using column values

•sampling—selects a random subset of the data

•making arrays—takes the values of one column and turns them into an array

indexed by discrete values in another column

•distributing columns—makes two or more new columns from a single column of

values, distributed by the discrete values of another column

Overview

Visualizing the Data on the Screen

The ﬁnal step, having transformed the data, is to visualize the results.You can do this in

any of these ways; for example you can display the data on the screen as:

•a hierarchy (Tree Visualizer—Option, Decision, Regression)

•a map (Map Visualizer)

•a scatter plot showing relations of numerous independent variables (Scatter

Visualizer and Splat Visualizer)

•as associated rules (Rules Visualizer)

•as evidence and probability (Evidence Visualizer)

•as box plots and histograms (Statistics Visualizer and Cluster Visualizer)

•as layered tables or cakes (Decision Table)

With Tool Manager you can map data values to speciﬁc visual elements on the screen

such as:

•colors

•bars

•heights

Finally, Tool Manager lets you control those options not related to data, including:

•background colors

•grid spacing

•label sizes

Chapter 3: The Tool Manager

Starting the Tool Manager

You can run the Tool Manager in two modes:

•interactive mode—the Tool Manager provides windows, menus, buttons, and so on,

to let you access, mine, and visualize your data. Interactive mode also lets you save

a description of your actions to a “session ﬁle” for future use.

•batch mode—the Tool Manager performs all the actions described in a session ﬁle

without bringing up windows. For example, batch mode is useful for lengthy

computations that need to be done every night, so that the data can be fully

prepared each morning.

There are three ways to start the Tool Manager in interactive mode:

•Double-click the MineSet icon, which is in the Applications or the MineSet page of

the icon catalog. The Tool Manager starts with the same conﬁguration used in the

last Tool Manager session.

•Double-click an icon representing a session ﬁle saved from a previous invocation of

the Tool Manager. This starts the Tool Manager with that session ﬁle.

•Start the Tool Manager from the UNIX shell command line by entering this

command at the prompt:

mineset [ sessionFile ]

Here, sessionFile is optional and speciﬁes the name of the session ﬁle to use. If

you do not specify a conﬁguration ﬁle, MineSet starts up with the conﬁguration

most recently used.

To start the Tool Manager in batch mode, enter this command at the UNIX shell prompt:

mineset_batch [-s serverPassword -d databasePassword]sessionFile

The -s and -d options allow you to specify the password for logging into the server and

database respectively. If you do not specify these options, mineset_batch will ask you to

type in the passwords, thus these options are useful when running mineset_batch from a

shell script. To specify that there is no password for either the server or database, use -s

or -d followed by two double quotes, that is,

mineset_batch -s "" -d "" foo.mineset

If you specify one of the two passwords, you must specify both.

Starting the Tool Manager

Figure 3-1 shows the Tool Manager’s startup window.

Figure 3-1 The Tool Manager Startup Window

This window consists of two panels related to the speciﬁc dataset and tool chosen, and

two information sections. Speciﬁcation of servers and data sources is done via popup

dialogs accessible from the File menu.

Chapter 3: The Tool Manager

The panels and information sections are

•Data Transformations, which lets you modify the data from your data source.

•Data Destination, which lets you create visualizations based on your data, save the

data to a ﬁle, mine the data for association rules, create classiﬁers based on the data,

or ﬁnd important columns in the data.

•The top panel, which provides information on the currently selected data source.

•The bottom panel, which contains a stream of information on the status on certain

operations.

The following sections describe each panel of the main Tool Manager window.

Choosing a Data Source

Data sources are selected using the ﬁrst set of menu items in the File menu.

Figure 3-2 File Pulldown Menu

Choosing a Data Source

The ﬁrst three options in the File menu let you select the data source from a

•DBMS Table

•DBMS Query

•Data File

The fourth option, Connect to Server, lets you connect to a server without specifying the

data source.

You must connect to a server to get information from a database or mining tool, or to

apply transformations to an existing data ﬁle. It is not necessary if you plan to visualize

an existing client data ﬁle without transforming it.

Choosing an Existing Data File

Use the Open New Data File menu option to work with an existing data ﬁle. When you

select this option, the dialog in Figure 3-3 appears.

Figure 3-3 Open New Data File Dialog Box

Chapter 3: The Tool Manager

This dialog box, which is similar to a standard ﬁle selection dialog box, provides a toggle

at the top to select client versus server ﬁles; it also has a label indicating the name of the

current MineSet server, and a push button to let you log in to a new server. The radio

buttons at the top let you select ﬁles on your client machine (in any directory accessible

to you) or ﬁles that exist in your single cache on the DataMover server, (see “Conﬁguring

the DataMover Server” in Chapter 2.)

When you select the name of a ﬁle from the list in the left window, the columns of that

data ﬁle are shown in the right window.

When you click the Change Server button, a dialog prompts you for a server name, login

name, and password to connect to the server (see Figure 3-5).

If you want to access a data ﬁle created outside of Tool Manager, you must create a

.schema ﬁle for it. This is a text ﬁle containing a conﬁguration “input” section, which gives

the name of the data ﬁle and describes its layout. The Tool Manager supports input

sections similar to those for the Tree Visualizer (described in Appendix B), except that it

does not support variable length arrays or the monitor option.

Choosing a Database Table

Use the Open New DBMS Table menu option to work with tables in a DBMS. Selecting

this option causes the dialog box in Figure 3-4 to be displayed.

Choosing a Data Source

Figure 3-4 Choosing New Database Table Dialog Box

The name of the currently selected server appears to the left of the Change Server button.

If you click this button, the dialog box shown in Figure 3-5 appears. This lets you specify

a server name, login, and password.

Figure 3-5 Specifying Server Name, Login, and Password

Chapter 3: The Tool Manager

Once you have logged in to a server, click the Change DBMS button to bring up a dialog

box that contains a popup menu listing DBMS names/vendors (see Figure 3-6). Select a

DBMS from the menu, and enter the login name and password to connect to the DBMS.

Note that the DBMS login and password are usually different from those required to

connect to the server.

Figure 3-6 Sample Dialog Box Listing Available DBMS Names/Vendors

If you have logged on to an Oracle DBMS, the dialog box appears as shown in Figure 3-4,

with a list of tables on the left. When you select a table, the columns for that table are

shown on the right.

If the DBMS is Informix or Sybase, the dialog box shown in Figure 3-7 appears, with a list

of databases for the DBMS. Select a database, and the list of tables in that database are

shown.

Choosing a Data Source

Figure 3-7 Dialog Box After Selecting Informix or Sybase DBMS

To use a certain table in the Tool Manager, select the table you want to use and click OK.

Running an SQL Query

Use the Open New DBMS Query menu option to work with tables created via SQL

queries against a DBMS. Selecting this option causes the dialog box in Figure 3-8 to

appear.

Chapter 3: The Tool Manager

Figure 3-8 SQL Query Dialog Box

Selecting a server and DBMS in this dialog box has the same effect as selecting those

items in the Open New DBMS Table dialog box.

The SQL query is shown in the panel at the lower left. You can enter the query there, or

load it from a disk ﬁle using the Load SQL from File button. The names of tables and

columns in the current DBMS are shown to help build queries. To have their names

transferred to the SQL query panel, double-click on them.

When you have entered the SQL query, click the Submit SQL Query button to send it to

the DBMS for execution. The table columns resulting from the query appears on the

right.

Transforming the Data

The Data Transformations panel lets you manipulate the tables with which you want to

work. After you have selected a table (via the File menu, described above), its column

headings appear in the Current Columns window of the Data Transformations panel

(Figure 3-9).

Figure 3-9 The Data Transformations Panel

The functions of the displayed options are:

•Remove Column—lets you delete one or more columns that are not relevant to the

current visualization or mining.

•Bin Columns—lets you assign each record to a group that falls within a certain range

(bin) of column values. For example, an age column may be binned into the ranges

(bins): 0-18, 19-25, 26-35, and so on.

Chapter 3: The Tool Manager

•Aggregate—adds columns of records (sum), creates a new column representing

maximum or minimum values, or makes an array from a column that is indexed by

other columns.

•Filter—lets you select a subset of the data based on an expression involving column

values, for example, leave only those records in which the age is less than 20.

•Change Types—lets you change a column’s name as well as its type.

•Add Column—lets you add a new column based on a mathematical expression. For

example, add a column “minor” based on the column “age,” using the expression:

“if age is less than or equal to 18 then minor is true; else minor is false.”

•Apply Model—lets you use a previously created classiﬁer to label new records, to

estimate probabilities for label values, to test the classiﬁer on new data, or to backﬁt

data to an existing classiﬁer (see Chapter 10, “MineSet Inducers and Classiﬁers,” for

details).

•Sample—lets you select a random subset of the data. This is useful for very large

data sets.

The Remove Column Button

Remove Column lets you delete columns by selecting the column name or names in the

Current Columns panel, then clicking this button. The items in the Current Columns panel

change to show the new table columns. To choose multiple contiguous columns for

simultaneous removal, click and drag the mouse over the columns. To choose multiple

non-contiguous columns for simultaneous removal, hold down the Ctrl key while

selecting the additional columns.

Transforming the Data

The Bin Columns Button

Binning lets you sort the information from one or more columns into groups in a new

column or columns (for example, with a range of ages, 0-18, 19-25, 26-35, and so on).

Click Bin Columns to get a dialog box that lets you specify the binning options

(Figure 3-10).

Figure 3-10 Bin Columns Dialog Box

Chapter 3: The Tool Manager

This dialog box lets you

•choose the column that is to be divided into bins

•specify the name of the new column to contain values for the bins

•set bin thresholds, or specify a range with thresholds at regular intervals

To specify binning options for one or more columns, select the column name(s), choose

the appropriate options below, and click the Apply button at the bottom of the dialog box.

If you select only one column for binning, the name of the resulting binned column

appears in the “New column name” box, and you can type in a new name if you like. In

the example shown in Figure 3-10, mpg_bin is the name for the new column; in this case,

it provides a range of fuel efﬁciencies. If you select more than one column for binning,

New column name stays inactive.

Next to New column name is a check box labeled “Delete original column”. When

chosen, this option automatically deletes the original column after binning. Click the

check box to turn this function on or off.

In the middle of the Bin columns dialogue box are two tabs for choosing Automatic

Thresholds or User Speciﬁed Thresholds. Choose Automatic Thresholds if you’d like the

computer to suggest the bins or User Speciﬁed Thresholds if you’d like to specify the

thresholds yourself.

Automatically Computed Thresholds

If you’ve chosen the Automatic Thresholds tab, the program can use machine learning to

suggest bins.

Transforming the Data

Figure 3-11 Binning With Automatically Computed Thresholds

Chapter 3: The Tool Manager

The ﬁrst choice under Automatic Thresholds is between the Automatically choose number

of bins and the Group into: ___ bins buttons. Click Automatically choose number of bins to let

the computer decide the best number of bins. If you choose to specify the number of bins,

click Group into: ___ bins, and type the number of bins you want into the ﬁeld.

There are three ways to categorize data into bins:

•Automatic—you must also select a discrete label. The thresholds are chosen so that

the distributions of labels within different bins are as different as possible. This

approach continues to create thresholds that split the range until no additional

interval is considered signiﬁcant.

The “Min weight per bin” text ﬁeld lets you specify the minimum weight in any bin;

this prevents the creation of bins with less weight than the number speciﬁed. No

interval is split if the two resulting subintervals do not each contain at least the

minimum weight you specify. By default, each instance has unit weight. In this

situation, specifying the Min weight per bin is the same as specifying the minimum

number of instances per bin.

Rather than specifying the minimum weight per bin, it is possible to have the

algorithm set that value automatically. The check box labeled Auto causes the

algorithm to calculate a value for the minimum weight per bin based on the total

weight of the instances: the more total weight, the higher the minimum weight per

bin (the relationship is logarithmic).

•Uniform Range—the algorithm divides the value range into the speciﬁed number of

uniformly sized subintervals. The upper and lower bounds for the extreme ranges

include any values outside the ranges observed in the data. For example, if the

values for an attribute are in the range 3-8, and you specify four bins, the thresholds

identiﬁed are 4.25, 5.5 and 6.75, corresponding to the ranges:

•≤4.25

•> 4.25 to 5.5

•> 5.5 to 6.75

•> 6.75

•Uniform Weight—the algorithm divides the value range into the speciﬁed number of

equal weight bins. Unlike Uniform Range, in which thresholds are identiﬁed that

separate the value range into intervals of equal size, Uniform Weight identiﬁes

thresholds that group the instances into subsets of equal weight. By default, each

instance has unit weight. In this case, the Uniform Weight approach produces the

speciﬁed number of bins, each containing an approximately equal number of

instances.

Transforming the Data

Both Uniform Range and Uniform Weight let you specify a trimming fraction, which

indicates the fraction of extreme values to be excluded from the value range prior to

generating bins. The default trimming fraction is 0.05. This excludes the 5% of the

instances with the most extreme values (2.5% with the lowest values in the range, and

2.5% with the highest values in the range). Trimming tends to reduce the inﬂuence of

outliers on the generation of thresholds.

All of the approaches let you decide whether you want to specify the number of bins or

let the algorithm select the number automatically. For the Uniform Range and Uniform

Weight approaches, the automatic selection of the bins is based on the number of distinct

values: the more distinct values, the more bins are chosen (the relationship is

logarithmic).

Typically, all of the available instances are used when identifying thresholds. When

binned attributes are later used to induce a classiﬁer, the error estimates for that classiﬁer

tend to be overly optimistic. This is because distributional information from the test set

was used to identify thresholds. Use training set only prevents the binning approaches

from looking at the records in the test set when identifying thresholds. This tends to give

a more realistic estimate of the classiﬁers' error rate. Use training set only requires the user

to specify the same Holdout ratio and Random seed (see “Error Options for Inducers” in

Chapter 10) that are used to create the holdout set for estimating classiﬁer error.

The Use Weight menu lets you weight the instances by any numeric attribute. Changing

instance weight affects both Automatic and Uniform Weight, but has no affect on the

Uniform Range.

If you click Apply, the Tool Manager picks bin thresholds and displays them in the

“Thresholds for selected column are” text ﬁeld. The text ﬁeld at the bottom of the Bin

Columns window shows the progress of the binning algorithm and any errors that occur.

Chapter 3: The Tool Manager

Specifying Thresholds

If you specify your own thresholds (as shown in Figure 3-10), you can choose between

Use custom thresholds or Use evenly spaced thresholds by clicking either button. When you

type in the thresholds, you must click Apply to make those thresholds effective for the

selected columns.

The Use custom thresholds text box lets you enter the range criteria. For example, you

could enter the numbers 18, 30, 50, 60. This results in the following ranges: 0-18, 19-30,

31-50, 51-60, 61+. Note that you enter only the digits and commas, not the ranges.

To specify equally spaced bins over a range of values, click the Equally Spaced Bins button.

This activates the three text ﬁelds below it. You can type the start of the binning range,

the end of the range, and the spacing of the bins, respectively, into these ﬁelds. If you are

binning a column that is a date, you can specify units of time for the bin spacing (using

the “Date units” popup menu under the text ﬁelds). This would permit you, for example,

to bin a time period into bins of three weeks. Dates entered into these ﬁelds must be

typed in the form “MM/DD/YY”. Possible time units are as follows:

•years

•quarters

•months

•days

•hours

•minutes

•seconds

The Use custom thresholds text box accepts dates either in double quotes (as shown below),

or without. If you enter dates without quotes, the quotes are added automatically.

"1/1/96", "2/1/96", "3/1/96", "4/1/96", "5/1/96", "6/1/96"

However, do not put quotes around dates used with Use evenly spaced thresholds.

Note: If you enter an invalid parameter, an error message is displayed after you click

Apply, informing you of the valid options and letting you either cancel the command or

return to the dialog box to make the appropriate changes.

Transforming the Data

Aggregation

Before describing the features and effects of the Aggregate button (see page 86), this

section provides an introduction to the concept of arrays and distribution as used in the

aggregation feature.

Introduction to Arrays and Distribution

The Aggregate button lets you perform simple aggregations (for example, sum, min, max,

and so on), make arrays, and distribute columns. (See Table 3-1)

If you make Total $ Spent into an array indexed by the binned column Age_bin, the

resulting table has only two columns:

In this case, making an array reduces the number of columns by one, and also reduces

the number of rows by four. Arrays are useful for the Tree Visualizer tool; they are

necessary if you want to use sliders in Scatter Visualizer and Map Visualizer displays.

Table 3-1 Aggregate Example 1

State Age_bin Total $ Spent

CA 0-20 $50

CA 21-40 $454

CA 41-60 $693

NY 0-20 $35

NY 21-40 $541

NY 41-60 $628

Table 3-2 Aggregate Example 2

State Total $ Spent [Age_bin]

CA [$50, $454, $693]

NY [$35, $541, $628]

Chapter 3: The Tool Manager

Distributing columns is similar, but different in several important ways. Instead of

producing a single new column holding many values, distributing produces one new

column for each value of the index. For example, if in the ﬁrst table was not made an

array, but instead distributed by Age_bin, the result is:

Thus, distributing increases the number of columns but decreases the number of rows.

If you have more than one binned column (for example, Age_bin and Sex_bin), you can

make a two-dimensional array (indexed by combinations of Age_bin and Sex_bin). You

also can distribute and make an array at the same time.

This table has two binned columns: one for age, one for sex.:

If you make Total $ Spent an array indexed by age, and remove Sex_bin, the results are:

Table 3-3 Aggregate Example 3

State Total $_0-20 Total $_21-40 Total $_41-60

CA $50 $454 $693

NY $35 $541 $628

Table 3-4 Example of Binning

State Age_bin Sex_bin Total $ Spent

CA 0-20 1 $20

CA 0-20 2 $30

CA 21-40 1 $220

CA 21-40 2 $234

CA 41-60 1 $401

CA 41-60 2 $292

Table 3-5 Results When Making Total $ Spent an Array

State Total $ Spent [Age_bin]

CA [$50, $454, $693]

Transforming the Data

If you do not remove Sex_bin, the results are:

If you make an array by both Age_bin and Sex_bin, the results are:

Finally, if you distribute by Sex_bin and index by Age_bin, the results are:

The examples above (with the exception of Table 3-5) had exactly one relevant value for

each array element, and the distribution merely rearranged existing data values. For the

example in Table 3-5, there were two data values for each array element, and these were

summed. MineSet provides several aggregation options for datasets containing more

than one value to be distributed into a given output array element. The most common

option is to add the values (as done in Table 3-5). This is useful when accumulating

expenditures into budgets, for example. You also can take the minimum, maximum, and

average of the total number of values, as well as count them.

When distributing values for a given dataset, it is possible that there are no values

appropriate for a particular bin. In this case, for min, max, avg, and sum aggregations, the

DataMover ﬁlls in a value of NULL. For count aggregations, the DataMover ﬁlls in a

value of 0.

Table 3-6 Results When Specifying Sex_bin

State Sex_bin Total $ Spent [Age_bin]

CA 1 [$20, $220, $401]

CA 2 [$30, $234, $292]

Table 3-7 Results of Making an Array by Age_bin and Sex_bin

State Total $ Spent [Age_bin] [Sex_bin]

CA [$20, $220, $401, $30, $234, $292]

Table 3-8 Results of Distributing Sex_bin and Indexing by Age_bin

State Total $ Spent [Age_bin], Sex = 1 Total $ Spent [Age_bin], Sex = 2

CA [$20, $220, $401] [$30, $234, $292]

Chapter 3: The Tool Manager

The Aggregate Button

You can use the Aggregate button to create simple aggregations, make arrays, or

distribute columns. Clicking this button causes the Aggregate dialog box to appear

(Figure 3-12). It shows three lists, with the columns in the current table appearing in the

middle list. If you want to aggregate, distribute, or turn a column into an array, select the

name of the column, and click the left arrow button between the left and center lists.

Below are popup menus that let you specify indexes (if the result is to be an array) and a

distribution column (if the result is to be distributed). In addition, at the bottom of the

dialog box are ﬁve toggles that let you specify how different values are to be combined

when aggregated: either summed, averaged, the min or max value, or the count. When

you are aggregating number-valued columns, you can choose any combination of these

options. For other types, only count is permitted. If you choose more than one option,

you get more than one result. For example, selecting average and max gives you one

result with average values, and another one holding the max values.

Figure 3-12 Aggregate Dialog Box

Transforming the Data

The three lists of column names are given below:

•Columns to aggregate.

•Group-By columns (the default); this keeps the columns unchanged throughout the

operation. For each set of records with the same combination of values in the Group

By columns, only one record is output in the resulting table, with values in the

aggregated columns summed, averaged, minned, maxed, or counted (depending

on the checkboxes at the bottom of the panel).

•Columns to remove, as can be seen with the Sex_bin column in Table 3-5

After you have ﬁnished with the additional aggregate criteria dialog box, the Current

Columns text box in the Table Processing window shows the new column names that

result from applying these criteria.

The Filter Button

This button lets you ﬁlter the data via a mathematical expression. The resulting table

includes only records for which the expression is true (or, if numerical, non-zero). When

you click Filter, the Filter dialog (Figure 3-13) appears.

Figure 3-13 Filter Dialog Box

Chapter 3: The Tool Manager

This dialog box lets you select column names and operators on the left to build an

expression on the right. For a complete description of the expression deﬁnition language,

see “The Conﬁguration File” in Appendix B.

The Change Types Button

This button lets you change the name of a column, as well as its type.

Changing a Column Type

Some databases store numerical values as strings. Oracle stores all numbers (both

integers and real numbers) in a single format, which defaults to the data type double in

the Tool Manager. You can use the Change Types button to ensure that these values are

processed correctly. To change the type of one or more columns, click the Change Types

button. A new dialog box appears (see Figure 3-14). This dialog box contains a window

with a list of column headings and their respective types.

Figure 3-14 Change Types Dialog Box

Transforming the Data

First select a column heading in the window. Then click the New type button. This

produces a popup list of the possible types (invalid types are grayed out), as shown in

Figure 3-15.

Figure 3-15 Types Popup List

•int—represents a 32-bit signed integer.

•ﬂoat—represents a single-precision ﬂoating-point number. The decimal point is

optional when representing a ﬂoating-point number.

•double—represents a double-precision ﬂoating-point number. The decimal point is

optional when representing a ﬂoating-point number.

•dataString—represents a string that is unlikely to appear multiple times. If it

appears multiple times, several copies are made. A dataString can be used to store

an address. Addresses are unlikely to be compared, and each record can have a

different address.

Chapter 3: The Tool Manager

•string—represents a string of characters that can appear multiple times in the data

ﬁle. Unlike a dataString, only a single copy of a given string is stored in memory, no

matter how many times it appears in the data. This saves memory for strings

appearing many times.

Comparing strings is also much quicker than comparing dataStrings. However,

reading in strings can be slower than reading in dataStrings because it is necessary

to look for duplications. An example of string use would be for a division name that

appears once for each department in the division. If you are unsure whether to use a

string or a dataString, use a string.

•date—represents the date type from the database.

•bin—represents a column created by a binning operation.

•ﬁxed-length array—an array of values of ﬁxed size, not created by the Tool Manager.

•bin-base array—an array of values as can be created by the Tool Manager.

•variable-based array—an array of values of variable size, not created by the Tool

Manager.

After selecting a new type, click Apply to have the change take effect.

If you try to convert an inappropriate ﬁeld (such as a name) to a number, the resulting

values are all zeroes.

Note: When the data source is an existing ﬁle, there are fewer possibilities available for

changing any given column.

Changing a Column Name

Select the original column, type a new name in the text ﬁeld, and click Apply. Then click

Close.

To exit this dialog box, click Close.

Transforming the Data

The Add Column Button

You can use the Add Column button to create a new column whose values are computed

based on a mathematical expression. For example, you could add a new column whose

values are the ratio of values from two existing columns. Click Add Column to get a dialog

box that lets you specify the new column name and expression (Figure 3-16).

Figure 3-16 The Add Column Dialog Box

In the upper left of this dialog box is a ﬁeld for entering the new column’s name. Below

this is a popup menu that lets you specify the column type (integer, string, ﬂoating point,

and so on).

Chapter 3: The Tool Manager

The right-hand side of the dialog contains a large text entry area where you can type in

a deﬁnition of the expression (for a complete description of the expression deﬁnition

language, see “The Conﬁguration File” in Appendix B). As a shortcut to typing column

names and operators, scrolled lists in the lower left of the dialog display all columns in

the current table and all possible operators. To insert a column name or operator into the

expression, either double-click it in its scrolled list, or select it and click the arrow button

to the right of the scrolled list.

To check the expression you have created, click the Check Expression button. If there is an

error, a dialog box appears, indicating what the error is and where it occurred. When you

click OK, the expression is automatically checked, and the dialog box is not removed

unless the expression is correct.

The Add Column dialog box checks for type compatibility: if you have assigned a

numerical expression to a string column (or vice versa), a warning message appears, and

the type of the new column is automatically changed to be correct.

The Apply Model Button

The Apply Model button lets you use a previously created model to label new records in

the current table, to estimate probabilities for a label value, to test the performance of the

model on the current table, or to backﬁt the current table onto an existing model. See

Chapter 10, “MineSet Inducers and Classiﬁers,” for details.

Transforming the Data

The Sample Button

This button lets you select a random subset of the data. This is useful for data sets that

are too large to work with efﬁciently. When you click Sample, the Sampling dialog box

(Figure 3-17) appears.

Figure 3-17 Sampling Dialog Box

You can sample two ways: as a percentage of the current table, or by setting the

maximum number of records to put in the sample. Percentage sampling is approximate,

you can get slightly more or slightly fewer records than the exact percentage would

indicate. The random sample is based on a numeric seed that can be speciﬁed in the

sampling dialog. If no seed is speciﬁed, the number 1 is used as the seed. If you want a

different random sample, specify a different random seed.

When you click the Complementary Sample toggle, you get all records except those that fall

in the random sample. That is, if you get a 10% sample with the Complementary Sample

not clicked, when you click it, you get the remaining 90% of the data.

Chapter 3: The Tool Manager

The Table History Buttons

Table processing is a series of operations performed by using the buttons described

above. To allow you to see this series of steps, and go back if you made a mistake, there

are two Table History buttons at the bottom of the Table Processing panel (Figure 3-18).

When you click the left arrow button, the columns window shows the table as it

appeared at an earlier step. Clicking the right arrow button returns the table to its current

state.

Figure 3-18 Table History Buttons

The “Current view is” Field

To the right of the history buttons is the information ﬁeld Current view is, which counts

the changes you’ve made and indicates which step you are viewing. The two integers in

this ﬁeld indicate which table view you’re looking at, out of the total number of table

views that exist. For example, if you’ve made two changes, you can view the original

table (1 of 3), the table after the ﬁrst change (2 of 3), or the table after the second change

(3 of 3).

The Prev and Next Buttons

As you go back and forth using the Table History buttons to view earlier versions of the

table, the Prev: and Next: ﬁelds (under the arrow buttons) help you keep track of where

you are in the history of the table. For any table you view, the Prev: ﬁeld tells you what

the previous change was, and the Next: ﬁeld tells you the next change.

Transforming the Data

The Edit Prev. Op Button

The Edit Prev. Op. button allows you to edit the operation shown in the Prev. ﬁeld. (This

button is not active when Current view is: 1 of some number, because that is the original

table, with no previous changes.) When you click the Edit Prev. Op. button, the dialog box

for the previous operation comes up, and you can make changes to that operation. For

example, if the previous operation was binning columns, when you click Edit Prev. Op.,

the Bin Columns dialog box appears.

Note that by changing a previous operation, you could affect operations you set up

subsequent to the current one. For example, if you delete a column that you used in a

subsequent binning operation, that binning operation becomes invalid. The Edit History

button can help you avoid such problems.

The View History Button

When you click the View History button, the panels showing the current column and data

destination are replaced by a panel showing you the complete history of the Data

Transformation table (Figure 3-19). Each version of the table appears as a box containing a

list of the columns, linked by a smaller box (indicating the operation performed on the

table) to the next version of it.

Chapter 3: The Tool Manager

Figure 3-19 View History Dialog Box

As with Edit Prev. Op, changing one operation usually affects (sometimes invalidates)

subsequent operations in the history. You can select a speciﬁc operation to edit, add, or

view. The View History dialog warns you when changes affect the history, shows you the

new history. The row of buttons beneath the diagram window of the View History panel

allows you to change the size and orientation of the diagram as detailed below.

Transforming the Data

Zoom Buttons

Under the window displaying this ﬂow chart are the zoom buttons that let you view the

ﬂow chart closer up or farther away (Figure 3-20). You can choose the zoom by using the

button indicating the percentage, or by clicking the arrow buttons to increase or decrease

the size. The increments of change are the same whether you use the percentage button

or the arrow buttons.

Figure 3-20 Zoom Buttons

Overview Button

This button (Figure 3-21) creates, in a separate window, an overview of the entire history

chart that is synchronized with the Edit History dialog. The overview window shows

you which part of the history is currently visible, and lets you pan to other parts of the

history.

Figure 3-21 Overview Button

Vertical/Horizontal View Button

Next to the zoom buttons is a toggle button that lets you view the ﬂow chart vertically or

horizontally (Figure 3-22). Clicking the button switches you back and forth between the

two points of view.

Figure 3-22 Vertical/Horizontal View Button

Chapter 3: The Tool Manager

Data Source

Under the Data Source heading is the Change Data Source... button, which lets you change

the table on which the history operates. When you hold the button down, a menu

appears that lets you choose

•...to DBMS table

•...to DBMS query

•...to Data File

Selecting one of these items causes a dialog box to appear that lets you select the new data

source.

Note: As with editing the history, changing the data source can invalidate history

operations.

View

Under the indicator View is a View Single Ops/Dest button. When this button is pressed,

the panel showing the history is hidden, and the panels showing current columns and

data destinations return to view. The function of this button is the same as choosing the

Single Ops and Destination option from the View pulldown menu at the top of the

window.

For Selected Operation/Table

Under the indicator For Selected Operation are three rows of buttons that become active if

you click one of the operations or tables in the ﬂow chart. Once you select an operation,

you can alter it.

•The Edit Op button brings up the dialog box for the selected operation, so you can

make changes to it.

•The Delete Op button removes the operation from the table history, and the elements

that follow in the ﬂow chart move over when it disappears.

•The Add New Op. Before and Add New Op. After buttons let you insert a new

operation into the table history.

•The View Data button shows the data for any selected table in the history. When you

click this button, a menu is displayed enabling you to select the entire dataset, or a

random sample of 10, 100, or 1,000 records.

Investigating the Data

Other

Under the indicator Other there are three buttons that affect the total history ﬁle:

•Undo Change—undoes the most recent change to the history (except changes to the

data source)

•Redo Change—redoes any change you have undone.

•Save to PostScript—save a picture of the history ﬂow chart to a ﬁle in PostScript

format.

Investigating the Data

The Data Destination panel (Figure 3-23) lets you direct your processed data to one of the

MineSet visualization or mining tools, or to a data ﬁle.

There are three tabs at the top of this panel:

•Viz Tools

•Mining Tools

•Data Files

These are the three possible destinations for your data. They are discussed in greater

detail in later chapters dealing with the Data Destination tools.

Using Visualization Tools

If you choose the Viz Tool tab, the visualization tool panel appears under Data

Destination (Figure 3-23).

100

Chapter 3: The Tool Manager

Figure 3-23 Data Destination Panel

Viz Tool is a popup menu that lets you choose among Map Visualizer,Scatter Visualizer,

Splat Visualizer,Tree Visualizer,Statistics Visualizer and Record Viewer, to determine the type

of visual representation you want for your data.

The ﬁrst ﬁve tools are described in their respective chapters.

The Record Viewer lets you view the data in the current table in a row/column

spreadsheet-like tool. To use the Record Viewer select it from the tool menu, and click

Invoke Tool.

•Tool Options—lets you further specify options you want to set in the speciﬁed tool’s

conﬁguration ﬁle.

•Clear Selected—lets you undo the mapping to a selected Visual Element.

•Clear All—clears all mappings.

•Invoke Tool—lets you start the tool you speciﬁed (via the top button) using the

conﬁguration ﬁle named in the Saved as text ﬁeld.

Investigating the Data

101

Each tool’s requirements are listed individually in the Visual Elements pane. This pane lets

you map a table column to a requirement. To do this,

1. Select a column by clicking its name in the Current Columns pane.

2. Select the requirement to which you want to map the column by clicking on that

requirement in the Visual Elements pane.

The Viz Tool panel now shows the Visual Element and the column to which it has been

mapped (see Figure 3-24).

Figure 3-24 Columns Mapped to Requirements

102

Chapter 3: The Tool Manager

You can clear the mapping at any time by selecting the requirement that has the mapping

you want to change, then clicking the Clear Selected button. You can clear all mappings

using the Clear All button.

If you want to specify other details to ﬁne-tune your mappings or to change the settings

so that the data representations more clearly reﬂect your intentions, click the Tool Options

button. A dialog box speciﬁc to each MineSet tool appears, where you can manually

specify the options to use.

Note: For details on a speciﬁc tool’s options, see that tool’s chapter.

Using Mining Tools

The MineSet Classiﬁers are described in Chapter 10, “MineSet Inducers and Classiﬁers,”

Chapter 11, “Inducing and Visualizing the Decision Tree Classiﬁer,” Chapter 12,

“Inducing and Visualizing the Option Tree Classiﬁer,” Chapter 13, “Inducing and

Visualizing the Evidence Classiﬁer,” Chapter 14, “Inducing and Visualizing the Decision

Table,” Chapter 15, “Inducing and Visualizing the Regression Tree.” Clustering is

described in Chapter 16, and Column Importance is described in Chapter 17.

Creating Associations for the Rule Visualizer

If you click the Mining Tools tab, then the Associations tab, the panel lets you take the data

ﬁle you created in Data Transformations and proceed to the Rule Visualizer. Each step of

the process is shown in the subpanels:

•Assoc Settings—creates rule generation options and mappings from the columns in

your table to elements of association rules

•Ruleviz Settings—provides options to tailor the representation of the association

rules in the Ruleviz tool

•Execution—a button to invoke the process of ﬁnding association rules and

visualizing them

If you don’t want to go through this process manually, click the Execution button, and the

computer will perform the process using defaults.

Investigating the Data

103

Figure 3-25 The Associations Tab

Finding Important Columns

The Column Importance (Figure 3-26) allows you to determine how important various

columns are in discriminating the different values of the label column you choose. You

might, for example, want to ﬁnd the best three columns for discriminating the label good

credit risk so you can choose them for the Scatter Visualizer. When you select the label and

click Go!, a popup window appears with the three columns that are the best three

discriminators. A measure called “purity” (a number from 0 to 100) informs you how

well the columns discriminate the different labels. Adding more columns can only

increase the purity.

104

Chapter 3: The Tool Manager

Figure 3-26 The Column Importance Tab

There are two modes of column importance:

•Simple Mode

To invoke the simple mode, choose a discrete label from the popup menu, and

specify the number of columns you want to see.

Investigating the Data

105

•Advanced Mode

Advanced mode lets you control the choice of columns. To enter advanced mode,

click Advanced Mode in the Column Importance panel. A dialog box appears, as

shown in Figure 3-27. The dialog box contains two lists of column names: the left

list contains available attributes, and the right list contains attributes chosen as

important (by either the user or the column importance algorithm).

Figure 3-27 Advanced Mode of Column Importance

106

Chapter 3: The Tool Manager

Advanced mode can work two different ways: ﬁnding several new important

attributes, or ranking available attributes.

•Finding Several Important Attributes

To enter this sub-mode, click the ﬁrst of the two radio buttons at the bottom of

the dialog (...ﬁnd [number] additional important columns). If you click Go! with no

further changes, the effect is the same as if you were in Simple Mode, ﬁnding

the speciﬁed number of important columns and automatically moving them to

the right column. Near each column, the cumulative purity is given (that is, the

purity of all the columns up to and including the one on the line). More

attributes can only increase the purity.

Alternatively, by moving columns names from the left list to the right list, you

can pre-specify columns that you want included and let the system add more.

For example, to select the age column and let the system ﬁnd three more

columns, click the age column name, then click the right arrow.

Clicking Go! lets you see the cumulative purity of each column, together with

the previous ones in the list. A purity of 100 means that using the given

columns, you can perfectly discriminate the different label values.

•Ranking Available Attributes

Advanced Mode also lets you compute the change in purity that each column

would add to all those that were already selected. For example, you might

choose age, and then ask the system to compute the incremental improvement

in purity that each column would yield.

To enter this sub-mode, click the second of the two radio buttons at the bottom

of the dialog (...compute improved purity for left columns, cumulative purity for right

columns.). This sub-mode permits ﬁne control over the process. If two columns

are ranked very closely, you might prefer one over the other (for example,

cheaper to gather, more reliable, easier to understand).

Column Importance Notes

Note that with other columns, the importance of features varies from their ranking alone.

For example, while net-income might be a good column individually, it might not be as

important together with salary because they are likely to be highly correlated. The best

set of three columns is not necessarily composed of the columns that rank highest

individually. If two columns give the income in dollars and in another currency, they are

ranked equally alone; however, once one of them is chosen, the other adds no

discriminatory power to the set of best features.

Investigating the Data

107

Column selection is useful for ﬁnding the best three axes for the Scatter Visualizer, as well

as for ﬁnding a good discriminatory hierarchy for the Tree Visualizer.

All ﬂoating point values (double or ﬂoat) are pre-discretized using the automatic

discretization. If a column has no value given to it in the left list, the algorithm did not

consider it, because it either had a single value (for example, when it is discretized into

one interval), or the number of records that it would separate is not statistically

signiﬁcant.

Using Data Files

The Tool Manager lets you save the manipulated table for future use in a data ﬁle on the

client or server. If you click the Data Files tab, the panel shown in Figure 3-28 appears.

Figure 3-28 The Data Files Panel

The two toggle buttons in this panel let you specify whether the ﬁle is to be saved on the

server or your client machine. The selected name for the client ﬁle appears next to the

Client checkbox. If you select Client, the Choose new client ﬁle button brings up a dialog for

you to choose the name for the client ﬁle. If you select Server, you can type the server

ﬁlename directly into the adjacent text ﬁeld.

Note: Pathnames are not permitted for server ﬁles; all server ﬁles are stored in the

DataMover cache directory.

108

Chapter 3: The Tool Manager

Session Files

The Tool Manager can save a description of your work to a “session ﬁle” for future use.

A session ﬁle contains a description of the data source you selected, all the

transformations on the data, and the mining or visualization of the data. Each session ﬁle

can hold descriptions of only one data source and one data destination; thus, if you

change the destination visual tool or source data table, the session ﬁle loses its links to

any previous data source or destination.

Session ﬁles can be saved at any time through the entries in the File menu, described

below. The name of the current session appears in the window’s title bar. The Tool

Manager also keeps a parallel session ﬁle, called .latest.mineset, in your home directory. It

always has a record of your most recent actions in the Tool Manager. Whenever you start

the Tool Manager without a session ﬁle, it reads the contents of the .latest.mineset ﬁle to

return you to the state when you last ran MineSet.

Session ﬁles also can be used for running the Tool Manager in batch mode, by issuing this

command at the UNIX shell prompt:

mineset_batch [-s serverPassword -d databasePassword]sessionFile

The -s and -d options let you specify the password for logging into the server and

database respectively. If you do not specify these options, mineset_batch will ask you to

type in the passwords, thus these options are useful when running mineset_batch from a

shell script. To specify that there is no password for either the server or database, use -s

or -d followed by two double quotes, that is,

mineset_batch -s "" -d "" foo.mineset

If you specify one of the two passwords, you must specify both.

In batch mode, the Tool Manager does not bring up tools or windows; however, it creates

ﬁles for tools. For example, if the session ﬁle includes the Tree Visualizer as the data

destination, running the Tool Manager in batch mode produces ﬁles for running the Tree

Visualizer, but the Tool Manager does not invoke it.

Pulldown Menus

109

Pulldown Menus

At the top of the Tool Manager window (see Figure 3-1 on page 67) are four pulldown

menus:

•File

•View

•Visual Tools

•Help

The following section describes each of these menus.

The File Menu

The File menu lets you choose what to do with your current session, which is one

complete session with a tool. This includes choosing the server, data source and table, all

the table manipulations, the mapping or classifying of the data, as well as opening or

saving a tool history, changing the working directory, and setting preferences.

Figure 3-29 File Menu

110

Chapter 3: The Tool Manager

The File menu provides ﬁve sets of functions:

•The ﬁrst set is for selecting a data source.

–Open New DBMS Table—lets you select a single table from a DBMS.

–Open New DBMS Query—lets you make an SQL query against the DBMS.

–Open New Data File—lets you select a table from a data ﬁle on disk.

–Connect To Server—lets you open a connection to a MineSet server.

•The second set is for opening or saving .mineset ﬁles.

–Open Saved Session...—lets you open a .mineset ﬁle.

–Reopen Current Session —lets you reopen the current session ﬁle from the disk, in

case you do not want to save the current changes.

–Save Current Session—lets you save a currently open .mineset ﬁle.

–Save Current Session As...—lets you name (or rename) and save a currently open

history as a .mineset ﬁle.

•The third set is for changing the current directory.

–Change Current Directory—lets you specify the directory in which the Tool

Manager creates all data and visualization ﬁles.

•The fourth set is for setting preferences. Here you can specify whether to

–use ASCII or binary ﬁles

–include an entry for NULL values when creating arrays

–automatically load the most recent session when starting up the Tool Manager

–run MIndUtil in single- or multi-threaded mode; a slider allows you to select

how many threads to use.

•The last option, Exit, lets you end the current session and exit the Tool Manager.

Pulldown Menus

111

The View Menu

The View Menu lets you select whether to see the history panel or the current columns

and data destination panels.

Figure 3-30 View Menu

•Single Ops and Destination—shows the current columns and data destination panels.

This menu option performs the same function as the View Single Ops/Dest button on

the history panel.

•Entire History—shows the history panel. This menu option performs the same

function as the View History button on the Data Transformations panel.

The Visual Tools Menu

The Visual Tool menu lets you invoke any of the following visual tools directly:

•Cluster

•Decision Table

•Evidence Visualizer

•Map Visualizer

•Rule Visualizer

•Scatter Visualizer

•Splat Visualizer

•Statistics Visualizer

•Record Viewer

•Tree Visualizer

112

Chapter 3: The Tool Manager

If you have created a ﬁle that runs within one of these tools, and you want to go back to

it, click the tool. From within the tool, use File > Open to open the data ﬁle. These viewers

are described in their respective chapters, except for Record Viewer, which is described

later in this chapter, in “The Record Viewer” on page 113.

The Help Menu

The Help menu provides information about the elements of the Tool Manager and how

they work:

•Click for Help—Gives help information about a particular item if you press Shift-F1,

then click the item for which you want help.

•Overview—Gives an overview of the online help and how to use it.

•Index—Provides an index of the complete help system. This option is currently

disabled.

•Keys & Shortcuts—Provides the keyboard shortcuts for all of the Tree Visualizer’s

functions that have accelerator keys.

•Product Information—Indicates what version of the Tool Manager you are using.

•MineSet User’s Guide—Invokes the IRIS Insight viewer with the online version of

this manual.

The Tool Manager Options File

The Tool Manager creates a .mineset ﬁle in your home directory. This is used to store the

preference indicating whether to restore the most recent session on startup, as well as the

default server name, login, and password. If you log in to the same server often, edit this

ﬁle and specify a server name and login as follows:

default_server_name: mineset

default_server_login: guest

default_server_password:

Whenever you try to log in to a server, these names appear as defaults.

Warning: Putting a password in a ﬁle is a great security risk. Do not place a

password in the Tool Manager options ﬁle unless you want other people to know that

password.

The Record Viewer

113

The Record Viewer

The Record Viewer lets you view MineSet data ﬁles in a format similar to spreadsheets.

There are ﬁve ways to start the Record Viewer.

•Use the Tool Manager to start the Record Viewer. This invokes the Record Viewer on

the data currently conﬁgured in the Tool Manager.

•Double-click on the Record Viewer icon, which is in the MineSet page of the icon

catalog. Since no .schema ﬁle is speciﬁed, you must select one by using File > Open.

•Double-click on any MineSet .schema ﬁle. This launches the Record Viewer on that

.schema ﬁle.

•Drag a .schema ﬁle onto the Record Viewer icon.

•Start the Record Viewer from the UNIX shell command line by entering this

command at the prompt:

recordview [ file.schema ]

where file.schema is optional and speciﬁes the name of the .schema ﬁle to use. If

you do not specify a .schema ﬁle, you must use File|Open to specify one.

The Record Viewer shows the data speciﬁed by the .schema ﬁle in spreadsheet format (see

Figure 3-31).

114

Chapter 3: The Tool Manager

Figure 3-31 Sample Record Viewer Screen

If a column is not wide enough to see a speciﬁc value, click on it to display that value at

the top of the Record Viewer. You also can change the width of columns by dragging the

separators between the columns.

To read a new .schema ﬁle into the Record Viewer, select File > Open. To close the Record

Viewer, select File > Exit.

Note that some of the visual tools also bring up record viewers to display the current

selections. These record viewers are built into the visual tools; while their behavior is the

same as the Record Viewer discussed above, they do not allow opening other .schema

ﬁles.

Color Options for the MineSet Visualizers

115

Color Options for the MineSet Visualizers

Many of the tool option dialogs have options for choosing colors. MineSet has a color list

chooser that uses color swatches. This section describes how to choose, apply, and

change color options for the MineSet Visualizers.

Choosing Colors

If only one color is to be chosen (for example a grid color), a single color swatch appears

(Figure 3-32).

Figure 3-32 Conﬁguration Option With a Single Color Swatch

Clicking the swatch brings up a Color Browser that lets you change the color of that

swatch (Figure 3-33). The Color Browser is described in more detail in the “Using the

Color Browser” section, shown in Figure 3-33.

Figure 3-33 Color Browser

116

Chapter 3: The Tool Manager

If a list of color swatches is to be chosen, the list of swatches appears (these can be empty

initially), as shown in Figure 3-34.

Figure 3-34 Multiple Colors Swatches

To edit the color, click a swatch with the left mouse button. This also selects the swatch

for making changes to the colors with the buttons. If you click on the swatch with the

middle mouse button, the swatch is selected, but the color chooser does not appear.

Next to the list of swatches are four buttons. First is the Add button, labeled with a plus

sign (+), which adds a new color at the end of the list. A swatch is added, and the color

chooser appears, where you can select the color of that swatch. The Add button is

disabled if the maximum number of colors is already in the list.

Next to the Add button is a Delete button, labeled with a minus sign (-). This button

deletes the selected color. It is disabled if no swatch is selected, or if the list already has

the minimum number of colors.

Next to the Delete button are two buttons to shift the selected color right and left. These

buttons are disabled if no swatch is selected, or if the swatch is already at the end of the

list.

If there are more colors in the list than room to display them, scroll arrows are added at

each end of the list (Figure 3-35).

Figure 3-35 Scroll Arrows on Color Browser

If the hardware runs out of colors, the color swatches are replaced with text labels

showing the color in X notation (Figure 3-36).

Figure 3-36 Color Browser Out of Colors

Color Options for the MineSet Visualizers

117

Using the Color Browser

The Color Browser (Figure 3-33) appears when you click a color swatch or the add button

in the Colors panel of the visualizer’s Conﬁguration Options panel.

To select a color using the Color Browser:

1. Move your mouse cursor on top of the small circle in the colored hexagon.

2. Press the left mouse button, and move your mouse around the hexagon. The color

beneath the small circle appears in the rectangle next to the Current Color label. This

rectangle acts as your color palette while you choose a color.

3. Release the mouse button when the small circle is on top of a color you want. The

selected swatch immediately takes on the chosen color.

You can edit several colors without dismissing the Color Browser; clicking any color

in the options panel lets you edit that color in the already posted Color Browser.

4. Click the OK button when you decide on a color. The Color Browser window closes.

119

Chapter 4

4. Using the Statistics Visualizer

This chapter discusses the features and capabilities of the Statistics Visualizer. It provides

an overview of this data visualization tool, then explains the Statistics Visualizer’s

functionality when working with the

•main window

•external controls

•pulldown menus

Finally, it lists and describes the sample ﬁles provided for this tool.

Overview of the Statistics Visualizer

The Statistics Visualizer lets you visualize statistics on columns. Statistics Visualizer

presents a window that contains one small panel for each column listed in the Current

Columns pane of Tool Manager. The Statistics Visualizer main window has a default size

and shows only a restricted number of column panels. If the number of columns is large,

scrollbars appear; alternatively, you can stretch the Statistics Visualizer window

horizontally or vertically to view more column panels.

The format of the column panel varies according to the column type, and the number of

distinct values that exist for that column. Columns are generally divided into two types:

numeric and discrete, shown as box plots and histograms, respectively.

Anumeric column has integer, ﬂoat, double or date values. Each box plot panel shows

statistics about data from a single column, including the minimum, maximum, mean,

median, and two quartiles (25th and 75th percentiles) of these numeric values. These

values are shown as lines across a vertical bar in graduated shades of green, and the

standard deviation of the population is shown as a +/- value. The quartiles are shown

whenever there are fewer than 50,000 distinct values, (see Figure 4-1). If there are more

than 50,000 distinct values in the column, the statistics are shown as a gray vertical bar.

120

Chapter 4: Using the Statistics Visualizer

Figure 4-1 Numeric Column Displayed by Statistics Visualizer

Adiscrete (or nominal) column has non-numeric (string, bin, or enum) values shown as

histograms, (see Figure 4-2). The discrete column panel shows up to 100 distinct values,

as well as a histogram of the number of instances of this distinct value. The default

ordering of the discrete rows is by decreasing count, but you can use the View pulldown

menu to select an alternative sorting. If there are 100 or fewer distinct categories, then the

column panel also contains the count of distinct values.

Figure 4-2 Discrete Column Displayed by Statistics Visualizer

After creating a visualization of your data, the Statistics Visualizer lets you see truncated

textual information in the histograms with a brush highlighter. The brush highlighter

activates as you pass the mouse across a ﬁeld without clicking.

File Requirements

121

File Requirements

The Statistics Visualizer requires a data ﬁle, consisting of ASCII or binary ﬁelds. This ﬁle

is easily created when running the Tool Manager (see Chapter 3).

Starting the Statistics Visualizer

There are ﬁve ways to start the Statistics Visualizer:

•Use the Tool Manager to conﬁgure and start the Statistics Visualizer. See Chapter 3

for details on most of the Tool Manager’s functionality, which is common to all

MineSet tools.

•Double-click the Statistics Visualizer icon, which is in the MineSet page of the icon

catalog. The icon is labeled statviz. Since no conﬁguration ﬁle is speciﬁed, the

start-up screen requires you to select one by using File > Open.

•Double-click the Statistics Visualizer icon on your Silicon Graphics desktop. The

startup screen requires you to select a data ﬁle by choosing File > Open.

Figure 4-3 File > Open Menu Selection for Statistics Visualizer

Starting the Statistics Visualizer from the icon activates only the File and Help

pulldown menus. For the main window to be fully functional, open a .statviz ﬁle by

selecting File > Open.

122

Chapter 4: Using the Statistics Visualizer

•If you know what .statviz ﬁle you want to use, double-click the icon for that ﬁle. This

starts the Statistics Visualizer and automatically loads the ﬁle you speciﬁed. This

works only if the ﬁlename ends in .statviz (which is always the case for data ﬁles

created for the Statistics Visualizer using the Tool Manager).

•Drag the .statviz ﬁle icon onto the Statistics Visualizer icon. This starts the Statistics

Visualizer and automatically loads the ﬁle you speciﬁed. This works even if the

conﬁguration ﬁlename does not end in .statviz.

Starting the Statistics Visualizer

Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen

(Figure 4-4). From the popup list of tools, choose Statistics Visualizer.

Figure 4-4 Data Destination Panel With Statistics Visualizer Selected

Working in the Statistics Visualizer’s Main Window

123

Working in the Statistics Visualizer’s Main Window

If you started the Statistics Visualizer from the icon, the main window shows the

Help pulldown menus can be used. For the main window to show all menus and

controls, open a .statviz ﬁle. Use File > Open (Figure 4-3) to see a list of conﬁguration ﬁles.

Pulldown Menus

Three pulldown menus let you access additional Statistics Visualizer functions. These are

labeled File, View, and Help. If you start the Statistics Visualizer without specifying a

conﬁguration ﬁle, only the File and the Help menus are available.

The File Menu

The File pulldown menu for the Statistics Visualizer contains four options.

•Open loads and opens a ﬁle and displays it in the main window.

•Save As saves the current state of the Statistics Visualizer main window into an

image ﬁle

•Print Image captures the image of outputs the current state of the Statistics

Visualizer main window and prints it to a printer.

•Exit closes all windows and exits the application

124

Chapter 4: Using the Statistics Visualizer

The View Menu

The Statistics Visualizer View pulldown menu (Figure 4-5) contains two options.

Figure 4-5 StatViz View Pulldown Menu

•Sort Nominals By Count speciﬁes that the nominal (discrete) columns show the

histogram of values that is ordered by decreasing per-value counts.

•Sort Nominals By Name speciﬁes that those same columns be ordered by the relative

alphabetical order of each data value name.

The Help Menu

125

The Help Menu

The Help menu provides access to ﬁve help functions (see Figure 4-6).

Figure 4-6 Statistics Visualizer Help Menu

•Click for Help turns the cursor into a question mark. Placing this cursor over an

object in the Statistics Visualizer’s main window and clicking the mouse causes a

help screen to appear; this screen contains information about that object. Closing the

help window restores the cursor to its arrow form and deselects the help function.

The keyboard shortcut for this function is Shift+F1. (Note that it also is possible to

place the arrow cursor over an object and press the F1 function key to access a help

screen about that object.)

•Overview provides a brief summary of the major functions of this tool, including

how to open a ﬁle and how to interact with the resulting view.

•Index provides an index of the complete help system. This option is currently

disabled.

•Keys & Shortcuts provides the keyboard shortcuts for all of the Statistics Visualizer’s

functions that have accelerator keys.

•Product Information brings up a screen with the version number and copyright notice

for the Statistics Visualizer.

•MineSet User’s Guide invokes the Insight viewer with the online version of this

manual.

126

Chapter 4: Using the Statistics Visualizer

Sample Data Files

The provided sample data ﬁles demonstrate the Statistics Visualizer’s features and

capabilities. The following ﬁles are in the /usr/lib/MineSet/statviz/examples directory:

mushroom.statviz

census95.statviz.

127

Chapter 5

5. Using the Tree Visualizer

This chapter discusses the features and capabilities of the Tree Visualizer. It provides an

overview of this visualization tool, discusses ways of invoking it, then explains the Tree

Visualizer’s functionality when working with the following elements.

•main window

•external controls

•pulldown menus

•overview window

Finally, this chapter lists and describes the sample ﬁles provided for this tool.

Overview of Tree Visualizer

The Tree Visualizer is a graphical interface that displays data as a three-dimensional

“landscape.” It presents your data as clustered, hierarchical blocks (nodes) and bars with

disks through which you can dynamically navigate, viewing part, or all, of the dataset.

As shown in Figure 5-1, the Tree Visualizer displays quantitative and relational

characteristics of your data by showing them as hierarchically connected nodes. Each

node contains bars whose height, color and disk correspond to aggregations of data

values. The lines connecting nodes show the relationship of one set of data to its subsets.

128

Chapter 5: Using the Tree Visualizer

Figure 5-1 Example Display in the Tree Visualizer’s Main Window

File Requirements

129

Values in subgroups can be summed and displayed automatically in the next higher

level. The base under the bars can provide information about the aggregate value of all

the bars. Bars representing negative values are shown below the top of the base. You can

see negative value bars more clearly by disabling the base height (see “The Display

Menu” on page 165, or the “Base Height Statements” section in Appendix B, “Creating

Data and Conﬁguration Files for the Tree Visualizer”).

File Requirements

The Tree Visualizer requires the following ﬁles:

•Adata ﬁle consisting of rows of tab-separated ﬁelds. This ﬁle is easily created using

the Tool Manager (see Chapter 3). If you are generating this ﬁle yourself, see

Appendix B, “Creating Data and Conﬁguration Files for the Tree Visualizer” for the

required ﬁle format.

Data ﬁles are generated by extracting data from a source (such as an Oracle,

INFORMIX, or Sybase database) and formatting it speciﬁcally for use by the Tree

Visualizer. Data ﬁles have user-deﬁned extensions (the sample ﬁles provided with

the Tree Visualizer have a .data extension).

•Aconﬁguration ﬁle describing the format of the input data and how these are

converted to a hierarchy. This ﬁle also is easily created using the Tools Manager (see

Chapter 3). You also can use an editor (such as jot, vi, or Emacs) to produce this ﬁle

(see Appendix B, “Creating Data and Conﬁguration Files for the Tree Visualizer”).

Conﬁguration ﬁles must have a .treeviz extension. When starting the Tree Visualizer,

or when opening a ﬁle, specify the conﬁguration ﬁle, not the data ﬁle.

130

Chapter 5: Using the Tree Visualizer

Starting the Tree Visualizer

There are ﬁve ways to start the Tree Visualizer:

•Use the Tool Manager to conﬁgure and start the Tree Visualizer. (See Chapter 3 ﬁrst

for details on most of the Tool Manager’s functionality, which is common to all

MineSet tools; see below for details about using the Tool Manager in conjunction

with the Tree Visualizer.)

•Double-click the Tree Visualizer icon, which is in the MineSet page of the icon

catalog. The icon is labeled treeviz. Since no conﬁguration ﬁle is speciﬁed, the

start-up screen requires you to select one by using File|Open.

Starting the Tree Visualizer without specifying a conﬁguration ﬁle causes the main

window to show the copyright notice for this tool. Only the File and Help pulldown

menus can be used. For the main window to be fully functional, open a

conﬁguration ﬁle by selecting File|Open (Figure 5-2).

Figure 5-2 Tree Visualizer’s File Pulldown Menu

•If you know what conﬁguration ﬁle you want to use, double-click the icon for that

ﬁle. This starts the Tree Visualizer and automatically loads the ﬁle you speciﬁed.

This only works if the ﬁlename ends in .treeviz (which is always the case for

conﬁguration ﬁles created for the Tree Visualizer via the Tool Manager).

Starting the Tree Visualizer

131

•Drag the conﬁguration ﬁle icon onto the Tree Visualizer icon. This starts the Tree

Visualizer and automatically loads the ﬁle you speciﬁed. This works even if the

ﬁlename does not end in .treeviz.

•Start the Tree Visualizer from the UNIX shell command line by entering this

command at the prompt:

treeviz [ configFile ]

where conﬁgFile is optional and speciﬁes the name of the conﬁguration ﬁle to use. If

you don’t specify a conﬁguration ﬁle, you must use File > Open to specify one (see

Figure 5-2).

Options for Invoking the Tree Visualizer

There are a two options that affect how this tool is invoked:

•-warnexecute indicates that if you attempt to execute a command speciﬁed in an

execute statement, a warning is displayed and you are given the option to execute

the command or not. This is intended for an insecure environment, such as ﬁles

obtained from the Web, and is used automatically when commands are executed via

mtr ﬁles.

You can enable this option permanently by adding the line

*minesetWarnExecute:TRUE

to the user’s.Xdefaults ﬁle, or by setting the environment variable

MINESET_WARN_EXECUTE

•-quiet eliminates the dialogs that popup to indicate progress. You can enable this

option permanently by adding the line

*minesetQuiet:TRUE

to your .Xdefaults ﬁle.

132

Chapter 5: Using the Tree Visualizer

Conﬁguring the Tree Visualizer Using the Tool Manager

This section describes how the Tree Visualizer can be conﬁgured using the Tool Manager.

Although the Tool Manager greatly simpliﬁes the task of conﬁguring the Tree Visualizer,

you can construct a conﬁguration ﬁle manually for this tool using an editor (see

Appendix B, “Creating Data and Conﬁguration Files for the Tree Visualizer”).

For the Tree Visualizer, the Tool Manager does not support the following:

•Non-aggregated hierarchies where the data is displayed directly without

aggregating it.

•Real-time monitoring.

•A number of very rarely used options (skip missing, overview, shrinkage, root label,

speed, climb speed, leaf margin, root leaf margin, leaf edge margin, initial position,

initial angle, bar label size, base label size, and lod). See Appendix B.

•Variable-length arrays.

•Expressions computed after creating the hierarchy. For example, if you are

computing a percentage, the percentage must be computed after the hierarchy

aggregation takes place, since it is not possible to aggregate the percentages.

Note that the steps required to connect to a data source are described in Chapter 3.

Selecting the Tree Visualizer Tool

Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen

(Figure 5-3). From the popup list of tools, select Tree Visualizer. The mapping

requirements for the Tree Visualizer are displayed in the window on the right side of this

panel. Items in the Visual Elements: list that are preceded by an asterisk are optional.

Configuring the Tree Visualizer Using the Tool Manager

133

Figure 5-3 Data Destination Panel of Tool Manager With Tree Visualizer Selected

Key - Bars lets you deﬁne what the bars shown in the Tree Visualizer main window

represent. For example, in a table representing the budget of the 50 United States, the

keys could be state names. If the ﬁrst key is associated with Alabama, the ﬁrst bar

represents the values for Alabama.

Height - Bar lets you specify what the bar heights represent. Typically, the higher the bar,

the greater the value represented.

Sort By lets you specify a column, the values of which are used to sort the layout of the

nodes. The sort order defaults to ascending from left to right.

Hierarchy Root Level lets you specify how the table from your data source is converted into

a hierarchy. The Visual Elements list defaults to six hierarchical levels. If you specify a

sixth hierarchy level, the Tree Visualizer automatically adds a seventh. With every extra

level you specify, the Tree Visualizer adds another one. You can specify as many

hierarchy levels as necessary.

134

Chapter 5: Using the Tree Visualizer

Height - Disk—lets you specify what the heights represent for optional disks placed at the

same location as the bar. If no mapping is speciﬁed, no disks are displayed.

Height - Base—lets you specify what the base heights represent. If no mapping is

speciﬁed, the bar height mapping is used.

Color - Bar—lets you specify what the bar colors represent. The speciﬁc colors must be

assigned via the Tool Manager’s Tool Options panel (see “Choosing Colors” and “Using

the Color Browser” in Chapter 3).

Color - Disk—lets you specify what the disk colors represent. This option has an effect

only if the disk height is speciﬁed (see “Choosing Colors” and “Using the Color Browser”

in Chapter 3).

Color - Base—lets you specify what the base colors represent. If no mapping is speciﬁed,

the bar color mapping is used (see “Choosing Colors” and “Using the Color Browser” in

Chapter 3).

Undoing Mappings

To undo any mapping, select that mapping in the Requirements: window, then click the

Clear Selected button. To undo all mappings, click the Clear All button.

Specifying Tool Options

Clicking the Tool Options button causes a new dialog box to be displayed (Figure 5-4).

This lets you change some of the Tree Visualizer options from their default values.

Configuring the Tree Visualizer Using the Tool Manager

135

Figure 5-4 Tree Visualizer’s Conﬁguration Options Dialog Box

The top of the dialog box has three columns: Bars,Node Bases, and Disks.

136

Chapter 5: Using the Tree Visualizer

Normalize Heights

This option lets you normalize heights across each level of the hierarchy (or across all

levels) of bars, node bases, and disks. Normalizing the heights determines the maximum

value of the height variable; it normalizes all values relative to that height. Thus, if the

maximum value is 30.0, and the maximum bar height was set to 1.0 (in arbitrary units),

a value of 15.0 would be mapped to a value of 0.5.

Normalizing across each level independently normalizes each level of the hierarchy. This

option is most useful if data has been summed up the hierarchy, and prevents the top

level of the hierarchy from dwarﬁng items at the lowest level. Normalizing across all

levels normalizes everything together, regardless of the level in the hierarchy. If neither

box is checked for bars, no normalization takes place.

Node Bases are normalized independently of Bars. If no boxes are checked, the same

normalization method used for bars is used for node bases, although the values are

normalized independently.

If disks are present and normalize with bars is checked, the disks are normalized in

conjunction with the bars: a disk and a bar representing the same value have the same

height. If one of the other normalize boxes is checked in the Disks column, disks are

normalized independently of the bars: the highest disk and the tallest bar have the same

height, regardless of the actual values represented by them.

Max/Scale Heights

This option lets you specify the height of the tallest bars and node bases. The default is

1.0 (in arbitrary units). If after looking at the view, you see that the heights are too low or

too high, use this ﬁeld to adjust them. For example, entering 2 in the ﬁeld causes all bars

to be doubled in height; entering .5 makes all bars half as big.

If normalization was speciﬁed, this value represents the height of the tallest bar or base.

If normalization was not speciﬁed, all values are scaled by this amount. The latter can be

useful when comparing views of two different datasets.

Configuring the Tree Visualizer Using the Tool Manager

137

Filter out % shortest

This option lets you ﬁlter out nodes containing only short bars. First, the tallest bar in the

scene is calculated (if heights are normalized by level, then the tallest bar in each level).

Then only those nodes that contain at least one bar that is the appropriate percentage of

the tallest bar are shown. For example, if you enter 5% in this ﬁeld, then only those nodes

containing at least one bar that is at least 5% of the height of the tallest bar are shown.

(Also shown are ancestors of such bars). This option is intended as a coarse way to ﬁlter

out small, uninteresting nodes. It is not intended as an exact mechanism of identifying

speciﬁc nodes of a certain value. Use of this option can accelerate the rendering of slow,

complex scenes, or reduce clutter resulting from many bars near zero height.

Although small nodes are ﬁltered out, they are nonetheless counted in any cumulation

up the hierarchy.

Height Aggregation

By default, the height of the bars of the parent node is the sum of the height of all the bars

of the children; however, these heights can be average, max, min, count, or any of the

values that appear. This aggregation can be used for the values of the bar heights, base

heights, and disk heights.

Colors

This set of options lets you

•specify the list of colors to use

•specify the kind of mapping

•map colors to bars, node bases, and disks

To use these Colors options, you must have mapped a column to the *Color - Bar,

*Color - Disk, or *Color - Base requirements of the Data Destination panel. See

“Choosing Colors” and “Using the Color Browser” in Chapter 3 for a more detailed

explanation of how to choose and change colors.

138

Chapter 5: Using the Tree Visualizer

Color list to use lets you specify the color list using the + button next to the color list label.

This brings up a color editor that lets you specify a color to be added to the list.

Kind of mapping lets you specify whether the color change that is shown in the graphic

display is Continuous or Discrete. If you choose Continuous, the color values (of the bars,

node bases, or disks) shift gradually between the colors entered in the Color list to use

ﬁeld as a function of the values that are mapped to those colors in the Color mapping ﬁeld.

If you choose Discrete, the colors change only at the speciﬁed boundaries.

Color mapping lets you specify values to which the colors are mapped.

Example 5-1

If you

•used the Color Browser to apply red and green to bars

•selected Discrete for the Kind of mapping

•entered the values 0 100

then the display shows all bars (or node bases or disks) with values of less than 100 in

red, and all those with values greater than or equal to 100 in green.

Example 5-2

If you

•used the Color Browser to apply red and green to bars

•selected Continuous for the Kind of mapping

•entered the values 0 100

then the display shows all bars (or node bases or disks) with values less than or equal to

0 as completely red, those as greater than or equal to 100 as completely green, and those

between 0 and 100 as shadings from red to green.

Configuring the Tree Visualizer Using the Tool Manager

139

Color Aggregation

By default, the values of the colors of the bars of the parent node are the sum of the values

of all the bars of the children; however, these colors can be average, max, min, or any of

the values that appear. This aggregation can be used for the values of the bar colors, base

node colors, and disk colors.

Color by Key

This option lets you automatically color the bars by their key value. This option is

ignored if another coloring was speciﬁed. If you specify no color list, or specify

insufﬁcient colors, additional colors are chosen at random. If extra colors are speciﬁed,

they are ignored.

Make Fixed

By default, this option places all bars across one row. This option allows changing the

number of rows or columns. If neither rows nor columns are selected, or the number is

set to 0, then neither rows nor columns are ﬁxed, and the closest approximation to a

square is displayed.

Message

This option lets you type in any message you want. The message statement speciﬁes the

message displayed when the pointer is moved over an object or when an object is

selected. By default, the same message is used for the base as for the bars. If no message

is speciﬁed, a default message containing the names and values of all the columns is

used.

The format of the message must match the type of data being used:

•Strings must use %s.

•Ints must use integer formats (like %d).

•Floats and doubles must used ﬂoating-point formats (like %f).

For a detailed description of the message ﬁeld, see “Message Statements” in Appendix B.

140

Chapter 5: Using the Tree Visualizer

Execute and Base Execute

These options let you type in a UNIX command that is executed when double-clicking

on a bar or base. If only the Execute ﬁeld is ﬁlled in, it applies to both bars and bases. If

both are ﬁlled in, Execute applies to bars, and Base Execute applies to bases. The format

is similar to the message statement. If no execute statement appears, double-clicking has

no effect.

For a detailed description of the Execute ﬁeld, see “The Execute Statement” in

Appendix B.

Sky Color

You can specify either one or two colors. If only one color is speciﬁed, the sky is solid. If

two colors are speciﬁed, the sky is shaded between the colors. When specifying two

colors, the ﬁrst color is for the top of the sky, the second for the bottom.

Ground Color

You can specify either one or two colors. If only one color is speciﬁed, the ground is solid.

If two colors are speciﬁed, the ground is shaded between the colors. For the ground, the

ﬁrst color is for the far horizon, the second is for the near ground.

Base Label Color

You can specify the color of the labels on the front of the bases.

Bar Label Color

You can specify the color of the labels on the front of the bars.

Line Color

You can specify the color of the lines connecting the bases.

Configuring the Tree Visualizer Using the Tool Manager

141

Sort Order

If you select the Sort by Key checkbox, the nodes in the display are in sorted order. The

menu next to the checkbox lets you specify whether to sort in ascending or descending

order.

Resetting the Tool Options

If, after you have made changes to the Tool Options dialog box, you want to reset the

values of all options to their default values, click the Reset Options button.

Saving the New Tool Options

Once you have ﬁnished making changes to the Tool Options dialog box, click OK to

return to the Tool Manager’s main screen.

Saving Tree Visualizer Settings

The Tool Manager stores information for the Tree Visualizer in several ﬁles, all sharing

the same preﬁx:

•<preﬁx>.treeviz.data contains data.

•<preﬁx>.treeviz.schema describes the data ﬁle.

•<preﬁx>.treeviz contains information needed by the Tree Visualizer.

•<preﬁx>.mineset contains all the information needed to create the other ﬁles.

To specify a preﬁx, use the Save Current Session As ... menu option in the File menu of the

Tool Manager’s main window. If you do not specify a preﬁx, it is based on the data

source.

When you use the Invoke Tool button, the .data,.schema, and .treeviz ﬁles are updated, if

necessary.

Invoking the Tree Visualizer

To see the Tree Visualizer graphically represent your data, click the Invoke Tool button at

the bottom of the Data Destination panel.

142

Chapter 5: Using the Tree Visualizer

Working in the Tree Visualizer’s Main Window

A ﬁle’s hierarchy is visible only after a valid conﬁguration ﬁle is speciﬁed. For example,

specifying store.treeviz results in Figure 5-5.

Figure 5-5 Tree Visualizer’s Initial View When Specifying store.treeviz

Working in the Tree Visualizer’s Main Window

143

The root node of the hierarchy is at the front of the scene, near the bottom of the Tree

Visualizer’s main window. In back of the root node are its descendents; each one consists

of a base with bars on it. You can change what the heights and colors of the bars represent

via the Tool Manager or by manually changing the .treeviz conﬁguration ﬁle; usually, the

base represents the aggregate of all the bars. Bases are connected with lines representing

the connection of the nodes to their descendents.

Highlighting an Object or Node

To highlight an object, move the mouse over that object (either a base or a bar). This

causes information about that object to appear over the top left of the view area, under

the Pointer is over: label (Figure 5-6). To highlight a node and obtain information about

that node, place the pointer over a line leading to that node. This information appears in

the same place as that for an object.

Figure 5-6 A Highlighted Object and the Information It Represents

144

Chapter 5: Using the Tree Visualizer

Selecting an Object

To select an object and zoom to it, left-click the mouse on that object. Hold the Ctrl key

down while clicking to select the object without zooming to it. At the top of the window,

under the label “Selection:”, you see information about a selected object. The information

is the same as that shown when highlighting an object. As long as the object is selected,

the information is displayed. This lets you compare information about two objects by

selecting one, then highlighting the other. Using the mouse, you can cut and paste

selection information into other applications, such as reports or databases.

If you hold the Shift key while left-clicking on an object, the selection of that object is

toggled. If the object is currently not selected, it then is selected; conversely, if it is

currently selected, it then is deselected. Using this technique, it is possible to select

multiple objects simultaneously. While the information under the “Selection:” label only

shows the information on the last object selected, it is possible to see the values for all

selections by using Selections > Show Values or by drilling through to the original data

behind the selections (see “The Selections Menu” on page 166).

If an execute statement was speciﬁed via Tool Manager or the conﬁguration ﬁle, then

double clicking on an object executes the appropriate command. If the -warnexecute

option was speciﬁed when invoking the Tree Visualizer, a warning is given ﬁrst.

Spotlighting an Object

When you select an object, a white spotlight appears on it (Figure 5-7). A yellow spotlight

appears when you are searching (see “The Search Panel” on page 154). Spotlights are

visible even if the selected object is a descendent node in the far background.

The edges of spotlights are surrogates for an object: when you move the pointer over the

edge of a spotlight, the associated object is highlighted, and information about that object

appears above the top left of the view. Left-click the edge of a spotlight to select the

associated object and (if the Ctrl key is not held down) to zoom to it. The spotlight is

active only on the solid lines along the edges, not the translucent section in the center.

This lets you select objects behind the spotlight.

Working in the Tree Visualizer’s Main Window

145

Figure 5-7 Example of a Selected (Spotlighted) Object

Using the Right Mouse Button

When the cursor is in the main window, clicking the right mouse button (or, if the mouse

has been reconﬁgured, the third button) brings up a menu that lets you select the children

of a node. If you click on a node with children, it provides you with a list of the children.

This list is displayed as long as you hold the mouse button down. If you do not click on

a node, but one is selected, it provides you with a list of children of the selected node. If

nothing is selected, or if the selected node has no children, no menu is displayed.

146

Chapter 5: Using the Tree Visualizer

Navigating With the Middle Mouse Button

To navigate over the scene in the main window, use the middle mouse button. You also

can use external controls to perform all middle mouse button functions (see the “External

Controls” on page 147).

To move through the main window, click the middle mouse button. A small square

appears (see Figure 5-8). Move the cursor out of this square while pressing the mouse to

move your point of reference dynamically through the 3D landscape. The farther the

cursor is from the square, the faster your viewpoint moves. To move the viewpoint

forward, move the mouse up. To move the viewpoint back, move the mouse down.

Moving the mouse left and right causes the viewpoint to shift accordingly. You can move

in any direction as long as a part of your data is visible.

Figure 5-8 Example of the Square as Navigational Base

To move the viewpoint up and down, hold the Shift key down when pressing the middle

mouse button. To move the viewpoint up, move the mouse up. To move the viewpoint

down, move the mouse down. You cannot move below ground level.

To combine horizontal and vertical motion (that is, to move the viewpoint back and forth,

as well as up and down), hold the Alt key down when pressing the middle mouse button.

Note that while moving forward, the viewpoint also moves down, based on the current

tilt. Similarly, while moving backward, the viewpoint moves up, based on the tilt.

Note: You cannot turn from side to side. Tilting the viewpoint requires using external

controls.

External Controls

147

External Controls

Several external controls surround the graphics window. These consist of buttons and

thumbwheels.

Buttons

At the top right of the image area are eleven buttons as shown in Figure 5-9.

Figure 5-9 Tree Visualizer’s External Button Controls

•Home takes you to a designated location. Initially, this location is the ﬁrst viewpoint

shown after invoking the Tree Visualizer and specifying a conﬁguration ﬁle. If you

have been working with the Tree Visualizer and have clicked the Set Home button,

then clicking Home returns you to the viewpoint that was current when you last

clicked Set Home.

•Set Home makes your current location the Home location. Clicking the Home button

returns you to the last location where you clicked Set Home.

•View All lets you view the whole hierarchy, keeping the tilt of the camera. To get an

overhead view of the scene, tilt the camera to point straight down, then click the

View All button. To tilt the camera, see the description of the Tilt thumbwheel (see

“Thumbwheels” on page 149).

Home

Set Home

View All

Go Back

Go Forward

Parent

Move Left

Move Right

First Child

Last Child

Choose Child

148

Chapter 5: Using the Tree Visualizer

•Go Back lets you return to the previous location. If you have just started the Tree

Visualizer and have not moved from the home view, this button is grayed out.

•Go Forward lets you proceed to the location from which you clicked the Go Back

button. If you have not clicked the Go Back button, the Go Forward button is grayed

out.

•Parent is active only when you have an object selected. If a bar is selected, clicking

this button selects the base containing the bar. If a base is selected, clicking this

button moves up the hierarchy to the parent node. Once the root node has been

reached (highest level of the hierarchy), the Parent button is grayed out. Note that

when using Parent, the selected node is changed to the parent of the previously

selected one.

•Move Left lets you select the next sibling to the left. If a bar is selected, the bar to the

left of it is selected. If a base is selected, then, if the parent has another child to the

left, that is selected. This button is grayed out if nothing is selected, or if the current

selection has no sibling to the left.

•Move Right lets you select he next sibling to the right. If a bar is selected, the bar to

the right of it is selected. If a base is selected, then, if the parent has another child to

the right, that is selected. This button is grayed out if nothing is selected, or if the

current selection has no sibling to the right.

•First Child lets you select the ﬁrst child of the current node. This button is grayed out

if there is no selection, if a bar is selected, or if the current selection has no children.

•Last Child lets you select the last child of the current node. This button is grayed out

if there is no selection, if a bar is selected, or if the current selection has no children.

•Choose Child produces a popup menu that lists all the children of the current node.

This button is grayed out if there is no selection, if a bar is selected, or if the current

selection has no children.

You also can perform these functions using the Go menu (see “The Go Menu” on

page 167.)

External Controls

149

Thumbwheels

Four thumbwheels appear around the lower part of the graphics window border (see

Figure 5-10). They let you dynamically move the viewpoint.

Figure 5-10 Tree Visualizer’s Thumbwheels

•The vertical H (height) thumbwheel, on the upper left, moves the camera up and

down. You cannot move the viewpoint below ground level.

•The vertical Tilt thumbwheel, at the bottom left, tilts the camera. You can tilt the

viewpoint to any position from straight ahead and straight down. You cannot tilt

the viewpoint to look up.

•The horizontal <--> (pan) thumbwheel, at the bottom left, moves the viewpoint

from left to right and back. You cannot rotate the viewpoint.

•The vertical Dolly thumbwheel, on the right, moves the viewpoint forward and

backward.

Thumbwheels

150

Chapter 5: Using the Tree Visualizer

Height Slider

A slider to the top left of the main window (Figure 5-11) lets you rescale all objects in the

window. Pushing the slider up to a value of 2.0 doubles the size of all objects in the main

window. Pulling the slider back down to a value of 1.0 returns the objects in the window

to their original heights.

Figure 5-11 Tree Visualizer’s Height Slider

Pulldown Menus

You also can access all of the Tree Visualizer’s functions via ﬁve pulldown menus. These

are labeled File, Show, Display, Go, and Help.

If you start the Tree Visualizer without specifying a conﬁguration ﬁle, only the File and

the Help menus are available. The Show, Display, and Go menus are available after a

graph is loaded.

Pulldown Menus

151

The File Menu

The File menu (Figure 5-12) contains nine options.

Figure 5-12 Tree Visualizer’s File Pulldown Menu With Options

•Open loads and opens a conﬁguration ﬁle, displaying it in the main window.

Previously displayed data is discarded. Use Open to view a new dataset, or to view

the same dataset after changing its conﬁguration.

•Open Other Window opens a conﬁguration ﬁle, but displays its results in a different

window. The current dataset remains open.

•Reopen reopens the currently opened ﬁle. This can be used after the conﬁguration or

data ﬁle has been updated.

•Copy Other Window opens a new window that displays the same view of the current

dataset. You can interact with these windows independently.

•Save As saves the state of the current Tree Visualizer window into an image ﬁle. The

user speciﬁes both the ﬁle name (default is treeviz.rgb), format (default is rgb), and

whether to save the entire window, including any legends, or just the main scene

with the graphical objects (default is the full window).

152

Chapter 5: Using the Tree Visualizer

•Print Image outputs the state of the current Tree Visualizer window to a printer. You

can specify the output printer using a Print dialog panel (default is your system's

default printer) and, like the Save As dialog, choose whether to print the entire

window or just the main scene window.

•Start Tool Manager starts the Tool Manager (if not already running), and restores it to

the state it was in when the Tree Visualizer was invoked.

•Close closes the current window (and all panels associated with it). If no other

windows are open, Close exits the application.

•Exit closes all windows and exits the application.

The Show Menu

The Show menu (Figure 5-13) contains four options:

•Overview

•Search Panel

•Filter Panel

•Marks Panel

Each of these options brings up another dialog box for interacting with the data.

Figure 5-13 Tree Visualizer’s Show Pulldown Menu With Options

Pulldown Menus

153

The Overview Window

Select Overview in the Show menu to bring up a new window with an overhead view of

the complete hierarchy (Figure 5-14). If you want the Overview to be brought up

automatically each time the scene is viewed, set the Overview option in the conﬁguration

ﬁle (see “Overview” on page 566).

Figure 5-14 Tree Visualizer’s Overview Window

The “X” in the Overview window shows your current location. The Overview helps you

keep track of your location and viewpoint in the entire scene. It can also help you quickly

go to a speciﬁc node.

To select an object in the Overview and have the main view zoom to it, left-click that

object. This is similar to left-clicking the object in the main view. Middle-clicking

anywhere in the overview zooms your viewpoint to that location, even if no object is at

that point.

154

Chapter 5: Using the Tree Visualizer

The Search Panel

Select Search in the Show menu to bring up a dialog box that lets you specify criteria to

search for objects (Figure 5-15).

Figure 5-15 Tree Visualizer’s Search Dialog Box

Pulldown Menus

155

Once the search is complete, yellow spotlights highlight objects matching the search

criteria (see Figure 5-16). To display information about an object under a yellow

spotlight, move the pointer over that spotlight; the information appears in the upper left

corner, under the label Pointer is over:. To select and zoom to an object under a yellow

spotlight, left-click the spotlight; if you press the Ctrl key while clicking, zooming does

not occur.

Figure 5-16 Sample Results of a Search in the Tree Visualizer

Items in the Search Panel

To specify whether a search is case-sensitive, click the Ignore Case In Searches checkbox, at

the top of the Search panel. For example, if this toggle is on (a check mark appears on that

button), the string “hello” is the same as “HellO.”

To the right of the case sensitivity checkbox is another, labeled Treat Nulls as Zeros. If this

checkbox is off (the default), comparisons involving nulls cannot return TRUE in a

search. If the it is on, nulls are treated as equal to zero.

156

Chapter 5: Using the Tree Visualizer

Below the case-sensitivity checkbox are controls that let you specify the parts of the

hierarchy to be searched. By default, the whole hierarchy is searched. To limit the levels

searched, select a relational operator (such as <=) from the option menu that lets you

specify the operand for the level. Then use the slider to select the level to be searched.

Level 0 is the root of the hierarchy, level 1 is the level below that, and so forth. To search

the root and the two levels below that, for example, choose <= 2.

Checkboxes also let you choose whether to search the bars or the bases.

When searching through bars, the default is that all bars are searched. To search only a

speciﬁc list of bars, you must select them. The Set All button turns on all bars; this is

useful if most of the bars are to be searched, and only a few are to be turned off. The Clear

button turns off all bars. If no bar is selected, the bar list is ignored, and all bars are

searched.

Below the panel for bar labels is a Hierarchy ﬁeld that lets you specify nodes to search

(Figure 5-17). Below the Hierarchy ﬁeld are ﬁelds that let you specify search criteria for

individual columns (deﬁned in the Current Columns: window of the Tool Manager’s

Table Processing pane, see “Selecting the Tree Visualizer Tool” on page 132).

Figure 5-17 Detail of the Tree Visualizer’s Search Dialog Box

Pulldown Menus

157

To search for numeric values, enter the value, and select a relational operation (=, !=, >,

<, >=, <=). To search for alphanumeric values, enter the string for which you want to

search. You can use any of three types of string comparisons:

•“Contains” indicates that it contains the appropriate string. For example, California

contains the strings Cal and forn.

•“Equals” requires the strings to match exactly.

•“Matches” allows wildcards:

–An asterisk (*) represents any number of characters.

–A question mark (?) represents one character.

–Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.

In some cases (usually associated with binning in the Tool Manager), an option menu of

values appears, instead of a text ﬁeld. To ignore that variable, select Ignored in the Option

menu. You can use relational operators (such as >=) with these options. This means that

the speciﬁed value as well as subsequent ones are selected.

In addition to numeric and string comparison operations, you can specify Is Null,

which is true if the value is null.

To the right of each search ﬁeld is an additional option menu that lets you specify “And”

or “Or” options. For example, you could specify “sales > 20 And < 40.” You can have any

number of And or Or clauses for a given column, but cannot mix And and Or in a single

column.

Note that if different levels of the hierarchy are keyed by different types of data (for

example, the top level is selected by strings, while the second level is selected by

integers), then the “Hierarchy” search ﬁeld is treated as a string and provides string

operations, not number operations.

If the Ignore Case In Searches checkbox is checked, the comparisons of all string searches

are case-insensitive.

158

Chapter 5: Using the Tree Visualizer

Six buttons are placed across the bottom of the Search panel:

•Search causes the search to be started. This button is automatically activated if the

Enter key is pressed and the panel is active.

•Clear turns off all search spotlights and erases the values from the search ﬁelds.

•Next selects and zooms to the next matched object, in left-to-right order. After the

last matched object is selected, clicking Next returns the view to the Home position.

Next is valid only after a search that has found matches.

•Previous selects and zooms in the opposite order from that of the Next button.

•Select causes all objects that matched the search criteria to be selected. The Selections

menu can then interact with these objects.

•Close closes the search window and turns off the search spotlights. If the Search

panel is reopened, it is in the same state as it was before the last Close; clicking Search

again repeats the last search.

The Filter Panel

The Filter panel ﬁlters out selected information, thus ﬁne-tuning the displayed hierarchy.

You can use the Filter panel to emphasize speciﬁc information, or to shrink the amount

of data for better performance. Figure 5-18 shows a sample Filter panel.

Pulldown Menus

159

Figure 5-18 Tree Visualizer’s Filter Dialog Box

To specify whether a ﬁlter is case-sensitive, click the Ignore Case In Filter checkbox, at the

top of the Filter panel. For example, if this toggle is on (a check mark appears on that

button), the string “hello” is the same as “HellO.”

160

Chapter 5: Using the Tree Visualizer

To the right of the case sensitivity checkbox is another, labeled Treat Nulls as Zeros. If this

checkbox is off (the default), comparisons involving nulls cannot return TRUE in a ﬁlter.

If the it is on, nulls are treated as equal to zero.

Below the case-sensitivity checkbox are controls that let you specify the parts of the

hierarchy to be ﬁltered. By default, the whole hierarchy is ﬁltered. To limit the levels

ﬁltered, select a relational operator (such as <=) from the option menu that lets you

specify the operand for the level. Then use the slider to select the level to be ﬁltered. Level

0 is the root of the hierarchy, level 1 is the level below that, and so forth. To ﬁlter the root

and the two levels below that, for example, choose <= 2.

Checkboxes also let you choose whether to ﬁlter the bars or bases.

When ﬁltering bars, the default is that all bars are ﬁltered. To ﬁlter only a speciﬁc list of

bars, you must select them. The Set All button turns on all bars; this is useful if most of

the bars are to be ﬁltered, and only a few are to be turned off. The Clear button turns off

all bars. If no bar is selected, the bar list is ignored.

Filtering bars does not affect the information in the base, which continues to include the

summary of all bars.

Below the panel for bar labels is a Hierarchy ﬁeld, which lets you specify nodes to ﬁlter.

Below the Hierarchy ﬁeld are ﬁelds that let you specify ﬁlter criteria for individual

columns (deﬁned in the Current Columns: window of the Tool Manager’s Table

Processing pane, see “Selecting the Tree Visualizer Tool” on page 132).

To ﬁlter for numeric values, enter the value, and select a relational operation (=, !=, >, <,

>=, <=). To ﬁlter for alphanumeric values, enter the string for which you want to ﬁlter.

You can use any of three types of string comparisons:

•“Contains” indicates that it contains the appropriate string. For example, California

contains the strings Cal and forn.

•“Equals” requires the strings to match exactly.

•“Matches” allows wildcards:

–An asterisk (*) represents any number of characters.

–A question mark (?) represents one character.

–Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.

Pulldown Menus

161

In some cases (usually associated with binning in the Tool Manager), an option menu of

values appears, instead of a text ﬁeld. To ignore that variable, select Ignored in the Option

menu. You can use relational operators (such as >=) with these options. This means that

the speciﬁed value as well as subsequent ones are selected.

In addition to numeric and string comparison operations, you can specify Is Null,

which is true if the value is null.

To the right of each ﬁlter ﬁeld is an additional option menu that lets you specify “And”

or “Or” options. For example, you could specify “sales > 20 And < 40.” You can have any

number of And or Or clauses for a given column, but cannot mix And and Or in a single

column.

Note that if different levels of the hierarchy are keyed by different types of data (for

example, the top level is selected by strings, while the second level is selected by

integers), then the “Hierarchy” ﬁlter ﬁeld is treated as a string and provides string

operations, not number operations.

If the Ignore Case In Filters checkbox is checked, the comparisons of all string ﬁlters are

case-insensitive.

If a node does not meet the ﬁlter criteria, has no bars that meet the criteria, and has no

children that meet the criteria, the node is not shown. There can be, however, cases in

which a speciﬁc object meets the ﬁlter criteria, but its ancestors up the tree do not. Also,

other bars in the same node might not meet the criteria. Since position is important in

interpreting context, it might not be good to eliminate those bars. Consequently, you are

given an option of selecting one of three radio buttons that control how these objects

should be drawn: Solid,Outline, and Hidden. Note, however, that if objects are drawn in

a less solid form due to the Display Zeros or Display Null menu, they are displayed

appropriately. For example, if Nulls are to be hidden, they are always hidden, regardless

of the ﬁlter criteria.

The exception to this is when ﬁltering to speciﬁc bars. In such a case, the other bars are

eliminated and don’t take up space, regardless of the radio button settings.

The Height Filter slider lets you ﬁlter out those nodes containing only short bars. The size

of a value is shown as a percentage of the maximum height. First, the tallest bar in the

scene is calculated (if heights are normalized by level, then the tallest bar in each level).

Then only those nodes that contain at least one bar that is the appropriate percentage of

the tallest bar are shown.

162

Chapter 5: Using the Tree Visualizer

For example, if you enter 5% in this ﬁeld, then only those nodes containing at least one

bar that is at least 5% of the height of the tallest bar are shown. (Also shown are ancestors

of such bars). This option is intended as a coarse way to ﬁlter out small, uninteresting

nodes. It is not intended as an exact mechanism of identifying speciﬁc nodes of a certain

value; use the search panel for that purpose. Use of this option can accelerate the

rendering of slow, complex scenes, or reduce clutter resulting from many bars near zero

height. You can also set this ﬁltering option in the conﬁguration ﬁle by using the Height

Filter command.

Although small nodes are ﬁltered out, they are nonetheless counted in any cumulation

up the hierarchy.

The Depth slider, which is under the Height Filter slider, lets you display the hierarchy

so that only a given number of levels are displayed at any given time. When you are at

the top of the hierarchy, only the number of hierarchical levels speciﬁed by the slider is

seen. The nodes in the rows are arranged to optimize their visibility. When navigating to

nodes lower in the hierarchy, additional rows are made visible automatically. The nodes

above them automatically adjust their locations to accommodate the newly added nodes;

thus, some nodes might seem to move. Note that the overview shows all nodes in the

hierarchy, not just the top nodes; thus, the layout of the overview might not match the

layout of the main view. The X in the overview approximates the corresponding location

in the main view; there is no exact mapping between the two layouts.

•Click the Filter button to start ﬁltering. If the Enter key is pressed while the panel is

active, ﬁltering automatically starts.

•Click the Close button to close the panel.

The Marks Panel

The Marks panel, from the Tree Visualizer's Show Pulldown Menu (Figure 5-13,) lets you

name and store important locations (viewpoints) so that you can easily and quickly

return to them (see Figure 5-19). The location is stored relative to the currently selected

object. If no object is selected, the absolute location is recorded.

All marks can be indicated by colored ﬂags in the main view. If the mark represents a

selected object, the ﬂag is placed on that object. If it represents an absolute position, the

ﬂag is placed at that position. To go to the mark, click the ﬂag. All ﬂags can be turned on

and off using the Mark Flags menu entry in the Display menu. (See Mark Flags in “The

Display Menu” on page 165).

Pulldown Menus

163

Figure 5-19 Tree Visualizer’s Marks Panel

•Click the Mark button to mark the current location. Another dialog box appears (in

Figure 5-20) to prompt you for the name and color of the mark. The default name is

that of the currently selected object. The color controls the color of the ﬂag

appearing in the main window and represents the mark. If you do not want a ﬂag to

represent the mark, click the button with the “Not” symbol (slash through a circle).

To add another color to the palette, click the button with the plus symbol (+) to

bring up a color chooser.

Figure 5-20 Window Resulting From Clicking Mark Button

164

Chapter 5: Using the Tree Visualizer

Figure 5-21 shows a sample main window with ﬂags representing the created marks.

Figure 5-21 Main Window With Flags Representing Marks

•Click the Go to button to go to the current location associated with the selected mark

in the panel. Double-clicking a mark has the same effect. If the object selected by

that mark no longer exists (because it was ﬁltered out, or the data was changed

since the mark was created), the location shown is close to where the object would

have been.

•Click the Delete button to delete the selected mark in the panel.

•Click the Modify button to change the name or color of the selected mark in the

panel.

•Click the Up button to move the selected mark in the panel up the listing order.

•Click the Down button to move the selected mark in the panel down the listing

order.

•Click the Close button to exit the marks panel.

The ﬁle storing the marks information has the same name as the conﬁguration ﬁle, with

a.marks sufﬁx appended. Whenever a mark is changed, all marks are saved to that ﬁle.

If all marks are deleted, the .marks ﬁle is removed. If mark changes cannot be saved

(because of a permission error, for instance), a warning appears; this warning is not

repeated when subsequent mark changes are attempted.

Pulldown Menus

165

The Display Menu

The Tree Visualizer's Display menu lets you control several display parameters.

Figure 5-22 Tree Visualizer’s Display Menu

Base Heights is a checkbox that lets you turn the heights of the bases on and off. To see

negative numbers, or to make it easier to compare the bar heights, turn this option off.

Turning it on provides summary information about all the bars. The initial value of this

toggle can be changed with the “base height” statement in the conﬁguration ﬁle.

Mark Flags is a toggle option that lets you turn on or off the ﬂags representing marks (also

see “The Marks Panel”).

Zeros is a submenu that controls how objects with zero height are displayed. By default,

they are shown like other objects: a solid cube of height zero (a plane). The submenu lets

you specify them to be displayed as outlines (appearing as a hollow square), or to be

hidden completely (not drawn). The initial value of this of this can be changed using the

“zero” option in the conﬁguration ﬁle (see “Zero” on page 568).

Nulls is a submenu that controls how objects of null height are displayed. It has the same

options as the zero menu; however, the default for null options is to display the objects

as an outline. The initial value can be changed using the “null” option in the

conﬁguration ﬁle (see “Null” on page 569).

166

Chapter 5: Using the Tree Visualizer

The Selections Menu

The Selections menu lets you drill through to the underlying data. This menu has ﬁve

items (see Figure 5-23).

Figure 5-23 Tree Visualizer’s Selection Menu

•Show Values displays a table (Record Viewer) of the values for all selected objects.

•Show Original Data retrieves and displays the records corresponding to what has

been selected. The resulting records are shown in a table viewer.

•Send To Tool Manager inserts a ﬁlter operation, based on the current box selection(s),

at the beginning of the Tool Manager history. The actual expression used to do the

drill through is determined by extents of the current box selection(s). If nothing is

selected, a warning message appears.

•Complementary Drill Through causes the Show Original Data and Send To Tool Manager

selections, when used, to fetch all the data that are not selected.

•Normalize Subtree determines the maximum height of the elements in the subtree,

and normalizes all values relative to that height.

For further details on drill-through, see Chapter 18, “Selection and Drill-Through.”

Pulldown Menus

167

The Go Menu

The Go menu duplicates the functions of the buttons on the upper right-hand side of the

main window (see Figure 5-24). It also identiﬁes keyboard shortcuts for some functions.

Figure 5-24 Tree Visualizer’s Go Pulldown Menu

•Home takes you to a designated location. By default, this location is the initial view

point of the scene. Initially, this location is the ﬁrst viewpoint shown after invoking

the Tree Visualizer and specifying a conﬁguration ﬁle. If you have been working

with the Tree Visualizer and have clicked the Set Home menu item, then clicking

Home returns you to the viewpoint that was current when you last clicked Set Home.

The keyboard shortcut for this function is Ctrl+H.

•Set Home changes the Home location to your current location. Clicking the Home

menu item then returns you to the viewpoint that was current when you last clicked

Set Home.

168

Chapter 5: Using the Tree Visualizer

•View All shows the whole hierarchy, keeping the tilt of the camera. To get an

overhead view of the scene, tilt the camera to point straight down, then click the

View All menu item. (To tilt the camera, see the description of the Tilt thumbwheel

in “Thumbwheels” on page 149.)

•Go Back lets you return to the previous location. If you have just started the Tree

Visualizer and have not moved from the home view, this menu item is grayed out.

The keyboard shortcut for this function is Ctrl+B.

•Go Forward lets you proceed to the location from which you clicked the Go Back

menu item. If you have not clicked the Go Back menu item, the Go Forward menu

item is grayed out. The keyboard shortcut for this function is Ctrl+R.

•Parent is active only when an object is selected. If a bar is selected, clicking this

menu item selects the base containing the bar. If a base is selected, clicking this

menu item moves up the hierarchy to the parent node. Once the root node has been

reached (highest level of the hierarchy), the Parent menu is grayed out. The

keyboard shortcut for this function is Ctrl+U.

•Move Left lets you select the next sibling to the left. If a bar is selected, the bar to the

left of it is selected. If a base is selected, then, if the parent has another child to the

left, that is selected. This button is grayed out if nothing is selected, or if the current

selection has no sibling to the left.

•Move Right lets you select the next sibling to the right. If a bar is selected, the bar to

the right of it is selected. If a base is selected, then, if the parent has another child to

the right, that is selected. This button is grayed out if nothing is selected, or if the

current selection has no sibling to the right.

•First Child lets you select the ﬁrst child of the current node. This button is grayed out

if there is no selection, if a bar is selected, or if the current selection has no children.

•Last Child lets you select the last child of the current node. This button is grayed out

if there is no selection, if a bar is selected, or if the current selection has no children.

Pulldown Menus

169

The Help Menu

The Help menu (see Figure 5-25) provides access to six help functions.

Figure 5-25 Tree Visualizer’s Help Pulldown Menu

•Click for Help turns the cursor into a question mark. Placing this cursor over an

object in the main window and clicking the mouse causes a help screen to appear;

this screen contains information about that object. Closing the help window restores

the cursor to its arrow form and deselects the help function. The keyboard shortcut

for this function is Shift+F1. (Note that it also is possible to place the arrow cursor

over an object and press the F1 function key to access a help screen about that

object.)

•Overview provides a brief summary of the major functions of this tool, including

how to open a ﬁle and how to interact with the resulting view.

•Index provides an index of the complete help system. This option is currently

disabled.

•Keys & Shortcuts provides the keyboard shortcuts for all of the Tree Visualizer’s

functions that have accelerator keys.

•Product Information brings up a screen with the version number and copyright notice

for the Tree Visualizer.

•MineSet User’s Guide invokes the IRIS Insight viewer with the online version of this

manual.

170

Chapter 5: Using the Tree Visualizer

Null Handling in the Tree Visualizer

Nulls represent unknown data (see Appendix J, “Nulls in MineSet”).

In the Tree Visualizer, nulls can occur in the following cases:

•The database or data ﬁle contains a null value.

•The skipMissing option is not present in the conﬁguration ﬁle (see skipMissing in

Appendix B,) and data is present for the key value in one node of the hierarchy, but

not in another. For example, in a representation of state budgets, if there is no record

for state income tax for Texas, Texas would have an income tax of null. This is

different than for the case where there is a record showing 0 as the income tax for

Texas, in which case it would show a tax of 0.

•When the Tool Manager is used to make an array based on bins and no data falls

into a speciﬁc bin, the value for that bin is null. For example, if there is no data for

30-40 year olds, that bin is null.

•When making an array in the Tool Manager and the null enum option is speciﬁed,

an extra array entry, corresponding to the ﬁrst bar in each bar chart, is created to

represent the aggregation of all the values where the bin value is null (see

“Aggregations in the Presence of Nulls” in Appendix J). This bar is labeled with a

question mark (?), representing null. If there is no data for that null bin, the values

associated with it are null as well.

Note: if all values throughout the data associated with the null bin are null, the Tree

Visualizer ignores the null bin and does not display it.

•Expressions and aggregations of nulls can generate nulls (see Appendix J).

When a null value is mapped to a visual attribute, special representations are used in the

Tree Visualizer. If null is mapped to height, the object is normally drawn in outline mode

(although this is conﬁgurable through the Display menu (see the “The Display Menu”

section) or the conﬁguration ﬁle (see “Null” in Appendix B). For a bar or a base, this looks

like an empty square. (It does not look like a cube, since it has no height.) For a disk, it

looks like a circle. If a null value is mapped to a color, it is drawn in a dark grey (see

Figure 5-26).

Sample Configuration and Data Files

171

Figure 5-26 Representation of a Null Value Mapped to Height, Color, Disk, and Label

When selecting an object with a null value, it is shown as a question mark (?) in the

selection ﬁeld.

Sample Conﬁguration and Data Files

The provided sample conﬁguration and data ﬁles demonstrate the Tree Visualizer’s

features and capabilities. The following ﬁles are in the directory

/usr/lib/MineSet/treeviz/examples:

•store.data and store.treeviz

When graphically displayed, these ﬁles show hypothetical sales data for a store

chain. The hierarchy includes the entire chain, regions, states, cities, and individual

stores. Four products are shown for each level in the hierarchy. In this conﬁguration,

heights represent sales in dollars; colors represent the percentage of the target dollar

amount.

•stateRevenue.data and stateRevenue.treeviz

When graphically displayed, these ﬁles show the revenue components of every

state’s budgets for 1992, as obtained from the United States Census Bureau (from

http://www.census.gov/govs/state/stﬁn92.dat). Heights represent the dollar

amounts in taxes. The descendent nodes in the background show the contribution

of various taxes to the total revenues shown in the root node.

172

Chapter 5: Using the Tree Visualizer

•beer.data and beer2.data, and beer.treeviz and beer2.treeviz

When graphically displayed, these ﬁles show ﬁctitious data based on consumer

research of beer purchases. The hierarchy contains three levels:

1. The ﬁrst is category (for example, beer or ale).

2. The second level is brand codes (randomly assigned).

3. The third is the individual product codes; for example, twelve-pack versus

six-pack (randomly assigned).

Each chart contains seven bars, representing seven age groups. Bar height

represents the total dollars spent by that age group. Colors represent the percentage

of dollars spent by males and females. Brands, products, and data used in these ﬁles

are samples only.

Both beer.treeviz and beer2.treeviz produce the same graphical output, but they have

been constructed differently. In beer.treeviz, each type of beer is represented by a

single record, with values for male and for female consumption; these values are

stored in an enumerated array (explained in Appendix B, “Creating Data and

Conﬁguration Files for the Tree Visualizer”).

In beer2.treeviz, there are seven records for each beer, with each record representing

one age group. Note that in the beer ﬁle, the age groups are represented in the

conﬁguration ﬁle; in the beer2 ﬁle, they are included in the data ﬁle.

The beer ﬁle requires less storage space than the beer2 ﬁle; however, the

conﬁguration ﬁle is a little more complicated. In some cases, it might be easier to

produce data in the form used by the beer2 ﬁle.

Additional examples of the Tree Visualizer to visualize a Decision tree are provided in

Chapter 11.

173

Chapter 6

6. Using the Map Visualizer

This chapter discusses the features and capabilities of the Map Visualizer. It provides an

overview of this visualization tool, then explains the Map Visualizer’s functionality

when working with the following elements:

•main window

•viewing modes

•external controls

•pulldown menus

Finally, it lists and describes the sample ﬁles provided for this tool.

Overview of Map Visualizer

The Map Visualizer is a graphical interface that displays data as a three-dimensional

“landscape” of arbitrarily speciﬁed and positioned “bar chart” shapes. This tool displays

quantitative and relational characteristics of your geographically oriented data.

Data items are associated with graphical “bar chart” objects in the visual landscape.

However, the objects have recognizable geographical shapes and positions. The

landscape can consist of a collection of these geographical objects, each with individual

heights and colors (see Figure 6-1). You can dynamically navigate through this landscape

•panning

•rotating

•zooming to more clearly see areas of interest

•drilling down to see increased granularity of geographic details

•drilling up to aggregate data into coarser-grained graphical objects

•using animation to see how the data changes across one or two independent

dimensions.

174

Chapter 6: Using the Map Visualizer

Figure 6-1 Sample Map Visualizer Screen Showing 1990 U.S. Population

The landscape can also consist of a ﬂat plane of these geographical objects drawn as

simple outlines, with “bar chart” cylinders placed at speciﬁc locations (see Figure 6-2).

Overview of Map Visualizer

175

Figure 6-2 Sample Map Visualizer Screen Showing Relative Population of Major U.S. Cities

Another landscape possibility is lines with endpoints at speciﬁc point locations, all with

individual widths and colors (see Figure 6-3). Lines have width and color properties,

instead of the height and color properties of the arbitrarily shaped objects and cylinders.

176

Chapter 6: Using the Map Visualizer

Figure 6-3 Sample Map Visualizer Screen Showing the United States With Speciﬁc Endpoints

File Requirements

The Map Visualizer requires the following ﬁles:

•A data ﬁle consisting of rows of tab-separated ﬁelds. Typically, the Tool Manager

creates this ﬁle (see Chapter 3). You can also generate this ﬁle without using the Tool

Manager (for the required ﬁle format, see Appendix C, “Creating Data,

Conﬁguration, Hierarchy, and GFX Files for the Map Visualizer”).

Data ﬁles are the result of extracting raw data from a source (such as an Oracle,

INFORMIX, or Sybase database) and formatting it speciﬁcally for use by the Map

Visualizer. Data ﬁles have user-deﬁned extensions (the sample ﬁles provided with

the Map Visualizer have a .data extension).

•A gfx ﬁle consisting of a description of the shapes and locations of the 1-, 2-, or

3-dimensional objects to be displayed.

Gfx ﬁles must have a .gfx extension. MineSet includes various .gfx ﬁles, including

the United States to the granularity of counties, telephone area codes, and postal zip

codes, as well as Canada to the granularity of provinces. You can also manually

generate .gfx ﬁles (see Appendix C, “Creating Data, Conﬁguration, Hierarchy, and

GFX Files for the Map Visualizer” for the required ﬁle format).

File Requirements

177

•A hierarchy ﬁle consisting of a description of

–the column names of the various graphical objects to be displayed

–the ﬁlenames of the .gfx ﬁles that describe the locations and shapes of the

graphical objects

–an optional description of the hierarchical relationship of the graphical objects,

which is used for the drill-down and drill-up functions.

Hierarchy ﬁles enable drill down and drill up. This means that information

associated with objects at one level can be aggregated (or, conversely, shown in

greater detail) and displayed at a different level. For example, a hierarchy ﬁle

deﬁning the relationships between states and regions comprising multiple states

allows values such as population levels to be displayed at both the individual state

level as well as at regional levels. The gfx_ﬁles/usa.state.gfx ﬁle, for example,

describes the shapes of the 50 United States; the gfx_ﬁles/usa.state.hierarchy ﬁle

describes the hierarchy grouping individual states into regions, regions into

East-West areas, and the East-West areas into an aggregated United States.

For more information, see Appendix C, “Creating Data, Conﬁguration, Hierarchy,

and GFX Files for the Map Visualizer”

•A conﬁguration ﬁle describing the format of the input data and how these are to be

displayed. Typically, this ﬁle is created using the Tool Manager (see Chapter 3). You

also can use an editor (such as jot, vi, or Emacs) to produce this ﬁle without using

the Tool Manager (see Appendix C, “Creating Data, Conﬁguration, Hierarchy, and

GFX Files for the Map Visualizer”).

Conﬁguration ﬁles should have a .mapviz extension. If they do not, they are not

listed when selecting the Open option from the File pulldown menu. When starting

the Map Visualizer, or when opening a ﬁle, specify the conﬁguration ﬁle, not the

data ﬁle.

178

Chapter 6: Using the Map Visualizer

Starting the Map Visualizer

There are ﬁve ways to start the Map Visualizer:

•Use the Tool Manager to conﬁgure and start the Map Visualizer. See Chapter 3 ﬁrst

for details on most of the Tool Manager’s functionality, which is common to all

MineSet tools; see below for details about using the Tool Manager in conjunction

with the Map Visualizer.

•Double-click the Map Visualizer icon, which is in the MineSet page of the icon

catalog. The icon is labeled mapviz. Since no conﬁguration ﬁle is speciﬁed, the

start-up screen requires you to select one by using File > Open.

Figure 6-4 Map Visualizer’s Startup Screen, With File Pulldown Menu Selected

Starting the Map Visualizer without specifying a conﬁguration ﬁle causes the main

window to show the copyright notice for this tool. Only the File and Help pulldown

menus can be used. For the main window to be fully functional, open a

conﬁguration ﬁle by selecting File > Open (Figure 6-4).

•If you know what conﬁguration ﬁle you want to use, double-click the icon for that

conﬁguration ﬁle. This starts the Map Visualizer and automatically loads the

conﬁguration ﬁle you speciﬁed. This only works if the conﬁguration ﬁlename ends

in .mapviz (which is always the case for conﬁguration ﬁles created for the Map

Visualizer using the Tool Manager).

Starting the Map Visualizer

179

•Drag the conﬁguration ﬁle icon onto the Map Visualizer icon. This starts the Map

Visualizer and automatically loads the conﬁguration ﬁle you speciﬁed. This works

even if the conﬁguration ﬁlename does not end in .mapviz.

•Start the Map Visualizer from the UNIX shell command line by entering this

command at the prompt:

mapviz [ configFile ]

where conﬁgFile is optional and speciﬁes the name of the conﬁguration ﬁle to use. If

you don’t specify a conﬁguration ﬁle, you must use File > Open to specify one (see

Figure 6-4).

Options for Invoking the Map Visualizer

There are a two options that affect how this tool is invoked:

•-warnexecute indicates that if you attempt to execute a command speciﬁed in an

execute statement, a warning is displayed and you are given the option to execute

the command or not. This is intended for an insecure environment, such as ﬁles

obtained from the Web, and is used automatically when commands are executed via

mtr ﬁles.

You can enable this option permanently by adding the line

*minesetWarnExecute:TRUE

to your .Xdefaults ﬁle, or by setting the environment variable

MINESET_WARN_EXECUTE

•-quiet eliminates the dialogs that popup to indicate progress. You can enable this

option permanently by adding the line

*minesetQuiet:TRUE

to your .Xdefaults ﬁle.

180

Chapter 6: Using the Map Visualizer

Conﬁguring the Map Visualizer Using the Tool Manager

This section describes how the Map Visualizer can be conﬁgured using the Tool Manager.

Although the Tool Manager greatly simpliﬁes the task of conﬁguring the Map Visualizer,

you can construct a conﬁguration ﬁle manually for this tool using a text editor (see

Appendix C, “Creating Data, Conﬁguration, Hierarchy, and GFX Files for the Map

Visualizer”).

Note that the steps required to connect to a data source are described in Chapter 3.

Generating .gfx and .hierarchy Files

To use the Map Visualizer, you must provide the application with two ﬁles that deﬁne

the graphical objects to be displayed:

•One or more .gfx ﬁles, which deﬁne the shapes of the graphical objects displayed.

•A.hierarchy ﬁle, which describes the relationship of multiple, interrelated map (.gfx)

ﬁles.

These ﬁles are not created by the Tool Manager; they must already exist as part of

MineSet (residing in the /usr/lib/MineSet/mapviz/gfx_ﬁles directory), or they must be

created by the user. For instructions on their creation, see Appendix C, “Creating Data,

Conﬁguration, Hierarchy, and GFX Files for the Map Visualizer”

The .gfx and .hierarchy ﬁles that are part of the MineSet package include

•the individual states of the United States

•the areas covered by the individual counties of the United States

•the areas covered by the individual ﬁve-digit ZIP codes of the United States

•the areas covered by the telephone area codes of the United States

•the individual provinces and territories of Canada

•the individual states of Mexico

•the individual states and territories of Australia

•the individual countries of Western and Central Europe

•regional subdivisions of both France and The Netherlands

Configuring the Map Visualizer Using the Tool Manager

181

The Map Visualizer requires a data ﬁle with

•One column indicating geographical objects (for example, states). Each row in this

column must indicate a unique geographical object (staying with the example, this

means one row for each state).

•At least one column with numeric values mapped (using arithmetic expressions) to

the heights and/or colors of each geographic bar. These columns can be scalar, a 1D

array, or a 2D array. If the column is an array, a slider must be used to select speciﬁc

data points for this mapping to heights and colors.

If both heights and colors are mapped to 1D or 2D arrays, the arrays must have the same

indexes (see Appendix C, “Creating Data, Conﬁguration, Hierarchy, and GFX Files for

the Map Visualizer”).

Selecting the Map Visualizer Tool

Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen

(Figure 6-5). From the popup list of tools, select Map Visualizer. The window on the right

side of this panel displays the mapping requirements for the Map Visualizer. Items in the

Visual Elements list that are preceded by an asterisk are optional.

182

Chapter 6: Using the Map Visualizer

Figure 6-5 Data Destination Panel, With Map Visualizer Selected

•Entity - Bars lets you specify which column contains the keywords of the graphical

objects.

•Height - Bars lets you specify the heights of the geographic bars on the map.

•*Color - Bars lets you assign the colors of the geographic bars. See “Choosing

Colors” and “Using the Color Browser” in Chapter 3 for a more detailed

explanation of how to choose and change colors.

•*Slider1 and *Slider2 let you map columns directly to one or two animation Sliders

(see “Slider Creation for Mapviz,” below).

Mapping Columns to Visual Elements

A column in the Current Columns window should be mapped to the Visual Element

Height - Bars by clicking the column ﬁrst, then Height - Bars. Optionally, another column

(perhaps even the same column) can be mapped to the Visual Element *Color - Bars.

Another column must be mapped to the Visual Element Entity. This must be a string

column.

Configuring the Map Visualizer Using the Tool Manager

183

Undoing Mappings

To undo a mapping, select the mapping in the Requirements: window, then click the

Clear Selected button. To undo all mappings, click the Clear All button.

Slider Creation for Mapviz

Sliders can be created manually or automatically. The following subsections describe

these methods.

Manual Slider Creation

Tool Manager generates sliders whenever there is an array column present in the current

table. The sliders correspond to the indices of the array columns. If the column has one

index (one-dimensional array), only one slider is created, but if the column has two

indices (two-dimensional array), both an X and a Y slider are created. The current slider

indices are indicated in the Tool Options dialog box from the Tool Manager.

Note that for a slider to be created, all array columns in the current table must have the

same indices. If array columns with differing indices exist in the current table, no slider

is created.

See “Aggregation” in Chapter 3 for more information on creating arrayed columns.

Automatic Slider Creation

If no arrayed columns are in the current table, Tool Manager can automatically generate

sliders by use of the Slider1 and Slider2 mappings. Sliders are created through a

combination of automatic binning and aggregation. These automatic operations occur

after clicking Invoke Tool. The operations do not affect the current history operations of

Tool Manager, but they do appear in the conﬁguration ﬁles for the tool.

Columns mapped to Slider1 and Slider2 eventually form the indices for the sliders. These

columns must be either numeric (int, ﬂoat, double) or binned. If a column mapped to a

slider is already binned, no automatic binning is needed for this column, and this column

is used as an index for a slider. However, if the column is not binned, a binned column is

created using the automatic binning options in the Tool Options dialog box.

184

Chapter 6: Using the Map Visualizer

The three methods of binning are:

•Selecting All Distinct Values creates a bin for every unique value of the column.

•Specify the number of bins you want to create. The thresholds for the bins are

determined using the Uniform Range approach.

•Selecting Automatic automatically determines the number of bins to create and

determines the bin thresholds using the Uniform Range approach.

(See “The Bin Columns Button” in Chapter 3 for more information about binning.) The

column used in forming the automatic bins is deleted from the current table.

The binned columns now form the indices of array columns. Note that if you want to

create only one slider, the index must be mapped to Slider1. Attempting to create only

one slider with a mapping to Slider2 is not allowed and generates a Tool Manager error.

Also, a column mapped to a slider cannot be mapped to any other mapping, since it is

removed during the aggregation process.

Once the slider indices are formed, the arrayed columns are created. This is done using

automatic aggregation. Any numeric columns mapped to Height or Color are

aggregated using the automatic aggregation options in the Tool Options dialog box. You

can either specify aggregating by Sum or by Average. The binned columns created from

the slider mappings form the indices for the aggregation. The column mapped to Entity

is the only Group-By column. Any remaining columns in the table are removed. (See

“Aggregation” in Chapter 3 for a description of the aggregation process.)

The aggregation step automatically forms the arrayed columns used for sliders. These

arrayed columns form the new tool mappings. For example, if the column “mpg” were

mapped to Height, a new column “avg_mpg[]” is formed and remapped to Height. The

progress of the automatic slider generation is displayed in the Tool Manager status

window.

Specifying Tool Options

Clicking the Tool Options button causes a new dialog box to be displayed (Figure 6-6).

This lets you change some of the Map Visualizer options from their default values.

Configuring the Map Visualizer Using the Tool Manager

185

Figure 6-6 Map Visualizer’s Options Dialog Box

The following sections describe the buttons and ﬁelds of the Map Visualizer’s Options

dialog box.

186

Chapter 6: Using the Map Visualizer

Geography

The Entities File speciﬁes a .hierarchy ﬁle to be used for the representation of the

geographical "entity" objects, in the Map Visualizer's main window.

The Outlines File speciﬁes outline objects to draw, which appear as a ﬂat plane on which

the 3-D entity objects are placed.

The Find File button lets you browse your ﬁles to ﬁnd the .hierarchy ﬁle to be used.

Note that the Entities File and Outlines File ﬁelds are optional. If the Entities File is not

supplied, then the Map Visualizer creates graphical entity objects consisting of simple

rectangles that are arbitrarily sized and placed in the scene.

Height

This section speciﬁes an initial height Scale value (default is 1.0) and whether to display

a height legend at the bottom of the Map Visualizer window.

Color

To use these Color options, you must have mapped a column to the *Color - Bars

requirement of the Data Destination panel. See “Choosing Colors” and “Using the Color

Browser” in Chapter 3 for a more detailed explanation of how to choose and change

colors.

Color List—You can specify the color list using the + button next to the color list label. This

brings up a color editor that lets you specify a color to be added to the list.

Mapping—You can specify whether the color change that is shown in the graphic display

is Continuous or Discrete. If you choose Continuous, the color values shift gradually

between the colors entered in the “Color List” ﬁeld as a function of the values that are

mapped to those colors in the “Mapping” ﬁeld.

The ﬁeld to the right of the popup button lets you enter speciﬁc values to which the colors

are mapped. You must have the same number of values in this ﬁeld as there are colors

entered in the “Color list to use” ﬁeld.

Configuring the Map Visualizer Using the Tool Manager

187

Example 6-1

If you

•used the Color Browser to choose gray and red

•selected Discrete for the Mapping

•entered the values 0 150000

then the display shows the population of the United States across the time period

1770-1990. States with more than 150,000 square miles are shown in red, the rest are in

gray.

Example 6-2

If you

•used the Color Browser to choose gray and red

•selected Continuous for the Mapping

•entered the values 0 300000

then the display shows the population of the United States across the same time period.

The states’ colors vary from gray to red, depending on their size; the largest states are

shown with the greatest density of red.

You can enter as many colors into this ﬁeld as necessary for your display. If the number

of values in the column that maps to *Color - Bars exceeds the number of distinct colors

you have chosen, the Map Visualizer adds an appropriate number of randomly chosen

colors at runtime.

Legend On—lets you determine whether a color legend is displayed or hidden.

Normalize On—lets you determine whether the Map Visualizer automatically scales the

colors between the color column's minimum and maximum values (this is called color

normalization), as opposed to you manually specifying threshold values. When

Normalize On is enabled, the threshold values must lie within the range 0 to 100,

representing a percentage of the color column's minimum to maximum numeric range.

188

Chapter 6: Using the Map Visualizer

Sliders

You can manually select a binned column to be associated with the slider(s), where the

binned column indexes an aggregated array that is mapped to height or color.

Alternatively, you can have the Tool Manager automatically perform the binning and

aggregations. For more details on the Slider options, see “Slider Creation for Mapviz” on

page 183.

Message Field

This lets you specify the message displayed when an entity is selected. For a listing and

description of format types that can be entered in this ﬁeld, see the “Message Statement”

section in Appendix C, “Creating Data, Conﬁguration, Hierarchy, and GFX Files for the

Map Visualizer”

Title ﬁeld

This lets you specify a string that appears at the bottom of the Map Visualizer main

window. This string must be enclosed in double-quotes.

Execute Field

This option lets you type in a UNIX command that is executed when double-clicking on

an entity. The format is similar to the message statement. If no execute statement appears,

double-clicking has no effect.

For a detailed description of the Execute ﬁeld, see “Execute Statement” in Appendix C.

Resetting the Tool Options

If, after making changes to the Tool Options dialog box, you want to reset the values of

all options to their default values, click the Reset Options button.

Accepting the Tool Options

Once you have ﬁnished making changes to the Tool Options dialog box, click OK to

return the Tool Manager’s main screen.

Working in the Map Visualizer’s Main Window

189

Saving Map Visualizer Settings

The Tool Manager stores information for the Map Visualizer in several ﬁles, all sharing

the same preﬁx:

•<preﬁx>.mapviz.data contains data.

•<preﬁx>.mapviz.schema describes the data ﬁle.

•<preﬁx>.mapviz contains information needed by the Map Visualizer.

•<preﬁx>.mineset contains all the information needed to create the other ﬁles.

To specify a preﬁx, use the Save ... menu option in the File menu of the Tool Manager’s

main window. If you do not specify a preﬁx, it is based on the data source.

When you use the Invoke Tool button, the .data,.schema, and .mapviz ﬁles are updated, if

necessary.

Invoking the Map Visualizer

To see the Map Visualizer graphically represent your data, click the Invoke Tool button at

the bottom of the Data Destination panel.

Working in the Map Visualizer’s Main Window

If you started the Map Visualizer without specifying a conﬁguration ﬁle, the main

window shows the copyright notice for the Map Visualizer. Only the File and Help

pulldown menus can be used. For the main window to show all menus and controls,

open a conﬁguration ﬁle. Use File > Open (Figure 6-4) to see a list of conﬁguration ﬁles.

When a valid conﬁguration ﬁle has been speciﬁed, its geographical landscape is visible.

For example, Figure 6-7 shows the results of specifying population.usa.mapviz and moving

the Year slider to the far right.

190

Chapter 6: Using the Map Visualizer

Figure 6-7 Population.usa.mapviz Example With the Slider Moved to 1990

This shows the population and population density for each state of the United States. The

population of each state is represented by the height of the state’s graphical shape.

Heights are relative to each other across the entire range of the animation controls.

Working in the Map Visualizer’s Main Window

191

Viewing Modes

The two modes of viewing are grasp and select. To toggle between these modes, move the

cursor into the main window, and press the Esc key. You can also change from one mode

to the other by clicking the appropriate button: to enter select mode, left-click the arrow

button (to the top-right of the main window); to enter grasp mode, left-click the hand

button (immediately below the arrow button, near the top right of the main window).

Grasp Mode

In grasp mode, the cursor appears as a hand. This mode supports panning, rotating, and

scaling the scene’s size in the main window.

•To pan the display, press the middle mouse button and drag it in the direction you

want the display panned.

•To rotate the display, press the left mouse button and move the mouse in the

direction you want to rotate.

•To move the viewpoint forward, press the left and middle mouse buttons

simultaneously and move the mouse downwards. To move the viewpoint

backward, press the left and middle mouse buttons simultaneously and move the

mouse upwards. This is equivalent to the functions provided by the Dolly

thumbwheel.

Select Mode

In select mode, you can highlight an object by positioning the cursor over that object.

Information about that object then appears at the top of the view area. This information

remains visible in the window only as long as the pointer cursor remains over the object.

If you position the pointer cursor over an object and click the left mouse button, the same

information appears in the Selection Window, which is above the main window, under

the “Selection” label (Figure 6-8).

192

Chapter 6: Using the Map Visualizer

Figure 6-8 Highlighted Information in the Viewing Window and Selected Information

This Selection information remains visible until you select another object or click the

background. Using the mouse, you can cut and paste this text into other applications,

such as reports or databases.

Working in the Map Visualizer’s Main Window

193

Drilling Down and Drilling up

To view a ﬁner level of geographical granularity for an object (if the .data and .hierarchy

ﬁles support it), click the right mouse button while the cursor is over that object. This is

called “drilling down.” You can repeat this down to the ﬁnest level of granularity

supported by the data. If the cursor is positioned over a speciﬁc object when drilling

down, only the more detailed sub-objects of that object appear. If, instead, the cursor is

positioned on the background at the time of the mouse click, then the more detailed

sub-objects of the entire set of objects appear. This might produce a display with a large

number of individual objects. The greater the number of objects, the longer the Map

Visualizer takes to construct the scene, and the slower the performance when moving the

animation controls.

To move up one level and view a coarser geographical granularity (“drill up”), click the

middle mouse button. If the cursor is positioned on the background when you click, all

the higher-level objects appear. If the cursor is positioned on a speciﬁc object in the scene,

then the scene “returns” to the group of higher-level objects visible when you last drilled

down with the right mouse button.

If an execute statement was speciﬁed via Tool Manager or the conﬁguration ﬁle, then

double clicking on an object executes the appropriate command. If the -warnexecute

option was speciﬁed when invoking the Map Visualizer, a warning is given ﬁrst.

Note: By default, the Map Visualizer initially displays objects at the lowest level of detail;

thus, initially, only drill-up (to coarser granularity) is active.

194

Chapter 6: Using the Map Visualizer

External Main Window Controls

Several external controls surround the graphics window. These consist of buttons,

sliders, and a summary window. Each of these controls is described in this section.

Buttons

At the top right of the image area are 11 buttons (see Figure 6-9).

Figure 6-9 Detail View of Top Right Buttons

•Arrow puts you in select mode, which lets you highlight entities in the main

window. When in this mode, the cursor shape is an arrow.

•Hand puts you in grasp mode, which lets you rotate, zoom, and pan the display in

the main window. When in this mode, the cursor shape is a hand.

•Viewer help brings up a help window describing the viewer itself.

Arrow

Hand

Viewer help

Home

Set Home

View All

Perspective

Seek

Top View

Front View

Right View

External Main Window Controls

195

•Home takes you to a designated location. Initially, this location is the ﬁrst viewpoint

shown after invoking the Map Visualizer and specifying a conﬁguration ﬁle. If you

have been working with the Map Visualizer and have clicked the Set Home button,

then clicking Home returns you to the viewpoint that was current when you last

clicked Set Home.

•Set Home makes your current location the Home location. Clicking the Home button

returns you to the last location where you clicked Set Home.

•View All lets you view the entire graphic display, without changing the angle of

view you had before clicking on this option. To get an overhead view of the scene,

rotate the camera so that you are looking directly down on the entities, then click

the View All button.

•Seek takes you to the point or object you click after selecting this button.

•Perspective is a toggle button that lets you view the scene in 3D perspective (closer

objects appear larger, farther object appear smaller). Clicking this button again turns

3D perspective off. If Perspective is off, the Dolly thumbwheel becomes the Zoom

thumbwheel

•Top View lets you view the scene from the top.

•Front View lets you view the scene from the front.

•Right View lets you view the scene from the right side.

Height-Adjust Slider and Label

To the left of the Map Visualizer’s main window is a vertical height adjust slider and,

below it, a label containing a numeric value between 0.1 and 100. This slider lets you

change the absolute heights of all the graphical objects in the main window. Moving the

slider up increases the heights of the objects; moving it down decreases their heights. The

numeric value in the label changes accordingly. This value indicates the height

multiplier, the default value of which is 1.0. The height adjust slider is useful for

accentuating relative height differences between objects in the view window.

196

Chapter 6: Using the Map Visualizer

Thumbwheels

Three thumbwheels appear around the lower part of the main window border (see

Figure 6-10). They let you dynamically move the viewpoint.

Figure 6-10 Lower Half of Window With Thumbwheels

•The vertical thumbwheel Rotx (rotate about the x axis), on the left, rotates the

display up and down.

•The horizontal thumbwheel Roty (rotate about the y axis), at the bottom left, rotates

the scene in the main window around its centerpoint left and right.

•The vertical Dolly thumbwheel, on the right, moves the viewpoint forward and

backward. Note that as you use the Dolly thumbwheel to magnify the scene in the

main window, additional detail can appear. This is not the case with the Zoom

slider, which merely enlarges the scene without adding detail.

The Animation Control Panel

To the right of the Map Visualizer’s main window are several external controls,

depending on the type of data being displayed (see Figure 6-11). These controls can

include

•sliders for independent dimensions

•a summary window containing a color density proﬁle.

•a color legend showing the color density value limits

•buttons and sliders for animation

Thumbwheels

The Animation Control Panel

197

Figure 6-11 Map Visualizer’s Summary Window With Slider and Animation Controls

Sliders Controlling Independent Dimensions

The number of sliders appearing adjacent to the summary window is dependent on the

dataset displayed in the Map Visualizer’s main window. Datasets can have two, one, or

no independent dimensions.

Datasets With Two Independent Dimensions

If the dataset has two dimensions of independently varying data (such as

nl.births.mapviz), the animation control panel to the right of the main graphics window

becomes visible (as in Figure 6-11).

198

Chapter 6: Using the Map Visualizer

Within this animation control panel are the 2D summary window and two sliders. The

summary window has a horizontal slider below it for selecting data points of the ﬁrst

independent dimension, and a vertical slider to the left for selecting data points of the

second independent dimension. The horizontal slider’s dimension is identiﬁed by a label

below it. The vertical slider’s dimension is identiﬁed by a label above it.

Datasets With One Independent Dimension

For datasets with one independent dimension (such as population.usa.mapviz), only the

slider below the summary window appears, and the summary window is compressed

(see Figure 6-12). This slider’s dimension is identiﬁed by a label below it.

Figure 6-12 Map Visualizer’s Summary Window With One Slider and Animation Controls

The Animation Control Panel

199

Datasets With No Independent Dimension

For datasets with no independent dimensions (such as population.europe.mapviz), no

animation control panel appears (see Figure 6-13).

Figure 6-13 If There Are No Independent Dimensions, No Animation Control Panel Appears

The Summary Window

The summary window provides a 2D representation of the aggregation of values that the

main window displays in 3D. Above this window is a label, Sum Heights, followed by

two rectangles: the ﬁrst white, the second red. Within the rectangles are numbers; each is

the respective value for the maximum density of that color. This summary color legend

provides a visual and numeric comparison to the densities in the summary window.

200

Chapter 6: Using the Map Visualizer

The whiter the areas of the summary window, the lower the total values represented by

the heights of the objects in the main window. The greater the density of red shown in

areas of the summary window, the higher the total of those values. The density of these

colors in the summary window provides a summary of the data across the one or two

independent dimensions in the dataset, which is useful for guiding your exploration

through the data.

By default, the summary window also contains a set of black dots, evenly spaced across

the one or two dimensions of data. These dots indicate the precise positions of the

discrete datapoints of the data. You can turn off the dots using the View > Show Data

Points menu option.

Color Density Examples in the Summary Window

After opening the population.usa.mapviz ﬁle, for example, the 2D summary window

shows a color range from white (on the left) to red (on the right). White corresponds to

the low aggregate population in the early years of the United States; red represents the

higher aggregate population in later years. In this example, the greater the density of red,

the higher the total population of United States.

For a more complex example, open perhouse.perage.mapviz. This dataset has two

independent dimensions: time and age. The summary window displays these

dimensions as a complex pattern of colors. Place the cursor on the horizontal lines with

the greatest density of red, which runs horizontally across the summary window (this

means the age group making the greatest number of purchases). Click the left mouse

button. The information displayed in the ﬁeld below the horizontal slider shows that this

represents purchases made by 30- to 39-year-olds.

Now place the cursor at the junction of the densest red horizontal (age group) and

vertical (time frame) parts of the summary window, and click the left mouse button. The

information displayed in the ﬁeld below the horizontal slider shows that most purchases

were made by 30- to 39-year-olds in May-June 1989 and May-June 1990.

The Animation Control Panel

201

Creating a Path in the Summary Window

If the dataset loaded into the Map Visualizer has at least one independent dimension, it

is possible to view all or any part of that dataset via animation. This is done by ﬁrst

creating a path in the summary window, then activating the animation controls

described in the next section.

The three ways to draw a path in the summary window are as follows:

•Deﬁne a starting point by clicking and holding down the left mouse button, then

draw a path by dragging the cursor over the window. The actual path passes

through intermediate discrete points closest to the path of the mouse. End the path

by releasing the left mouse button.

•Deﬁne a starting point by clicking the left mouse button, then deﬁne an endpoint by

moving the cursor to another part of the window and clicking the middle mouse

button. A path appears between those two endpoints, passing through the

intermediate discrete data point(s) that are closest to the hypothetical straight line

between the endpoints. To add more line segments, continue with repeated middle

mouse clicks.

•Deﬁne a starting point by clicking the left mouse button, then drag one of the

independent dimension sliders to draw a straight line along this dimension. If there

are two sliders, then using the second slider will continue to draw a straight line

along the axis controlled by this second slider.

The path you draw can only go through the well-deﬁned discrete data points, identiﬁed

by the black dots in the summary window.

Animation Buttons and Sliders

Use the seven VCR-like buttons and two sliders (Path and Speed) below the 2D summary

window to control animation.

202

Chapter 6: Using the Map Visualizer

Animation Buttons

Once a path is drawn in the summary window (see “Creating a Path in the Summary

Window,” above), you can use the VCR-like buttons to control animation along this path.

The middle Stop button is highlighted in blue to indicate an initial state. Use the adjacent

Play Forward button (to the right of Stop) or Play Reverse (to the left) to begin simple

movement along the drawn path in a forward or reverse direction. Forward and Reverse

are deﬁned by the sequence in which the path was drawn, not by a sense of left-to-right

or right-to-left movement.

To stop and restart the animation, click the Stop button, then use the Play Forward or

Reverse button. When you use the Stop button, the animation continues in the current

direction until the position falls on a discrete data point.

Adjacent to the Play buttons are the Single-Step buttons, also Forward and Reverse.

Clicking one of these buttons causes the current path position to change to the next

discrete data point.

On the outside are the Fast Forward and Fast Reverse buttons. Clicking one of these Fast

buttons while in Stop state changes the path position to the end (for Forward) or to the

beginning (for Reverse) of the path. Clicking a Fast button when in Play state increases the

animation speed.

Animation Flow

Below the Animation Buttons are the three Animation Flow buttons.

Play-once (default)—the animation moves either forward or in reverse until it reaches the

end of the path, then stops.

Loop—when the animation reaches the end of the path, it automatically resets to the

beginning and starts over again.

Swing—when the animation reaches the end of the path, it reverses direction and retraces

its path to the other end; upon reaching that end, the animation reverses direction again,

beginning the cycle again.

The Animation Control Panel

203

Animation Sliders

While animation is stopped, you can move the Path slider to reset the position along the

path. Note that when you use the Path slider, the cursor in the summary window moves

across the drawn path, and the 1D sliders (below and to the left of the drawing area)

move consistently with the cursor position. Then use the Play or Reverse button to restart

the animation from the newly speciﬁed point.

You can drag the Path slider to an arbitrary position on the path between discrete data

points; however, when you release the slider, the path position changes to a stop at the

nearest discrete data point.

Use the Speed slider to adjust the speed of the animation along the path.

Data Points and Interpolation

As animation proceeds, the variables mapped to height and color in the Map Visualizer

also change. However, the variables displayed in the Selection: message box show only

the data values of the nearest discrete data position, not intermediate (interpolated) data

values.

The animation is produced in the following manner: Assume you have data for 10 years,

on a per-year basis (that is, 10 data values) and that these correspond to the height of one

state in the Map Visualizer. The years are 1991 to 2000, the height for 1991 is 20, and the

height for 1992 is 40. As you move the year slider from 1991 to 1992, the height changes

by being uniformly interpolated between 20 and 40. For example, midway between 1991

and 1992, the height appears to be 30. As you approach 1992, the height approaches 40.

However, you cannot stop an animation between discrete data points, and you cannot

drag the Path slider to a stationary position between discrete data points.

The data points in the summary window represent the slider positions corresponding to

the actual data from the data ﬁle. For example, the heights 20 and 40 are representations

of actual data, but the height 30 is not. In this example, there would be data points in the

summary window at the slider positions corresponding to each year.

Note that not all variables are required to vary with a slider. For example, in the Map

Visualizer, the area and name of the state do not vary with the slider (for example, year).

If there are two sliders, some variables can vary with only one of the sliders, while other

variables vary with both.

204

Chapter 6: Using the Map Visualizer

Pulldown Menus

Five pulldown menus let you access additional Map Visualizer functions. These are

labeled File, View, Selections, InterTool, and Help. If you start the Map Visualizer

without specifying a conﬁguration ﬁle, only the File and the Help menus are available.

The View menu is available after a valid dataset is loaded.

The File Menu

The File menu is the same for all visualization tools; see “The File Menu” in Chapter 5.

The View Menu

The View menu (Figure 6-14) contains six options. This section describes those options

below.

Figure 6-14 Map Visualizer’s View Pulldown Menu

Filter Panel brings up a ﬁlter panel (Figure 6-15), which lets you reduce the number of

entities displayed in the main viewing area, based on one or more criteria. You can use

the ﬁlter panel to ﬁne-tune the display, emphasize speciﬁc information, or simply shrink

the amount of information displayed. Scale to Filter lets you specify whether the heights

of the graphical objects are scaled across the entire dataset or just across the ﬁltered data.

Pulldown Menus

205

Figure 6-15 Map Visualizer Filter Panel

The ﬁlter panel has two panes. The top pane lets you ﬁlter based on string variables. To

select all values of a variable, click Set All. To clear the current selections, click Clear. To

select a value, click it. To deselect a value, simply click it again.

The bottom pane lets you ﬁlter based on the values of both string and numeric variables.

Only variables whose values do not change as you navigate the slider can be used in

ﬁltering.

206

Chapter 6: Using the Map Visualizer

To ﬁlter numeric values, enter the value, and select a relational operation (=, !=, >, <, >=,

<=). To ﬁlter alphanumeric values, enter the string. You can use any of three types of

string comparisons:

•Contains indicates that it contains the appropriate string. For example, California

contains the strings Cal and forn.

•Equals requires the strings to match exactly.

•Matches allows wildcards:

•An asterisk (*) represents any number of characters.

•A question mark (?) represents one character.

•Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.

In some cases (usually associated with binning in the Tool Manager), an option menu of

values appears, instead of a text ﬁeld. To ignore that variable, select Ignored in the Option

menu. You can use relational operators (such as >=) with these options. This means that

the speciﬁed value as well as subsequent ones are selected.

In addition to numeric and string comparison operations, you can specify Is Null,

which is true if the value is null.

To the right of each ﬁeld is an additional option menu that lets you specify “And” or “Or”

options. For example, you could specify “sales > 20 And < 40.” You can have any number

of And or Or clauses for a given variable, but cannot mix And and Or in a single variable.

Click the Apply button to start ﬁltering. If you press Enter while the panel is active,

ﬁltering starts automatically.

Click the Close button to close the panel.

•Show Window Decoration causes the buttons around the main window to be

displayed. Default for this option is on. Toggle this option to make the window

decoration disappear.

•Show Animation Panel causes the animation control panel to be displayed to the right

of the main view. Click this option again to deselect it. When this option is

deselected, the animation panel is not displayed. Not displaying the animation

panel can be useful when you have applied the InterTool menu’sSynchronize All

Mapviz Sliders option (described in the “The InterTool Menu” on page 209) and need

only a single animation control panel on the screen.

Pulldown Menus

207

•Show Data Points causes a grid of black dots to appear (or disappear) in the 2D

summary window. Each dot denotes the precise position of a discrete data value in

the input dataset. For example, if the input dataset has 10 data values across one

independent dimension, then you see heights and colors of the graphical objects in

the main window vary continuously, based on data values that are interpolations

between these discrete data points. These data point dots in the summary window

help you better understand when the heights and colors are derived directly from

the input data values, and when they are derived indirectly from interpolated

values.

•Use Random Colors causes the conﬁguration ﬁle’s color mapping speciﬁcations (for

example, white-to-red shadings representing population density) to be ignored.

Random, constant colors are assigned to the graphical objects. Click this option

again to deselect it.

•Display X-Y Coordinates puts the Map Visualizer into a special mode that lets you

identify X-Y vertex pairs at speciﬁc points of the scene in the main window. In this

mode, the Map Visualizer resets the cursor to select mode and displays 3D objects

as ﬂat background lines. Clicking the left mouse button on various parts of the

displayed scene causes the corresponding X-Y vertex pair values to appear in the

Selection Details window. You can also enter the vertex pair points into the .gfx ﬁle

to identify point objects or the endpoints of line objects for subsequent display. Note

that displaying X-Y coordinates is used for developing and reﬁning .gfx ﬁles, not for

data analysis.

When Display X-Y Coordinates mode is initially enabled, or when a point in the

background is selected, the selection window shows the minimum and maximum

X-Y pairs of the currently displayed image in the main window. Add these two

value pairs to the new .gfx ﬁle you are generating. The ﬁrst record in the ﬁle

gfx_ﬁles/usa.cities.gfx shows an example of how the min-max pairs of the usa.sates.gfx

ﬁle were entered into the associated usa.cities.gfx ﬁle. This ensures that the X-Y

coordinate pairs in usa.cities.gfx share the same coordinate system as the X-Y

coordinate pairs in usa.sates.gfx.

208

Chapter 6: Using the Map Visualizer

The Selections Menu

The Selection menu lets you drill through to the underlying data. The menu has six items.

Figure 6-16 Map Visualizer Selections Menu

•Select All performs the equivalent of selecting (with the mouse pointer) all the

visible graphical objects in the current scene.

•Show Values displays a table (Record Viewer) of the values for all selected objects.

•Show Original Data retrieves and displays the records corresponding to what has

been selected. The resulting records are shown in a table viewer.

•Send To Tool Manager inserts a ﬁlter operation, based on the current box selection(s),

at the beginning of the Tool Manager history. The actual expression used to do the

drill through is determined by extents of the current box selection(s). If nothing is

selected, a warning message appears.

•Use Slider On Drill Through determines whether or not to use the slider position

when creating the drill-through expression. If checked (default), an additional term

is added to the drill-through expression, limiting the drill-through to those records

deﬁned by the slider’s position. If this option is not checked, no such limiting term

is added.

•Complementary Drill Through causes the Show Original Data and Send To Tool Manager

selections, when used, to fetch all the data that are not selected.

For further details on drill-through, see Chapter 18, “Selection and Drill-Through.”

Pulldown Menus

209

The InterTool Menu

The InterTool menu has one option, as shown Figure 6-17.

Figure 6-17 Map Visualizer’s InterTool Pulldown Menu

Selecting Synchronize All Mapviz Sliders identiﬁes this Map Visualizer window as one in

a “synchronized sliders” cooperative: changing the current slider positions in one Map

Visualizer window causes/produces the same change in all others currently open. Click

this option again to deselect it. This menu option must be selected in every Mapviz main

window that is to be part of the synchronization.

Note that currently only the sliders’ physical positions are synchronized, not the

underlying meanings of those positions. For example, synchronizing

population.usa.mapviz (with dates ranging from 1770 to 1990) and population.canada.mapviz

(with dates ranging from 1871 to 1991) probably is not useful, since the slider physical

midpoint position represents 1880 in the United States and 1931 in Canada. Generally,

synchronization is useful only when the sliders of each dataset represent the same range

of independent variables.

The Help Menu

The Help menu is the same for all visualization tools; see “The Help Menu” in Chapter 5.

210

Chapter 6: Using the Map Visualizer

Null Handling in the Map Visualizer

Nulls represent unknown data (see Appendix J, “Nulls in MineSet”).

In the Map Visualizer, nulls can occur when any of the following is true:

•The database or data ﬁle contains a null.

•The Tool Manager is used to make an array based on bins and no data falls into a

speciﬁc bin. For example, if there is no data for the 30-40-year-old population, that

bin is null.

•The Tool Manager is used to make an array and the null enum option is speciﬁed. In

this case, an extra array element is created to represent the aggregation of all the

values for which the bin value is null. The Tool Manager assigns the question mark

(?) character to this extra bin. To view the values of this bin, move the corresponding

slider to its left-most position. If there are no data for that null bin, the values

associated with it are null as well, and the Map Visualizer represents the

corresponding graphical object(s) as a “null object.”

•Expressions and aggregations of nulls can generate nulls (see Appendix J, “Nulls in

MineSet”).

•The Map Visualizer uses special representations when a null value is mapped to a

visual attribute. A null height results in a dark grey object with zero height; a null

color results in an object with appropriate height (as deﬁned by the value mapped

to height), but with a dark gray color (see Figure 6-18).

Sample Configuration and Data Files

211

Figure 6-18 Representation of a Null Value Mapped to Height (Top Middle Object) and to

Color (Bottom Right Object)

When selecting an object with a null value, a question mark (?) is shown in the

selection ﬁeld.

Sample Conﬁguration and Data Files

The provided sample conﬁguration and data ﬁles demonstrate the Map Visualizer’s

features and capabilities. The .data and .mapviz ﬁles are in the directory

/usr/lib/MineSet/mapviz/examples; the .gfx and .hierarchy ﬁles are in the directory

/usr/lib/MineSet/mapviz/gfx_ﬁles.

•blocks.mapviz,blocks.data,blocks.gfx, and blocks.hierarchy

This simple example shows four adjacent blocks. The height and color of each block

varies based on the underlying data in blocks.data. You can drill up using the middle

mouse button (see the “Select Mode” section) to see the upper pair and the lower

pair of blocks aggregate; then drill up again to see these upper and lower blocks

aggregate into a single block. You can drill down using the right mouse button to

see the objects of ﬁner granularity reappear.

212

Chapter 6: Using the Map Visualizer

•population.australia.mapviz,population.australia.data,australia.states.gfx, and

australia.states.hierarchy

The data ﬁle contains one row for each Australian state and territory. Each row

contains three tab-separated items: a keyword name for the state or territory, the

population value, and the size of the territory.

This sample graphically displays the 1991 population and population density of the

Australian states and territories. Heights of the graphical objects represent the

relative population; color represents the relative population density. A legend at the

bottom of the display describes the color range and the associated values.

•population.canada.mapviz,population.canada.data,canada.provinces.gfx, and

canada.provinces.hierarchy

The data ﬁle contains one row for each Canadian province and territory. In this

example, each row contains 13 blank-separated values (one for each decade

between 1871 and 1991).

This sample graphically displays the population and population density of the

Canadian provinces and territories from 1871 to 1991, in 10-year increments. The

animation control panel lets you dynamically view the datasets across a range of

time. Animation operation is explained in “Sliders Controlling Independent

Dimensions” on page 197.

•population.europe.mapviz,population.europe.data,europe.countries.hierarchy, and

europe.countries.gfx

When graphically displayed, this shows the 1992 population and population

density of countries in Western and Central Europe.

•population.usa.mapviz,population.usa.data,usa.sates.gfx, and usa.sates.hierarchy

When graphically displayed, this shows the population and population density of

the United States from 1770 to 1990. The animation controls let you dynamically

view population and density changes across time.

•population.usa.cities.mapviz,population.usa.cities.data,usa.sates.gfx,usa.sates.hierarchy,

and usa.cities.gfx and usa.cities.hierarchy

The usa.sates.gfx ﬁle speciﬁes the United States, which is displayed as a background.

The usa.cities.gfx ﬁle speciﬁes the location of the cities on this background. The .data

ﬁle speciﬁes the population of each city.

This sample graphically displays the population of the 48 largest U.S. cities from

1950 to 1990. No data has been mapped to the colors. The animation controls let you

dynamically view changes across time.

Sample Configuration and Data Files

213

•perhouse.perage.mapviz,perhouse.perage.data,usa.sates.gfx, and usa.sates.hierarchy

This sample graphically displays consumer household spending data from

July-August 1988 to May-June 1991. Color is mapped to the gender of the spending

household member; height represents the average dollar amount spent per

household for a given time period and age group. This data has two independent

dimensions: time and age. The highest spending is indicated in the summary

window (see “The Summary Window” on page 199) by the areas with the greatest

color density, namely “May-June 1989 (Age: 30-39)” and “May-June 1990 (Age:

30-39).”

•telecom.mapviz,telecom.data,usa.cities.lines.gfx,usa.cities.lines.hierarchy,usa.sates.gfx,

and usa.sates.hierarchy

This sample graphically displays a ﬂat map with arched lines on it. These lines

connect two endpoints. The lines can have variable width and color. In this

example, the widths and colors are random; however, they could relate to the

volume and duration of the connections between the endpoints.

•fasta.m.data,fasta.m.mapviz,fasta.m.gfx, and fasta.m.hierarchy

The data ﬁle for this example contains the partial results of a full biological

sequence comparison between two complete genomes (courtesy of Dr. Tom Flores,

European Bioinformatics Institute). When graphically displayed, scientists can

quickly identify and locate the regions of similarity between the two genomes. The

ability to display such large amounts of information in a visual data exploration

method such as this could be extended to include much more information about the

individual genomes. Scientists could explore this data more easily and thereby

perhaps better understand the function and purpose of the similar genetic

sequences.

In this example, the “map” is the circular-shaped genome of a biological organism

called Mycoplasma genitalium (MG). The MG genome is divided into 500 equal

segments, each representing a 1000-nucleotide sequence in the genome. The slider

selects one of the segments of the second genome, called Haemophilus inﬂuenzae

(HI), for cross-comparison between the two genomes. The Summary Window in the

Animation Control Panel indicates which segments show the greatest similarities,

and you can move the slider to examine those particular segments of interest. The

bar heights and colors on the “map” therefore indicate the relative similarity of each

MG segment to each HI segment, where higher bars correspond to greater measures

of similarity. This similarity is measured by the “Reciprocal Evalues,” which ranges

from 0.0 to 1.0.

215

Chapter 7

7. Using the Scatter Visualizer

This chapter discusses the features and capabilities of the Scatter Visualizer. It provides

an overview of this database visualization tool, then explains the Scatter Visualizer’s

functionality when working with the

•main window

•external controls

•pulldown menus

Finally, it lists and describes the sample ﬁles provided for this tool.

Overview of Scatter Visualizer

The Scatter Visualizer lets you visually analyze relationships among several variables

(see Figure 7-1), either statically or by animation. It is particularly useful for seeing

individual data points when you do not have a large number of records. If your dataset

has a very large number of records consider using the Splat Visualizer. Analysis in the

Scatter Visualizer is done using

•a three-dimensional landscape

•an animation control panel that includes a two-dimensional slider

•graphical objects, called entities, that can be animated in the three-dimensional

landscape

216

Chapter 7: Using the Scatter Visualizer

Figure 7-1 Sample Scatter Visualizer Screen

The Scatter Visualizer lets you visualize your data by mapping each record, or row, in the

dataset to an entity in the three-dimensional landscape. Variables in the data can be

mapped to the sizes, colors, and positions of the entities. Also, you can map one or two

numeric variables to the sliders in the animation control panel. If the variables mapped

to sizes, colors, or positions of the entities depend on the variables mapped to sliders, the

sliders can be used to drive an animation. For example, the data might represent the sales

of several companies over time. If the time variable is mapped to a slider and the sales

variable is mapped to size, then the entities grow or shrink as the time slider is animated.

File Requirements

217

After you create a visualization of your data, the Scatter Visualizer lets you analyze the

data in various ways. The animation control panel lets you trace animation paths in one

or two dimensions. By playing back the path you created, you can watch the size, color,

and motion of the entities for trends or anomalies. In the three-dimensional landscape,

you can orient the display to emphasize particular dimensions or a point of view. The

Scatter Visualizer lets you scale the values of variables to give them greater emphasis.

Also, you can ﬁlter the display to show only those entities meeting certain criteria.

File Requirements

The Scatter Visualizer requires the following ﬁles:

•A data ﬁle, consisting of rows of tab-separated ﬁelds. This ﬁle is easily created using

the Tool Manager (see Chapter 3). If you are generating this ﬁle yourself, see

Appendix D, “Creating Data and Conﬁguration Files for the Scatter Visualizer” for

the required ﬁle format.

You can generate data ﬁles by extracting data from a source (such as a database) and

formatting it speciﬁcally for use by the Scatter Visualizer. Data ﬁles have

user-deﬁned extensions (the sample ﬁles provided with the Scatter Visualizer have

a.data extension).

•A conﬁguration ﬁle, describing the format of the input data and how it is to be

displayed. The Tool Manager can create this ﬁle (see Chapter 3), or you can use an

editor (such as jot, vi, or Emacs) to produce this ﬁle yourself (see Appendix D,

“Creating Data and Conﬁguration Files for the Scatter Visualizer”).

Conﬁguration ﬁles must have a .scatterviz extension. When starting the Scatter

Visualizer, or when opening a ﬁle, you must specify the conﬁguration ﬁle, not the

data ﬁle.

218

Chapter 7: Using the Scatter Visualizer

Options for Invoking the Scatter Visualizer

There are a two options that affect how this tool is invoked:

•-warnexecute indicates that if you attempt to execute a command speciﬁed in an

execute statement, a warning is displayed and you are given the option to execute

the command or not. This is intended for an insecure environment, such as ﬁles

obtained from the Web, and is used automatically when commands are executed via

mtr ﬁles.

You can enable this option permanently by adding the line

*minesetWarnExecute:TRUE

to your .Xdefaults ﬁle, or by setting the environment variable

MINESET_WARN_EXECUTE

•-quiet eliminates the dialogs that popup to indicate progress. You can enable this

option permanently by adding the line

*minesetQuiet:TRUE

to your .Xdefaults ﬁle.

Starting the Scatter Visualizer

There are ﬁve ways to start the Scatter Visualizer:

•Use the Tool Manager to conﬁgure and start the Scatter Visualizer. (See Chapter 3

for details on most of the Tool Manager’s functionality, which is common to all

MineSet tools; see “Conﬁguring the Scatter Visualizer Using the Tool Manager” on

page 220 for details about using the Tool Manager in conjunction with the Scatter

Visualizer.)

•Double-click the Scatter Visualizer icon, which is in the MineSet page of the icon

catalog. The icon is labeled scatterviz. Since no conﬁguration ﬁle is speciﬁed, the

start-up screen requires you to select one by using File > Open.

Starting the Scatter Visualizer

219

Figure 7-2 Scatter Visualizer Start-Up File Pulldown Menu Selected

Starting the Scatter Visualizer without specifying a conﬁguration ﬁle causes the

main window to show the copyright notice and license agreement for this tool.

Only the File and Help pulldown menus can be used. For the main window to be

fully functional, open a conﬁguration ﬁle by selecting File > Open.

•If you know which conﬁguration ﬁle you want to use, double-click the icon for that

conﬁguration ﬁle. This starts the Scatter Visualizer and automatically loads the

conﬁguration ﬁle you speciﬁed. This works only if the conﬁguration ﬁlename ends

in .scatterviz (which is always the case for conﬁguration ﬁles created for the Scatter

Visualizer using the Tool Manager).

•Drag the conﬁguration ﬁle icon onto the Scatter Visualizer icon. This starts the

Scatter Visualizer and automatically loads the conﬁguration ﬁle you speciﬁed.

•Start the Scatter Visualizer from the UNIX shell command line by entering this

command at the prompt:

scatterviz [ configFile ]

conﬁgFile is optional and speciﬁes the name of the conﬁguration ﬁle to use. If you

don’t specify a conﬁguration ﬁle, you must use File > Open to specify one (see

Figure 7-2).

220

Chapter 7: Using the Scatter Visualizer

Conﬁguring the Scatter Visualizer Using the Tool Manager

This section describes how the Scatter Visualizer can be conﬁgured using the Tool

Manager. Although the Tool Manager greatly simpliﬁes the task of conﬁguring the

Scatter Visualizer, you can construct a conﬁguration ﬁle manually for this tool using a

text editor (see Appendix D, “Creating Data and Conﬁguration Files for the Scatter

Visualizer”).

The steps required to connect to a data source are described in Chapter 3.

Selecting the Scatter Visualizer Tool

Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen

(Figure 7-3). From the popup list of tools, select Scatter Visualizer. The mapping

requirements for the Scatter Visualizer are displayed in the window on the right side of

this panel. Items in the Visual Elements list that are preceded by an asterisk are optional.

Figure 7-3 Data Destination Panel With Scatter Visualizer Selected

Configuring the Scatter Visualizer Using the Tool Manager

221

•Axis 1,*Axis 2,*Axis 3 let you assign to the axes in the Scatter Visualizer’s main

window the data you want represented. Assigning data to Axis1 is required.

However, this alone does not produce a useful display. By assigning data to Axis 2,

you can create an XY chart. Assigning data to all three axes produces a 3-D chart.

•*Entity-size,*Entity-color,*Entity-label let you assign size, color, and label to the

entities appearing in the Scatter Visualizer’s main window.

•*Summary is the value mapped to the summary column, if you have a slider. It

determines the color of the slider’s background.

•*Slider1 and *Slider2 let you map columns directly to one or two animation Sliders

(see “Slider Creation for Scatterviz,” below).

Mapping Requirements to Columns

You can map requirements to columns by selecting a column name in the Current

Columns window of the Table Processing panel, then selecting a category in the Visual

Elements window.

Undoing Mappings

To undo a speciﬁc mapping, select that mapping in the Visual Elements window, then

click the Clear Selected button. To undo all mappings, click the Clear All button.

Slider Creation for Scatterviz

Sliders can be created manually or automatically. The following subsections describe

these methods.

222

Chapter 7: Using the Scatter Visualizer

Manual Slider Creation

Tool Manager generates sliders whenever there is an array column present in the current

table. The sliders correspond to the indices of the array columns. If the column has one

index (one-dimensional array), only one slider is created, but if the column has two

indices (two-dimensional array), both an X and a Y slider are created. The current slider

indices are indicated in the Tool Options dialog box from the Tool Manager. Array

columns can be created using the “index by” menus in the Tool Manager aggregation

panel (see “Aggregation” in Chapter 3).

Note that for a slider to be created, all array columns in the current table must have the

same indices. If array columns with differing indices exist in the current table, no sliders

are created.

Automatic Slider Creation

If no arrayed columns are in the current table, Tool Manager can automatically generate

sliders by use of the Slider1 and Slider2 mappings. Sliders are created through a

combination of automatic binning and aggregation. These automatic operations occur

after clicking Invoke Tool in the Data Destination Panel. The operations do not affect the

current history operations of Tool Manager, but they do appear in the conﬁguration ﬁles

for the tool.

Columns mapped to Slider1 and Slider2 eventually form the indices for the sliders. These

columns must be either numeric (int, ﬂoat, double) or binned. If a column mapped to a

slider is already binned, no automatic binning is needed for this column, and this column

is used as an index for a slider. However, if the column is not binned, a binned column is

created using the automatic binning options in the Tool Options dialog box.

The three methods of binning are:

•Selecting All Distinct Values creates a bin for every unique value of the column.

•Specify the number of bins you want to create. The thresholds for the bins are

determined using the Uniform Range approach.

•Selecting Automatic automatically determines the number of bins to create and

determines the bin thresholds using the Uniform Range approach.

(See “The Bin Columns Button” in Chapter 3 for more information about binning.) The

column used in forming the automatic bins is deleted from the current table.

Configuring the Scatter Visualizer Using the Tool Manager

223

The binned columns now form the indices of array columns. Note that if you want to

create only one slider, the index must be mapped to Slider1. Attempting to create only

one slider with a mapping to Slider2 is not allowed and generates a Tool Manager error.

Also, a column mapped to a slider cannot be mapped to any other mapping, since it is

removed during the aggregation process.

Once the slider indices are formed, the arrayed columns are created. This is done using

automatic aggregation. Any numeric columns mapped to Axis 1, Axis 2, Axis 3,

Entity-size, Entity-color, Entity-label, or Summary are aggregated using the automatic

aggregation options in the Tool Options dialog box. You can either specify aggregating

by Sum or by Average. The binned columns created from the slider mappings form the

indices for the aggregation, and any remaining columns in the table are Group-By

columns. (See “Aggregation” in Chapter 3 for a description of the aggregation process.)

Be sure to remove any columns you do not wish to use in the grouping process. If you

need different types of aggregates for different mappings, you must aggregate manually.

The aggregation step automatically forms the arrayed columns used for sliders. These

arrayed columns form the new tool mappings. For example, if the column “mpg” were

mapped to Axis 1, a new column “avg_mpg[]” is formed and remapped to Axis 1. The

progress of the automatic slider generation is displayed in the Tool Manager status

window.

Specifying Tool Options

Clicking the Tool Options button causes a new dialog box to be displayed (Figure 7-4).

This lets you change some of the Scatter Visualizer options from their default values.

224

Chapter 7: Using the Scatter Visualizer

Figure 7-4 Scatter Visualizer’s Options Dialog Box

Configuring the Scatter Visualizer Using the Tool Manager

225

The Scatter Visualizer’s Options dialog box has four basic options blocks:

•Entities

•Sliders

•Axes

•Other

Entity Options

This option block lets you specify a number of characteristics for the entities that the

Scatter Visualizer then graphically displays.

•Entity Legend On—lets you determine whether the entity legend is displayed or

hidden.

•Entity Size—lets you scale the entity to a max size, a scale size, or a default (no

adjustment). You also can specify whether the legend for entity size is displayed or

hidden.

•Entity Colors—lets you control the colors in which entities are displayed. You can

–specify the list of colors to use

–specify the kind of mapping

–map the list of colors to a list of values

–specify whether the legend for color is displayed or hidden

–map colors to entities

•Entity Shape—lets you choose a visual representation for the entities: cubes, bars, or

diamonds.

To use these Colors options, you must have mapped a column to the *Entity-color

requirement of the Data Destination panel. See “Choosing Colors” and “Using the Color

Browser” in Chapter 3 for a more detailed explanation of how to choose and change

colors.

Color list to use lets you specify the color list using the + button next to the color list label.

This brings up a color editor that lets you specify a color to be added to the list.

226

Chapter 7: Using the Scatter Visualizer

Color mapping let you specify whether the color change that is shown in the graphic

display is Continuous or Discrete. If you choose Continuous, the color values shift

gradually between the colors entered in the Color list to use ﬁeld as a function of the

values that are mapped to those colors in the Color mapping ﬁeld.

The ﬁeld to the right of the popup button lets you enter speciﬁc values for mapping the

colors. If you do not specify any mapping values, the range of values in the color variable

is used.

Example 7-1

If you

•used the Color Browser to apply red and green to bars

•selected Continuous for the Kind of mapping

•entered the values 0 100

then the display shows all entities with values less than or equal to 0 as completely red,

those as greater than or equal to 100 as completely green, and those between 0 and 100

as shadings from red to green.

Example 7-2

If you

•used the Color Browser to apply red and green to entities

•selected Discrete for the Kind of mapping

•entered the values 0 50

then the display shows all entities with values of less than 50 in red, and all those with

values greater than or equal to 50 in green.

•Entity Label Color lets you modify a label color by clicking on it. This causes the

Color Choose dialog box to appear, which lets you implement your color changes.

•Entity Label Size controls the size of the entity labels. A smaller number decreases

the size, a larger one increases it.

Configuring the Scatter Visualizer Using the Tool Manager

227

Summary Options

Summary options let you specify what color to use for the Summary window. You can

also specify whether the summary legend, which indicates what the values are, is

displayed or hidden.

If you have an array of values, you can specify an X or Y slider. The popup buttons next

to these options provide a list of available keys, and let you specify which to use as

sliders.

Slider Options

The Slider options control how the slider mappings are interpreted. For details see

“Slider Creation for Scatterviz” on page 221.

Axis Options

The Axis options let you specify the following, for each axis:

•A label. (If you leave this box blank, the Scatter Visualizer defaults to using the

column names for each axis.)

•A color

•A size type for each axis. (This can be Max Size,Scale Size, or No Adjustment.)

–Max Size lets you specify that an axis is scaled independently to a speciﬁed size.

If one axis has a Max Size that is twice as large as the other, it will be twice as

long, regardless of the data values. This option is most useful when comparing

axes that are in different units (for example, comparing income to age). This

option has no effect on non-numeric data.

–Scale Size lets you specify that the axis is scaled based on its maximum value. If

two axes have the same Scale Size, but one has a maximum that is twice the

value of the other, the former will be twice as long as the latter. This option is

useful for comparing axes with the same units (for example, income vs.

expenses). This option does affect the size of non-numeric axes.

–No Adjust is equivalent to a Scale Size of 1.0.

•A size value

•Whether the axis should be extended to include the value 0.

228

Chapter 7: Using the Scatter Visualizer

Other Options

The Other Options, at the bottom of the dialog box, include the following ﬁelds:

•Message lets you specify the message displayed when an entity is selected. For a

listing and description of format types that can be entered in this ﬁeld, see the

“Message Statement” section in Appendix D, “Creating Data and Conﬁguration

Files for the Scatter Visualizer”

•Execute lets you type in a UNIX command that is executed when double-clicking on

an entity. The format is similar to the message statement. If no execute statement

appears, double-clicking has no effect. For a detailed description of the Execute

ﬁeld, see “Execute Statement” in Appendix D.

•Hide Label Distance controls the distance at which entity labels become invisible.

Smaller distances might improve performance, but the labels disappear more

quickly. The higher the number, the greater the distance at which labels are hidden.

•Axis Label Size controls the size of the axis labels. A smaller number decreases the

size, a larger one increases it.

•Grid (X, Y, Z) Size lets you specify the spacing between grid lines for the respective

axis. A smaller number decreases the size, a larger one increases it.

•Grid Color lets you modify a grid color by clicking on it. This causes the Color

Chooser dialog box to appear, which lets you implement your color changes.

Resetting the Tool Options

If you want to reset the values of all options to their default values, click the Reset Options

button.

Saving the New Tool Options

Once you have ﬁnished making changes to the Tool Options dialog box, click OK to

return to the Tool Manager’s main screen.

Configuring the Scatter Visualizer Using the Tool Manager

229

Invoking the Scatter Visualizer

To see Scatter Visualizer graphically represent your data, click the Invoke Tool button at

the bottom of the Data Destination panel.

Saving the Scatter Visualizer Settings

When you press Invoke Tool, the Tool Manager stores information for the Scatter

Visualizer in three ﬁles, all sharing the same preﬁx:

•<preﬁx>.scatterviz.data contains data.

•<preﬁx>.scatterviz.schema describes the data ﬁle.

•<preﬁx>.scatterviz contains information needed by the Scatter Visualizer.

To save the entire session along with the current tool options, use one of these menu

options from the File menu:

•Save Current Session... where the default preﬁx is based on the data source

•Save Current Session As... to specify your own preﬁx

The saved ﬁle is <preﬁx>.mineset, and contains all the information needed to return

MineSet to its current state.

Null Handling in the Scatter Visualizer

The Scatter Visualizer uses special representations when ﬁelds with unknown data

values, or nulls, are mapped to visual attributes. (For a discussion of null values, see

Appendix J, “Nulls in MineSet.”) When a null value is mapped to an entity’s size, the

entity is drawn as the outline of a cube. When a null value is mapped to an entity’s color,

it is drawn in dark grey. When a null value is displayed in the Selection Window or

“Pointer is Over” area, it is shown as a question mark (?). (The Selection Window and

“Pointer is Over” areas are discussed in the “Select Mode” section.)

If a null value is mapped to the x,y, or z position of an entity, the result depends on the

Show Entities with Null Positions option under the View Menu (see “The View Menu”

on page 241). If the option is set, the entity is shown just below the range of the

corresponding axis. If the option is not set, the entity is not shown.

230

Chapter 7: Using the Scatter Visualizer

Working in the Scatter Visualizer’s Main Window

If you started the Scatter Visualizer without specifying a conﬁguration ﬁle, the main

window shows the copyright notice and license agreement for the Scatter Visualizer.

Only the File and Help pulldown menus can be used. For the main window to show all

menus and controls, open a conﬁguration ﬁle. Use File > Open (Figure 7-2) to see a list of

conﬁguration ﬁles.

When a valid conﬁguration ﬁle has been selected, the 3D landscape it speciﬁes is visible.

For example, selecting company-total.scatterviz gives results as shown in Figure 7-5.

Working in the Scatter Visualizer’s Main Window

231

Figure 7-5 Initial View When Specifying company.scatterviz

This shows the sales of life insurance, auto insurance, and home insurance with respect

to income brackets over time.

232

Chapter 7: Using the Scatter Visualizer

Viewing Modes

The two modes of viewing are grasp and select. To toggle between these modes, press the

Esc key or click the appropriate cursor button adjacent to the top-right of the viewing

area.

Grasp Mode

In grasp mode, the cursor appears as a hand. This mode supports panning, rotating, and

scaling the scene’s size in the main window.

•To pan the display, press the middle mouse button and drag it in the direction you

want the display panned.

•To rotate the display, press the left mouse button and move the mouse in the

direction you want to rotate. (Also see the thumbwheel controls Rotx and Roty,

described in “Thumbwheels” in Chapter 6.)

•To move the viewpoint forward, press the left and middle mouse buttons

simultaneously and move the mouse downwards. To move the viewpoint

backward, press the left and middle mouse buttons simultaneously and move the

mouse upwards. This is equivalent to the functions provided by the Dolly

thumbwheel.

Select Mode

In select mode, you can highlight an object by positioning the cursor over that object.

Information about that object then appears at the top of the view area, under the Pointer

is over: label (Figure 7-6). This information remains visible in the window only as long as

the pointer cursor remains over the object. Position the pointer cursor over an object and

click the left mouse button; the same information appears in the Selection Window, above

the main window. A white box appears around the entity, indicating it has been selected,

and a table viewer shows your current selection. Select several entities by holding down

the Shift key while clicking the left mouse button. The most recent selection is shown

under the Selection label at the top of the scene. All current selections are shown in the

Record Viewer. You can now drill-through on your selection (see “The Selection Menu”

on page 279 for the different drill-though options.)

This Selection information remains visible until another object is selected, or you click the

black background. Using the mouse, you can cut and paste this selection information into

other applications, such as reports or databases.

Working in the Scatter Visualizer’s Main Window

233

Figure 7-6 Displayed Information When Cursor is Over a Selected Entity

If an execute statement was speciﬁed using Tool Manager or the conﬁguration ﬁle, then

double clicking on an object executes the appropriate command. If the -warnexecute

option was speciﬁed when invoking the Scatter Visualizer, a warning is given ﬁrst.

Note: Users familiar with Open Inventor can conﬁgure the Scatter Visualizer so that the

right mouse button brings up the standard Inventor Menu. This provides additional

functions, such as stereo viewing and spin animation. These functions are provided by

the Open Inventor library. To enable the Open Inventor Menu, add the line

*minesetInventorMenu:TRUE

to your .Xdefaults ﬁle.

234

Chapter 7: Using the Scatter Visualizer

External Controls

Several external controls surround the main window, including buttons and

thumbwheels. These controls are substantially the same for most MineSet visualization

tools (see the descriptions “Buttons” in Chapter 6, and “Thumbwheels” in Chapter 6).

The Animation Control Panel

The animation control panel, which appears to the right of the main window, consists of

a summary window, with up to two adjacent sliders, an information ﬁeld, animation

buttons, and animation sliders.

Sliders Controlling Independent Dimensions

The number of sliders appearing adjacent to the summary window is dependent on the

dataset displayed in the Scatter Visualizer’s main window. Datasets can have two, one,

or no independent dimensions.

Datasets With Two Independent Dimensions

If the dataset has two dimensions of independently varying data (such as

company.scatterviz), the controls to the right of the main graphics window become visible

(see Figure 7-7).

The Animation Control Panel

235

Figure 7-7 Animation Control Panel With Summary Window and Both Slider Controls

To the right of the main window are the summary window and slider controls. The

summary window has a horizontal slider below it for selecting data points of the ﬁrst

independent dimension, and a vertical slider to the left for selecting data points of the

second independent dimension. The horizontal slider’s dimension is identiﬁed by a label

below it. The vertical slider’s dimension is identiﬁed by a label above it.

236

Chapter 7: Using the Scatter Visualizer

Datasets With One Independent Dimension

For datasets with one independent dimension (such as store-type.scatterviz), only the

slider below the summary window appears, and the summary window is compressed

(see Figure 7-8). This slider’s dimension is identiﬁed by a label below it.

Figure 7-8 Animation Control Panel With Summary Window and One Slider Control

The Animation Control Panel

237

Datasets With No Independent Dimension

For datasets with no independent dimensions (such as brand.scatterviz), no slider control

appears (see Figure 7-9).

Figure 7-9 Scatter Visualizer With No Independent Dimension or Animation Control Panel

238

Chapter 7: Using the Scatter Visualizer

The Summary Window

The summary window provides a 2D representation of the aggregation of values that the

main window displays in 3D. The whiter the areas of the summary window, the lower

the total values represented by the entities in the main window. The greater the color

density in areas of the summary window, the higher the total of those values. The density

of these colors in the summary window provides a summary of the data across the one

or two independent dimensions in the dataset.

By default, the summary window also contains a set of black dots, evenly spaced across

the one or two dimensions of data. These dots indicate the precise positions of the

discrete datapoints. You can turn off these black dots using the View|Show Data Points

menu option.

Color Density Examples in the Summary Window

After opening the company.scatterviz ﬁle, for example, the 2D summary window shows a

color range from white (on the left) to red (on the right). White corresponds to a low sales

volume; red represents a higher aggregate sales volume. In this example, the greater the

density of red, the higher the total sales of life, auto, and home insurance.

Creating a Path in the Summary Window

If the dataset loaded into the Scatter Visualizer has at least one independent dimension,

it is possible to view all or any part of that dataset via animation. This is done by ﬁrst

creating a path in the summary window (this path connects a sequence of data points),

then activating the animation controls described in the next section.

The three ways to draw a path in the summary window are as follows:

•Deﬁne a starting point by clicking and holding down the left mouse button, then

draw a path by dragging the cursor over the window. End the path by releasing the

left mouse button.

•Deﬁne a starting point by clicking the left mouse button, then deﬁne an endpoint by

moving the cursor to another part of the window and clicking the middle mouse

button. A line appears between those two points. To add more line segments,

continue with repeated middle mouse clicks.

•Deﬁne a starting point by clicking the left mouse button, then drag one of the

independent dimension sliders, thus drawing a straight line along this dimension.

If there are two sliders, use of the second slider causes a straight line to be drawn

along the axis controlled by this second slider.

The Animation Control Panel

239

Animation Buttons and Sliders

The seven VCR-like buttons and two sliders (Path and Speed) below the 2D summary

window let you control the animation.

Animation Buttons

Once a path is drawn in the summary window (see “Creating a Path in the Summary

Window,” above), you can use the VCR-like buttons to control animation along this path.

The middle Stop button is highlighted in blue, indicating an initial state. Use the adjacent

Play Forward button (to the right of Stop) or Play Reverse (to the left) to begin simple

movement along the drawn path in a forward or reverse direction. (Forward and Reverse

are deﬁned by the sequence in which the path was drawn, not by the left-to-right or

right-to-left movement.)

To stop and restart the animation, click the Stop button, then use the Play Forward or

Reverse button again. Note that when you stop, the animation continues in the current

direction until the position falls upon a discrete data point.

Adjacent to the Play buttons are the Single-Step buttons, as well as Forward and Reverse.

Clicking on one of these buttons changes the current path position to the next discrete

data point.

On the outside are the Fast Forward and Fast Reverse buttons. Clicking one of these

buttons while in Stop state changes the path position to the end (for Forward) or to the

beginning (for Reverse) of the path. Clicking a Fast button when in Play state increases the

animation speed.

Animation Flow

Below the Animation Buttons are the three Animation Flow buttons.

Play-once (default)—the animation moves either forward or in reverse until it reaches the

end of the path, then stops.

Loop—when the animation reaches the end of the path, it automatically resets to the

beginning and starts over again.

Swing—when the animation reaches the end of the path, it reverses direction and retraces

its path to the other end; upon reaching that end, the animation reverses direction again,

beginning the cycle again.

240

Chapter 7: Using the Scatter Visualizer

Animation Sliders

While animation is stopped, you can move the Path slider to reset the position along the

path. Note that when you use the Path slider, the cursor in the summary window moves

across the drawn path, and the 1D sliders (below and to the left of the drawing area)

move consistently with the cursor position. Then use the Play or Reverse button to restart

the animation from the newly speciﬁed point. You can drag the Path slider to an arbitrary

position between discrete data points; however, when you release the slider, the path

position changes to the nearest discrete data point.

Use the Speed slider to adjust the speed of the animation along the path.

Data Points and Interpolation

As animation proceeds, the variables mapped to size, color, and axes (positions) in the

Scatter Visualizer change smoothly. However, the information displayed in the

“Selection:” message box and the Pointer is over: ﬁeld show only the data values of the

nearest discrete data position; they do not show interpolated data values.

The animation is produced in the following manner: Assume you have data for 10 years,

on a per-year basis (that is, 10 data values) and that these correspond to the size of one

entity in the Scatter Visualizer. Assume further that the years are 1991 to 2000, the size for

1991 is 20, and the size for 1992 is 40. As you move the year slider from 1991 to 1992, the

size changes by being uniformly interpolated between 20 and 40. For example, midway

between 1991 and 1992, the size is 30. As you approach 1992, the size approaches 40.

However, you cannot stop an animation between discrete data points, and you cannot

drag the Path slider to a stationary position between discrete data points.

The data points in the summary window represent the slider positions corresponding to

the actual data from the data ﬁle. For example, sizes 20 and 40 are representations of

actual data, but size 30 is not. In this example, there would be data points in the summary

window at the slider positions corresponding to each year.

Note that not all variables are required to vary with a slider. If there are two sliders, some

variables can vary with only one of the sliders, while other variables vary with both.

Pulldown Menus

241

Pulldown Menus

Four pulldown menus let you access additional Scatter Visualizer functions. These are

labeled File, View, Selections, and Help. If you start the Scatter Visualizer without

specifying a conﬁguration ﬁle, only the File and the Help menus are available.

The File Menu

The File menu is the substantially the same for all visualization tools see “The File Menu”

in Chapter 5.

The View Menu

The View menu lets you control certain aspects of what is shown in the Scatter Visualizer

window (Figure 7-10).

Figure 7-10 Scatter Visualizer View Menu

•Show Window Decoration lets you hide or show the external controls around the

main window.

•Show null Positions lets you hide or show entities that have null or unknown

position values along one or more axes.

•Show Animation Panel lets you show or hide the animation control panel. This menu

item is disabled for datasets with no independent dimension.

242

Chapter 7: Using the Scatter Visualizer

•Show Filter Panel lets brings up the Filter Panel. This panel (Figure 7-11) lets you

reduce the number of entities displayed in the main viewing area, based on one or

more criteria. You can use the ﬁlter panel to ﬁne-tune the display, emphasize

speciﬁc information, or simply shrink the amount of information displayed. The Set

Landscape to Filter checkbox, which appears in the lower right of the ﬁlter panel,

lets you specify whether the landscape in the main window covers the entire

dataset or just the ﬁltered data.

•Set Background Color brings up a color chooser to let you specify a new background

color.

Figure 7-11 Scatter Visualizer Filter Panel

Pulldown Menus

243

The Filter panel has two panes. The top pane lets you ﬁlter based on string columns.

To select all values of a column, click Set All. To clear the current selections, click

Clear. To select a value, click it. To deselect a value, simply click it again.

The bottom pane lets you ﬁlter based on the values of both string and numeric

columns.

To ﬁlter numeric values, enter the value, and select a relational operation (=, !=, >, <,

>=, <=). To ﬁlter alphanumeric values, enter the string. You can use any of three

types of string comparisons:

•Contains indicates that it contains the appropriate string. For example,

“California” contains the strings “Cal” and “forn”.

•Equals requires the strings to match exactly.

•Matches allows wildcards:

–An asterisk (*) represents any number of characters.

–A question mark (?) represents one character.

–Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.

For columns which were binned, an option menu of values appears, instead of a

text ﬁeld. To ignore that column, select Ignored in the Option menu. You can use

relational operators, such as >=, with these options. This means that the speciﬁed

value as well as subsequent ones are selected.

In addition to numeric and string comparison operations, you can specify Is Null,

which is true if the value is null.

To the right of each ﬁeld is an additional option menu that lets you specify “And” or

“Or” options. For example, you could specify “sales > 20 And < 40.” You can have

any number of And or Or clauses for a given column, but cannot mix And and Or in

a single column.

Scale to Filter lets you specify whether the ﬁltered landscape is rescaled to the size of

the ﬁltered data or remains the size of the entire data set.

Click the Filter button to start ﬁltering. If you press Enter while the panel is active,

ﬁltering starts automatically.

Click the Close button to close the panel.

244

Chapter 7: Using the Scatter Visualizer

The Selections Menu

The Selections menu lets you drill through to the underlying data.

Figure 7-12 The Scatter Visualizer Selections Menu

•Create Box Selection creates a 3-D box selector that can be stretched and translated to

select regions of the volume. While active, a table in Record Viewer format is

opened showing information about all of the aggregated data that is represented by

the entities within it. Closing this window clears all current selections. Any entities

within the selection box or selected using Shift-click are shown in the table window.

To translate the selection box, click on one of the faces with the left mouse button,

and drag it in the desired direction. Holding the Shift key while dragging constrains

the motion to the axis to which the drag motion is closest. To change the extent of

the selection box, drag one of the gray scale tabs in the desired direction. Trying to

resize or translate beyond the bounds of the volume is not permitted. The gray scale

tabs constantly resize to maintain constant screen size. If at any time they appear

too big, you can zoom in closer, and they reduce their size relative to the box.

•Show Original Data retrieves and displays the records corresponding to what has

been selected. The resulting records are shown in a table viewer.

•Send To Tool Manager inserts a ﬁlter operation, based on the current box selection(s),

at the beginning of the Tool Manager history. The actual expression used to do the

drill through is determined by the extent of the current box selection. If nothing is

selected, a warning message appears.

•Complementary Drill Through causes the Show Original Data and Send To Tool Manager

selections, when used, to fetch all the data that are not selected.

Sample Configuration and Data Files

245

•Preferences brings up a panel that lets you select which columns are used in

drill-through. Unlike other visual tools, there are no speciﬁc columns in the data

that are designated as the key to the data. It is impossible for the Scatter Visualizer

to determine which columns the user desires in the drill-through expression. For

example, you might have cars data with brand, model, and weight. Perhaps you

want to drill through to the original data, and specify that brand and model should

be considered, but weight should not. By default, all columns that have been

mapped to graphical requirements are considered signiﬁcant on drill-through. The

others are not, but may be made so by highlighting them in the Preferences dialog

box.

For further details on drill-through, see Chapter 18, “Selection and Drill-Through.”

The Help Menu

The Help menu is substantially the same for all visualization tools; see “The Help Menu”

in Chapter 6

Sample Conﬁguration and Data Files

The provided sample data and conﬁguration ﬁles demonstrate the Scatter Visualizer’s

features and capabilities. The following ﬁles are in the /usr/lib/MineSet/scatterviz/examples

directory:

•company.data

This ﬁle contains ﬁctitious sales data of several insurance companies in three

product categories: life insurance, auto insurance, and home insurance. The data

span ten years (in increments of one year) and includes ﬁve income brackets (the

customer’s annual income).

•company.scatterviz

This ﬁle speciﬁes that the years form one slider dimension and the income brackets

form the other slider. Sales of life insurance, auto insurance, and home insurance

become the three dimensions in the Scatter Visualizer landscape. The color density

in the slider summary window represents the total sales of all companies across all

categories of insurance.

246

Chapter 7: Using the Scatter Visualizer

•company-total.scatterviz

This ﬁle contains the same speciﬁcations as company.scatterviz, except that the size of

each company is determined by the total sales of that company across all the

categories of insurance.

•company-life.scatterviz

This ﬁle contains the same speciﬁcations as company.scatterviz, except that the color

of each object indicates the life insurance sales as a fraction of total sales.

•store-type.data and store-type.scatterviz

These ﬁles show sales of various product groups by store type during a three-year

period. The single independent variable for which a slider appears is time. Each

entity represents a store type (such as Food Store, Drug Store, Service Station, and

so forth). For each store type, the data ﬁle contains the total sales of several product

groups, such as alcoholic beverages, cereal, and so forth. The data spans 36 months,

in increments of one month.

The conﬁguration ﬁle uses the month as the single slider dimension. One axis is

sales of alcoholic beverages, the other is sales of tobacco products. A third axis is not

used.

Note: The data ﬁle includes other categories. You can edit the conﬁguration ﬁle to

use other product categories for the axes (see Appendix D, “Creating Data and

Conﬁguration Files for the Scatter Visualizer”).

•brand.data and brand.scatterviz

These ﬁles show sales of several soft-drink brands in a variety of store types. In this

dataset the brands form the entities, and the store types are associated with the axes.

The total sales are mapped to the size of each brand. The color mapping is random.

Since there are no independent variables, no slider is present.

•cars.data and cars.scatterviz

These ﬁles show the weight, horsepower, model year, and acceleration of several car

models.

•people.data and people.scatterviz

These ﬁles show the height, weight, density, and cholesterol level for a population

sample.

•nl.births.data and nl.births.scatterviz

These ﬁles show birth patterns in the Netherlands. For each region, the population

density, birth rate, and population are shown. The animation sliders are mapped to

the age of the mother and the year.

Sample Configuration and Data Files

247

•adult94.data and adult94.scatterviz

These ﬁles show a complex example with scatterviz applied to

/usr/lib/MineSet/data/adult.data. The three axes in the visualization are avg_hrswk

(that is, average hours worked per week), avg_gross_income, and

avg_education_num. Unfortunately “education num” does not correspond exactly

to number of years of education, but it is close. The slider on the right side animates

across different age ranges. Each aggregate was created by grouping by occupation,

race and sex. This means that there is an entity for every combination of values for

these three attributes. The color shows different occupations, as shown in the

legend. The size of each entity corresponds to record counts. The summary slider is

also colored by data density. To ﬁnd out how this visualization was created, you

may select Start Tool Manager from the File menu. This will bring up the Tool

Manager with the session used to create this example.

Initially the scene shows information for people under 20 years of age. Note that the

average hours worked (about 14) and the average income (about $4000) are low. If

you animate over age using the slider, and examine the scene from the three

orthogonal viewer (try using the lower 3 buttons to the right of the main window),

you will notice various trends emerge. For example, if you orient the scene so you

see only income by hours per week, you can see that people start to work longer

hours as they age, until about age 25, then they seldom work more that 49 hours per

week until they retire. Income, however, grows until people age 50, then plateaus,

then goes lower again. The actual trend depends somewhat on the career choice and

other factors.

Suppose you were interested in comparing trends between the occupations

craft-repair and prof-specialty. Open the Filter panel (View > Show Filter Panel) and

select just “craft-repair” and “prof-specialty” from the list of occupations. Now

when you animate, you can see that “prof-specialty” actually starts with lower

incomes, but quickly outpaces “craft-repair” as people age. “Prof-specialty” is much

higher on the education axis than “craft-repair”. You may wish to limit your ﬁlter

further by showing just females, or those of a certain race.

249

Chapter 8

8. Using the Splat Visualizer

This chapter discusses the features and capabilities of the Splat Visualizer. It provides an

overview of this database visualization tool, then explains the Splat Visualizer’s

functionality when working with the

•main window

•external controls

•pulldown menus

Finally, it lists and describes the sample ﬁles provided for this tool.

Overview of the Splat Visualizer

The Splat Visualizer lets you visually analyze relationships among several variables (see

Figure 8-1), either statically or by animation. It is particularly well-suited for application

to datasets with large numbers of records. Choose the Scatter Visualizer if you want to

see individual data points and do not have a large number of records. Data analysis is

done using

•a three-dimensional landscape

•an animation control panel that includes a two-dimensional slider

•graphical objects, called splats, which represent aggregates of datapoints. Color and

opacity of the splats can change during animation.

250

Chapter 8: Using the Splat Visualizer

Figure 8-1 Sample Splat Visualizer With One Slider Control

The Splat Visualizer lets you visualize your data by mapping columns to axes, sliders,

color, and opacity. The resulting three-dimensional landscape can be thought of as an

approximation to a scatterplot in which every datapoint is drawn separately. It is not

truly a scatterplot, because datapoints that are close together (fall in the same bin) are

aggregated and drawn as a single splat.

Overview of the Splat Visualizer

251

Each numeric column that is mapped to an axis or slider ﬁrst must be binned. If this

binning step is skipped, the Tool Manager does it using automatic uniform binning (see

“The Bin Columns Button” in Chapter 3). String columns can be mapped directly to axes.

Any numeric column can be mapped to a color. The color of a splat is derived by

averaging the value of the column mapped to color for all the data points that fall in a

bin. The opacity of a splat is based on a weighting of the number of datapoints that fall

in a bin. If nothing is mapped to opacity, record counts are used to determine it. The

interactivity of the resulting visualization is independent of the number of data points

represented; it depends only on the number of bins in the axis dimensions. If your dataset

is very large, aggregate explicitly in the Tool Manager. This causes the server to perform

the processing, rather than having the entire dataset sent to the client and aggregated

there.

Up to two numeric columns can be mapped to the sliders in the animation control panel.

The splats change their color and opacity during animation as the sliders in the

animation panel are moved from point to point along the slider ‘s path. Unlike the Scatter

Visualizer, neither the position nor the size of the splats change; they are at ﬁxed,

uniformly spaced positions. Only their color and opacity change, which can give the

illusion of actual movement.

After creating a visualization of your data, the Splat Visualizer lets you analyze the data

in various ways:

•The animation control panel lets you note global shifts and trends in the data.

•The three-dimensional landscape lets you orient the display to emphasize particular

dimensions or a point of view.

•You can use the scale slider (located to the left of the Main Window) to lower the

overall opacity of the splats, so only regions with dense data show up; conversely,

you can increase the scale slider so all regions having any data become visible. The

regions with dense data are likely to show less color variation, because the color is

based on the average of many values (see Figure 8-3).

•You can ﬁlter the display to show only those splats meeting certain criteria. You can

ﬁlter on the columns corresponding to axes, sliders, weight, and color.

•An opaque pick dragger lets you display textual information about individual

splats in the volume.

•A box selector lets you deﬁne a selected region for drilling through to the original

data or for sending to the Tool Manager.

252

Chapter 8: Using the Splat Visualizer

If a string column is mapped onto an axis, binning is deﬁned to be the distinct values of

that column. The order of the values along a string axis is automatically determined by

sorting the distinct values by the average aggregate value of the column mapped to color.

Looking at the color changes along a string-valued axis lets you see how well that

column correlates with the column mapped to color. The left axis in Figure 8-1 shows

occupations sorted by average income (the average income of everyone with that

occupation) along an axis. The occupation, executive-managerial, listed at the end of the

axis, has the highest average income. This ordering often presents a natural progression

for the values. For example, the ordering for the values of education (the right axis in

Figure 8-1) was generally from low to high; but, in a few cases, there were anomalies in

the order. This unexpected ordering might be interesting because it points out places

where the data does not agree with expectations.

Opacity

The column mapped to opacity should be record count or a column used to weight

record counts. A splat’s opacity, α, is based on this column according to the following

relation:

where count is the column mapped to opacity (or the record count if no such column was

mapped to opacity). The shape of this function is such that the opacity asymptotically

approaches 1 (totally opaque) as the value of weight becomes large. The variable u is what

is scaled when you adjust the opacity scale slider. Figure 8-2 shows the shape of this

function for low and high values of u. Figure 8-3 shows the same visualization with low

and high values of u.

Figure 8-2 Shape of Opacity Function For Low and High Values of u

α1eu weight⋅–

–=

Overview of the Splat Visualizer

253

Figure 8-3 Image Where u = 5.3, and u = 30

If nothing is mapped to opacity, the Splat Visualizer generates a column of ones to

produce record counts when aggregating. This means all records are weighted equally.

A sum aggregation is done on this column, and an average aggregation is done on the

column mapped to color while grouping by all the axis and slider columns. All other

columns are unnecessary and removed. You do not need to map anything to opacity

unless you want each record to be weighted by something other than 1.

You can avoid processing on the client by aggregating in the Tool Manager. This also

avoids having to transfer a large dataset to the client. This is done by

1. Binning the numeric columns which are to be used for axes and sliders.

2. Aggregating the column to be mapped to color by count and average while

grouping by the axis and slider columns.

3. Mapping the resulting count aggregation to opacity.

4. Mapping the resulting average aggregation to color.

254

Chapter 8: Using the Splat Visualizer

For example, using the adult94 data (provided with the distribution):

1. Bin age and hours_per_week.

2. Aggregate gross_income using count and average. Keep education, occupation,

age_bin and hours_per_week_bin, in the group-by pane while removing all the

other columns.

3. Map education, occupation, and hours_per_week_bin to the axes.

4. Map avg_gross_income to color, count_gross_income to opacity, and age_bin to a

slider.

When you invoke the tool, note that all the processing is done on the server, and that the

dataﬁle, adult94.splatviz.data, contains rows that are aggregates of rows in the original

data. This produces the same visualization as seen in Figure 8-1.

In some cases, you might have a column by which you want to weight the records. For

example, if you have a dataset for which one column was population and another was

average_salary (which you want to map to color), you can map population to opacity,

and average_salary to color; then have the Splat Visualizer do the aggregation. Its

aggregation groups-by the axis and slider columns, so that it sum aggregates the opacity

column (which, in this case, is population). The new column is called sum_population.

The average_salary column is revised, so that it is still average salary, but weighted by

each row’s population. In this way, the average salary column still shows the average

salary for all the people it represents.

Alternatively, if you want to avoid client-side processing and storage because of the size

of your dataset, you can perform the same aggregation in Tool Manager by doing the

following:

1. Create a new column, deﬁning temp = population*avg_income.

2. Perform an aggregation: group-by axis and slider columns, sum aggregate

population, and sum aggregate temp.

3. create a new column, deﬁning

avg_salary = sum_temp/sum_population

This creates the weighted average.

4. Now you can map sum_population to opacity, and avg_salary to color.

Note that these steps are the ones taken by the Splat Visualizer if you do not explicitly do

them in the Tool Manager.

File Requirements

255

File Requirements

The Splat Visualizer requires the following ﬁles:

•A data ﬁle, consisting of rows of tab-separated ﬁelds. This ﬁle is easily created using

the Tool Manager (see Chapter 3). If you are generating this ﬁle yourself, see

Appendix E, “Creating Data and Conﬁguration Files for the Splat Visualizer” for

the required ﬁle format.

You can generate data ﬁles by extracting data from a source (such as a database) and

formatting it speciﬁcally for use by the Splat Visualizer. Data ﬁles have user-deﬁned

extensions (the sample ﬁles provided with the Splat Visualizer have a .data

extension).

•A conﬁguration ﬁle, describing the format of the input data and how it is to be

displayed. The Tool Manager can create this ﬁle (see Chapter 3), or you can use an

editor (such as jot, vi, or Emacs) to produce this ﬁle yourself (see Appendix E,

“Creating Data and Conﬁguration Files for the Splat Visualizer”).

Conﬁguration ﬁles must have a .splatviz extension. When starting the Splat

Visualizer, or when opening a ﬁle, you must specify the conﬁguration ﬁle, not the

data ﬁle.

Starting the Splat Visualizer

There are ﬁve ways to start the Splat Visualizer:

•Use the Tool Manager to conﬁgure and start the Splat Visualizer. (See Chapter 3 for

details on most of the Tool Manager’s functionality, which is common to all MineSet

tools; see “Conﬁguring the Splat Visualizer Using the Tool Manager” on page 256

for details about using the Tool Manager in conjunction with the Splat Visualizer.)

•Double-click the Splat Visualizer icon, which is in the MineSet page of the icon

catalog (or on your Indigo Magic desktop). The icon is labeled splatviz. Since no

conﬁguration ﬁle is speciﬁed, the start-up screen requires you to select one by using

File > Open.

Starting the Splat Visualizer without specifying a conﬁguration ﬁle causes the main

window to show the copyright notice and license agreement for this tool. Only the

File and Help pulldown menus can be used. For the main window to be fully

functional, open a conﬁguration ﬁle by selecting File > Open.

256

Chapter 8: Using the Splat Visualizer

•If you know what conﬁguration ﬁle you want to use, double-click the icon for that

conﬁguration ﬁle. This starts the Splat Visualizer and automatically loads the

conﬁguration ﬁle you speciﬁed. This works only if the conﬁguration ﬁlename ends

in .splatviz (which is always the case for conﬁguration ﬁles created for the Splat

Visualizer via the Tool Manager).

•Drag the conﬁguration ﬁle icon onto the Splat Visualizer icon.

•Start the Splat Visualizer from the UNIX shell command line by entering this

command at the prompt:

splatviz [ configFile ]

conﬁgFile is optional and speciﬁes the name of the conﬁguration ﬁle to use. If you

don’t specify a conﬁguration ﬁle, you must use File >Open to specify one.

Options for Invoking the Splat Visualizer

The -quiet option eliminates the dialogs that popup to indicate progress. You can enable

this option permanently by adding the line

*minesetQuiet:TRUE

to your .Xdefaults ﬁle.

Conﬁguring the Splat Visualizer Using the Tool Manager

This section describes how the Splat Visualizer can be conﬁgured using the Tool

Manager. Although the Tool Manager greatly simpliﬁes the task of conﬁguring the Splat

Visualizer, you can construct a conﬁguration ﬁle manually for this tool using a text editor

(see Appendix E, “Creating Data and Conﬁguration Files for the Splat Visualizer”).

The steps required to connect to a data source are described in Chapter 3.

Configuring the Splat Visualizer Using the Tool Manager

257

Selecting the Splat Visualizer Tool

Select the Viz Tools tab in the Data Destination panel of the Tool Manager’s main screen

(Figure 8-4). From the popup list of tools, select Splat Visualizer. The mapping

requirements for the Splat Visualizer are displayed in the window on the right side of this

panel. Items in the Visual Elements list that are preceded by an asterisk are optional.

Figure 8-4 Data Destination Panel With Splat Visualizer Selected

•Axis 1, *Axis 2, *Axis 3 — determine which columns are assigned to the axes in the

Splat Visualizer’s main window. Assigning data to the ﬁrst axis is required;

however, this alone does not usually produce a useful display. By assigning data to

Axis 2, you can create an XY chart. Assigning data to all three axes produces a 3-D

chart.

•*Color — Requires a numeric column used to determine the color of the splats.

If you have a two-valued string column, you can create a new numeric column

using an expression such as:

('stringCol'==”value1”)? 1:0

If nothing is mapped to color, the resulting scene is monochromatic.

258

Chapter 8: Using the Splat Visualizer

•*Opacity—the tool was designed to have the opacity based on a weighting of

records. If you do not aggregate in the Tool Manager, this requirement need not be

mapped; it will be determined automatically by the tool. If you do a count

aggregation in the Tool Manager, or there is a column in the data that already is

based on counts, use that column for this requirement.

•*Sliders — the summary slider dimensions. They must be numeric or binned.

•*Summary—this is the value to be shown in the summary slider. If no summary

column is mapped, count is used by default. If a summary column is mapped, a

weighted average value for that column is shown in the summary.

Mapping Columns to Requirements

You can map requirements to columns by selecting a column name in the Current

Columns window of the Table Processing panel, then selecting a category in the Visual

Elements window.

Undoing Mappings

To undo a speciﬁc mapping, select that mapping in the Requirements window, then click

the Clear Selected button. To undo all mappings, click the Clear All button.

Specifying Tool Options

Clicking the Tool Options button causes a new dialog box to be displayed (Figure 8-5).

This lets you change some of the default values of the Splat Visualizer options.

Configuring the Splat Visualizer Using the Tool Manager

259

Figure 8-5 Splat Visualizer’s Options Dialog Box

The Splat Visualizer’s Options dialog box has three basic options blocks:

•Splats

•Summary

•Other

260

Chapter 8: Using the Splat Visualizer

Splat Options

This option lets you specify a number of characteristics for the Splats that the Splat

Visualizer then graphically displays.

•Splat Colors—lets you control the colors used for the splats. You can

–specify the list of colors to use

–specify the kind of mapping

–map the list of colors to a list of values

•Splat Shape—lets you choose one of the following methods for drawing splats:

linear, gaussian, texture, sphere, cube, or diamond. See “Splat Type Menu” on

page 282 for a further explanation of each of these.

To use these Colors options, you must have mapped a column to the *color requirement

of the Data Destination panel. If nothing is entered in the color list, the default colormap

is used. The default colormap is a continuous spectrum from blue (lowest value) to red

(highest value). See “Choosing Colors” and “Using the Color Browser” in Chapter 3 for

a more detailed explanation of how to choose and change colors.

Color list—You can specify the color list using the + button next to the color list label.

This brings up a color editor that lets you specify a color to be added to the list.

Color mapping—You can specify whether the color change that is shown in the graphic

display is Continuous or Discrete. If you choose Continuous, the color values shift

gradually between the colors entered in the Color list to use ﬁeld as a function of the

values that are mapped to those colors in the Color mapping ﬁeld.

The ﬁeld to the right of the popup button lets you enter speciﬁc values for mapping the

colors. If you do not specify any mapping values, the range of values in the color column

is used.

Configuring the Splat Visualizer Using the Tool Manager

261

Example 8-1

If you

•used the Color Browser to apply red and green to the splats

•selected Continuous for the Kind of mapping

•entered the values 0 100

the display shows all splats with values less than or equal to 0 as completely red, those

with values greater than or equal to 100 as completely green, and those between 0 and

100 have a color which results from a linear interpolation between red and green.

Example 8-2

If you

•used the Color Browser to apply red and green to the splats

•selected Discrete for the Kind of mapping

•entered the values 0 50

The display shows all splats with values of less than 50 in red, and all those with values

greater than, or equal to, 50 in green.

Summary Options

Summary options let you specify what color to use for the Summary window. This is

only applicable if you have mapped a column to the summary.

262

Chapter 8: Using the Splat Visualizer

Other Options

The Other Options, at the bottom of the dialog box, include the following ﬁelds:

•Hide Label Distance — controls the distance at which axis tick labels (for string

valued axes) become invisible. Increase this number to make the labels appear at

further distances. The higher the number, the greater the distance at which labels

are hidden.

•Axis Label Size — this controls the size of the axis labels. A smaller number decreases

the size, a larger one increases it.

•Grid Color — lets you modify a grid color by clicking on it. This causes the Color

Chooser dialog box to appear, which lets you implement your color changes.

•Grid (X, Y, Z) Size — lets you specify the spacing between grid lines for the

respective axis. A smaller number decreases the size, a larger one increases it. If the

Size is set to 0, there are no grid lines in that dimension.

Resetting the Tool Options

Clicking the Reset Options button resets the values of all options to their default values.

Invoking the Splat Visualizer

To see Splat Visualizer graphically represent your data, click Invoke Tool at the bottom of

the Data Destination panel.

Saving the Splat Visualizer Settings

When you press Invoke Tool, The Tool Manager stores information for the Splat Visualizer

in several ﬁles, all sharing the same preﬁx:

•<preﬁx>.splatviz.data contains data.

•<preﬁx>.splatviz.schema describes the data ﬁle.

•<preﬁx>.splatviz contains information required by the Splat Visualizer.

Configuring the Splat Visualizer Using the Tool Manager

263

To save the entire session along with the current tool options, use one of these menu

options from the File menu:

•Save Current Session... where the default preﬁx is based on the data source

•Save Current Session As... to specify your own preﬁx

The saved ﬁle is <preﬁx>.mineset, and contains all the information needed to return

MineSet to its current state.

When you use Invoke Tool, the .data,.schema, and .splatviz ﬁles are updated, if necessary.

Null Handling in the Splat Visualizer

The Splat Visualizer uses special representations when ﬁelds with unknown data values,

or nulls, are mapped to visual attributes. (For a discussion of null values, see Appendix J,

“Nulls in MineSet.”) When every record in a bin has a null value for the column mapped

to color, the resulting color for that splat is gray. If one or more records in the aggregate

have non-null values for the column mapped to colors, then that value is (or those values

are) used to compute the color. While the sum of a value and null is null, the average of

a value and null is the value (that is, value + Null = Null; avg(val, Null) = val).

When a null value is displayed in the Pick Window, Selection Window or “Pointer is

Over” area, it is shown as a question mark (?). (The Selection Window and “Pointer is

Over” areas are discussed in the “Select Mode” section.)

For numeric columns containing nulls which are mapped to axes, there is a special null

position below the range deﬁned by the axis. This is to help show that the null value is

discontinuous with the other values. The null positions for numeric axes can be turned

off using the Show Null Positions option under the View Menu (see “The View Menu”

on page 276). For string-valued columns mapped to axes, nulls (represented by a ‘?’) are

treated as just another value.

264

Chapter 8: Using the Splat Visualizer

Working in the Splat Visualizer’s Main Window

If you started the Splat Visualizer without specifying a conﬁguration ﬁle, the main

window shows the copyright notice and license agreement for the Splat Visualizer. Only

the File and Help pulldown menus can be used. For the main window to show all menus

and controls, open a conﬁguration ﬁle. Use File > Open to see a list of conﬁguration ﬁles.

When a valid conﬁguration ﬁle has been selected, the 3-D landscape it speciﬁes is visible.

Viewing Modes

The two modes of viewing are grasp and select. To toggle between these modes, press Esc,

or click the appropriate cursor button adjacent to the top-right of the viewing area.

Grasp Mode

In grasp mode, the cursor appears as a hand. This mode supports panning, rotating, and

scaling the scene’s size in the main window.

•To pan the display, press the middle mouse button and drag it in the direction you

want the display panned.

•To rotate the display, press the left mouse button and move the mouse in the

direction you want to rotate. (Also see the thumbwheel controls Rotx and Roty,

described in “Thumbwheels” in Chapter 6.)

•To move the viewpoint forward, press the left and middle mouse buttons

simultaneously and move the mouse downwards. To move the viewpoint

backward, press the left and middle mouse buttons simultaneously and move the

mouse upwards. This is equivalent to the functions provided by the Dolly

thumbwheel.

Working in the Splat Visualizer’s Main Window

265

Select Mode

In select mode, you can move a 3-D pick dragger through the volume in order to display

information about regions in the scene. This pick dragger is composed of a cylinder and

a square. If you pick on the cylinder and drag, motion is constrained to be parallel to the

cylinder’s axis. If you pick on the square and drag, motion is constrained to the plane

deﬁned by the square. You can cycle through the three possible orientations of the pick

dragger by pressing the Control key with the cursor over the dragger. (You need not press

the mouse button.) In the case of dragging the square portion of the dragger, you can use

the Shift key to constrain the motion along one of the two axes within the plane.

Alternatively, each axis has a disk that aligns with the pick dragger position. Moving the

disk on an axis moves the dragger, and vice-versa.

The dragger lets you pick within a dense cloud of points, freeing you from the limitation

of having to pick regions on the surface.

When the pick dragger is over data, the cylinder changes its color to that of the splat

under it, and information about that region appears at the top of the view area

(Figure 8-6). If no data is present, the cylinder remains light gray, and information about

its position is displayed at the top of the render area for aid in navigation.

When you are ﬁnished dragging, and have released the mouse button, the message for

the splat you are currently over is shown in the Pick Window at the top. This pick

information is updated if the animation slider is moved. Using the mouse, you can cut

and paste this selection information into other applications, such as reports or databases.

The pick dragger may be removed from the scene by unchecking Selection > Show Pick

Dragger.

266

Chapter 8: Using the Splat Visualizer

Figure 8-6 Pick Dragger Over Data

The information is displayed when the pick dragger is over the object.

Note: Users familiar with Open Inventor can conﬁgure the Splat Visualizer so that the

right mouse button brings up the standard Inventor Menu. This provides additional

functions, such as stereo viewing and spin animation. These functions are provided by

the Open Inventor library. To enable the Open Inventor Menu, add the line

*minesetInventorMenu:TRUE

to your .Xdefaults ﬁle.

External Controls

267

External Controls

Several external controls surround the main window, including buttons and

thumbwheels. (These and the rest of the buttons are substantially the same as the other

MineSet visualization tools and are described in “Buttons” in Chapter 6, and

“Thumbwheels” in Chapter 6).

The Animation Control Panel

The animation control panel, which appears to the right of the main window, consists of

a summary window, with up to two adjacent sliders, an information ﬁeld, animation

buttons, and animation sliders.

Sliders Controlling Independent Dimensions

The number of sliders appearing adjacent to the summary window is dependent on the

slider mappings speciﬁed in the conﬁguration ﬁle. Datasets can have two, one, or no

independent dimensions.

Datasets With Two Independent Dimensions

If the dataset has two dimensions of independently varying data (such as

adultJobs2.splatviz), the controls to the right of the main graphics window become visible

(see Figure 8-7).

268

Chapter 8: Using the Splat Visualizer

Figure 8-7 Animation Control Panel With Summary Window and Both Slider Controls

The Animation Control Panel

269

To the right of the main window are the 2-D summary window and slider controls. The

summary window has a horizontal slider below it for selecting data points of the ﬁrst

independent dimension, and a vertical slider to the left for selecting data points of the

second independent dimension. The horizontal slider’s dimension is identiﬁed by a label

below it. The vertical slider’s dimension is identiﬁed by a label above it.

Datasets with One Independent Dimension

For datasets with one independent dimension (such as adultJobs.splatviz), only the slider

below the summary window appears, and the summary window is compressed (see

Figure 8-1). This slider’s dimension is identiﬁed by a label below it.

Datasets With No Independent Dimension

For datasets with no independent dimensions (such as mushroom.splatviz), no slider

control appears (see Figure 8-8). In this example, the splats that are neither completely

red nor completely blue indicate that both poisonous and edible mushrooms are plotted

at that location.

270

Chapter 8: Using the Splat Visualizer

Figure 8-8 Splat Visualizer Without Independent Dimension or An Animation Control Panel

The Animation Control Panel

271

The Summary Window

The summary window provides a 2-D representation of the aggregation of values that

the main window displays in 3-D. The whiter the areas of the summary window, the

lower the summary value represented by the splats in the main window. The greater the

color density in areas of the summary window, the higher the summary values. The

summary value is either the total weight of data at that slider position, or the weighted

average of the column that was mapped to summary. The density of these colors in the

summary window provides a summary of the data across the one or two independent

dimensions in the dataset. If no column is explicitly mapped to summary, count is used

to show which positions on the slider represent the most data.

By default, the summary window also contains a set of black dots, evenly spaced across

the one or two dimensions of data. These dots indicate the precise positions of the

discrete datapoints. You can turn off these black dots by unchecking the box at the

bottom of the summary slider window. Slider positions between these positions use

interpolation of the underlying data to produce an image.

Color Density in the Summary Window

After opening the adultJobs.splatviz ﬁle, for example, the 2-D summary window shows a

color range from white (on the left) to red (in the middle) to white (on the right). Red

represents more records (12,838 in this case), while white represents fewer records

(3,606). In this example, the greater the density of red in the middle of the slider, means

the highest concentration of people are in the 20-50 age range.

Creating a Path in the Summary Window

If the dataset loaded into the Splat Visualizer has at least one independent dimension, it

is possible to view all or any part of that dataset via animation. This is done by ﬁrst

creating a path in the summary window (this path connects a sequence of data points),

then activating the animation controls described in the next section.

272

Chapter 8: Using the Splat Visualizer

The three ways to draw a path in the summary window are:

•Deﬁne a starting point by clicking and holding down the left mouse button, then

draw a path by dragging the cursor over the window. End the path by releasing the

left mouse button.

•Deﬁne a starting point by clicking the left mouse button, then deﬁne an endpoint by

moving the cursor to another part of the window and clicking the middle mouse

button. A path appears between those two endpoints. To add more line segments,

continue with repeated middle mouse clicks.

•Deﬁne a starting point by clicking the left mouse button, then drag one of the

independent dimension sliders, thus drawing a straight line along this dimension.

If there are two sliders, use of the second slider causes a straight line to be drawn

along the axis controlled by this second slider.

Animation Buttons and Sliders

The animation panel is identical to that of the Map Visualizer, see “Animation Buttons

and Sliders” in Chapter 6. The following section is different, however.

Slider Data Points and Interpolation

As animation proceeds, the size and color of the splats change smoothly. The information

displayed in the message box ﬁeld shows the interpolated data values. When the slider

motion stops, the slider position snaps to the nearest discrete data position where

interpolated data values are not used.

There is a table for each binned position on the summary slider. Each row in one of these

tables (which is an aggregate of original data) deﬁnes a splat in the scene. Tables

corresponding to adjacent bins on the summary need not have the same number of rows

because of the differences in data distribution from one position to the next. For example,

if we change the visualization in Figure 8-1 from showing 40-50 year-olds to one showing

50-60 year-olds by moving the slider one notch to the right (see Figure 8-9), some

positions might show splats where there were none before, and vice versa.

The Animation Control Panel

273

Figure 8-9 Changed Visualization as a Result of Moving the Slider (Compare to Figure 8-1)

274

Chapter 8: Using the Splat Visualizer

For interpolation on a one dimensional slider, two adjacent tables are merged, then

aggregated using the spatial columns as unique keys. The count is simply interpolated

(0 count is assumed if one of the tables lacks a particular row). The average value used

for color is also interpolated, but weighted by the count.

Example 8-3

(This example describes technical details of the interpolation process.) Suppose we want

to show an image that represents an interpolation between the tables for the 40-50

year-olds and the 50-60 year-olds on the external slider. Let Table 8-1 and Table 8-2 be the

tables for age=40-50 and age=50-60, respectively, for the two slider positions.

This is how the Splat Visualizer performs the interpolation. For Table 8-1, a new count

column equal to (1-t)count and a new weighted value column equal to (1-t) (count) (value)

are added. For Table 2, a new count column equal to (t)(count), and a new weighted value

column equal to (t) (count) (value) are added. The two tables are merged together.

The merged table is aggregated using the spatial axes columns as keys, and sum

aggregating the two new columns. This ensures that no two rows have the same binned

values for all the spatial axes. Finally, divide the summed value by the summed count to

get the interpolated values. In this case, the interpolated values are for income. If t=.5, the

resulting table would be Table 8-3.

Table 8-1 Ages 40 to 50

education occupation hours_worked income count

HS-grad

Masters

Exec-Man.

Mach-op

Technician

15-25

25-35

25000

30000

35000

Table 8-2 Ages 50 to 60

education occupation hours_worked income count

HS-grad

Vocational

Exec-Man.

Mach-op

15-25

35-45

70000

40000

The Animation Control Panel

275

If the external query slider has two dimensions, bilinear interpolation is used.

This census dataset contains nearly 150,000 rows. The purpose of the external slider is to

allow navigation through, and show summary info for additional dimensions in the

data. The red regions represent places where the summary value is high; white shows

areas where it is low. When the slider is positioned over a black point, the image shows

uninterpolated data. One can trace a path on the slider and animate it using the VCR

control panel below the slider.

To show how animation is produced, assume you have data for 8 years, 1990-1997 (that

is, eight data points in the summary window). Lets examine how one splat changes as

the slider is moved from one year to the next. Assume that in 1990 a splat at a given

position has value of 20 (to be mapped to color) and a count of 2. Assume further that in

1991 that same splat has a value of 40 and a count of 200. The splat in year 1991 is much

more opaque than the one in 1990 because it represents an aggregation of many more

records (or of much more heavily weighted records). As you move the year slider from

1990 to 1991, the count changes by being linearly interpolated between 2 and 200. The

value is computed by taking an average of the two values weighted by records counts (or

weights). For example, midway between 1990 and 1991, the count is 101, and the value

is ((1-.5)*2*20+.5*200*40)/((1-.5)*2+.5*200) = 39.8. As you approach 1992, the size

approaches 40. You cannot stop an animation between discrete data points, and you

cannot drag the Path slider to a stationary position between discrete data points.

The data points in the summary window represent the slider positions corresponding to

the actual data from the data ﬁle. For example, values 20 and 40 represent aggregations

of actual data, but the value 39.8 does not.

Table 8-3 Interpolation Midway Between Table 1 and Table 2

education occupation hours_worked income count

HS-grad

Masters

Vocational

Exec-Man.

Mach-op

Technician

Mach-op

15-25

25-35

35-45

40000

30000

35000

40000

1.5

276

Chapter 8: Using the Splat Visualizer

Pulldown Menus

Five pulldown menus let you access additional Splat Visualizer functions. These are

labeled File, View, Selection, Splat Type, and Help. If you start the Splat Visualizer

without specifying a conﬁguration ﬁle, only the File and the Help menus are available.

The File and Help menus are the same as found in other MineSet visualization tools. For

a description see “The File Menu” in Chapter 5, and “The Help Menu” in Chapter 5.

The View Menu

The View menu lets you control certain aspects of what is shown in the Splat Visualizer

window.

Figure 8-10 Splat Visualizer View Menu

•Show Window Decoration lets you hide or show the external controls around the

main window.

•Show Null Positions lets you hide or show splats that have null or unknown position

values along one or more axes.

•Show Animation Panel lets you show or hide the animation control panel. This menu

item is disabled for datasets with no independent dimension.

Pulldown Menus

277

•Show Filter Menu brings up a ﬁlter panel (Figure 8-11) that lets you reduce the

number of splats displayed in the main viewing area, based on one or more criteria.

You can use the ﬁlter panel to ﬁne-tune the display, emphasize speciﬁc information,

or simply shrink the amount of information displayed. Columns other than those

mapped to axes, sliders, opacity, and color are not available for ﬁltering because

they are removed during aggregation. The Scale to ﬁlter checkbox, which appears in

the lower right of the ﬁlter panel, lets you specify whether the landscape in the main

window covers the entire dataset or just the ﬁltered data.

•Set Background Color brings up a color chooser to let you specify a new background

color.

Figure 8-11 Splat Visualizer Filter Panel

278

Chapter 8: Using the Splat Visualizer

The Filter panel has two panes. The top pane lets you ﬁlter based on string columns.

To select all values of a column, click Set All. To clear the current selections, click

Clear. To select a value, click it. To deselect a value, simply click it again.

The bottom pane lets you ﬁlter based on the values of both string and numeric

columns.

To ﬁlter numeric values, enter the value, and select a relational operation (=, !=, >, <,

>=, <=). To ﬁlter alphanumeric values, enter the string. You can use any of three

types of string comparisons:

•Contains indicates that it contains the appropriate string. For example,

“California” contains the strings “Cal” and “forn”.

•Equals requires the strings to match exactly.

•Matches allows wildcards:

–An asterisk (*) represents any number of characters.

–A question mark (?) represents one character.

–Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.

For columns which were binned, an option menu of values appears, instead of a

text ﬁeld. To ignore that column, select Ignored in the Option menu. You can use

relational operators, such as >=, with these options. This means that the speciﬁed

value as well as subsequent ones are selected.

In addition to numeric and string comparison operations, you can specify Is Null,

which is true if the value is null.

To the right of each ﬁeld is an additional option menu that lets you specify “And” or

“Or” options. For example, you could specify “sales > 20 And < 40.” You can have

any number of And or Or clauses for a given column, but cannot mix And and Or in

a single column.

Scale to Filter lets you specify whether the ﬁltered landscape is rescaled to the size of

the ﬁltered data or remains the size of the entire data set.

Click the Filter button to start ﬁltering. If you press Enter while the panel is active,

ﬁltering starts automatically.

Click the Close button to close the panel.

Pulldown Menus

279

The Selection Menu

The Selection menu (Figure 8-12) lets you drill through to the underlying data. The menu

has six items.

Figure 8-12 The Splat Visualizer’s Selection Menu

•Create Box Selection creates a 3-D box selector that can be stretched and translated to

select regions of the volume. While the box selector is active, a window in Record

Viewer format is opened showing information about all of the aggregated data that

is represented by the splats within it (see top of Figure 8-13). Closing this window

clears the current box selection(s). Selecting this option again creates a new box

selection, making the previous selection ﬁxed. The ﬁxed-selection boxes are gray,

while the active one is light yellow (see Figure 8-13). The selected bins, shown in the

selection window, are the bins enclosed by the union of all the selection boxes.

280

Chapter 8: Using the Splat Visualizer

Figure 8-13 Image With Fixed Selection Box (Gray) and Active Selection Box (Yellow)

Pulldown Menus

281

To translate the active selection box, click on one of the faces with the left mouse

button, and drag it in the desired direction. Holding the Shift key while dragging

constrains the motion to the axis to which the drag motion is closest. To change the

extent of the selection box, drag one of the gray scale tabs in the desired direction.

Trying to resize or translate beyond the bounds of the volume is not permitted. The

gray scale tabs constantly resize to maintain constant screen size. If at any time they

appear too big, you can zoom in closer, and they reduce their size relative to the

box.

•Show Original Data retrieves and displays the records corresponding to what has

been selected via Box Selection(s). The resulting records are shown in a table viewer.

•Send To Tool Manager inserts a ﬁlter operation, based on the current box selection(s),

at the beginning of the Tool Manager history. The actual expression used to do the

drill through is determined by extents of the current box selection(s). If nothing is

selected, a warning message appears.

•Use Slider On Drill-Through determines whether or not to use the slider position

when creating the drill-through expression. If checked (default), an additional term

is added to the drill-through expression, limiting the drill-through to those records

deﬁned by the slider’s position. If this option is not checked, no such limiting term

is added.

•Complementary Drill Through causes the Show Original Data and Send To Tool Manager

selections, when used, to fetch all the data that are not selected.

•Show Pick Dragger toggles the visibility of the pick dragger (on by default). The pick

dragger is removed when a box selection is started, but it can be made active at the

same time that a box selection is active.

For further details on drill-through, see Chapter 18, “Selection and Drill-Through.”

282

Chapter 8: Using the Splat Visualizer

Splat Type Menu

Splats are used in this tool to model clouds of small points (see Lee Westover, “Footprint

Evaluation for Volume Rendering” in Proceedings of SIGGRAPH ‘90, Vol. 24, No. 4, pages

367-376).

The Splat Type menu lets you change the method for drawing the splats. You can choose

to exchange accuracy for interactivity. Texture splats are the most accurate representation

of ideal Gaussian density that is approximated in every approach. Since most computers

support hardware-assisted texturing well, the texture splat is usually the best choice.

Among SGI platforms, only the Indy or earlier systems are restricted to the slower

software implementation. The three splat types are:

•Linear draws a small set of triangles to give a linear approximation to a Gaussian

splat.

•Gaussian draws a large set of triangles to approximate a Gaussian splat.

•Texture uses a texture mapped rectangle to give the most accurate representation.

This can be very slow on machines that don’t support hardware-assisted texture

mapping.

Alternatively, the following opaque primitives are allowed.

•Sphere draws an opaque sphere, the radius for which varies with the cube root of the

count (or weight).

•Cube draws a large set of triangles to approximate a Gaussian splat.

•Diamond uses a texture-mapped rectangle to give the most accurate representation.

This can be very slow on machines that don’t support hardware-assisted texture

mapping.

Sample Configuration and Data Files

283

Sample Conﬁguration and Data Files

The provided sample data and conﬁguration ﬁles demonstrate the Splat Visualizer’s

features and capabilities. The following ﬁles are in the /usr/lib/MineSet/splatviz/examples

directory:

•mushroom

The mushroom.data ﬁle contains pre-aggregated data concerning more than 5,000

mushrooms. The group by columns were: odor, gill_color, and cap_color. For every

combination of these three columns in the original data, there is a count and an

average edibility, where 0 is edible, and 1 is poisonous. An average edibility

between 0 and 1 means some of the mushrooms in that aggregate are edible and

some are poisonous, since mushrooms can not be partially poisonous.

The visualization (Figure 8-8) shows that the unique values for each of these

columns have been sorted along the axes according to average edibility. Odor is

clearly the best determinant of edibility. Also note that most splats are either all 0 or

all 1, meaning these three columns are useful in segmenting the two classes of

mushrooms. Lower the opacity slider to determine which splats have the highest

counts. The most opaque splat represents 288 mushrooms having common values

for odor, gill_color, and cap_color. To conﬁrm this try ﬁltering based on

sum_count_poison>280 and picking on the remaining splats to see their counts.

Note that all mushrooms with gill_color=buff are poisonous.

•adultJobs

The adultJobs.data ﬁle was derived from adult94, a dataset provided with the

distribution. It was created using an aggregation that grouped by education,

occupation, hrs_worked_per_week(binned), and age (binned). The gross_income

column was aggregated by count and average. For a display using the Splat

Visualizer (Figure 8-1), age_bin was mapped to a slider, while the other group-by

columns were mapped to axes. The count_gross_income column was mapped to

opacity, and avg_gross_income was mapped to color.

When the slider is in the left-most position, the color of the plot is almost entirely

blue. This means that regardless of occupation, education, or number of hours

worked, people younger than 20 have low incomes. Move the slider to the right,

and note how incomes rise faster for higher education and occupations toward the

end of the axis. By the opacity variation you can see that the most common types of

education are HS, some college and Bachelors degree.

Moving the Summary slider shows how the distribution of income changes with

respect to the axis columns as people age.

284

Chapter 8: Using the Splat Visualizer

•adultJobs2

The adultJobs2 ﬁle is also based on the adult94 dataset. Here, the axis columns are

working_class, education, and occupation. The two columns mapped to sliders are

age(binned) and hours_worked_per_week(binned). Again, income was aggregated

by count and average for use with opacity and color, respectively. Since there are

more positions on the 2D slider, there are fewer records represented by each

position. This causes greater variation of color and opacity. The red region in the

center of the hrs_per_week dimension of the Summary slider shows that nearly

everyone works between 35 and 45 hours per week (see Figure 8-7). Note that some

occupations are aligned with speciﬁc working classes. For example, everyone in the

Armed-forces has Fed-Government for their working class.

•censusIncome

This example is based on a dataset similar to adult94, but was not included with the

distribution because of its size. In attempt to understand the differences between

gross income and total income, gross_income, total_income, and hrs_per_week

have been mapped to axes. Color shows age. By studying the image we can learn

that there are many records where total_income=gross_income, but there are also a

larger portion of records with high total_income, but 0 gross_income. It is

surprising that in many cases gross_income is greater that total_income.

Note where the people of different ages are concentrated. Many old people (yellow)

are in the hrs_per_wk=0 plane. They are probably retirees. Many children and

young adults (blue) are in the line gross_income=total_income=0. Note the fairly

opaque splats near the outside edges of the volume. These positions include all

points that fell in the maximum bin shown for an axis. For example, the highest bin

for total_income is 70300+. Any point higher than 70300 goes in this bin.

To better see the varying density, adjust the opacity slider. At low opacity scales, the

diagonal lines show that for most people gross_income=total_income, or they have

just total_income and no gross_income. As you raise the scale, you can see that

almost the entire volume contains data. This dataset contains 150,000 records.

Sample Configuration and Data Files

285

•churn

Churn is when a customer leaves one company for another. This example shows

customer churn for a telephone company. The data used to generate this example is

in /usr/lib/MineSet/data/churn.schema.

Using column importance, we found that total_day_charge,

number_customer_service_calls, and international_plan were important

discriminators. These columns were mapped to axes. We then created a new

numeric column, churn, which equals churned==Yes, and mapped it to color.

In the resulting visualization, red areas of the volume indicate high churn. The area

corresponding to three or more customer service calls and low total_day charge

corresponds to high churn. You might want to weight big-spending customers more

heavily than others. To do this, create a new column, total_charge, equal to

`total_day_charge`+`total_eve_charge`+`total_night_charge`

or some power of this sum. Then map this total_charge column to opacity. This

means every record is weighted by total_charge. Now the visualization shows

additional areas of interest near the high end of the total_day_charge axis.

287

Chapter 9

9. Using the Rules Visualizer

This chapter discusses the components and capabilities of the Rules Visualizer. It ﬁrst

provides an overview of this data mining and visualization tool, then it explains this

tool’s functionality when working with the

•main window

•external controls

•pulldown menus

Finally, it lists and describes the provided sample ﬁles for these tools.

Overview of Rules Visualizer

The Rules Visualizer gives you the power to mine data by constructing, verifying, and

graphically representing models of patterns in large databases. These patterns are

expressed via association rules, which indicate the frequency of items occurring together

in a database.

Discovering and graphically displaying association rules can be relevant to many

enterprises, including supermarket inventory planning, shelf planning, and attached

mailing in direct marketing.

The tool execution scenario described in Chapter 1 of this document (see Figure 1-1) is

slightly modiﬁed for the Rules Visualizer. First, the “raw” data in your database must be

converted into a specially formatted ﬁle that can be processed by the association rules

generator part of the Rules Visualizer. When the association rules generator has

processed this ﬁle, the results can be displayed by the rules visualizer part of this tool.

288

Chapter 9: Using the Rules Visualizer

Thus, the Rules Visualizer consists of three operations:

1. Data conversion. The association data converter processes a “raw” data ﬁle and

creates a ﬁle usable by the association rules generator.

2. Association rules generation. The data ﬁle created by the association data converter

is processed by the association rules generator, which creates a ﬁle usable by the

rules visualizer.

3. Rules visualization. This operation displays the generated association rules.

In addition to the input data and rules ﬁle requirements, each operation requires a

conﬁguration ﬁle that speciﬁes operational parameters.

The sequence of actions by the user, at the user’s workstation, and at the host server is

shown schematically in Figure 9-1. The phases indicated at the right of the illustration

correlate to the operations listed above.

Overview of Rules Visualizer

289

Figure 9-1 Execution Sequence of the Rules Visualizer

User's

data source

Data Mover

Client

workstation Host server

Tool

manager

Format

file

Rules

Visualizer

Rules file

Configuration

file

Binary (flat)

data file

Association

data

converter

Association

rules

generator

"Raw"

data file

User

Visual

display

Data conversionRules visualization Rules

generation

290

Chapter 9: Using the Rules Visualizer

Data Conversion

The association data converter takes a “raw” data ﬁle, such as one resulting from a

database query, and creates a binary data ﬁle in the format used by the association rules

generator. The internal format of this generated ﬁle allows optimum processing by the

rules generator. The data converter also accepts input from ﬂat ﬁles as well as databases.

Association Rules Generator

One example of applying the association rules generator is to obtain “market basket”

data for customer buying patterns. Here, “market basket” is the set of items bought by

each customer on a single visit to a store. An example rule in this context might be: “80%

of the people that buy diapers buy baby powder.” This percentage is known as the

predictability of the rule.

In the example, “diapers” is the item on the left-hand side (LHS) of the rule, and “baby

powder” is the item on the right-hand side (RHS) of the rule.

Some applications of these rules are as follows:

•If “Fizzy Pop” appears on the RHS, the LHS can help us determine what the store

should do to boost sales of this beverage.

•If “Bagels” appears on the LHS, the RHS can help us determine what products

might be affected if the store no longer sells bagels.

The association rules generator part of this tool processes an input ﬁle, then generates an

output ﬁle consisting of the rules. If X and Y are items in a record, then a rule such as

X ⇒ Y

indicates that whenever X occurs in a record, expect Y to occur with some frequency.

Overview of Rules Visualizer

291

Components of a Generated Association Rule

The strength of the association is quantiﬁed by three numbers. The ﬁrst number, the

predictability of the rule, quantiﬁes how often X and Y occur together as a fraction of the

number of records in which X occurs. For example, if the predictability is 50%, X and Y

occur together in 50% of the records in which X occurs. Thus, knowing that X occurs in

a record, expect that 50% of the time Y occurs in that record.

The second number, the prevalence of the rule, quantiﬁes how often X and Y occur

together in the ﬁle as a fraction of the total number of records. For example, if the

prevalence is 1%, X and Y occur together in 1% of the total number of records.

You can specify a minimum prevalence threshold for the generated rules. The default

minimum prevalence threshold is 1%. The lower the minimum prevalence, the more

rules are generated, and the slower the performance of the tool might be. You can also

specify a minimum predictability threshold for the generated rules. The minimum

predictability threshold default is 50%.

Rules that meet a minimum prevalence threshold are important for two reasons:

1. A rule might have business value only if a reasonably signiﬁcant fraction of records

support the rule. For example, if everyone who buys caviar also buys vodka, the

rule Caviar ⇒Vodka has 100% predictability. However, if only a handful of people

buy caviar, the rule might be of limited value to the retailer.

2. A rule might not be statistically signiﬁcant if a very small number of records

support the rule. The rule might be due to chance, and it would not be prudent to

make decisions based on such a rule.

The third number is expected predictability. The expected predictability is the frequency of

occurrence of the RHS items. So the difference between expected predictability and

predictability is a measure of the change in predictive power due to the presence of the

LHS rule. Expected predictability gives an indication of what the predictability would be

if there were no relationship between the items.

The Association Rules generator does not report rules in which the predictability is less

than the expected predictability. In other words, a rule such as A->B is not reported if the

frequency of A and B occurring together is less than the frequency of B alone.

Note: Given just Y and a rule of the form X ⇒ Y, nothing is known about X. Rules specify

implications only from the LHS to the RHS.

292

Chapter 9: Using the Rules Visualizer

Table 9-1 summarizes the three numbers that quantify the strength of each association

rule.

Hierarchical Data

The rules generator also works on hierarchical data, which includes a component that

relates (or maps) data to new data at varying degrees of generality. The ability to handle

hierarchical data allows rules to be generated at the desired level of generality.

For example, consider the hierarchy shown in Table 9-2. This hierarchical information, in

addition to the “market basket” data that lists the products purchased in each record,

allows rules to be generated at four levels. In contrast to rules learned at the lowest level,

which relate speciﬁc products to each other, a rule at the highest level might be “Milk

implies Bread.”

Rules Visualization

The rules visualization part lets you graphically display and explore the generated

association rules. The rules are presented on a grid landscape, with left-hand side (LHS)

items on one axis, and right-hand side (RHS) items on the other. As shown in Figure 9-2,

attributes of a rule are displayed at the junction of its LHS and RHS item. The display can

include bars, disks, and labels.

Table 9-1 Association Rules Components

Measure Description

Prevalence Frequency of LHS and RHS occurring together.

Predictability Fraction of RHS out of all items with LHS, or the prevalence

divided by the frequency of occurrence of LHS items.

Expected Predictability Frequency of occurrence of RHS items.

Table 9-2 Example of Hierarchical Levels

Level Example

Product Group Milk

Category Non-Refrigerated Milk

Brand Lucerne®

Product ID (UPC/SKU Code) 1 pint can of Premium Condensed Milk

Overview of Rules Visualizer

293

Figure 9-2 Detail View of the Rules Visualizer’s Main Window

If the displayed view is too small, item labels do not appear on the side of the axes. You

can zoom in on the view until the item labels appear (see the Dolly description in

“Thumbwheels” in Chapter 6).

A legend indicating the mapping between displayed attributes (such as bar heights and

colors) and the values associated with the underlying rules (such as predictability and

prevalence) can be displayed at the bottom of the main window.

The Tool Manager interface for associations allows you to run the Rule Visualizer

without running associations, by using the Visual Tools menu from Tool Manager’s main

window.

Disk

Label

Bar

294

Chapter 9: Using the Rules Visualizer

File Requirements

Each of the Rules Visualizer’s three components has its own ﬁle requirements. These are

detailed in the following subsections.

Files Required by the Association Data Converter Part

•A “raw” data ﬁle that results from extracting raw data from a source (such as a

relational database). This ﬁle is processed by the association data converter to

produce the internal binary data ﬁle used by the association rules generator.

•A format ﬁle that speciﬁes the format of the data ﬁle. If the internal binary data ﬁle

(see next subsection) is created via the Tool Manager, this format ﬁle is created

automatically. If the internal binary data ﬁle is created via the command line, this

format ﬁle must be created manually (see Appendix F, “Creating Data and

Conﬁguration Files for the Rules Visualizer”).

Files Required by the Association Rules Generator Part

•An internal binary data ﬁle, which results from running the association data

converter on your original data.

If you have hierarchical data, the association rules generator also requires the

following two ﬁles:

•A mapping ﬁle, which speciﬁes the mapping between hierarchical levels.

•A description ﬁle, which speciﬁes a string description for each item at a speciﬁc

hierarchical level.

Files Required by the Rules Visualization Part

•A rules ﬁle that results from running the association rules generator.

•A.ruleviz conﬁguration ﬁle that speciﬁes parameters used by the rules visualizer

program (such as mapping colors to prevalence values) when displaying the

generated rules. This ﬁle is easily created using the Tool Manager (see Chapter 3).

You also can use an editor (such as jot, vi, or Emacs) to produce this ﬁle (see

Appendix F, “Creating Data and Conﬁguration Files for the Rules Visualizer”).

These conﬁguration ﬁles must have a .ruleviz extension.

Starting the Rules Visualizer

295

Starting the Rules Visualizer

The Rules Visualizer has three components. The following subsections describe the

procedure for starting each one.

Starting the Association Data Converter Part

There are two ways to start the association data converter part of the Rules Visualizer:

•Use the Tool Manager to conﬁgure and start the data converter. (See Chapter 3 ﬁrst

for details on most of the Tool Manager’s functionality, which is common to all

MineSet tools; see below for details about using the Tool Manager in conjunction

with the data converter.)

•Enter the following command at the UNIX shell command-line prompt:

assoccvt parameters

The parameters are described in Appendix F, “Creating Data and Conﬁguration Files

for the Rules Visualizer.”

Starting the Association Rules Generator Part

There are two ways to start the association rules generator part of the Rules Visualizer:

•Use the Tool Manager to conﬁgure and start the association rules generator. (See

Chapter 3 ﬁrst for details on most of the Tool Manager’s functionality, which is

common to all MineSet tools; see below for details about using the Tool Manager in

conjunction with the association rules generator.)

•If the data with which you are working is non-hierarchical, enter this command at

the UNIX shell command line prompt:

assocgen parameters

If your data is hierarchical, enter this command at the UNIX shell command-line

prompt:

mapassocgen parameters

The parameters for both instances are described in Appendix F, “Creating Data and

Conﬁguration Files for the Rules Visualizer.”

296

Chapter 9: Using the Rules Visualizer

Starting the Rules Visualization Part

There are ﬁve ways to start the rules visualization part of this tool:

•Use the Tool Manager to conﬁgure and start the Rules Visualizer. (See Chapter 3

ﬁrst for details on most of the Tool Manager’s functionality, which is common to all

MineSet tools; see below for details about using the Tool Manager in conjunction

with the Rules Visualizer.)

•Double-click the Rules Visualizer icon, which is in the MineSet page of the icon

catalog. The icon is labeled ruleviz. Since no conﬁguration ﬁle is speciﬁed, the

start-up screen requires you to select one by using File > Open.

•If you know what conﬁguration ﬁle you want to use, double-click the icon for that

conﬁguration ﬁle. This starts the Rules Visualizer and automatically loads the

conﬁguration ﬁle you speciﬁed. This only works if the conﬁguration ﬁlename ends

in .ruleviz (which is always the case for conﬁguration ﬁles created for the Rules

Visualizer via the Tool Manager).

•Drag the conﬁguration ﬁle icon onto the Rules Visualizer icon. This starts the Rules

Visualizer and automatically loads the conﬁguration ﬁle you speciﬁed. This works

even if the conﬁguration ﬁlename does not end in .ruleviz.

•Enter this command at the UNIX shell command-line prompt:

ruleviz [ conﬁgFilename ]

When starting the rules visualization part of this tool, you must specify the conﬁguration

ﬁle, not the data or rules ﬁle.

Option for Invoking the Rules Visualizer

The -quiet option eliminates the dialogs that popup to indicate progress. You can enable

this option permanently by adding the line

*minesetQuiet:TRUE

to your .Xdefaults ﬁle.

Configuring the Rules Visualizer Using the Tool Manager

297

Conﬁguring the Rules Visualizer Using the Tool Manager

This section describes how the components of the Rules Visualizer can be conﬁgured

using the Tool Manager. Although the Tool Manager greatly simpliﬁes the task of

conﬁguring the Rules Visualizer, you can construct a conﬁguration ﬁle for this tool using

an editor (see Appendix F, “Creating Data and Conﬁguration Files for the Rules

Visualizer”).

Note that the steps required to connect to a data source are described in Chapter 3.

The sections below follow the conﬁguration and invocation of the Rules Visualizer

components in the conventional order:

•creating a ﬁle for the association rules generator

•generating rules

•displaying rules

Setting Up Associations

To show how to set up associations, the following example uses the cars database table.

Assume that you want to ﬁnd out if there is an association between miles per gallon,

horsepower, and the year the car was built. For example, did mileage improve over time?

Did engines become less powerful? The following steps (and Figure 9-3) show you how

to set up the associations and map table columns to the data you want to study.

298

Chapter 9: Using the Rules Visualizer

Figure 9-3 Initial Tool Manager Window for Association Generation

1. Connect to a MineSet server. Refer to Chapter 2, “Setting Up MineSet,”if you need

help.

2. Open a data source.

3. (Optional step) In the Data Transformations tab you can choose the transformations

you want do on the data before you give it to the associations engine. One

recommended transformation is to create bins for numeric data. (The binning

operation and the options available for it are described in detail in Chapter 3.) This

leads to more “meaningful” rules from the association engine. For example, instead

of using discrete values for the weightlbs attribute in the “cars” table such as 3504,

3693, 3436, 3433, and so on, it may be more meaningful to give weightlbs_bin value

ranges such as 1600-2500, 2501-3500, and so on.

For this example, click on the Bin Columns button, and select all the columns in the

Bin Column window for binning.

Note: If you run associations without binning any of the numerical columns (ints,

ﬂoats, doubles) you get the warning message

Running associations on unbinned non-categorical data. Binning is

recommended for producing more useful results.

4. Choose the Mining Tools tab from the Data Destination tab.

5. Choose the Assoc. tab (abbreviation for Associations) from the Mining Tools tab.

Configuring the Rules Visualizer Using the Tool Manager

299

Applying Association Rule Options

After selecting a data source, you can run the Association Rules generator. You can

choose options for this by clicking on the Assoc Options button. This causes the dialog box

in Figure 9-4 to be displayed.

Figure 9-4 Association Rule Options Dialog Box

Prevalence—lets you specify the minimum prevalence threshold as a percentage of the

total number of records. Rules with a prevalence below this value are not generated. The

default is 1%. The possible values are 0–100.

Predictability—lets you specify the minimum predictability threshold for rules. Rules

with a predictability below this value are not generated. The default is 50%. The possible

values are 0–100.

6. Once you have made your association rule options selections, click the OK button.

This returns you to the Tool Manager startup screen.

300

Chapter 9: Using the Rules Visualizer

Mapping Columns to Association Items

Figure 9-5 Association Mappings Dialog Box

The database in the Current Columns text panel can contain multiple table columns. By

mapping speciﬁc columns to association rules, the association rules generator can ﬁnd

the association between any possible pair of those items.

1. Click on Assoc. Mappings button to open the Mapping Columns to Assoc Items

dialog box.

The Mapping Columns to Assoc Items window shows two panels:

•Columns shows the columns in the data

•Items shows the mapping between columns in the data and items

The Map All button on this window can be used to map all the attributes in the data

source to items for the associations engine. The Clear All and Clear Selected buttons

can be used to clear/change the mapping between columns and items.

2. The default behavior is to map all columns to items. Therefore, if you omit this step

or if you open this window, you ﬁnd all columns mapped. For this example, click

OK.

Configuring the Rules Visualizer Using the Tool Manager

301

Specifying Ruleviz Options

Clicking on the Ruleviz Options button causes a new dialog box to be displayed

(Figure 9-6). This lets you change some of the Rules Visualizer options from their default

values.

Figure 9-6 Rule Visualizer Options Dialog Box

This dialog box has two panels: the top one lets you set options for bars and disks; the

bottom one lets you specify options for items, the grid, and labels.

302

Chapter 9: Using the Rules Visualizer

Items in the top panel are listed below:

•Height button—lets you specify whether the bars and disk heights are to be

normalized so that the tallest bar equals the height ﬁeld value (Max Height), or

whether they are to be scaled by the height ﬁeld value (Scale Height).

•Height ﬁeld—lets you enter the maximum or scale value for bar and disk heights.

•Hide Distance—lets you specify the distance at which disks are not graphically

represented. Smaller numbers in this ﬁeld specify a shorter distance; this means

fewer disks are shown and performance is greater. Larger numbers indicate a

greater distance; this means disks further away remain visible.

•Legends—lets you enter a text string that appears as mapping information displayed

at the bottom of the main Rules Visualizer window. This is information about

mapping between display entities and data values (for example, bar height

corresponds to predictability values).

•Color list—lets you add or edit a color. To add a color to the list, click the + button. To

edit a color, click the color. See “Choosing Colors” and “Using the Color Browser”

in Chapter 3 for a more detailed explanation of how to choose and change colors.

•Mapping—lets you specify whether the color change that is shown in the graphic

display is Continuous or Discrete. If you choose Continuous, the color values (of the

bars or disks) shift gradually between the colors entered in the Color list ﬁeld as a

function of the values that are mapped to those colors in the Color list ﬁeld.

Example 9-1

If you

•used the Color Browser to apply red and green (for bars and/or disks)

•selected Discreet for the Mapping

•entered the values 0 100

then the display shows all bars and/or disks with values of less than 50 in red, and all

those with values greater than or equal to 50 in green.

Configuring the Rules Visualizer Using the Tool Manager

303

Example 9-2

If you

•used the Color Browser to apply red and green (for bars and/or disks)

•selected Continuous for the Mapping

•entered the values 0 100

then the display shows all bars and/or disks with values less than or equal to 0 as

completely red, those as greater than or equal to 100 as completely green, and those

between 0 and 100 as shadings from red to green.

If no mapping and values are speciﬁed, a continuous mapping is used, and values are

generated automatically from the minimum value to the maximum value in the data.

Items in the bottom panel are as follows:

•Items On and Grids On checkbox buttons let you determine whether items (the

names on the side of the grid) are displayed or hidden.

•Size (for Items, Grid, and Bar Labels) lets you specify the size for items, the grid, and

bar labels. If you mapped a column value to bar labels in the Requirements panel of

the Tool Manager startup screen, you can specify a size for those labels.

•Color (for Left-Hand Items, Right-Hand Items, Grid, and Bar Labels) lets you specify

the color for LHS and RHS items, the grid, and bar labels. If you mapped a column

value to bar labels in the Requirements panel of the Tool Manager startup screen,

you can specify a size for those labels.

•Hide Distance lets you specify the distance at which the LHS items, RHS items, grid,

or labels become invisible. Smaller distances might improve performance, but the

objects disappear more quickly. The higher the number, the greater the distance at

which labels are hidden.

•Message lets you specify the message displayed when the pointer is moved over an

object or when an object is selected. (See Figure 9-8.) The syntax of the message

string is the same as for the mapviz message string. See the Message Statement

section in Appendix C, “Creating Data, Conﬁguration, Hierarchy, and GFX Files for

the Map Visualizer.”

304

Chapter 9: Using the Rules Visualizer

Mapping Columns to Visual Elements

The Rules Visualizer lets you map attributes of the rules to visual elements of the display.

Clicking on the RuleViz Mappings button brings up the Ruleviz Mappings panel shown in

Figure 9-7.

Figure 9-7 The Rules Visualizer’s Mappings Panel

The visual elements that can be mapped are listed below, where the items with “*” are

optional:

•Height - Bars—lets you specify what the bar heights represent.

•*Height - Disks—lets you specify what the disk heights represent.

•*Color - Bars—lets you specify what the bar colors represent.

•*Color - Disks—lets you specify what the disk colors represent.

•*Label - Bars—lets you specify what the bar labels represent.

Working in the Rules Visualizer’s Main Window

305

The default mappings are as follows:

•predictability to bar heights

•expected predictability to disk heights

•lift to bar and disk colors

Lift is the predictability divided by the expected predictability.

Invoking the Rules Visualizer

To see the Rules Visualizer graphically represent your data, click the Run Assoc & Rules

button at the bottom of the Associations tab in the Data Destination panel of the main

Tool Manager window.

Working in the Rules Visualizer’s Main Window

The Rules Visualizer part of this tool graphically displays the data in a rules ﬁle using the

speciﬁcations of a valid conﬁguration ﬁle. For example, specifying group.ruleviz results in

the image shown in Figure 9-8.

Figure 9-8 Initial Rules Visualizer View When Specifying group.ruleviz

306

Chapter 9: Using the Rules Visualizer

The rules are presented on a grid, initially displayed with left-hand side (LHS) items

displayed on the left side of the window and right-hand side (RHS) items on the right. A

rule is displayed at the junction of its LHS and RHS items. The display can include bars,

disks, and labels.

When the scene is close enough, the LHS and RHS axes are labeled with the item names,

unless this has been turned off in the conﬁguration ﬁle. (To view the grid and rules at

closer range, use the Dolly thumbwheel, described in the “Thumbwheels” in Chapter 6.)

You can change the labels as well as what the heights and colors of the bars and disks

represent by modifying the conﬁguration ﬁle via the Tool Manager (see Chapter 3) or

using an editor to change the conﬁguration ﬁle.

For example, in Figure 9-8, bar heights correspond to predictability values, bar colors

correspond to prevalence values, and disk heights correspond to expected predictability.

Viewing Modes

The two modes of viewing are grasp and select. To toggle between these modes, press the

Esc key. You also can change from one mode to the other by clicking the appropriate

button: to enter select mode, left-click the arrow button (to the top right of the main

window); to enter grasp mode, left-click the hand button (immediately below the arrow

button, near the top right of the main window as shown in Figure 9-2).

Grasp Mode

In grasp mode the cursor appears as a hand. This mode supports panning, rotating, and

scaling the scene’s size in the main window.

•To rotate the display, press the left mouse button and move the mouse in the

direction you want to rotate. (Also see the rotating controls Rotx and Roty described

in “Thumbwheels” in Chapter 6.)

•To pan the display, press the middle mouse button and drag it in the direction you

want the display panned.

•To move the viewpoint forward, press the left and middle mouse buttons

simultaneously and move the mouse downwards. To move the viewpoint

backward, press the left and middle mouse buttons simultaneously and move the

mouse upwards. This is equivalent to the functions provided by the Dolly

thumbwheel.

Working in the Rules Visualizer’s Main Window

307

Select Mode

In select mode, you can obtain additional information about a rule by placing the cursor

over a bar. This highlights the selected bar and causes information about the rule

represented by that bar to appear at the top of the main window.

Figure 9-9 Cursor Over a Rules Visualizer Object

The information is displayed as long as the cursor remains over the object. If you position

the pointer cursor over an object and click the left mouse button, that same information

appears in the Selection Window, which is above the main window, under the

“Selection” label.

This Selection information remains visible until another object is selected, or until no

object is selected (if you click the black background). Using the mouse, you can cut and

paste this text into other applications, such as reports or databases.)

308

Chapter 9: Using the Rules Visualizer

External Controls

Several external controls surround the main window, including buttons and

thumbwheels. (These and the rest of the buttons are the same as the other MineSet

visualization tools and are described in “Buttons” in Chapter 6, and “Thumbwheels” in

Chapter 6)

The Height Slider

The Height slider, at the upper left corner of the main window, lets you scale the heights

of objects (bars and disks) in the main window.

Figure 9-10 Rules Visualizer’s Height Slider

Height slider

Pulldown Menus

309

Pulldown Menus

The Rules Visualizer has three pulldown menus, labeled File, View, and Help.

The File Menu

The File menu (Figure 9-11) contains six options.

Figure 9-11 Rules Visualizer File Menu

•Open loads and opens a conﬁguration ﬁle, displaying it in the main window.

Previously displayed data is discarded.

•Reopen reloads the current conﬁguration ﬁle. This is useful if either the

conﬁguration ﬁle or data ﬁle has changed.

•Save As... allows uyou to save the ﬁle under a different ﬁle name

•Print Image... allows you to print the image from the screen

•Filter....allows you to ﬁlter the data.

•Exit closes the current window and exits the application.

310

Chapter 9: Using the Rules Visualizer

The Filter Menu

The Filter menu brings up a Filter panel (Figure 9-12) that lets you reduce the number of

rules displayed in the main viewing area, based on one or more criteria. You can use the

ﬁlter panel to ﬁne-tune the display, emphasize speciﬁc information, or simply shrink the

amount of information displayed.

Figure 9-12 Rules Visualizer Filter Panel

Pulldown Menus

311

The top pane lets you ﬁlter based on string variables, such as LHS and RHS. To select all

values of a variable, click Set All. To clear the current selections, click Clear. To select a

value, click it. To deselect a value, click it again.

The bottom pane lets you ﬁlter based on the values of both string and numeric variables.

To ﬁlter numeric values, enter the value, and select a relational operation (=, !=, >, <, >=,

<=). To ﬁlter alphanumeric values, enter the string. You can use any of three types of

string comparisons:

•Contains indicates that it contains the appropriate string. For example, California

contains the strings Cal and forn.

•Equals requires the strings to match exactly.

•Matches allows wildcards:

–An asterisk (*) represents any number of characters.

–A question mark (?) represents one character.

–Square braces ([ ]) enclose a list of characters to match.

For example, California matches Cal*, Cal?fornia, and Cal[a-z]fornia.

In addition to numeric and string comparison operations, you can specify Is Null.

Currently, this option does not match any rules, resulting in an empty display.

To the right of each ﬁeld is an additional option menu that lets you specify “And” or “Or”

options. For example, you could specify “sales > 20 And < 40.” You can have any number

of And or Or clauses for a given variable, but cannot mix And and Or in a single variable.

Click the Filter button to start ﬁltering. If you press Enter while the panel is active,

ﬁltering starts automatically.

Click the Close button to close the panel.

312

Chapter 9: Using the Rules Visualizer

The View Menu

The View menu (Figure 9-13)contains one option.

Figure 9-13 Rules Visualizer View Menu

Use Symmetric Axes controls how items are displayed along the left- and right-hand side

axes. If enabled, every item appears on both axes, making the axes identical. Otherwise,

only the required items appear on each axis.

The Help Menu

The Help menu is the same for all MineSet Visualization tools, (see “The Help Menu” in

Chapter 5 for a complete description).

Sample Files

The provided sample data, rules, and conﬁguration ﬁles demonstrate the features and

capabilities of the Rules Visualizer.

Sample Files for the Association Data Converter

There are two sample ﬁles provided for each of the two formats of the association data

converter. These ﬁles are located in the /usr/lib/MineSet/assoccvt/examples directory.

•sing.dat and sing.fmt

The sing.dat ﬁle is a “raw” data ﬁle type, as described in the “Files Required by the

Association Data Converter Part” on page 294. The sing.fmt ﬁle is the format ﬁle

described in the same section. Both ﬁles are of the single-item-per-record format.

•mult.dat and mult.fmt

The mult.dat ﬁle is a “raw” data ﬁle type, as described in the “Files Required by the

Association Data Converter Part” on page 294. The mult.fmt ﬁle is the format ﬁle

described in the same section. Both ﬁles are of the multiple-item-per-record format.

Sample Files

313

Sample Files for the Association Rules Generator

These ﬁles are located in the /usr/lib/MineSet/assocgen/examples directory. Except for the

synthn.dsc ﬁle, the sample ﬁles for the association rules generator are provided in 2-byte

and 4-byte integer versions. The difference between the respective ﬁles is that the 4-byte

integer version requires twice the amount of storage space of the 2-byte integer version.

•synthn.dsc

This is a description ﬁle for items at the nth level of the hierarchy. For example, if n

is 0, this ﬁle describes the lowest level; if n = 1, the ﬁle describes the next higher

level of the hierarchy, and so forth. Description ﬁles are common to both 2-byte and

4-byte integer ﬁles.

Two-byte Integer Version

•synths.dat

This is a data ﬁle with 2-byte integers. It corresponds to the data shown in Table F-9

on page 658.

•synths.map

This is a 2-byte integer mapping ﬁle for hierarchical data.

Four-byte Integer Version

•synthb.dat

This is a data ﬁle with 4-byte integers. It corresponds to the data shown in Table F-9

on page 658.

•synthb.map

This is a 4-byte integer mapping ﬁle for hierarchical data.

Sample Files for the Rules Visualization Part

The following sample rules and conﬁguration ﬁles are provided for use with the rules

visualization part of this tool. These ﬁles correspond to the hierarchical datasets. Rules

ﬁles contain the generated rules obtained by running the association rules generator part

of the Rules Visualizer. Rules ﬁles must have a .rules extension. Each conﬁguration ﬁle

speciﬁes how the corresponding rules ﬁle is displayed. Conﬁguration ﬁles must have a

.ruleviz extension. The ﬁles mentioned in this subsection are in the

/usr/lib/MineSet/ruleviz/examples directory.

314

Chapter 9: Using the Rules Visualizer

•group.rules and group.ruleviz

These ﬁles provide the generated rules and conﬁguration speciﬁcations for product

groups, such as bread and baked goods, dairy milk, and carbonated beverages.

•category.rules and category.ruleviz

These ﬁles provide the generated rules and conﬁguration speciﬁcations for product

categories within product groups, such as refrigerated or non-refrigerated milk.

•people94.rules and people94.ruleviz

These ﬁles provide the generated rules and conﬁguration speciﬁcations for a census

database, showing associations among marital status, education level, age, income,

and other variables.

•germanCredit.rules and germanCredit.ruleviz

These ﬁles provide the generated rules and conﬁguration speciﬁcations for a credit

database from Germany, showing associations among credit history, employment,

savings, and other variables.

See /usr/lib/MineSet/ruleviz/examples/README for additional information on the ﬁles in

that directory.

315

Chapter 10

10. MineSet Inducers and Classiﬁers

This chapter provides an introduction to classiﬁers and the algorithms that build them,

called inducers. MineSet provides three inducer-classiﬁer pairs:

•Decision Tree

•Option Tree

•Evidence

The information in this chapter is equally applicable to all the MineSet classiﬁers and

inducers. The chapter consists of two parts: the ﬁrst part introduces the basic concepts,

and the second part details how to apply those concepts via the Tool Manager.

Detailed descriptions of the MineSet inducers and classiﬁers are provided in Chapter 11,

“Inducing and Visualizing the Decision Tree Classiﬁer,” Chapter 12, “Inducing and

Visualizing the Option Tree Classiﬁer,” and Chapter 13, “Inducing and Visualizing the

Evidence Classiﬁer.”

Classiﬁers

A classiﬁer predicts one attribute of a set of data, given several other attributes. For

example, if you have data on customers of a telecommunications company, a classiﬁer

can be generated to predict whether the customer will churn (leave the company) given

information such as whether the customer has voice mail, an international plan or not,

and how much time they spend on the phone. The attribute being predicted is called the

label, and the attributes used for prediction are called the descriptive attributes.

MineSet can build a classiﬁer automatically from a training set. The training set consists

of records in the data for which the label has been supplied. For example, you supply a

database table with one column for each descriptive attribute (such as the presence of a

voice mail plan, the average number of calling minutes per day), and one column for the

label (churned or not). An algorithm that automatically builds a classiﬁer from a training

set is called an inducer.

316

Chapter 10: MineSet Inducers and Classifiers

When a classiﬁer is generated, MineSet also generates a visualization that can help you

understand how the classiﬁer operates. This visualization can also provide valuable

insight into the data itself. Once a classiﬁer is generated, it can be used to classify records

that do not contain the label attribute. This value is predicted by the classiﬁer.

Note: See Appendix K for a list of further readings about classiﬁers as well as

acknowledgements for the datasets used in MineSet sample ﬁles.

Decision Tree Classiﬁers

Figure 10-1 shows the Decision Tree generated by the Decision Tree Inducer for the churn

dataset.

Figure 10-1 The Decision Tree Generated by the Decision Tree Inducer for Churn Dataset

Classifiers

317

To understand how the Decision Tree classiﬁer assigns a label to each record, look at the

attributes tested at the nodes and the values on the connecting lines. In the Decision Tree

shown in Figure 10-1, the ﬁrst test (at the root of the tree) is for total day minutes. There

are two branches from this root. If the total day minutes is <= 264.45, the left branch is

taken; otherwise the right branch is taken. The process is repeated until a leaf (node with

no branches) is reached. The leaf is labeled with the predicted class. The leaf represents

a rule that is the conjunction of all tests from the root to the leaf. For example, the right

most leaf, labeled No, matches the rule

total day minutes > 264.45 and voice mail plan = yes and

international plan = yes and total day minutes > 276.3

Option Tree Classiﬁers

Figure 10-2 shows an Option Tree generated by the Option Tree inducer.

Figure 10-2 The Option Tree Generated by the Option Tree Inducer for the Cars Dataset

318

Chapter 10: MineSet Inducers and Classifiers

The top node in this ﬁgure is an “Option node,” which indicates that several good

attributes can be chosen at the root. A Decision Tree Inducer picks the single “best”

attribute for each subtree; however, there might be several good attributes on which to

split. In such cases, an Option Tree can create option nodes. In the example dataset

(Figure 10-2), the task is to predict whether a car is manufactured in Europe, Japan, or the

US. The Decision Tree Inducer picks cubic inches for the root. The Option Tree inducer

chooses several options: cubic inches, cylinders, weight, mpg, and brand are all good

choices for the root.

Option nodes can appear elsewhere besides the root. With the default settings, however,

they appear only at the root or one level below the root (after a single test node).

Option Trees usually take 10 to 15 times longer to build than do Decision trees, but they

provide two signiﬁcant advantages:

1. Comprehensibility — Option nodes let you see several likely options. Instead of

having to settle for a single attribute, option nodes let you choose from several.

When you ﬂy over the tree, you can choose to follow an option that you believe is

easier to understand or that you believe is better for predictions based on your

background knowledge of the problem.

2. Accuracy — In many cases, Option Trees are more accurate (have lower error-rates)

than Decision Trees. Option Trees classify by letting each option “vote” for each

label value, then average the votes. This is similar to having a series of “experts,”

each one attempting to predict the label based on a different main criterion. The

option node averages all these experts’ votes. Just as distributing stock investments

reduces the risk, using a mixture of options usually results in a more stable, less

risky classiﬁer.

Classifiers

319

Evidence Classiﬁers

Figure 10-3 shows the evidence information generated by the Evidence Inducer.

Figure 10-3 Results of Evidence Inducer for Iris Dataset

The right window of the screen shows the distribution of the classes in the training set.

The left side shows rows of cake charts, one for each attribute. For every value of an

attribute in the data, there is one chart matching it in the row for the attribute. Given a

record with an attribute value corresponding to a chart, the chart represents how much

evidence the classiﬁer “adds” to each possible label value. For example, in Figure 11-3, a

record with total day minutes < 175 shows much evidence for the No label value, and

little evidence for the Yes label value. After evidence is accumulated from all the

attributes, the label value with the most evidence is predicted.

320

Chapter 10: MineSet Inducers and Classifiers

Inducers

An inducer is an algorithm that builds a classiﬁer from a training set, which consists of

records with labels. The training set is used by the inducer to “learn” how to construct

the classiﬁer, as shown in Figure 10-4.

Figure 10-4 Method for Building a Classiﬁer

Once the classiﬁer is built, its structure can be visualized or used to classify unlabeled

records, as shown in Figure 10-4 and Figure 10-5.

Figure 10-5 Using a Classiﬁer to Label New Records

Running inducers can be a CPU- and I/O-intensive process. For this reason, the MineSet

inducers run on the MineSet server, rather than on the MineSet client (see Figure 10-6).

Training Set

(records with

labels)

Classifier

Inducer

Visualization

files

Classifier

Records

without

labels Labels

Inducers

321

Figure 10-6 Tool Execution Sequence for Classiﬁers

MineSet client MineSet server

Tool

manager

Configuration

file

Configuration

file

Visualization

tool

Visual

files

Data

file

DataMover User's

data

source

User

Visual

display

Inducer

(MIndUtil)

Classifier

Information & statistics

(error estimate)

322

Chapter 10: MineSet Inducers and Classifiers

Training Set

Inducers require a training set, which is a table containing attributes, one of which is

designated as the class label. The label attribute type must be discrete (binned values,

character string values, or a few integers). The number of possible values for the label

attribute should be small, preferably two or three values. Let’s look at an example where

the goal is to classify the type of an iris ﬂower (iris-setosa, iris-versicolor, or iris-virginica)

given as descriptive attributes its sepal length, sepal width, petal length and petal width.

Figure 10-7 shows several records from a sample training set for this problem.

Figure 10-7 Sample Records From a Training Set

Once a classiﬁer is built, it can classify new records as belonging to one of the above

classes (see Figure 10-5). These new records must be in a table that has all the attributes

used by the classiﬁer with the same name and type as they were in the training set. The

table need not contain the label attribute. If it exists, it is ignored during classiﬁcation.

Applying a Model

After building a classiﬁer, you can apply it to records to predict their label. For example,

if you built a classiﬁer for predicting iris type, you can apply the classiﬁer to records

containing only the descriptive attributes, and a new column is added with the predicted

iris type.

In a marketing campaign, for example, a training set can be generated by running the

campaign at one city and generating label values according to the responses in that city.

A classiﬁer can then be induced and campaign mail can then be sent only to people who

are labeled by the classiﬁer as likely to respond, thus saving mailing costs.

5.1 3.5 1.4 0.2 Iris-setosa

5.9 3 5.1 1.8 Iris-virginica

6.5 2.8 4.6 1.5 Iris-versicolor

6.3 2.9 5.6 1.8 Iris-virginica

6.5 3 5.8 2.2 Iris-virginica

Record 1

Record 2

Record 3

sepal length sepal width petal length petal width iris type

Descriptive Attributes Label

Training Set

323

As an example of using mining tools for ensuring data quality, after building a classiﬁer

you can apply it to the training set in order to identify records that are mislabeled by the

classiﬁer. Such records might warrant closer investigation. Perhaps they are “noise,” or

they might yield special insights. If, for example, you have a Decision Tree for the iris

dataset induced using the Classify Only mode, by applying the classiﬁer, you get a new

column (iris type_1) containing the predicted labels. You can then add a column that is

deﬁned as type int with the expression (iris type != iris type_1). The new column has a 1

whenever the classiﬁer misclassiﬁes, and a zero when it correctly classiﬁes. Figure 10-8

shows a Scatter Visualizer plot of the data where the new column is mapped to color with

the colors set such that green is 0 (OK) and 1 is red (error). By looking at the plot, it is

possible to determine where mistakes are being made.

Figure 10-8 Iris Dataset Misclassiﬁcation, Example 1

Another alternative is to deﬁne the new column as a ﬂoat with the expression

(iris type != iris type_1) + 0.01. The Scatter Visualizer can then be used with the original

label mapped to color, and this new column mapped to size. Incorrect predictions are

shown as big cubes; correct predictions are shown as small cubes (see Figure 10-9).

324

Chapter 10: MineSet Inducers and Classifiers

Figure 10-9 Iris Dataset Misclassiﬁcation, Example 2

Error Estimation

When a classiﬁer is built, it is useful to know how well you can expect it to perform in

the future (what is the classiﬁer’s error-rate). Factors affecting classiﬁcation error-rate

include:

•The number of records available in the training set.

Since the inducer must learn from the training set, the larger the training set, the

more reliable the classiﬁer should be; however, the larger the training set, the longer

it takes the inducer to build a classiﬁer. The improvement to the error-rate decreases

as the size of the training set increases (this is a case of diminishing returns).

Error Estimation

325

•The number of attributes.

More attributes mean more combinations for the inducer to compute, making the

problem more difﬁcult for the inducer and requiring more time. Note that

sometimes random correlations can lead the inducer astray; consequently, it might

build less accurate classiﬁers (technically, this is known as “overﬁtting”). If an

attribute is irrelevant to the task, remove it from the training set (this can be done

using the Tool Manager).

•The information in the attributes.

Sometimes there is not enough information in the attributes to correctly predict the

label with a low error-rate (for example, trying to determine someone’s salary based

on their eye color). Adding other attributes (such as profession, hours per week,

and age) might reduce the error-rate.

•The distribution of future unlabeled records.

If future records come from a distribution different from that of the training set, the

error-rate probably will be high. For example, if you build a classiﬁer from a

training set containing family cars, it might not be useful when attempting to

classify records containing many sport cars, because the distribution of attribute

values might be very different.

The two common methods for estimating the error-rate of a classiﬁer are described

below. Both of these assume that future records will be sampled from the same

distribution as the training set.

•Holdout: A portion of the records (commonly two-thirds) is used as the training set,

while the rest is kept as a test set. The inducer is shown only two-thirds of the data

and builds a classiﬁer. The test set is then classiﬁed using the induced classiﬁer, and

the error-rate or loss on this test set is the estimated error-rate or estimated loss.

Figure 10-10 shows this error estimation method.

326

Chapter 10: MineSet Inducers and Classifiers

Figure 10-10 Estimating the Classiﬁer’s Accuracy

This method is fast, but since it uses only two-thirds of the data for building the

classiﬁer, it does not make efﬁcient use of the data for learning. If all the data were

used, it is possible that a more accurate classiﬁer could be built.

•Cross-validation: The data is split into k mutually exclusive subsets (folds) of

approximately equal size. The inducer is trained and tested k times; each time, it is

trained on all the data minus a different fold, then tested on that holdout fold. The

estimated error-rate is then the average of the errors obtained. Figure 10-11 shows

cross-validation with k=3 (note that the default value is k=10).

Cross-validation can be repeated multiple times (t). For a t times k-fold

cross-validation, k*t classiﬁers are built and evaluated. This means the time for

cross-validation is k*t times longer. By default, k=10 and t=1, so cross-validation

takes approximately 10 times longer than building a single classiﬁer.

Increasing the number of repetitions (t) increases the running time and improves

the error estimate and the corresponding conﬁdence interval.

You can increase or decrease k. Reducing it to 3 or 5 shortens the running time;

however, estimates are likely to be biased pessimistically because of the smaller

training set sizes. You can increase k, but this is recommended only for very small

datasets.

Evaluation

(percent

incorrect

predictions)

Classifier

Inducer

Training Set

Error Estimation

327

Figure 10-11 Classiﬁer Cross-Validation (k=3)

Generally, a holdout estimate should be used at the exploratory stage, as well as on

datasets of over 5,000 records. Cross-validation should be used for the ﬁnal classiﬁer

building phase, as well as on small datasets.

Average

Evaluation

Classifier

Inducer

Classifier

Inducer

Classifier

Inducer

Training Set

Evaluation Evaluation

328

Chapter 10: MineSet Inducers and Classifiers

Backﬁtting in Error Estimation

An inducer builds a classiﬁer, which has two parts:

•Structure — For Decision Trees and Option Trees, the structure is the shape of the

tree. For evidence, the structure is the number of bins for every attribute and the

thresholds if the attribute is numeric.

•Probability estimates — Each part of the structure estimates the probability of each

class. These estimates are commonly based on the counts of training records at

different points in the structure. For Decision Trees, the probabilities are determined

by the weight of records at the leaves. For the Evidence classiﬁer, the probabilities

are determined by the conditional probabilities for every attribute value or range.

Backﬁtting a classiﬁer with a set of records does not alter the structure of the classiﬁer,

but updates the probability estimates based on the given data. Backﬁtting is useful for

several reasons:

1. A structure can be built from a small training set, then backﬁtted with a big dataset

to improve the probability estimates in the structure. Backﬁtting is a faster process

than inducing the classiﬁer's structure.

2. When holdout error estimation is used, a portion of the data is left out for testing.

Once the classiﬁer structure is induced and the error estimated, it is possible to

backﬁt all of the data through the structure, which can reduce the error of the ﬁnal

classiﬁer. When counts, weights, and probabilities are shown in the classiﬁer's

structure, they reﬂect all the data, not just the training set portion.

When using drill-through from the visualizers, you can see data corresponding to the

weights shown, which reﬂect the whole dataset. If backﬁtting is not used, the weights

shown represent only the training set.

Error Estimation

329

Confusion Matrices in Error Estimation

Confusion matrices give a more detailed picture of the errors made by a classiﬁer. Instead

of simply analyzing the number of correct and incorrect predictions, the confusion

matrix shows the type of errors being made.

Figure 10-12 shows a confusion matrix for a Decision Tree that was induced on the iris

dataset.

Figure 10-12 Confusion Matrix for Iris Dataset

330

Chapter 10: MineSet Inducers and Classifiers

The two axes represent:

•the class values predicted by the classiﬁer, and

•the actual class values given in the test set (holdout set).

Entries on the diagonal are correct predictions. Entries off the diagonal indicate incorrect

predictions. This representation shows that iris-versicolor and iris-virginica are frequently

confused, but iris-setosa is always predicted correctly.

When the cost of making different types of mistakes is uneven, it is frequently useful to

understand the type of errors that are being made (see loss matrices below).

Note: The confusion matrix shows the errors made on the test set; thus, it represents the

expected true distribution of errors in an actual situation if the underlying distribution

of the data does not change signiﬁcantly. The confusion matrix in MineSet is computed

prior to backﬁtting and is the same whether or not backﬁtting is applied.

Lift Curves in Error Estimation

A lift curve is a graph that plots the cumulative weight of the records from a speciﬁed

label value as a function of the weight of all the records. The order in which the records

occur determines the slope of the curve. Typically, a lift curve plots the difference

between randomly ordered records and records sorted based on a classiﬁer's predictions.

For example, in telecommunications, it is valuable to be able to predict which customers

are likely to switch providers (churn). In the dataset churn, about 13.5% of the customers

are likely to switch provider. Figure 10-13 shows the lift curve obtained by using a

Decision Tree classiﬁer on this dataset.

Error Estimation

331

Figure 10-13 Lift Curve for the Churn Dataset

The X axis shows the number of records sampled; the Y axis shows the number of records

corresponding to customers who churn. The lower curve (red) shows the number of

customers expected to churn given a random ordering of the records. The upper curve

(white) shows the percentage of customers that churn when ordered according to the

classiﬁer's score (probability estimate) for each record. Records representing customers

that the classiﬁer identiﬁes as most likely to churn appear ﬁrst; those less likely to churn

appear last. The lift that the classiﬁer ordering provides can be seen by the difference

between the classiﬁer curve and the random curve.

332

Chapter 10: MineSet Inducers and Classifiers

If some action should be taken before customers churn, it should be prioritized according

to the classiﬁer's score. If the action costs money (for example, an operator contacting the

customer or a mailing), lift curves can help identify a cutoff point that maximizes returns.

Note: The lift curve shows lift on the test set; thus, it represents the expected true lift in

actual situations if the underlying distribution of the data does not change signiﬁcantly.

The lift curve in MineSet is computed prior to backﬁtting and is the same whether or not

backﬁtting is applied.

Learning Curves in Error Estimation

A learning curve is a graph that shows the error of the classiﬁer generated by an inducer

as a function of the number of records used to create the classiﬁer. Typically, the more

records used to generate the classiﬁer, the lower its error.

A learning curve is created by generating the speciﬁed number of classiﬁers for each of

the points on the curve. Each classiﬁer is generated using a random sample of the

records, and its error is estimated using the rest of the records (those not used for

training).

Figure 10-14 shows a learning curve for the Decision Tree Inducer on the churn dataset.

Figure 10-15 shows a learning curve for the Decision Tree Inducer on the adult dataset

with the label set to gross income binned at $50,000 (so one class is gross income less than

or equal to $50,000 and the other class is gross income >$50,000). The X axis shows the

number of records used for training the inducer; the Y axis shows the error. The graph

consists of four type of points:

•The yellow points are the actual error estimates taken from the runs.

•The white points are averages.

•The blue points interpolate between the white points.

•The red points show a 95% conﬁdence interval about the average based on actual

error estimates for each run.

Error Estimation

333

The more runs that are requested, and the bigger the test set (portion used to test), the

smaller the conﬁdence interval. The error is generally reduced as more records are used

for training.

We can see that for the churn dataset, the error continues to decrease as the training set

size grows, while for the adult dataset, there is little advantage to training on the whole

dataset. The third point represents about 13,000 records and has an estimated error of

16.96%, while the last point represents about 44,000 records and has an estimated error

of 16.85%.

Figure 10-14 Learning Curve for the Churn Dataset

334

Chapter 10: MineSet Inducers and Classifiers

Figure 10-15 Learning Curve for the Adult Dataset With Label Set to Gross Income

Binned at $50,000

A small sample might sufﬁce for most of the study, with the full data used only for the

ﬁnal runs. In many cases, a small sample can result in a sufﬁciently accurate classiﬁer,

with the error reducing only slightly if the number of records is increased (diminishing

returns). Once a learning curve is seen, the desired sampling point can be determined,

and the “sample” transformation in Tool Manager can be used to generate a sample of

this size (see “The Sample Button” in Chapter 3). Small samples reduce the time needed

to build a classiﬁer and make the knowledge discovery process more interactive.

Error Estimation

335

Advanced Options

MineSet supports several advanced options for all inducers. These let you take into

account different costs for making mistakes and to take into account an experimental

design that has a non-uniform sampling process (that is, some parts of the true

population are sampled more heavily than others). Another option lets you create more

complicated classiﬁers which may have better accuracy, at the expense of added compute

time.

Loss Matrices: Not All Mistakes Were Created Equally

Suppose you are trying to classify mushrooms as poisonous or edible. Classifying a

mushroom that is actually edible as poisonous might cost you $2, since you are not eating

it; however, classifying a poisonous mushroom as edible (that is, eating it) might incur a

$10,000 operation.

Figure 10-16 shows a confusion matrix for the mushroom dataset with the Decision Tree

Inducer when only a ratio of 0.1 (10%) was used for a training set.

Figure 10-16 Confusion Matrix for the Mushroom Dataset Using Defaults Settings

336

Chapter 10: MineSet Inducers and Classifiers

Eight records, representing poisonous mushrooms, were classiﬁed as edible (0.1%); 15

records, representing edible mushrooms, were classiﬁed as poisonous (0.2%). 3793

edible mushrooms and 3496 poisonous mushrooms were correctly classiﬁed. While the

error-rate for the classiﬁer is only 0.31% (less than one percent), our estimated loss would

be $10000*8 + $2*15 = $80,030.

Figure 10-17 shows a confusion matrix for the same dataset, but with the Decision Tree

Inducer run using a loss matrix representing the above costs. The new classiﬁer is very

conservative and makes no mistakes in classifying a poisonous mushroom as edible; but

it makes 1558 mistakes (1543+8) in classifying edible mushrooms as poisonous. The total

estimated loss we would incur is thus $10000*0 + $2*1558 = $3116, only 3% of the cost of

the classiﬁer that did not take losses into account.

Figure 10-17 Confusion Matrix for the Mushroom Dataset With Loss Matrix

Error Estimation

337

Loss matrices also allow predicting unknown (null values), which are shown as question

marks (?). For example, suppose it costs us $1 to ask an outside expert whether a

mushroom is poisonous or edible. In that case, some classiﬁcations result in an unknown

prediction. Running the Decision Tree Inducer yields the confusion matrix shown in

Figure 10-18, where there are 1551 unknowns, and only 15 edible mushrooms are

classiﬁed as poisonous. The overall cost is thus $10000*0 + $1*1551 + $15*2 = $1581

Figure 10-18 Confusion Matrix for the Mushroom Dataset With Loss Matrix Allowing

Unknown Predictions

Note that loss matrices are based on probability estimates made at the leaves of the tree.

For reliable estimates:

1. Raise the "split lower bound" in Further Options of Decision Trees and Option Trees

from the default value to a higher value (for example: 5). In general, the larger and

noisier the training set size, the higher this value should be.

2. Use large training sets. You might need large training sets to get reliable estimates

when the costs are not as extreme as in this example.

3. Use Option Trees. While they do not always help, they usually provide better

probability estimates that tend to reduce the loss. For example, running the above

example with $10000 changed to $100 with unknowns not allowed, yields an

estimated loss of $1464 for Decision Trees and an estimated loss of $662 for Option

Trees.

338

Chapter 10: MineSet Inducers and Classifiers

Return-on-Investment Curves

A Return-on-Investment (ROI) curve is similar to a Lift Curve, but displays accuracy in

terms of loss rather than in terms of error; taking into account the Loss Matrix used. The

points in an ROI curve are ordered by the expected loss for each record, were they to be

labeled by the chosen label value. Similarly, the height of each point in the curve indicates

the cumulative proﬁt (inverse loss), rather than the cumulative accuracy (inverse error)

of all records up to this point. The expected loss is computed by multiplying entries in

the Loss Matrix, under the chosen labels column (see “Loss Matrices,” under “Advanced

Options,” below) by the probabilities assigned to the corresponding classes, for the

classes by the classiﬁer. Hence, if the classiﬁer is very sure about its prediction, the

expected loss will be low, and the record will appear near the left side of the ROI curve.

The idea behind the ROI curve is that the user will take an action for each individual

record in the dataset. That action will be the one associated with the chosen label value.

For example, in the churn dataset, the action associated with the label Yes, might be to

send that person some marketing material. This might stop the person from churning;

but the action is costly if done indiscriminately. The peak of the ROI curve shows

approximately how much money would have been saved on the test set, if the classiﬁer

was used to predict whether or not to send the mailing to a particular person.

Special care needs to be taken when ﬁlling out a Loss Matrix for use with a Loss Curve.

The column under a certain predicted label determines the resulting ROI curve for that

label value. The entries in this column need to represent the expected gain or loss for

taking the action associated with that label value, on all of the possible classes. For

example, the entry under the column “prediction yes” in churn, under the row “actual

value no”, may contain the value 2 to indicate that the cost of mailing a brochure (the

action associated with “yes”) to someone who was not going to churn, is 2 dollars. On

the other hand, the entry under the column yes, row yes, may have a value of -10 to

indicate that a customer was prevented from churning, saving the company ten dollars

over the cost of the mailing.

Record Weighting: Not All Records Were Sampled Equally

In certain experimental designs, portions of the true population are sampled more

frequently than others. For example, while you might want a 1% sample of some

population, a small minority that is already 0.1% of the population results in a 0.001%

sample, which might be too small (for instance, you might get two people). Record

weighting lets you give each record a weight; thus, a subpopulation that was sampled

twice as frequently might get a weight of 0.5, while the rest of the population is given a

weight of 1.

Error Estimation

339

As another example, a phone company stores all fraudulent phone calls in the dataset,

while storing only a small fraction of non-fraudulent calls. By using record weighting, it

is possible give each record its true portion of the population.

Finally, some datasets are already aggregated, and the records have a natural “count”

associated with them (for example, statistics about cities in the U.S. usually have an

associated count of the population). This count attribute can be mapped to weight, which

is equivalent to replicating each record by the number of counts.

The semantics of record weighting is that a record weight of 2 is equivalent to two records

with a record weight of 1. Floating point weights are allowed.

Boosting: Accuracy is Sometimes Crucial

In some cases, the most important issue in creating a classiﬁer is its error rate. For

example, suppose you have analyzed a dataset for churn prediction to a point that you

are satisﬁed with, and are ready to create a classiﬁer that will predict which of your

customers are the most likely to churn. At this point, you are no longer interested in

visualizing your classiﬁer, since you have a reasonably good understanding of the factors

that are involved. You also want to achieve the best classiﬁcation accuracy possible. In

this case, you might want to enable Boosting, which is an algorithm that creates several

different classiﬁers and combines their predictions using a weighted voting scheme.

Boosted classiﬁers often improve classiﬁer accuracy, by focusing the induction process

on examples in the data which are harder to model than others.

Boosting will not always increase accuracy, but it often does. Boosted classiﬁers cannot

be visualized, though you can still see confusion matrices, lift curves, learning curves

and ROI curves for boosted classiﬁers. Boosting is a computationally intensive process,

often taking 25 times longer to run than the corresponding inducer without Boosting.

Backﬁtting does not work with Boosted classiﬁers, because of the special way Boosting

weights multiple classiﬁers and the records used to train them.

The theory behind boosting has not been generalized past two-class problems. MineSet

2.5 and later versions, however, allow you to use boosting with labels that have any

number of values. Boosting will not always improve the error rate of induced classiﬁers;

this is especially stressed for problems that have more than two label values.

340

Chapter 10: MineSet Inducers and Classifiers

Boosting works by repeatedly assigning new weight distributions to the training set and

inducing classiﬁers on the reweighted sets. The number of times this occurs is limited by

the BOOST_NUM_TRIALS option, which can be set using the .mineset-classopt ﬁles on

the client (see Appendix I, “Command-Line Interface to MIndUtil: Analytical Data

Mining Algorithms,”for more information about the .mineset-classopt ﬁle). The number of

classiﬁers generated may be lower than this parameter if the training set error rate drops

to zero before this many classiﬁers are generated.

The following section describes the options provided for the classiﬁers by the Tool

Manager.

Inducer Modes in Tool Manager

There are four modes for running an inducer (shown in Figure 10-19).

•Classiﬁer and Error

•Classiﬁer Only

•Estimate Error

•Learning Curve

Figure 10-19 Options for Running the Inducer

The Classiﬁer and Error mode uses a holdout method to build a classiﬁer: a random

portion of the data is used for training (commonly two-thirds) and the rest for testing.

This holdout proportion can be set in Further Inducer Options (see “Error Estimation” on

page 324). This method is the default mode and is recommended for initial explorations.

It is fast and provides an error estimate.

Error Options for Inducers

341

The Classiﬁer Only mode uses all the data to build the classiﬁer. There is no error

estimation. Use this mode when there is little data or when you build the ﬁnal classiﬁer.

The Estimate Error mode assesses the error of a classiﬁer that would be built if all the data

were used (as with Classiﬁer Only mode). Estimate Error uses cross-validation, resulting

in long running times. Cross-validation splits the data into k folds (commonly 10) and

builds k classiﬁers. The process can be repeated multiple times to increase the reliability

of the estimate. You can set the number k and the number of times in Further Inducer

Options, as explained in “Error Options for Inducers,” below. Use this method when

there is little data. The induced classiﬁer is exactly the same as the one induced by the

Classiﬁer Only mode.

The Learning Curve mode assesses the effect of training set size on the error of a

classiﬁer.

Error Options for Inducers

The following options are available to ﬁne tune the error estimation for the inducers. The

Error Options available to you depend on the mode you have chosen.

In both Classiﬁer & Error and Estimate Error, you can set a random seed that determines

how the data is split into training and testing sets. Changing the random seed causes a

different split of the data into training and test sets. If the error estimate varies

appreciably, the induction process is not stable.

In Classiﬁer & Error (see Figure 10-20), you can set the Holdout Ratio of records to keep

as the training set. This defaults to 0.666667 (two-thirds). The rest of the records are used

for assessing the error.

Figure 10-20 Error Estimation Options With Holdout

342

Chapter 10: MineSet Inducers and Classifiers

In Estimate Error (see Figure 10-21), you can set the number of folds in cross validation

and the number of times to repeat the process.

Figure 10-21 Error Estimation Options With Cross Validation

Backﬁtting

The Backﬁt test set option is a checkmark that can be found under Further Options for all

inducers when using Classiﬁer & Error mode, and is shown in Figure 10-22. The backﬁt

checkmark is disabled when Boosting is enabled.

Figure 10-22 Backﬁtting, Confusion Matrices, Lift Curve, and ROI Curve Options

Error Options for Inducers

343

Confusion Matrices

The Display Confusion Matrix option is checkmark under Further Options for all inducers

when using Classiﬁer & Error mode is shown in Figure 10-22.

ROI Option

The Display ROI Curve option is a checkmark under Further Options for all inducers when

the classiﬁer and Error mode is shown in Figure 10-23. An ROI curve requires a label

value to be chosen. An ROI curve is then generated and displayed for that label value.

Figure 10-23 ROI Option for Generating a Return on Investment Curve

Lift Curves

The Display Lift Curve option is a checkmark under Further Options for all inducers when

using Classiﬁer & Error mode is shown in Figure 10-22. A Lift Curve requires a label

value to be chosen. A lift curve is generated and displayed for that label value.

344

Chapter 10: MineSet Inducers and Classifiers

Loss Matrices

The Use Loss Matrix option is a checkmark under Further Options for all inducers (See

Figure 10-24). The Edit matrix button can then be used to deﬁne the loss matrix. To avoid

unknowns from being predicted, ﬁll the unknown prediction column with the highest

value in the matrix.

Figure 10-24 Enabling Loss Matrices and Setting the Weight Attribute

Weight Setting

The Use Weight option is a checkmark under Further Options for all inducers (See

Figure 10-24). Choose the column for the weight. The Weight is Attribute option

determines whether the inducer can use this attribute for classiﬁcation purposes or not.

In certain cases where the weight is a result of a stratiﬁed sample that is part of the

experimental design, the classiﬁer should not be given access to the weight column as it

is not a property of the real-word entity.

Learning Curves

Learning Curve is a mode in the Classify menu of the Mining Tools tab. It can be used

with any of the inducers. When the Learning Curve mode is selected, the Further Options

dialog box lets you specify Learning Curve Options (shown in Figure 10-25), including:

•the number of points in the learning curve,

•the number of runs per point, and

•the number of records to use at the start and end points.

The number of records to use at each intermediate point is calculated automatically.

Error Options for Inducers

345

Figure 10-25 Learning Curve Options

The number of points in the learning curve must be speciﬁed; also, it must be greater than

or equal to 1. The number of records for the starting and ending points can be speciﬁed

to allow generating a learning curve for a speciﬁc range of the training set. If either of

these options are left blank, they are calculated automatically based on the number of

points in the learning curve and the total number of records in the training set. This

default covers the entire range of the training set. For instance, assume a ﬁle containing

80,000 records. If you speciﬁed 3 points in the learning curve, the algorithm generates

points at 20,000, 40,000 and 60,000 records. Often it is useful to “zoom in” on a smaller

range. For example, a learning curve might be generated only for a range of 1000 to

10,000 records.

Generating a learning curve takes a signiﬁcant amount of CPU time. If tiis the time to

train an inducer on training set i (where i ranges from 1 to the number of points), and

there are k runs per point, the total time is . Increasing the number of runs per

point increases the running time proportionally, but improves the estimate of the

average. The default value of the number of runs is 3.

The Scatter Visualizer’s ﬁlter panel can be used to ﬁlter some of the data types shown

(average points, conﬁdence intervals, interpolated points, or actual trials). For example,

you might want to remove the data points for the trials and conﬁdence intervals and

show only the averages and interpolated points.

OK and Cancel Buttons

Once you have speciﬁed the Classiﬁcation Options, click OK to have these options take

effect and to return to the Data Destination panel. To return to the Data Destination panel

without having changes to the options take effect, click Cancel.

k*Σti

346

Chapter 10: MineSet Inducers and Classifiers

Go! Button

After you have set the options, click the Go! button in the Data Destination panel to run

the inducer. The appropriate visualizer will automatically be launched.

The Status Window

After you press Go! in the Data Destination panel, the Status Window at the bottom of

the Tool Manager’s main window shows the inducer’s progress and the output

classiﬁer’s statistics. It displays speciﬁc information for the induced classiﬁer. For

example, for Decision Trees it shows the number of nodes, the number of leaves, and the

depth of the Decision Tree (Figure 10-26). This information is saved automatically on

your workstation under the session ﬁle name with a -dt.out, -odt.out, or -eviviz.out

extension, depending on whether a Decision Tree, Option Tree, or Evidence Inducer was

executed.

For Classiﬁer & Error, the ﬁrst series of dots represent reading the ﬁle, then information

about the classiﬁer build progress is shown, then the test set classiﬁcation progress is

shown.

For Classiﬁer Only mode, there is no test set classiﬁcation phase.

For Estimate Error, the times and folds are shown.

For Learning curves, each average point on the x-axis will be described on a line and each

run for that average point will be represented by a dot.

Figure 10-26 The Status Window

The Status Window

347

When you have selected the Classiﬁer & Error mode, the Status window contains the

following information:

•The random seed used to split the data into training and test sets.

•The number of records used for training the inducer.

•The number of records used for evaluating the resulting classiﬁer; of the test

records, how many were seen during training, excluding the label attribute. It is

possible to have duplicate records (“seen”) in a dataset; some records can be in both

the training and test set. A large value of seen records indicates that there are many

duplicate records. If their labels are contradictory, it might be impossible to achieve

high accuracy without adding more attributes to the dataset.

•The number of correct and incorrect predictions made.

•The average normalized mean squared error represents the accuracy of the

probability estimates. For each test record, the mean squared error is the square of

one minus the probability estimate for the correct label value, plus the sum of the

squares of the probability estimates for the other (incorrect) label values. The

normalized mean squared error is half the mean squared error, which is a value

between zero and one. The average normalized mean squared error is the

normalized mean squared error averaged over all the records in the test set by their

appropriate weights (weighted average).

•The classiﬁcation error, which is the percent of incorrect predictions.

•Both the average mean squared error and the classiﬁcation error show the standard

deviation of the mean and the conﬁdence interval for the mean. This is the range

you can expect from the classiﬁer if the data comes from the same distribution. For

error estimates (not losses), a more accurate formula than the usual two-standard

deviation rule is used.

When you have selected the Estimate Error mode, the Status window contains the

following information:

•The number of cross-validation folds and times.

•The random seed.

•The estimated accuracy with standard deviation.

•The 95% conﬁdence interval for the estimated accuracy.

348

Chapter 10: MineSet Inducers and Classifiers

Applying Models, Testing Models, and Fitting New Data

The Apply Model button in the Data Transformations panel lets you:

•take a previously created classiﬁer and apply it to new data.

•test a previously created classiﬁer’s performance on the current table.

•ﬁt the current table into a previously created classiﬁer’s structure.

On the top left of this dialog box (Figure 10-27) is a list of all classiﬁers currently available

on the server. If you select a classiﬁer, the right-hand side lists the column names and

types required by that classiﬁer. If these requirements match the current table, a message

at the bottom states this, and the buttons on the bottom (OK,Run Test, or Fit Data) is

activated. If the current table does not have all the columns required for the selected

classiﬁer, the message at the bottom states this, the columns that are missing are selected

in the list on the right, and the button on the bottom is deactivated.

Figure 10-27 The Test and Apply Model Dialog Box: Selecting a Classiﬁer

Applying Models, Testing Models, and Fitting New Data

349

Apply Model

The Apply Model panel is used to apply a previously created classiﬁer to the current

table, as shown in Figure 10-28. There are two modes of application for the classiﬁer:

•To Predict discrete label values for the records in the current table. For example, if you

created a classiﬁer to determine churn, you can use this option to add a column that

labels each customer as either likely to churn or not likely to churn.

•To generate Estimated probability values for a speciﬁed label value. Instead of using

the classiﬁer to predict the label value of each record, it is used to estimate the

probability that each record has a speciﬁed label value (for example, churn = yes).

Given the classiﬁer created to determine churn, you can use this option to add a

column that indicates the probability that each customer is likely to churn.

The New column name text ﬁeld lets you specify the name of the new column.

Figure 10-28 The Apply Model Panel

Test Model

The Test Model panel is used to test a previously created classiﬁer on the current table,

as shown in Figure 10-29. The table must contain columns with the names and types

required by the selected classiﬁer. Unlike Apply Model, Test Model also requires the

table to contain a label column with the same name and type as the label column used

when building the classiﬁer.

350

Chapter 10: MineSet Inducers and Classifiers

The Test Model panel has options that lets you

•show the confusion matrix of the classiﬁer on the table records

•show the lift curve of the classiﬁer for a speciﬁed label value

•show the ROI curve of the classiﬁer for a speciﬁed label value

•show a visualization of the classiﬁer with the table used as the test-set (this is only

relevant for Decision Tree and Option Tree classiﬁers)

•select an attribute to use as the record weight

The text ﬁeld at the bottom of the Test Model panel shows the results.

Figure 10-29 The Test Model Panel

Fit Data to Model

The Fit Data to Model panel is used to ﬁt the data in the current table to a previously

created classiﬁer, as shown in Figure 10-30. This produces a new classiﬁer with the same

structure as the original one; however, the new one uses the data from the table to update

the probability estimates (see “Backﬁtting in Error Estimation” on page 328). Because all

of the data from the table is being ﬁt into the structure of the classiﬁer, there is no error

estimation. Fit Data to Model cannot be used on classiﬁers that were built using boosting.

Use Test Model to evaluate the performance of the new classiﬁer on a separate test set

(disjoint from the ﬁt data).

Special Options and Limitations

351

The Fit Data to Model panel has options that lets you

•show a visualization of the new classiﬁer

•specify a name for the new classiﬁer

•select an attribute to use as the record weight

Figure 10-30 The Fit Data to Model Panel

Special Options and Limitations

The following subsections describe how to set special options and the limitations of the

inducers.

Setting Special Options

When the Tool Manager runs an inducer on the server (the MIndUtil program), it passes

certain options to the inducers. Not all options are controlled through the Tool Manager

GUI. Those options not controlled by Tool Manager take on their default values and can

be overridden by setting them in a special ﬁle, called .mineset-classopt. Tool Manager

prepends this ﬁle to the options sent. This ﬁle is optional. Tool Manager looks for it ﬁrst

in the current directory, then in your home directory. See Appendix I, “Command-Line

Interface to MIndUtil: Analytical Data Mining Algorithms” for more details about the

options.

352

Chapter 10: MineSet Inducers and Classifiers

The ﬁle should contain one line per option, in the following format:

For example, the special option LOGLEVEL increases the amount of information shown

during the induction process. The default of zero shows very little information. Level 1

shows other options and slightly more information. Level 2 and higher show large

amounts of information about the induction process. These levels are appropriate only if

you have a ﬁrm understanding of the induction process. (See Appendix K, “Further

Reading and Acknowledgments.”)

Note that these options are not part of the saved session. If you send ﬁles to other users,

you might have to send this ﬁle separately to them.

Default Limits and How to Override Them

Three limits and their respective options are as follows:

•Discrete attributes are ignored if they have more than 100 values. Discrete attributes

with many values are usually inappropriate for classiﬁcation. For example, ﬁrst

names and street addresses are unlikely to form predictive patterns.

To speed up the induction process, attributes with over 100 values are ignored.

You can override this value by setting MAX_ATTR_VALS to a higher number. For

example, your .mineset-classopt ﬁle could contain the line

MAX_ATTR_VALS=500

•Discrete labels with over 25 values are not allowed by default. Automatically

induced classiﬁers are rarely appropriate for predicting one of a large number of

label values. You should limit the label to a few values (preferably two or three).

You can override this default limit by setting the option MAX_LABEL_VALS to a

higher value in your .mineset-classopt ﬁle.

•Boosting works by repeatedly assigning new weight distributions to the training set

and inducing classiﬁers on the reweighted sets. The number of times this occurs is

limited by the BOOST_NUM_TRIALS option, which can be set by the

.mineset-classopt ﬁle on the client. The number of classiﬁers generated may be lower

than this parameter if the training set error rate drops to zero before this many

classiﬁers are generated.

Special Options and Limitations

353

Other Limitations

There are three further limitations:

•Floating point numbers are read into MIndUtil as ﬂoats (4 bytes) even if they are

represented as doubles (8 bytes) in the database, in the ASCII ﬁle, or in the binary

ﬁle. This limits the precision and magnitude of the representations allowed.

•Attributes of type arrays are always ignored.

•Dates are considered strings. Unless there are few dates, such attributes are usually

ignored because of the limit on discrete attributes. You should bin dates before

running an inducer.

355

Chapter 11

11. Inducing and Visualizing the Decision Tree Classiﬁer

This chapter discusses the features and capabilities of the Decision Tree Inducer. Its

associated visualizer, the Tree Visualizer, is described in Chapter 5. This chapter provides

an overview of this tool and discusses the ways of using it to generate Decision Tree

classiﬁers. It then explains the Tree Visualizer’s functionality when working with the

main window. Finally, it lists and describes the sample ﬁles provided for this tool.

Note: It is assumed that you have read Chapter 10, “MineSet Inducers and Classiﬁers,”

before proceeding with this chapter.

Overview

A Decision Tree classiﬁer assigns each record to a class. The underlying structure used

for classiﬁcation is a Decision Tree, such as the one shown in Figure 11-1.

356

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

Figure 11-1 Decision Tree for the Iris Dataset

Inducing Decision Trees

A Decision Tree classiﬁer is induced (generated) automatically from data. The data,

which is made up of records and a label associated with each record, is called the training

set (see Chapter 10, “MineSet Inducers and Classiﬁers”).

File Requirements

357

File Requirements

The Decision Tree Inducer requires a training set, as described in the “Training Set” in

Chapter 10. Files are generated by extracting data from a source (such as a MineSet ASCII

or binary ﬁle, or a table in an Oracle, INFORMIX, or Sybase database). To apply the

generated classiﬁer, you should have a dataset of records with the attributes used by the

classiﬁer, except that the label need not be present.

Running the Decision Tree Inducer

There are two ways to run the Decision Tree inducer:

•From the Tool Manager.

Connect to the server and select a data source (see “Choosing a Data Source” in

Chapter 3).

From the File menu, choose Open New Data File. Log in to a server, and enter the

ﬁlename. For the example shown here, the ﬁlename entered would be

/usr/lib/MineSet/data/iris.schema as the ﬁlename. You’ll see four continuous

attributes and one discrete attribute in the Data Transformation panel. Since there is

only one discrete attribute, the label option automatically shows it. Select the

Decision Tree inducer, and ensure you have selected the Classiﬁer & Error mode. To

run the Inducer, click Go!.

The status window will show the progress, statistics, and the Tree Visualizer will be

launched automatically.

•From the command line.

To induce a Decision Tree classiﬁer from the command line, refer to Appendix I,

“Command-Line Interface to MIndUtil: Analytical Data Mining Algorithms.”

358

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

Conﬁguring the Decision Tree Inducer Using the Tool Manager

To access the options for conﬁguring the Decision Tree inducer, select the Mining Tools

tab on the Data Destination panel (Figure 11-2). From the tabs at the right, select Classify.

Ensure that the inducer you select is Decision Tree (the default). Your selections in the

Mode and Inducer menus determine the options available in the Further Inducer

Options menu. After you have made your selections in these menus, click Go! to run the

inducer, which, in turn, creates the classiﬁer.

Figure 11-2 Data Destination Panel in Tool Manager Showing Classiﬁers

Discrete Labels

The Discrete Labels menu provides a list of possible discrete labels. Discrete attributes

(binned values, character string values, or a few integers) have a limited number of

values. You should select a label attribute with few values; for instance, two or three (see

“Training Set” in Chapter 10). If there are no discrete attributes, the menu shows No

Discrete Label, and the Go! button is disabled. You then must create a discrete attribute by

binning or adding a new column using the Tool Manager’s Data Transformations panel.

Configuring the Decision Tree Inducer Using the Tool Manager

359

Classiﬁer Name

The generated classiﬁer is named with the preﬁx of the session ﬁlename (as determined

in Tool Manager) and the sufﬁx-dt.class. By default, all classiﬁers are stored on the server

in the ﬁle_cache directory, which defaults to mineset_ﬁles. These classiﬁers can be used for

future classiﬁcation of unlabeled records; that is, they can be used to predict the labels for

unlabeled datasets (see “Applying a Model” and “Backﬁtting in Error Estimation” in

Chapter 10).

Parallelization

If you have installed the multiprocessor version of MineSet, it is possible to compute

tree-based algorithms in parallel whenever a branch contains over 1000 records. Each

node on the tree estimates the best possible split on the corresponding level, and these

tasks are performed in parallel. The maximum number of threads a program can spawn

is determined automatically by default. You can control the number of threads by

changing the parallelization mode in the Preferences panel of Tool Manager (see“The File

Menu” in Chapter 3). Parallelization may cause memory fragmentation, causing the

largest data sets that can be computed in parallel to be smaller than the largest data sets

that can be computed on a uniprocessor.

Decision Tree Options

Selecting Further Classiﬁer Options causes the Classiﬁer Options dialog box

(Figure 11-3) to appear. This dialog box consists of four panels:

•The top panel indicates the choices you made in the Tool Manager’s Data

Destination panel.

•The second pane from the top lets you set the loss matrix and the weight attribute.

See “Loss Matrices” and “Weight Setting” in Chapter 10.

•The bottom-left panel lets you specify further Inducer Options.

•The bottom-right panel lets you specify the Error Estimation Options (unless the

mode you chose in the Data Destination panel was Classiﬁer Only, in which case

this area is empty). The options shown in this panel depend on the type of Error

Estimation you chose (see “Applying Models, Testing Models, and Fitting New

Data” in Chapter 10).

360

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

Figure 11-3 Further Inducer Options

Decision Tree Inducer Options

To ﬁne-tune the Decision Tree induction algorithm, you can change the following

Decision Tree inducer options (see Figure 11-3).

Configuring the Decision Tree Inducer Using the Tool Manager

361

•Limit tree height by

By default, there is no limit to the height (number of levels) in the Decision Tree.

Limit the height by clicking the checkbox and typing a number for the limit.

Limiting the number of levels speeds up the induction and is useful for studying

the Decision Tree without the distraction of too many nodes. Note that restricting

the size decreases the run time but might increase the error rate. Setting this option

does not affect the attributes chosen at levels before the maximum level.

•Splitting criterion

This option offers three splitting criteria selections. The deﬁnitions below are

technical. For a given problem, it is difﬁcult to know which criteria will be best. Try

them all, and select the one that leads to the lowest error estimate - or to a Decision

Tree you ﬁnd easiest to understand.

Mutual Info is the change in purity (that is, the entropy) between the parent node and

the weighted average of the purities of the child nodes. The weighted average is

based on the number of records at each child node.

Normalized Mutual Info (the default) is the Mutual Info divided by the log (base 2) of

the number of child nodes.

Gain Ratio is the Mutual Info divided by the entropy of the split while ignoring the

label values.

Normalized Mutual Info and Gain Ratio give preference to attributes with few values.

•Split lower bound

This is a lower bound on the weight (normally the number of records if weight was

not set) that must be present in at least two of the node’s children. The default for

this option is 2. For example, if there is a three-way split in the node, at least two out

of the three children must have a weight of two or more (two records or more if

weight is not set). This provides another method of limiting the size of the Decision

Tree.

Increasing the split lower bound tends to increase the reliability of the probability

estimates, because the number of records at each leaf is larger. It also creates smaller

trees and decreases the induction time. If you expect the data to contain noise

(errors or anomalies), or if you use the tree for estimating probabilities (see

“Applying a Model” in Chapter 10), increase the split lower bound to 5 or more. If

your dataset is very small (< 100 records), you might want to decrease this number

to 1.

362

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

•Pruning factor

A Decision Tree is built based on the limits imposed by Limit Tree Height and Split

Lower Bound. Statistical tests are then made to determine when some subtrees are

not signiﬁcantly better than a single leaf node in which case those subtrees are

pruned.

The default pruning factor of 0.7 indicates the recommended amount of pruning to

be applied to the Decision Tree. Higher numbers indicate more pruning; lower

numbers indicate less pruning. If your data might contain noise (errors or

anomalies), increase this number to create smaller trees. The lowest possible value is

0 (no pruning); there is no upper limit.

Pruning is slower than limiting the tree height or increasing the split lower bound

because a full tree is built and then pruned. Pruning, however, is done selectively,

resulting in a more accurate classiﬁer

•Boosting

Boosting is employed to improve accuracy of classiﬁcation, although it is a

time-intensive process. Visualization is not performed during boosting.

The estimated error appears in the status window, as the process runs. This

estimated error is the result of the algorithm repeatedly assigning new weight

distributions to the training set, and inducing a classiﬁer on the reweighted sets.

Estimating error in this manner is normally done after you have done a

visualization to gain insight into the dataset. See “Boosting: Accuracy is Sometimes

Crucial” in Chapter 10 for more information on boosting.

•Allow one-off splits

Clicking this checkbox allows the inducer to make two-way splits on nominal

attributes that have more than two values. Normally, splits on nominal attributes

have as many lines as they have values (see the section on Decision Nodes, below).

For example, a split on the attribute “color” might have lines for red, green, yellow,

and blue. If one-off splits are enabled, then the inducer can also make splits with

just two lines, for example red, and not red. One-off splits isolate exactly one of the

possible values that the column can have. These kinds of splits may be useful when

the data contains attributes with many possible values, some of which are

exceptionally good at discriminating the label.

Working in the Tree Visualizer’s Main Window

363

Working in the Tree Visualizer’s Main Window

The Tree Visualizer’s main window shows the Decision Tree. This Decision Tree consists

of nodes connected by lines (see Figure 11-1).

Nodes

There are two types of nodes:

•decision

•leaf

Decision Nodes

Decision nodes specify the attribute that is tested at the node. Values (or ranges of values)

against which the attributes are tested are shown at the lines. Each possible value for the

attribute matches exactly one line. For example, the root of the Decision Tree in

Figure 11-1 tests the attribute petal_length; the two lines emanating from the node

specify the ranges of values for that attribute (< = 2.6 and >2.6,) so that every possible

value matches either the right branch or the left branch. If the value is unknown and

there is no line labeled with a question mark (?), the majority class of the current node is

predicted.

Leaf Nodes

Leaf nodes in a Decision Tree specify a class. Follow the left branch in Figure 11-1 from

the root to a leaf labeled iris-setosa. Note that the Decision Tree classiﬁer classiﬁes all

records with petal_length < = 2.6 inches as belonging to the class iris-setosa.

Node Information

The vertical bars atop each node show the distribution of the classes at the node. The base

of each node has a height and a color. The height corresponds to the weight of the

training set records that have reached this node (this is the number of records if weight

was not set). In general, the higher the weight, the more reliable the class distribution at

every node.

364

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

The color of the base indicates the error estimate of the subtree: indigo shows high error,

grey indicates medium, white indicates low error. The color of the base is black if no test

set records reached a node; thus, there is no error estimate.

Pointing to a node causes the following information to be displayed:

•Subtree weight — The weight of the training set records in the subtree below the

node pointed to. This value is mapped to the height of the base.

•Test set error/loss — An estimate of the subtree error (or loss if a loss matrix was

given). The number after the +/- is the standard deviation of the estimate. The

higher the standard deviation, the less accurate the error estimate. The error/loss

estimate and the standard deviation are less reliable for leaves with few records or

when the test set error is close to 0% or 100%.

•Test set weight — The weight of records from the test set that reached the node

(number of records if weight was not set).

•Purity — A number from 0 to 100 indicating the skewness of the label value

distribution at the node. If a node has records from a single class, the purity is 100. If

the label values have the same weight, the purity is 0. The purity is computed after

backﬁtting.

When backﬁtting is enabled and Display training set as disks checked on, the vertical bars

show the distribution of the training set as disks. The heights of the disks are on the same

scale as the heights of the bars. You should expect the disks to appear about as high on

the bars as the value used for the Holdout ratio in Further inducer options.

Note that only Classify & Error yields the test set error/loss and weight. You can use the

Test Classiﬁer option (see “Applying Models, Testing Models, and Fitting New Data” in

Chapter 10) to generate a visualization based on an existing classiﬁer and a test set.

Lines

All possible outcomes are marked on the horizontal lines emanating from each decision

node. Each line indicates the value (or range of values) against which the attribute of that

node was tested.

External Controls

365

Using the Main Window to Classify Records

To classify a record, start at the root, and test how to branch at every decision node. By

following the appropriate lines based on the record’s attribute values, you reach a leaf

node. The label, or class, associated with the leaf node is the predicted classiﬁcation of

the record.

Some decisions are quickly made and take a shorter path (for example, petal_length

<=2.6 implies iris-setosa). Other decisions can take a longer path (for example, the right

branches, petal_length > 2.6 and petal_width > 1.65). In general, every leaf corresponds

to a rule that is the conjunction of all tests at the decision nodes and all the values (or

ranges of values) on the lines leading to it from the root.

In the root of the tree shown in Figure 11-1, the error rate is 6%, with a standard deviation

of 3.39%. The standard deviation is high because the ﬁle is small, and the test set only has

50 records. The purity is 0.0, indicating that the distribution is uniform.

The left child of the root has 0 test set error and a purity of 100 because all records with

petal_length <=2.6 inches are of the iris-setosa class; thus, the prediction of iris-setosa is

likely to be very accurate for all records with petal_length <=2.6 inches. The right child

of the root has an estimated error of 8.57%. In this child, which matches records whose

petal_length > 2.6 inches, there are no records belonging to the iris-setosa class; thus, the

class is more likely to be iris-versicolor or iris-virginica. Because only two possibilities

exist at this node, there is a higher purity than at the root (36.91).

The Decision Tree leaves segment the data into clusters sharing the same classiﬁcation

rule (path that leads to each leaf). By looking at the leaves, it is possible to see clusters

that share the same set of properties.

External Controls

The external controls for the visualizer associated with the Decision Tree classiﬁer are the

same as those for the Tree Visualizer. For a description of these controls, see “External

Controls” in Chapter 5.

One particularly useful control for decision trees is to click the right mouse button when

pointing to a node. This shows the list of children of that node.

366

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

Pulldown Menus

The pulldown menus for the visualizer associated with the Decision Tree classiﬁer are the

same as those for the Tree Visualizer. For a description of these menus, see “External

Controls” in Chapter 5.

The Search and Filter Panels

Select Search Panel and Filter Panel in the Show menu to bring up a dialog box that lets

you specify criteria to search/ﬁlter for objects (Figure 11-4). The panels are the same ones

described in “The Search Panel” in Chapter 5; however, the item choices for decision

trees are always the same. These are described below.

Figure 11-4 Tree Visualizer’s Search Dialog Box

Pulldown Menus

367

The search/ﬁlter can be restricted to speciﬁc class labels, either by selecting the values in

the class list or by using the class item, which allows more powerful comparison

operators (such as Matches). Other items are described below:

•Subtree weight lets you restrict the search/ﬁlter to bars or bases (depending on the

choice of the radio button bars/bases) with a given weight (number of records if

weight is not set) for the subtree. For example, you can restrict the search to bars

containing a weight of at least 50.

•Test attribute lets you restrict the search/ﬁlter to nodes labeled by the given value

that the node is testing. Note that decision node labels represent the test attribute,

while leaf node labels show the predicted label. For example, if you select Test

attribute contains age, only nodes that test the value of age are considered.

•Test value lets you restrict the search/ﬁlter to nodes having an incoming line labeled

with a value you specify.

•Percent lets you restrict the search/ﬁlter to bars representing a percentage of the

overall weight at a node. For example, you might want to ﬁnd all nodes such that a

given class accounts for more than 80 percent of the weight. To do this, click the

class label, and select Percent > 80. Setting this item is meaningless if you select

bases and not bars (the value for the bases is 0).

•Purity lets you restrict the search/ﬁlter to nodes with a range of purity levels. For

example, if you want to look at pure nodes (with one class predominant), you can

select Purity > 90.

•Test-set subtree weight lets you restrict the search/ﬁlter to subtrees with a given

test-set weight (number of test-set records if weight is not set).

•Test set error/loss lets you restrict the search/ﬁlter to nodes with a range of estimated

error/loss.

•Mean error/loss standard deviation lets you restrict the search/ﬁlter to nodes with a

range of estimated standard deviation for the test set error/loss.

•Level lets you restrict the search/ﬁlter to a speciﬁc level or range of levels. For

example, you can search only the ﬁrst ﬁve levels.

The following items and options are less useful for decision trees.

•Hierarchy ﬁnds all the nodes and lines that match the given value at the tail of the

path from the root. It then marks the children of these nodes.

•Treat Nulls as Zeros is not used by the Decision Tree inducer because there are no null

items generated for decision trees.

368

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

Once the search is complete, yellow spotlights highlight objects matching the search

criteria. To display information about an object under a yellow spotlight, move the

pointer over that spotlight; the information appears in the upper left corner, under the

label “Pointer is over:.” To select and zoom to an object under a yellow spotlight,

left-click the spotlight; if you press the Shift key while clicking, zooming does not occur.

Once the ﬁltering is complete, the scene shows only nodes matching the ﬁltering criteria.

Sample Files

The following examples illustrate cases in which the Decision Tree inducer can be useful.

Each of these examples is associated with a sample data ﬁle provided with MineSet. By

running the inducer, you can generate the -dt.treeviz ﬁles described below.

Note: The data ﬁles, which have a .schema extension, are located in /usr/lib/MineSet/data

on the client workstation. The classiﬁer visualization ﬁles, which have a -dt.treeviz

extension, reside on the client workstation in /usr/lib/MineSet/treeviz/examples.

Churn

When customers change their phone carrier from one telecommunications company to

another, this is termed “churning.” This is a common problem in the telecommunications

industry. The ﬁle /usr/lib/MineSet/treeviz/examples/churn-dt.treeviz shows a Decision Tree

classiﬁer induced for this problem. The ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/churn.schema with the label set to churn (yes, no). The ﬁle given is

ﬁctitious, but based on patterns found in real data.

Note that in this tree the root split is on the amount of time the customers talk during the

day (total day minutes). Customers who talk more than 264 minutes per day churn at a

signiﬁcantly higher rate than those who don’t (60% versus 11%). These also are probably

the most proﬁtable customers.

The left subtree represents customers who talk less than 264 minutes per day. They have

a churn rate of 11%; but if they make more than three customer service calls, the churn

rate increases to 49%.

The right subtree represents customers who talk over 264 minutes per day. They have a

churn rate of 59%; but if they have a voice-mail plan, the rate decreases to 9.3%. If they

do not have a voice-mail plan, the churn rate is almost 75%.

Sample Files

369

Origin of Cars

The cars dataset contains information about different models of cars from the 1970s and

early 1980s. Attributes include weight, acceleration, and miles per gallon (mpg). The ﬁle

/usr/lib/MineSet/treeviz/examples/cars-dt.treeviz shows the Decision Tree classiﬁer induced

for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/cars.schema with the label set to origin (Japan, U.S., Europe). If you

have a dataset of car attributes, you might want to know what characterizes cars of

different origins.

Note that in the tree the left split is on brand. The root split is not brand because the

Decision Tree inducer penalizes multi-way splits; and the split on cubic_inches was

deemed a better discriminator. You can use the Tool Manager Remove Column

transformation to hide the brand, thus making the problem more interesting.

In the Decision Tree, you can see that cubic inches is an excellent discriminator for

U.S.-made cars. Cars with large engines (>169.5 cubic inches) are all made in the U.S., but

smaller cars are made everywhere. By choosing Selections > Show Original Data, you can

see that the one car with a big engine that was not made in the US is a Mercedes. Note

that in this tree, the root node (that is, the entire training dataset) has many more U.S. cars

(62.50%), yet after a single split on the cubic inches attribute, it is more difﬁcult to predict

the origin of cars with small engines. The purity of the root is 16.2 showing that there is

one class (U.S., in this case) that is dominant. The right node (cubic inches > 169.5) has

purity 96.81, indicating that we have identiﬁed a very pure subpopulation (almost all

cars with large engines were made in the U.S.). Indeed, the error rate for the right subtree

is estimated at 0% (green base). The left node from the root has purity 0.23 and a much

higher error rate of 31.25% (orange base). This subproblem is much harder than the

original one: the number of records for each class is approximately the same.

Gender Attribution

The adult dataset contains information about working adults. This dataset was extracted

from the U.S. Census Bureau. It contains data about people older than 16, with a gross

income of more than $100 per year who work at least one hour a week. You might want

to know how to characterize males and females. The ﬁle

/usr/lib/MineSet/treeviz/examples/adult-sex-dt.treeviz shows the Decision Tree classiﬁer

induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/adult.schema, with the label set to sex. Note that this dataset contains

almost 50,000 records; thus, running the Decision Tree Inducer can take several minutes

when you run this on your workstation.

370

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

The resulting visualization provides the following insights:

•Relationship is a giveaway attributes for some values. Husbands usually are male.

(Interestingly, there is one husband that is a female, showing data quality problems

at the Census Bureau, which does not recognize same-sex marriages.) Similarly, if

the person is a wife, the person is usually a female, except for three records that

show otherwise.

To make the problem more interesting, remove the relationship attribute and generate a

new Decision Tree. Note that

•The most important attribute is marital status.

•From the height of the bases, most people are either divorced, married to a civilian

spouse, or never married. Few are married with spouse absent, separated, married

to armed-forces spouse, or widowed.

•The distribution at the root shows more males in this dataset. (This database

contains information about working adults and is not representative of the entire

population.)

•The left-most node contains divorced working adults. We can see that the

distribution is more balanced than at the root (60% female, 40% male). The second

node contains married working adults. We can see that 89% are males. The third

node contains working adults that have never married. Their numbers are

approximately equal to those in the divorced group, with slightly more males. The

right-most node contains working widowed adults, of which 81% are females

(probably because of their higher life expectancy).

If you want to target working females for a new product, you can use the search panel to

identify segments that have a large population of females. You can do this by choosing

•sex matches female (click female on the top portion of the window).

•subtree weight > 1000

•percent > 80

Three yellow spotlights show the matching nodes. Since two are on one path, look at the

node closest to the root (on the right). The paths translate into the rules

marital status = Widowed implies that 81.23% are female

marital status = Divorced and occupation =

administrative clerical implies that 87.67% are female

In this training set, 1233 (widowed) and 1045 (divorced and occupation) females satisfy

these rules out of 16,192 at the root. This simple segment contains over 14% of working

women.

Sample Files

371

Salary Factors

If you have a dataset of working adults, you might want to ﬁnd out what factors affect

salary. You might then divide the records into two classes: those adults earning ≤ $50,000

a year, and those earning more. Each record then has an attribute with two values: “−

50,000” and “50,000+”. You can run a MineSet classiﬁer to help determine what factors

inﬂuence salary. The ﬁle /usr/lib/MineSet/treeviz/examples/adult-salary-dt.treeviz shows the

Decision Tree classiﬁer induced for this problem. This ﬁle was generated by running the

inducer on /usr/lib/MineSet/data/adult.schema with gross_income binned at the

user-speciﬁed threshold of 50000 and the label set to gross_income_bin.

The resulting visualization provides the following insights:

•The root, which represents the entire training set, shows 76.07% of the working

adults earn ≤$50,000.

•Age is the most important factor. Only 3.07% of the people under 27 years old earn

more than $50,000. Note that the base color is green, indicating a very accurate rule

(about 3% error rate).

•Education is an important factor for predicting salary for people over 27 years old.

The Census Bureau assigns education levels to each person. The Decision Tree

classiﬁer splits on 12.5; the level 13 matches a Bachelor’s degree. People with a

Bachelor’s degree or higher, go right to the node where about 55% earn over

$50,000.

•Of the segment that is older than 27 years and well educated, relationship is an

important predictor of salary. For those persons that are married, chances of earning

$50,000 or more increase to 73% for husbands and 75% for wives. (Note, however,

that the node containing wives has a small base, indicating that few females match

this rule.) If the person in this group is not married, chances of earning $50,000 or

more decrease to 27% for males and 25% for females.

Iris Classiﬁcation

In this dataset, each record describes four characteristics of iris ﬂowers: petal width, petal

length, sepal width, and sepal length. Each iris was further classiﬁed into the types

iris-setosa, iris-versicolor, or iris-virginica. The goal is to understand what characterizes

each iris type.

Before running a classiﬁer, click the Importance tab in the Tool Manager’s Classiﬁers tab;

then click Go!. You obtain a ranking of the importance of the features: petal_width,

petal_length, and sepal_length. You can map these to the axes in the Scatter Visualizer,

with the iris_type mapped to the color, and see the clusters.

372

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

The ﬁle /usr/lib/MineSet/treeviz/examples/iris-dt.treeviz shows the Decision Tree classiﬁer

induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/iris.schema.

Running the Tree Visualizer, you can see that the root has 6% error rate, even though the

purity is very low (0). The purity measures the skewness of the distribution, and, at the

root, the distribution is perfectly uniform: 50 records for each label value. The left branch

(petal-length <= 2.6 inches) goes to a green node (zero error) containing only iris-setosas.

The other branches are also quickly able to separate the classes using another test on the

petal_width. The path petal-length > 2.6 and petal-width £ 1.65 and petal-length > 5 ends

with an impure leaf containing 4 records. There are three records of type iris-virginica

and one of iris-versicolor. The Decision Tree did not split this node because it was

deemed insigniﬁcant (by default, every split must contain two children with at least a

weight of two). The node color is also black, indicating that no test instances reach this

node, so we do not have an estimated error rate for it.

To summarize: the ﬂowers with petal length £ 2.6 inches are predicted as iris-setosa,

those with petal length > 2.6 inches and <=5 inches and petal width £1.65 inches are

predicted as iris-versicolor, and those with a petal length >2.6 inches and a petal width >

1.65 or petal length > 5 inches and petal width <= 1.65 are predicted as iris-virginica.

Note that because the Decision Tree makes binary splits on continuous attributes while

Column Importance discretizes the data, the root split of the tree is different from the ﬁrst

attribute in column importance (see Chapter 17 for more details).

Mushroom Classiﬁcation

The ﬁle /usr/lib/MineSet/treeviz/examples/mushroom-dt.treeviz shows the Decision Tree

classiﬁer induced for the classiﬁcation of mushrooms. This ﬁle was generated by running

the inducer on /usr/lib/MineSet/data/mushroom.schema.

The goal is to understand which mushrooms are edible and which are poisonous, given

this dataset. There are over 8000 records in this set; thus, running this inducer might take

several minutes.

Each mushroom has many characteristics, including cap color, bruises, and odor. If you

build a Decision Tree classiﬁer, you can see that using only the odor attribute lets you

determine in 50% of the cases whether the mushroom is poisonous or edible. If the

mushroom has no odor, there is a 3.4% chance it is poisonous. The next attribute to look

at is the shape of the stalk. If it tapers, the mushroom is edible; but if it enlarges, there is

a 11.6% chance the mushroom is poisonous. There are 1032 mushrooms that reach this

node. You can follow the tree down further nodes to see what other attributes to consider.

Sample Files

373

Party Afﬁliation

This dataset consists of voting records. The goal is to identify the party a congressperson

belongs to given data about key votes. The dataset includes votes for each member of the

U.S. House of Representatives on the 16 key votes identiﬁed by the Congressional

Quarterly Almanac (CQA). The CQA lists nine types of votes: voted for, paired for, and

announced for (these three are simpliﬁed to yes); voted against, paired against, and

announced against (these three are simpliﬁed to no); voted present, voted present to

avoid conﬂict of interest, and did not vote or otherwise make a position known (these

three are simpliﬁed to an unknown disposition).

Before running a classiﬁer, look at the 16 votes to see if you can perceive which features

are important. Then run the Decision Tree classiﬁer.

The ﬁle /usr/lib/MineSet/treeviz/examples/vote-dt.treeviz shows the Decision Tree classiﬁer

induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/vote.schema.

Breast Cancer Diagnosis

The breast cancer dataset contains information about females undergoing breast cancer

diagnosis. Each record is a patient with attributes such as cell size, clump thickness, and

marginal adhesion. The ﬁnal attribute is whether the diagnosis is malignant or benign.

The ﬁle /usr/lib/MineSet/treeviz/examples/breast-dt.treeviz shows the Decision Tree classiﬁer

induced for this problem. This ﬁle was generated by running the inducers on

/usr/lib/MineSet/data/breast.schema.

The Decision Tree shows that uniformity_of_cell_size is a very strong discriminatory

attribute. While the root distribution is about 65% versus 35% (purity is 7.07), the two

children of the root are much more skewed, with the left node having an error rate of only

1.29%. The root alone is an excellent discriminator: if you limit the tree height to a single

level (see “Decision Tree Inducer Options”), the error rate is 7.3%.

Hypothyroid Diagnosis

The hypothyroid diseases dataset is similar to the one for breast cancer. The ﬁle

/usr/lib/MineSet/treeviz/examples/hypothyroid-dt.treeviz shows the Decision Tree classiﬁer

induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/hypothyroid.schema.

374

Chapter 11: Inducing and Visualizing the Decision Tree Classifier

There are 3163 records in this dataset and most of them do not have hypothyroid

(95.23%). While this means that one can predict “negative” and be correct with high

probability, it’s those people that have hypothyroid that we are most worried about: the

false negatives are very important. By selecting a confusion matrix from Further Inducer

Options, you’ll see that there are ﬁve patients with Hypothyroid who were misclassiﬁed.

Looking at the Decision Tree, you can see that the root node is “green” (highly accurate).

The single attribute on “fti” at the root shows that it is relatively easy to identify many of

the negative diagnosis. People with high fti are 99.7% negative, and all those where the

value is unknown are also negative (perhaps the doctor decided not to measure this

attribute because something else was obvious), but the rest (218 people) are hard cases

(node base is colored orange). We started with 3163 records, but only 218 are really

“interesting” to mine because it was very easy to determine the classiﬁcation of most

cases. In this example most of the data is uninteresting and you want to concentrate on a

small part quickly. Of the 218 people, you can see that about 66% are positive and 34%

negative.

As you move down the tree, increase the height scale (slider on the top left of the

visualizer) to see the different heights. The node that catches most of the people with

hypothyroid has the conditions “fti <= 64.5 and tsh > 5.95.” It contains 140 of the 151

records that have hypothyroid.

Pima Diabetes Diagnosis

This dataset is a diagnosis problem for diabetes using statistics gathered from a Native

American tribe in Phoenix, Arizona. The task is to determine whether a patient has

diabetes, given some medical attributes, such as blood pressure, body mass, glucose

level, and age.

The ﬁle /usr/lib/MineSet/treeviz/examples/pima-dt.treeviz shows the Decision Tree classiﬁer

induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/pima.schema.

Sample Files

375

DNA Boundaries

There are 3,186 records in this DNA dataset. The domain is drawn from the ﬁeld of

molecular biology. Splice junctions are points on a DNA sequence at which

“superﬂuous” DNA is removed during protein creation. The task is to recognize

exon/intron boundaries, referred to as EI sites; intron/exon boundaries, referred to as IE

sites; or neither. The IE borders are referred to as “acceptors” and the EI borders are

“donors.” The records were originally taken from GenBank 64.1 (genbank.bio.net). The

attributes provide a window of 60 nucleotides. The classiﬁcation is the middle point of

the window, thus providing 30 nucleotides at each side of the junction.

In this example, the root of the Decision Tree shows the distribution of the three classes.

By pointing to the bars, you can see that the composition is about 24% exon/intron, 24%

intron/exon, and 52% none. The “left_01” in front of the root node indicates that this is

an important attribute to look at ﬁrst. The “left_01” notation refers to the ﬁrst nucleotide

found to the left of the splice junction in question. The choices of attribute values for this

ﬁrst nucleotide (and all nucleotides in general) are the “A”, “G”, “T”, and “C”

nucleotides. If the “left_01” nucleotide is a “G”, then the “G” branch is taken and

followed to the next node, where the distribution now shows that such a nucleotide is

more likely to be an “exon/intron” or an “intron/exon” than at the root: the distribution

is 34% for “exon/intron,” 42% for “intron/exon”, and 24% for “none.” If the “left_01”

nucleotide is an “A”, “T”, or “C”, then the corresponding “A”, “T”, or “C” branch is

taken instead and in all three cases, the probability of “none” increases dramatically

(87%, 87%, and 95% respectively). This testing and branching process is repeated until

the ﬁnal node with the predicted class (“exon/intron”, “intron/exon”, or “none”) is

reached.

For this dataset, the Evidence Classiﬁer (Chapter 13) is more appropriate than a Decision

Tree due to the probabilistic nature of this domain. This can be veriﬁed by comparing the

estimated error rates.

377

Chapter 12

12. Inducing and Visualizing the Option Tree Classiﬁer

This chapter discusses the features and capabilities of the Option Tree inducer. This

chapter provides an overview of this tool and discusses methods for using it to generate

Option Tree classiﬁers. The Option Tree Visualizer’s functionality is the same as for

Decision Trees and was described in Chapter 11. Finally, it lists and describes the sample

ﬁles provided for this tool.

Note: It is assumed that you have read Chapter 10, “MineSet Inducers and Classiﬁers,”

and Chapter 11, “Inducing and Visualizing the Decision Tree Classiﬁer” before

proceeding with this chapter.

Overview

An Option Tree classiﬁer assigns each record to a class. The underlying structure used for

classiﬁcation is a Decision Tree, as described in Chapter 11. Figure 12-1 shows an Option

Tree where the goal is to predict for the cars dataset the origin of a car built in the 1970’s

or early 1980’s (the origin points being the U.S., Japan, or Europe). Option Trees extend

a regular Decision Tree classiﬁer by allowing Option Nodes. An Option Node shows

several options that can be chosen at a decision node in the tree. For example, in

Figure 12-1, the root is an option node with ﬁve options:

1. cubicinches

2. cylinders

3. weightlbs

4. mpg

5. brand

378

Chapter 12: Inducing and Visualizing the Option Tree Classifier

Option nodes serve two purposes:

1. They enhance comprehensibility of the factors affecting the determination of class

label by showing several choices that can be made. Instead of using a single

attribute at a node, an option node provides you with several options. When ﬂying

over the tree, you can choose to follow an option that

•you believe is easier to understand, or

•you believe is better for predictions based upon your previous experience, or

•you select based on the error estimate

In the cars dataset shown in Figure 12-1, you can ﬂy down the cylinders subtree

because it has few values, or you can ﬂy down to the weightlbs subtree because its

estimated error is lower (1.53). Note that error estimates are only estimates;

generally, if the error difference between two options is less than twice their mean

standard deviation, then statistically the errors are not different.

2. They reduce the risk of making a mistake by averaging the votes made by the

options below. Every option leads to a subtree that can be thought of as an “expert.”

The option node averages these experts’ votes. Such averaging can lead to a better

classiﬁer with a lower error rate.

In the cars dataset, shown in Figure 12-1, the root node has an estimated error rate

of 0.76%, which is lower than any of its children! Note that while brand might seem

like a “giveaway” attribute for this task, the training set might not contain all

brands (in fact, it does not contain all of them). For an unseen brand, the Decision

Tree guesses the majority class (U.S.) and makes two errors. However, when there

are other options, they are averaged, and, indeed, the error is reduced.

Overview

379

Figure 12-1 Option Decision Tree for the Cars Dataset

Option Trees, however, have two disadvantages:

1. The time necessary to build an Option Tree under the default setting is about 10 to

15 times longer than that needed to build a Decision Tree.

2. The Tree Visualizer ﬁle that is created is more complex, containing 10 to 15 times as

many nodes.

Run the Option Tree inducer on your dataset to determine whether the advantages in

comprehensibility and error rates justify the longer induction time. You might gain

additional insight as to which attributes to remove or use when building a Decision Tree.

380

Chapter 12: Inducing and Visualizing the Option Tree Classifier

Inducing Option Trees

An Option Tree classiﬁer is induced (generated) automatically from data. The data,

which is made up of records and a label associated with each record, is called the training

set (see Chapter 10, “MineSet Inducers and Classiﬁers”).

File Requirements

The Option Tree inducer requires a training set, as described in the “Training Set” in

Chapter 10. Files are generated by extracting data from a source (such as a MineSet ASCII

or binary ﬁle, or a table in an Oracle, INFORMIX, or Sybase database). To apply the

generated classiﬁer, you should have a dataset of records with the attributes used by the

classiﬁer, except that the label need not be present.

Running the Option Tree Inducer

There are two ways to run the Option Tree inducer:

•From the Tool Manager.

Connect to the server and select a data source (see “Choosing a Data Source” in

Chapter 3). Enter the ﬁlename. For the example shown here, the ﬁlename entered

would be /usr/lib/MineSet/data/cars.schema as the ﬁlename. You’ll see several

attributes in the Data Transformation panel. Verify that origin is shown as the

discrete label. Select the Option Tree inducer, and ensure you have selected the

Classiﬁer & Error mode. To run the Inducer, click Go!.

•From the command line.

To induce an Option Tree classiﬁer from the command line, refer to Appendix I,

“Command-Line Interface to MIndUtil: Analytical Data Mining Algorithms.”

Configuring the Decision Tree Inducer Using the Tool Manager

381

Conﬁguring the Decision Tree Inducer Using the Tool Manager

To access the options for conﬁguring the Option Tree inducer, select the Mining Tools tab

on the Data Destination panel (Figure 12-2). From the tabs at the right, select Classify.

Ensure that the inducer you select is Option Tree. Your selections in the Mode and

Inducer menus determine the options available in the Further Inducer Options menu

(Figure 12-3). After you have made your selections in these menus, click Go! to run the

inducer, which, in turn, creates the classiﬁer.

Figure 12-2 Data Destination Panel in Tool Manager Showing Classiﬁers

Discrete Labels

The Discrete Labels menu provides a list of possible discrete labels. Discrete attributes

(binned values, character string values, or a few integers) have a limited number of

values. You should select a label attribute with few values; for instance, two or three (see

“Training Set” in Chapter 10). If there are no discrete attributes, the menu shows No

Discrete Label, and the Go! button is disabled. You then must create a discrete attribute by

binning or adding a new column using the Tool Manager’s Data Transformations panel.

382

Chapter 12: Inducing and Visualizing the Option Tree Classifier

Parallelization

If you have installed the multiprocessor version of MineSet, it is possible to compute

tree-based algorithms in parallel whenever a branch contains over 1000 records. Each

node on the tree estimates the best possible split on the corresponding level, and these

tasks are performed in parallel. The maximum number of threads that a program can

spawn is determined automatically by default. You can control the number of threads

changing the parallelization mode in the Preferences panel of Tool Manager (see “The

File Menu” in Chapter 3). Parallelization may cause memory fragmentation, causing the

largest data sets that can be computed in parallel to be smaller than the largest data sets

that can be computed on a uniprocessor.

Classiﬁer Name

The generated classiﬁer is named with the preﬁx of the session ﬁlename (as determined

in Tool Manager) with the sufﬁx-odt.class. By default, all classiﬁers are stored on the

server in the ﬁle_cache directory, which defaults to mineset_ﬁles. These classiﬁers can be

used for future classiﬁcation of unlabeled records; that is, they can be used to predict the

labels for unlabeled datasets (see “Applying a Model” and “Backﬁtting in Error

Estimation” in Chapter 10).

Option Tree: Further Options

Selecting Further Classiﬁer Options causes the Further Inducer Options dialog box to

appear. This dialog box consists of four panels:

•The top panel indicates the choices you made in the Tool Manager’s Data

Destination panel.

•The second pane from the top lets you set the loss matrix and the weight attribute.

See “Loss Matrices: Not All Mistakes Were Created Equally” and “Record

Weighting: Not All Records Were Sampled Equally” in Chapter 10.

•The bottom-left panel lets you specify further Inducer Options.

•The bottom-right panel lets you specify the Error Estimation Options (unless the

mode you chose in the Data Destination panel was Classiﬁer Only, in which case

this area is empty). The options shown in this panel depend on the type of Error

Estimation you chose (see “Error Estimation” in Chapter 10).

Configuring the Decision Tree Inducer Using the Tool Manager

383

Figure 12-3 Further Inducer Options

384

Chapter 12: Inducing and Visualizing the Option Tree Classifier

Option Tree Inducer Further Options

To ﬁne-tune the Option Tree induction algorithm, you can change any of the options for

Decision Trees described in Chapter 11. In addition, the following new options are

provided (see Figure 12-3).

•Max # root options

This integer, which defaults to 5, restricts the maximum number of options that may

be created at the root. While the inducer might not allow this number of options

because other attributes are inferior, many natural datasets will have many good

attributes that could be chosen.

•Decrease

This integer, which defaults to 2, deﬁnes the amount by which the number of

options decreases at every level. With the default of 5 for Max # root options, it

implies that there are at most three options (5-2=3) for the second level of decision

nodes. The third level of decision nodes is restricted to a single option (3-2=1).

Levels further down are similarly restricted to a single option.

•Min ﬁtness ratio

This ratio determines when to exclude attributes as options. When the inducer gives

a ﬁtness score to each attribute, it chooses the best attribute as well as other

attributes that might also be good as options. The ﬁtness ratio determines how good

those other options must be, to be chosen. A factor value of f implies that to be

considered an option, an attribute must rank at least (1-f)*b, where b is the score for

the best attribute. A ﬁtness ratio of 1 picks all the attributes (so the limiting options

described above are reached if there are attributes on which to split). A ﬁtness ratio

of 0 causes a regular Decision Tree to be created (no option nodes).

The time to induce an Option Tree is closely related to the number of option nodes

created. Because option nodes usually are created near the top (where they are most

useful for both comprehensibility and error reduction), a good approximation for the

time to induce an Option Tree is: the number of options created that have no children

options times the time to build a Decision Tree. Under the default setting, the root node

can have up to ﬁve options, and each child can have up to three options. The total options

then can be up to 15 (3 times 5). If the option Max # root options is increased to 6, the

number of options then is limited by 48 (6*4*2); if it is increased to 7, the number of

options is then limited by 105 (7*5*3). Keeping the Max # root options to 5, but changing

the decrease to 1, limits the options by 120 (5*4*3*2). The expected induction time for the

last example, thus, is two orders of magnitude longer than for a regular Decision Tree.

Decreasing the Min ﬁtness ratio option usually results in fewer options than the limiting

factor, thus reducing induction time.

Working in the Tree Visualizer’s Main Window

385

Working in the Tree Visualizer’s Main Window

The Tree Visualizer’s main window shows the Option Tree. The navigation is the same

as for decision trees. One feature that is very useful for option trees is clicking the right

mouse button on an option node. This presents the list of children, which are the options.

Note that:

•The left-most option would have been the only option chosen by the Decision Tree

inducer. As you go right, the options are ranked in decreasing order by the ﬁtness

scoring. Sometimes, it is interesting to see that the ﬁtness scores do not necessarily

match the test-set error shown. This is expected, as the inducer is using a

non-perfect scoring function. The test-set estimate also has natural variability: the

larger the test-set, the more accurate the estimate.

•The option node can have a different error rate than every one of its children.

Because the option node averages the children’s predictions, its error rate can be

different. In some cases, its error is strictly lower than that of every child, showing

that averaging helps.

•The distribution of instances (shown in bars) at every child of an option node is

exactly the same as that of the option node itself. This is because there was no

decision made by the option node: options are being presented as children.

Sample Files

The following examples show cases in which the Option Tree inducer can be useful. Each

of these examples is associated with a sample data ﬁle provided with MineSet. By

running the inducer, you can generate the -odt.treeviz ﬁles described below. The text

describing the scenario and goal for each task is described in “Sample Files” in

Chapter 11. Here we describe the speciﬁc advantages and disadvantages of Option Trees

for several of the example datasets.

Note: The data ﬁles, which have a .schema extension, are located in /usr/lib/MineSet/data

on the client workstation. The classiﬁer visualization ﬁles, which have a -odt.treeviz

extension, reside on the client workstation in /usr/lib/MineSet/treeviz/examples.

386

Chapter 12: Inducing and Visualizing the Option Tree Classifier

Churn

The Option Tree for this dataset shows that total day charge, total day minutes, and

customer service calls are all good attributes for the root: they all have approximately the

same estimated error rate. You can choose to ﬂy down to one subtree or another, based

on your preferences and understanding of the data. Note that while the right subtree

starts with customer service calls, the second test is on total daily charge or total daily

minutes (as the root’s left option). However, because a split already occurred on an

attribute, the thresholds are different.

Origin of Cars

The Option Tree for this dataset shows several good attributes for the root, including:

cubic inches, cylinders, weight lbs, mpg, and brand. Note that the root has a lower

estimated error rate than any of the children.

Iris Classiﬁcation

This is an example where Option Trees seem to be performing worse than Decision Trees.

The root for the Decision Tree shows 6% error and the root for the Option Tree shows 8%

error, so it seems that Option Trees perform worse. Be cautioned about making inferences

regarding the error rates:

•The standard deviation of the error estimate is fairly high: 3.88% and 3.39%. A rule

of thumb in statistics is that if the difference is less than two standard deviations,

the difference is not statistically signiﬁcant at the 95% conﬁdence level. A difference

of 2% is not larger than even a single standard deviation; hence, the classiﬁer error

rates are not statistically different at the 95% conﬁdence level

•For small ﬁles (Iris has 150 records), different random seeds give different results.

For example, changing the seed to 3 improves the Option Tree classiﬁer’s error from

8% to 4% without changing the Decision Tree classiﬁer’s error rate (remember to

reset the seed). This does not imply that a more accurate classiﬁer has been

generated, rather that the error estimate is not stable. Because only 50 records are

used for testing, each mistake is 2%. The difference between 4% and 8% is making

two more mistakes.

Sample Files

387

•For small ﬁles (Iris has 150 records), use the “Estimate Error” option in MineSet. It

results in better estimates that have narrower conﬁdence intervals. When you run

this mode, the status window shows that the Decision Tree classiﬁer has an

estimated error of 4.67% +/- 1.73%, and the Option Tree classiﬁer has an estimated

error of 4.00% +/- 1.61%. The difference is not signiﬁcant in this case either, but the

Option Tree is slightly superior.

•Even if the error rate is higher for Option Trees, they might be (and usually are)

better at assigning probability estimates. For this dataset, the estimated mean

squared error for Decision Trees is 3.94; for Option Trees it is 3.67 (although the

difference is not signiﬁcant at the 95% conﬁdence level).

Mushroom Classiﬁcation

The Option Tree for this dataset shows that all ﬁve options chosen at the root have zero

error rate estimates. Looking at the result, you might prefer the left option (bruises)

because it is as accurate but is easier to measure than odor (the root test of the induced

Decision Tree). You might want to remove odor and gill size, then build a regular

Decision Tree that turns out to be just as accurate (0% estimated error rate).

Note, however, that removal of a root option to have a sibling option selected by the

Decision Tree might not necessarily result in the same accurate classiﬁer that is shown in

the Option Tree. The removed attribute might have been used lower down in the tree. For

example, removing brand from the cars dataset signiﬁcantly increases the error rate,

even though four out of ﬁve options do not use it at the root.

Party Afﬁliation

This dataset behaves very similarly to the Iris dataset. The Option Tree has the same error

rate as the Decision Tree. Under “Estimate error,” the cross-validated estimate shows that

it is slightly better than the Decision Tree (but not signiﬁcantly so at the 95% level) both

on error rate and on mean squared error.

Breast Cancer Diagnosis

The error rate for Option Trees is slightly lower than that for Decision Trees, both for

Classiﬁer & Error and for Estimate Error; however, the difference is not signiﬁcant (at 95%).

388

Chapter 12: Inducing and Visualizing the Option Tree Classifier

Hypothyroid Diagnosis

The error rates for this dataset are very low (less than 1%), but this is because most people

who were tested for hypothyroid (95%) did not suffer from it. If we use a loss matrix that

attempts to avoid false negatives (by penalizing by 100 a prediction of negative when the

actual value is hypothyroid), we can see that the loss for Option Trees is signiﬁcantly

lower than that of Decision Trees: 182 versus 523 (total), or 0.17 versus 0.5 (per record).

This difference is signiﬁcant at the 95% conﬁdence level.

DNA Boundaries

For this dataset, the Option Tree is slightly more accurate than the Decision Tree;

however, looking at the root options, you might notice that it chooses left 1,2, and right

1,2,5. Given the background knowledge that attributes closer to the boundary can be

more important, you might want to exclude the option split on right 5. After updating the

maximum number of root options to 4 (down from 5), the error rate increases from 5.65%

to 6.59%. This might be surprising, given that the root no longer uses right 5 as an option;

another effect of changing the number of root options from 5 to 4 was to also reduce the

number of options that appear further down the tree (because of the decrease parameter).

This caused the individual error rates for each of the other 4 subtrees to increase. Still, the

option tree’s error rate is signiﬁcantly better (at the 95% conﬁdence level) than the

Decision Tree error rate of 7.06% +/- 0.79%.

389

Chapter 13

13. Inducing and Visualizing the Evidence Classiﬁer

This chapter discusses the features and capabilities of the Evidence Classiﬁer and

Visualizer. It provides an overview of this classiﬁcation tool as well as the inducer that

generates it. It describes the ways of invoking this tool. It then explains the Evidence

Visualizer’s functionality when working with the

•Label Probability Pane

•Main Window

Finally, it lists and describes the sample ﬁles provided for this tool.

Note: It is assumed that you have read Chapter 10, “MineSet Inducers and Classiﬁers,”

before proceeding with this chapter.

Overview

The Evidence Classiﬁer assigns each record in a dataset to a class. The Evidence

Visualizer displays the structure of an evidence classiﬁer (Figure 13-1). The visualizer

can help you understand the importance of speciﬁc attribute values for classiﬁcation.

Also, it can be used to gain insight into how classiﬁcation is done, as well as to answer

“what if” questions.

390

Chapter 13: Inducing and Visualizing the Evidence Classifier

Figure 13-1 The Evidence Visualizer Applied to the Iris Dataset

The Main Window (left) contains rows of cake charts, pie charts, or bars for each attribute

used by the classiﬁer. Initially rows of cake charts are shown (as in Figure 13-1). A cake

chart resembles a pie chart in that it shows proportions, but it is square with rectangular

slices. To toggle between cakes, representing evidence, and pies, representing

probabilities, click on the boxed title “Evidence,” and the display shown in Figure 13-2

appears. To characterize a particular class label select one of the values in the Label

Probability Pane (right).

Overview

391

Figure 13-2 Evidence Visualizer Showing Probabilities

There is one chart or bar for each discrete value of the attribute. In the case where the

attributes are not discrete, the continuous range has been discretized (binned) in a way

that maximizes the differences between adjacent charts. A chart’s height is proportional

to the weight of records having that attribute value. (If no weight attribute is set, the

height represents the number of records.) If no ﬁltering is done, the sum of the chart

heights for every row is the same because it is equal to the total weight of the dataset.

The height of the graphical objects can be scaled to exaggerate the differences between

the charts. You can adjust a Detail slider in order to reduce the number of attributes

shown starting with the ones that are less useful for classiﬁcation.

By adjusting the percent weight threshold, all values having counts below a certain

percentage of the total are ﬁltered out.

392

Chapter 13: Inducing and Visualizing the Evidence Classifier

The kinds of questions you might answer by using the Evidence Visualizer are as follows:

•What is the likelihood that a new record, for which you know only the values for a

few attributes, has a certain label?

•Which values of which attributes are the most useful for classifying the label?

•What is the distribution of records by attribute values?

•What are the characteristics of records that have a certain label?

•What is the probability that an attribute takes on a certain value given that it has a

speciﬁc label value?

The prior probability for each class label is depicted in the pie chart in the Label Probability

Pane, on the right of the screen. The prior probability for a class label is the probability of

seeing this label in the data for a randomly chosen record, ignoring all attribute values.

Mathematically, this is the number of records with the class label divided by the total

number of records.

The conditional probabilities, depicted by cake charts in the Main Window on the left of the

screen, show the relative probability of each attribute value given (conditioned on) each

label value. The size of a cake slice indicates the amount of evidence the classiﬁer adds

to the prior probability after taking into account a given attribute value in a record. If the

size of the slices are equal, the value is irrelevant, and the classiﬁer adds the same amount

of evidence to all classes.

The probability distribution for each value, depicted by pie charts representation in the

Main Window on the left of the screen, show the proportion of records in each class

considering only records having that particular value. This distribution is arrived at by

multiplying the conditional probabilities for the value by the prior probability, then

normalizing so that they sum to one. These probabilities are precise for the data given

(Figure 13-2). You may toggle between the cakes and pies representation by clicking on

the boxed title “Evidence” or “Probabilities” in the scene.

By default, values of nominal attributes are sorted by the size of the slices corresponding

to one of the classes. This aids in identifying important values. If the label is a binned

attribute, the class that is the highest bin is used. If the label is nominal, then the class

with the largest slice in the prior probability pie is used. If a particular class is selected,

and then a sort by label probability is requested, the selected class is used for determining

the ordering. Alternatively, the values of the nominal attributes can be sorted

alphabetically or by weight. Binned attributes never change their ordering.

Overview

393

Technically, the slice of the chart represents the normalized conditional probability of an

attribute value A, given the class label L. The conditional probability, P(A|L), is the

probability that a random record chosen only from records with label L takes the value

A. Under the default settings, the probability is computed based on record weights. For

example, P(0.75 < petal width < 1.65 | iris-versicolor) is 91.6, because there are 36

records with label iris-versicolor, and 33 of them have a petal width in this range.

The Evidence Inducer, sometimes called Naive-Bayes (or Simple Bayes), builds a model

that assumes the probabilities of each attribute value are independent given the class. For

example, this assumes that the four attributes (sepal length, sepal width, petal length,

and petal width) are independent for each class of iris (iris-setosa, iris-versicolor, and

iris-virginica). While this simplistic model is rarely true, the model is excellent for initial

explorations of data and its classiﬁcation prediction performance is very good in

practical applications.

Each attribute value, or range of values, (for binned continuous attributes), deﬁnes

exactly one chart, which, in turn, gives the conditional probabilities for each class label.

To classify a given record, one computes the probability of each class by multiplying its

prior probability by the appropriate conditional probability from each row in the matrix.

The ﬁnal product gives the relative probability for each class and the highest value is the

predicted class. If an attribute has an unknown value, it is ignored. These NULL values

are represented by charts that are slightly offset from the rest of the charts and are on the

left. For example, stalk root in Figure 13-5 has a null value.

This process of classiﬁcation can be done interactively using the Evidence Visualizer.

Simply select all the values for the attributes that you know. The probability pie on the

right changes to show the distribution you would expect, given the attribute values you

selected on the left. For example, selecting the chart for sepal length < 5.45 inches and the

chart for sepal width > 3.05 inches shows that an iris with these characteristics belongs

almost certainly to the class iris-setosa (see Figure 13-3).The behavior is the same

whether you are using the cake, pie, or bar mode.

394

Chapter 13: Inducing and Visualizing the Evidence Classifier

Figure 13-3 Selecting sepal length < 5.45 and sepal width > 3.05 Using the Iris Dataset

If the classes listed under the pie chart on the right are not numerical, they are shown in

order of slice size. The class with the largest probability is at the top. As values on the left

are selected, this order changes to reﬂect the changing probability pie. If the label is a

binned attribute, the values are not reordered, and colors are assigned according to a

continuous spectrum: the highest bin is red; otherwise, random colors are used.

For some combinations of selected values, the probability pie on the right turns

completely gray. This occurs when the values selected are contradictory according to the

model. For example, in iris.eviviz there are no iris ﬂowers that have petal width <.75

inches and petal length > 4.85 inches. Thus, selecting the two charts on the left

representing these two values results in a gray chart on the right (see Figure 13-4).

Overview

395

Figure 13-4 Selecting Two Contradictory Pies Results in a Gray Pie on the Right

You can eliminate the possibility of getting a grey pie by using the Laplace correction

option (see “Evidence Inducer Options” on page 402). If Laplace correction is not used,

clicking more than one chart on the left yields exact posterior proportions on the right.

After selecting charts with more than one attribute on the left, the posterior probability

pie might not reﬂect exactly the true proportions in the original data; however, it is a

good estimate. Laplace correction can be toggled on or off from within the tool by

checkmarking View > Use Laplace Correction. Even if a value for the Laplace correction

was not speciﬁed in the further inducer options, a default value for it will be used if this

menu item has a check mark.

396

Chapter 13: Inducing and Visualizing the Evidence Classifier

Importance is a measure of predictive power with respect to a label. The Main Window

provides valuable insight not only into the importance of each attribute value affecting

the class value, but also into the importance of speciﬁc attribute values. For example, in

the mushroom dataset (described in “Mushroom Classiﬁcation” on page 426), the

veil-color attribute has little importance because its attribute value usually is white (see

Figure 13-5) and does not add much evidence to either class. The importance assigned to

each attribute is a number between 1 and 100. It is possible to see the value at the top by

selecting the attribute name in the scene.

Figure 13-5 Veil-Color Attribute in the Mushroom Dataset

However, if the veil color is brown or orange, the mushroom is likely to be edible, while

if it is yellow, it is likely to be poisonous. Similarly, a “test for AIDS” might not be an

important attribute for determining whether a patient has a deadly disease because most

people would not test positive. However, the value “positive” for this test is highly

informative because most patients that test positive do have a deadly disease.

Inducing Evidence Classifiers

397

Inducing Evidence Classiﬁers

The automatic induction of Evidence classiﬁers is a process whereby counts (or weights)

are used to calculate the probabilities. Evidence classiﬁers are automatically induced

(generated) from data. The data, which is made up of records and a label associated with

each record, is called the training set (see Chapter 10).

The probabilities are generated using the following method:

1. All continuous attributes are discretized (binned), such that class distributions in

these ranges are as different as possible. The number of ranges is determined

automatically, which invokes parallelization so that attributes are binned in parallel,

if the multiprocessor version of MineSet is installed on your system. The automatic

binning for any attribute may be overridden by explicitly performing the binning

operation in Tool Manager.

2. The prior probabilities are the proportions of each class in the training set.

3. The conditional probabilities are the probabilities of each attribute value

conditioned on each class label in the training set. (The charts show them

normalized for each attribute value.)

The number of charts in a row is the number of discrete ranges produced by the inducer.

If there is just one range, it means that this attribute by itself was not useful in predicting

the label. Initially, the prior probabilities of the labels are displayed in the Label

Probability Pane.

An optional Auto Column selection mode removes attributes that are not useful or that

increase the error rate.

An optional Laplace correction can be applied to the probabilities, which avoids extreme

probabilities (for example, probabilities of zeros and ones). We may prefer not to assign

a probability of 1 to the event “a patient tested positive for AIDS has a deadly disease.”

We may want to assign a probability close to 1 (but not 1), in order to allow for errors or

unrepresentative samples.

Note that the Evidence Visualizer shows the probabilities of classes. The classiﬁer can

have a Loss Matrix. In the visualization the Loss Matrix (if present) may be applied by

clicking the button to the lower right of the pie in the label probability pane. The

predicted class is the one with the least expected loss under the probability estimates.The

Loss Matrix button will only appear if it has been speciﬁed in the .eviviz ﬁle (see

Appendix G, “Format of the Evidence Visualizer’s Data File.”

398

Chapter 13: Inducing and Visualizing the Evidence Classifier

File Requirements

The Evidence Visualizer requires a training set, as described in “Training Set” on

page 322 of Chapter 10, “MineSet Inducers and Classiﬁers.” Files are generated by

extracting data from a source (such as a MineSet ASCII or binary ﬁle, or a table in an

Oracle, INFORMIX, or Sybase database). The Evidence Visualizer data ﬁle is output as a

result of running the Evidence Inducer. The format of this ﬁle, which has a .eviviz

extension, is described in Appendix G. When starting the Evidence Visualizer or when

opening a ﬁle, you must specify the data ﬁlename. To apply the generated classiﬁer, you

should have a dataset of records with the same attributes and type as those used by the

classiﬁer, except that the label need not be present.

Running the Evidence Inducer

There are two ways to run the evidence inducer:

•From the Tool Manager

Connect to the server and select a data source (see “Choosing a Data Source” in

Chapter 3).

From the File menu, choose Open New Data File. Log in to a server, and type

/usr/lib/MineSet/data/iris.schema as the ﬁlename. You’ll see four continuous

attributes and one discrete attribute in the Data Transformation panel. Since there is

only one discrete attribute, the label option automatically shows it. Select the

Evidence Inducer, and ensure that you have selected the Classiﬁer & Error mode. To

run the Inducer, click Go!.

The Status window shows the progress and resulting statistics.

•From the command line

To induce an evidence classiﬁer from the command line, refer to Appendix I,

“Command-Line Interface to MIndUtil: Analytical Data Mining Algorithms.”

Starting the Evidence Visualizer

399

Starting the Evidence Visualizer

There are six ways to start the Evidence Visualizer:

•Run the Evidence Inducer from the Tool Manager under the Classify tab. After the

inducer builds the classiﬁer, it automatically invokes the Evidence Visualizer. See

below for details about using the Tool Manager in conjunction with the Evidence

Visualizer.

•Use the Tool Manager to start the Evidence Visualizer from the Visual Tools menu.

(See Chapter 3 for details on the Tool Manager’s functionality, which is common to

all MineSet tools.)

•Double-click the Evidence Visualizer icon on your Indigo Magic desktop. The

startup screen requires you to select a data ﬁle by choosing File > Open.

Figure 13-6 File > Open Menu Selection

•If you know what conﬁguration ﬁle you want to use, double-click the icon for that

conﬁguration ﬁle. This starts the Evidence Visualizer and automatically loads the

conﬁguration ﬁle you speciﬁed. This works only if the conﬁguration ﬁlename ends

in .eviviz (which is always the case for conﬁguration ﬁles created for the Evidence

Visualizer via the Tool Manager).

•If you know what conﬁguration ﬁle you want to use, drag its icon onto the Evidence

Visualizer icon. This starts the Evidence Visualizer and automatically loads the

conﬁguration ﬁle you speciﬁed.

•Start the Evidence Visualizer from the UNIX shell command line by entering this

command at the prompt:

eviviz [ dataFile ]

400

Chapter 13: Inducing and Visualizing the Evidence Classifier

Here, dataFile is optional and speciﬁes the name of the conﬁguration ﬁle to use. If

you don’t specify a conﬁguration ﬁle, you then must use File > Open to specify one

(see Figure 13-6).

Options for Invoking the Evidence Visualizer

The -quiet option eliminates the dialogs that popup to indicate progress. You can enable

this option permanently by adding the line

*minesetQuiet:TRUE

to your .Xdefaults ﬁle.

Conﬁguring the Evidence Inducer Using the Tool Manager

To access the options for conﬁguring the Evidence Inducer, select the Mining Tools tab on

the Data Destination panel (Figure 13-7). From the subsequent tabs, select Classify.

Ensure that the inducer you select is Evidence. Your selections in the Mode menu

determines the options available in the Further Inducer Options menu. After you have

made your selections in these menus, click Go! to run the inducer, which, in turn, creates

the classiﬁer.

Figure 13-7 Tool Manager With Data Destination Panel Showing Classiﬁers

Configuring the Evidence Inducer Using the Tool Manager

401

Discrete Labels

The Discrete Labels menu provides a list of possible discrete labels. Discrete attributes

(binned values, character string values, or integers) can have a number of values. Select

a label attribute (see “Training Set” in Chapter 10). If there are no discrete attributes, the

menu shows No Discrete Label, and the Go! button is disabled. You then must create a

discrete attribute by binning or adding a new column using the Tool Manager’s Data

Transformations panel.

Classiﬁer Name

The generated classiﬁer is named with the preﬁx of the session ﬁlename (as determined

in Tool Manager) and the sufﬁx-evi.class. By default, all classiﬁers are stored on the

server. These classiﬁers can be used for future classiﬁcation of unlabeled records; that is,

they can be used to predict the labels for unlabeled datasets (see “The Apply Model

Button” in Chapter 3).

Reﬁning the Inducer With Further Options

Selecting Further Inducer Options causes the Inducer Options dialog box to appear (see

Figure 13-8). This dialog box consists of three panels:

•The top panel shows the choices you made in the Tool Manager’s Data Destination

panel. The type of Error Estimation is determined by the model.

•The bottom-left panel lets you specify further Inducer Options.

•The bottom-right panel lets you specify the Error Estimation Options (unless the

mode you chose in the Data Destination panel was Classiﬁer Only, in which case

this area is empty). The options shown in this panel depend on the type of Error

Estimation you chose (see “Error Estimation” in Chapter 10).

402

Chapter 13: Inducing and Visualizing the Evidence Classifier

Figure 13-8 Classiﬁcation Options Dialog Box Without Accuracy Estimate

Evidence Inducer Options

By choosing Further Inducer Options, you can ﬁne-tune the Evidence inducer.

•Laplace Correction

This biases the probabilities towards the average, thus avoiding extreme numbers

(such as 0 and 1). This means every chart in the Main Window has a non-zero slice

for each class. The fewer the records in a bin, the more it is changed towards the

average. If the Laplace correction is checked, and the factor is left empty or set to 0,

an automatic Laplace correction is applied, using a heuristic that applies a factor of

1/training-set-weight.

Working in the Evidence Visualizer’s Panes

403

•Set Minimum Weight per Bin

The Evidence Inducer discretizes all continuous attributes. This option lets you

deﬁne the minimum number of instances per bin. The automatic setting has a

heuristic that sets this number based on the dataset size: the larger the dataset, the

greater the minimum number of records allowed in the bin, and the smaller the

width of the bin, in general. If your dataset is very large, you might obtain more

discrete ranges than you want. To reduce the number of bins, raise this value.

•Automatic column selection

This applies a process that chooses only those columns that help prediction the

most. Because extra columns can degrade the predictive accuracy of the evidence

classiﬁer, this process searches for a good subset of the columns automatically. Only

those columns found to be useful are used. This process can take a long time,

especially if there are many columns. It is useful for eliminating highly correlated

columns that could degrade accuracy. Automatic column selection and Boosting

cannot both be enabled, as together they would take far too long to complete.

Automatic column selection conducts a search for the best set of columns that

reduce the error of the classiﬁer. The selection of these columns is done by

estimating the error of different attribute sets using the wrapper approach (see

Appendix K, “Further Reading and Acknowledgments”). Each feature subset is

evaluated by estimating the classiﬁer's error using cross-validation. Columns are

added or removed based on the error estimates using a best-ﬁrst search mechanism.

In the default mode, the search begins with an empty set of features. By selecting

the Backwards option, the search starts with the full set of options; this is slower,

since larger models are initially built.

Working in the Evidence Visualizer’s Panes

If you started the Evidence Visualizer without specifying a conﬁguration ﬁle, the main

screen shows the copyright notice for the Evidence Visualizer. Only the File and Help

pulldown menus are available. To view all menus and controls in the Main Window,

open a conﬁguration ﬁle. Use File > Open (see Figure 13-6) to see a list of conﬁguration

ﬁles.

When a valid conﬁguration ﬁle is speciﬁed, the two panes in the main screen display

graphics. For example, specifying cars.eviviz results in the output displayed in

Figure 13-9.

404

Chapter 13: Inducing and Visualizing the Evidence Classifier

Figure 13-9 Evidence Visualizer Window for cars.eviviz

In the Main Window on the left, one row of cake charts appears for each attribute in the

dataset that the classiﬁer is using. Each chart corresponds to a value for the attribute

associated with the row. In the Label Probability Pane on the right, a list of all class labels

appears under a large pie chart of the prior probability distribution. Note that the color

of the slices correspond to the color associated with each class label. This prior

probability represented by the pie shows the proportion of data with each class label.

Working in the Evidence Visualizer’s Panes

405

Viewing Modes

Both of the Evidence Visualizer’s window panes has two modes of viewing: grasp and

select.

Viewing Modes in the Label Probability Pane

The Label Probability Pane is located on the right of the Evidence Visualizer’s Main

Window. The top two buttons of those aligned vertically between the panes toggle

between the grasp and select modes. Alternatively, the Esc key also toggles the viewing

mode for both panes.

In grasp mode, the cursor appears as a hand that lets you pan and scale the scene’s size.

•To move the display within the pane, press the middle mouse button and drag it in

the direction you want the display moved.

•To enlarge the scene, press the left mouse button, and drag the mouse downward.

•To shrink the scene, press the left mouse button, and move the mouse upward.

Viewing Modes in the Main Window

The Main Window is located on the left of the Evidence Visualizer’s display window. The

top two buttons of those aligned vertically between the panes toggle between the grasp

and select modes. Alternatively, the Esc key also toggles the viewing mode for both

panes.

In grasp mode, the cursor appears as a hand, so you can pan, rotate, and scale the scene’s

size. (The Label Probability pane contains only 2D geometry; thus, rotation is disabled.)

•To rotate the display, press the left mouse button and move the mouse in the

direction you want.

•To move the display within the pane, press the middle mouse button, and drag it in

the direction you want the display moved.

•To enlarge the viewpoint, simultaneously press the left and middle mouse buttons

and move the mouse downward. To shrink the viewpoint, simultaneously press the

left and middle mouse buttons, and move the mouse upward. This is equivalent to

the functions provided by the Dolly thumbwheel.

406

Chapter 13: Inducing and Visualizing the Evidence Classifier

Selecting Items in the Label Probability Pane

In select mode, the cursor appears as an arrow. If you then click the button to the left of

one of the class labels, a white box appears around the button next to the label (see

Figure 13-10). The size of that slice (the probability of predicting that label value) appears

in the text output line at the top. To deselect a class label, click on it again. Moving the

mouse over the button next to a class label, in select mode, causes the size of that slice (in

percentage terms) to appear in the output line at the top.

Figure 13-10 Label Value “Japan” Selected Using the Cars Dataset

If no label is selected, the Main Window on the left displays cake or pie charts (see

Figure 13-10).

Working in the Evidence Visualizer’s Panes

407

If a label is selected, the representation on the left displays bar charts (see Figure 13-10).

The height of each bar shows the evidence in favor of the selected label value. Evidence

for is the negative log of the quantity one, minus the size of the slice matching the selected

label, in the corresponding chart representation.

The grayness of the bars is based on the 95% conﬁdence interval. This, in turn, depends

on the weight for that value. Hence, bars that are nearly gray have low weight and a large

conﬁdence interval. The height of gray bars is not likely to be very accurate. Conversely,

the height and corresponding evidence value for a fully saturated bar can be relied on

because it is based on large weight, representing many records. The exact number of

records (weight) is reﬂected in the text output line when that bar is highlighted.

As the default, the amount of evidence common to all the labels is subtracted. This means

that the height of a bar for each value is reduced by the height representing the label for

which the evidence is smallest. If you select a different label, the bars and their colors

change to represent the new class label. Selecting the same label again deselects it, and

the Main Window again displays the cake or pie charts. Uncheck the View > Subtract

minimum evidence option if you do not want to subtract the common evidence.

If a Loss Matrix has been speciﬁed in the Further Inducer options, a button labelled Use

Loss Matrix appears to the lower right of the probability pie. When selected (the default)

the Loss Matrix is used to adjust the probabilities shown. The largest slice shown is the

class that takes into account the Loss Matrix. To see what the probabilities would be

without using the Loss Matrix, deselect the Use Loss Matrix button. When using the Loss

Matrix, a gray slice may be present because there was a column for predicting null when

you edited it. If the gray slice is the largest slice, the classiﬁer predicts null.

Figure 13-11 Loss Matrix to Avoid Predicting Poisonous Mushrooms as Being Edible

408

Chapter 13: Inducing and Visualizing the Evidence Classifier

Figure 13-11 shows a Loss Matrix that has been edited to reduce the likelihood of

predicting a mushroom edible when it is really poisonous. Figure 13-12 shows the initial

state of the Label Probability pane for the resulting Evidence Visualizer. Normally the

prior probability on the right shows slices of almost equal sizes for edible and poisonous

mushrooms. Using this Loss Matrix, however, we csn be much more cautious about

predicting that a mushroom is edible.

Figure 13-12 Loss Matrix Applied to Probabilities in the Label Probability Pane

Working in the Evidence Visualizer’s Panes

409

Selecting Items in the Main Window

In select mode, the cursor appears as an arrow. You can highlight an object (either a pie

chart or a bar) by moving the cursor over that object. Information about that object then

appears above the Main Window. The information is displayed as long as the cursor is

over the object.

•If the object is a pie chart, then the message takes this format:

weight = <weight>

Here, weight is the total weight of the data points that fall in that range or have that

value for that attribute (see Figure 13-13). The chart height is proportional to this

number. Unless record weighting is used, the weight shows record counts.

Figure 13-13 Pie Charts With the First Binned Range of weightlbs Highlighted

410

Chapter 13: Inducing and Visualizing the Evidence Classifier

•If the object is a bar, then the message takes this format:

(<attribute> = <value>) ==> Prob(<selected label>) = x%

[low%-high%] Evidence=z

<selected label> ==> Prob(<attribute> = <value>) = y% [low%-high%]

weight = <weight>

See Figure 13-14.

Here, x is the probability that a record has the selected label given that it has the

highlighted attribute value. The bracketed range, [low%-high%] gives the 95%

conﬁdence interval. Similarly, y% is the probability that a record has the highlighted

attribute value given the selected label (see Figure 13-14). Note that the height of the

bar shows evidence, not probability. The amount of evidence, z, is directly related to

the bar heights. Evidence can be summed in order to determine which class is

predicted (unlike probability, which must be multiplied). Weight is the weight of

data points having that value.

Technically, evidence for is deﬁned as

while evidence against is deﬁned as

A is the attribute value, L is the selected label value, and N is the number of label

values. When computing the bar heights, a very small number is added inside the

brackets of the above expressions to prevent the bars from becoming inﬁnitely tall.

The word “for” or “against” in the Main Window has a box around it to indicate

that it may be clicked on. Click on the box to toggle the representation.

The height of the gray rectangular base (on which the bars stand) represents the

amount of evidence contributed by the prior probability. For example, if the label is

car cylinders, there are very few three cylinder cars, so the base is low when evidence

for is showing, and high when evidence against is showing. You can add to this height

the height of individual bars that are on top.

Evidence for can be useful in determining which values are the most helpful in

predicting a particular label value.

1 −

P(A|Li)

P(A|L)

-log Σ

i=1

P(A|Li)

P(A|L)

-log Σ

i=1

Working in the Evidence Visualizer’s Panes

411

The amount of evidence (bar height) is not derived directly from either probability

shown while highlighting. Instead, the evidence depends on the conditional

probability relative to the other probabilities for all the other label values according

to the equation above.

Figure 13-14 Bar Chart With a Range Selected

You can also select any number of values from an attribute row by clicking the left mouse

button while the cursor is over one of the attribute values. This causes the object to be

drawn with a white bounding box surrounding it (see Figure 13-14). The large pie chart

in the Label Probability Pane on the right changes to reﬂect items you select; it now

shows the posterior probability, given the attribute values just selected. Note that the

classes remain ordered, so the one corresponding to the largest slice is at the top of the

list on the right. The Evidence Visualizer arrives at the new posterior probability

distribution by multiplying the conditional probabilities for each attribute together, then

multiplying this result by the prior probability and normalizing to one.

412

Chapter 13: Inducing and Visualizing the Evidence Classifier

This multiplication corresponds to a conditional independence assumption. When this

assumption is violated, and multiple values for attributes are chosen, the predicted class

probabilities are likely to be too extreme, although the ﬁnal classiﬁcation might be

correct. The estimated error shown in the Status window when you run the inducer can

help you determine how reasonable this assumption is. If the error rate/loss is low, the

assumption is reasonably robust in the domain.

Before clicking on a square chart, the Evidence Visualizer appears as shown in

Figure 13-1. This shows that given no additional information, there is an approximately

equal likelihood that an iris will be designated type iris-setosa, iris-versicolor, or

iris-virginica. If you click a chart for petal width .75 - 1.65, the pie on the right changes to

that shown in Figure 13-15. This indicates that if the petal width is between .75 and 1.65,

the iris probably belongs to the class iris-versicolor. You then can select additional values

to further change the distribution. The order in which you select charts or bars does not

matter.

Figure 13-15 Iris Dataset With the Value petal width .75 - 1.65 Selected

Working in the Evidence Visualizer’s Panes

413

When a particular label has been selected in the Label Probability Pane, the Main

Window shows bars rather than cakes or pies for each value of an attribute. The title over

the bars reads “Evidence For.” The box around the “For” indicates that it can be selected

(Figure 13-16).

Figure 13-16 Bars Showing Evidence For iris-virginica

Clicking the “For” in the “Evidence For” title toggles it to display “Against”. As a result,

the bar heights change to show evidence against the label (Figure 13-17).

414

Chapter 13: Inducing and Visualizing the Evidence Classifier

Figure 13-17 Bars Showing Evidence Against iris-virginica

Selecting bars has the same effect on the large probability pie in the Label Probability

Pane to the right as did selecting cakes or pies. The bar height indicates the amount of

evidence for or against the selected label contributed by that selected value. Since log

probabilities are used to represent evidence, the bar heights are added to accumulate

evidence (whereas probabilities must be multiplied).

External Controls

415

External Controls

The external controls for the visualizer associated with the Decision Table classiﬁer are

the same as those for the Map Visualizer. For a description of these controls, see “External

Controls” in Chapter 5.

Sliders

The Evidence Visualizer contains three sliders: Height Scale, Detail Slider, and Percent

Weight Threshold.

The Height Scale Slider (Figure 13-18), which is located in the upper left of the Evidence

Visualizer, scales the height of the charts and bars. You can use this slider to magnify

small differences.

Figure 13-18 Evidence Visualizer Height Scale Slider

The Detail Slider, located at the bottom right of the Evidence Visualizer window

(Figure 13-19), ﬁlters out attributes that are not as useful for classifying the selected label.

This quality, assigned a value between 0 and 100 by the inducer, is called importance. This

measure is on an absolute scale. To understand how importance is calculated, see

“Column Importance and Relation to Classiﬁers” on page 497. As the slider is moved to

the right, attributes that fall below the requisite importance value are removed from the

scene. If the attributes are sorted by importance (the default), then the ones at the bottom

are the ﬁrst to be removed.

416

Chapter 13: Inducing and Visualizing the Evidence Classifier

Figure 13-19 Evidence Visualizer Detail Slider

The Percent Weight Threshold Slider, located at the bottom right of the Evidence

Visualizer window (Figure 13-20), ﬁlters out values having counts less than the

percentage indicated by the slider (up to a maximum of 2%). This slider helps visualize

attributes that have a large number of values, many of which occur infrequently (and,

hence, are not as useful). For example, if an attribute has 101 values, removing values

with counts less than 1% of the total might remove all values, and must remove at least 2.

Figure 13-20 Evidence Visualizer Percent Weight Threshold Slider.

Pulldown Menus

Five pulldown menus let you access additional Evidence Visualizer functions: File, View,

Selection, Nominal Order and Help. If you start the Evidence Visualizer without

specifying a conﬁguration ﬁle, only the File and the Help menus are available.

The File Menu

The File menu lets you open a new conﬁguration ﬁle, reopen the current conﬁguration

ﬁle, save an image of a current display, send an image of the display to the printer, start

the Tool Manager, or exit the Evidence Visualizer.

Pulldown Menus

417

The View Menu

The View menu lets you control certain aspects of what is shown in the Evidence

Visualizer pane (Figure 13-21).

Figure 13-21 Evidence Visualizer’s View Menu

This menu contains six options:

•Show Window Decoration lets you hide or show the external controls around the

display window.

•Sort By Importance lets you display the attributes sorted according to their

usefulness in classifying with respect to the chosen label. If this option is turned off,

then the attributes will appear in the same order they did under “Current Columns”

in the Tool Manager.

•Subtract Minimum Evidence applies only when a label has been selected and the bars

are shown. With this option on (the default), the height that is the minimum over all

the label values is subtracted. This amount may be different for each value of each

attribute, but for a given attribute value, the amount subtracted is constant across

label values. Activating this option magniﬁes small differences by subtracting the

least common denominator among all the label values.

•Show Nulls toggles the display of null values. Null values, if present, are shown as

the ﬁrst value, offset slightly from other non-Null values.

•Use Laplace Correction toggles the use of Laplace correction. If a speciﬁc Laplace

correction was speciﬁed in the further inducer options dialog box, then that value

gets used, otherwise a default value is provided.

•Use Landscape Viewer (or Use Examiner Viewer) switches to an alternative mode of 3D

navigation. To understand navigation using the Landscape viewer see “Navigating

With the Middle Mouse Button” in Chapter 5.

418

Chapter 13: Inducing and Visualizing the Evidence Classifier

The Nominal Order Menu

The Nominal Order menu lets you control how values for nominal attributes are ordered

(Figure 13-22).

Figure 13-22 Evidence Visualizer’s Nominal Order Menu

The three choices are:

•Alphabetical implies values for nominal attributes are sorted from left to right, in

alphabetical order.

•by Weight sorts values from left to right, with those having the largest number of

records appearing toward the left.

•by Label Probability (the default) sorts the values of nominal attributes by the size of

the slices corresponding to one of the classes. If the label is a binned attribute, the

highest bin is used by default. If the label is nominal, then whatever class has the

largest slice in the prior probability pie is used by default. If a particular class is

selected, and then sort by label probability is requested, the selected class is used for

determining the ordering. In all cases, if there is a NULL value, it remains at the far

left.

The Selection Menu

The Selection menu allows drill-through to the underlying data (Figure 13-23). To do a

drill-through, ﬁrst select some combination of values and/or a class, then chose one of

the two methods of drilling-through to the underlying records.

Pulldown Menus

419

Figure 13-23 Evidence Visualizer’s Selection Menu

There are three menu items:

•Show Original Data causes the records corresponding to the selected item to be

displayed in a record viewer.

•Send to Tool Manager causes a ﬁlter operation to be inserted at the beginning of the

Tool Manager history. The actual expression used to do the drill-through is

determined by

–the values selected in the Main Window, and

–the class(optionally) selected on the right.

All selected values within a single attribute are ORed together. These terms for each

attribute are then ANDed along with the class selection, if present, to form a

drill-through expression that is used to do the ﬁltering in Tool Manager. If nothing

is currently selected, a warning message appears.

For Figure 13-24 the ﬁlter expression would be:

(`gross income`>60000.0) &&

((`occupation`==”Protective-serv”)||(`occupation`==”Tech-support”))

&&(`education`==”Masters”)

&&(`sex`==”Female”)

•Complementary Drill Through uses the complement of the expression deﬁned by the

selected objects for drill-through.

420

Chapter 13: Inducing and Visualizing the Evidence Classifier

Figure 13-24 Filtered Adult Dataset With Multiple Selection

Sample Files

The following examples show cases in which classiﬁers might be useful. Each of these

examples is associated with a sample dataset provided with MineSet. By running the

inducer, you can generate the .eviviz ﬁles described below.

Note: The data ﬁles, which have a .schema extension, are located in /usr/lib/MineSet/data

on the client workstation. The classiﬁer visualization ﬁles, which have a .eviviz extension,

reside on the client workstation in /usr/lib/MineSet/eviviz/examples.

Sample Files

421

Churn

Churn is when a customer leaves one company for another. This example shows what

causes customer churn for a telephone company. The data used to generate this example

is in /usr/lib/MineSet/data/churn.schema. The ﬁle

/usr/lib/MineSet/eviviz/examples/churn.eviviz shows the structure of the classiﬁer induced

using the attribute churned as the label. The error rate for this classiﬁer is 12%. 14.1% of

the records represent customers who churned. The two most important attributes,

total_day_minutes and total_day_charge, are clearly correlated. A more accurate

classiﬁer can be induced if one of these attributes is removed ﬁrst (the error rate becomes

11%). If you run the inducer after selecting Automatic Feature Selection from the Further

Inducer Options, the error-rate drops to 10.5% using only 4 attributes (total day charge,

number of service calls, voice mail plan, and number of voice mail messages). All 29

customers who had a total_day_charge above 53.78 churned.

A high number of customer service calls is a predictor of churn. Many customer service

calls might indicate frustration in using a complicated equipment or receiving unreliable

service. Customers with the International plan are also more likely to churn. The people

in some states were much more likely to churn than those in others; for example,

California and New Jersey have the most churn, Virginia the least. To see just those states

that have more than 2% of the total number of records, slide the % Weights Threshold

slider all the way to the right. This eliminates most of the values for state from the

display. If you also select Nominal Order > Weight, then the state with the most records,

West Virginia (WV), is left-most. Many of the attributes (at the bottom of the list) are not

useful in discriminating churn. Note that day_charge is a great predictor, but

night_charge is not.

Origin of Cars

The cars dataset contains information about different models of cars from the 1970s and

early 1980s. Attributes include weight,acceleration, and miles per gallon (mpg). The ﬁle

/usr/lib/MineSet/eviviz/examples/cars.eviviz shows the structure of the Evidence Classiﬁer

induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/cars.schema with the label set to origin (Japan, U.S., Europe) and the

cylinders column changed to type string. The cylinders were changed to type string in

order to see all values and avoid the automatic discretization.

422

Chapter 13: Inducing and Visualizing the Evidence Classifier

If you have a dataset of car attributes, you might want to know what characterizes cars

of different origins. From the distribution of label values in the pie on the right we can

see that most cars in this dataset were made in the U.S. (62.5%) and a smaller number in

Japan (20.2%) and Europe (17.3%). Clearly brand is the best predictor of origin, since each

brand is associated with only one country of origin. For this reason, it has the highest

importance and is at the top of the list. By looking at the height of the pies, it can be seen

that many cars have four cylinders, most weigh less that 3000 lbs and most can reach 60

miles per hour in less than 20 seconds but more than 13.

Look at the distribution of slices for individual attribute values. If a car has an engine size

>169 cubic inches, it is almost certainly made in the U.S.; it certainly was not made in

Japan. Other pies show that U.S. cars generally have six or eight cylinders, low miles per

gallon, high horsepower (over 134), heavy weight (over 2981 lbs), and fast acceleration.

Japanese cars have better gas mileage, three or four cylinders (and a few six cylinders),

and smaller engines. If you click “Europe” in the Label Probability Pane, you can see bars

representing evidence for a car being European. For example, ﬁve cylinders strongly

indicates that a car is European. The height of the corresponding pie, however, shows

that there were only three cars with ﬁve cylinders in the data. If a car’s mileage is good,

there is much evidence for it being European. If a car’s mileage is > 41, then there is an

83% chance that it’s European. If a car is European, there is only a 10.4% chance that its

mileage is better than 41 mpg. But only 2% of Japanese cars—and no U.S. cars—have

mpg in this range, so Europe gets the most evidence.

Suppose you wanted to predict where a car came from knowing only that it got 40 mpg

and weighed 3000 lbs. Select the appropriate pies (or bars): mpg=30.95-41.15 and

weightlbs=2981.5+. The resulting probability distribution on the right shows 84% U.S.,

16% European. There is no possibility it is Japanese because there were no Japanese cars

in the training set with weightlbs>2981.5. If you run the inducer again with Laplace

correction turned on (with a value of .5), you get a different answer: 16% chance for

European, 82% chance for U.S., and a 2% chance for Japanese. This is because Laplace

correction prevents any slice in the pies from going completely to 0. Certainly, there is no

fundamental reason why the Japanese could not make a car that weighs more than

2981lbs; hence, when the probabilities (pies) are multiplied together, the possibility of

predicting a Japanese car is not eliminated.

Sample Files

423

Gender Attribution

The adult dataset contains information about working adults. This dataset was extracted

from the U.S. Census Bureau. It contains data about people older than 16, with a gross

income of more than $100 per year who work at least one hour a week. You might want

to know how to characterize males and females. The ﬁle

/usr/lib/MineSet/eviviz/examples/adult-sex.eviviz shows the structure of the Evidence

Classiﬁer induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/adult.schema, with the label set to sex, after removing the relationship

column (which would have made the classiﬁer trivial).

In the Evidence Visualizer, the Label Probability Pane shows that the prior probability of

working males is higher than that of females.

•Marital status is the most important predictor of gender. If a worker is a

married-civilian-spouse there is a greater probability of being male. A worker who

is widowed and working, however, is much more likely to be female.

•The second attribute listed shows occupation. Study this to learn which occupations

are popular with a particular gender. The various occupations are listed from left to

right in order of decreasing male dominance: Armed forces (100%), Craft-repair

(95%), Transport-moving (95%), and Farming-ﬁshing (94%). Female trades are

Private-house-service (94%) and Adm-clerical (67%). By clicking on the button next

to “Female” in the Label Probability Pane, and then moving the mouse over

occupation = Adm-clerical, one can see that 23% of females have an Adm-clerical

job. Conversely, given that one’s job is Adm-clerical, there is a 67% chance that the

gender is Female.

Suppose you wanted to ﬁnd out the probability of being female given that a person

is widowed and has occupation = Adm-clerical. This can be done by clicking on the

values and reading 95% from the text at the top when you move the mouse over the

box next to “Female” (in select mode).

•If the working class is either self-employed-incorporated or

self-employed-not-incorporated, the probability that the person is a male is higher.

Conversely, if the working class is state-gov, the conditional probability that the

person is a female is higher, but the posterior probability (after taking into account

the prior probability) is not higher (click it and look at the posterior probability on

the right). The size of the female slice increased by selecting state-gov, but not so

much that it would lead you to predict that a person was female, given only that

they worked for the state.

By rotating the view, you can see that most people work in private industry by

looking at the height of the charts.

424

Chapter 13: Inducing and Visualizing the Evidence Classifier

•By looking at the gross-income attribute, you can see that the higher the income

range, the higher the probability of being male.

•Education generally does not indicate much about gender, except for doctorate

degrees, where you are more likely to ﬁnd males.

•Different occupations have different distributions for males and females.

•The race attribute shows that African-Americans have a higher percentage of

females working than the percentage of other races in the conditional probability.

Click the value to see that the posterior is about equal between males and females.

•Males in this dataset work more hours per week than do females.

Salary Factors

If you have a dataset of working adults, you might want to ﬁnd out what factors affect

salary. First bin gross_income into ﬁve bins, with thresholds at 10,000, 20,000, 30,000, and

60,000. Each record then has an attribute with one of ﬁve values. You can run a MineSet

classiﬁer to help determine what factors inﬂuence salary. The ﬁle

/usr/lib/MineSet/eviviz/examples/adult-salary.eviviz shows the Evidence classiﬁer induced

for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/adult.schema with gross_income divided into ﬁve bins using

user-speciﬁed thresholds.

The attributes in the Evidence Visualizer are ranked by importance; thus, relationship,

marital status, age, occupation, education, hours per week, and sex are considered most

important. Since the label is numeric, a continuous spectrum is used to assign colors to

each class. Red is assigned to the highest bin (60,000+). The class labels are listed in the

Label Probability Pane according to slice size. As you click on values in the Main

Window, the order of the class labels changes to keep the label for the largest predicted

class at the top.

•Relationship shows that husbands and wives are likely to make more money than

unmarried workers or workers not in a family. Wives have slightly higher incomes

than husbands.

•Marital status shows that most people are married (the second chart from the left is

tall). Married workers earn more money than unmarried people.

•Age shows that age is a crucial factor. Until the age of 61, when many people retire,

the probability of making over $50,000 increases as workers get older.

Sample Files

425

•Different occupations yield different probabilities. Executive and professional jobs

raise the evidence for making over $60,000 per year.

•Education is an important factor. When considering just education, the highest

evidence for earning over $60,000 is given to workers whose educational level

includes a masters or doctoral degree, or matriculation from professional schools.

•Hours per week show that the more hours worked, the higher the evidence for

earning more money.

•Sex shows that being a female gives evidence for making less than $60,000 per year.

•Adjust the Percent Weights slider to remove values of native_country, education and

occupation with low weights are removed.

Iris Classiﬁcation

In this dataset, each record describes four characteristics of iris ﬂowers: petal width, petal

length, sepal width, and sepal length. Each iris was further classiﬁed into the types

iris-setosa, iris-versicolor, or iris-virginica. The goal is to understand what characterizes

each iris type.

Before running a classiﬁer, click the Column Importance tab in the Tool Manager’s

Classiﬁers tab; then click Go! You obtain a ranking of the importance of the features: petal

width, petal length, and sepal length. You can map these to the axes in the Scatter

Visualizer, with the iris_type mapped to the color and see the clusters.

The ﬁle /usr/lib/MineSet/eviviz/examples/iris.eviviz shows the structure of the Evidence

Classiﬁer induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/iris.schema.

In the Evidence Visualizer, we can see that petal length and petal width are excellent

discriminatory attributes, while sepal length and sepal width are not as good. Move the

importance threshold slider to the right to see that the sepal-based attributes disappear

ﬁrst.

426

Chapter 13: Inducing and Visualizing the Evidence Classifier

Mushroom Classiﬁcation

The ﬁle /usr/lib/MineSet/eviviz/examples/mushroom.eviviz shows the structure of the

Evidence Classiﬁer induced for this problem. This ﬁle was generated by running the

inducer on /usr/lib/MineSet/data/mushroom.schema.

The goal is to understand which mushrooms are edible and which ones are poisonous,

given this dataset. There are over 8000 records in this set; thus, running this inducer

might take several minutes. Note that under the default mode of the one-third holdout

for accuracy estimation, a third of the records are kept for testing.

Each mushroom has many characteristics, including cap color, bruises, and odor. The

Evidence Visualizer orders attributes by importance (that is, usefulness in predicting the

label). Odor and spore print color appear at the top of the list because the distributions

in the pies is most different from value to value for these attributes. Since all the attributes

in this dataset are nominal, all the values are sorted from left to right by how well they

predict edibility. You might want to order the values alphabetically or by weight

(prevalence). To do this, select the appropriate method from the nominal order menu.

You can see a characterization of poisonous mushrooms by changing the pointer to an

arrow (click the arrow icon at the top right of the main screen), then clicking the button

by that class label in the right pane. High bars are associated with values that indicate the

mushrooms are poisonous.

In the Evidence Visualizer, move the Detail slider to the right. The attributes with the

lowest importance are removed from the scene. The most important attribute by far is

odor, as its importance is 92; all other attributes have importance less than 48. Almost all

values are good discriminators, but if there is no odor (none), then there is a mix of both

classes. The Evidence Visualizer lets you see speciﬁc values that might be critical, even if

the attribute itself is not always important. For example, stalk_color_below_ring is not a

good discriminatory attribute because most of the time it takes on the value white. White

offers no predictive power because there are equal amounts of edible and poisonous

mushrooms with this value. When stalk_color_below_ring takes the value gray or buff,

it provides excellent discrimination, but there are very few mushrooms with these

values.

Sample Files

427

Party Afﬁliation

This dataset consists of voting records. The goal is to identify the party a congressperson

belongs to given data about key votes. The dataset includes votes for each member of the

U.S. House of Representatives on the 16 key votes identiﬁed by the Congressional

Quarterly Almanac (CQA). The CQA lists nine types of votes: voted for, paired for, and

announced for (these three are simpliﬁed to yes), voted against, paired against, and

announced against (these three are simpliﬁed to no), voted present, voted present to

avoid conﬂict of interest, and did not vote or otherwise make a position known (these

three are simpliﬁed to an unknown disposition).

Before running a classiﬁer, look at the 16 votes to see if you can perceive which features

are important. Then run the Evidence Visualizer. For this dataset, you might want to

order the values alphabetically, so that all no votes are on the left, undecided is in the

middle, and yes is on the right.

Some issues clearly deﬁne one’s party afﬁliation. Democrats tended to vote for a

physician fee freeze and aid for El Salvador, while Republicans voted for adoption of a

budget resolution and aid to the Contras in Nicaragua.

Immigration was an issue not split along party lines; nevertheless, politicians had strong

positions on it because only 7 out of the 235 were undecided on this issue.

The ﬁle /usr/lib/MineSet/eviviz/examples/vote.eviviz shows the structure of the Evidence

Classiﬁer induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/vote.schema.

Breast Cancer Diagnosis

The breast cancer dataset contains information about females undergoing breast cancer

diagnosis. Each record represents a patient with attributes such as cell size, clump

thickness, and marginal adhesion. The ﬁnal attribute is whether the diagnosis is

malignant or benign. The ﬁle /usr/lib/MineSet/eviviz/examples/breast.eviviz shows the

structure of the Evidence Classiﬁer induced for this problem. This ﬁle was generated by

running the inducer on /usr/lib/MineSet/data/breast.schema.

In the Evidence Visualizer, you can see that sample_code_number was discretized into

one range that is equally split, meaning that it does not indicate whether the breast cancer

is benign or malignant.

428

Chapter 13: Inducing and Visualizing the Evidence Classifier

Hypothyroid Diagnosis

The hypothyroidism dataset is similar to the one for breast cancer. The ﬁle

/usr/lib/MineSet/eviviz/examples/hypothyroid.eviviz shows the structure of the Evidence

Classiﬁer induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/hypothyroid.schema.

There are 3163 records in this dataset and most of them do not have hypothyroidism

(95.45%). While this means that one can predict “negative” and be correct with high

probability, it’s those people that have hypothyroidism that we are most worried about.

In technical terms, the false negatives are very important.

Look at the pie for “tsh” between 6.35 and 27.5. It shows much evidence for

hypothyroidism. When you click on it, however, the posterior pie still predicts

“negative” because the prior probability for “negative” was so great.

This is a case where you might want to adjust the Loss Matrix to skew the posterior

probability toward predicting hypothyroidism in order to avoid false negatives. There

might be a high cost associated with predicting that someone is healthy when they

actually have the disease; predicting them sick when they are actually healthy means

they take a more accurate test or a treatment they do not need.

In the Evidence Visualizer, you can see that “fti” is very important. The ﬁrst two ranges

(besides the unknown) give a lot of evidence for hypothyroidism.

Pima Diabetes Diagnosis

This dataset is a diagnosis problem for diabetes using statistics gathered from an Indian

tribe in Phoenix Arizona. The task is to determine whether a patient has diabetes, given

some medical attributes, such as blood pressure, body mass, glucose level, and age.

The ﬁle /usr/lib/MineSet/eviviz/examples/pima.eviviz shows the structure of the Evidence

Classiﬁer induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/pima.schema.

In the Evidence Visualizer, you can see that many attributes are irrelevant by themselves.

As plasma_glucose increases, the probability of having diabetes increases. The number

of pregnancies is also a good indicator when it is high (above 6), as is age (above 27).

Sample Files

429

DNA Boundaries

The ﬁle /usr/lib/MineSet/eviviz/examples/dna.eviviz shows the structure of the Evidence

Classiﬁer induced for this problem. This ﬁle was generated by running the inducer on

/usr/lib/MineSet/data/dna.schema.