The Data Warehouse Toolkit, 3rd Edition Ralph Kimball, Margy Ross Toolkit Definitive Guide To Dimensional Ing Wil

User Manual:

Open the PDF directly: View PDF .
Page Count: 601 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Cover
Title Page
Copyright
Contents
1 Data Warehousing, Business Intelligence, and Dimensional Modeling Primer
- Different Worlds of Data Capture and Data Analysis
- Goals of Data Warehousing and Business Intelligence
  - Publishing Metaphor for DW/BI Managers
- Dimensional Modeling Introduction
- Kimball’s DW/BI Architecture
- Alternative DW/BI Architectures
- Dimensional Modeling Myths
- More Reasons to Think Dimensionally
- Agile Considerations
- Summary
2 Kimball Dimensional Modeling Techniques Overview
- Fundamental Concepts
- Basic Fact Table Techniques
- Basic Dimension Table Techniques
- Integration via Conformed Dimensions
- Dealing with Slowly Changing Dimension Attributes
- Dealing with Dimension Hierarchies
- Advanced Fact Table Techniques
- Advanced Dimension Techniques
- Special Purpose Schemas
3 Retail Sales
- Four-Step Dimensional Design Process
- Retail Case Study
- Dimension Table Details
- Retail Schema in Action
- Retail Schema Extensibility
- Factless Fact Tables
- Dimension and Fact Table Keys
- Resisting Normalization Urges
- Summary
4 Inventory
- Value Chain Introduction
- Inventory Models
- Fact Table Types
- Value Chain Integration
- Enterprise Data Warehouse Bus Architecture
  - Understanding the Bus Architecture
  - Enterprise Data Warehouse Bus Matrix
- Conformed Dimensions
- Conformed Facts
- Summary
5 Procurement
- Procurement Case Study
- Procurement Transactions and Bus Matrix
  - Single Versus Multiple Transaction Fact Tables
  - Complementary Procurement Snapshot
- Slowly Changing Dimension Basics
- Hybrid Slowly Changing Dimension Techniques
- Slowly Changing Dimension Recap
- Summary
6 Order Management
- Order Management Bus Matrix
- Order Transactions
- Invoice Transactions
- Accumulating Snapshot for Order Fulfillment Pipeline
- Summary
7 Accounting
- Accounting Case Study and Bus Matrix
- General Ledger Data
- Budgeting Process
- Dimension Attribute Hierarchies
- Consolidated Fact Tables
- Role of OLAP and Packaged Analytic Solutions
- Summary
8 Customer Relationship Management
- CRM Overview
  - Operational and Analytic CRM
- Customer Dimension Attributes
- Bridge Tables for Multivalued Dimensions
  - Bridge Table for Sparse Attributes
  - Bridge Table for Multiple Customer Contacts
- Complex Customer Behavior
- Customer Data Integration Approaches
- Low Latency Reality Check
- Summary
9 Human Resources Management
- Employee Profile Tracking
- Headcount Periodic Snapshot
- Bus Matrix for HR Processes
- Packaged Analytic Solutions and Data Models
- Recursive Employee Hierarchies
  - Change Tracking on Embedded Manager Key
  - Drilling Up and Down Management Hierarchies
- Multivalued Skill Keyword Attributes
  - Skill Keyword Bridge
  - Skill Keyword Text String
- Survey Questionnaire Data
  - Text Comments
- Summary
10 Financial Services
- Banking Case Study and Bus Matrix
- Dimension Triage to Avoid Too Few Dimensions
- Supertype and Subtype Schemas for Heterogeneous Products
  - Supertype and Subtype Products with Common Facts
- Hot Swappable Dimensions
- Summary
11 Telecommunications
- Telecommunications Case Study and Bus Matrix
- General Design Review Considerations
- Design Review Guidelines
- Draft Design Exercise Discussion
- Remodeling Existing Data Structures
- Geographic Location Dimension
- Summary
12 Transportation
- Airline Case Study and Bus Matrix
- Extensions to Other Industries
  - Cargo Shipper
  - Travel Services
- Combining Correlated Dimensions
  - Class of Service
  - Origin and Destination
- More Date and Time Considerations
  - Country-Specific Calendars as Outriggers
  - Date and Time in Multiple Time Zones
- Localization Recap
- Summary
13 Education
- University Case Study and Bus Matrix
- Accumulating Snapshot Fact Tables
  - Applicant Pipeline
  - Research Grant Proposal Pipeline
- Factless Fact Tables
- More Educational Analytic Opportunities
- Summary
14 Healthcare
- Healthcare Case Study and Bus Matrix
- Claims Billing and Payments
- Electronic Medical Records
- Facility/Equipment Inventory Utilization
- Dealing with Retroactive Changes
- Summary
15 Electronic Commerce
- Clickstream Source Data
  - Clickstream Data Challenges
- Clickstream Dimensional Models
- Integrating Clickstream into Web Retailer’s Bus Matrix
- Profitability Across Channels Including Web
- Summary
16 Insurance
- Insurance Case Study
  - Insurance Value Chain
  - Draft Bus Matrix
- Policy Transactions
- Premium Periodic Snapshot
- More Insurance Case Study Background
  - Updated Insurance Bus Matrix
  - Detailed Implementation Bus Matrix
- Claim Transactions
  - Transaction Versus Profile Junk Dimensions
- Claim Accumulating Snapshot
- Policy/Claim Consolidated Periodic Snapshot
- Factless Accident Events
- Common Dimensional Modeling Mistakes to Avoid
- Summary
17 Kimball DW/BI Lifecycle Overview
- Lifecycle Roadmap
  - Roadmap Mile Markers
- Lifecycle Launch Activities
  - Program/Project Planning and Management
  - Business Requirements Definition
- Lifecycle Technology Track
  - Technical Architecture Design
  - Product Selection and Installation
- Lifecycle Data Track
- Lifecycle BI Applications Track
  - BI Application Specification
  - BI Application Development
- Lifecycle Wrap-up Activities
  - Deployment
  - Maintenance and Growth
- Common Pitfalls to Avoid
- Summary
18 Dimensional Modeling Process and Tasks
- Modeling Process Overview
- Get Organized
- Design the Dimensional Model
- Summary
19 ETL Subsystems and Techniques
- Round Up the Requirements
- The 34 Subsystems of ETL
- Extracting: Getting Data into the Data Warehouse
- Cleaning and Conforming Data
- Delivering: Prepare for Presentation
- Managing the ETL Environment
- Summary
20 ETL System Design and Development Process and Tasks
- ETL Process Overview
- Develop the ETL Plan
- Develop One-Time Historic Load Processing
  - Step 5: Populate Dimension Tables with Historic Data
  - Step 6: Perform the Fact Table Historic Load
- Develop Incremental ETL Processing
- Real-Time Implications
- Summary
21 Big Data Analytics
- Big Data Overview
- Recommended Best Practices for Big Data
- Summary
Index
Advertisement

The Data

Warehouse

Toolkit

The Data

Warehouse

Toolkit

Third Edition

Ralph Kimball

Margy Ross

The De nitive Guide to

Dimensional Modeling

The Data Warehouse Toolkit: The Deﬁ nitive Guide to Dimensional Modeling, Third Edition

Published by

John Wiley & Sons, Inc.

10475 Crosspoint Boulevard

Indianapolis, IN 46256

www.wiley.com

Published by John Wiley & Sons, Inc., Indianapolis, Indiana

Published simultaneously in Canada

ISBN: 978-1-118-53080-1

ISBN: 978-1-118-53077-1 (ebk)

ISBN: 978-1-118-73228-1 (ebk)

ISBN: 978-1-118-73219-9 (ebk)

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or

by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permit-

ted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written

permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the

8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John

Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online

at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or war-

ranties with respect to the accuracy or completeness of the contents of this work and speciﬁ cally disclaim all

warranties, including without limitation warranties of ﬁ tness for a particular purpose. No warranty may be

created or extended by sales or promotional materials. The advice and strategies contained herein may not

be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in

rendering legal, accounting, or other professional services. If professional assistance is required, the services

of a competent professional person should be sought. Neither the publisher nor the author shall be liable for

damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation

and/or a potential source of further information does not mean that the author or the publisher endorses

the information the organization or website may provide or recommendations it may make. Further, readers

should be aware that Internet websites listed in this work may have changed or disappeared between when

this work was written and when it is read.

For general information on our other products and services please contact our Customer Care

Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax

(317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material

included with standard print versions of this book may not be included in e-books or in print-on-

demand. If this book refers to media such as a CD or DVD that is not included in the version you

purchased, you may download this material at http://booksupport.wiley.com. For more informa-

tion about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2013936841

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons,

Inc. and/or its a liates, in the United States and other countries, and may not be used without written per-

mission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not

associated with any product or vendor mentioned in this book.

Ralph Kimball founded the Kimball Group. Since the mid-1980s, he has been the

data warehouse and business intelligence industry’s thought leader on the dimen-

sional approach. He has educated tens of thousands of IT professionals. The Toolkit

books written by Ralph and his colleagues have been the industry’s best sellers

since 1996. Prior to working at Metaphor and founding Red Brick Systems, Ralph

coinvented the Star workstation, the ﬁ rst commercial product with windows, icons,

and a mouse, at Xerox’s Palo Alto Research Center (PARC). Ralph has a PhD in

electrical engineering from Stanford University.

Margy Ross is president of the Kimball Group. She has focused exclusively on data

warehousing and business intelligence since 1982 with an emphasis on business

requirements and dimensional modeling. Like Ralph, Margy has taught the dimen-

sional best practices to thousands of students; she also coauthored ﬁ ve Toolkit books

with Ralph. Margy previously worked at Metaphor and cofounded DecisionWorks

Consulting. She graduated with a BS in industrial engineering from Northwestern

University.

About the Authors

Executive Editor

Robert Elliott

Project Editor

Maureen Spears

Senior Production Editor

Kathleen Wisor

Copy Editor

Apostrophe Editing Services

Editorial Manager

Mary Beth Wakeﬁ eld

Freelancer Editorial Manager

Rosemarie Graham

Associate Director of Marketing

David Mayhew

Marketing Manager

Ashley Zurcher

Business Manager

Amy Knies

Production Manager

Tim Tate

Vice President and Executive Group

Publisher

Richard Swadley

Vice President and Executive Publisher

Neil Edde

Associate Publisher

Jim Minatel

Project Coordinator, Cover

Katie Crocker

Proofreader

Word One, New York

Indexer

Johnna VanHoose Dinse

Cover Image

iStockphoto.com / teekid

Cover Designer

Ryan Sneed

Credits

First, thanks to the hundreds of thousands who have read our Toolkit books,

attended our courses, and engaged us in consulting projects. We have learned as

much from you as we have taught. Collectively, you have had a profoundly positive

impact on the data warehousing and business intelligence industry. Congratulations!

Our Kimball Group colleagues, Bob Becker, Joy Mundy, and Warren Thornthwaite,

have worked with us to apply the techniques described in this book literally thou-

sands of times, over nearly 30 years of working together. Every technique in this

book has been thoroughly vetted by practice in the real world. We appreciate their

input and feedback on this book—and more important, the years we have shared

as business partners, along with Julie Kimball.

Bob Elliott, our executive editor at John Wiley & Sons, project editor Maureen

Spears, and the rest of the Wiley team have supported this project with skill and

enthusiasm. As always, it has been a pleasure to work with them.

To our families, thank you for your unconditional support throughout our

careers. Spouses Julie Kimball and Scott Ross and children Sara Hayden Smith,

Brian Kimball, and Katie Ross all contributed in countless ways to this book.

Acknowledgments

Contents

Introduction.......................................xxvii

1 Data Warehousing, Business Intelligence, and Dimensional

Modeling Primer.....................................1

Different Worlds of Data Capture and Data Analysis ................... 2

Goals of Data Warehousing and Business Intelligence .................. 3

Publishing Metaphor for DW/BI Managers ....................... 5

Dimensional Modeling Introduction ...............................7

Star Schemas Versus OLAP Cubes .............................8

Fact Tables for Measurements ............................... 10

Dimension Tables for Descriptive Context...................... 13

Facts and Dimensions Joined in a Star Schema ................... 16

Kimball’s DW/BI Architecture................................... 18

Operational Source Systems ................................. 18

Extract, Transformation, and Load System ...................... 19

Presentation Area to Support Business Intelligence................21

Business Intelligence Applications ............................22

Restaurant Metaphor for the Kimball Architecture ................ 23

Alternative DW/BI Architectures................................. 26

Independent Data Mart Architecture .......................... 26

Hub-and-Spoke Corporate Information Factory Inmon Architecture ..28

Hybrid Hub-and-Spoke and Kimball Architecture .................29

Dimensional Modeling Myths...................................30

Myth 1: Dimensional Models are Only for Summary Data.......... 30

Myth 2: Dimensional Models are Departmental, Not Enterprise ..... 31

Myth 3: Dimensional Models are Not Scalable ................... 31

Myth 4: Dimensional Models are Only for Predictable Usage ........ 31

Myth 5: Dimensional Models Can’t Be Integrated ................ 32

More Reasons to Think Dimensionally ............................32

Agile Considerations ..........................................34

Summary.................................................. 35

Contents

2 Kimball Dimensional Modeling Techniques Overview .........37

Fundamental Concepts ........................................ 37

Gather Business Requirements and Data Realities ................. 37

Collaborative Dimensional Modeling Workshops .................38

Four-Step Dimensional Design Process .........................38

Business Processes........................................ 39

Grain .................................................. 39

Dimensions for Descriptive Context ........................... 40

Facts for Measurements ....................................40

Star Schemas and OLAP Cubes ..............................40

Graceful Extensions to Dimensional Models ..................... 41

Basic Fact Table Techniques .................................... 41

Fact Table Structure ....................................... 41

Additive, Semi-Additive, Non-Additive Facts .................... 42

Nulls in Fact Tables ....................................... 42

Conformed Facts ......................................... 42

Transaction Fact Tables .................................... 43

Periodic Snapshot Fact Tables ............................... 43

Accumulating Snapshot Fact Tables...........................44

Factless Fact Tables .......................................44

Aggregate Fact Tables or OLAP Cubes ......................... 45

Consolidated Fact Tables................................... 45

Basic Dimension Table Techniques...............................46

Dimension Table Structure ..................................46

Dimension Surrogate Keys ..................................46

Natural, Durable, and Supernatural Keys.......................46

Drilling Down ........................................... 47

Degenerate Dimensions .................................... 47

Denormalized Flattened Dimensions.......................... 47

Multiple Hierarchies in Dimensions ........................... 48

Flags and Indicators as Textual Attributes .......................48

Null Attributes in Dimensions ...............................48

Calendar Date Dimensions ..................................48

Role-Playing Dimensions ................................... 49

Junk Dimensions .........................................49

Contents xi

Snowﬂ aked Dimensions ....................................50

Outrigger Dimensions.....................................50

Integration via Conformed Dimensions ...........................50

Conformed Dimensions .................................... 51

Shrunken Dimensions ..................................... 51

Drilling Across........................................... 51

Value Chain ............................................. 52

Enterprise Data Warehouse Bus Architecture .................... 52

Enterprise Data Warehouse Bus Matrix ......................... 52

Detailed Implementation Bus Matrix.......................... 53

Opportunity/Stakeholder Matrix ............................. 53

Dealing with Slowly Changing Dimension Attributes................. 53

Type 0: Retain Original ....................................54

Type 1: Overwrite ........................................54

Type 2: Add New Row .....................................54

Type 3: Add New Attribute ................................. 55

Type 4: Add Mini-Dimension ................................ 55

Type 5: Add Mini-Dimension and Type 1 Outrigger ............... 55

Type 6: Add Type 1 Attributes to Type 2 Dimension............... 56

Type 7: Dual Type 1 and Type 2 Dimensions .................... 56

Dealing with Dimension Hierarchies ..............................56

Fixed Depth Positional Hierarchies ............................ 56

Slightly Ragged/Variable Depth Hierarchies ..................... 57

Ragged/Variable Depth Hierarchies with Hierarchy Bridge Tables .... 57

Ragged/Variable Depth Hierarchies with Pathstring Attributes ....... 57

Advanced Fact Table Techniques ................................ 58

Fact Table Surrogate Keys...................................58

Centipede Fact Tables ..................................... 58

Numeric Values as Attributes or Facts ......................... 59

Lag/Duration Facts........................................ 59

Header/Line Fact Tables .................................... 59

Allocated Facts ...........................................60

Proﬁ t and Loss Fact Tables Using Allocations ....................60

Multiple Currency Facts ....................................60

Multiple Units of Measure Facts .............................. 61

Contents

xii

Year-to-Date Facts........................................ 61

Multipass SQL to Avoid Fact-to-Fact Table Joins ..................61

Timespan Tracking in Fact Tables .............................62

Late Arriving Facts ........................................ 62

Advanced Dimension Techniques ................................62

Dimension-to-Dimension Table Joins ..........................62

Multivalued Dimensions and Bridge Tables ..................... 63

Time Varying Multivalued Bridge Tables ....................... 63

Behavior Tag Time Series ................................... 63

Behavior Study Groups ....................................64

Aggregated Facts as Dimension Attributes ......................64

Dynamic Value Bands .....................................64

Text Comments Dimension.................................65

Multiple Time Zones ......................................65

Measure Type Dimensions ..................................65

Step Dimensions ......................................... 65

Hot Swappable Dimensions .................................66

Abstract Generic Dimensions ................................66

Audit Dimensions .........................................66

Late Arriving Dimensions ................................... 67

Special Purpose Schemas...................................... 67

Supertype and Subtype Schemas for Heterogeneous Products ...... 67

Real-Time Fact Tables ......................................68

Error Event Schemas ......................................68

3 Retail Sales .........................................69

Four-Step Dimensional Design Process............................ 70

Step 1: Select the Business Process ............................70

Step 2: Declare the Grain ................................... 71

Step 3: Identify the Dimensions ..............................72

Step 4: Identify the Facts ...................................72

Retail Case Study ............................................ 72

Step 1: Select the Business Process ............................ 74

Step 2: Declare the Grain ................................... 74

Step 3: Identify the Dimensions .............................. 76

Contents xiii

Step 4: Identify the Facts ................................... 76

Dimension Table Details ....................................... 79

Date Dimension .......................................... 79

Product Dimension ....................................... 83

Store Dimension ......................................... 87

Promotion Dimension .....................................89

Other Retail Sales Dimensions............................... 92

Degenerate Dimensions for Transaction Numbers ................ 93

Retail Schema in Action ....................................... 94

Retail Schema Extensibility ....................................95

Factless Fact Tables ........................................... 97

Dimension and Fact Table Keys .................................. 98

Dimension Table Surrogate Keys .............................98

Dimension Natural and Durable Supernatural Keys.............. 100

Degenerate Dimension Surrogate Keys ....................... 101

Date Dimension Smart Keys ................................ 101

Fact Table Surrogate Keys.................................. 102

Resisting Normalization Urges ................................. 104

Snowﬂ ake Schemas with Normalized Dimensions............... 104

Outriggers ............................................. 106

Centipede Fact Tables with Too Many Dimensions ............... 108

Summary................................................. 109

4 Inventory ......................................... 111

Value Chain Introduction.....................................111

Inventory Models ...........................................112

Inventory Periodic Snapshot ................................113

Inventory Transactions ....................................116

Inventory Accumulating Snapshot........................... 118

Fact Table Types ............................................119

Transaction Fact Tables ................................... 120

Periodic Snapshot Fact Tables .............................. 120

Accumulating Snapshot Fact Tables.......................... 121

Complementary Fact Table Types ........................... 122

Contents

xiv

Value Chain Integration ...................................... 122

Enterprise Data Warehouse Bus Architecture ....................... 123

Understanding the Bus Architecture ......................... 124

Enterprise Data Warehouse Bus Matrix ........................ 125

Conformed Dimensions...................................... 130

Drilling Across Fact Tables ................................. 130

Identical Conformed Dimensions ............................ 131

Shrunken Rollup Conformed Dimension with Attribute Subset ..... 132

Shrunken Conformed Dimension with Row Subset .............. 132

Shrunken Conformed Dimensions on the Bus Matrix ............. 134

Limited Conformity...................................... 135

Importance of Data Governance and Stewardship ............... 135

Conformed Dimensions and the Agile Movement............... 137

Conformed Facts ........................................... 138

Summary................................................. 139

5 Procurement...................................... 141

Procurement Case Study ..................................... 141

Procurement Transactions and Bus Matrix ........................ 142

Single Versus Multiple Transaction Fact Tables .................. 143

Complementary Procurement Snapshot....................... 147

Slowly Changing Dimension Basics ............................. 147

Type 0: Retain Original ................................... 148

Type 1: Overwrite ....................................... 149

Type 2: Add New Row .................................... 150

Type 3: Add New Attribute ................................ 154

Type 4: Add Mini-Dimension ............................... 156

Hybrid Slowly Changing Dimension Techniques .................... 159

Type 5: Mini-Dimension and Type 1 Outrigger ................. 160

Type 6: Add Type 1 Attributes to Type 2 Dimension.............. 160

Type 7: Dual Type 1 and Type 2 Dimensions ................... 162

Slowly Changing Dimension Recap ............................. 164

Summary................................................. 165

Contents xv

6 Order Management ................................. 167

Order Management Bus Matrix ................................ 168

Order Transactions .......................................... 168

Fact Normalization ....................................... 169

Dimension Role Playing................................... 170

Product Dimension Revisited............................... 172

Customer Dimension .....................................174

Deal Dimension ......................................... 177

Degenerate Dimension for Order Number ..................... 178

Junk Dimensions ........................................ 179

Header/Line Pattern to Avoid ............................... 181

Multiple Currencies...................................... 182

Transaction Facts at Different Granularity ..................... 184

Another Header/Line Pattern to Avoid........................ 186

Invoice Transactions ......................................... 187

Service Level Performance as Facts, Dimensions, or Both .......... 188

Proﬁ t and Loss Facts ...................................... 189

Audit Dimension........................................ 192

Accumulating Snapshot for Order Fulﬁ llment Pipeline ............... 194

Lag Calculations ......................................... 196

Multiple Units of Measure................................. 197

Beyond the Rearview Mirror ............................... 198

Summary................................................. 199

7 Accounting .......................................201

Accounting Case Study and Bus Matrix .......................... 202

General Ledger Data ........................................ 203

General Ledger Periodic Snapshot........................... 203

Chart of Accounts ....................................... 203

Period Close ............................................204

Year-to-Date Facts.......................................206

Multiple Currencies Revisited ...............................206

General Ledger Journal Transactions ......................... 206

Contents

xvi

Multiple Fiscal Accounting Calendars .........................208

Drilling Down Through a Multilevel Hierarchy ..................209

Financial Statements ..................................... 209

Budgeting Process .......................................... 210

Dimension Attribute Hierarchies ................................ 214

Fixed Depth Positional Hierarchies ........................... 214

Slightly Ragged Variable Depth Hierarchies.................... 214

Ragged Variable Depth Hierarchies .......................... 215

Shared Ownership in a Ragged Hierarchy ..................... 219

Time Varying Ragged Hierarchies ...........................220

Modifying Ragged Hierarchies .............................. 220

Alternative Ragged Hierarchy Modeling Approaches............. 221

Advantages of the Bridge Table Approach for Ragged Hierarchies ... 223

Consolidated Fact Tables ..................................... 224

Role of OLAP and Packaged Analytic Solutions ..................... 226

Summary................................................. 227

8 Customer Relationship Management....................229

CRM Overview ............................................. 230

Operational and Analytic CRM .............................. 231

Customer Dimension Attributes................................ 233

Name and Address Parsing ................................ 233

International Name and Address Considerations ................ 236

Customer-Centric Dates ................................... 238

Aggregated Facts as Dimension Attributes ..................... 239

Segmentation Attributes and Scores ......................... 240

Counts with Type 2 Dimension Changes...................... 243

Outrigger for Low Cardinality Attribute Set.................... 243

Customer Hierarchy Considerations ..........................244

Bridge Tables for Multivalued Dimensions ........................ 245

Bridge Table for Sparse Attributes ........................... 247

Bridge Table for Multiple Customer Contacts ................... 248

Complex Customer Behavior.................................. 249

Behavior Study Groups for Cohorts.......................... 249

Contents xvii

Step Dimension for Sequential Behavior ....................... 251

Timespan Fact Tables ..................................... 252

Tagging Fact Tables with Satisfaction Indicators .................254

Tagging Fact Tables with Abnormal Scenario Indicators ........... 255

Customer Data Integration Approaches..........................256

Master Data Management Creating a Single Customer Dimension .. 256

Partial Conformity of Multiple Customer Dimensions .............258

Avoiding Fact-to-Fact Table Joins............................ 259

Low Latency Reality Check ................................... 260

Summary................................................. 261

9 Human Resources Management ........................ 263

Employee Proﬁ le Tracking ..................................... 263

Precise Effective and Expiration Timespans .................... 265

Dimension Change Reason Tracking ......................... 266

Proﬁ le Changes as Type 2 Attributes or Fact Events.............. 267

Headcount Periodic Snapshot .................................. 267

Bus Matrix for HR Processes................................... 268

Packaged Analytic Solutions and Data Models..................... 270

Recursive Employee Hierarchies ................................ 271

Change Tracking on Embedded Manager Key .................. 272

Drilling Up and Down Management Hierarchies ................ 273

Multivalued Skill Keyword Attributes ............................ 274

Skill Keyword Bridge ..................................... 275

Skill Keyword Text String .................................. 276

Survey Questionnaire Data .................................... 277

Text Comments ......................................... 278

Summary................................................. 279

10 Financial Services ...................................281

Banking Case Study and Bus Matrix............................. 282

Dimension Triage to Avoid Too Few Dimensions .................... 283

Household Dimension....................................286

Multivalued Dimensions and Weighting Factors ................. 287

Contents

xviii

Mini-Dimensions Revisited ................................. 289

Adding a Mini-Dimension to a Bridge Table .................... 290

Dynamic Value Banding of Facts ............................ 291

Supertype and Subtype Schemas for Heterogeneous Products ......... 293

Supertype and Subtype Products with Common Facts ........... 295

Hot Swappable Dimensions................................... 296

Summary................................................. 296

11 Telecommunications ................................297

Telecommunications Case Study and Bus Matrix ................... 297

General Design Review Considerations ........................... 299

Balance Business Requirements and Source Realities .............300

Focus on Business Processes ................................300

Granularity ............................................300

Single Granularity for Facts ................................ 301

Dimension Granularity and Hierarchies ....................... 301

Date Dimension ......................................... 302

Degenerate Dimensions ................................... 303

Surrogate Keys .......................................... 303

Dimension Decodes and Descriptions ........................ 303

Conformity Commitment .................................304

Design Review Guidelines .....................................30 4

Draft Design Exercise Discussion ...............................306

Remodeling Existing Data Structures ............................309

Geographic Location Dimension ............................... 310

Summary................................................. 310

12 Transportation ..................................... 311

Airline Case Study and Bus Matrix .............................. 311

Multiple Fact Table Granularities ............................ 312

Linking Segments into Trips ................................ 315

Related Fact Tables ....................................... 316

Extensions to Other Industries................................. 317

Cargo Shipper .......................................... 317

Travel Services .......................................... 317

Contents xix

Combining Correlated Dimensions .............................. 318

Class of Service ......................................... 319

Origin and Destination ................................... 320

More Date and Time Considerations ............................ 321

Country-Speciﬁ c Calendars as Outriggers ..................... 321

Date and Time in Multiple Time Zones ....................... 323

Localization Recap.......................................... 324

Summary................................................. 324

13 Education ........................................325

University Case Study and Bus Matrix ............................325

Accumulating Snapshot Fact Tables ............................. 326

Applicant Pipeline ....................................... 326

Research Grant Proposal Pipeline ............................ 329

Factless Fact Tables .......................................... 329

Admissions Events....................................... 330

Course Registrations ..................................... 330

Facility Utilization ........................................ 334

Student Attendance ...................................... 335

More Educational Analytic Opportunities ......................... 336

Summary................................................. 336

14 Healthcare ........................................ 339

Healthcare Case Study and Bus Matrix ........................... 339

Claims Billing and Payments ................................... 342

Date Dimension Role Playing ............................... 345

Multivalued Diagnoses .................................... 345

Supertypes and Subtypes for Charges........................ 347

Electronic Medical Records ....................................348

Measure Type Dimension for Sparse Facts.....................349

Freeform Text Comments ................................. 350

Images ................................................ 350

Facility/Equipment Inventory Utilization.......................... 351

Dealing with Retroactive Changes .............................. 351

Summary................................................. 352

Contents

15 Electronic Commerce ................................ 353

Clickstream Source Data ...................................... 353

Clickstream Data Challenges............................... 354

Clickstream Dimensional Models............................... 357

Page Dimension ......................................... 358

Event Dimension........................................ 359

Session Dimension ....................................... 359

Referral Dimension .......................................360

Clickstream Session Fact Table .............................. 361

Clickstream Page Event Fact Table........................... 363

Step Dimension ......................................... 366

Aggregate Clickstream Fact Tables ........................... 366

Google Analytics........................................ 367

Integrating Clickstream into Web Retailer’s Bus Matrix ............... 368

Proﬁ tability Across Channels Including Web ....................... 370

Summary................................................. 373

16 Insurance ......................................... 375

Insurance Case Study........................................ 376

Insurance Value Chain.................................... 377

Draft Bus Matrix ........................................ 378

Policy Transactions .......................................... 379

Dimension Role Playing...................................380

Slowly Changing Dimensions ............................... 380

Mini-Dimensions for Large or Rapidly Changing Dimensions ....... 381

Multivalued Dimension Attributes........................... 382

Numeric Attributes as Facts or Dimensions .................... 382

Degenerate Dimension ................................... 383

Low Cardinality Dimension Tables........................... 383

Audit Dimension........................................ 383

Policy Transaction Fact Table............................... 383

Heterogeneous Supertype and Subtype Products ...............384

Complementary Policy Accumulating Snapshot .................384

Premium Periodic Snapshot................................... 385

Conformed Dimensions ...................................386

Conformed Facts ........................................386

Contents xxi

Pay-in-Advance Facts .....................................386

Heterogeneous Supertypes and Subtypes Revisited .............. 387

Multivalued Dimensions Revisited ...........................388

More Insurance Case Study Background ..........................388

Updated Insurance Bus Matrix .............................. 389

Detailed Implementation Bus Matrix.........................390

Claim Transactions .......................................... 390

Transaction Versus Proﬁ le Junk Dimensions .................... 392

Claim Accumulating Snapshot................................. 392

Accumulating Snapshot for Complex Workﬂ ows................ 393

Timespan Accumulating Snapshot ........................... 394

Periodic Instead of Accumulating Snapshot .................... 395

Policy/Claim Consolidated Periodic Snapshot...................... 395

Factless Accident Events ...................................... 396

Common Dimensional Modeling Mistakes to Avoid ................. 397

Mistake 10: Place Text Attributes in a Fact Table ................. 397

Mistake 9: Limit Verbose Descriptors to Save Space .............. 398

Mistake 8: Split Hierarchies into Multiple Dimensions ............ 398

Mistake 7: Ignore the Need to Track Dimension Changes ......... 398

Mistake 6: Solve All Performance Problems with More Hardware ....399

Mistake 5: Use Operational Keys to Join Dimensions and Facts ...... 399

Mistake 4: Neglect to Declare and Comply with the Fact Grain ..... 399

Mistake 3: Use a Report to Design the Dimensional Model ........400

Mistake 2: Expect Users to Query Normalized Atomic Data ........400

Mistake 1: Fail to Conform Facts and Dimensions ...............400

Summary................................................. 401

17 Kimball DW/BI Lifecycle Overview......................403

Lifecycle Roadmap..........................................404

Roadmap Mile Markers ................................... 405

Lifecycle Launch Activities ....................................406

Program/Project Planning and Management ...................406

Business Requirements Deﬁ nition ........................... 410

Lifecycle Technology Track .................................... 416

Technical Architecture Design.............................. 416

Product Selection and Installation........................... 418

Contents

xxii

Lifecycle Data Track......................................... 420

Dimensional Modeling .................................... 420

Physical Design ......................................... 420

ETL Design and Development.............................. 422

Lifecycle BI Applications Track ................................. 422

BI Application Speciﬁ cation................................ 423

BI Application Development ............................... 423

Lifecycle Wrap-up Activities................................... 424

Deployment ............................................ 424

Maintenance and Growth ................................. 425

Common Pitfalls to Avoid ..................................... 426

Summary................................................. 427

18 Dimensional Modeling Process and Tasks .................429

Modeling Process Overview ................................... 429

Get Organized............................................. 431

Identify Participants, Especially Business Representatives .......... 431

Review the Business Requirements ........................... 432

Leverage a Modeling Tool................................. 432

Leverage a Data Proﬁ ling Tool.............................. 433

Leverage or Establish Naming Conventions.................... 433

Coordinate Calendars and Facilities.......................... 433

Design the Dimensional Model ................................ 434

Reach Consensus on High-Level Bubble Chart .................. 435

Develop the Detailed Dimensional Model..................... 436

Review and Validate the Model ............................. 439

Finalize the Design Documentation.......................... 441

Summary................................................. 441

19 ETL Subsystems and Techniques .......................443

Round Up the Requirements...................................444

Business Needs .........................................444

Compliance ............................................445

Data Quality ........................................... 445

Security ...............................................446

Data Integration ........................................446

Contents xxiii

Data Latency........................................... 447

Archiving and Lineage .................................... 447

BI Delivery Interfaces .....................................448

Available Skills..........................................448

Legacy Licenses .........................................449

The 34 Subsystems of ETL ....................................449

Extracting: Getting Data into the Data Warehouse .................. 450

Subsystem 1: Data Proﬁ ling................................ 450

Subsystem 2: Change Data Capture System .................... 451

Subsystem 3: Extract System............................... 453

Cleaning and Conforming Data................................ 455

Improving Data Quality Culture and Processes .................. 455

Subsystem 4: Data Cleansing System ......................... 456

Subsystem 5: Error Event Schema ...........................458

Subsystem 6: Audit Dimension Assembler.....................460

Subsystem 7: Deduplication System ..........................460

Subsystem 8: Conforming System........................... 461

Delivering: Prepare for Presentation............................. 463

Subsystem 9: Slowly Changing Dimension Manager.............464

Subsystem 10: Surrogate Key Generator ...................... 469

Subsystem 11: Hierarchy Manager ........................... 470

Subsystem 12: Special Dimensions Manager ................... 470

Subsystem 13: Fact Table Builders ........................... 473

Subsystem 14: Surrogate Key Pipeline ........................ 475

Subsystem 15: Multivalued Dimension Bridge Table Builder ........ 477

Subsystem 16: Late Arriving Data Handler ..................... 478

Subsystem 17: Dimension Manager System .................... 479

Subsystem 18: Fact Provider System ..........................480

Subsystem 19: Aggregate Builder ............................ 481

Subsystem 20: OLAP Cube Builder ........................... 481

Subsystem 21: Data Propagation Manager .....................482

Managing the ETL Environment ................................ 483

Subsystem 22: Job Scheduler ............................... 483

Subsystem 23: Backup System .............................. 485

Subsystem 24: Recovery and Restart System ...................486

Contents

xxiv

Subsystem 25: Version Control System .......................488

Subsystem 26: Version Migration System ......................488

Subsystem 27: Workﬂ ow Monitor ........................... 489

Subsystem 28: Sorting System ..............................490

Subsystem 29: Lineage and Dependency Analyzer ............... 490

Subsystem 30: Problem Escalation System ..................... 491

Subsystem 31: Parallelizing/Pipelining System .................. 492

Subsystem 32: Security System ............................. 492

Subsystem 33: Compliance Manager ......................... 493

Subsystem 34: Metadata Repository Manager ................. 495

Summary................................................. 496

20 ETL System Design and Development Process and Tasks .....497

ETL Process Overview ........................................ 497

Develop the ETL Plan........................................ 498

Step 1: Draw the High-Level Plan ............................ 498

Step 2: Choose an ETL Tool................................ 499

Step 3: Develop Default Strategies ...........................500

Step 4: Drill Down by Target Table ...........................500

Develop the ETL Speciﬁ cation Document ..................... 502

Develop One-Time Historic Load Processing ....................... 503

Step 5: Populate Dimension Tables with Historic Data............ 503

Step 6: Perform the Fact Table Historic Load ...................508

Develop Incremental ETL Processing............................. 512

Step 7: Dimension Table Incremental Processing................ 512

Step 8: Fact Table Incremental Processing ..................... 515

Step 9: Aggregate Table and OLAP Loads ..................... 519

Step 10: ETL System Operation and Automation................ 519

Real-Time Implications....................................... 520

Real-Time Triage ........................................ 521

Real-Time Architecture Trade-Offs........................... 522

Real-Time Partitions in the Presentation Server.................. 524

Summary................................................. 526

Contents xxv

21 Big Data Analytics..................................527

Big Data Overview .......................................... 527

Extended RDBMS Architecture .............................. 529

MapReduce/Hadoop Architecture........................... 530

Comparison of Big Data Architectures........................ 530

Recommended Best Practices for Big Data........................ 531

Management Best Practices for Big Data...................... 531

Architecture Best Practices for Big Data....................... 533

Data Modeling Best Practices for Big Data ..................... 538

Data Governance Best Practices for Big Data................... 541

Summary................................................. 542

Index ............................................543

The data warehousing and business intelligence (DW/BI) industry certainly has

matured since Ralph Kimball published the ﬁ rst edition of The Data Warehouse

Toolkit (Wiley) in 1996. Although large corporate early adopters paved the way, DW/

BI has since been embraced by organizations of all sizes. The industry has built

thousands of DW/BI systems. The volume of data continues to grow as warehouses

are populated with increasingly atomic data and updated with greater frequency.

Over the course of our careers, we have seen databases grow from megabytes to

gigabytes to terabytes to petabytes, yet the basic challenge of DW/BI systems has

remained remarkably constant. Our job is to marshal an organization’s data and

bring it to business users for their decision making. Collectively, you’ve delivered

on this objective; business professionals everywhere are making better decisions

and generating payback on their DW/BI investments.

Since the ﬁ rst edition of The Data Warehouse Toolkit was published, dimensional

modeling has been broadly accepted as the dominant technique for DW/BI presenta-

tion. Practitioners and pundits alike have recognized that the presentation of data

must be grounded in simplicity if it is to stand any chance of success. Simplicity is

the fundamental key that allows users to easily understand databases and software

to e ciently navigate databases. In many ways, dimensional modeling amounts

to holding the fort against assaults on simplicity. By consistently returning to a

business-driven perspective and by refusing to compromise on the goals of user

understandability and query performance, you establish a coherent design that

serves the organization’s analytic needs. This dimensionally modeled framework

becomes the platform for BI. Based on our experience and the overwhelming feed-

back from numerous practitioners from companies like your own, we believe that

dimensional modeling is absolutely critical to a successful DW/BI initiative.

Dimensional modeling also has emerged as the leading architecture for building

integrated DW/BI systems. When you use the conformed dimensions and con-

formed facts of a set of dimensional models, you have a practical and predictable

framework for incrementally building complex DW/BI systems that are inherently

distributed.

For all that has changed in our industry, the core dimensional modeling tech-

niques that Ralph Kimball published 17 years ago have withstood the test of time.

Concepts such as conformed dimensions, slowly changing dimensions, heteroge-

neous products, factless fact tables, and the enterprise data warehouse bus matrix

Introduction

xxviii

continue to be discussed in design workshops around the globe. The original con-

cepts have been embellished and enhanced by new and complementary techniques.

We decided to publish this third edition of Kimball’s seminal work because we felt

that it would be useful to summarize our collective dimensional modeling experi-

ence under a single cover. We have each focused exclusively on decision support,

data warehousing, and business intelligence for more than three decades. We want

to share the dimensional modeling patterns that have emerged repeatedly during

the course of our careers. This book is loaded with speciﬁ c, practical design recom-

mendations based on real-world scenarios.

The goal of this book is to provide a one-stop shop for dimensional modeling

techniques. True to its title, it is a toolkit of dimensional design principles and

techniques. We address the needs of those just starting in dimensional DW/BI and

we describe advanced concepts for those of you who have been at this a while. We

believe that this book stands alone in its depth of coverage on the topic of dimen-

sional modeling. It’s the deﬁ nitive guide.

Intended Audience

This book is intended for data warehouse and business intelligence designers, imple-

menters, and managers. In addition, business analysts and data stewards who are

active participants in a DW/BI initiative will ﬁ nd the content useful.

Even if you’re not directly responsible for the dimensional model, we believe it

is important for all members of a project team to be comfortable with dimensional

modeling concepts. The dimensional model has an impact on most aspects of a

DW/BI implementation, beginning with the translation of business requirements,

through the extract, transformation and load (ETL) processes, and ﬁ nally, to the

unveiling of a data warehouse through business intelligence applications. Due to the

broad implications, you need to be conversant in dimensional modeling regardless

of whether you are responsible primarily for project management, business analysis,

data architecture, database design, ETL, BI applications, or education and support.

We’ve written this book so it is accessible to a broad audience.

For those of you who have read the earlier editions of this book, some of the

familiar case studies will reappear in this edition; however, they have been updated

signiﬁ cantly and ﬂ eshed out with richer content, including sample enterprise data

warehouse bus matrices for nearly every case study. We have developed vignettes

for new subject areas, including big data analytics.

The content in this book is somewhat technical. We primarily discuss dimen-

sional modeling in the context of a relational database with nuances for online

Introduction xxix

analytical processing (OLAP) cubes noted where appropriate. We presume you

have basic knowledge of relational database concepts such as tables, rows, keys,

and joins. Given we will be discussing dimensional models in a nondenominational

manner, we won’t dive into speciﬁ c physical design and tuning guidance for any

given database management systems.

Chapter Preview

The book is organized around a series of business vignettes or case studies. We

believe developing the design techniques by example is an extremely e ective

approach because it allows us to share very tangible guidance and the beneﬁ ts of

real world experience. Although not intended to be full-scale application or indus-

try solutions, these examples serve as a framework to discuss the patterns that

emerge in dimensional modeling. In our experience, it is often easier to grasp the

main elements of a design technique by stepping away from the all-too-familiar

complexities of one’s own business. Readers of the earlier editions have responded

very favorably to this approach.

Be forewarned that we deviate from the case study approach in Chapter 2: Kimball

Dimensional Modeling Techniques Overview. Given the broad industry acceptance

of the dimensional modeling techniques invented by the Kimball Group, we have

consolidated the o cial listing of our techniques, along with concise descriptions

and pointers to more detailed coverage and illustrations of these techniques in

subsequent chapters. Although not intended to be read from start to ﬁ nish like the

other chapters, we feel this technique-centric chapter is a useful reference and can

even serve as a professional checklist for DW/BI designers.

With the exception of Chapter 2, the other chapters of this book build on one

another. We start with basic concepts and introduce more advanced content as the

book unfolds. The chapters should be read in order by every reader. For example, it

might be di cult to comprehend Chapter 16: Insurance, unless you have read the

preceding chapters on retailing, procurement, order management, and customer

relationship management.

Those of you who have read the last edition may be tempted to skip the ﬁ rst

few chapters. Although some of the early fact and dimension grounding may be

familiar turf, we don’t want you to sprint too far ahead. You’ll miss out on updates

to fundamental concepts if you skip ahead too quickly.

NOTE This book is laced with tips (like this note), key concept listings, and

chapter pointers to make it more useful and easily referenced in the future.

Introduction

xxx

Chapter 1: Data Warehousing, Business Intelligence,

and Dimensional Modeling Primer

The book begins with a primer on data warehousing, business intelligence, and

dimensional modeling. We explore the components of the overall DW/BI archi-

tecture and establish the core vocabulary used during the remainder of the book.

Some of the myths and misconceptions about dimensional modeling are dispelled.

Chapter 2: Kimball Dimensional Modeling

Techniques Over view

This chapter describes more than 75 dimensional modeling techniques and pat-

terns. This o cial listing of the Kimball techniques includes forward pointers to

subsequent chapters where the techniques are brought to life in case study vignettes.

Chapter 3: Retail Sales

Retailing is the classic example used to illustrate dimensional modeling. We start

with the classic because it is one that we all understand. Hopefully, you won’t need

to think very hard about the industry because we want you to focus on core dimen-

sional modeling concepts instead. We begin by discussing the four-step process for

designing dimensional models. We explore dimension tables in depth, including

the date dimension that will be reused repeatedly throughout the book. We also

discuss degenerate dimensions, snowﬂ aking, and surrogate keys. Even if you’re not

a retailer, this chapter is required reading because it is chock full of fundamentals.

Chapter 4: Inventory

We remain within the retail industry for the second case study but turn your atten-

tion to another business process. This chapter introduces the enterprise data ware-

house bus architecture and the bus matrix with conformed dimensions. These

concepts are critical to anyone looking to construct a DW/BI architecture that is

integrated and extensible. We also compare the three fundamental types of fact

tables: transaction, periodic snapshot, and accumulating snapshot.

Chapter 5: Procurement

This chapter reinforces the importance of looking at your organization’s value chain

as you plot your DW/BI environment. We also explore a series of basic and advanced

techniques for handling slowly changing dimension attributes; we’ve built on the

long-standing foundation of type 1 (overwrite), type 2 (add a row), and type 3 (add

a column) as we introduce readers to type 0 and types 4 through 7.

Introduction xxxi

Chapter 6: Order Management

In this case study, we look at the business processes that are often the ﬁ rst to be

implemented in DW/BI systems as they supply core business performance met-

rics—what are we selling to which customers at what price? We discuss dimensions

that play multiple roles within a schema. We also explore the common challenges

modelers face when dealing with order management information, such as header/

line item considerations, multiple currencies or units of measure, and junk dimen-

sions with miscellaneous transaction indicators.

Chapter 7: Accounting

We discuss the modeling of general ledger information for the data warehouse in

this chapter. We describe the appropriate handling of year-to-date facts and multiple

ﬁ scal calendars, as well as consolidated fact tables that combine data from mul-

tiple business processes. We also provide detailed guidance on dimension attribute

hierarchies, from simple denormalized ﬁ xed depth hierarchies to bridge tables for

navigating more complex ragged, variable depth hierarchies.

Chapter 8: Customer Relationship Management

Numerous DW/BI systems have been built on the premise that you need to better

understand and service your customers. This chapter discusses the customer dimen-

sion, including address standardization and bridge tables for multivalued dimension

attributes. We also describe complex customer behavior modeling patterns, as well

as the consolidation of customer data from multiple sources.

Chapter 9: Human Resources Management

This chapter explores several unique aspects of human resources dimensional

models, including the situation in which a dimension table begins to behave like a

fact table. We discuss packaged analytic solutions, the handling of recursive man-

agement hierarchies, and survey questionnaires. Several techniques for handling

multivalued skill keyword attributes are compared.

Chapter 10: Financial Services

The banking case study explores the concept of supertype and subtype schemas

for heterogeneous products in which each line of business has unique descriptive

attributes and performance metrics. Obviously, the need to handle heterogeneous

products is not unique to ﬁ nancial services. We also discuss the complicated rela-

tionships among accounts, customers, and households.

Introduction

xxxii

Chapter 11: Telecommunications

This chapter is structured somewhat di erently to encourage you to think critically

when performing a dimensional model design review. We start with a dimensional

design that looks plausible at ﬁ rst glance. Can you ﬁ nd the problems? In addition,

we explore the idiosyncrasies of geographic location dimensions.

Chapter 12: Transportation

In this case study we look at related fact tables at di erent levels of granularity

while pointing out the unique characteristics of fact tables describing segments in

a journey or network. We take a closer look at date and time dimensions, covering

country-speciﬁ c calendars and synchronization across multiple time zones.

Chapter 13: Education

We look at several factless fact tables in this chapter. In addition, we explore accu-

mulating snapshot fact tables to handle the student application and research grant

proposal pipelines. This chapter gives you an appreciation for the diversity of busi-

ness processes in an educational institution.

Chapter 14: Healthcare

Some of the most complex models that we have ever worked with are from the

healthcare industry. This chapter illustrates the handling of such complexities,

including the use of a bridge table to model the multiple diagnoses and providers

associated with patient treatment events.

Chapter 15: Electronic Commerce

This chapter focuses on the nuances of clickstream web data, including its unique

dimensionality. We also introduce the step dimension that’s used to better under-

stand any process that consists of sequential steps.

Chapter 16: Insurance

The ﬁ nal case study reinforces many of the patterns we discussed earlier in the book

in a single set of interrelated schemas. It can be viewed as a pulling-it-all-together

chapter because the modeling techniques are layered on top of one another.

Introduction xxxiii

Chapter 17: Kimball Lifecycle Overview

Now that you are comfortable designing dimensional models, we provide a high-

level overview of the activities encountered during the life of a typical DW/BI proj-

ect. This chapter is a lightning tour of The Data Warehouse Lifecycle Toolkit, Second

Edition (Wiley, 2008) that we coauthored with Bob Becker, Joy Mundy, and Warren

Thornthwaite.

Chapter 18: Dimensional Modeling Process and Tasks

This chapter outlines speciﬁ c recommendations for tackling the dimensional mod-

eling tasks within the Kimball Lifecycle. The ﬁ rst 16 chapters of this book cover

dimensional modeling techniques and design patterns; this chapter describes

responsibilities, how-tos, and deliverables for the dimensional modeling design

activity.

Chapter 19: ETL Subsystems and Techniques

The extract, transformation, and load system consumes a disproportionate share

of the time and e ort required to build a DW/BI environment. Careful consider-

ation of best practices has revealed 34 subsystems found in almost every dimen-

sional data warehouse back room. This chapter starts with the requirements and

constraints that must be considered before designing the ETL system and then

describes the 34 extraction, cleaning, conforming, delivery, and management

subsystems.

Chapter 20: ETL System Design and Development

Process and Tasks

This chapter delves into speciﬁ c, tactical dos and don’ts surrounding the ETL

design and development activities. It is required reading for anyone tasked with

ETL responsibilities.

Chapter 21: Big Data Analytics

We focus on the popular topic of big data in the ﬁ nal chapter. Our perspective

is that big data is a natural extension of your DW/BI responsibilities. We begin

with an overview of several architectural alternatives, including MapReduce and

Introduction

xxxiv

Hadoop, and describe how these alternatives can coexist with your current DW/BI

architecture. We then explore the management, architecture, data modeling, and

data governance best practices for big data.

Website Resources

The Kimball Group’s website is loaded with complementary dimensional modeling

content and resources:

■ Register for Kimball Design Tips to receive practical guidance about dimen-

sional modeling and DW/BI topics.

■ Access the archive of more than 300 Design Tips and articles.

■ Learn about public and onsite Kimball University classes for quality, vendor-

independent education consistent with our experiences and writings.

■ Learn about the Kimball Group’s consulting services to leverage our decades

of DW/BI expertise.

■ Pose questions to other dimensionally aware participants on the Kimball

Forum.

Summary

The goal of this book is to communicate the o cial dimensional design and devel-

opment techniques based on the authors’ more than 60 years of experience and

hard won lessons in real business environments. DW/BI systems must be driven

from the needs of business users, and therefore are designed and presented from a

simple dimensional perspective. We are conﬁ dent you will be one giant step closer

to DW/BI success if you buy into this premise.

Now that you know where you are headed, it is time to dive into the details. We’ll

begin with a primer on DW/BI and dimensional modeling in Chapter 1 to ensure that

everyone is on the same page regarding key terminology and architectural concepts.

Data Warehousing,

Business Intelligence,

and Dimensional

Modeling Primer

This first chapter lays the groundwork for the following chapters. We begin by

considering data warehousing and business intelligence (DW/BI) systems from

a high-level perspective. You may be disappointed to learn that we don’t start with

technology and tools—first and foremost, the DW/BI system must consider the

needs of the business. With the business needs firmly in hand, we work backwards

through the logical and then physical designs, along with decisions about technol-

ogy and tools.

We drive stakes in the ground regarding the goals of data warehousing and busi-

ness intelligence in this chapter, while observing the uncanny similarities between

the responsibilities of a DW/BI manager and those of a publisher.

With this big picture perspective, we explore dimensional modeling core concepts

and establish fundamental vocabulary. From there, this chapter discusses the major

components of the Kimball DW/BI architecture, along with a comparison of alterna-

tive architectural approaches; fortunately, there’s a role for dimensional modeling

regardless of your architectural persuasion. Finally, we review common dimensional

modeling myths. By the end of this chapter, you’ll have an appreciation for the need

to be one-half DBA (database administrator) and one-half MBA (business analyst)

as you tackle your DW/BI project.

Chapter 1 discusses the following concepts:

■ Business-driven goals of data warehousing and business intelligence

■ Publishing metaphor for DW/BI systems

■ Dimensional modeling core concepts and vocabulary, including fact and

dimension tables

■ Kimball DW/BI architecture’s components and tenets

■ Comparison of alternative DW/BI architectures, and the role of dimensional

modeling within each

■ Misunderstandings about dimensional modeling

Chapter 1

Different Worlds of Data Capture and

Data Analysis

One of the most important assets of any organization is its information. This asset

is almost always used for two purposes: operational record keeping and analytical

decision making. Simply speaking, the operational systems are where you put the

data in, and the DW/BI system is where you get the data out.

Users of an operational system turn the wheels of the organization. They take

orders, sign up new customers, monitor the status of operational activities, and log

complaints. The operational systems are optimized to process transactions quickly.

These systems almost always deal with one transaction record at a time. They predict-

ably perform the same operational tasks over and over, executing the organization’s

business processes. Given this execution focus, operational systems typically do not

maintain history, but rather update data to reﬂ ect the most current state.

Users of a DW/BI system, on the other hand, watch the wheels of the organiza-

tion turn to evaluate performance. They count the new orders and compare them

with last week’s orders, and ask why the new customers signed up, and what the

customers complained about. They worry about whether operational processes are

working correctly. Although they need detailed data to support their constantly

changing questions, DW/BI users almost never deal with one transaction at a time.

These systems are optimized for high-performance queries as users’ questions often

require that hundreds or hundreds of thousands of transactions be searched and

compressed into an answer set. To further complicate matters, users of a DW/BI

system typically demand that historical context be preserved to accurately evaluate

the organization’s performance over time.

In the ﬁ rst edition of The Data Warehouse Toolkit (Wiley, 1996), Ralph Kimball

devoted an entire chapter to describe the dichotomy between the worlds of opera-

tional processing and data warehousing. At this time, it is widely recognized that

the DW/BI system has profoundly di erent needs, clients, structures, and rhythms

than the operational systems of record. Unfortunately, we still encounter supposed

DW/BI systems that are mere copies of the operational systems of record stored on

a separate hardware platform. Although these environments may address the need

to isolate the operational and analytical environments for performance reasons,

they do nothing to address the other inherent di erences between the two types

of systems. Business users are underwhelmed by the usability and performance

provided by these pseudo data warehouses; these imposters do a disservice to DW/

BI because they don’t acknowledge their users have drastically di erent needs than

operational system users.

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 3

Goals of Data Warehousing and

Business Intelligence

Before we delve into the details of dimensional modeling, it is helpful to focus on

the fundamental goals of data warehousing and business intelligence. The goals can

be readily developed by walking through the halls of any organization and listening

to business management. These recurring themes have existed for more than three

decades:

■ “We collect tons of data, but we can’t access it.”

■ “We need to slice and dice the data every which way.”

■ “Business people need to get at the data easily.”

■ “Just show me what is important.”

■ “We spend entire meetings arguing about who has the right numbers rather

than making decisions.”

■ “We want people to use information to support more fact-based decision

making.”

Based on our experience, these concerns are still so universal that they drive the

bedrock requirements for the DW/BI system. Now turn these business management

quotations into requirements.

■ The DW/BI system must make information easily accessible. The contents

of the DW/BI system must be understandable. The data must be intuitive and

obvious to the business user, not merely the developer. The data’s structures

and labels should mimic the business users’ thought processes and vocabu-

lary. Business users want to separate and combine analytic data in endless

combinations. The business intelligence tools and applications that access

the data must be simple and easy to use. They also must return query results

to the user with minimal wait times. We can summarize this requirement by

simply saying simple and fast.

■ The DW/BI system must present information consistently. The data in the

DW/BI system must be credible. Data must be carefully assembled from a

variety of sources, cleansed, quality assured, and released only when it is ﬁ t

for user consumption. Consistency also implies common labels and deﬁ ni-

tions for the DW/BI system’s contents are used across data sources. If two

performance measures have the same name, they must mean the same thing.

Conversely, if two measures don’t mean the same thing, they should be labeled

di erently.

Chapter 1

■ The DW/BI system must adapt to change. User needs, business conditions,

data, and technology are all subject to change. The DW/BI system must be

designed to handle this inevitable change gracefully so that it doesn’t invali-

date existing data or applications. Existing data and applications should not

be changed or disrupted when the business community asks new questions

or new data is added to the warehouse. Finally, if descriptive data in the DW/

BI system must be modiﬁ ed, you must appropriately account for the changes

and make these changes transparent to the users.

■ The DW/BI system must present information in a timely way. As the DW/

BI system is used more intensively for operational decisions, raw data may

need to be converted into actionable information within hours, minutes,

or even seconds. The DW/BI team and business users need to have realistic

expectations for what it means to deliver data when there is little time to

clean or validate it.

■ The DW/BI system must be a secure bastion that protects the information

assets. An organization’s informational crown jewels are stored in the data

warehouse. At a minimum, the warehouse likely contains information about

what you’re selling to whom at what price—potentially harmful details in the

hands of the wrong people. The DW/BI system must e ectively control access

to the organization’s conﬁ dential information.

■ The DW/BI system must serve as the authoritative and trustworthy foun-

dation for improved decision making. The data warehouse must have the

right data to support decision making. The most important outputs from a

DW/BI system are the decisions that are made based on the analytic evidence

presented; these decisions deliver the business impact and value attributable

to the DW/BI system. The original label that predates DW/BI is still the best

description of what you are designing: a decision support system.

■ The business community must accept the DW/BI system to deem it successful.

It doesn’t matter that you built an elegant solution using best-of-breed products

and platforms. If the business community does not embrace the DW/BI environ-

ment and actively use it, you have failed the acceptance test. Unlike an opera-

tional system implementation where business users have no choice but to use

the new system, DW/BI usage is sometimes optional. Business users will embrace

the DW/BI system if it is the “simple and fast” source for actionable information.

Although each requirement on this list is important, the ﬁ nal two are the most

critical, and unfortunately, often the most overlooked. Successful data warehousing

and business intelligence demands more than being a stellar architect, technician,

modeler, or database administrator. With a DW/BI initiative, you have one foot

in your information technology (IT) comfort zone while your other foot is on the

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 5

unfamiliar turf of business users. You must straddle the two, modifying some tried-

and-true skills to adapt to the unique demands of DW/BI. Clearly, you need to bring

a spectrum of skills to the party to behave like you’re a hybrid DBA/MBA.

Publishing Metaphor for DW/BI Managers

With the goals of DW/BI as a backdrop, let’s compare the responsibilities of DW/BI

managers with those of a publishing editor-in-chief. As the editor of a high-quality

magazine, you would have broad latitude to manage the magazine’s content, style,

and delivery. Anyone with this job title would likely tackle the following activities:

■ Understand the readers:

■ Identify their demographic characteristics.

■ Find out what readers want in this kind of magazine.

■ Identify the “best” readers who will renew their subscriptions and buy

products from the magazine’s advertisers.

■ Find potential new readers and make them aware of the magazine.

■ Ensure the magazine appeals to the readers:

■ Choose interesting and compelling magazine content.

■ Make layout and rendering decisions that maximize the readers’

pleasure.

■ Uphold high-quality writing and editing standards while adopting a

consistent presentation style.

■ Continuously monitor the accuracy of the articles and advertisers’

claims.

■ Adapt to changing reader profiles and the availability of new input

from a network of writers and contributors.

■ Sustain the publication:

■ Attract advertisers and run the magazine profitably.

■ Publish the magazine on a regular basis.

■ Maintain the readers’ trust.

■ Keep the business owners happy.

You also can identify items that should be non-goals for the magazine’s editor-

in-chief, such as building the magazine around a particular printing technology

or exclusively putting management’s energy into operational e ciencies, such as

imposing a technical writing style that readers don’t easily understand, or creating

an intricate and crowded layout that is di cult to read.

By building the publishing business on a foundation of serving the readers e ec-

tively, the magazine is likely to be successful. Conversely, go through the list and

imagine what happens if you omit any single item; ultimately, the magazine would

have serious problems.

Chapter 1

There are strong parallels that can be drawn between being a conventional pub-

lisher and being a DW/BI manager. Driven by the needs of the business, DW/BI

managers must publish data that has been collected from a variety of sources and

edited for quality and consistency. The main responsibility is to serve the readers,

otherwise known as business users. The publishing metaphor underscores the need

to focus outward on your customers rather than merely focusing inward on prod-

ucts and processes. Although you use technology to deliver the DW/BI system, the

technology is at best a means to an end. As such, the technology and techniques

used to build the system should not appear directly in your top job responsibilities.

Now recast the magazine publisher’s responsibilities as DW/BI manager

responsibilities:

■ Understand the business users:

■ Understand their job responsibilities, goals, and objectives.

■ Determine the decisions that the business users want to make with the

help of the DW/BI system.

■ Identify the “best” users who make effective, high-impact decisions.

■ Find potential new users and make them aware of the DW/BI system’s

capabilities.

■ Deliver high-quality, relevant, and accessible information and analytics to

the business users:

■ Choose the most robust, actionable data to present in the DW/BI sys-

tem, carefully selected from the vast universe of possible data sources

in your organization.

■ Make the user interfaces and applications simple and template-driven,

explicitly matched to the users’ cognitive processing profiles.

■ Make sure the data is accurate and can be trusted, labeling it consis-

tently across the enterprise.

■ Continuously monitor the accuracy of the data and analyses.

■ Adapt to changing user profiles, requirements, and business priorities,

along with the availability of new data sources.

■ Sustain the DW/BI environment:

■ Take a portion of the credit for the business decisions made using the

DW/BI system, and use these successes to justify staffing and ongoing

expenditures.

■ Update the DW/BI system on a regular basis.

■ Maintain the business users’ trust.

■ Keep the business users, executive sponsors, and IT management

happy.

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 7

If you do a good job with all these responsibilities, you will be a great DW/BI

manager! Conversely, go through the list and imagine what happens if you omit

any single item. Ultimately, the environment would have serious problems. Now

contrast this view of a DW/BI manager’s job with your own job description. Chances

are the preceding list is more oriented toward user and business issues and may not

even sound like a job in IT. In our opinion, this is what makes data warehousing

and business intelligence interesting.

Dimensional Modeling Introduction

Now that you understand the DW/BI system’s goals, let’s consider the basics of dimen-

sional modeling. Dimensional modeling is widely accepted as the preferred technique

for presenting analytic data because it addresses two simultaneous requirements:

■ Deliver data that’s understandable to the business users.

■ Deliver fast query performance.

Dimensional modeling is a longstanding technique for making databases simple.

In case after case, for more than ﬁ ve decades, IT organizations, consultants, and

business users have naturally gravitated to a simple dimensional structure to match

the fundamental human need for simplicity. Simplicity is critical because it ensures

that users can easily understand the data, as well as allows software to navigate and

deliver results quickly and e ciently.

Imagine an executive who describes her business as, “We sell products in various

markets and measure our performance over time.” Dimensional designers listen

carefully to the emphasis on product, market, and time. Most people ﬁ nd it intui-

tive to think of such a business as a cube of data, with the edges labeled product,

market, and time. Imagine slicing and dicing along each of these dimensions. Points

inside the cube are where the measurements, such as sales volume or proﬁ t, for

that combination of product, market, and time are stored. The ability to visualize

something as abstract as a set of data in a concrete and tangible way is the secret

of understandability. If this perspective seems too simple, good! A data model that

starts simple has a chance of remaining simple at the end of the design. A model

that starts complicated surely will be overly complicated at the end, resulting in

slow query performance and business user rejection. Albert Einstein captured the

basic philosophy driving dimensional design when he said, “Make everything as

simple as possible, but not simpler.”

Although dimensional models are often instantiated in relational database man-

agement systems, they are quite di erent from third normal form (3NF) models which

Chapter 1

seek to remove data redundancies. Normalized 3NF structures divide data into

many discrete entities, each of which becomes a relational table. A database of sales

orders might start with a record for each order line but turn into a complex spider

web diagram as a 3NF model, perhaps consisting of hundreds of normalized tables.

The industry sometimes refers to 3NF models as entity-relationship (ER)

models. Entity-relationship diagrams (ER diagrams or ERDs) are drawings that com-

municate the relationships between tables. Both 3NF and dimensional models can

be represented in ERDs because both consist of joined relational tables; the key

di erence between 3NF and dimensional models is the degree of normalization.

Because both model types can be presented as ERDs, we refrain from referring to

3NF models as ER models; instead, we call them normalized models to minimize

confusion.

Normalized 3NF structures are immensely useful in operational processing

because an update or insert transaction touches the database in only one place.

Normalized models, however, are too complicated for BI queries. Users can’t under-

stand, navigate, or remember normalized models that resemble a map of the Los

Angeles freeway system. Likewise, most relational database management systems

can’t e ciently query a normalized model; the complexity of users’ unpredictable

queries overwhelms the database optimizers, resulting in disastrous query perfor-

mance. The use of normalized modeling in the DW/BI presentation area defeats the

intuitive and high-performance retrieval of data. Fortunately, dimensional modeling

addresses the problem of overly complex schemas in the presentation area.

NOTE A dimensional model contains the same information as a normalized

model, but packages the data in a format that delivers user understandability, query

performance, and resilience to change.

Star Schemas Versus OLAP Cubes

Dimensional models implemented in relational database management systems are

referred to as star schemas because of their resemblance to a star-like structure.

Dimensional models implemented in multidimensional database environments are

referred to as online analytical processing (OLAP) cubes, as illustrated in Figure 1-1.

If your DW/BI environment includes either star schemas or OLAP cubes, it lever-

ages dimensional concepts. Both stars and cubes have a common logical design with

recognizable dimensions; however, the physical implementation di ers.

When data is loaded into an OLAP cube, it is stored and indexed using formats

and techniques that are designed for dimensional data. Performance aggregations

or precalculated summary tables are often created and managed by the OLAP cube

engine. Consequently, cubes deliver superior query performance because of the

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 9

precalculations, indexing strategies, and other optimizations. Business users can

drill down or up by adding or removing attributes from their analyses with excellent

performance without issuing new queries. OLAP cubes also provide more analyti-

cally robust functions that exceed those available with SQL. The downside is that you

pay a load performance price for these capabilities, especially with large data sets.

Date

Dimension

Market

Dimension

Product

Dimension

Market

Product

Date

Sales

Facts

Figure 1-1: Star schema versus OLAP cube.

Fortunately, most of the recommendations in this book pertain regardless of the

relational versus multidimensional database platform. Although the capabilities

of OLAP technology are continuously improving, we generally recommend that

detailed, atomic information be loaded into a star schema; optional OLAP cubes are

then populated from the star schema. For this reason, most dimensional modeling

techniques in this book are couched in terms of a relational star schema.

OLAP Deployment Considerations

Here are some things to keep in mind if you deploy data into OLAP cubes:

■ A star schema hosted in a relational database is a good physical foundation

for building an OLAP cube, and is generally regarded as a more stable basis

for backup and recovery.

■ OLAP cubes have traditionally been noted for extreme performance advan-

tages over RDBMSs, but that distinction has become less important with

advances in computer hardware, such as appliances and in-memory databases,

and RDBMS software, such as columnar databases.

■ OLAP cube data structures are more variable across di erent vendors than

relational DBMSs, thus the ﬁ nal deployment details often depend on which

OLAP vendor is chosen. It is typically more di cult to port BI applications

between di erent OLAP tools than to port BI applications across di erent

relational databases.

Chapter 1

■ OLAP cubes typically o er more sophisticated security options than RDBMSs,

such as limiting access to detailed data but providing more open access to

summary data.

■ OLAP cubes o er signiﬁ cantly richer analysis capabilities than RDBMSs,

which are saddled by the constraints of SQL. This may be the main justiﬁ ca-

tion for using an OLAP product.

■ OLAP cubes gracefully support slowly changing dimension type 2 changes

(which are discussed in Chapter 5: Procurement), but cubes often need to be

reprocessed partially or totally whenever data is overwritten using alternative

slowly changing dimension techniques.

■ OLAP cubes gracefully support transaction and periodic snapshot fact tables,

but do not handle accumulating snapshot fact tables because of the limitations

on overwriting data described in the previous point.

■ OLAP cubes typically support complex ragged hierarchies of indeterminate

depth, such as organization charts or bills of material, using native query

syntax that is superior to the approaches required for RDBMSs.

■ OLAP cubes may impose detailed constraints on the structure of dimension

keys that implement drill-down hierarchies compared to relational databases.

■ Some OLAP products do not enable dimensional roles or aliases, thus requir-

ing separate physical dimensions to be deﬁ ned.

We’ll return to the world of dimensional modeling in a relational platform as we

consider the two key components of a star schema.

Fact Tables for Measurements

The fact table in a dimensional model stores the performance measurements result-

ing from an organization’s business process events. You should strive to store the

low-level measurement data resulting from a business process in a single dimen-

sional model. Because measurement data is overwhelmingly the largest set of data,

it should not be replicated in multiple places for multiple organizational functions

around the enterprise. Allowing business users from multiple organizations to access

a single centralized repository for each set of measurement data ensures the use of

consistent data throughout the enterprise.

The term fact represents a business measure. Imagine standing in the marketplace

watching products being sold and writing down the unit quantity and dollar sales

amount for each product in each sales transaction. These measurements are captured

as products are scanned at the register, as illustrated in Figure 1-2.

Each row in a fact table corresponds to a measurement event. The data on each

row is at a speciﬁ c level of detail, referred to as the grain, such as one row per product

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 11

sold on a sales transaction. One of the core tenets of dimensional modeling is that

all the measurement rows in a fact table must be at the same grain. Having the dis-

cipline to create fact tables with a single level of detail ensures that measurements

aren’t inappropriately double-counted.

Translates into

Retail Sales Facts

Date Key (FK)

Product Key (FK)

Store Key (FK)

Promotion Key (FK)

Customer Key (FK)

Clerk Key (FK)

Transaction #

Sales Dollars

Sales Units

Figure 1-2: Business process measurement events translate into fact tables.

NOTE The idea that a measurement event in the physical world has a one-to-one

relationship to a single row in the corresponding fact table is a bedrock principle

for dimensional modeling. Everything else builds from this foundation.

The most useful facts are numeric and additive, such as dollar sales amount.

Throughout this book we will use dollars as the standard currency to make the

case study examples more tangible—you can substitute your own local currency

if it isn’t dollars.

Additivity is crucial because BI applications rarely retrieve a single fact table

row. Rather, they bring back hundreds, thousands, or even millions of fact rows at

a time, and the most useful thing to do with so many rows is to add them up. No

matter how the user slices the data in Figure 1-2, the sales units and dollars sum

to a valid total. You will see that facts are sometimes semi-additive or even non-

additive. Semi-additive facts, such as account balances, cannot be summed across

the time dimension. Non-additive facts, such as unit prices, can never be added. You

are forced to use counts and averages or are reduced to printing out the fact rows

one at a time—an impractical exercise with a billion-row fact table.

Facts are often described as continuously valued to help sort out what is a fact

versus a dimension attribute. The dollar sales amount fact is continuously valued in

this example because it can take on virtually any value within a broad range. As an

Chapter 1

observer, you must stand out in the marketplace and wait for the measurement before

you have any idea what the value will be.

It is theoretically possible for a measured fact to be textual; however, the condition

rarely arises. In most cases, a textual measurement is a description of something

and is drawn from a discrete list of values. The designer should make every e ort to

put textual data into dimensions where they can be correlated more e ectively with

the other textual dimension attributes and consume much less space. You should

not store redundant textual information in fact tables. Unless the text is unique

for every row in the fact table, it belongs in the dimension table. A true text fact is

rare because the unpredictable content of a text fact, like a freeform text comment,

makes it nearly impossible to analyze.

Referring to the sample fact table in Figure 1-2, if there is no sales activity for a

given product, you don’t put any rows in the table. It is important that you do not

try to ﬁ ll the fact table with zeros representing no activity because these zeros would

overwhelm most fact tables. By including only true activity, fact tables tend to be

quite sparse. Despite their sparsity, fact tables usually make up 90 percent or more

of the total space consumed by a dimensional model. Fact tables tend to be deep in

terms of the number of rows, but narrow in terms of the number of columns. Given

their size, you should be judicious about fact table space utilization.

As examples are developed throughout this book, you will see that all fact table

grains fall into one of three categories: transaction, periodic snapshot, and accu-

mulating snapshot. Transaction grain fact tables are the most common. We will

introduce transaction fact tables in Chapter 3: Retail Sales, and both periodic and

accumulating snapshots in Chapter 4: Inventory.

All fact tables have two or more foreign keys (refer to the FK notation in Figure 1-2)

that connect to the dimension tables’ primary keys. For example, the product key in

the fact table always matches a speciﬁ c product key in the product dimension table.

When all the keys in the fact table correctly match their respective primary keys in

the corresponding dimension tables, the tables satisfy referential integrity. You access

the fact table via the dimension tables joined to it.

The fact table generally has its own primary key composed of a subset of the for-

eign keys. This key is often called a composite key. Every table that has a composite

key is a fact table. Fact tables express many-to-many relationships. All others are

dimension tables.

There are usually a handful of dimensions that together uniquely identify each

fact table row. After this subset of the overall dimension list has been identiﬁ ed, the

rest of the dimensions take on a single value in the context of the fact table row’s

primary key. In other words, they go along for the ride.

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 13

Dimension Tables for Descriptive Context

Dimension tables are integral companions to a fact table. The dimension tables con-

tain the textual context associated with a business process measurement event. They

describe the “who, what, where, when, how, and why” associated with the event.

As illustrated in Figure 1-3, dimension tables often have many columns or

attributes. It is not uncommon for a dimension table to have 50 to 100 attributes;

although, some dimension tables naturally have only a handful of attributes.

Dimension tables tend to have fewer rows than fact tables, but can be wide with

many large text columns. Each dimension is deﬁ ned by a single primary key (refer

to the PK notation in Figure 1-3), which serves as the basis for referential integrity

with any given fact table to which it is joined.

Product Key (PK)

SKU Number (Natural Key)

Product Description

Brand Name

Category Name

Department Name

Package Type

Package Size

Abrasive Indicator

Weight

Weight Unit of Measure

Storage Type

Shelf Life Type

Shelf Width

Shelf Height

Shelf Depth

...

Product Dimension

Figure 1-3: Dimension tables contain descriptive characteristics of business

process nouns.

Dimension attributes serve as the primary source of query constraints, group-

ings, and report labels. In a query or report request, attributes are identiﬁ ed as the

by words. For example, when a user wants to see dollar sales by brand, brand must

be available as a dimension attribute.

Dimension table attributes play a vital role in the DW/BI system. Because they

are the source of virtually all constraints and report labels, dimension attributes are

critical to making the DW/BI system usable and understandable. Attributes should

consist of real words rather than cryptic abbreviations. You should strive to mini-

mize the use of codes in dimension tables by replacing them with more verbose

Chapter 1

textual attributes. You may have already trained the business users to memorize

operational codes, but going forward, minimize their reliance on miniature notes

attached to their monitor for code translations. You should make standard decodes

for the operational codes available as dimension attributes to provide consistent

labeling on queries, reports, and BI applications. The decode values should never be

buried in the reporting applications where inconsistency is inevitable.

Sometimes operational codes or identiﬁ ers have legitimate business signiﬁ cance

to users or are required to communicate back to the operational world. In these

cases, the codes should appear as explicit dimension attributes, in addition to the

corresponding user-friendly textual descriptors. Operational codes sometimes have

intelligence embedded in them. For example, the ﬁ rst two digits may identify the

line of business, whereas the next two digits may identify the global region. Rather

than forcing users to interrogate or ﬁ lter on substrings within the operational codes,

pull out the embedded meanings and present them to users as separate dimension

attributes that can easily be ﬁ ltered, grouped, or reported.

In many ways, the data warehouse is only as good as the dimension attributes; the

analytic power of the DW/BI environment is directly proportional to the quality and

depth of the dimension attributes. The more time spent providing attributes with

verbose business terminology, the better. The more time spent populating the domain

values in an attribute column, the better. The more time spent ensuring the quality

of the values in an attribute column, the better. Robust dimension attributes deliver

robust analytic slicing-and-dicing capabilities.

NOTE Dimensions provide the entry points to the data, and the ﬁ nal labels and

groupings on all DW/BI analyses.

When triaging operational source data, it is sometimes unclear whether a

numeric data element is a fact or dimension attribute. You often make the decision

by asking whether the column is a measurement that takes on lots of values and

participates in calculations (making it a fact) or is a discretely valued description

that is more or less constant and participates in constraints and row labels (making

it a dimensional attribute). For example, the standard cost for a product seems like

a constant attribute of the product but may be changed so often that you decide it

is more like a measured fact. Occasionally, you can’t be certain of the classiﬁ cation;

it is possible to model the data element either way (or both ways) as a matter of the

designer’s prerogative.

NOTE The designer’s dilemma of whether a numeric quantity is a fact or a

dimension attribute is rarely a di cult decision. Continuously valued numeric

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 15

observations are almost always facts; discrete numeric observations drawn from a

small list are almost always dimension attributes.

Figure 1-4 shows that dimension tables often represent hierarchical relation-

ships. For example, products roll up into brands and then into categories. For each

row in the product dimension, you should store the associated brand and category

description. The hierarchical descriptive information is stored redundantly in the

spirit of ease of use and query performance. You should resist the perhaps habitual

urge to normalize data by storing only the brand code in the product dimension and

creating a separate brand lookup table, and likewise for the category description in a

separate category lookup table. This normalization is called snowﬂ aking. Instead of

third normal form, dimension tables typically are highly denormalized with ﬂ attened

many-to-one relationships within a single dimension table. Because dimension tables

typically are geometrically smaller than fact tables, improving storage e ciency by

normalizing or snowﬂ aking has virtually no impact on the overall database size. You

should almost always trade o dimension table space for simplicity and accessibility.

Product Key

PowerAll 20 oz

PowerAll 32 oz

PowerAll 48 oz

PowerAll 64 oz

ZipAll 20 oz

ZipAll 32 oz

ZipAll 48 oz

Shiny 20 oz

Shiny 32 oz

ZipGlass 20 oz

ZipGlass 32 oz

PowerClean

Zippy

Clean Fast

Zippy

All Purpose Cleaner

Glass Cleaner

Product Description Brand Name Category Name

Figure 1-4: Sample rows from a dimension table with denormalized hierarchies.

Contrary to popular folklore, Ralph Kimball didn’t invent the terms fact and

dimension. As best as can be determined, the dimension and fact terminology

originated from a joint research project conducted by General Mills and Dartmouth

University in the 1960s. In the 1970s, both AC Nielsen and IRI used the terms con-

sistently to describe their syndicated data o erings and gravitated to dimensional

models for simplifying the presentation of their analytic information. They under-

stood that their data wouldn’t be used unless it was packaged simply. It is probably

accurate to say that no single person invented the dimensional approach. It is an

irresistible force in designing databases that always results when the designer places

understandability and performance as the highest goals.

Chapter 1

Facts and Dimensions Joined in a Star Schema

Now that you understand fact and dimension tables, it’s time to bring the building blocks

together in a dimensional model, as shown in Figure 1-5. Each business process is repre-

sented by a dimensional model that consists of a fact table containing the event’s numeric

measurements surrounded by a halo of dimension tables that contain the textual context

that was true at the moment the event occurred. This characteristic star-like structure

is often called a star join, a term dating back to the earliest days of relational databases.

Retail Sales Fact

Date Key (FK)

Product Key (FK)

Store Key (FK)

Promotion Key (FK)

Customer Key (FK)

Clerk Key (FK)

Transaction #

Sales Dollars

Sales Units

Date Dimension

Product Dimension

Promotion Dimension

Clerk Dimension

Store Dimension

Customer Dimension

Figure 1-5: Fact and dimension tables in a dimensional model.

The ﬁ rst thing to notice about the dimensional schema is its simplicity and

symmetry. Obviously, business users beneﬁ t from the simplicity because the data

is easier to understand and navigate. The charm of the design in Figure 1-5 is that

it is highly recognizable to business users. We have observed literally hundreds of

instances in which users immediately agree that the dimensional model is their

business. Furthermore, the reduced number of tables and use of meaningful busi-

ness descriptors make it easy to navigate and less likely that mistakes will occur.

The simplicity of a dimensional model also has performance beneﬁ ts. Database

optimizers process these simple schemas with fewer joins more e ciently. A data-

base engine can make strong assumptions about ﬁ rst constraining the heavily

indexed dimension tables, and then attacking the fact table all at once with the

Cartesian product of the dimension table keys satisfying the user’s constraints.

Amazingly, using this approach, the optimizer can evaluate arbitrary n-way joins

to a fact table in a single pass through the fact table’s index.

Finally, dimensional models are gracefully extensible to accommodate change.

The predictable framework of a dimensional model withstands unexpected changes

in user behavior. Every dimension is equivalent; all dimensions are symmetrically-

equal entry points into the fact table. The dimensional model has no built-in bias

regarding expected query patterns. There are no preferences for the business ques-

tions asked this month versus the questions asked next month. You certainly don’t

want to adjust schemas if business users suggest new ways to analyze their business.

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 17

This book illustrates repeatedly that the most granular or atomic data has the

most dimensionality. Atomic data that has not been aggregated is the most expres-

sive data; this atomic data should be the foundation for every fact table design to

withstand business users’ ad hoc attacks in which they pose unexpected queries.

With dimensional models, you can add completely new dimensions to the schema

as long as a single value of that dimension is deﬁ ned for each existing fact row.

Likewise, you can add new facts to the fact table, assuming that the level of detail

is consistent with the existing fact table. You can supplement preexisting dimen-

sion tables with new, unanticipated attributes. In each case, existing tables can be

changed in place either by simply adding new data rows in the table or by executing

an SQL ALTER TABLE command. Data would not need to be reloaded, and existing BI

applications would continue to run without yielding di erent results. We examine

this graceful extensibility of dimensional models more fully in Chapter 3.

Another way to think about the complementary nature of fact and dimension

tables is to see them translated into a report. As illustrated in Figure 1-6, dimension

attributes supply the report ﬁ lters and labeling, whereas the fact tables supply the

report’s numeric values.

Product Dimension

Date Dimension

Store Dimension

Sales Fact

Product Key (PK)

SKU Number (Natural Key)

Product Description

Package Type

Package Size

Brand Name

Category Name

... and more

Date Key (FK)

Product Key (FK)

Store Key (FK)

...

Transaction #

Sales Dollars

Sales Units

Store Key (PK)

Store Number

Store Name

Store State

Store ZIP

District

Region

... and more

Date Key (PK)

Date

Day of Week

Month

Year

...and more

Filter

SumGroup byGroup by

Sales Activity for June 2013

District

Atherton

Belmont

Brand Name

PowerClean

Zippy

Clean Fast

Zippy

Sales Dollars

2,035

707

2,330

527

Figure 1-6: Dimensional attributes and facts form a simple report.

Chapter 1

You can easily envision the SQL that’s written (or more likely generated by a BI

tool) to create this report:

SELECT

store.district_name,

product.brand,

sum(sales_facts.sales_dollars) AS "Sales Dollars"

FROM

store,

product,

date,

sales_facts

WHERE

date.month_name="January" AND

date.year=2013 AND

store.store_key = sales_facts.store_key AND

product.product_key = sales_facts.product_key AND

date.date_key = sales_facts.date_key

GROUP BY

store.district_name,

product.brand

If you study this code snippet line-by-line, the ﬁ rst two lines under the SELECT

statement identify the dimension attributes in the report, followed by the aggre-

gated metric from the fact table. The FROM clause identiﬁ es all the tables involved

in the query. The ﬁ rst two lines in the WHERE clause declare the report’s ﬁ lter, and

the remainder declare the joins between the dimension and fact tables. Finally, the

GROUP BY clause establishes the aggregation within the report.

Kimball’s DW/BI Architecture

Let’s build on your understanding of DW/BI systems and dimensional modeling

fundamentals by investigating the components of a DW/BI environment based on

the Kimball architecture. You need to learn the strategic signiﬁ cance of each com-

ponent to avoid confusing their role and function.

As illustrated in Figure 1-7, there are four separate and distinct components to

consider in the DW/BI environment: operational source systems, ETL system, data

presentation area, and business intelligence applications.

Operational Source Systems

These are the operational systems of record that capture the business’s transactions.

Think of the source systems as outside the data warehouse because presumably you

have little or no control over the content and format of the data in these operational

systems. The main priorities of the source systems are processing performance and avail-

ability. Operational queries against source systems are narrow, one-record-at-a-time

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 19

queries that are part of the normal transaction ﬂ ow and severely restricted in their

demands on the operational system. It is safe to assume that source systems are not

queried in the broad and unexpected ways that DW/BI systems typically are queried.

Source systems maintain little historical data; a good data warehouse can relieve

the source systems of much of the responsibility for representing the past. In many

cases, the source systems are special purpose applications without any commitment

to sharing common data such as product, customer, geography, or calendar with other

operational systems in the organization. Of course, a broadly adopted cross-application

enterprise resource planning (ERP) system or operational master data management

system could help address these shortcomings.

Source

Transactions

Back Room

ETL System:

• Transform from

source-to-target

• Conform

dimensions

• Normalization

optional

• No user query

support

Design Goals:

• Throughput

• Integrity and

consistency

Presentation Area:

• Dimensional (star

schema or OLAP

cube)

• Atomic and

summary data

• Organized by

business process

• Uses conformed

dimensions

Design Goals:

• Ease-of-use

• Query performance

BI Applications:

• Ad hoc queries

• Standard reports

• Analytic apps

• Data mining and

models

Enterprise DW Bus

Architecture

Front Room

Figure 1-7: Core elements of the Kimball DW/BI architecture.

Extract, Transformation, and Load System

The extract, transformation, and load (ETL) system of the DW/BI environment consists

of a work area, instantiated data structures, and a set of processes. The ETL system

is everything between the operational source systems and the DW/BI presentation

area. We elaborate on the architecture of ETL systems and associated techniques

in Chapter 19: ETL Subsystems and Techniques, but we want to introduce this

fundamental piece of the overall DW/BI system puzzle.

Extraction is the ﬁ rst step in the process of getting data into the data warehouse

environment. Extracting means reading and understanding the source data and

copying the data needed into the ETL system for further manipulation. At this

point, the data belongs to the data warehouse.

After the data is extracted to the ETL system, there are numerous potential trans-

formations, such as cleansing the data (correcting misspellings, resolving domain

Chapter 1

conﬂ icts, dealing with missing elements, or parsing into standard formats), com-

bining data from multiple sources, and de-duplicating data. The ETL system adds

value to the data with these cleansing and conforming tasks by changing the data

and enhancing it. In addition, these activities can be architected to create diagnos-

tic metadata, eventually leading to business process reengineering to improve data

quality in the source systems over time.

The ﬁ nal step of the ETL process is the physical structuring and loading of data

into the presentation area’s target dimensional models. Because the primary mis-

sion of the ETL system is to hand o the dimension and fact tables in the delivery

step, these subsystems are critical. Many of these deﬁ ned subsystems focus on

dimension table processing, such as surrogate key assignments, code lookups to

provide appropriate descriptions, splitting, or combining columns to present the

appropriate data values, or joining underlying third normal form table structures

into ﬂ attened denormalized dimensions. In contrast, fact tables are typically large

and time consuming to load, but preparing them for the presentation area is typically

straightforward. When the dimension and fact tables in a dimensional model have

been updated, indexed, supplied with appropriate aggregates, and further quality

assured, the business community is notiﬁ ed that the new data has been published.

There remains industry consternation about whether the data in the ETL system

should be repurposed into physical normalized structures prior to loading into the

presentation area’s dimensional structures for querying and reporting. The ETL

system is typically dominated by the simple activities of sorting and sequential

processing. In many cases, the ETL system is not based on relational technology but

instead may rely on a system of ﬂ at ﬁ les. After validating the data for conformance

with the deﬁ ned one-to-one and many-to-one business rules, it may be pointless to

take the ﬁ nal step of building a 3NF physical database, just before transforming the

data once again into denormalized structures for the BI presentation area.

However, there are cases in which the data arrives at the doorstep of the ETL

system in a 3NF relational format. In these situations, the ETL system develop-

ers may be more comfortable performing the cleansing and transformation tasks

using normalized structures. Although a normalized database for ETL processing

is acceptable, we have some reservations about this approach. The creation of both

normalized structures for the ETL and dimensional structures for presentation

means that the data is potentially extracted, transformed, and loaded twice—once

into the normalized database and then again when you load the dimensional model.

Obviously, this two-step process requires more time and investment for the develop-

ment, more time for the periodic loading or updating of data, and more capacity to

store the multiple copies of the data. At the bottom line, this typically translates into

the need for larger development, ongoing support, and hardware platform budgets.

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 21

Unfortunately, some DW/BI initiatives have failed miserably because they focused

all their energy and resources on constructing the normalized structures rather

than allocating time to developing a dimensional presentation area that supports

improved business decision making. Although enterprise-wide data consistency is a

fundamental goal of the DW/BI environment, there may be e ective and less costly

approaches than physically creating normalized tables in the ETL system, if these

structures don’t already exist.

NOTE It is acceptable to create a normalized database to support the ETL

processes; however, this is not the end goal. The normalized structures must be

o -limits to user queries because they defeat the twin goals of understandability

and performance.

Presentation Area to Support Business Intelligence

The DW/BI presentation area is where data is organized, stored, and made available

for direct querying by users, report writers, and other analytical BI applications.

Because the back room ETL system is o -limits, the presentation area is the DW/BI

environment as far as the business community is concerned; it is all the business

sees and touches via their access tools and BI applications. The original pre-release

working title for the ﬁ rst edition of The Data Warehouse Toolkit was Getting the Data

Out. This is what the presentation area with its dimensional models is all about.

We have several strong opinions about the presentation area. First of all, we insist

that the data be presented, stored, and accessed in dimensional schemas, either

relational star schemas or OLAP cubes. Fortunately, the industry has matured to the

point where we’re no longer debating this approach; it has concluded that dimen-

sional modeling is the most viable technique for delivering data to DW/BI users.

Our second stake in the ground about the presentation area is that it must

contain detailed, atomic data. Atomic data is required to withstand assaults from

unpredictable ad hoc user queries. Although the presentation area also may contain

performance-enhancing aggregated data, it is not su cient to deliver these sum-

maries without the underlying granular data in a dimensional form. In other words,

it is completely unacceptable to store only summary data in dimensional models

while the atomic data is locked up in normalized models. It is impractical to expect

a user to drill down through dimensional data almost to the most granular level and

then lose the beneﬁ ts of a dimensional presentation at the ﬁ nal step. Although DW/

BI users and applications may look infrequently at a single line item on an order,

they may be very interested in last week’s orders for products of a given size (or

ﬂ avor, package type, or manufacturer) for customers who ﬁ rst purchased within

Chapter 1

the last 6 months (or reside in a given state or have certain credit terms). The most

ﬁ nely grained data must be available in the presentation area so that users can ask

the most precise questions possible. Because users’ requirements are unpredictable

and constantly changing, you must provide access to the exquisite details so they

can roll up to address the questions of the moment.

The presentation data area should be structured around business process mea-

surement events. This approach naturally aligns with the operational source data

capture systems. Dimensional models should correspond to physical data capture

events; they should not be designed to deliver the report-of-the-day. An enterprise’s

business processes cross the boundaries of organizational departments and func-

tions. In other words, you should construct a single fact table for atomic sales metrics

rather than populating separate similar, but slightly di erent, databases containing

sales metrics for the sales, marketing, logistics, and ﬁ nance teams.

All the dimensional structures must be built using common, conformed dimen-

sions. This is the basis of the enterprise data warehouse bus architecture described

in Chapter 4. Adherence to the bus architecture is the ﬁ nal stake in the ground

for the presentation area. Without shared, conformed dimensions, a dimensional

model becomes a standalone application. Isolated stovepipe data sets that cannot be

tied together are the bane of the DW/BI movement as they perpetuate incompatible

views of the enterprise. If you have any hope of building a robust and integrated

DW/BI environment, you must commit to the enterprise bus architecture. When

dimensional models have been designed with conformed dimensions, they can be

readily combined and used together. The presentation area in a large enterprise

DW/BI solution ultimately consists of dozens of dimensional models with many of

the associated dimension tables shared across fact tables.

Using the bus architecture is the secret to building distributed DW/BI systems.

When the bus architecture is used as a framework, you can develop the enterprise

data warehouse in an agile, decentralized, realistically scoped, iterative manner.

NOTE Data in the queryable presentation area of the DW/BI system must be

dimensional, atomic (complemented by performance-enhancing aggregates), busi-

ness process-centric, and adhere to the enterprise data warehouse bus architecture.

The data must not be structured according to individual departments’ interpreta-

tion of the data.

Business Intelligence Applications

The ﬁ nal major component of the Kimball DW/BI architecture is the business intelligence

(BI) application. The term BI application loosely refers to the range of capabilities pro-

vided to business users to leverage the presentation area for analytic decision making.

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 23

By definition, all BI applications query the data in the DW/BI presentation area.

Querying, obviously, is the whole point of using data for improved decision making.

A BI application can be as simple as an ad hoc query tool or as complex as a sophis-

ticated data mining or modeling application. Ad hoc query tools, as powerful as they

are, can be understood and used e ectively by only a small percentage of the potential

DW/BI business user population. Most business users will likely access the data via

prebuilt parameter-driven applications and templates that do not require users to con-

struct queries directly. Some of the more sophisticated applications, such as modeling

or forecasting tools, may upload results back into the operational source systems, ETL

system, or presentation area.

Restaurant Metaphor for the Kimball Architecture

One of our favorite metaphors reinforces the importance of separating the overall

DW/BI environment into distinct components. In this case, we’ll consider the simi-

larities between a restaurant and the DW/BI environment.

ETL in the Back Room Kitchen

The ETL system is analogous to the kitchen of a restaurant. The restaurant’s kitchen

is a world unto itself. Talented chefs take raw materials and transform them into

appetizing, delicious meals for the restaurant’s diners. But long before a commercial

kitchen swings into operation, a signiﬁ cant amount of planning goes into designing

the workspace layout and components.

The kitchen is organized with several design goals in mind. First, the layout must

be highly e cient. Restaurant managers want high kitchen throughput. When the

restaurant is packed and everyone is hungry, there is no time for wasted movement.

Delivering consistent quality from the restaurant’s kitchen is the second important

goal. The establishment is doomed if the plates coming out of the kitchen repeat-

edly fail to meet expectations. To achieve consistency, chefs create their special

sauces once in the kitchen, rather than sending ingredients out to the table where

variations will inevitably occur. Finally, the kitchen’s output, the meals delivered

to restaurant customers, must also be of high integrity. You wouldn’t want someone

to get food poisoning from dining at your restaurant. Consequently, kitchens are

designed with integrity in mind; salad preparation doesn’t happen on the same

surfaces where raw chicken is handled.

Just as quality, consistency, and integrity are major considerations when designing

the restaurant’s kitchen, they are also ongoing concerns for everyday management

of the restaurant. Chefs strive to obtain the best raw materials possible. Procured

products must meet quality standards and are rejected if they don’t meet minimum

standards. Most ﬁ ne restaurants modify their menus based on the availability of

quality ingredients.

Chapter 1

The restaurant sta s its kitchen with skilled professionals wielding the tools of

their trade. Cooks manipulate razor-sharp knives with incredible conﬁ dence and

ease. They operate powerful equipment and work around extremely hot surfaces

without incident.

Given the dangerous surroundings, the back room kitchen is o limits to res-

taurant patrons. Things happen in the kitchen that customers just shouldn’t see. It

simply isn’t safe. Professional cooks handling sharp knives shouldn’t be distracted

by diners’ inquiries. You also wouldn’t want patrons entering the kitchen to dip their

ﬁ ngers into a sauce to see whether they want to order an entree. To prevent these

intrusions, most restaurants have a closed door that separates the kitchen from the

area where diners are served. Even restaurants that boast an open kitchen format

typically have a barrier, such as a partial wall of glass, separating the two environ-

ments. Diners are invited to watch but can’t wander into the kitchen. Although part

of the kitchen may be visible, there are always out-of-view back rooms where the

less visually desirable preparation occurs.

The data warehouse’s ETL system resembles the restaurant’s kitchen. Source data

is magically transformed into meaningful, presentable information. The back room

ETL system must be laid out and architected long before any data is extracted from

the source. Like the kitchen, the ETL system is designed to ensure throughput.

It must transform raw source data into the target model e ciently, minimizing

unnecessary movement.

Obviously, the ETL system is also highly concerned about data quality, integrity, and

consistency. Incoming data is checked for reasonable quality as it enters. Conditions

are continually monitored to ensure ETL outputs are of high integrity. Business rules

to consistently derive value-add metrics and attributes are applied once by skilled

professionals in the ETL system rather than relying on each patron to develop them

independently. Yes, that puts extra burden on the ETL team, but it’s done to deliver a

better, more consistent product to the DW/BI patrons.

NOTE A properly designed DW/BI environment trades o work in the front

room BI applications in favor of work in the back room ETL system. Front room

work must be done over and over by business users, whereas back room work is

done once by the ETL sta .

Finally, ETL system should be o limits to the business users and BI application

developers. Just as you don’t want restaurant patrons wandering into the kitchen

and potentially consuming semi-cooked food, you don’t want busy ETL profession-

als distracted by unpredictable inquiries from BI users. The consequences might

be highly unpleasant if users dip their ﬁ ngers into interim staging pots while data

preparation is still in process. As with the restaurant kitchen, activities occur in

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 25

the ETL system that the DW/BI patrons shouldn’t see. When the data is ready and

quality checked for user consumption, it’s brought through the doorway into the

DW/BI presentation area.

Data Presentation and BI in the Front Dining Room

Now turn your attention to the restaurant’s dining room. What are the key fac-

tors that di erentiate restaurants? According to the popular restaurant ratings and

reviews, restaurants are typically scored on four distinct qualities:

■ Food (quality, taste, and presentation)

■ Decor (appealing, comfortable surroundings for the patrons)

■ Service (prompt food delivery, attentive support sta , and food received

as ordered)

■ Cost

Most patrons focus initially on the food score when they’re evaluating dining

options. First and foremost, does the restaurant serve good food? That’s the res-

taurant’s primary deliverable. However, the decor, service, and cost factors also

a ect the patrons’ overall dining experience and are considerations when evaluating

whether to eat at a restaurant.

Of course, the primary deliverable from the DW/BI kitchen is the data in

the presentation area. What data is available? Like the restaurant, the DW/BI

system provides “menus” to describe what’s available via metadata, published

reports, and parameterized analytic applications. The DW/BI patrons expect con-

sistency and high quality. The presentation area’s data must be properly prepared

and safe to consume.

The presentation area’s decor should be organized for the patrons’ comfort. It

must be designed based on the preferences of the BI diners, not the development

sta . Service is also critical in the DW/BI system. Data must be delivered, as ordered,

promptly in a form that is appealing to the business user or BI application developer.

Finally, cost is a factor for the DW/BI system. The kitchen sta may be dream-

ing up elaborate, expensive meals, but if there’s no market at that price point, the

restaurant won’t survive.

If restaurant patrons like their dining experience, then everything is rosy for

the restaurant manager. The dining room is always busy; sometimes there’s even

a waiting list. The restaurant manager’s performance metrics are all promising:

high numbers of diners, table turnovers, and nightly revenue and proﬁ t, while sta

turnover is low. Things look so good that the restaurant’s owner is considering an

expansion site to handle the tra c. On the other hand, if the restaurant’s diners

aren’t happy, things go downhill in a hurry. With a limited number of patrons,

the restaurant isn’t making enough money to cover its expenses, and the sta isn’t

making any tips. In a relatively short time, the restaurant closes.

Chapter 1

Restaurant managers often proactively check on their diners’ satisfaction with

the food and dining experience. If a patron is unhappy, they take immediate action

to rectify the situation. Similarly, DW/BI managers should proactively monitor sat-

isfaction. You can’t a ord to wait to hear complaints. Often, people will abandon

a restaurant without even voicing their concerns. Over time, managers notice that

diner counts have dropped but may not even know why.

Inevitably, the prior DW/BI patrons will locate another “restaurant” that bet-

ter suits their needs and preferences, wasting the millions of dollars invested to

design, build, and sta the DW/BI system. Of course, you can prevent this unhappy

ending by managing the restaurant proactively; make sure the kitchen is properly

organized and utilized to deliver as needed to the presentation area’s food, decor,

service, and cost.

Alternative DW/BI Architectures

Having just described the Kimball architecture, let’s discuss several other DW/BI

architectural approaches. We’ll quickly review the two dominant alternatives to the

Kimball architecture, highlighting the similarities and di erences. We’ll then close

this section by focusing on a hybrid approach that combines alternatives.

Fortunately, over the past few decades, the di erences between the Kimball

architecture and the alternatives have softened. Even more fortunate, there’s a role

for dimensional modeling regardless of your architectural predisposition.

We acknowledge that organizations have successfully constructed DW/BI systems

based on the approaches advocated by others. We strongly believe that rather than

encouraging more consternation over our philosophical di erences, the industry

would be far better o devoting energy to ensure that our DW/BI deliverables are

broadly accepted by the business to make better, more informed decisions. The

architecture should merely be a means to this objective.

Independent Data Mart Architecture

With this approach, analytic data is deployed on a departmental basis without con-

cern to sharing and integrating information across the enterprise, as illustrated in

Figure 1-8. Typically, a single department identiﬁ es requirements for data from an

operational source system. The department works with IT sta or outside consul-

tants to construct a database that satisﬁ es their departmental needs, reﬂ ecting their

business rules and preferred labeling. Working in isolation, this departmental data

mart addresses the department’s analytic requirements.

Meanwhile, another department is interested in the same source data. It’s extremely

common for multiple departments to be interested in the same performance met-

rics resulting from an organization’s core business process events. But because this

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 27

department doesn’t have access to the data mart initially constructed by the other

department, it proceeds down a similar path on its own, obtaining resources and

building a departmental solution that contains similar, but slightly di erent data.

When business users from these two departments discuss organizational perfor-

mance based on reports from their respective repositories, not surprisingly, none of

the numbers match because of the di erences in business rules and labeling.

Source

Transactions

Back Room

BI Applications for

Department #1

Data Mart for

Department #1

Data Mart for

Department #2

Data Mart for

Department #3

BI Applications for

Department #2

BI Applications for

Department #3

Front Room

ETL

Figure 1-8: Simpliﬁ ed illustration of the independent data mart “architecture.”

These standalone analytic silos represent a DW/BI “architecture” that’s essen-

tially un-architected. Although no industry leaders advocate these independent

data marts, this approach is prevalent, especially in large organizations. It mirrors

the way many organizations fund IT projects, plus it requires zero cross-organi-

zational data governance and coordination. It’s the path of least resistance for fast

development at relatively low cost, at least in the short run. Of course, multiple

uncoordinated extracts from the same operational sources and redundant storage

of analytic data are ine cient and wasteful in the long run. Without any enterprise

perspective, this independent approach results in myriad standalone point solutions

that perpetuate incompatible views of the organization’s performance, resulting in

unnecessary organizational debate and reconciliation.

We strongly discourage the independent data mart approach. However, often

these independent data marts have embraced dimensional modeling because they’re

interested in delivering data that’s easy for the business to understand and highly

responsive to queries. So our concepts of dimensional modeling are often applied

in this architecture, despite the complete disregard for some of our core tenets, such

as focusing on atomic details, building by business process instead of department,

and leveraging conformed dimensions for enterprise consistency and integration.

Chapter 1

Hub-and-Spoke Corporate Information Factory

Inmon Architecture

The hub-and-spoke Corporate Information Factory (CIF) approach is advocated

by Bill Inmon and others in the industry. Figure 1-9 illustrates a simpliﬁ ed version

of the CIF, focusing on the core elements and concepts that warrant discussion.

Source

Transactions

Back Room

Enterprise Data

Warehouse (EDW)

• Normalized

tables (3NF)

• Atomic data

• User queryable

Front Room

Data Marts:

• Dimensional

• Often

summarized

• Often

departmental

Figure 1-9: Simpliﬁ ed illustration of the hub-and-spoke Corporate Information Factory

architecture.

With the CIF, data is extracted from the operational source systems and processed

through an ETL system sometimes referred to as data acquisition. The atomic data

that results from this processing lands in a 3NF database; this normalized, atomic

repository is referred to as the Enterprise Data Warehouse (EDW) within the CIF

architecture. Although the Kimball architecture enables optional normalization to

support ETL processing, the normalized EDW is a mandatory construct in the CIF.

Like the Kimball approach, the CIF advocates enterprise data coordination and inte-

gration. The CIF says the normalized EDW ﬁ lls this role, whereas the Kimball archi-

tecture stresses the importance of an enterprise bus with conformed dimensions.

NOTE The process of normalization does not technically speak to integration.

Normalization simply creates physical tables that implement many-to-one rela-

tionships. Integration, on the other hand, requires that inconsistencies arising

from separate sources be resolved. Separate incompatible database sources can be

normalized to the hilt without addressing integration. The Kimball architecture

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 29

based on conformed dimensions reverses this logic and focuses on resolving data

inconsistencies without explicitly requiring normalization.

Organizations who have adopted the CIF approach often have business users

accessing the EDW repository due to its level of detail or data availability timeli-

ness. However, subsequent ETL data delivery processes also populate downstream

reporting and analytic environments to support business users. Although often

dimensionally structured, the resultant analytic databases typically di er from

structures in the Kimball architecture’s presentation area in that they’re frequently

departmentally-centric (rather than organized around business processes) and popu-

lated with aggregated data (rather than atomic details). If the data delivery ETL

processes apply business rules beyond basic summarization, such as departmental

renaming of columns or alternative calculations, it may be di cult to tie these

analytic databases to the EDW’s atomic repository.

NOTE The most extreme form of a pure CIF architecture is unworkable as a data

warehouse, in our opinion. Such an architecture locks the atomic data in di cult-

to-query normalized structures, while delivering departmentally incompatible data

marts to di erent groups of business users. But before being too depressed by this

view, stay tuned for the next section.

Hybrid Hub-and-Spoke and Kimball Architecture

The ﬁ nal architecture warranting discussion is the marriage of the Kimball and

Inmon CIF architectures. As illustrated in Figure 1-10, this architecture populates

a CIF-centric EDW that is completely o -limits to business users for analysis and

reporting. It’s merely the source to populate a Kimball-esque presentation area

in which the data is dimensional, atomic (complemented by aggregates), process-

centric, and conforms to the enterprise data warehouse bus architecture.

Some proponents of this blended approach claim it’s the best of both worlds. Yes, it

blends the two enterprise-oriented approaches. It may leverage a preexisting invest-

ment in an integrated repository, while addressing the performance and usability

issues associated with the 3NF EDW by o oading queries to the dimensional presen-

tation area. And because the end deliverable to the business users and BI applications

is constructed based on Kimball tenets, who can argue with the approach?

If you’ve already invested in the creation of a 3NF EDW, but it’s not delivering

on the users’ expectations of fast and ﬂ exible reporting and analysis, this hybrid

approach might be appropriate for your organization. If you’re starting with a blank

sheet of paper, the hybrid approach will likely cost more time and money, both dur-

ing development and ongoing operation, given the multiple movements of data and

Chapter 1

redundant storage of atomic details. If you have the appetite, the perceived need, and

perhaps most important, the budget and organizational patience to fully normalize

and instantiate your data before loading it into dimensional structures that are well

designed according to the Kimball methods, go for it.

Source

Transactions

Back Room

ETL ETL

Presentation Area:

• Dimensional (star

schema or OLAP

cube)

• Atomic and

summary data

• Organized by

business process

• Uses conformed

dimensions

Enterprise DW Bus

Architecture

Front Room

Enterprise Data

Warehouse (EDW)

• Normalized

tables (3NF)

• Atomic data

Figure 1-10: Hybrid architecture with 3NF structures and dimensional Kimball

presentation area.

Dimensional Modeling Myths

Despite the widespread acceptance of dimensional modeling, some misperceptions

persist in the industry. These false assertions are a distraction, especially when you

want to align your team around common best practices. If folks in your organiza-

tion continually lob criticisms about dimensional modeling, this section should

be on their recommended reading list; their perceptions may be clouded by these

common misunderstandings.

Myth 1: Dimensional Models are Only

for Summary Data

This ﬁ rst myth is frequently the root cause of ill-designed dimensional models.

Because you can’t possibly predict all the questions asked by business users, you

need to provide them with queryable access to the most detailed data so they can

roll it up based on the business question. Data at the lowest level of detail is practi-

cally impervious to surprises or changes. Summary data should complement the

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 31

granular detail solely to provide improved performance for common queries, but

not replace the details.

A related corollary to this ﬁ rst myth is that only a limited amount of historical

data should be stored in dimensional structures. Nothing about a dimensional model

prohibits storing substantial history. The amount of history available in dimensional

models must only be driven by the business’s requirements.

Myth 2: Dimensional Models are Departmental,

Not Enterprise

Rather than drawing boundaries based on organizational departments, dimensional

models should be organized around business processes, such as orders, invoices, and

service calls. Multiple business functions often want to analyze the same metrics

resulting from a single business process. Multiple extracts of the same source data

that create multiple, inconsistent analytic databases should be avoided.

Myth 3: Dimensional Models are Not Scalable

Dimensional models are extremely scalable. Fact tables frequently have billions of

rows; fact tables containing 2 trillion rows have been reported. The database ven-

dors have wholeheartedly embraced DW/BI and continue to incorporate capabilities

into their products to optimize dimensional models’ scalability and performance.

Both normalized and dimensional models contain the same information and data

relationships; the logical content is identical. Every data relationship expressed in

one model can be accurately expressed in the other. Both normalized and dimen-

sional models can answer exactly the same questions, albeit with varying di culty.

Myth 4: Dimensional Models are Only

for Predictable Usage

Dimensional models should not be designed by focusing on predeﬁ ned reports

or analyses; the design should center on measurement processes. Obviously, it’s

important to consider the BI application’s ﬁ ltering and labeling requirements. But

you shouldn’t design for a top ten list of reports in a vacuum because this list is

bound to change, making the dimensional model a moving target. The key is to

focus on the organization’s measurement events that are typically stable, unlike

analyses that are constantly evolving.

A related corollary is that dimensional models aren’t responsive to changing busi-

ness needs. On the contrary, because of their symmetry, dimensional structures are

extremely ﬂ exible and adaptive to change. The secret to query ﬂ exibility is building

Chapter 1

fact tables at the most granular level. Dimensional models that deliver only summary

data are bound to be problematic; users run into analytic brick walls when they try

to drill down into details not available in the summary tables. Developers also run

into brick walls because they can’t easily accommodate new dimensions, attributes,

or facts with these prematurely summarized tables. The correct starting point for

your dimensional models is to express data at the lowest detail possible for maxi-

mum ﬂ exibility and extensibility. Remember, when you pre-suppose the business

question, you’ll likely pre-summarize the data, which can be fatal in the long run.

As the architect Mies van der Rohe is credited with saying, “God is in the details.”

Delivering dimensional models populated with the most detailed data possible ensures

maximum ﬂ exibility and extensibility. Delivering anything less in your dimensional

models undermines the foundation necessary for robust business intelligence.

Myth 5: Dimensional Models Can’t Be Integrated

Dimensional models most certainly can be integrated if they conform to the enterprise

data warehouse bus architecture. Conformed dimensions are built and maintained

as centralized, persistent master data in the ETL system and then reused across

dimensional models to enable data integration and ensure semantic consistency. Data

integration depends on standardized labels, values, and deﬁ nitions. It is hard work

to reach organizational consensus and then implement the corresponding ETL rules,

but you can’t dodge the e ort, regardless of whether you’re populating normalized

or dimensional models.

Presentation area databases that don’t adhere to the bus architecture

with shared conformed dimensions lead to standalone solutions. You can’t hold

dimensional modeling responsible for organizations’ failure to embrace one of its

fundamental tenets.

More Reasons to Think Dimensionally

The majority of this book focuses on dimensional modeling for designing databases

in the DW/BI presentation area. But dimensional modeling concepts go beyond the

design of simple and fast data structures. You should think dimensionally at other

critical junctures of a DW/BI project.

When gathering requirements for a DW/BI initiative, you need to listen for and

then synthesize the ﬁ ndings around business processes. Sometimes teams get lulled

into focusing on a set of required reports or dashboard gauges. Instead you should

constantly ask yourself about the business process measurement events producing

the report or dashboard metrics. When specifying the project’s scope, you must stand

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 33

ﬁ rm to focus on a single business process per project and not sign up to deploy a

dashboard that covers a handful of them in a single iteration.

Although it’s critical that the DW/BI team concentrates on business processes, it’s

equally important to get IT and business management on the same wavelength. Due

to historical IT funding policies, the business may be more familiar with depart-

mental data deployments. You need to shift their mindset about the DW/BI rollout

to a process perspective. When prioritizing opportunities and developing the DW/

BI roadmap, business processes are the unit of work. Fortunately, business man-

agement typically embraces this approach because it mirrors their thinking about

key performance indicators. Plus, they’ve lived with the inconsistencies, incessant

debates, and never ending reconciliations caused by the departmental approach, so

they’re ready for a fresh tactic. Working with business leadership partners, rank each

business process on business value and feasibility, then tackle processes with the

highest impact and feasibility scores ﬁ rst. Although prioritization is a joint activity

with the business, your underlying understanding of the organization’s business

processes is essential to its e ectiveness and subsequent actionability.

If tasked with drafting the DW/BI system’s data architecture, you need to wrap

your head around the organization’s processes, along with the associated master

descriptive dimension data. The prime deliverable for this activity, the enterprise

data warehouse bus matrix, will be fully vetted in Chapter 4. The matrix also serves

as a useful tool for touting the potential beneﬁ ts of a more rigorous master data

management platform.

Data stewardship or governance programs should focus ﬁ rst on the major dimen-

sions. Depending on the industry, the list might include date, customer, product,

employee, facility, provider, student, faculty, account, and so on. Thinking about

the central nouns used to describe the business translates into a list of data gov-

ernance e orts to be led by subject matter experts from the business community.

Establishing data governance responsibilities for these nouns is the key to eventually

deploying dimensions that deliver consistency and address the business’s needs for

analytic ﬁ ltering, grouping, and labeling. Robust dimensions translate into robust

DW/BI systems.

As you can see, the fundamental motivation for dimensional modeling is front and

center long before you design star schemas or OLAP cubes. Likewise, the dimen-

sional model will remain in the forefront during the subsequent ETL system and BI

application designs. Dimensional modeling concepts link the business and technical

communities together as they jointly design the DW/BI deliverables. We’ll elaborate

on these ideas in Chapter 17: Kimball DW/BI Lifecycle Overview and Chapter 18:

Dimensional Modeling Process and Tasks, but wanted to plant the seeds early so

they have time to germinate.

Chapter 1

Agile Considerations

Currently, there’s signiﬁ cant interest within the DW/BI industry on agile development

practices. At the risk of oversimpliﬁ cation, agile methodologies focus on manage-

ably sized increments of work that can be completed within reasonable timeframes

measured in weeks, rather than tackling a much larger scoped (and hence riskier)

project with deliverables promised in months or years. Sounds good, doesn’t it?

Many of the core tenets of agile methodologies align with Kimball best practices,

including

■ Focus on delivering business value. This has been the Kimball mantra for

decades.

■ Value collaboration between the development team and business stakehold-

ers. Like the agile camp, we strongly encourage a close partnership with the

business.

■ Stress ongoing face-to-face communication, feedback, and prioritization with

the business stakeholders.

■ Adapt quickly to inevitably evolving requirements.

■ Tackle development in an iterative, incremental manner.

Although this list is compelling, a common criticism of the agile approaches is the

lack of planning and architecture, coupled with ongoing governance challenges. The

enterprise data warehouse bus matrix is a powerful tool to address these shortcom-

ings. The bus matrix provides a framework and master plan for agile development,

plus identiﬁ es the reusable common descriptive dimensions that provide both data

consistency and reduced time-to-market delivery. With the right collaborative mix

of business and IT stakeholders in a room, the enterprise data warehouse bus matrix

can be produced in relatively short order. Incremental development work can produce

components of the framework until su cient functionality is available and then

released to the business community.

Some clients and students lament that although they want to deliver consistently

deﬁ ned conformed dimensions in their DW/BI environments, it’s “just not feasible.”

They explain that they would if they could, but with the focus on agile development

techniques, it’s “impossible” to take the time to get organizational agreement on

conformed dimensions. We argue that conformed dimensions enable agile DW/BI

development, along with agile decision making. As you ﬂ esh out the portfolio of mas-

ter conformed dimensions, the development crank starts turning faster and faster.

The time-to-market for a new business process data source shrinks as developers

reuse existing conformed dimensions. Ultimately, new ETL development focuses

almost exclusively on delivering more fact tables because the associated dimension

tables are already sitting on the shelf ready to go.

Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 35

Without a framework like the enterprise data warehouse bus matrix, some DW/

BI teams have fallen into the trap of using agile techniques to create analytic or

reporting solutions in a vacuum. In most situations, the team worked with a small

set of users to extract a limited set of source data and make it available to solve

their unique problems. The outcome is often a standalone data stovepipe that others

can’t leverage, or worse yet, delivers data that doesn’t tie to the organization’s other

analytic information. We encourage agility, when appropriate, however building

isolated data sets should be avoided. As with most things in life, moderation and

balance between extremes is almost always prudent.

Summary

In this chapter we discussed the overriding goals for DW/BI systems and the fun-

damental concepts of dimensional modeling. The Kimball DW/BI architecture and

several alternatives were compared. We closed out the chapter by identifying com-

mon misunderstandings that some still hold about dimensional modeling, despite

its widespread acceptance across the industry, and challenged you to think dimen-

sionally beyond data modeling. In the next chapter, you get a turbocharged tour

of dimensional modeling patterns and techniques, and then begin putting these

concepts into action in your ﬁ rst case study in Chapter 3.

Kimball Dimensional

Modeling

Techniques

Overview

Starting with the first edition of The Data Warehouse Toolkit (Wiley, 1996), the

Kimball Group has defined the complete set of techniques for modeling data

in a dimensional way. In the first two editions of this book, we felt the techniques

needed to be introduced through familiar use cases drawn from various industries.

Although we still feel business use cases are an essential pedagogical approach, the

techniques have become so standardized that some dimensional modelers reverse

the logic by starting with the technique and then proceeding to the use case for

context. All of this is good news!

The Kimball techniques have been accepted as industry best practices.

As evidence, some former Kimball University students have published their own

dimensional modeling books. These books usually explain the Kimball techniques

accurately, but it is a sign of our techniques’ resilience that alternative books have

not extended the library of techniques in signiﬁ cant ways or o ered conﬂ icting

guidance.

This chapter is the “o cial” list of Kimball Dimensional Modeling Techniques

from the inventors of these design patterns. We don’t expect you to read this chapter

from beginning to end at ﬁ rst. But we intend the chapter to be a reference for our

techniques. With each technique, we’ve included pointers to subsequent chapters

for further explanation and illustrations based on the motivating use cases.

Fundamental Concepts

The techniques in this section must be considered during every dimensional

design. Nearly every chapter in the book references or illustrates the concepts in

this section.

Gather Business Requirements and Data Realities

Before launching a dimensional modeling e ort, the team needs to understand the

needs of the business, as well as the realities of the underlying source data. You

Chapter 2

uncover the requirements via sessions with business representatives to understand

their objectives based on key performance indicators, compelling business issues,

decision-making processes, and supporting analytic needs. At the same time, data

realities are uncovered by meeting with source system experts and doing high-level

data proﬁ ling to assess data feasibilities.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 5

Chapter 3 Retail Sales , p 70

Chapter 11 Telecommunications , p 297

Chapter 17 Lifecycle Overview , p 412

Chapter 18 Dimensional Modeling Process and Tasks , p 431

Chapter 19 ETL Subsystems and Techniques ,p 444

Collaborative Dimensional Modeling Workshops

Dimensional models should be designed in collaboration with subject matter experts

and data governance representatives from the business. The data modeler is in

charge, but the model should unfold via a series of highly interactive workshops

with business representatives. These workshops provide another opportunity to

ﬂ esh out the requirements with the business. Dimensional models should not be

designed in isolation by folks who don’t fully understand the business and their

needs; collaboration is critical!

Chapter 3 Retail Sales , p 70

Chapter 4 Inventory , p 135

Chapter 18 Dimensional Modeling Process and Tasks , p 429

Four-Step Dimensional Design Process

The four key decisions made during the design of a dimensional model include:

1. Select the business process.

2. Declare the grain.

3. Identify the dimensions.

4. Identify the facts.

The answers to these questions are determined by considering the needs of the

business along with the realities of the underlying source data during the collab-

orative modeling sessions. Following the business process, grain, dimension, and

fact declarations, the design team determines the table and column names, sample

domain values, and business rules. Business data governance representatives must

participate in this detailed design activity to ensure business buy-in.

Kimball Dimensional Modeling Techniques Overview 39

Chapter 3 Retail Sales , p 70

Chapter 11 Telecommunications , p 300

Chapter 18 Dimensional Modeling Process and Tasks , p 434

Business Processes

Business processes are the operational activities performed by your organization,

such as taking an order, processing an insurance claim, registering students for a

class, or snapshotting every account each month. Business process events generate

or capture performance metrics that translate into facts in a fact table. Most fact

tables focus on the results of a single business process. Choosing the process is

important because it deﬁ nes a speciﬁ c design target and allows the grain, dimen-

sions, and facts to be declared. Each business process corresponds to a row in the

enterprise data warehouse bus matrix.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 10

Chapter 3 Retail Sales , p 70

Chapter 17 Lifecycle Overview , p 414

Chapter 18 Dimensional Modeling Process and Tasks , p 435

Grain

Declaring the grain is the pivotal step in a dimensional design. The grain establishes

exactly what a single fact table row represents. The grain declaration becomes a bind-

ing contract on the design. The grain must be declared before choosing dimensions

or facts because every candidate dimension or fact must be consistent with the grain.

This consistency enforces a uniformity on all dimensional designs that is critical to

BI application performance and ease of use. Atomic grain refers to the lowest level at

which data is captured by a given business process. We strongly encourage you to start

by focusing on atomic-grained data because it withstands the assault of unpredictable

user queries; rolled-up summary grains are important for performance tuning, but they

pre-suppose the business’s common questions. Each proposed fact table grain results

in a separate physical table; di erent grains must not be mixed in the same fact table.

Chapter 1 DW/BI and Dimensional Modeling Primer, p 30

Chapter 3 Retail Sales , p 71

Chapter 4 Inventory , p 112

Chapter 6 Order Management, p 184

Chapter 11 Telecommunications , p 300

Chapter 12 Transportation , p 312

Chapter 18 Dimensional Modeling Process and Tasks , p 435

Chapter 2

Dimensions for Descriptive Context

Dimensions provide the “who, what, where, when, why, and how” context surround-

ing a business process event. Dimension tables contain the descriptive attributes

used by BI applications for ﬁ ltering and grouping the facts. With the grain of a fact

table ﬁ rmly in mind, all the possible dimensions can be identiﬁ ed. Whenever pos-

sible, a dimension should be single valued when associated with a given fact row.

Dimension tables are sometimes called the “soul” of the data warehouse because

they contain the entry points and descriptive labels that enable the DW/BI system

to be leveraged for business analysis. A disproportionate amount of e ort is put

into the data governance and development of dimension tables because they are

the drivers of the user’s BI experience.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 13

Chapter 3 Retail Sales , p 72

Chapter 11 Telecommunications , p 301

Chapter 18 Dimensional Modeling Process and Tasks , p 437

Chapter 19 ETL Subsystems and Techniques , p 463

Facts for Measurements

Facts are the measurements that result from a business process event and are almost

always numeric. A single fact table row has a one-to-one relationship to a measurement

event as described by the fact table’s grain. Thus a fact table corresponds to a physi-

cal observable event, and not to the demands of a particular report. Within a fact

table, only facts consistent with the declared grain are allowed. For example, in a

retail sales transaction, the quantity of a product sold and its extended price are

good facts, whereas the store manager’s salary is disallowed.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 10

Chapter 3 Retail Sales , p 72

Chapter 4 Inventory , p 112

Chapter 18 Dimensional Modeling Process and Tasks , p 437

Star Schemas and OLAP Cubes

Star schemas are dimensional structures deployed in a relational database management

system (RDBMS). They characteristically consist of fact tables linked to associated

dimension tables via primary/foreign key relationships. An online analytical processing

(OLAP) cube is a dimensional structure implemented in a multidimensional database;

it can be equivalent in content to, or more often derived from, a relational star schema.

An OLAP cube contains dimensional attributes and facts, but it is accessed through

languages with more analytic capabilities than SQL, such as XMLA and MDX. OLAP

Kimball Dimensional Modeling Techniques Overview 41

cubes are included in this list of basic techniques because an OLAP cube is often

the ﬁ nal step in the deployment of a dimensional DW/BI system, or may exist as an

aggregate structure based on a more atomic relational star schema.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 8

Chapter 3 Retail Sales , p 94

Chapter 5 Procurement , p 149

Chapter 6 Order Management , p 170

Chapter 7 Accounting , p 226

Chapter 9 Human Resources Management , p 273

Chapter 13 Education , p 335

Chapter 19 ETL Subsystems and Techniques , p 481

Chapter 20 ETL System Process and Tasks , p 519

Graceful Extensions to Dimensional Models

Dimensional models are resilient when data relationships change. All the following

changes can be implemented without altering any existing BI query or application,

and without any change in query results.

■ Facts consistent with the grain of an existing fact table can be added by creat-

ing new columns.

■ Dimensions can be added to an existing fact table by creating new foreign key

columns, presuming they don’t alter the fact table’s grain.

■ Attributes can be added to an existing dimension table by creating new

columns.

■ The grain of a fact table can be made more atomic by adding attributes to an exist-

ing dimension table, and then restating the fact table at the lower grain, being

careful to preserve the existing column names in the fact and dimension tables.

Chapter 3 Retail Sales , p 95

Basic Fact Table Techniques

The techniques in this section apply to all fact tables. There are illustrations of fact

tables in nearly every chapter.

Fact Table Structure

A fact table contains the numeric measures produced by an operational measure-

ment event in the real world. At the lowest grain, a fact table row corresponds to a

measurement event and vice versa. Thus the fundamental design of a fact table is

entirely based on a physical activity and is not inﬂ uenced by the eventual reports

Chapter 2

that may be produced. In addition to numeric measures, a fact table always contains

foreign keys for each of its associated dimensions, as well as optional degenerate

dimension keys and date/time stamps. Fact tables are the primary target of compu-

tations and dynamic aggregations arising from queries.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 10

Chapter 3 Retail Sales , p 76

Chapter 5 Procurement, p 143

Chapter 6 Order Management, p 169

Additive, Semi-Additive, Non-Additive Facts

The numeric measures in a fact table fall into three categories. The most ﬂ exible and

useful facts are fully additive; additive measures can be summed across any of the

dimensions associated with the fact table. Semi-additive measures can be summed

across some dimensions, but not all; balance amounts are common semi-additive facts

because they are additive across all dimensions except time. Finally, some measures

are completely non-additive, such as ratios. A good approach for non-additive facts is,

where possible, to store the fully additive components of the non-additive measure

and sum these components into the ﬁ nal answer set before calculating the ﬁ nal

non-additive fact. This ﬁ nal calculation is often done in the BI layer or OLAP cube.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 10

Chapter 3 Retail Sales , p 76

Chapter 4 Inventory , p 114

Chapter 7 Accounting , p 204

Nulls in Fact Tables

Null-valued measurements behave gracefully in fact tables. The aggregate functions

(SUM, COUNT, MIN, MAX, and AVG) all do the “right thing” with null facts. However,

nulls must be avoided in the fact table’s foreign keys because these nulls would

automatically cause a referential integrity violation. Rather than a null foreign key,

the associated dimension table must have a default row (and surrogate key) repre-

senting the unknown or not applicable condition.

Chapter 3 Retail Sales , p 92

Chapter 20 ETL System Process and Tasks , p 509

Conformed Facts

If the same measurement appears in separate fact tables, care must be taken to make

sure the technical deﬁ nitions of the facts are identical if they are to be compared

Kimball Dimensional Modeling Techniques Overview 43

or computed together. If the separate fact deﬁ nitions are consistent, the conformed

facts should be identically named; but if they are incompatible, they should be dif-

ferently named to alert the business users and BI applications.

Chapter 4 Inventory , p 138

Chapter 16 Insurance , p 386

Transaction Fact Tables

A row in a transaction fact table corresponds to a measurement event at a point in

space and time. Atomic transaction grain fact tables are the most dimensional and

expressive fact tables; this robust dimensionality enables the maximum slicing

and dicing of transaction data. Transaction fact tables may be dense or sparse

because rows exist only if measurements take place. These fact tables always con-

tain a foreign key for each associated dimension, and optionally contain precise

time stamps and degenerate dimension keys. The measured numeric facts must be

consistent with the transaction grain.

Chapter 3 Retail Sales , p 79

Chapter 4 Inventory , p 116

Chapter 5 Procurement , p 142

Chapter 6 Order Management , p 168

Chapter 7 Accounting , p 206

Chapter 11 Telecommunications , p 306

Chapter 12 Transportation , p 312

Chapter 14 Healthcare , p 351

Chapter 15 Electronic Commerce , p 363

Chapter 16 Insurance , p 379

Chapter 19 ETL Subsystems and Techniques , p 473

Periodic Snapshot Fact Tables

A row in a periodic snapshot fact table summarizes many measurement events occur-

ring over a standard period, such as a day, a week, or a month. The grain is the

period, not the individual transaction. Periodic snapshot fact tables often contain

many facts because any measurement event consistent with the fact table grain is

permissible. These fact tables are uniformly dense in their foreign keys because

even if no activity takes place during the period, a row is typically inserted in the

fact table containing a zero or null for each fact.

Chapter 2

Chapter 4 Inventory , p 113

Chapter 7 Accounting , p 204

Chapter 9 Human Resources Management , p 267

Chapter 10 Financial Services , p 283

Chapter 13 Education , p 333

Chapter 14 Healthcare, p 351

Chapter 16 Insurance , p 385

Chapter 19 ETL Subsystems and Techniques , p 474

Accumulating Snapshot Fact Tables

A row in an accumulating snapshot fact table summarizes the measurement events

occurring at predictable steps between the beginning and the end of a process.

Pipeline or workﬂ ow processes, such as order fulﬁ llment or claim processing, that

have a deﬁ ned start point, standard intermediate steps, and deﬁ ned end point can be

modeled with this type of fact table. There is a date foreign key in the fact table for

each critical milestone in the process. An individual row in an accumulating snap-

shot fact table, corresponding for instance to a line on an order, is initially inserted

when the order line is created. As pipeline progress occurs, the accumulating fact

table row is revisited and updated. This consistent updating of accumulating snap-

shot fact rows is unique among the three types of fact tables. In addition to the date

foreign keys associated with each critical process step, accumulating snapshot fact

tables contain foreign keys for other dimensions and optionally contain degener-

ate dimensions. They often include numeric lag measurements consistent with the

grain, along with milestone completion counters.

Chapter 4 Inventory , p 118

Chapter 5 Procurement , p 147

Chapter 6 Order Management , p 194

Chapter 13 Education , p 326

Chapter 14 Healthcare , p 342

Chapter 16 Insurance , p 392

Chapter 19 ETL Subsystems and Techniques , p 475

Factless Fact Tables

Although most measurement events capture numerical results, it is possible that

the event merely records a set of dimensional entities coming together at a moment

in time. For example, an event of a student attending a class on a given day may

not have a recorded numeric fact, but a fact row with foreign keys for calendar day,

student, teacher, location, and class is well-deﬁ ned. Likewise, customer communi-

cations are events, but there may be no associated metrics. Factless fact tables can

Kimball Dimensional Modeling Techniques Overview 45

also be used to analyze what didn’t happen. These queries always have two parts: a

factless coverage table that contains all the possibilities of events that might happen

and an activity table that contains the events that did happen. When the activity

is subtracted from the coverage, the result is the set of events that did not happen.

Chapter 3 Retail Sales , p 97

Chapter 6 Order Management , p 176

Chapter 13 Education , p 329

Chapter 16 Insurance , p 396

Aggregate Fact Tables or OLAP Cubes

Aggregate fact tables are simple numeric rollups of atomic fact table data built solely

to accelerate query performance. These aggregate fact tables should be available to

the BI layer at the same time as the atomic fact tables so that BI tools smoothly

choose the appropriate aggregate level at query time. This process, known as

aggregate navigation, must be open so that every report writer, query tool, and BI

application harvests the same performance beneﬁ ts. A properly designed set of

aggregates should behave like database indexes, which accelerate query perfor-

mance but are not encountered directly by the BI applications or business users.

Aggregate fact tables contain foreign keys to shrunken conformed dimensions, as

well as aggregated facts created by summing measures from more atomic fact tables.

Finally, aggregate OLAP cubes with summarized measures are frequently built in

the same way as relational aggregates, but the OLAP cubes are meant to be accessed

directly by the business users.

Chapter 15 Electronic Commerce , p 366

Chapter 19 ETL Subsystems and Techniques , p 481

Chapter 20 ETL System Process and Tasks , p 519

Consolidated Fact Tables

It is often convenient to combine facts from multiple processes together into a single

consolidated fact table if they can be expressed at the same grain. For example, sales

actuals can be consolidated with sales forecasts in a single fact table to make the task

of analyzing actuals versus forecasts simple and fast, as compared to assembling a

drill-across application using separate fact tables. Consolidated fact tables add bur-

den to the ETL processing, but ease the analytic burden on the BI applications. They

should be considered for cross-process metrics that are frequently analyzed together.

Chapter 7 Accounting , p 224

Chapter 16 Insurance , p 395

Chapter 2

Basic Dimension Table Techniques

The techniques in this section apply to all dimension tables. Dimension tables are

discussed and illustrated in every chapter.

Dimension Table Structure

Every dimension table has a single primary key column. This primary key is embedded

as a foreign key in any associated fact table where the dimension row’s descriptive

context is exactly correct for that fact table row. Dimension tables are usually wide, ﬂ at

denormalized tables with many low-cardinality text attributes. While operational codes

and indicators can be treated as attributes, the most powerful dimension attributes

are populated with verbose descriptions. Dimension table attributes are the primary

target of constraints and grouping speciﬁ cations from queries and BI applications. The

descriptive labels on reports are typically dimension attribute domain values.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 13

Chapter 3 Retail Sales , p 79

Chapter 11 Telecommunications , p 301

Dimension Surrogate Keys

A dimension table is designed with one column serving as a unique primary key.

This primary key cannot be the operational system’s natural key because there will

be multiple dimension rows for that natural key when changes are tracked over time.

In addition, natural keys for a dimension may be created by more than one source

system, and these natural keys may be incompatible or poorly administered. The

DW/BI system needs to claim control of the primary keys of all dimensions; rather

than using explicit natural keys or natural keys with appended dates, you should

create anonymous integer primary keys for every dimension. These dimension sur-

rogate keys are simple integers, assigned in sequence, starting with the value 1,

every time a new key is needed. The date dimension is exempt from the surrogate

key rule; this highly predictable and stable dimension can use a more meaningful

primary key. See the section “Calendar Date Dimensions.”

Chapter 3 Retail Sales , p 98

Chapter 19 ETL Subsystems and Techniques , p 469

Chapter 20 ETL System Process and Tasks , p 506

Natural, Durable, and Supernatural Keys

Natural keys created by operational source systems are subject to business rules outside

the control of the DW/BI system. For instance, an employee number (natural key) may

Kimball Dimensional Modeling Techniques Overview 47

be changed if the employee resigns and then is rehired. When the data warehouse

wants to have a single key for that employee, a new durable key must be created that is

persistent and does not change in this situation. This key is sometimes referred to as

a durable supernatural key. The best durable keys have a format that is independent of

the original business process and thus should be simple integers assigned in sequence

beginning with 1. While multiple surrogate keys may be associated with an employee

over time as their proﬁ le changes, the durable key never changes.

Chapter 3 Retail Sales , p 100

Chapter 20 ETL System Process and Tasks , p 510

Chapter 21 Big Data Analytics, p 539

Drilling Down

Drilling down is the most fundamental way data is analyzed by business users. Drilling

down simply means adding a row header to an existing query; the new row header

is a dimension attribute appended to the GROUP BY expression in an SQL query. The

attribute can come from any dimension attached to the fact table in the query. Drilling

down does not require the deﬁ nition of predetermined hierarchies or drill-down paths.

See the section “Drilling Across.”

Chapter 3 Retail Sales , p 86

Degenerate Dimensions

Sometimes a dimension is deﬁ ned that has no content except for its primary key.

For example, when an invoice has multiple line items, the line item fact rows inherit

all the descriptive dimension foreign keys of the invoice, and the invoice is left with

no unique content. But the invoice number remains a valid dimension key for fact

tables at the line item level. This degenerate dimension is placed in the fact table with

the explicit acknowledgment that there is no associated dimension table. Degenerate

dimensions are most common with transaction and accumulating snapshot fact tables.

Chapter 3 Retail Sales , p 93

Chapter 6 Order Management , p 178

Chapter 11 Telecommunications , p 303

Chapter 16 Insurance , p 383

Denormalized Flattened Dimensions

In general, dimensional designers must resist the normalization urges caused by years

of operational database designs and instead denormalize the many-to-one ﬁ xed depth

Chapter 2

hierarchies into separate attributes on a ﬂ attened dimension row. Dimension denor-

malization supports dimensional modeling’s twin objectives of simplicity and speed.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 13

Chapter 3 Retail Sales , p 84

Multiple Hierarchies in Dimensions

Many dimensions contain more than one natural hierarchy. For example, calendar

date dimensions may have a day to week to ﬁ scal period hierarchy, as well as a

day to month to year hierarchy. Location intensive dimensions may have multiple

geographic hierarchies. In all of these cases, the separate hierarchies can gracefully

coexist in the same dimension table.

Chapter 3 Retail Sales , p 88

Chapter 19 ETL Subsystems and Techniques , p 470

Flags and Indicators as Textual Attributes

Cryptic abbreviations, true/false ﬂ ags, and operational indicators should be sup-

plemented in dimension tables with full text words that have meaning when

independently viewed. Operational codes with embedded meaning within the

code value should be broken down with each part of the code expanded into its

own separate descriptive dimension attribute.

Chapter 3 Retail Sales , p 82

Chapter 11 Telecommunications, p 301

Chapter 16 Insurance , p 383

Null Attributes in Dimensions

Null-valued dimension attributes result when a given dimension row has not been

fully populated, or when there are attributes that are not applicable to all the dimen-

sion’s rows. In both cases, we recommend substituting a descriptive string, such as

Unknown or Not Applicable in place of the null value. Nulls in dimension attributes

should be avoided because di erent databases handle grouping and constraining

on nulls inconsistently.

Chapter 3 Retail Sales , p 92

Calendar Date Dimensions

Calendar date dimensions are attached to virtually every fact table to allow navigation

of the fact table through familiar dates, months, ﬁ scal periods, and special days on

Kimball Dimensional Modeling Techniques Overview 49

the calendar. You would never want to compute Easter in SQL, but rather want to

look it up in the calendar date dimension. The calendar date dimension typically

has many attributes describing characteristics such as week number, month name,

ﬁ scal period, and national holiday indicator. To facilitate partitioning, the primary

key of a date dimension can be more meaningful, such as an integer representing

YYYYMMDD, instead of a sequentially-assigned surrogate key. However, the date

dimension table needs a special row to represent unknown or to-be-determined

dates. When further precision is needed, a separate date/time stamp can be added

to the fact table. The date/time stamp is not a foreign key to a dimension table, but

rather is a standalone column. If business users constrain or group on time-of-day

attributes, such as day part grouping or shift number, then you would add a separate

time-of-day dimension foreign key to the fact table.

Chapter 3 Retail Sales , p 79

Chapter 7 Accounting , p 208

Chapter 8 Customer Relationship Management , p 238

Chapter 12 Transportation , p 321

Chapter 19 ETL Subsystems and Techniques , p 470

Role-Playing Dimensions

A single physical dimension can be referenced multiple times in a fact table, with

each reference linking to a logically distinct role for the dimension. For instance, a

fact table can have several dates, each of which is represented by a foreign key to the

date dimension. It is essential that each foreign key refers to a separate view of

the date dimension so that the references are independent. These separate dimen-

sion views (with unique attribute column names) are called roles.

Chapter 6 Order Management , p 170

Chapter 12 Transportation , p 312

Chapter 14 Healthcare , p 345

Chapter 16 Insurance , p 380

Junk Dimensions

Transactional business processes typically produce a number of miscellaneous, low-

cardinality ﬂ ags and indicators. Rather than making separate dimensions for each

ﬂ ag and attribute, you can create a single junk dimension combining them together.

This dimension, frequently labeled as a transaction proﬁ le dimension in a schema,

does not need to be the Cartesian product of all the attributes’ possible values, but

should only contain the combination of values that actually occur in the source data.

Chapter 2

Chapter 6 Order Management , p 179

Chapter 12 Transportation , p 318

Chapter 16 Insurance , p 392

Chapter 19 ETL Subsystems and Techniques , p 470

Snowﬂ aked Dimensions

When a hierarchical relationship in a dimension table is normalized, low-cardinal-

ity attributes appear as secondary tables connected to the base dimension table by

an attribute key. When this process is repeated with all the dimension table’s hier-

archies, a characteristic multilevel structure is created that is called a snowﬂ ake.

Although the snowﬂ ake represents hierarchical data accurately, you should avoid

snowﬂ akes because it is di cult for business users to understand and navigate

snowﬂ akes. They can also negatively impact query performance. A ﬂ attened denor-

malized dimension table contains exactly the same information as a snowﬂ aked

dimension.

Chapter 3 Retail Sales , p 104

Chapter 11 Telecommunications , p 301

Chapter 20 ETL System Process and Tasks , p 504

Outrigger Dimensions

A dimension can contain a reference to another dimension table. For instance, a

bank account dimension can reference a separate dimension representing the date

the account was opened. These secondary dimension references are called outrigger

dimensions. Outrigger dimensions are permissible, but should be used sparingly. In

most cases, the correlations between dimensions should be demoted to a fact table,

where both dimensions are represented as separate foreign keys.

Chapter 3 Retail Sales , p 106

Chapter 5 Procurement , p 160

Chapter 8 Customer Relationship Management , p 243

Chapter 12 Transportation , p 321

Integration via Conformed Dimensions

One of the marquee successes of the dimensional modeling approach has been to

deﬁ ne a simple but powerful recipe for integrating data from di erent business

processes.

Kimball Dimensional Modeling Techniques Overview 51

Conformed Dimensions

Dimension tables conform when attributes in separate dimension tables have the

same column names and domain contents. Information from separate fact tables

can be combined in a single report by using conformed dimension attributes that

are associated with each fact table. When a conformed attribute is used as the

row header (that is, the grouping column in the SQL query), the results from the

separate fact tables can be aligned on the same rows in a drill-across report. This

is the essence of integration in an enterprise DW/BI system. Conformed dimen-

sions, deﬁ ned once in collaboration with the business’s data governance represen-

tatives, are reused across fact tables; they deliver both analytic consistency and

reduced future development costs because the wheel is not repeatedly re-created.

Chapter 4 Inventory , p 130

Chapter 8 Customer Relationship Management , p 256

Chapter 11 Telecommunications , p 304

Chapter 16 Insurance , p 386

Chapter 18 Dimensional Modeling Process and Tasks , p 431

Chapter 19 ETL Subsystems and Techniques , p 461

Shrunken Dimensions

Shrunken dimensions are conformed dimensions that are a subset of rows and/or

columns of a base dimension. Shrunken rollup dimensions are required when con-

structing aggregate fact tables. They are also necessary for business processes that

naturally capture data at a higher level of granularity, such as a forecast by month

and brand (instead of the more atomic date and product associated with sales data).

Another case of conformed dimension subsetting occurs when two dimensions are

at the same level of detail, but one represents only a subset of rows.

Chapter 4 Inventory , p 132

Chapter 19 ETL Subsystems and Techniques , p 472

Chapter 20 ETL System Process and Tasks , p 504

Drilling Across

Drilling across simply means making separate queries against two or more fact tables

where the row headers of each query consist of identical conformed attributes. The

answer sets from the two queries are aligned by performing a sort-merge opera-

tion on the common dimension attribute row headers. BI tool vendors refer to this

functionality by various names, including stitch and multipass query.

Chapter 4 Inventory , p 130

Chapter 2

Value Chain

A value chain identiﬁ es the natural ﬂ ow of an organization’s primary business

processes. For example, a retailer’s value chain may consist of purchasing to ware-

housing to retail sales. A general ledger value chain may consist of budgeting to

commitments to payments. Operational source systems typically produce transac-

tions or snapshots at each step of the value chain. Because each process produces

unique metrics at unique time intervals with unique granularity and dimensionality,

each process typically spawns at least one atomic fact table.

Chapter 4 Inventory , p 111

Chapter 7 Accounting , p 210

Chapter 16 Insurance , p 377

Enterprise Data Warehouse Bus Architecture

The enterprise data warehouse bus architecture provides an incremental approach

to building the enterprise DW/BI system. This architecture decomposes the DW/

BI planning process into manageable pieces by focusing on business processes,

while delivering integration via standardized conformed dimensions that are reused

across processes. It provides an architectural framework, while also decomposing

the program to encourage manageable agile implementations corresponding to the

rows on the enterprise data warehouse bus matrix. The bus architecture is tech-

nology and database platform independent; both relational and OLAP dimensional

structures can participate.

Chapter 1 DW/BI and Dimensional Modeling Primer , p 21

Chapter 4 Inventory , p 123

Enterprise Data Warehouse Bus Matrix

The enterprise data warehouse bus matrix is the essential tool for designing and com-

municating the enterprise data warehouse bus architecture. The rows of the matrix

are business processes and the columns are dimensions. The shaded cells of the

matrix indicate whether a dimension is associated with a given business process. The

design team scans each row to test whether a candidate dimension is well-deﬁ ned for

the business process and also scans each column to see where a dimension should be

conformed across multiple business processes. Besides the technical design consid-

erations, the bus matrix is used as input to prioritize DW/BI projects with business

management as teams should implement one row of the matrix at a time.

Kimball Dimensional Modeling Techniques Overview 53

Chapter 4 Inventory , p 125

Chapter 5 Procurement , p 143

Chapter 6 Order Management , p 168

Chapter 7 Accounting, p 202

Chapter 9 Human Resources Management , p 268

Chapter 10 Financial Services , p 282

Chapter 11 Telecommunications , p 297

Chapter 12 Transportation , p 311

Chapter 13 Education , p 325

Chapter 14 Healthcare , p 339

Chapter 15 Electronic Commerce , p 368

Chapter 16 Insurance , p 389

Detailed Implementation Bus Matrix

The detailed implementation bus matrix is a more granular bus matrix where each

business process row has been expanded to show speciﬁ c fact tables or OLAP cubes.

At this level of detail, the precise grain statement and list of facts can be documented.

Chapter 5 Procurement , p 143

Chapter 16 Insurance , p 390

Opportunity/Stakeholder Matrix

After the enterprise data warehouse bus matrix rows have been identiﬁ ed, you can

draft a di erent matrix by replacing the dimension columns with business func-

tions, such as marketing, sales, and ﬁ nance, and then shading the matrix cells to

indicate which business functions are interested in which business process rows.

The opportunity/stakeholder matrix helps identify which business groups should be

invited to the collaborative design sessions for each process-centric row.

Chapter 4 Inventory , p 127

Dealing with Slowly Changing Dimension Attributes

The following section describes the fundamental approaches for dealing with slowly

changing dimension (SCD) attributes. It is quite common to have attributes in the

same dimension table that are handled with di erent change tracking techniques.

Chapter 2

Type 0: Retain Original

With type 0, the dimension attribute value never changes, so facts are always grouped

by this original value. Type 0 is appropriate for any attribute labeled “original,” such

as a customer’s original credit score or a durable identiﬁ er. It also applies to most

attributes in a date dimension.

Chapter 5 Procurement , p 148

Type 1: Overwrite

With type 1, the old attribute value in the dimension row is overwritten with the new

value; type 1 attributes always reﬂ ects the most recent assignment, and therefore

this technique destroys history. Although this approach is easy to implement and

does not create additional dimension rows, you must be careful that aggregate fact

tables and OLAP cubes a ected by this change are recomputed.

Chapter 5 Procurement , p 149

Chapter 16 Insurance , p 380

Chapter 19 ETL Subsystems and Techniques , p 465

Type 2: Add New Row

Type 2 changes add a new row in the dimension with the updated attribute values.

This requires generalizing the primary key of the dimension beyond the natural or

durable key because there will potentially be multiple rows describing each member.

When a new row is created for a dimension member, a new primary surrogate key is

assigned and used as a foreign key in all fact tables from the moment of the update

until a subsequent change creates a new dimension key and updated dimension row.

A minimum of three additional columns should be added to the dimension row

with type 2 changes: 1) row e ective date or date/time stamp; 2) row expiration

date or date/time stamp; and 3) current row indicator.

Chapter 5 Procurement , p 150

Chapter 8 Customer Relationship Management , p 243

Chapter 9 Human Resources Management , p 263

Chapter 16 Insurance , p 380

Chapter 19 ETL Subsystems and Techniques , p 465

Chapter 20 ETL System Process and Tasks , p 507

Kimball Dimensional Modeling Techniques Overview 55

Type 3: Add New Attribute

Type 3 changes add a new attribute in the dimension to preserve the old attribute

value; the new value overwrites the main attribute as in a type 1 change. This kind of

type 3 change is sometimes called an alternate reality. A business user can group and

ﬁ lter fact data by either the current value or alternate reality. This slowly changing

dimension technique is used relatively infrequently.

Chapter 5 Procurement , p 154

Chapter 16 Insurance , p 380

Chapter 19 ETL Subsystems and Techniques , p 467

Type 4: Add Mini-Dimension

The type 4 technique is used when a group of attributes in a dimension rapidly

changes and is split o to a mini-dimension. This situation is sometimes called a

rapidly changing monster dimension. Frequently used attributes in multimillion-row

dimension tables are mini-dimension design candidates, even if they don’t fre-

quently change. The type 4 mini-dimension requires its own unique primary key;

the primary keys of both the base dimension and mini-dimension are captured in

the associated fact tables.

Chapter 5 Procurement , p 156

Chapter 10 Financial Services , p 289

Chapter 16 Insurance , p 381

Chapter 19 ETL Subsystems and Techniques , p 467

Type 5: Add Mini-Dimension and Type 1 Outrigger

The type 5 technique is used to accurately preserve historical attribute values,

plus report historical facts according to current attribute values. Type 5 builds on

the type 4 mini-dimension by also embedding a current type 1 reference to the

mini-dimension in the base dimension. This enables the currently-assigned mini-

dimension attributes to be accessed along with the others in the base dimension

without linking through a fact table. Logically, you’d represent the base dimension

and mini-dimension outrigger as a single table in the presentation area. The ETL

team must overwrite this type 1 mini-dimension reference whenever the current

mini-dimension assignment changes.

Chapter 5 Procurement , p 160

Chapter 19 ETL Subsystems and Techniques , p 468

Chapter 2

Type 6: Add Type 1 Attributes to Type 2 Dimension

Like type 5, type 6 also delivers both historical and current dimension attribute

values. Type 6 builds on the type 2 technique by also embedding current type

1 versions of the same attributes in the dimension row so that fact rows can be

ﬁ ltered or grouped by either the type 2 attribute value in e ect when the measure-

ment occurred or the attribute’s current value. In this case, the type 1 attribute is

systematically overwritten on all rows associated with a particular durable key

whenever the attribute is updated.

Chapter 5 Procurement , p 160

Chapter 19 ETL Subsystems and Techniques , p 468

Type 7: Dual Type 1 and Type 2 Dimensions

Type 7 is the ﬁ nal hybrid technique used to support both as-was and as-is report-

ing. A fact table can be accessed through a dimension modeled both as a type 1

dimension showing only the most current attribute values, or as a type 2 dimen-

sion showing correct contemporary historical proﬁ les. The same dimension table

enables both perspectives. Both the durable key and primary surrogate key of the

dimension are placed in the fact table. For the type 1 perspective, the current ﬂ ag

in the dimension is constrained to be current, and the fact table is joined via the

durable key. For the type 2 perspective, the current ﬂ ag is not constrained, and the

fact table is joined via the surrogate primary key. These two perspectives would be

deployed as separate views to the BI applications.

Chapter 5 Procurement , p 162

Chapter 19 ETL Subsystems and Techniques , p 468

Dealing with Dimension Hierarchies

Dimensional hierarchies are commonplace. This section describes approaches for

dealing with hierarchies, starting with the most basic.

Fixed Depth Positional Hierarchies

A ﬁ xed depth hierarchy is a series of many-to-one relationships, such as product

to brand to category to department. When a ﬁ xed depth hierarchy is deﬁ ned and

the hierarchy levels have agreed upon names, the hierarchy levels should appear

as separate positional attributes in a dimension table. A ﬁ xed depth hierarchy is

by far the easiest to understand and navigate as long as the above criteria are met.

It also delivers predictable and fast query performance. When the hierarchy is not

a series of many-to-one relationships or the number of levels varies such that the

Kimball Dimensional Modeling Techniques Overview 57

levels do not have agreed upon names, a ragged hierarchy technique, described

below, must be used.

Chapter 3 Retail Sales , p 84

Chapter 7 Accounting , p 214

Chapter 19 ETL Subsystems and Techniques , p 470

Chapter 20 ETL System Process and Tasks , p 501

Slightly Ragged/Variable Depth Hierarchies

Slightly ragged hierarchies don’t have a ﬁ xed number of levels, but the range in depth

is small. Geographic hierarchies often range in depth from perhaps three levels to

six levels. Rather than using the complex machinery for unpredictably variable

hierarchies, you can force-ﬁ t slightly ragged hierarchies into a ﬁ xed depth positional

design with separate dimension attributes for the maximum number of levels, and

then populate the attribute value based on rules from the business.

Chapter 7 Accounting , p 214

Ragged/Variable Depth Hierarchies with Hierarchy Bridge Tables

Ragged hierarchies of indeterminate depth are di cult to model and query in a

relational database. Although SQL extensions and OLAP access languages provide

some support for recursive parent/child relationships, these approaches have limita-

tions. With SQL extensions, alternative ragged hierarchies cannot be substituted at

query time, shared ownership structures are not supported, and time varying ragged

hierarchies are not supported. All these objections can be overcome in relational

databases by modeling a ragged hierarchy with a specially constructed bridge table.

This bridge table contains a row for every possible path in the ragged hierarchy

and enables all forms of hierarchy traversal to be accomplished with standard SQL

rather than using special language extensions.

Chapter 7 Accounting , p 215

Chapter 9 Human Resources Management , p 273

Ragged/Variable Depth Hierarchies with Pathstring Attributes

The use of a bridge table for ragged variable depth hierarchies can be avoided by

implementing a pathstring attribute in the dimension. For each row in the dimen-

sion, the pathstring attribute contains a specially encoded text string containing

the complete path description from the supreme node of a hierarchy down to the

node described by the particular dimension row. Many of the standard hierarchy

Chapter 2

analysis requests can then be handled by standard SQL, without resorting to SQL

language extensions. However, the pathstring approach does not enable rapid sub-

stitution of alternative hierarchies or shared ownership hierarchies. The pathstring

approach may also be vulnerable to structure changes in the ragged hierarchy that

could force the entire hierarchy to be relabeled.

Chapter 7 Accounting , p 221

Advanced Fact Table Techniques

The techniques in this section refer to less common fact table patterns.

Fact Table Surrogate Keys

Surrogate keys are used to implement the primary keys of almost all dimension

tables. In addition, single column surrogate fact keys can be useful, albeit not

required. Fact table surrogate keys, which are not associated with any dimension,

are assigned sequentially during the ETL load process and are used 1) as the single

column primary key of the fact table; 2) to serve as an immediate identiﬁ er of a fact

table row without navigating multiple dimensions for ETL purposes; 3) to allow an

interrupted load process to either back out or resume; 4) to allow fact table update

operations to be decomposed into less risky inserts plus deletes.

Chapter 3 Retail Sales , p 102

Chapter 19 ETL Subsystems and Techniques , p 486

Chapter 20 ETL System Process and Tasks , p 520

Centipede Fact Tables

Some designers create separate normalized dimensions for each level of a many-to-

one hierarchy, such as a date dimension, month dimension, quarter dimension, and

year dimension, and then include all these foreign keys in a fact table. This results

in a centipede fact table with dozens of hierarchically related dimensions. Centipede

fact tables should be avoided. All these ﬁ xed depth, many-to-one hierarchically

related dimensions should be collapsed back to their unique lowest grains, such as

the date for the example mentioned. Centipede fact tables also result when design-

ers embed numerous foreign keys to individual low-cardinality dimension tables

rather than creating a junk dimension.

Chapter 3 Retail Sales , p 108

Kimball Dimensional Modeling Techniques Overview 59

Numeric Values as Attributes or Facts

Designers sometimes encounter numeric values that don’t clearly fall into either

the fact or dimension attribute categories. A classic example is a product’s standard

list price. If the numeric value is used primarily for calculation purposes, it likely

belongs in the fact table. If a stable numeric value is used predominantly for ﬁ ltering

and grouping, it should be treated as a dimension attribute; the discrete numeric

values can be supplemented with value band attributes (such as $0-50). In some

cases, it is useful to model the numeric value as both a fact and dimension attribute,

such as a quantitative on-time delivery metric and qualitative textual descriptor.

Chapter 3 Retail Sales , p 85

Chapter 6 Order Management , p 188

Chapter 8 Customer Relationship Management , p 254

Chapter 16 Insurance , p 382

Lag/Duration Facts

Accumulating snapshot fact tables capture multiple process milestones, each with a

date foreign key and possibly a date/time stamp. Business users often want to analyze

the lags or durations between these milestones; sometimes these lags are just the

di erences between dates, but other times the lags are based on more complicated

business rules. If there are dozens of steps in a pipeline, there could be hundreds

of possible lags. Rather than forcing the user’s query to calculate each possible lag

from the date/time stamps or date dimension foreign keys, just one time lag can be

stored for each step measured against the process’s start point. Then every possible

lag between two steps can be calculated as a simple subtraction between the two

lags stored in the fact table.

Chapter 6 Order Management , p 196

Chapter 16 Insurance , p 393

Header/Line Fact Tables

Operational transaction systems often consist of a transaction header row that’s

associated with multiple transaction lines. With header/line schemas (also known

as parent/child schemas), all the header-level dimension foreign keys and degenerate

dimensions should be included on the line-level fact table.

Chapter 6 Order Management , p 181

Chapter 12 Transportation , p 315

Chapter 15 Electronic Commerce , p 363

Chapter 2

Allocated Facts

It is quite common in header/line transaction data to encounter facts of di er-

ing granularity, such as a header freight charge. You should strive to allocate

the header facts down to the line level based on rules provided by the business, so the

allocated facts can be sliced and rolled up by all the dimensions. In many cases, you

can avoid creating a header-level fact table, unless this aggregation delivers query

performance advantages.

Chapter 6 Order Management , p 184

Proﬁ t and Loss Fact Tables Using Allocations

Fact tables that expose the full equation of proﬁ t are among the most powerful deliv-

erables of an enterprise DW/BI system. The equation of proﬁ t is (revenue) – (costs) =

(proﬁ t). Fact tables ideally implement the proﬁ t equation at the grain of the atomic

revenue transaction and contain many components of cost. Because these tables are

at the atomic grain, numerous rollups are possible, including customer proﬁ tabil-

ity, product proﬁ tability, promotion proﬁ tability, channel proﬁ tability, and others.

However, these fact tables are di cult to build because the cost components must

be allocated from their original sources to the fact table’s grain. This allocation step

is often a major ETL subsystem and is a politically charged step that requires high-

level executive support. For these reasons, proﬁ t and loss fact tables are typically

not tackled during the early implementation phases of a DW/BI program.

Chapter 6 Order Management , p 189

Chapter 15 Electronic Commerce , p 370

Multiple Currency Facts

Fact tables that record ﬁ nancial transactions in multiple currencies should contain

a pair of columns for every ﬁ nancial fact in the row. One column contains the fact

expressed in the true currency of the transaction, and the other contains the same

fact expressed in a single standard currency that is used throughout the fact table.

The standard currency value is created in an ETL process according to an approved

business rule for currency conversion. This fact table also must have a currency

dimension to identify the transaction’s true currency.

Chapter 6 Order Management , p 182

Chapter 7 Accounting , p 206

Kimball Dimensional Modeling Techniques Overview 61

Multiple Units of Measure Facts

Some business processes require facts to be stated simultaneously in several units

of measure. For example, depending on the perspective of the business user, a

supply chain may need to report the same facts as pallets, ship cases, retail cases,

or individual scan units. If the fact table contains a large number of facts, each of

which must be expressed in all units of measure, a convenient technique is to store

the facts once in the table at an agreed standard unit of measure, but also simulta-

neously store conversion factors between the standard measure and all the others.

This fact table could be deployed through views to each user constituency, using

an appropriate selected conversion factor. The conversion factors must reside in the

underlying fact table row to ensure the view calculation is simple and correct, while

minimizing query complexity.

Chapter 6 Order Management , p 197

Year-to-Date Facts

Business users often request year-to-date (YTD) values in a fact table. It is hard to

argue against a single request, but YTD requests can easily morph into “YTD at the

close of the ﬁ scal period” or “ﬁ scal period to date.” A more reliable, extensible way

to handle these assorted requests is to calculate the YTD metrics in the BI applica-

tions or OLAP cube rather than storing YTD facts in the fact table.

Chapter 7 Accounting , p 206

Multipass SQL to Avoid Fact-to-Fact Table Joins

A BI application must never issue SQL that joins two fact tables together across the

fact table’s foreign keys. It is impossible to control the cardinality of the answer set

of such a join in a relational database, and incorrect results will be returned to the

BI tool. For instance, if two fact tables contain customer’s product shipments and

returns, these two fact tables must not be joined directly across the customer

and product foreign keys. Instead, the technique of drilling across two fact tables

should be used, where the answer sets from shipments and returns are separately

created, and the results sort-merged on the common row header attribute values to

produce the correct result.

Chapter 4 Inventory , p 130

Chapter 8 Customer Relationship Management , p 259

Chapter 2

Timespan Tracking in Fact Tables

There are three basic fact table grains: transaction, periodic snapshot, and accu-

mulating snapshot. In isolated cases, it is useful to add a row e ective date, row

expiration date, and current row indicator to the fact table, much like you do with

type 2 slowly changing dimensions, to capture a timespan when the fact row was

e ective. Although an unusual pattern, this pattern addresses scenarios such as

slowly changing inventory balances where a frequent periodic snapshot would load

identical rows with each snapshot.

Chapter 8 Customer Relationship Management , p 252

Chapter 16 Insurance , p 394

Late Arriving Facts

A fact row is late arriving if the most current dimensional context for new fact rows

does not match the incoming row. This happens when the fact row is delayed. In

this case, the relevant dimensions must be searched to ﬁ nd the dimension keys that

were e ective when the late arriving measurement event occurred.

Chapter 14 Healthcare , p 351

Chapter 19 ETL Subsystems and Techniques , p 478

Advanced Dimension Techniques

The techniques in this section refer to more advanced dimension table patterns.

Dimension-to-Dimension Table Joins

Dimensions can contain references to other dimensions. Although these relation-

ships can be modeled with outrigger dimensions, in some cases, the existence of a

foreign key to the outrigger dimension in the base dimension can result in explosive

growth of the base dimension because type 2 changes in the outrigger force cor-

responding type 2 processing in the base dimension. This explosive growth can

often be avoided if you demote the correlation between dimensions by placing the

foreign key of the outrigger in the fact table rather than in the base dimension. This

means the correlation between the dimensions can be discovered only by traversing

the fact table, but this may be acceptable, especially if the fact table is a periodic

snapshot where all the keys for all the dimensions are guaranteed to be present for

each reporting period.

Chapter 6 Order Management , p 175

Kimball Dimensional Modeling Techniques Overview 63

Multivalued Dimensions and Bridge Tables

In a classic dimensional schema, each dimension attached to a fact table has a single

value consistent with the fact table’s grain. But there are a number of situations in

which a dimension is legitimately multivalued. For example, a patient receiving a

healthcare treatment may have multiple simultaneous diagnoses. In these cases, the

multivalued dimension must be attached to the fact table through a group dimen-

sion key to a bridge table with one row for each simultaneous diagnosis in a group.

Chapter 8 Customer Relationship Management , p 245

Chapter 9 Human Resources Management , p 275

Chapter 10 Financial Services , p 287

Chapter 13 Education , p 333

Chapter 14 Healthcare , p 345

Chapter 16 Insurance , p 382

Chapter 19 ETL Subsystems and Techniques , p 477

Time Varying Multivalued Bridge Tables

A multivalued bridge table may need to be based on a type 2 slowly changing dimen-

sion. For example, the bridge table that implements the many-to-many relationship

between bank accounts and individual customers usually must be based on type

2 account and customer dimensions. In this case, to prevent incorrect linkages

between accounts and customers, the bridge table must include e ective and expi-

ration date/time stamps, and the requesting application must constrain the bridge

table to a speciﬁ c moment in time to produce a consistent snapshot.

Chapter 7 Accounting , p 220

Chapter 10 Financial Services , p 286

Behavior Tag Time Series

Almost all text in a data warehouse is descriptive text in dimension tables. Data

mining customer cluster analyses typically results in textual behavior tags, often

identiﬁ ed on a periodic basis. In this case, the customers’ behavior measurements

over time become a sequence of these behavior tags; this time series should be

stored as positional attributes in the customer dimension, along with an optional

text string for the complete sequence of tags. The behavior tags are modeled in a

positional design because the behavior tags are the target of complex simultaneous

queries rather than numeric computations.

Chapter 8 Customer Relationship Management , p 240

Chapter 2

Behavior Study Groups

Complex customer behavior can sometimes be discovered only by running lengthy

iterative analyses. In these cases, it is impractical to embed the behavior analyses

inside every BI application that wants to constrain all the members of the customer

dimension who exhibit the complex behavior. The results of the complex behavior

analyses, however, can be captured in a simple table, called a study group, consisting

only of the customers’ durable keys. This static table can then be used as a kind of

ﬁ lter on any dimensional schema with a customer dimension by constraining the

study group column to the customer dimension’s durable key in the target schema

at query time. Multiple study groups can be deﬁ ned and derivative study groups

can be created with intersections, unions, and set di erences.

Chapter 8 Customer Relationship Management , p 249

Aggregated Facts as Dimension Attributes

Business users are often interested in constraining the customer dimension based

on aggregated performance metrics, such as ﬁ ltering on all customers who spent

over a certain dollar amount during last year or perhaps over the customer’s lifetime.

Selected aggregated facts can be placed in a dimension as targets for constraining and

as row labels for reporting. The metrics are often presented as banded ranges in the

dimension table. Dimension attributes representing aggregated performance metrics

add burden to the ETL processing, but ease the analytic burden in the BI layer.

Chapter 8 Customer Relationship Management , p 239

Dynamic Value Bands

A dynamic value banding report is organized as a series of report row headers that

deﬁ ne a progressive set of varying-sized ranges of a target numeric fact. For instance,

a common value banding report in a bank has many rows with labels such as

“Balance from 0 to $10,” “Balance from $10.01 to $25,” and so on. This kind of

report is dynamic because the speciﬁ c row headers are deﬁ ned at query time, not

during the ETL processing. The row deﬁ nitions can be implemented in a small value

banding dimension table that is joined via greater-than/less-than joins to the fact

table, or the deﬁ nitions can exist only in an SQL CASE statement. The value band-

ing dimension approach is probably higher performing, especially in a columnar

database, because the CASE statement approach involves an almost unconstrained

relation scan of the fact table.

Chapter 10 Financial Services , p 291

Kimball Dimensional Modeling Techniques Overview 65

Text Comments Dimension

Rather than treating freeform comments as textual metrics in a fact table, they

should be stored outside the fact table in a separate comments dimension (or as

attributes in a dimension with one row per transaction if the comments’ cardinal-

ity matches the number of unique transactions) with a corresponding foreign key

in the fact table.

Chapter 9 Human Resources Management , p 278

Chapter 14 Healthcare , p 350

Multiple Time Zones

To capture both universal standard time, as well as local times in multi-time zone

applications, dual foreign keys should be placed in the a ected fact tables that join

to two role-playing date (and potentially time-of-day) dimension tables.

Chapter 12 Transportation , p 323

Chapter 15 Electronic Commerce , p 361

Measure Type Dimensions

Sometimes when a fact table has a long list of facts that is sparsely populated in any

individual row, it is tempting to create a measure type dimension that collapses the

fact table row down to a single generic fact identiﬁ ed by the measure type dimen-

sion. We generally do not recommend this approach. Although it removes all the

empty fact columns, it multiplies the size of the fact table by the average number

of occupied columns in each row, and it makes intra-column computations much

more di cult. This technique is acceptable when the number of potential facts is

extreme (in the hundreds), but less than a handful would be applicable to any given

fact table row.

Chapter 6 Order Management , p 169

Chapter 14 Healthcare , p 349

Step Dimensions

Sequential processes, such as web page events, normally have a separate row in a

transaction fact table for each step in a process. To tell where the individual step

ﬁ ts into the overall session, a step dimension is used that shows what step number

is represented by the current step and how many more steps were required to com-

plete the session.

Chapter 2

Chapter 8 Customer Relationship Management , p 251

Chapter 15 Electronic Commerce , p 366

Hot Swappable Dimensions

Hot swappable dimensions are used when the same fact table is alternatively paired

with di erent copies of the same dimension. For example, a single fact table con-

taining stock ticker quotes could be simultaneously exposed to multiple separate

investors, each of whom has unique and proprietary attributes assigned to di erent

stocks.

Chapter 10 Financial Services , p 296

Abstract Generic Dimensions

Some modelers are attracted to abstract generic dimensions. For example, their

schemas include a single generic location dimension rather than embedded geo-

graphic attributes in the store, warehouse, and customer dimensions. Similarly,

their person dimension includes rows for employees, customers, and vendor

contacts because they are all human beings, regardless that signiﬁ cantly di erent

attributes are collected for each type. Abstract generic dimensions should be avoided

in dimensional models. The attribute sets associated with each type often di er. If

the attributes are common, such as a geographic state, then they should be uniquely

labeled to distinguish a store’s state from a customer’s. Finally, dumping all variet-

ies of locations, people, or products into a single dimension invariably results in

a larger dimension table. Data abstraction may be appropriate in the operational

source system or ETL processing, but it negatively impacts query performance and

legibility in the dimensional model.

Chapter 9 Human Resources Management , p 270

Chapter 11 Telecommunications , p 310

Audit Dimensions

When a fact table row is created in the ETL back room, it is helpful to create

an audit dimension containing the ETL processing metadata known at the time.

A simple audit dimension row could contain one or more basic indicators of data

quality, perhaps derived from examining an error event schema that records

data quality violations encountered while processing the data. Other useful audit

dimension attributes could include environment variables describing the versions

of ETL code used to create the fact rows or the ETL process execution time stamps.

Kimball Dimensional Modeling Techniques Overview 67

These environment variables are especially useful for compliance and auditing

purposes because they enable BI tools to drill down to determine which rows were

created with what versions of the ETL software.

Chapter 6 Order Management , p 192

Chapter 16 Insurance , p 383

Chapter 19 ETL Subsystems and Techniques , p 460

Chapter 20 ETL System Process and Tasks , p 511

Late Arriving Dimensions

Sometimes the facts from an operational business process arrive minutes, hours,

days, or weeks before the associated dimension context. For example, in a real-time

data delivery situation, an inventory depletion row may arrive showing the natural

key of a customer committing to purchase a particular product. In a real-time ETL

system, this row must be posted to the BI layer, even if the identity of the customer

or product cannot be immediately determined. In these cases, special dimension

rows are created with the unresolved natural keys as attributes. Of course, these

dimension rows must contain generic unknown values for most of the descriptive

columns; presumably the proper dimensional context will follow from the source at

a later time. When this dimensional context is eventually supplied, the placeholder

dimension rows are updated with type 1 overwrites. Late arriving dimension data

also occurs when retroactive changes are made to type 2 dimension attributes.

In this case, a new row needs to be inserted in the dimension table, and then the

associated fact rows must be restated.

Chapter 14 Healthcare , p 351

Chapter 19 ETL Subsystems and Techniques , p 478

Chapter 20 ETL System Process and Tasks , p 523

Special Purpose Schemas

The following design patterns are needed for speciﬁ c use cases.

Supertype and Subtype Schemas for Heterogeneous Products

Financial services and other businesses frequently o er a wide variety of products

in disparate lines of business. For example, a retail bank may o er dozens of types

of accounts ranging from checking accounts to mortgages to business loans, but all

are examples of an account. Attempts to build a single, consolidated fact table with

the union of all possible facts, linked to dimension tables with all possible attributes

Chapter 2

of these divergent products, will fail because there can be hundreds of incompatible

facts and attributes. The solution is to build a single supertype fact table that has the

intersection of the facts from all the account types (along with a supertype dimen-

sion table containing the common attributes), and then systematically build separate

fact tables (and associated dimension tables) for each of the subtypes. Supertype and

subtype fact tables are also called core and custom fact tables.

Chapter 10 Financial Services , p 293

Chapter 14 Healthcare , p 347

Chapter 16 Insurance , p 384

Real-Time Fact Tables

Real-time fact tables need to be updated more frequently than the more traditional

nightly batch process. There are many techniques for supporting this requirement,

depending on the capabilities of the DBMS or OLAP cube used for ﬁ nal deployment

to the BI reporting layer. For example, a “hot partition” can be deﬁ ned on a fact table

that is pinned in physical memory. Aggregations and indexes are deliberately not

built on this partition. Other DBMSs or OLAP cubes may support deferred updat-

ing that allows existing queries to run to completion but then perform the updates.

Chapter 8 Customer Relationship Management, p 260

Chapter 20 ETL System Process and Tasks , p 520

Error Event Schemas

Managing data quality in a data warehouse requires a comprehensive system of

data quality screens or ﬁ lters that test the data as it ﬂ ows from the source sys-

tems to the BI platform. When a data quality screen detects an error, this event

is recorded in a special dimensional schema that is available only in the ETL

back room. This schema consists of an error event fact table whose grain is the

individual error event and an associated error event detail fact table whose grain

is each column in each table that participates in an error event.

Chapter 19 ETL Subsystems and Techniques , p 458

Retail Sales

The best way to understand the principles of dimensional modeling is to work

through a series of tangible examples. By visualizing real cases, you hold the

particular design challenges and solutions in your mind more effectively than if they

are presented abstractly. This book uses case studies from a range of businesses to

help move past the idiosyncrasies of your own environment and reinforce dimen-

sional modeling best practices.

To learn dimensional modeling, please read all the chapters in this book, even

if you don’t manage a retail store or work for a telecommunications company. The

chapters are not intended to be full-scale solutions for a given industry or business

function. Each chapter covers a set of dimensional modeling patterns that comes

up in nearly every kind of business. Universities, insurance companies, banks, and

airlines alike surely need the techniques developed in this retail chapter. Besides,

thinking about someone else’s business is refreshing. It is too easy to let historical

complexities derail you when dealing with data from your company. By stepping out-

side your organization and then returning with a well-understood design principle

(or two), it is easier to remember the spirit of the design principles as you descend

into the intricate details of your business.

Chapter 3 discusses the following concepts:

■ Four-step process for designing dimensional models

■ Fact table granularity

■ Transaction fact tables

■ Additive, non-additive, and derived facts

■ Dimension attributes, including indicators, numeric descriptors, and multiple

hierarchies

■ Calendar date dimensions, plus time-of-day

■ Causal dimensions, such as promotion

■ Degenerate dimensions, such as the transaction receipt number

Chapter 3

■ Nulls in a dimensional model

■ Extensibility of dimension models

■ Factless fact tables

■ Surrogate, natural, and durable keys

■ Snowﬂ aked dimension attributes

■ Centipede fact tables with “too many dimensions”

Four-Step Dimensional Design Process

Throughout this book, we will approach the design of a dimensional model by

consistently considering four steps, as the following sections discuss in more detail.

Step 1: Select the Business Process

A business process is a low-level activity performed by an organization, such as taking

orders, invoicing, receiving payments, handling service calls, registering students,

performing a medical procedure, or processing claims. To identify your organiza-

tion’s business processes, it’s helpful to understand several common characteristics:

■ Business processes are frequently expressed as action verbs because they repre-

sent activities that the business performs. The companion dimensions describe

descriptive context associated with each business process event.

■ Business processes are typically supported by an operational system, such as

the billing or purchasing system.

■ Business processes generate or capture key performance metrics. Sometimes

the metrics are a direct result of the business process; the measurements are

derivations at other times. Analysts invariably want to scrutinize and evaluate

these metrics by a seemingly limitless combination of ﬁ lters and constraints.

■ Business processes are usually triggered by an input and result in output

metrics. In many organizations, there’s a series of processes in which the

outputs from one process become the inputs to the next. In the parlance of a

dimensional modeler, this series of processes results in a series of fact tables.

You need to listen carefully to the business to identify the organization’s business

processes because business users can’t readily answer the question, “What busi-

ness process are you interested in?” The performance measurements users want to

analyze in the DW/BI system result from business process events.

Sometimes business users talk about strategic business initiatives instead of busi-

ness processes. These initiatives are typically broad enterprise plans championed

by executive leadership to deliver competitive advantage. In order to tie a business

initiative to a business process representing a project-sized unit of work for the

Retail Sales 71

DW/BI team, you need to decompose the business initiative into the underlying

processes. This means digging a bit deeper to understand the data and operational

systems that support the initiative’s analytic requirements.

It’s also worth noting what a business process is not. Organizational business

departments or functions do not equate to business processes. By focusing on pro-

cesses, rather than on functional departments, consistent information is delivered

more economically throughout the organization. If you design departmentally bound

dimensional models, you inevitably duplicate data with di erent labels and data

values. The best way to ensure consistency is to publish the data once.

Step 2: Declare the Grain

Declaring the grain means specifying exactly what an individual fact table row

represents. The grain conveys the level of detail associated with the fact table

measurements. It provides the answer to the question, “How do you describe a

single row in the fact table?” The grain is determined by the physical realities of

the operational system that captures the business process’s events.

Example grain declarations include:

■ One row per scan of an individual product on a customer’s sales transaction

■ One row per line item on a bill from a doctor

■ One row per individual boarding pass scanned at an airport gate

■ One row per daily snapshot of the inventory levels for each item in a warehouse

■ One row per bank account each month

These grain declarations are expressed in business terms. Perhaps you were

expecting the grain to be a traditional declaration of the fact table’s primary key.

Although the grain ultimately is equivalent to the primary key, it’s a mistake to list

a set of dimensions and then assume this list is the grain declaration. Whenever

possible, you should express the grain in business terms.

Dimensional modelers sometimes try to bypass this seemingly unnecessary step

of the four-step design process. Please don’t! Declaring the grain is a critical step that

can’t be taken lightly. In debugging thousands of dimensional designs over the years,

the most frequent error is not declaring the grain of the fact table at the beginning

of the design process. If the grain isn’t clearly deﬁ ned, the whole design rests on

quicksand; discussions about candidate dimensions go around in circles, and rogue

facts sneak into the design. An inappropriate grain haunts a DW/BI implementation!

It is extremely important that everyone on the design team reaches agreement on

the fact table’s granularity. Having said this, you may discover in steps 3 or 4 of the

design process that the grain statement is wrong. This is okay, but then you must

return to step 2, restate the grain correctly, and revisit steps 3 and 4 again.

Chapter 3

Step 3: Identify the Dimensions

Dimensions fall out of the question, “How do business people describe the data

resulting from the business process measurement events?” You need to decorate

fact tables with a robust set of dimensions representing all possible descriptions

that take on single values in the context of each measurement. If you are clear about

the grain, the dimensions typically can easily be identiﬁ ed as they represent the

“who, what, where, when, why, and how” associated with the event. Examples of

common dimensions include date, product, customer, employee, and facility. With

the choice of each dimension, you then list all the discrete, text-like attributes that

ﬂ esh out each dimension table.

Step 4: Identify the Facts

Facts are determined by answering the question, “What is the process measuring?”

Business users are keenly interested in analyzing these performance metrics. All

candidate facts in a design must be true to the grain deﬁ ned in step 2. Facts that

clearly belong to a di erent grain must be in a separate fact table. Typical facts are

numeric additive ﬁ gures, such as quantity ordered or dollar cost amount.

You need to consider both your business users’ requirements and the realities

of your source data in tandem to make decisions regarding the four steps, as illus-

trated in Figure 3-1. We strongly encourage you to resist the temptation to model

the data by looking at source data alone. It may be less intimidating to dive into the

data rather than interview a business person; however, the data is no substitute for

business user input. Unfortunately, many organizations have attempted this path-

of-least-resistance data-driven approach but without much success.

Dimensional Model

Business Process

Grain

Dimensions

Facts

Data

Realities

Business

Requirements

Figure 3-1: Key input to the four-step dimensional design process.

Retail Case Study

Let’s start with a brief description of the retail business used in this case study. We

begin with this industry simply because it is one we are all familiar with. But the

patterns discussed in the context of this case study are relevant to virtually every

dimensional model regardless of the industry.

Retail Sales 73

Imagine you work in the headquarters of a large grocery chain. The business has

100 grocery stores spread across ﬁ ve states. Each store has a full complement of

departments, including grocery, frozen foods, dairy, meat, produce, bakery, ﬂ oral,

and health/beauty aids. Each store has approximately 60,000 individual products,

called stock keeping units (SKUs), on its shelves.

Data is collected at several interesting places in a grocery store. Some of the most

useful data is collected at the cash registers as customers purchase products. The point-

of-sale (POS) system scans product barcodes at the cash register, measuring consumer

takeaway at the front door of the grocery store, as illustrated in Figure 3-2’s cash register

receipt. Other data is captured at the store’s back door where vendors make deliveries.

Allstar Grocery

123 Loon Street

Green Prairie, MN 55555

(952) 555-1212

Store: 0022

Cashier: 00245409/Alan

2.50

4.99

1.99

3.19

12.67

4/15/2013 10:56 AM

0030503347 Baked Well Multigrain Muffins

2840201912 SoySoy Milk Quart

TOTAL

AMOUNT TENDERED

CASH

Transaction: 649

0064900220415201300245409

Thank you for shopping at Allstar

ITEM COUNT:

2120201195 Diet Cola 12-pack

Saved $.50 off $5.49

0070806048 Sparkly Toothpaste

Coupon $.30 off $2.29

Figure 3-2: Sample cash register receipt.

At the grocery store, management is concerned with the logistics of ordering,

stocking, and selling products while maximizing proﬁ t. The proﬁ t ultimately comes

Chapter 3

from charging as much as possible for each product, lowering costs for product

acquisition and overhead, and at the same time attracting as many customers as

possible in a highly competitive environment. Some of the most signiﬁ cant manage-

ment decisions have to do with pricing and promotions. Both store management

and headquarters marketing spend a great deal of time tinkering with pricing and

promotions. Promotions in a grocery store include temporary price reductions, ads

in newspapers and newspaper inserts, displays in the grocery store, and coupons.

The most direct and e ective way to create a surge in the volume of product sold

is to lower the price dramatically. A 50-cent reduction in the price of paper towels,

especially when coupled with an ad and display, can cause the sale of the paper

towels to jump by a factor of 10. Unfortunately, such a big price reduction usually

is not sustainable because the towels probably are being sold at a loss. As a result of

these issues, the visibility of all forms of promotion is an important part of analyz-

ing the operations of a grocery store.

Now that we have described our business case study, we’ll begin to design the

dimensional model.

Step 1: Select the Business Process

The ﬁ rst step in the design is to decide what business process to model by combin-

ing an understanding of the business requirements with an understanding of the

available source data.

NOTE The ﬁ rst DW/BI project should focus on the business process that is

both the most critical to the business users, as well as the most feasible. Feasibility

covers a range of considerations, including data availability and quality, as well as

organizational readiness.

In our retail case study, management wants to better understand customer pur-

chases as captured by the POS system. Thus the business process you’re modeling

is POS retail sales transactions. This data enables the business users to analyze

which products are selling in which stores on which days under what promotional

conditions in which transactions.

Step 2: Declare the Grain

After the business process has been identiﬁ ed, the design team faces a serious deci-

sion about the granularity. What level of data detail should be made available in

the dimensional model?

Tackling data at its lowest atomic grain makes sense for many reasons. Atomic

data is highly dimensional. The more detailed and atomic the fact measurement,

Retail Sales 75

the more things you know for sure. All those things you know for sure translate

into dimensions. In this regard, atomic data is a perfect match for the dimensional

approach.

Atomic data provides maximum analytic ﬂ exibility because it can be con-

strained and rolled up in every way possible. Detailed data in a dimensional model

is poised and ready for the ad hoc attack by business users.

NOTE You should develop dimensional models representing the most detailed,

atomic information captured by a business process.

Of course, you could declare a more summarized granularity representing an

aggregation of the atomic data. However, as soon as you select a higher level grain,

you limit yourself to fewer and/or potentially less detailed dimensions. The less

granular model is immediately vulnerable to unexpected user requests to drill down

into the details. Users inevitably run into an analytic wall when not given access to

the atomic data. Although aggregated data plays an important role for performance

tuning, it is not a substitute for giving users access to the lowest level details; users

can easily summarize atomic data, but it’s impossible to create details from sum-

mary data. Unfortunately, some industry pundits remain confused about this point.

They claim dimensional models are only appropriate for summarized data and then

criticize the dimensional modeling approach for its supposed need to anticipate the

business question. This misunderstanding goes away when detailed, atomic data is

made available in a dimensional model.

In our case study, the most granular data is an individual product on a POS transac-

tion, assuming the POS system rolls up all sales for a given product within a shopping

cart into a single line item. Although users probably are not interested in analyzing

single items associated with a speciﬁ c POS transaction, you can’t predict all the ways

they’ll want to cull through that data. For example, they may want to understand the

di erence in sales on Monday versus Sunday. Or they may want to assess whether it’s

worthwhile to stock so many individual sizes of certain brands. Or they may want

to understand how many shoppers took advantage of the 50-cents-o promotion on

shampoo. Or they may want to determine the impact of decreased sales when a com-

petitive diet soda product was promoted heavily. Although none of these queries calls

for data from one speciﬁ c transaction, they are broad questions that require detailed

data sliced in precise ways. None of them could have been answered if you elected to

provide access only to summarized data.

NOTE A DW/BI system almost always demands data expressed at the lowest

possible grain, not because queries want to see individual rows but because queries

need to cut through the details in very precise ways.

Chapter 3

Step 3: Identify the Dimensions

After the grain of the fact table has been chosen, the choice of dimensions is straight-

forward. The product and transaction fall out immediately. Within the framework

of the primary dimensions, you can ask whether other dimensions can be attributed

to the POS measurements, such as the date of the sale, the store where the sale

occurred, the promotion under which the product is sold, the cashier who handled

the sale, and potentially the method of payment. We express this as another design

principle.

NOTE A careful grain statement determines the primary dimensionality of the

fact table. You then add more dimensions to the fact table if these additional dimen-

sions naturally take on only one value under each combination of the primary

dimensions. If the additional dimension violates the grain by causing additional

fact rows to be generated, the dimension needs to be disqualiﬁ ed or the grain state-

ment needs to be revisited.

The following descriptive dimensions apply to the case: date, product, store,

promotion, cashier, and method of payment. In addition, the POS transaction ticket

number is included as a special dimension, as described in the section “Degenerate

Dimensions for Transaction Numbers” later in this chapter.

Before ﬂ eshing out the dimension tables with descriptive attributes, let’s complete

the ﬁ nal step of the four-step process. You don’t want to lose sight of the forest for

the trees at this stage of the design.

Step 4: Identify the Facts

The fourth and ﬁ nal step in the design is to make a careful determination of which

facts will appear in the fact table. Again, the grain declaration helps anchor your

thinking. Simply put, the facts must be true to the grain: the individual product

line item on the POS transaction in this case. When considering potential facts,

you may again discover adjustments need to be made to either your earlier grain

assumptions or choice of dimensions.

The facts collected by the POS system include the sales quantity (for example,

the number of cans of chicken noodle soup), per unit regular, discount, and net

paid prices, and extended discount and sales dollar amounts. The extended sales

dollar amount equals the sales quantity multiplied by the net unit price. Likewise,

the extended discount dollar amount is the sales quantity multiplied by the unit

discount amount. Some sophisticated POS systems also provide a standard dollar

cost for the product as delivered to the store by the vendor. Presuming this cost

fact is readily available and doesn’t require a heroic activity-based costing initiative,

Retail Sales 77

you can include the extended cost amount in the fact table. The fact table begins

to take shape in Figure 3-3.

Retail Sales Fact

Date Key (FK)

Product Key (FK)

Store Key (FK)

Promotion Key (FK)

Cashier Key (FK)

Payment Method Key (FK)

POS Transaction # (DD)

Sales Quantity

Regular Unit Price

Discount Unit Price

Net Unit Price

Extended Discount Dollar Amount

Extended Sales Dollar Amount

Extended Cost Dollar Amount

Extended Gross Profit Dollar Amount

Date Dimension

Product Dimension

Promotion Dimension

Payment Method Dimension

Store Dimension

Cashier Dimension

Figure 3-3: Measured facts in retail sales schema.

Four of the facts, sales quantity and the extended discount, sales, and cost dollar

amounts, are beautifully additive across all the dimensions. You can slice and dice

the fact table by the dimension attributes with impunity, and every sum of these

four facts is valid and correct.

Derived Facts

You can compute the gross proﬁ t by subtracting the extended cost dollar amount

from the extended sales dollar amount, or revenue. Although computed, gross proﬁ t

is also perfectly additive across all the dimensions; you can calculate the gross

proﬁ t of any combination of products sold in any set of stores on any set of days.

Dimensional modelers sometimes question whether a calculated derived fact should

be stored in the database. We generally recommend it be stored physically. In this

case study, the gross proﬁ t calculation is straightforward, but storing it means it’s

computed consistently in the ETL process, eliminating the possibility of user cal-

culation errors. The cost of a user incorrectly representing gross proﬁ t overwhelms

the minor incremental storage cost. Storing it also ensures all users and BI reporting

applications refer to gross proﬁ t consistently. Because gross proﬁ t can be calculated

from adjacent data within a single fact table row, some would argue that you should

perform the calculation in a view that is indistinguishable from the table. This is

a reasonable approach if all users access the data via the view and no users with

ad hoc query tools can sneak around the view to get at the physical table. Views

are a reasonable way to minimize user error while saving on storage, but the DBA

Chapter 3

must allow no exceptions to accessing the data through the view. Likewise, some

organizations want to perform the calculation in the BI tool. Again, this works if all

users access the data using a common tool, which is seldom the case in our expe-

rience. However, sometimes non-additive metrics on a report such as percentages

or ratios must be computed in the BI application because the calculation cannot

be precalculated and stored in a fact table. OLAP cubes excel in these situations.

Non-Additive Facts

Gross margin can be calculated by dividing the gross proﬁ t by the extended sales

dollar revenue. Gross margin is a non-additive fact because it can’t be summarized

along any dimension. You can calculate the gross margin of any set of products,

stores, or days by remembering to sum the revenues and costs respectively before

dividing.

NOTE Percentages and ratios, such as gross margin, are non-additive. The

numerator and denominator should be stored in the fact table. The ratio can then

be calculated in a BI tool for any slice of the fact table by remembering to calculate

the ratio of the sums, not the sum of the ratios.

Unit price is another non-additive fact. Unlike the extended amounts in the fact

table, summing unit price across any of the dimensions results in a meaningless,

nonsensical number. Consider this simple example: You sold one widget at a unit

price of $1.00 and four widgets at a unit price of 50 cents each. You could sum

the sales quantity to determine that ﬁ ve widgets were sold. Likewise, you could

sum the sales dollar amounts ($1.00 and $2.00) to arrive at a total sales amount

of $3.00. However, you can’t sum the unit prices ($1.00 and 50 cents) and declare

that the total unit price is $1.50. Similarly, you shouldn’t announce that the average

unit price is 75 cents. The properly weighted average unit price should be calcu-

lated by taking the total sales amount ($3.00) and dividing by the total quantity

(ﬁ ve widgets) to arrive at a 60 cent average unit price. You’d never arrive at this

conclusion by looking at the unit price for each transaction line in isolation. To

analyze the average price, you must add up the sales dollars and sales quantities

before dividing the total dollars by the total quantity sold. Fortunately, many BI

tools perform this function correctly. Some question whether non-additive facts

should be physically stored in a fact table. This is a legitimate question given

their limited analytic value, aside from printing individual values on a report or

applying a ﬁ lter directly on the fact, which are both atypical. In some situations,

a fundamentally non-additive fact such as a temperature is supplied by the source

system. These non-additive facts may be averaged carefully over many records, if

the business analysts agree that this makes sense.

Retail Sales 79

Transaction Fact Tables

Transactional business processes are the most common. The fact tables representing

these processes share several characteristics:

■ The grain of atomic transaction fact tables can be succinctly expressed in the

context of the transaction, such as one row per transaction or one row per

transaction line.

■ Because these fact tables record a transactional event, they are often sparsely

populated. In our case study, we certainly wouldn’t sell every product in

every shopping cart.

■ Even though transaction fact tables are unpredictably and sparsely populated,

they can be truly enormous. Most billion and trillion row tables in a data

warehouse are transaction fact tables.

■ Transaction fact tables tend to be highly dimensional.

■ The metrics resulting from transactional events are typically additive as long

as they have been extended by the quantity amount, rather than capturing

per unit metrics.

At this early stage of the design, it is often helpful to estimate the number of rows

in your largest table, the fact table. In this case study, it simply may be a matter of

talking with a source system expert to understand how many POS transaction line

items are generated on a periodic basis. Retail tra c ﬂ uctuates signiﬁ cantly from

day to day, so you need to understand the transaction activity over a reasonable

period of time. Alternatively, you could estimate the number of rows added to the

fact table annually by dividing the chain’s annual gross revenue by the average item

selling price. Assuming that gross revenues are $4 billion per year and that the aver-

age price of an item on a customer ticket is $2.00, you can calculate that there are

approximately 2 billion transaction line items per year. This is a typical engineer’s

estimate that gets you surprisingly close to sizing a design directly from your arm-

chair. As designers, you always should be triangulating to determine whether your

calculations are reasonable.

Dimension Table Details

Now that we’ve walked through the four-step process, let’s return to the dimension

tables and focus on populating them with robust attributes.

Date Dimension

The date dimension is a special dimension because it is the one dimension nearly

guaranteed to be in every dimensional model since virtually every business process

Chapter 3

captures a time series of performance metrics. In fact, date is usually the ﬁ rst dimen-

sion in the underlying partitioning scheme of the database so that the successive

time interval data loads are placed into virgin territory on the disk.

For readers of the ﬁ rst edition of The Data Warehouse Toolkit (Wiley, 1996), this

dimension was referred to as the time dimension. However, for more than a decade,

we’ve used the “date dimension” to mean a daily grained dimension table. This helps

distinguish between date and time-of-day dimensions.

Unlike most of the other dimensions, you can build the date dimension table in

advance. You may put 10 or 20 years of rows representing individual days in the table,

so you can cover the history you have stored, as well as several years in the future.

Even 20 years’ worth of days is only approximately 7,300 rows, which is a relatively

small dimension table. For a daily date dimension table in a retail environment, we

recommend the partial list of columns shown in Figure 3-4.

Date Dimension

Date Key (PK)

Date

Full Date Description

Day of Week

Day Number in Calendar Month

Day Number in Calendar Year

Day Number in Fiscal Month

Day Number in Fiscal Year

Last Day in Month Indicator

Calendar Week Ending Date

Calendar Week Number in Year

Calendar Month Name

Calendar Month Number in Year

Calendar Year-Month (YYYY-MM)

Calendar Quarter

Calendar Year-Quarter

Calendar Year

Fiscal Week

Fiscal Week Number in Year

Fiscal Month

Fiscal Month Number in Year

Fiscal Year-Month

Fiscal Quarter

Fiscal Year-Quarter

Fiscal Half Year

Fiscal Year

Holiday Indicator

Weekday Indicator

SQL Date Stamp

...

Figure 3-4: Date dimension table.

Retail Sales 81

Each column in the date dimension table is deﬁ ned by the particular day that the

row represents. The day-of-week column contains the day’s name, such as Monday.

This column would be used to create reports comparing Monday business with

Sunday business. The day number in calendar month column starts with 1 at the

beginning of each month and runs to 28, 29, 30, or 31 depending on the month.

This column is useful for comparing the same day each month. Similarly, you could

have a month number in year (1, . . ., 12). All these integers support simple date

arithmetic across year and month boundaries.

For reporting, you should include both long and abbreviated labels. For exam-

ple, you would want a month name attribute with values such as January. In

addition, a year-month (YYYY-MM) column is useful as a report column header.

You likely also want a quarter number (Q1, . . ., Q4), as well as a year-quarter,

such as 2013-Q1. You would include similar columns for the ﬁ scal periods if

they di er from calendar periods. Sample rows containing several date dimen-

sion columns are illustrated in Figure 3-5.

January 1, 2013

January 2, 2013

January 3, 2013

January 4, 2013

January 5, 2013

January 6, 2013

January 7, 2013

January 8, 2013

January

2013

F2013-01

Holiday

Non-Holiday

Weekday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

Monday

Tuesday

20130101

20130102

20130103

20130104

20130105

20130106

20130107

20130108

01/01/2013

01/02/2013

01/03/2013

01/04/2013

01/05/2013

01/06/2013

01/07/2013

01/08/2013

Date Key Date

Full Date

Description

Day of

Week

Calendar

Month

Calendar

Quarter

Calendar

Year

Fiscal Year-

Month

Holiday

Indicator

Weekday

Indicator

Figure 3-5: Date dimension sample rows.

NOTE A sample date dimension is available at www.kimballgroup.com under

the Tools and Utilities tab for this book title.

Some designers pause at this point to ask why an explicit date dimension table is

needed. They reason that if the date key in the fact table is a date type column, then

any SQL query can directly constrain on the fact table date key and use natural SQL

date semantics to ﬁ lter on month or year while avoiding a supposedly expensive

join. This reasoning falls apart for several reasons. First, if your relational database

can’t handle an e cient join to the date dimension table, you’re in deep trouble.

Most database optimizers are quite e cient at resolving dimensional queries; it is

not necessary to avoid joins like the plague.

Since the average business user is not versed in SQL date semantics, he would

be unable to request typical calendar groupings. SQL date functions do not support

Chapter 3

ﬁ ltering by attributes such as weekdays versus weekends, holidays, ﬁ scal periods,

or seasons. Presuming the business needs to slice data by these nonstandard date

attributes, then an explicit date dimension table is essential. Calendar logic belongs

in a dimension table, not in the application code.

NOTE Dimensional models always need an explicit date dimension table. There

are many date attributes not supported by the SQL date function, including week

numbers, ﬁ scal periods, seasons, holidays, and weekends. Rather than attempting

to determine these nonstandard calendar calculations in a query, you should look

them up in a date dimension table.

Flags and Indicators as Textual Attributes

Like many operational ﬂ ags and indicators, the date dimension’s holiday indicator

is a simple indicator with two potential values. Because dimension table attributes

serve as report labels and values in pull-down query ﬁ lter lists, this indicator should

be populated with meaningful values such as Holiday or Non-holiday instead of

the cryptic Y/N, 1/0, or True/False. As illustrated in Figure 3-6, imagine a report

comparing holiday versus non-holiday sales for a product. More meaningful domain

values for this indicator translate into a more meaningful, self-explanatory report.

Rather than decoding ﬂ ags into understandable labels in the BI application, we prefer

that decoded values be stored in the database so they’re consistently available to all

users regardless of their BI reporting environment or tools.

Monthly Sales

Extended Sales

Dollar Amount

Holiday

Indicator

Holiday

Indicator

Period:

Product

June 2013

Baked Well Sourdough

1,009

6,298

Monthly Sales

Extended Sales

Dollar Amount

Period:

Product

June 2013

Baked Well Sourdough

Holiday

Non-holiday

6,298

1,009

Figure 3-6: Sample reports with cryptic versus textual indicators.

A similar argument holds true for the weekday indicator that would have a value

of Weekday or Weekend. Saturdays and Sundays obviously would be assigned the

weekend value. Of course, multiple date table attributes can be jointly constrained,

so you can easily compare weekday holidays with weekend holidays.

Current and Relative Date Attributes

Most date dimension attributes are not subject to updates. June 1, 2013 will always

roll up to June, Calendar Q2, and 2013. However, there are attributes you can add

Retail Sales 83

to the basic date dimension that will change over time, including IsCurrentDay,

IsCurrentMonth, IsPrior60Days, and so on. IsCurrentDay obviously must be updated

each day; the attribute is useful for generating reports that always run for today. A

nuance to consider is the day that IsCurrentDay refers to. Most data warehouses

load data daily, so IsCurrentDay would refer to yesterday (or more accurately, the

mostrecent day loaded). You might also add attributes to the date dimension that

are unique to your corporate calendar, such as IsFiscalMonthEnd.

Some date dimensions include updated lag attributes. The lag day column would

take the value 0 for today, –1 for yesterday, +1 for tomorrow, and so on. This attribute

could easily be a computed column rather than physically stored. It might be useful

to set up similar structures for month, quarter, and year. Many BI tools include func-

tionality to do prior period calculations, so these lag columns may be unnecessary.

Time-of-Day as a Dimension or Fact

Although date and time are comingled in an operational date/time stamp, time-of-

day is typically separated from the date dimension to avoid a row count explosion

in the date dimension. As noted earlier, a date dimension with 20 years of history

contains approximately 7,300 rows. If you changed the grain of this dimension to

one row per minute in a day, you’d end up with over 10 million rows to accommodate

the 1,440 minutes per day. If you tracked time to the second, you’d have more than

31 million rows per year! Because the date dimension is likely the most frequently

constrained dimension in a schema, it should be kept as small and manageable as

possible.

If you want to ﬁ lter or roll up time periods based on summarized day part group-

ings, such as activity during 15-minute intervals, hours, shifts, lunch hour, or prime

time, time-of-day would be treated as a full-ﬂ edged dimension table with one row per

discrete time period, such as one row per minute within a 24-hour period resulting

in a dimension with 1,440 rows.

If there’s no need to roll up or ﬁ lter on time-of-day groupings, time-of-day should

be handled as a simple date/time fact in the fact table. By the way, business users

are often more interested in time lags, such as the transaction’s duration, rather

than discreet start and stop times. Time lags can easily be computed by taking the

di erence between date/time stamps. These date/time stamps also allow an applica-

tion to determine the time gap between two transactions of interest, even if these

transactions exist in di erent days, months, or years.

Product Dimension

The product dimension describes every SKU in the grocery store. Although a typi-

cal store may stock 60,000 SKUs, when you account for di erent merchandising

schemes and historical products that are no longer available, the product dimension

Chapter 3

may have 300,000 or more rows. The product dimension is almost always sourced

from the operational product master ﬁ le. Most retailers administer their product

master ﬁ le at headquarters and download a subset to each store’s POS system at

frequent intervals. It is headquarters’ responsibility to deﬁ ne the appropriate product

master record (and unique SKU number) for each new product.

Flatten Many-to-One Hierarchies

The product dimension represents the many descriptive attributes of each SKU. The

merchandise hierarchy is an important group of attributes. Typically, individual

SKUs roll up to brands, brands roll up to categories, and categories roll up to depart-

ments. Each of these is a many-to-one relationship. This merchandise hierarchy and

additional attributes are shown for a subset of products in Figure 3-7.

Baked Well

Fluffy

Light

Coldpack

Freshlike

Frigid

Icy

Bread

Sweeten Bread

Frozen Desserts

Bakery

Frozen Foods

Reduced Fat

Regular Fat

Reduced Fat

Non-Fat

Reduced Fat

Regular Fat

Fresh

Pre-Packaged

Ice Cream

Novelties

Baked Well Light Sourdough Fresh Bread

Fluffy Sliced Whole Wheat

Fluffy Light Sliced Whole Wheat

Light Mini Cinnamon Rolls

Diet Lovers Vanilla 2 Gallon

Light and Creamy Butter Pecan 1 Pint

Chocolate Lovers 1/2 Gallon

Strawberry Ice Creamy 1 Pint

Icy Ice Cream Sandwiches

Product

Key Product Description

Brand

Description

Subcategory

Description

Category

Description

Department

Description Fat Content

Figure 3-7: Product dimension sample rows.

For each SKU, all levels of the merchandise hierarchy are well deﬁ ned. Some

attributes, such as the SKU description, are unique. In this case, there are 300,000

di erent values in the SKU description column. At the other extreme, there are only

perhaps 50 distinct values of the department attribute. Thus, on average, there are

6,000 repetitions of each unique value in the department attribute. This is perfectly

acceptable! You do not need to separate these repeated values into a second nor-

malized table to save space. Remember dimension table space requirements pale in

comparison with fact table space considerations.

NOTE Keeping the repeated low cardinality values in the primary dimension

table is a fundamental dimensional modeling technique. Normalizing these values

into separate tables defeats the primary goals of simplicity and performance, as

discussed in “Resisting Normalization Urges” later in this chapter.

Many of the attributes in the product dimension table are not part of the mer-

chandise hierarchy. The package type attribute might have values such as Bottle,

Bag, Box, or Can. Any SKU in any department could have one of these values.

Retail Sales 85

It often makes sense to combine a constraint on this attribute with a constraint

on a merchandise hierarchy attribute. For example, you could look at all the SKUs

in the Cereal category packaged in Bags. Put another way, you can browse among

dimension attributes regardless of whether they belong to the merchandise hier-

archy. Product dimension tables typically have more than one explicit hierarchy.

A recommended partial product dimension for a retail grocery dimensional model

is shown in Figure 3-8.

Product Key (PK)

SKU Number (NK)

Product Description

Brand Description

Subcategory Description

Category Description

Department Number

Department Description

Package Type Description

Package Size

Fat Content

Diet Type

Weight

Weight Unit of Measure

Storage Type

Shelf Life Type

Shelf Width

Shelf Height

Shelf Depth

...

Product Dimension

Figure 3-8: Product dimension table.

Attributes with Embedded Meaning

Often operational product codes, identiﬁ ed in the dimension table by the NK notation

for natural key, have embedded meaning with di erent parts of the code representing

signiﬁ cant characteristics of the product. In this case, the multipart attribute should

be both preserved in its entirety within the dimension table, as well as broken down

into its component parts, which are handled as separate attributes. For example, if

the ﬁ fth through ninth characters in the operational code identify the manufacturer,

the manufacturer’s name should also be included as a dimension table attribute.

Numeric Values as Attributes or Facts

You will sometimes encounter numeric values that don’t clearly fall into either the

fact or dimension attribute categories. A classic example is the standard list price

Chapter 3

for a product. It’s deﬁ nitely a numeric value, so the initial instinct is to place it in

the fact table. But typically the standard price changes infrequently, unlike most

facts that are often di erently valued on every measurement event.

If the numeric value is used primarily for calculation purposes, it likely belongs

in the fact table. Because standard price is non-additive, you might multiply it by

the quantity for an extended amount which would be additive. Alternatively, if the

standard price is used primarily for price variance analysis, perhaps the variance

metric should be stored in the fact table instead. If the stable numeric value is used

predominantly for ﬁ ltering and grouping, it should be treated as a product dimen-

sion attribute.

Sometimes numeric values serve both calculation and ﬁ ltering/grouping func-

tions. In these cases, you should store the value in both the fact and dimension

tables. Perhaps the standard price in the fact table represents the valuation at the

time of the sales transaction, whereas the dimension attribute is labeled to indicate

it’s the current standard price.

NOTE Data elements that are used both for fact calculations and dimension

constraining, grouping, and labeling should be stored in both locations, even

though a clever programmer could write applications that access these data

elements from a single location. It is important that dimensional models be as

consistent as possible and application development be predictably simple. Data

involved in calculations should be in fact tables and data involved in constraints,

groups and labels should be in dimension tables.

Drilling Down on Dimension Attributes

A reasonable product dimension table can have 50 or more descriptive attributes.

Each attribute is a rich source for constraining and constructing row header labels.

Drilling down is nothing more than asking for a row header from a dimension that

provides more information.

Let’s say you have a simple report summarizing the sales dollar amount by depart-

ment. As illustrated in Figure 3-9, if you want to drill down, you can drag any

other attribute, such as brand, from the product dimension into the report next to

department, and you can automatically drill down to this next level of detail. You

could drill down by the fat content attribute, even though it isn’t in the merchandise

hierarchy rollup.

NOTE Drilling down in a dimensional model is nothing more than adding row

header attributes from the dimension tables. Drilling up is removing row headers.

You can drill down or up on attributes from more than one explicit hierarchy and

with attributes that are part of no hierarchy.

Retail Sales 87

Department

Name

Sales Dollar

Amount

Bakery

Frozen Foods

12,331

31,776

Drill down by brand name:

Bakery

Frozen Foods

Baked Well

Fluffy

Light

Coldpack

Freshlike

Frigid

Icy

QuickFreeze

3,009

3,024

6,298

5,321

10,476

7,328

2,184

6,467

Department

Name

Brand

Name

Sales Dollar

Amount

Or drill down by fat content:

Bakery

Frozen Foods

Nonfat

Reduced fat

Regular fat

Nonfat

Reduced fat

Regular fat

6,298

5,027

1,006

5,321

10,476

15,979

Department

Name

Fat

Content

Sales Dollar

Amount

Figure 3-9: Drilling down on dimension attributes.

The product dimension is a common dimension in many dimensional models.

Great care should be taken to ﬁ ll this dimension with as many descriptive attributes

as possible. A robust and complete set of dimension attributes translates into robust

and complete analysis capabilities for the business users. We’ll further explore the

product dimension in Chapter 5: Procurement where we’ll also discuss the handling

of product attribute changes.

Store Dimension

The store dimension describes every store in the grocery chain. Unlike the product

master ﬁ le that is almost guaranteed to be available in every large grocery business,

there may not be a comprehensive store master ﬁ le. POS systems may simply sup-

ply a store number on the transaction records. In these cases, project teams must

assemble the necessary components of the store dimension from multiple opera-

tional sources. Often there will be a store real estate department at headquarters

who will help deﬁ ne a detailed store master ﬁ le.

Chapter 3

Multiple Hierarchies in Dimension Tables

The store dimension is the case study’s primary geographic dimension. Each store

can be thought of as a location. You can roll stores up to any geographic attribute,

such as ZIP code, county, and state in the United States. Contrary to popular

belief, cities and states within the United States are not a hierarchy. Since many

states have identically named cities, you’ll want to include a City-State attribute

in the store dimension.

Stores likely also roll up an internal organization hierarchy consisting of store

districts and regions. These two di erent store hierarchies are both easily repre-

sented in the dimension because both the geographic and organizational hierarchies

are well deﬁ ned for a single store row.

NOTE It is not uncommon to represent multiple hierarchies in a dimension

table. The attribute names and values should be unique across the multiple

hierarchies.

A recommended retail store dimension table is shown in Figure 3-10.

Store Key (PK)

Store Number (NK)

Store Name

Store Street Address

Store City

Store County

Store City-State

Store State

Store Zip Code

Store Manager

Store District

Store Region

Floor Plan Type

Photo Processing Type

Financial Service Type

Selling Square Footage

Total Square Footage

First Open Date

Last Remodel Date

...

Store Dimension

Figure 3-10: Store dimension table.

The ﬂ oor plan type, photo processing type, and ﬁ nance services type are all short

text descriptors that describe the particular store. These should not be one-character

codes but rather should be 10- to 20-character descriptors that make sense when

viewed in a pull-down ﬁ lter list or used as a report label.

Retail Sales 89

The column describing selling square footage is numeric and theoretically addi-

tive across stores. You might be tempted to place it in the fact table. However, it is

clearly a constant attribute of a store and is used as a constraint or label more often

than it is used as an additive element in a summation. For these reasons, selling

square footage belongs in the store dimension table.

Dates Within Dimension Tables

The ﬁ rst open date and last remodel date in the store dimension could be date type

columns. However, if users want to group and constrain on nonstandard calendar

attributes (like the open date’s ﬁ scal period), then they are typically join keys to

copies of the date dimension table. These date dimension copies are declared in SQL

by the view construct and are semantically distinct from the primary date dimen-

sion. The view declaration would look like the following:

create view first_open_date (first_open_day_number, first_open_month,

...)

as select day_number, month, ...

from date

Now the system acts as if there is another physical copy of the date dimension

table called FIRST_OPEN_DATE. Constraints on this new date table have nothing to

do with constraints on the primary date dimension joined to the fact table. The ﬁ rst

open date view is a permissible outrigger to the store dimension; outriggers will be

described in more detail later in this chapter. Notice we have carefully relabeled all

the columns in the view so they cannot be confused with columns from the primary

date dimension. These distinct logical views on a single physical date dimension are

an example of dimension role playing, which we’ll discuss more fully in Chapter 6:

Order Management.

Promotion Dimension

The promotion dimension is potentially the most interesting dimension in the

retail sales schema. The promotion dimension describes the promotion condi-

tions under which a product is sold. Promotion conditions include temporary

price reductions, end aisle displays, newspaper ads, and coupons. This dimension

is often called a causal dimension because it describes factors thought to cause a

change in product sales.

Business analysts at both headquarters and the stores are interested in determin-

ing whether a promotion is e ective. Promotions are judged on one or more of the

following factors:

■ Whether the products under promotion experienced a gain in sales, called

lift, during the promotional period. The lift can be measured only if the store

can agree on what the baseline sales of the promoted products would have

Chapter 3

been without the promotion. Baseline values can be estimated from prior sales

history and, in some cases, with the help of sophisticated models.

■ Whether the products under promotion showed a drop in sales just prior to

or after the promotion, canceling the gain in sales during the promotion (time

shifting). In other words, did you transfer sales from regularly priced products

to temporarily reduced priced products?

■ Whether the products under promotion showed a gain in sales but other

products nearby on the shelf showed a corresponding sales decrease

(cannibalization).

■ Whether all the products in the promoted category of products experienced a

net overall gain in sales taking into account the time periods before, during,

and after the promotion (market growth).

■ Whether the promotion was proﬁ table. Usually the proﬁ t of a promotion is

taken to be the incremental gain in proﬁ t of the promoted category over the

baseline sales taking into account time shifting and cannibalization, as well

as the costs of the promotion.

The causal conditions potentially a ecting a sale are not necessarily tracked

directly by the POS system. The transaction system keeps track of price reduc-

tions and markdowns. The presence of coupons also typically is captured with

the transaction because the customer either presents coupons at the time of sale

or does not. Ads and in-store display conditions may need to be linked from other

sources.

The various possible causal conditions are highly correlated. A temporary price

reduction usually is associated with an ad and perhaps an end aisle display. For

this reason, it makes sense to create one row in the promotion dimension for each

combination of promotion conditions that occurs. Over the course of a year, there

may be 1,000 ads, 5,000 temporary price reductions, and 1,000 end aisle displays,

but there may be only 10,000 combinations of these three conditions a ecting any

particular product. For example, in a given promotion, most of the stores would run

all three promotion mechanisms simultaneously, but a few of the stores may not

deploy the end aisle displays. In this case, two separate promotion condition rows

would be needed, one for the normal price reduction plus ad plus display and one

for the price reduction plus ad only. A recommended promotion dimension table

is shown in Figure 3-11.

From a purely logical point of view, you could record similar information about

the promotions by separating the four causal mechanisms (price reductions, ads,

displays, and coupons) into separate dimensions rather than combining them into

one dimension. Ultimately, this choice is the designer’s prerogative. The trade-o s

in favor of keeping the four dimensions together include the following:

Retail Sales 91

■ If the four causal mechanisms are highly correlated, the combined single

dimension is not much larger than any one of the separated dimensions

would be.

■ The combined single dimension can be browsed e ciently to see how the vari-

ous price reductions, ads, displays, and coupons are used together. However,

this browsing only shows the possible promotion combinations. Browsing in

the dimension table does not reveal which stores or products were a ected

by the promotion; this information is found in the fact table.

Promotion Key (PK)

Promotion Code

Promotion Name

Price Reduction Type

Promotion Media Type

Ad Type

Display Type

Coupon Type

Ad Media Name

Display Provider

Promotion Cost

Promotion Begin Date

Promotion End Date

...

Promotion Dimension

Figure 3-11: Promotion dimension table.

The trade-o s in favor of separating the causal mechanisms into four distinct

dimension tables include the following:

■ The separated dimensions may be more understandable to the business com-

munity if users think of these mechanisms separately. This would be revealed

during the business requirement interviews.

■ Administration of the separate dimensions may be more straightforward than

administering a combined dimension.

Keep in mind there is no di erence in the content between these two choices.

NOTE The inclusion of promotion cost attribute in the promotion dimension

should be done with careful thought. This attribute can be used for constraining

and grouping. However, this cost should not appear in the POS transaction fact

table representing individual product sales because it is at the wrong grain; this

cost would have to reside in a fact table whose grain is the overall promotion.

Chapter 3

Null Foreign Keys, Attributes, and Facts

Ty pically, many sales transactions include products that are not being promoted.

Hopefully, consumers aren’t just ﬁ lling their shopping cart with promoted products;

you want them paying full price for some products in their cart! The promotion

dimension must include a row, with a unique key such as 0 or –1, to identify this

no promotion condition and avoid a null promotion key in the fact table. Referential

integrity is violated if you put a null in a fact table column declared as a foreign key

to a dimension table. In addition to the referential integrity alarms, null keys are

the source of great confusion to users because they can’t join on null keys.

WAR NI NG You must avoid null keys in the fact table. A proper design includes

a row in the corresponding dimension table to identify that the dimension is not

applicable to the measurement.

We sometimes encounter nulls as dimension attribute values. These usually result

when a given dimension row has not been fully populated, or when there are attri-

butes that are not applicable to all the dimension’s rows. In either case, we recom-

mend substituting a descriptive string, such as Unknown or Not Applicable, in place

of the null value. Null values essentially disappear in pull-down menus of possible

attribute values or in report groupings; special syntax is required to identify them.

If users sum up facts by grouping on a fully populated dimension attribute, and then

alternatively, sum by grouping on a dimension attribute with null values, they’ll get

di erent query results. And you’ll get a phone call because the data doesn’t appear to

be consistent. Rather than leaving the attribute null, or substituting a blank space or

a period, it’s best to label the condition; users can then purposely decide to exclude

the Unknown or Not Applicable from their query. It’s worth noting that some OLAP

products prohibit null attribute values, so this is one more reason to avoid them.

Finally, we can also encounter nulls as metrics in the fact table. We generally

leave these null so that they’re properly handled in aggregate functions such as SUM,

MIN, MAX, COUNT, and AVG which do the “right thing” with nulls. Substituting a zero

instead would improperly skew these aggregatedcalculations.

Data mining tools may use di erent techniques for tracking nulls. You may need

to do some additional transformation work beyond the above recommendations if

creating an observation set for data mining.

Other Retail Sales Dimensions

Any descriptive attribute that takes on a single value in the presence of a fact table

measurement event is a good candidate to be added to an existing dimension or

Retail Sales 93

be its own dimension. The decision whether a dimension should be associated

with a fact table should be a binary yes/no based on the fact table’s declared

grain. For example, there’s probably a cashier identiﬁ ed for each transaction. The

corresponding cashier dimension would likely contain a small subset of non-

private employee attributes. Like the promotion dimension, the cashier dimension

will likely have a No Cashier row for transactions that are processed through

self-service registers.

A trickier situation unfolds for the payment method. Perhaps the store has rigid

rules and only accepts one payment method per transaction. This would make

your life as a dimensional modeler easier because you’d attach a simple payment

method dimension to the sales schema that would likely include a payment method

description, along with perhaps a grouping of payment methods into either cash

equivalent or credit payment types.

In real life, payment methods often present a more complicated scenario. If

multiple payment methods are accepted on a single POS transaction, the payment

method does not take on a single value at the declared grain. Rather than altering

the declared grain to be something unnatural such as one row per payment method

per product on a POS transaction, you would likely capture the payment method in

a separate fact table with a granularity of either one row per transaction (then the

various payment method options would appear as separate facts) or one row per

payment method per transaction (which would require a separate payment method

dimension to associate with each row).

Degenerate Dimensions for Transaction Numbers

The retail sales fact table includes the POS transaction number on every line item

row. In an operational parent/child database, the POS transaction number would

be the key to the transaction header record, containing all the information valid

for the transaction as a whole, such as the transaction date and store identiﬁ er.

However, in the dimensional model, you have already extracted this interesting

header information into other dimensions. The POS transaction number is still

useful because it serves as the grouping key for pulling together all the products

purchased in a single market basket transaction. It also potentially enables you to

link back to the operational system.

Although the POS transaction number looks like a dimension key in the fact

table, the descriptive items that might otherwise fall in a POS transaction dimension

have been stripped o . Because the resulting dimension is empty, we refer to the

POS transaction number as a degenerate dimension (identiﬁ ed by the DD notation

Chapter 3

in this book’s ﬁ gures). The natural operational ticket number, such as the POS

transaction number, sits by itself in the fact table without joining to a dimension

table. Degenerate dimensions are very common when the grain of a fact table rep-

resents a single transaction or transaction line because the degenerate dimension

represents the unique identiﬁ er of the parent. Order numbers, invoice numbers,

and bill-of-lading numbers almost always appear as degenerate dimensions in a

dimensional model.

Degenerate dimensions often play an integral role in the fact table’s primary

key. In our case study, the primary key of the retail sales fact table consists of the

degenerate POS transaction number and product key, assuming scans of identical

products in the market basket are grouped together as a single line item.

NOTE Operational transaction control numbers such as order numbers, invoice

numbers, and bill-of-lading numbers usually give rise to empty dimensions and are

represented as degenerate dimensions in transaction fact tables. The degenerate

dimension is a dimension key without a corresponding dimension table.

If, for some reason, one or more attributes are legitimately left over after all the

other dimensions have been created and seem to belong to this header entity, you

would simply create a normal dimension row with a normal join. However, you would

no longer have a degenerate dimension.

Retail Schema in Action

With our retail POS schema designed, let’s illustrate how it would be put to use in

a query environment. A business user might be interested in better understanding

weekly sales dollar volume by promotion for the snacks category during January

2013 for stores in the Boston district. As illustrated in Figure 3-12, you would place

query constraints on month and year in the date dimension, district in the store

dimension, and category in the product dimension.

If the query tool summed the sales dollar amount grouped by week ending

date and promotion, the SQL query results would look similar to those below in

Figure 3-13. You can plainly see the relationship between the dimensional model

and the associated query. High-quality dimension attributes are crucial because they

are the source of query constraints and report labels. If you use a BI tool with more

functionality, the results would likely appear as a cross-tabular “pivoted” report,

which may be more appealing to business users than the columnar data resulting

from an SQL statement.

Retail Sales 95

Date Key (PK)

Date

Day of Week

Calendar Month

Calendar Quarter

Calendar Year

...

Date Key (FK)

Product Key (FK)

Store Key (FK)

Promotion Key (FK)

Cashier Key (FK)

Payment Method Key (FK)

POS Transaction # (DD)

Sales Quantity

Regular Unit Price

Discount Unit Price

Net Unit Price

Extended Discount Dollar Amount

Extended Sales Dollar Amount

Extended Cost Dollar Amount

Extended Gross Profit Dollar Amount

Product Key (PK)

SKU Number (NK)

Product Description

Brand Description

Category Description

...

Promotion Key (PK)

Promotion Code (NK)

Promotion Name

Promotion Media Type

Promotion Begin Date

...

Snacks

January

2013

Boston

Payment Method Key (PK)

Payment Method Description

Payment Method Group

Store Key (PK)

Store Number (NK)

Store Name

Store District

Store Region

...

Cashier Key (PK)

Cashier Employee ID (NK)

Cashier Name

...

Date Dimension Retail Sales Facts

Product Dimension

Promotion Dimension

Payment Method Dimension

Store Dimension

Cashier Dimension

Figure 3-12: Querying the retail sales schema.

Calendar Week

Ending Date

January 6, 2013

January 13, 2013

January 20, 2013

January 27, 2013

No Promotion

Super Bowl Promotion

2,647

4,851

7,248

13,798

Extended Sales

Dollar Amount

Promotion Name

Department

Name

No Promotion

Extended Sales

Dollar Amount

Super Bowl Promotion

Extended Sales

Dollar Amount

January 6, 2013

January 13, 2013

January 20, 2013

January 27, 2013

2,647

4,851

7,248

13,798

Figure 3-13: Query results and cross-tabular report.

Retail Schema Extensibility

Let’s turn our attention to extending the initial dimensional design. Several years

after the rollout of the retail sales schema, the retailer implements a frequent shop-

per program. Rather than knowing an unidentiﬁ ed shopper purchased 26 items on

Chapter 3

a cash register receipt, you can now identify the speciﬁ c shopper. Just imagine the

business users’ interest in analyzing shopping patterns by a multitude of geographic,

demographic, behavioral, and other di erentiating shopper characteristics.

The handling of this new frequent shopper information is relatively straightfor-

ward. You’d create a frequent shopper dimension table and add another foreign key

in the fact table. Because you can’t ask shoppers to bring in all their old cash register

receipts to tag their historical sales transactions with their new frequent shopper

number, you’d substitute a default shopper dimension surrogate key, corresponding

to a Prior to Frequent Shopper Program dimension row, on the historical fact table

rows. Likewise, not everyone who shops at the grocery store will have a frequent

shopper card, so you’d also want to include a Frequent Shopper Not Identiﬁ ed row

in the shopper dimension. As we discussed earlier with the promotion dimension,

you can’t have a null frequent shopper key in the fact table.

Our original schema gracefully extends to accommodate this new dimension

largely because the POS transaction data was initially modeled at its most granular

level. The addition of dimensions applicable at that granularity did not alter the

existing dimension keys or facts; all existing BI applications continue to run without

any changes. If the grain was originally declared as daily retail sales (transactions

summarized by day, store, product, and promotion) rather than the transaction line

detail, you would not have been able to incorporate the frequent shopper dimen-

sion. Premature summarization or aggregation inherently limits your ability to add

supplemental dimensions because the additional dimensions often don’t apply at

the higher grain.

The predictable symmetry of dimensional models enable them to absorb some

rather signiﬁ cant changes in source data and/or modeling assumptions without

invalidating existing BI applications, including:

■ New dimension attributes. If you discover new textual descriptors of a dimen-

sion, you can add these attributes as new columns. All existing applications

will be oblivious to the new attributes and continue to function. If the new

attributes are available only after a speciﬁ c point in time, then Not Available

or its equivalent should be populated in the old dimension rows. Be fore-

warned that this scenario is more complicated if the business users want to

track historical changes to this newly identiﬁ ed attribute. If this is the case,

pay close attention to the slowly changing dimension coverage in Chapter 5.

■ New dimensions. As we just discussed, you can add a dimension to an exist-

ing fact table by adding a new foreign key column and populating it correctly

with values of the primary key from the new dimension.

Retail Sales 97

■ New measured facts. If new measured facts become available, you can add

them gracefully to the fact table. The simplest case is when the new facts are

available in the same measurement event and at the same grain as the existing

facts. In this case, the fact table is altered to add the new columns, and the

values are populated into the table. If the new facts are only available from

a point in time forward, null values need to be placed in the older fact rows.

A more complex situation arises when new measured facts occur naturally

at a di erent grain. If the new facts cannot be allocated or assigned to the

original grain of the fact table, the new facts belong in their own fact table

because it’s a mistake to mix grains in the same fact table.

Factless Fact Tables

There is one important question that cannot be answered by the previous retail sales

schema: What products were on promotion but did not sell? The sales fact table

records only the SKUs actually sold. There are no fact table rows with zero facts

for SKUs that didn’t sell because doing so would enlarge the fact table enormously.

In the relational world, a promotion coverage or event fact table is needed to

answer the question concerning what didn’t happen. The promotion coverage fact

table keys would be date, product, store, and promotion in this case study. This

obviously looks similar to the sales fact table you just designed; however, the grain

would be signiﬁ cantly di erent. In the case of the promotion coverage fact table,

you’d load one row for each product on promotion in a store each day (or week, if

retail promotions are a week in duration) regardless of whether the product sold.

This fact table enables you to see the relationship between the keys as deﬁ ned by a

promotion, independent of other events, such as actual product sales. We refer to it

as a factless fact table because it has no measurement metrics; it merely captures the

relationship between the involved keys, as illustrated in Figure 3-14. To facilitate

counting, you can include a dummy fact, such as promotion count in this example,

which always contains the constant value of 1; this is a cosmetic enhancement that

enables the BI application to avoid counting one of the foreign keys.

To determine what products were on promotion but didn’t sell requires a two-

step process. First, you’d query the promotion factless fact table to determine the

universe of products that were on promotion on a given day. You’d then determine

what products sold from the POS sales fact table. The answer to our original ques-

tion is the set di erence between these two lists of products. If you work with data

Chapter 3

in an OLAP cube, it is often easier to answer the “what didn’t happen” question

because the cube typically contains explicit cells for nonbehavior.

Date Key (PK)

Date

Day of Week

Calendar Month

Calendar Quarter

Calendar Year

...

Store Key (PK)

Store Number (NK)

Store Name

Store District

Store Region

...

Date Key (FK)

Product Key (FK)

Store Key (FK)

Promotion Key (FK)

Promotion Count (=1)

Product Key (PK)

SKU Number (NK)

Product Description

Brand Description

Category Description

...

Promotion Key (PK)

Promotion Code (NK)

Promotion Name

Promotion Media Type

Promotion Begin Date

...

Date Dimension

Store Dimension

Promotion Coverage Facts

Product Dimension

Promotion Dimension

Figure 3-14: Promotion coverage factless fact table.

Dimension and Fact Table Keys

Now that the schemas have been designed, we’ll focus on the dimension and fact

tables’ primary keys, along with other row identiﬁ ers.

Dimension Table Surrogate Keys

The unique primary key of a dimension table should be a surrogate key rather than

relying on the operational system identiﬁ er, known as the natural key. Surrogate keys

go by many other aliases: meaningless keys, integer keys, non-natural keys, artiﬁ -

cial keys, and synthetic keys. Surrogate keys are simply integers that are assigned

sequentially as needed to populate a dimension. The ﬁ rst product row is assigned a

product surrogate key with the value of 1; the next product row is assigned product

key 2; and so forth. The actual surrogate key value has no business signiﬁ cance. The

surrogate keys merely serve to join the dimension tables to the fact table. Throughout

this book, column names with a Key su x, identiﬁ ed as a primary key (PK) or

foreign key (FK), imply a surrogate.

Modelers sometimes are reluctant to relinquish the natural keys because they

want to navigate the fact table based on the operational code while avoiding a join

to the dimension table. They also don’t want to lose the embedded intelligence

that’s often part of a natural multipart key. However, you should avoid relying on

Retail Sales 99

intelligent dimension keys because any assumptions you make eventually may be

invalidated. Likewise, queries and data access applications should not have any

built-in dependency on the keys because the logic also would be vulnerable to

invalidation. Even if the natural keys appear to be stable and devoid of meaning,

don’t be tempted to use them as the dimension table’s primary key.

NOTE Every join between dimension and fact tables in the data warehouse

should be based on meaningless integer surrogate keys. You should avoid using a

natural key as the dimension table’s primary key.

Initially, it may be faster to implement a dimensional model using operational

natural keys, but surrogate keys pay o in the long run. We sometimes think of

them as being similar to a ﬂ u shot for the data warehouse—like an immunization,

there’s a small amount of pain to initiate and administer surrogate keys, but the long

run beneﬁ ts are substantial, especially considering the reduced risk of substantial

rework. Here are several advantages:

■ Bu er the data warehouse from operational changes. Surrogate keys enable

the warehouse team to maintain control of the DW/BI environment rather

than being whipsawed by operational rules for generating, updating, deleting,

recycling, and reusing production codes. In many organizations, historical

operational codes, such as inactive account numbers or obsolete product

codes, get reassigned after a period of dormancy. If account numbers get

recycled following 12 months of inactivity, the operational systems don’t miss

a beat because their business rules prohibit data from hanging around for that

long. But the DW/BI system may retain data for years. Surrogate keys provide

the warehouse with a mechanism to di erentiate these two separate instances

of the same operational account number. If you rely solely on operational

codes, you might also be vulnerable to key overlaps in the case of an acquisi-

tion or consolidation of data.

■ Integrate multiple source systems. Surrogate keys enable the data warehouse

team to integrate data from multiple operational source systems, even if they

lack consistent source keys by using a back room cross-reference mapping

table to link the multiple natural keys to a common surrogate.

■ Improve performance. The surrogate key is as small an integer as possible

while ensuring it will comfortably accommodate the future anticipated car-

dinality (number of rows in the dimension). Often the operational code is a

bulky alphanumeric character string or even a group of ﬁ elds. The smaller

surrogate key translates into smaller fact tables, smaller fact table indexes,

and more fact table rows per block input-output operation. Typically, a 4-byte

Chapter 3

100

integer is su cient to handle most dimensions. A 4-byte integer is a single

integer, not four decimal digits. It has 32 bits and therefore can handle approx-

imately 2 billion positive values (232) or 4 billion total positive and negative

values (–232 to +232). This is more than enough for just about any dimension.

Remember, if you have a large fact table with 1 billion rows of data, every byte

in each fact table row translates into another gigabyte of storage.

■ Handle null or unknown conditions. As mentioned earlier, special surrogate

key values are used to record dimension conditions that may not have an

operational code, such as the No Promotion condition or the anonymous

customer. You can assign a surrogate key to identify these despite the lack of

operational coding. Similarly, fact tables sometimes have dates that are yet

to be determined. There is no SQL date type value for Date to Be Determined

or Date Not Applicable.

■ Support dimension attribute change tracking. One of the primary techniques

for handling changes to dimension attributes relies on surrogate keys to handle

the multiple proﬁ les for a single natural key. This is actually one of the most

important reasons to use surrogate keys, which we’ll describe in Chapter 5.

A pseudo surrogate key created by simply gluing together the natural key

with a time stamp is perilous. You need to avoid multiple joins between the

dimension and fact tables, sometimes referred to as double-barreled joins, due

to their adverse impact on performance and ease of use.

Of course, some e ort is required to assign and administer surrogate keys, but

it’s not nearly as intimidating as many people imagine. You need to establish and

maintain a cross-reference table in the ETL system that will be used to substitute the

appropriate surrogate key on each fact and dimension table row. We lay out a process

for administering surrogate keys in Chapter 19: ETL Subsystems and Techniques.

Dimension Natural and Durable Supernatural Keys

Like surrogate keys, the natural keys assigned and used by operational source sys-

tems go by other names, such as business keys, production keys, and operational

keys. They are identiﬁ ed with the NK notation in the book’s ﬁ gures. The natural

key is often modeled as an attribute in the dimension table. If the natural key comes

from multiple sources, you might use a character data type that prepends a source

code, such as SAP|43251 or CRM|6539152. If the same entity is represented in both

operational source systems, then you’d likely have two natural key attributes in

the dimension corresponding to both sources. Operational natural keys are often

composed of meaningful constituent parts, such as the product’s line of business

or country of origin; these components should be split apart and made available as

separate attributes.

Retail Sales 101

In a dimension table with attribute change tracking, it’s important to have an iden-

tiﬁ er that uniquely and reliably identiﬁ es the dimension entity across its attribute

changes. Although the operational natural key may seem to ﬁ t this bill, sometimes

the natural key changes due to unexpected business rules (like an organizational

merger) or to handle either duplicate entries or data integration from multiple

sources. If the dimension’s natural keys are not absolutely protected and preserved

over time, the ETL system needs to assign permanent durable identiﬁ ers, also known

as supernatural keys. A persistent durable supernatural key is controlled by the DW/

BI system and remains immutable for the life of the system. Like the dimension

surrogate key, it’s a simple integer sequentially assigned. And like the natural keys

discussed earlier, the durable supernatural key is handled as a dimension attribute;

it’s not a replacement for the dimension table’s surrogate primary key. Chapter 19

also discusses the ETL system’s responsibility for these durable identiﬁ ers.

Degenerate Dimension Surrogate Keys

Although surrogate keys aren’t typically assigned to degenerate dimensions, each

situation needs to be evaluated to determine if one is required. A surrogate key is

necessary if the transaction control numbers are not unique across locations or get

reused. For example, the retailer’s POS system may not assign unique transaction

numbers across stores. The system may wrap back to zero and reuse previous con-

trol numbers when its maximum has been reached. Also, the transaction control

number may be a bulky 24-byte alphanumeric column. Finally, depending on the

capabilities of the BI tool, you may need to assign a surrogate key (and create an

associated dimension table) to drill across on the transaction number. Obviously,

control number dimensions modeled in this way with corresponding dimension

tables are no longer degenerate.

Date Dimension Smart Keys

As we’ve noted, the date dimension has unique characteristics and requirements.

Calendar dates are ﬁ xed and predetermined; you never need to worry about deleting

dates or handling new, unexpected dates on the calendar. Because of its predict-

ability, you can use a more intelligent key for the date dimension.

If a sequential integer serves as the primary key of the date dimension, it should

be chronologically assigned. In other words, January 1 of the ﬁ rst year would

be assigned surrogate key value 1, January 2 would be assigned surrogate key 2,

February 1 would be assigned surrogate key 32, a nd so on.

More commonly, the primary key of the date dimension is a meaningful integer

formatted as yyyymmdd. The yyyymmdd key is not intended to provide business

users and their BI applications with an intelligent key so they can bypass the date

dimension and directly query the fact table. Filtering on the fact table’s yyyymmdd

Chapter 3

102

key would have a detrimental impact on usability and performance. Filtering and

grouping on calendar attributes should occur in a dimension table, not in the BI

application’s code.

However, the yyyymmdd key is useful for partitioning fact tables. Partitioning

enables a table to be segmented into smaller tables under the covers. Partitioning

a large fact table on the basis of date is e ective because it allows old data to be

removed gracefully and new data to be loaded and indexed in the current parti-

tion without disturbing the rest of the fact table. It reduces the time required for

loads, backups, archiving, and query response. Programmatically updating and

maintaining partitions is straightforward if the date key is an ordered integer: year

increments by 1 up to the number of years wanted, month increments by 1 up to

12, and so on. Using a smart yyyymmdd key provides the beneﬁ ts of a surrogate,

plus the advantages of easier partition management.

Although the yyyymmdd integer is the most common approach for date dimen-

sion keys, some relational database optimizers prefer a true date type column for

partitioning. In these cases, the optimizer knows there are 31 values between

March 1 and April 1, as opposed to the apparent 100 values between 20130301 and

20130401. Likewise, it understands there are 31 values between December 1 and

January 1, as opposed to the 8,900 integer values between 20121201 and 20130101.

This intelligence can impact the query strategy chosen by the optimizer and further

reduce query times. If the optimizer incorporates date type intelligence, it should

be considered for the date key. If the only rationale for a date type key is simpliﬁ ed

administration for the DBA, then you can feel less compelled.

With more intelligent date keys, whether chronologically assigned or a more

meaningful yyyymmdd integer or date type column, you need to reserve a special

date key value for the situation in which the date is unknown when the fact row is

initially loaded.

Fact Table Surrogate Keys

Although we’re adamant about using surrogate keys for dimension tables, we’re less

demanding about a surrogate key for fact tables. Fact table surrogate keys typically

only make sense for back room ETL processing. As we mentioned, the primary

key of a fact table typically consists of a subset of the table’s foreign keys and/or

degenerate dimension. However, single column surrogate keys for fact tables have

some interesting back room beneﬁ ts.

Like its dimensional counterpart, a fact table surrogate key is a simple integer,

devoid of any business content, that is assigned in sequence as fact table rows are

generated. Although the fact table surrogate key is unlikely to deliver query perfor-

mance advantages, it does have the following beneﬁ ts:

Retail Sales 103

■ Immediate unique identiﬁ cation. A single fact table row is immediately iden-

tiﬁ ed by the key. During ETL processing, a speciﬁ c row can be identiﬁ ed

without navigating multiple dimensions.

■ Backing out or resuming a bulk load. If a large number of rows are being

loaded with sequentially assigned surrogate keys, and the process halts before

completion, the DBA can determine exactly where the process stopped by

ﬁ nding the maximum key in the table. The DBA could back out the complete

load by specifying the range of keys just loaded or perhaps could resume the

load from exactly the correct point.

■ Replacing updates with inserts plus deletes. The fact table surrogate key

becomes the true physical key of the fact table. No longer is the key of the

fact table determined by a set of dimensional foreign keys, at least as far as

the RDBMS is concerned. Thus it becomes possible to replace a fact table

update operation with an insert followed by a delete. The ﬁ rst step is to

place the new row into the database with all the same business foreign keys

as the row it is to replace. This is now possible because the key enforce-

ment depends only on the surrogate key, and the replacement row has a

new surrogate key. Then the second step deletes the original row, thereby

accomplishing the update. For a large set of updates, this sequence is more

e cient than a set of true update operations. The insertions can be pro-

cessed with the ability to back out or resume the insertions as described in

the previous bullet. These insertions do not need to be protected with full

transaction machinery. Then the ﬁ nal deletion step can be performed safely

because the insertions have run to completion.

■ Using the fact table surrogate key as a parent in a parent/child schema. In

those cases in which one fact table contains rows that are parents of those in

a lower grain fact table, the fact table surrogate key in the parent table is also

exposed in the child table. The argument of using the fact table surrogate

key in this case rather than a natural parent key is similar to the argument

for using surrogate keys in dimension tables. Natural keys are messy and

unpredictable, whereas surrogate keys are clean integers and are assigned by

the ETL system, not the source system. Of course, in addition to including

the parent fact table’s surrogate key, the lower grained fact table should also

include the parent’s dimension foreign keys so the child facts can be sliced

and diced without traversing the parent fact table’s surrogate key. And as we’ll

discuss in Chapter 4: Inventory, you should never join fact tables directly to

other fact tables.

Chapter 3

104

Resisting Normalization Urges

In this section, let’s directly confront several of the natural urges that tempt model-

ers coming from a more normalized background. We’ve been consciously breaking

some traditional modeling rules because we’re focused on delivering value through

ease of use and performance, not on transaction processing e ciencies.

Snowﬂ ake Schemas with Normalized Dimensions

The ﬂ attened, denormalized dimension tables with repeating textual values make

data modelers from the operational world uncomfortable. Let’s revisit the case study

product dimension table. The 300,000 products roll up into 50 distinct depart-

ments. Rather than redundantly storing the 20-byte department description in the

product dimension table, modelers with a normalized upbringing want to store a

2-byte department code and then create a new department dimension for the depart-

ment decodes. In fact, they would feel more comfortable if all the descriptors in the

original design were normalized into separate dimension tables. They argue this

design saves space because the 300,000-row dimension table only contains codes,

not lengthy descriptors.

In addition, some modelers contend that more normalized dimension tables are

easier to maintain. If a department description changes, they’d need to update only

the one occurrence in the department dimension rather than the 6,000 repetitions

in the original product dimension. Maintenance often is addressed by normaliza-

tion disciplines, but all this happens back in the ETL system long before the data

is loaded into a presentation area’s dimensional schema.

Dimension table normalization is referred to as snowﬂ aking. Redundant attributes

are removed from the ﬂ at, denormalized dimension table and placed in separate

normalized dimension tables. Figure 3-15 illustrates the partial snowﬂ aking of the

product dimension into third normal form. The contrast between Figure 3-15 and

Figure 3-8 is startling. The plethora of snowﬂ aked tables (even in our simplistic

example) is overwhelming. Imagine the impact on Figure 3-12 if all the schema’s

hierarchies were normalized.

Snowﬂ aking is a legal extension of the dimensional model, however, we encour-

age you to resist the urge to snowﬂ ake given the two primary design drivers: ease

of use and performance.

Retail Sales 105

Product Key (PK)

SKU Number (Natural Key)

Product Description

Brand Key (FK)

Package Type Key (FK)

Fat Content

Weight

Weight Unit of Measure

Storage Type Key (FK)

Shelf Width

Shelf Height

Shelf Depth

...

Brand Key (PK)

Brand Description

Category Key (FK)

Category Key (PK)

Category Description

Department Key (FK)

Department Key (PK)

Department Number

Department Description

Shelf Life Type Key (PK)

Shelf Life Type Description

Package Type Key (PK)

Package Type Description

Storage Type Key (PK)

Storage Type Description

Shelf Life Type Key (FK)

Product Dimension

Brand Dimension Category Dimension

Shelf Life Type Dimension

Department Dimension

Package Type Dimension

Storage Type Dimension

Figure 3-15: Snowﬂ aked product dimension.

■ The multitude of snowﬂ aked tables makes for a much more complex presen-

tation. Business users inevitably will struggle with the complexity; simplicity

is one of the primary objectives of a dimensional model.

■ Most database optimizers also struggle with the snowﬂ aked schema’s complex-

ity. Numerous tables and joins usually translate into slower query performance.

The complexities of the resulting join speciﬁ cations increase the chances that

the optimizer will get sidetracked and choose a poor strategy.

■ The minor disk space savings associated with snowﬂ aked dimension tables

are insigniﬁ cant. If you replace the 20-byte department description in the

300,000 row product dimension table with a 2-byte code, you’d save a whop-

ping 5.4 MB (300,000 x 18 bytes); meanwhile, you may have a 10 GB fact

table! Dimension tables are almost always geometrically smaller than fact

tables. E orts to normalize dimension tables to save disk space are usually

a waste of time.

■ Snowﬂ aking negatively impacts the users’ ability to browse within a dimen-

sion. Browsing often involves constraining one or more dimension attributes

and looking at the distinct values of another attribute in the presence of these

constraints. Browsing allows users to understand the relationship between

dimension attribute values.

Chapter 3

106

■ Obviously, a snowﬂ aked product dimension table responds well if you just

want a list of the category descriptions. However, if you want to see all the

brands within a category, you need to traverse the brand and category dimen-

sions. If you want to also list the package types for each brand in a category,

you’d be traversing even more tables. The SQL needed to perform these seem-

ingly simple queries is complex, and you haven’t touched the other dimensions

or fact table.

■ Finally, snowﬂ aking defeats the use of bitmap indexes. Bitmap indexes are

useful when indexing low-cardinality columns, such as the category and

department attributes in the product dimension table. They greatly speed

the performance of a query or constraint on the single column in question.

Snowﬂ aking inevitably would interfere with your ability to leverage this per-

formance tuning technique.

NOTE Fixed depth hierarchies should be flattened in dimension tables.

Normalized, snowﬂ aked dimension tables penalize cross-attribute browsing and

prohibit the use of bitmapped indexes. Disk space savings gained by normalizing

the dimension tables typically are less than 1 percent of the total disk space needed

for the overall schema. You should knowingly sacriﬁ ce this dimension table space

in the spirit of performance and ease of use advantages.

Some database vendors argue their platform has the horsepower to query a fully

normalized dimensional model without performance penalties. If you can achieve

satisfactory performance without physically denormalizing the dimension tables,

that’s ﬁ ne. However, you’ll still want to implement a logical dimensional model with

denormalized dimensions to present an easily understood schema to the business

users and their BI applications.

In the past, some BI tools indicated a preference for snowﬂ ake schemas; snowﬂ ak-

ing to address the idiosyncratic requirements of a BI tool is acceptable. Likewise, if

all the data is delivered to business users via an OLAP cube (where the snowﬂ aked

dimensions are used to populate the cube but are never visible to the users), then

snowﬂ aking is acceptable. However, in these situations, you need to consider the

impact on users of alternative BI tools and the ﬂ exibility to migrate to alternatives

in the future.

Outriggers

Although we generally do not recommend snowﬂ aking, there are situations in which

it is permissible to build an outrigger dimension that attaches to a dimension within

Retail Sales 107

the fact table’s immediate halo, as illustrated in Figure 3-16. In this example, the

“once removed” outrigger is a date dimension snowﬂ aked o a primary dimension.

The outrigger date attributes are descriptively and uniquely labeled to distinguish

them from the other dates associated with the business process. It only makes

sense to outrigger a primary dimension table’s date attribute if the business wants

to ﬁ lter and group this date by nonstandard calendar attributes, such as the ﬁ scal

period, business day indicator, or holiday period. Otherwise, you could just treat

the date attribute as a standard date type column in the product dimension. If a date

outrigger is used, be careful that the outrigger dates fall within the range stored in

the standard date dimension table.

Product Key (PK)

SKU Number (NK)

Product Description

Brand Description

Subcategory Description

Category Description

Department Number

Department Description

Package Type Description

Package Size

Product Introduction Date Key (FK)

...

Product Dimension

Product Introduction Date Key (PK)

Product Introduction Date

Product Introduction Calendar Month

Product Introduction Calendar Year

Product Introduction Fiscal Month

Product Introduction Fiscal Quarter

Product Introduction Fiscal Year

Product Introduction Holiday Period Indicator

...

Product Introduction Date Dimension

Figure 3-16: Example of a permissible outrigger.

You’ll encounter more outrigger examples later in the book, such as the han-

dling of customers’ county-level demographic attributes in Chapter 8: Customer

Relationship Management.

Although outriggers may save space and ensure the same attributes are referenced

consistently, there are downsides. Outriggers introduce more joins, which can nega-

tively impact performance. More important, outriggers can negatively impact the

legibility for business users and hamper their ability to browse among attributes

within a single dimension.

WAR NI NG Though outriggers are permissible, a dimensional model should

not be littered with outriggers given the potentially negative impact. Outriggers

should be the exception rather than the rule.

Chapter 3

108

Centipede Fact Tables with Too Many Dimensions

The fact table in a dimensional schema is naturally highly normalized and compact.

There is no way to further normalize the extremely complex many-to-many relation-

ships among the keys in the fact table because the dimensions are not correlated

with each other. Every store is open every day. Sooner or later, almost every product

is sold on promotion in most or all of our stores.

Interestingly, while uncomfortable with denormalized dimension tables, some

modelers are tempted to denormalize the fact table. They have an uncontrollable

urge to normalize dimension hierarchies but know snowﬂ aking is highly discour-

aged, so the normalized tables end up joined to the fact table instead. Rather than

having a single product foreign key on the fact table, they include foreign keys for

the frequently analyzed elements on the product hierarchy, such as brand, category,

and department. Likewise, the date key suddenly turns into a series of keys joining

to separate week, month, quarter, and year dimension tables. Before you know it,

your compact fact table has turned into an unruly monster that joins to literally

dozens of dimension tables. We a ectionately refer to these designs as centipede

fact tables because they appear to have nearly 100 legs, as shown in Figure 3-17.

POS Retail Sales Transaction Fact

Date Key (FK)

Week Key (FK)

Month Key (FK)

Quarter Key (FK)

Year Key (FK)

Fiscal Year Key (FK)

Fiscal Month Key (FK)

Product Key (FK)

Brand Key (FK)

Category Key (FK)

Department Key (FK)

Package Type Key (FK)

Store Key (FK)

Store County Key (FK)

Store State Key (FK)

Store District Key (FK)

Store Region Key (FK)

Store Floor Plan Key (FK)

Promotion Key (FK)

Promotion Reduction Type Key (FK)

Promotion Media Type Key (FK)

POS Transaction Number (DD)

Sales Quantity

Extended Discount Dollar Amount

Extended Sales Dollar Amount

Extended Cost Dollar Amount

Date Dimension

Brand Dimension

Category Dimension

Department Dimension

Package Type Dimension

Promotion Dimension

Promotion Reduction Type Dimension

Promotion Media Type Dimension

Week Dimension

Month Dimension

Quarter Dimension

Year Dimension

Fiscal Year Dimension

Fiscal Month Dimension

Store Dimension

Store County Dimension

Store State Dimension

Store District Dimension

Store Region Dimension

Store Floor Plan Dimension

Product Dimension

Figure 3-17: Centipede fact table with too many normalized dimensions.

Retail Sales 109

Even with its tight format, the fact table is the behemoth in a dimensional model.

Designing a fact table with too many dimensions leads to signiﬁ cantly increased fact

table disk space requirements. Although denormalized dimension tables consume

extra space, fact table space consumption is a concern because it is your largest

table by orders of magnitude. There is no way to index the enormous multipart

key e ectively in the centipede example. The numerous joins are an issue for both

usability and query performance.

Most business processes can be represented with less than 20 dimensions in the

fact table. If a design has 25 or more dimensions, you should look for ways to com-

bine correlated dimensions into a single dimension. Perfectly correlated attributes,

such as the levels of a hierarchy, as well as attributes with a reasonable statistical

correlation, should be part of the same dimension. It’s a good decision to combine

dimensions when the resulting new single dimension is noticeably smaller than the

Cartesian product of the separate dimensions.

NOTE A very large number of dimensions typically are a sign that several

dimensions are not completely independent and should be combined into a

single dimension. It is a dimensional modeling mistake to represent elements of

a single hierarchy as separate dimensions in the fact table.

Developments with columnar databases may reduce the query and storage penal-

ties associated with wide centipede fact table designs. Rather than storing each table

row, a columnar database stores each table column as a contiguous object that is

heavily indexed for access. Even though the underlying physical storage is colum-

nar, at the query level, the table appears to be made up of familiar rows. But when

queried, only the named columns are actually retrieved from the disk, rather than

the entire row in a more conventional row-oriented relational database. Columnar

databases are much more tolerant of the centipede fact tables just described; how-

ever, the ability to browse across hierarchically related dimension attributes may

be compromised.

Summary

This chapter was your ﬁ rst exposure to designing a dimensional model. Regardless

of the industry, we strongly encourage the four-step process for tackling dimensional

model designs. Remember it is especially important to clearly state the grain associ-

ated with a dimensional schema. Loading the fact table with atomic data provides

the greatest ﬂ exibility because the data can be summarized “every which way.” As

Chapter 3

110

soon as the fact table is restricted to more aggregated information, you run into

walls when the summarization assumptions prove to be invalid. Also it is vitally

important to populate your dimension tables with verbose, robust descriptive attri-

butes for analytic ﬁ ltering and labeling.

In the next chapter we’ll remain within the retail industry to discuss techniques

for tackling a second business process within the organization, ensuring your earlier

e orts are leveraged while avoiding stovepipes.

Inventory

In Chapter 3: Retail Sales, we developed a dimensional model for the sales transac-

tions in a large grocery chain. We remain within the same industry in this chapter

but move up the value chain to tackle the inventory process. The designs developed

in this chapter apply to a broad set of inventory pipelines both inside and outside

the retail industry.

More important, this chapter provides a thorough discussion of the enterprise

data warehouse bus architecture. The bus architecture is essential to creating an

integrated DW/BI system. It provides a framework for planning the overall environ-

ment, even though it will be built incrementally. We will underscore the importance

of using common conformed dimensions and facts across dimensional models, and

will close by encouraging the adoption of an enterprise data governance program.

Chapter 4 discusses the following concepts:

■ Representing organizational value chains via a series of dimensional models

■ Semi-additive facts

■ Three fact table types: periodic snapshots, transaction, and accumulating

snapshots

■ Enterprise data warehouse bus architecture and bus matrix

■ Opportunity/stakeholder matrix

■ Conformed dimensions and facts, and their impact on agile methods

■ Importance of data governance

Value Chain Introduction

Most organizations have an underlying value chain of key business processes. The

value chain identiﬁ es the natural, logical ﬂ ow of an organization’s primary activi-

ties. For example, a retailer issues purchase orders to product manufacturers. The

products are delivered to the retailer’s warehouse, where they are held in inven-

tory. A delivery is then made to an individual store, where again the products sit in

Chapter 4

112

inventory until a consumer makes a purchase. Figure 4-1 illustrates this subset of a

retailer’s value chain. Obviously, products sourced from manufacturers that deliver

directly to the retail store would bypass the warehousing processes.

Issue Purchase

Order to

Manufacturer

Receive

Warehouse

Deliveries

Warehouse

Product

Inventory

Receive

Store

Deliveries

Store

Product

Inventory

Retail

Sales

Figure 4-1: Subset of a retailer’s value chain.

Operational source systems typically produce transactions or snapshots at each

step of the value chain. The primary objective of most analytic DW/BI systems is

to monitor the performance results of these key processes. Because each process

produces unique metrics at unique time intervals with unique granularity and

dimensionality, each process typically spawns one or more fact tables. To this end,

the value chain provides high-level insight into the overall data architecture for an

enterprise DW/BI environment. We’ll devote more time to this topic in the “Value

Chain Integration” section later in this chapter.

Inventory Models

In the meantime, we’ll discuss several complementary inventory models. The ﬁ rst

is the inventory periodic snapshot where product inventory levels are measured at

regular intervals and placed as separate rows in a fact table. These periodic snapshot

rows appear over time as a series of data layers in the dimensional model, much like

geologic layers represent the accumulation of sediment over long periods of time.

We’ll then discuss a second inventory model where every transaction that impacts

Inventory 113

inventory levels as products move through the warehouse is recorded. Finally, in the

third model, we’ll describe the inventory accumulating snapshot where a fact table

row is inserted for each product delivery and then the row is updated as the product

moves through the warehouse. Each model tells a di erent story. For some analytic

requirements, two or even all three models may be appropriate simultaneously.

Inventory Periodic Snapshot

Let’s return to our retail case study. Optimized inventory levels in the stores can have

a major impact on chain proﬁ tability. Making sure the right product is in the right

store at the right time minimizes out-of-stocks (where the product isn’t available

on the shelf to be sold) and reduces overall inventory carrying costs. The retailer

wants to analyze daily quantity-on-hand inventory levels by product and store.

It is time to put the four-step dimensional design process to work again. The

business process we’re interested in analyzing is the periodic snapshotting of retail

store inventory. The most atomic level of detail provided by the operational inven-

tory system is a daily inventory for each product in each store. The dimensions

immediately fall out of this grain declaration: date, product, and store. This often

happens with periodic snapshot fact tables where you cannot express the granular-

ity in the context of a transaction, so a list of dimensions is needed instead. In this

case study, there are no additional descriptive dimensions at this granularity. For

example, promotion dimensions are typically associated with product movement,

such as when the product is ordered, received, or sold, but not with inventory.

The simplest view of inventory involves only a single fact: quantity on hand.

This leads to an exceptionally clean dimensional design, as shown in Figure 4-2.

Date Dimension Store Inventory Snapshot Fact

Date Key (PK)

...

Date Key (FK)

Product Key (FK)

Store Key (FK)

Quantity on Hand

Product Key (PK)

Storage Requirement Type

...

Store Key (PK)

...

Product Dimension

Store Dimension

Figure 4-2: Store inventory periodic snapshot schema.

The date dimension table in this case study is identical to the table developed

in Chapter 3 for retail store sales. The product and store dimensions may be deco-

rated with additional attributes that would be useful for inventory analysis. For

example, the product dimension could be enhanced with columns such as the

minimum reorder quantity or the storage requirement, assuming they are constant

and discrete descriptors of each product. If the minimum reorder quantity varies for

Chapter 4

114

a product by store, it couldn’t be included as a product dimension attribute. In the

store dimension, you might include attributes to identify the frozen and refrigerated

storage square footages.

Even a schema as simple as Figure 4-2 can be very useful. Numerous insights can

be derived if inventory levels are measured frequently for many products in many

locations. However, this periodic snapshot fact table faces a serious challenge that

Chapter 3’s sales transaction fact table did not. The sales fact table was reasonably

sparse because you don’t sell every product in every shopping cart. Inventory, on

the other hand, generates dense snapshot tables. Because the retailer strives to

avoid out-of-stock situations in which the product is not available, there may be

a row in the fact table for every product in every store every day. In that case you

would include the zero out-of-stock measurements as explicit rows. For the grocery

retailer with 60,000 products stocked in 100 stores, approximately 6 million rows

(60,000 products x 100 stores) would be inserted with each nightly fact table load.

However, because the row width is just 14 bytes, the fact table would grow by only

84 MB with each load.

Although the data volumes in this case are manageable, the denseness of some

periodic snapshots may mandate compromises. Perhaps the most obvious is to

reduce the snapshot frequencies over time. It may be acceptable to keep the last 60

days of inventory at the daily level and then revert to less granular weekly snap-

shots for historical data. In this way, instead of retaining 1,095 snapshots during

a 3-year period, the number could be reduced to 208 total snapshots; the 60 daily

and 148 weekly snapshots should be stored in two separate fact tables given their

unique periodicity.

Semi-Additive Facts

We stressed the importance of fact additivity in Chapter 3. In the inventory snap-

shot schema, the quantity on hand can be summarized across products or stores

and result in a valid total. Inventory levels, however, are not additive across dates

because they represent snapshots of a level or balance at one point in time. Because

inventory levels (and all forms of ﬁ nancial account balances) are additive across

some dimensions but not all, we refer to them as semi-additive facts.

The semi-additive nature of inventory balance facts is even more understand-

able if you think about your checking account balances. On Monday, presume

that you have $50 in your account. On Tuesday, the balance remains unchanged.

On Wednesday, you deposit another $50 so the balance is now $100. The account

has no further activity through the end of the week. On Friday, you can’t merely

add up the daily balances during the week and declare that the ending balance is

$400 (based on $50 + $50 + $100 + $100 + $100). The most useful way to combine

Inventory 115

account balances and inventory levels across dates is to average them (resulting in

an $80 average balance in the checking example). You are probably familiar with

your bank referring to the average daily balance on a monthly account summary.

NOTE All measures that record a static level (inventory levels, ﬁ nancial account

balances, and measures of intensity such as room temperatures) are inherently

non-additive across the date dimension and possibly other dimensions. In these

cases, the measure may be aggregated across dates by averaging over the number

of time periods.

Unfortunately, you cannot use the SQL AVG function to calculate the average

over time. This function averages over all the rows received by the query, not just

the number of dates. For example, if a query requested the average inventory for

a cluster of three products in four stores across seven dates (e.g., the average daily

inventory of a brand in a geographic region during a week), the SQL AVG function

would divide the summed inventory value by 84 (3 products × 4 stores × 7 dates).

Obviously, the correct answer is to divide the summed inventory value by 7, which

is the number of daily time periods.

OLAP products provide the capability to deﬁ ne aggregation rules within the

cube, so semi-additive measures like balances are less problematic if the data is

deployed via OLAP cubes.

Enhanced Inventory Facts

The simplistic view in the periodic inventory snapshot fact table enables you to see

a time series of inventory levels. For most inventory analysis, quantity on hand isn’t

enough. Quantity on hand needs to be used in conjunction with additional facts to

measure the velocity of inventory movement and develop other interesting metrics

such as the number of turns and number of days’ supply.

If quantity sold (or equivalently, quantity shipped for a warehouse location) was

added to each fact row, you could calculate the number of turns and days’ supply.

For daily inventory snapshots, the number of turns measured each day is calculated

as the quantity sold divided by the quantity on hand. For an extended time span,

such as a year, the number of turns is the total quantity sold divided by the daily

average quantity on hand. The number of days’ supply is a similar calculation. Over

a time span, the number of days’ supply is the ﬁ nal quantity on hand divided by

the average quantity sold.

In addition to the quantity sold, inventory analysts are also interested in the

extended value of the inventory at cost, as well as the value at the latest selling price.

The initial periodic snapshot is embellished in Figure 4-3.

Chapter 4

116

Date Dimension Store Inventory Snapshot Fact

Date Key (PK)

...

Date Key (FK)

Product Key (FK)

Store Key (FK)

Quantity on Hand

Quantity Sold

Inventory Dollar Value at Cost

Inventory Dollar Value at Latest Selling Price

Product Key (PK)

...

Store Key (PK)

...

Product Dimension

Store Dimension

Figure 4-3: Enhanced inventory periodic snapshot.

Notice that quantity on hand is semi-additive, but the other measures in the

enhanced periodic snapshot are all fully additive. The quantity sold amount has been

rolled up to the snapshot’s daily granularity. The valuation columns are extended,

additive amounts. In some periodic snapshot inventory schemas, it is useful to

store the beginning balance, the inventory change or delta, along with the ending

balance. In this scenario, the balances are again semi-additive, whereas the deltas

are fully additive across all the dimensions.

The periodic snapshot is the most common inventory schema. We’ll brieﬂ y dis-

cuss two alternative perspectives that complement the inventory snapshot just

designed. For a change of pace, rather than describing these models in the context

of the retail store inventory, we’ll move up the value chain to discuss the inventory

located in the warehouses.

Inventory Transactions

A second way to model an inventory business process is to record every transac-

tion that a ects inventory. Inventory transactions at the warehouse might include

the following:

■ Receive product.

■ Place product into inspection hold.

■ Release product from inspection hold.

■ Return product to vendor due to inspection failure.

■ Place product in bin.

■ Pick product from bin.

■ Package product for shipment.

■ Ship product to customer.

■ Receive product from customer.

■ Return product to inventory from customer return.

■ Remove product from inventory.

Inventory 117

Each inventory transaction identiﬁ es the date, product, warehouse, vendor, trans-

action type, and in most cases, a single amount representing the inventory quantity

impact caused by the transaction. Assuming the granularity of the fact table is one

row per inventory transaction, the resulting schema is illustrated in Figure 4-4.

Date Dimension

Warehouse Inventory Transaction Fact

Date Key (FK)

Product Key (FK)

Warehouse Key (FK)

Inventory Transaction Type Key (FK)

Inventory Transaction Number (DD)

Inventory Transaction Dollar Amount

Inventory Transaction Type Key (PK)

Inventory Transaction Type Description

Inventory Transaction Type Group

Warehouse Key (PK)

Warehouse Number (NK)

Warehouse Name

Warehouse Address

Warehouse City

Warehouse City-State

Warehouse State

Warehouse ZIP

Warehouse Zone

Warehouse Total Square Footage

...

Product Dimension

Inventory Transaction Type Dimension

Warehouse Dimension

Figure 4-4: Warehouse inventory transaction model.

Even though the transaction fact table is simple, it contains detailed information

that mirrors individual inventory manipulations. The transaction fact table is use-

ful for measuring the frequency and timing of speciﬁ c transaction types to answer

questions that couldn’t be answered by the less granular periodic snapshot.

Even so, it is impractical to use the transaction fact table as the sole basis for ana-

lyzing inventory performance. Although it is theoretically possible to reconstruct the

exact inventory position at any moment in time by rolling all possible transactions

forward from a known inventory position, it is too cumbersome and impractical

for broad analytic questions that span dates, products, warehouses, or vendors.

NOTE Remember there’s more to life than transactions alone. Some form of a

snapshot table to give a more cumulative view of a process often complements

a transaction fact table.

Before leaving the transaction fact table, our example presumes each type of

transaction impacting inventory levels positively or negatively has consistent dimen-

sionality: date, product, warehouse, vendor, and transaction type. We recognize

some transaction types may have varied dimensionality in the real world. For

example, a shipper may be associated with the warehouse receipts and shipments;

customer information is likely associated with shipments and customer returns. If the

Chapter 4

118

transactions’ dimensionality varies by event, then a series of related fact tables should

be designed rather than capturing all inventory transactions in a single fact table.

NOTE If performance measurements have different natural granularity or

dimensionality, they likely result from separate processes that should be modeled

as separate fact tables.

Inventory Accumulating Snapshot

The ﬁ nal inventory model is the accumulating snapshot. Accumulating snapshot

fact tables are used for processes that have a deﬁ nite beginning, deﬁ nite end, and

identiﬁ able milestones in between. In this inventory model, one row is placed in the

fact table when a particular product is received at the warehouse. The disposition

of the product is tracked on this single fact row until it leaves the warehouse. In

this example, the accumulating snapshot model is only possible if you can reliably

distinguish products received in one shipment from those received at a later time;

it is also appropriate if you track product movement by product serial number or

lot number.

Now assume that inventory levels for a product lot captured a series of well-

deﬁ ned events or milestones as it moves through the warehouse, such as receiving,

inspection, bin placement, and shipping. As illustrated in Figure 4-5, the inventory

accumulating snapshot fact table with its multitude of dates and facts looks quite

di erent from the transaction or periodic snapshot schemas.

Inventory Receipt Accumulating Fact

Product Lot Receipt Number (DD)

Date Received Key (FK)

Date Inspected Key (FK)

Date Bin Placement Key (FK)

Date Initial Shipment Key (FK)

Date Last Shipment Key (FK)

Product Key (FK)

Warehouse Key (FK)

Vendor Key (FK)

Quantity Received

Quantity Inspected

Quantity Returned to Vendor

Quantity Placed in Bin

Quantity Shipped to Customer

Quantity Returned by Customer

Quantity Returned to Inventory

Quantity Damaged

Receipt to Inspected Lag

Receipt to Bin Placement Lag

Receipt to Initial Shipment Lag

Initial to Last Shipment Lag

Date Received Dimension

Product Dimension

Warehouse Dimension

Vendor Dimension

Date Inspected Dimension

Date Bin Placement Dimension

Date Initial Shipment Dimension

Date Last Shipment Dimension

Figure 4-5: Warehouse inventory accumulating snapshot.

Inventory 119

The accumulating snapshot fact table provides an updated status of the lot as it

moves through standard milestones represented by multiple date-valued foreign

keys. Each accumulating snapshot fact table row is updated repeatedly until the

products received in a lot are completely depleted from the warehouse, as shown

in Figure 4-6.

Date Inspected

Key

Receipt to

Inspected Lag

Receipt to Bin

Placement Lag

Date Bin

Placement Key

Product

Key

Quantity

Received

Lot Receipt

Number

Date Received

Key

101 20130101 0 0 1 100

Fact row inserted when lot received:

Date Inspected

Key

Receipt to

Inspected Lag

Receipt to Bin

Placement Lag

Date Bin

Placement Key

Product

Key

Quantity

Received

Lot Receipt

Number

Date Received

Key

101 20130101 20130103 0 1 100 2

Fact row updated when lot inspected:

Date Inspected

Key

Receipt to

Inspected Lag

Receipt to Bin

Placement Lag

Date Bin

Placement Key

Product

Key

Quantity

Received

Lot Receipt

Number

Date Received

Key

101 20130101 20130103 20130104 1 100 2 3

Fact row updated when lot placed in bin:

Figure 4-6: Evolution of an accumulating snapshot fact row.

Fact Table Types

There are just three fundamental types of fact tables: transaction, periodic snapshot,

and accumulating snapshot. Amazingly, this simple pattern holds true regardless

of the industry. All three types serve a useful purpose; you often need two comple-

mentary fact tables to get a complete picture of the business, yet the administration

and rhythm of the three fact tables are quite di erent. Figure 4-7 compares and

contrasts the variations.

Transaction

Discrete transaction point

in time

Recurring snapshots at

regular, predictable intervals

1 row per transaction or

transaction line

1 row per snapshot period

plus other dimensions

Snapshot date

Cumulative performance

for time interval

Predictably dense

Transaction date

Transaction performance

Sparse or dense, depending

on activity

No updates, unless error

correction

No updates, unless error

correction

Indeterminate time span for

evolving pipeline/workflow

1 row per pipeline

occurrence

Multiple dates for pipeline’s

key milestones

Performance for pipeline

occurrence

Sparse or dense, depending

on pipeline occurrence

Updated whenever pipeline

activity occurs

Periodicity

Grain

Date dimension(s)

Facts

Fact table sparsity

Fact table updates

Periodic Snapshot Accumulating Snapshot

Figure 4-7: Fact table type comparisons.

Chapter 4

120

Transaction Fact Tables

The most fundamental view of the business’s operations is at the individual transac-

tion or transaction line level. These fact tables represent an event that occurred at

an instantaneous point in time. A row exists in the fact table for a given customer

or product only if a transaction event occurred. Conversely, a given customer or

product likely is linked to multiple rows in the fact table because hopefully the

customer or product is involved in more than one transaction.

Transaction data ﬁ ts easily into a dimensional framework. Atomic transaction

data is the most naturally dimensional data, enabling you to analyze behavior in

extreme detail. After a transaction has been posted in the fact table, you typically

don’t revisit it.

Having made a solid case for the charm of transaction detail, you may be think-

ing that all you need is a big, fast server to handle the gory transaction minutiae,

and your job is over. Unfortunately, even with transaction level data, there are busi-

ness questions that are impractical to answer using only these details. As indicated

earlier, you cannot survive on transactions alone.

Periodic Snapshot Fact Tables

Periodic snapshots are needed to see the cumulative performance of the business

at regular, predictable time intervals. Unlike the transaction fact table where a row

is loaded for each event occurrence, with the periodic snapshot, you take a picture

(hence the snapshot terminology) of the activity at the end of a day, week, or month,

then another picture at the end of the next period, and so on. The periodic snap-

shots are stacked consecutively into the fact table. The periodic snapshot fact table

often is the only place to easily retrieve a regular, predictable view of longitudinal

performance trends.

When transactions equate to little pieces of revenue, you can move easily from

individual transactions to a daily snapshot merely by adding up the transactions.

In this situation, the periodic snapshot represents an aggregation of the transac-

tional activity that occurred during a time period; you would build the snapshot

only if needed for performance reasons. The design of the snapshot table is closely

related to the design of its companion transaction table in this case. The fact tables

share many dimension tables; the snapshot usually has fewer dimensions overall.

Conversely, there are usually more facts in a summarized periodic snapshot table

than in a transactional table because any activity that happens during the period

is fair game for a metric in a periodic snapshot.

In many businesses, however, transaction details are not easily summarized to

present management performance metrics. As you saw in this inventory case study,

Inventory 121

crawling through the transactions would be extremely time-consuming, plus the

logic required to interpret the e ect of di erent kinds of transactions on inventory

levels could be horrendously complicated, presuming you even have access to the

required historical data. The periodic snapshot again comes to the rescue to provide

management with a quick, ﬂ exible view of inventory levels. Hopefully, the data for

this snapshot schema is sourced directly from an operational system that handles

these complex calculations. If not, the ETL system must also implement this com-

plex logic to correctly interpret the impact of each transaction type.

Accumulating Snapshot Fact Tables

Last, but not least, the third type of fact table is the accumulating snapshot. Although

perhaps not as common as the other two fact table types, accumulating snapshots

can be very insightful. Accumulating snapshots represent processes that have a

deﬁ nite beginning and deﬁ nite end together with a standard set of intermediate

process steps. Accumulating snapshots are most appropriate when business users

want to perform workﬂ ow or pipeline analysis.

Accumulating snapshots always have multiple date foreign keys, representing the

predictable major events or process milestones; sometimes there’s an additional date

column that indicates when the snapshot row was last updated. As we’ll discuss in

Chapter 6: Order Management, these dates are each handled by a role-playing date

dimension. Because most of these dates are not known when the fact row is ﬁ rst

loaded, a default surrogate date key is used for the undeﬁ ned dates.

Lags Between Milestones and Milestone Counts

Because accumulating snapshots often represent the e ciency and elapsed time of

a workﬂ ow or pipeline, the fact table typically contains metrics representing the

durations or lags between key milestones. It would be di cult to answer duration

questions using a transaction fact table because you would need to correlate rows

to calculate time lapses. Sometimes the lag metrics are simply the raw di erence

between the milestone dates or date/time stamps. In other situations, the lag calcula-

tion is made more complicated by taking workdays and holidays into consideration.

Accumulating snapshot fact tables sometimes include milestone completion coun-

ters, valued as either 0 or 1. Finally, accumulating snapshots often have a foreign

key to a status dimension, which is updated to reﬂ ect the pipeline’s latest status.

Accumulating Snapshot Updates and OLAP Cubes

In sharp contrast to the other fact table types, you purposely revisit accumulating

snapshot fact table rows to update them. Unlike the periodic snapshot where the

prior snapshots are preserved, the accumulating snapshot merely reﬂ ects the most

Chapter 4

122

current status and metrics. Accumulating snapshots do not attempt to accommodate

complex scenarios that occur infrequently. The analysis of these outliers can always

be done with the transaction fact table.

It is worth noting that accumulating snapshots are typically problematic for

OLAP cubes. Because updates to an accumulating snapshot force both facts and

dimension foreign keys to change, much of the cube would need to be reprocessed

with updates to these snapshots, unless the fact row is only loaded once the pipeline

occurrence is complete.

Complementary Fact Table Types

Sometimes accumulating and periodic snapshots work in conjunction with one

another, such as when you incrementally build the monthly snapshot by adding the

e ect of each day’s transactions to a rolling accumulating snapshot while also storing

36 months of historical data in a periodic snapshot. Ideally, when the last day of the

month has been reached, the accumulating snapshot simply becomes the new regular

month in the time series, and a new accumulating snapshot is started the next day.

Transactions and snapshots are the yin and yang of dimensional designs. Used

together, companion transaction and snapshot fact tables provide a complete view

of the business. Both are needed because there is often no simple way to combine

these two contrasting perspectives in a single fact table. Although there is some

theoretical data redundancy between transaction and snapshot tables, you don’t

object to such redundancy because as DW/BI publishers, your mission is to publish

data so that the organization can e ectively analyze it. These separate types of fact

tables each provide di erent vantage points on the same story. Amazingly, these

three types of fact tables turn out to be all the fact table types needed for the use

cases described in this book.

Value Chain Integration

Now that we’ve completed the design of three inventory models, let’s revisit our ear-

lier discussion about the retailer’s value chain. Both business and IT organizations

are typically interested in value chain integration. Business management needs to

look across the business’s processes to better evaluate performance. For example,

numerous DW/BI projects have focused on better understanding customer behavior

from an end-to-end perspective. Obviously, this requires the ability to consistently

look at customer information across processes, such as quotes, orders, invoicing,

payments, and customer service. Similarly, organizations want to analyze their

products across processes, or their employees, students, vendors, and so on.

Inventory 123

IT managers recognize integration is needed to deliver on the promises of data

warehousing and business intelligence. Many consider it their ﬁ duciary respon-

sibility to manage the organization’s information assets. They know they’re not

fulﬁ lling their responsibilities if they allow standalone, nonintegrated databases

to proliferate. In addition to addressing the business’s needs, IT also beneﬁ ts from

integration because it allows the organization to better leverage scarce resources

and gain e ciencies through the use of reusable components.

Fortunately, the senior managers who typically are most interested in integration

also have the necessary organizational inﬂ uence and economic willpower to make

it happen. If they don’t place a high value on integration, you face a much more

serious organizational challenge, or put more bluntly, your integration project will

probably fail. It shouldn’t be the sole responsibility of the DW/BI manager to garner

organizational consensus for integration across the value chain. The political sup-

port of senior management is important; it takes the DW/BI manager o the hook

and places the burden on senior leadership’s shoulders where it belongs.

In Chapters 3 and 4, we modeled data from several processes of the retailer’s value

chain. Although separate fact tables in separate dimensional schemas represent the

data from each process, the models share several common business dimensions:

date, product, and store. We’ve logically represented this dimension sharing in

Figure 4-8. Using shared, common dimensions is absolutely critical to designing

dimensional models that can be integrated.

Store Dimension

Date Dimension

Retail Sales

Transaction Facts

Retail Inventory

Snapshot Facts

Warehouse Inventory

Transaction Facts

Promotion Dimension

Product Dimension

Warehouse Dimension

Figure 4-8: Sharing dimensions among business processes.

Enterprise Data Warehouse Bus Architecture

Obviously, building the enterprise’s DW/BI system in one galactic e ort is too daunt-

ing, yet building it as isolated pieces defeats the overriding goal of consistency. For

long-term DW/BI success, you need to use an architected, incremental approach to

build the enterprise’s warehouse. The approach we advocate is the enterprise data

warehouse bus architecture.

Chapter 4

124

Understanding the Bus Architecture

Contrary to popular belief, the word bus is not shorthand for business; it’s an old

term from the electrical power industry that is now used in the computer industry.

A bus is a common structure to which everything connects and from which every-

thing derives power. The bus in a computer is a standard interface speciﬁ cation

that enables you to plug in a disk drive, DVD, or any number of other specialized

cards or devices. Because of the computer’s bus standard, these peripheral devices

work together and usefully coexist, even though they were manufactured at di er-

ent times by di erent vendors.

NOTE By deﬁ ning a standard bus interface for the DW/BI environment, separate

dimensional models can be implemented by di erent groups at di erent times.

The separate business process subject areas plug together and usefully coexist if

they adhere to the standard.

If you refer back to the value chain diagram in Figure 4-1, you can envision many

business processes plugging into the enterprise data warehouse bus, as illustrated

in Figure 4-9. Ultimately, all the processes of an organization’s value chain create

a family of dimensional models that share a comprehensive set of common, con-

formed dimensions.

Store Sales

Store Inventory

Purchase Orders

Date Product Store Promotion Warehouse Vendor Shipper

Figure 4-9: Enterprise data warehouse bus with shared dimensions.

The enterprise data warehouse bus architecture provides a rational approach to

decomposing the enterprise DW/BI planning task. The master suite of standard-

ized dimensions and facts has a uniform interpretation across the enterprise. This

establishes the data architecture framework. You can then tackle the implementation

of separate process-centric dimensional models, with each implementation closely

Inventory 125

adhering to the architecture. As the separate dimensional models become available,

they ﬁ t together like the pieces of a puzzle. At some point, enough dimensional models

exist to make good on the promise of an integrated enterprise DW/BI environment.

The bus architecture enables DW/BI managers to get the best of both worlds.

They have an architectural framework guiding the overall design, but the problem

has been divided into bite-sized business process chunks that can be implemented

in realistic time frames. Separate development teams follow the architecture while

working fairly independently and asynchronously.

The bus architecture is independent of technology and database platforms. All

ﬂ avors of relational and OLAP-based dimensional models can be full participants

in the enterprise data warehouse bus if they are designed around conformed dimen-

sions and facts. DW/BI systems inevitably consist of separate machines with di erent

operating systems and database management systems. Designed coherently, they

share a common architecture of conformed dimensions and facts, allowing them

to be fused into an integrated whole.

Enterprise Data Warehouse Bus Matrix

We recommend using an enterprise data warehouse bus matrix to document and com-

municate the bus architecture, as illustrated in Figure 4-10. Others have renamed the

bus matrix, such as the conformance or event matrix, but these are merely synonyms

for this fundamental Kimball concept ﬁ rst introduced in the 1990s.

Issue Purchase Orders

Receive Warehouse Deliveries

Warehouse Inventory

Receive Store Deliveries

Store Inventory

Retail Sales

Retail Sales Forecast X X

Retail Promotion Tracking X

Customer Returns X

Returns to Vendor X

XFrequent Shopper Sign-Ups

BUSINESS PROCESSES

COMMON DIMENSIONS

Promotion

Store

X X

Employee

Customer

Warehouse

Product

Date

Figure 4-10: Sample enterprise data warehouse bus matrix for a retailer.

Chapter 4

126

Working in a tabular fashion, the organization’s business processes are repre-

sented as matrix rows. It is important to remember you are identifying business

processes, not the organization’s business departments. The matrix rows translate

into dimensional models representing the organization’s primary activities and

events, which are often recognizable by their operational source. When it’s time to

tackle a DW/BI development project, start with a single business process matrix row

because that minimizes the risk of signing up for an overly ambitious implementa-

tion. Most implementation risk comes from biting o too much ETL system design

and development. Focusing on the results of a single process, often captured by a

single underlying source system, reduces the ETL development risk.

After individual business processes are enumerated, you sometimes identify more

complex consolidated processes. Although dimensional models that cross processes

can be immensely beneﬁ cial in terms of both query performance and ease of use,

they are typically more di cult to implement because the ETL e ort grows with

each additional major source integrated into a single dimensional model. It is pru-

dent to focus on the individual processes as building blocks before tackling the task

of consolidating. Proﬁ tability is a classic example of a consolidated process in which

separate revenue and cost factors are combined from di erent processes to provide a

complete view of proﬁ tability. Although a granular proﬁ tability dimensional model

is exciting, it is deﬁ nitely not the ﬁ rst dimensional model you should attempt to

implement; you could easily drown while trying to wrangle all the revenue and

cost components.

The columns of the bus matrix represent the common dimensions used across

the enterprise. It is often helpful to create a list of core dimensions before ﬁ lling

in the matrix to assess whether a given dimension should be associated with a busi-

ness process. The number of bus matrix rows and columns varies by organization.

For many, the matrix is surprisingly square with approximately 25 to 50 rows and

a comparable number of columns. In other industries, like insurance, there tend to

be more columns than rows.

After the core processes and dimensions are identiﬁ ed, you shade or “X” the

matrix cells to indicate which columns are related to each row. Presto! You can

immediately see the logical relationships and interplay between the organization’s

conformed dimensions and key business processes.

Multiple Matrix Uses

Creating the enterprise data warehouse bus matrix is one of the most important

DW/BI implementation deliverables. It is a hybrid resource that serves multiple

purposes, including architecture planning, database design, data governance

coordination, project estimating, and organizational communication.

Inventory 127

Although it is relatively straightforward to lay out the rows and columns, the

enterprise bus matrix deﬁ nes the overall data architecture for the DW/BI system.

The matrix delivers the big picture perspective, regardless of database or technol-

ogy preferences.

The matrix’s columns address the demands of master data management and

data integration head-on. As core dimensions participating in multiple dimensional

models are deﬁ ned by folks with data governance responsibilities and built by the

DW/BI team, you can envision their use across processes rather than designing in

a vacuum based on the needs of a single process, or even worse, a single depart-

ment. Shared dimensions supply potent integration glue, allowing the business to

drill across processes.

Each business process-centric implementation project incrementally builds out

the overall architecture. Multiple development teams can work on components

of the matrix independently and asynchronously, with conﬁ dence they’ll ﬁ t together.

Project managers can look across the process rows to see the dimensionality of

each dimensional model at a glance. This vantage point is useful as they’re gauging

the magnitude of the project’s e ort. A project focused on a business process with

fewer dimensions usually requires less e ort, especially if the politically charged

dimensions are already sitting on the shelf.

The matrix enables you to communicate effectively within and across data

governance and DW/BI teams. Even more important, you can use the matrix to

communicate upward and outward throughout the organization. The matrix is a

succinct deliverable that visually conveys the master plan. IT management needs

to understand this perspective to coordinate across project teams and resist the

organizational urge to deploy more departmental solutions quickly. IT management

must also ensure that distributed DW/BI development teams are committed to the

bus architecture. Business management needs to also appreciate the holistic plan;

you want them to understand the staging of the DW/BI rollout by business process.

In addition, the matrix illustrates the importance of identifying experts from the

business to serve as data governance leaders for the common dimensions. It is a

tribute to its simplicity that the matrix can be used e ectively to communicate

with developers, architects, modelers, and project managers, as well as senior IT

and business management.

Opportunity/Stakeholder Matrix

You can draft a di erent matrix that leverages the same business process rows,

but replaces the dimension columns with business functions, such as merchandis-

ing, marketing, store operations, and ﬁ nance. Based on each function’s requirements,

the matrix cells are shaded to indicate which business functions are interested in

Chapter 4

128

which business processes (and projects), as illustrated in Figure 4-11’s opportunity/

stakeholder matrix variation. It also identiﬁ es which groups need to be invited to

the detailed requirements, dimensional modeling, and BI application speciﬁ cation

parties after a process-centric row is queued up as a project.

Issue Purchase Orders

Receive Warehouse Deliveries

Warehouse Inventory

Receive Store Deliveries

Store Inventory

Retail Sales

Retail Sales Forecast X X X X

Retail Promotion Tracking X

Customer Returns X X

Returns to Vendor X X X X

Frequent Shopper Sign-Ups X X X

BUSINESS PROCESSES

STAKEHOLDERS

Finance

Logistics

Store Operations

Marketing

Merchandising

Figure 4-11: Opportunity/stakeholder matrix.

Common Bus Matrix Mistakes

When drafting a bus matrix, people sometimes struggle with the level of detail

expressed by each row, resulting in the following missteps:

■ Departmental or overly encompassing rows. The matrix rows shouldn’t cor-

respond to the boxes on a corporate organization chart representing functional

groups. Some departments may be responsible or acutely interested in a single

business process, but the matrix rows shouldn’t look like a list of the CEO’s

direct reports.

■ Report-centric or too narrowly deﬁ ned rows. At the opposite extreme, the

bus matrix shouldn’t resemble a laundry list of requested reports. A single

business process supports numerous analyses; the matrix row should refer-

ence the business process, not the derivative reports or analytics.

When deﬁ ning the matrix columns, architects naturally fall into the similar traps

of deﬁ ning columns that are either too broad or too narrow:

Inventory 129

■ Overly generalized columns. A “person” column on the bus matrix may refer

to a wide variety of people, from internal employees to external suppliers

and customer contacts. Because there’s virtually zero overlap between these

populations, it adds confusion to lump them into a single, generic dimension.

Similarly, it’s not beneﬁ cial to put internal and external addresses referring

to corporate facilities, employee addresses, and customer sites into a generic

location column in the matrix.

■ Separate columns for each level of a hierarchy. The columns of the bus

matrix should refer to dimensions at their most granular level. Some

business process rows may require an aggregated version of the detailed

dimension, such as inventory snapshot metrics at the weekly level. Rather

than creating separate matrix columns for each level of the calendar hierarchy,

use a single column for dates. To express levels of detail above a daily grain,

you can denote the granularity within the matrix cell; alternatively, you can

subdivide the date column to indicate the hierarchical level associated with

each business process row. It’s important to retain the overarching identiﬁ ca-

tion of common dimensions deployed at di erent levels of granularity. Some

industry pundits advocate matrices that treat every dimension table attribute

as a separate, independent column; this defeats the concept of dimensions

and results in a completely unruly matrix.

Retroﬁ tting Existing Models to a Bus Matrix

It is unacceptable to build separate dimensional models that ignore a framework

tying them together. Isolated, independent dimensional models are worse than

simply a lost opportunity for analysis. They deliver access to irreconcilable views

of the organization and further enshrine the reports that cannot be compared with

one another. Independent dimensional models become legacy implementations

in their own right; by their existence, they block the development of a coherent

DW/BI environment.

So what happens if you’re not starting with a blank slate? Perhaps several dimen-

sional models have been constructed without regard to an architecture using

conformed dimensions. Can you rescue your stovepipes and convert them to the

bus architecture? To answer this question, you should start ﬁ rst with an honest

appraisal of your existing non-integrated dimensional structures. This typically

entails meetings with the separate teams (including the clandestine pseudo IT

teams within business organizations) to determine the gap between the current

environment and the organization’s architected goal. When the gap is understood,

you need to develop an incremental plan to convert the standalone dimensional

models to the enterprise architecture. The plan needs to be internally sold. Senior

IT and business management must understand the current state of data chaos, the

Chapter 4

130

risks of doing nothing, and the beneﬁ ts of moving forward according to your game

plan. Management also needs to appreciate that the conversion will require a sig-

niﬁ cant commitment of support, resources, and funding.

If an existing dimensional model is based on a sound dimensional design, per-

haps you can map an existing dimension to a standardized version. The original

dimension table would be rebuilt using a cross-reference map. Likewise, the fact

table would need to be reprocessed to replace the original dimension keys with the

conformed dimension keys. Of course, if the original and conformed dimension

tables contain di erent attributes, rework of the preexisting BI applications and

queries is inevitable.

More typically, existing dimensional models are riddled with dimensional model-

ing errors beyond the lack of adherence to standardized dimensions. In some cases,

the stovepipe dimensional model has outlived its useful life. Isolated dimensional

models often are built for a speciﬁ c functional area. When others try to leverage

the data, they typically discover that the dimensional model was implemented at

an inappropriate level of granularity and is missing key dimensionality. The e ort

required to retroﬁ t these dimensional models into the enterprise DW/BI architec-

ture may exceed the e ort to start over from scratch. As di cult as it is to admit,

stovepipe dimensional models often have to be shut down and rebuilt in the proper

bus architecture framework.

Conformed Dimensions

Now that you understand the importance of the enterprise bus architecture, let’s fur-

ther explore the standardized conformed dimensions that serve as the cornerstone

of the bus because they’re shared across business process fact tables. Conformed

dimensions go by many other aliases: common dimensions, master dimensions, ref-

erence dimensions, and shared dimensions. Conformed dimensions should be built

once in the ETL system and then replicated either logically or physically throughout

the enterprise DW/BI environment. When built, it’s extremely important that the

DW/BI development teams take the pledge to use these dimensions. It’s a policy

decision that is critical to making the enterprise DW/BI system function; their usage

should be mandated by the organization’s CIO.

Drilling Across Fact Tables

In addition to consistency and reusability, conformed dimensions enable you to com-

bine performance measurements from di erent business processes in a single report,

as illustrated in Figure 4-12. You can use multipass SQL to query each dimensional

Inventory 131

model separately and then outer-join the query results based on a common dimen-

sion attribute, such as Figure 4-12’s product name. The full outer-join ensures all

rows are included in the combined report, even if they only appear in one set of

query results. This linkage, often referred to as drill across, is straightforward if the

dimension table attribute values are identical.

Product Description

Baked Well Sourdough

Fluffy Light Sliced White

Fluffy Sliced Whole Wheat

1,201

1,472

846

935

801

513

1,042

922

368

Open Orders Qty Inventory Qty Sales Qty

Figure 4-12: Drilling across fact tables with conformed dimension attributes.

Drilling across is supported by many BI products and platforms. Their implemen-

tations di er on whether the results are joined in temporary tables, the application

server, or the report. The vendors also use di erent terms to describe this technique,

including multipass, multi-select, multi-fact, or stitch queries. Because metrics from

di erent fact tables are brought together with a drill-across query, often any cross-

fact calculations must be done in the BI application after the separate conformed

results have been returned.

Conformed dimensions come in several di erent ﬂ avors, as described in the

following sections.

Identical Conformed Dimensions

At the most basic level, conformed dimensions mean the same thing with every pos-

sible fact table to which they are joined. The date dimension table connected to the

sales facts is identical to the date dimension table connected to the inventory facts.

Identical conformed dimensions have consistent dimension keys, attribute column

names, attribute deﬁ nitions, and attribute values (which translate into consistent

report labels and groupings). Dimension attributes don’t conform if they’re called

Month in one dimension and Month Name in another; likewise, they don’t conform

if the attribute value is “July” in one dimension and “JULY” in another. Identical

conformed dimensions in two dimensional models may be the same physical table

within the database. However, given the typical complexity of the DW/BI system’s

technical environment with multiple database platforms, it is more likely that the

dimension is built once in the ETL system and then duplicated synchronously out-

ward to each dimensional model. In either case, the conformed date dimensions in

both dimensional models have the same number of rows, same key values, same

attribute labels, same attribute data deﬁ nitions, and same attribute values. Attribute

column names should be uniquely labeled across dimensions.

Chapter 4

132

Most conformed dimensions are deﬁ ned naturally at the most granular level

possible. The product dimension’s grain will be the individual product; the date

dimension’s grain will be the individual day. However, sometimes dimensions at the

same level of granularity do not fully conform. For example, there might be product

and store attributes needed for inventory analysis, but they aren’t appropriate for

analyzing retail sales data. The dimension tables still conform if the keys and com-

mon columns are identical, but the supplemental attributes used by the inventory

schema are not conformed. It is physically impossible to drill across processes using

these add-on attributes.

Shrunken Rollup Conformed Dimension

with Attribute Subset

Dimensions also conform when they contain a subset of attributes from a more

granular dimension. Shrunken rollup dimensions are required when a fact table

captures performance metrics at a higher level of granularity than the atomic

base dimension. This would be the case if you had a weekly inventory snapshot in

addition to the daily snapshot. In other situations, facts are generated by another

business process at a higher level of granularity. For example, the retail sales pro-

cess captures data at the atomic product level, whereas forecasting generates data

at the brand level. You couldn’t share a single product dimension table across the

two business process schemas because the granularity is di erent. The product

and brand dimensions still conform if the brand table attributes are a strict subset

of the atomic product table’s attributes. Attributes that are common to both the

detailed and rolled-up dimension tables, such as the brand and category descrip-

tions, should be labeled, deﬁ ned, and identically valued in both tables, as illustrated

in Figure 4-13. However, the primary keys of the detailed and rollup dimension

tables are separate.

NOTE Shrunken rollup dimensions conform to the base atomic dimension if

the attributes are a strict subset of the atomic dimension’s attributes.

Shrunken Conformed Dimension with Row Subset

Another case of conformed dimension subsetting occurs when two dimensions are

at the same level of detail, but one represents only a subset of rows. For example, a

corporate product dimension contains rows for the full portfolio of products across

multiple disparate lines of business, as illustrated in Figure 4-14. Analysts in the

Inventory 133

separate businesses may want to view only their subset of the corporate dimension,

restricted to the product rows for their business. By using a subset of rows, they

aren’t encumbered with the corporation’s entire product set. Of course, the fact table

joined to this subsetted dimension must be limited to the same subset of products.

If a user attempts to use a shrunken subset dimension while accessing a fact table

consisting of the complete product set, they may encounter unexpected query results

because referential integrity would be violated. You need to be cognizant of the

potential opportunity for user confusion or error with dimension row subsetting.

We will further elaborate on dimension subsets when we discuss supertype and

subtype dimensions in Chapter 10: Financial Services.

Product Key (PK)

Product Description

SKU Number (Natural Key)

Brand Description

Subcategory Description

Category Description

Department Description

Package Type Description

Package Size

Fat Content Description

Diet Type Description

Weight

Weight Units of Measure

...

Brand Key (PK)

Brand Description

Subcategory Description

Category Description

Department Description

Month Key (PK)

Calendar Month Name

Calendar Month Number

Calendar YYYY-MM

Calendar Year

Date Key (PK)

Date

Full Date Description

Day of Week

Day Number in Month

Calendar Month Name

Calendar Month Number

Calendar YYYY-MM

Calendar Year

Fiscal Week

Fiscal Month

...

Product Dimension

Date Dimension

Brand Dimension

Month Dimension

Conforms

Figure 4-13: Conforming shrunken rollup dimensions.

Chapter 4

134

Appliance

Products

Apparel

Products

Corporate

Product Dimension

Drilling across requires common conformed attributes.

Figure 4-14: Conforming dimension subsets at the same granularity.

Conformed date and month dimensions are a unique example of both row

and column dimension subsetting. Obviously, you can’t simply use the same date

dimension table for daily and monthly fact tables because of the di erence in rollup

granularity. However, the month dimension may consist of the month-end daily

date table rows with the exclusion of all columns that don’t apply at the monthly

granularity, such as the weekday/weekend indicator, week ending date, holiday

indicator, day number within year, and others. Sometimes a month-end indicator on

the daily date dimension is used to facilitate creation of this month dimension table.

Shrunken Conformed Dimensions on the Bus Matrix

The bus matrix identiﬁ es the reuse of common dimensions across business processes.

Typically, the shaded cells of the matrix indicate that the atomic dimension is

associated with a given process. When shrunken rollup or subset dimensions are

involved, you want to reinforce their conformance with the atomic dimensions.

Therefore, you don’t want to create a new, unrelated column on the bus matrix.

Instead, there are two viable approaches to represent the shrunken dimensions within

the matrix, as illustrated in Figure 4-15:

■ Mark the cell for the atomic dimension, but then textually document the

rollup or row subset granularity within the cell.

■ Subdivide the dimension column to indicate the common rollup or subset

granularities, such as day and month if processes collect data at both of these

grains.

Inventory 135

Date

Day Month

Issue Purchase Orders

Receive Deliveries

Inventory

Retail Sales

Month

Retail Sales Forecast

Figure 4-15: Alternatives for identifying shrunken dimensions on the bus matrix.

Limited Conformity

Now that we’ve preached about the importance of conformed dimensions, we’ll

discuss the situation in which it may not be realistic or necessary to establish con-

formed dimensions for the organization. If a conglomerate has subsidiaries spanning

widely varied industries, there may be little point in trying to integrate. If each line

of business has unique customers and unique products and there’s no interest in

cross-selling across lines, it may not make sense to attempt an enterprise archi-

tecture because there likely isn’t much perceived business value. The willingness

to seek a common deﬁ nition for product, customer, or other core dimensions is a

major litmus test for an organization theoretically intent on building an enterprise

DW/BI system. If the organization is unwilling to agree on common deﬁ nitions, the

organization shouldn’t attempt to build an enterprise DW/BI environment. It would

be better to build separate, self-contained data warehouses for each subsidiary. But

then don’t complain when someone asks for “enterprise performance” without going

through this logic.

Although organizations may ﬁ nd it di cult to combine data across disparate lines

of business, some degree of integration is typically an ultimate goal. Rather than

throwing your hands in the air and declaring it can’t possibly be done, you should

start down the path toward conformity. Perhaps there are a handful of attributes that

can be conformed across lines of business. Even if it is merely a product description,

category, and line of business attribute that is common to all businesses, this least-

common-denominator approach is still a step in the right direction. You don’t need

to get everyone to agree on everything related to a dimension before proceeding.

Importance of Data Governance and Stewardship

We’ve touted the importance of conformed dimensions, but we also need to acknowl-

edge a key challenge: reaching enterprise consensus on dimension attribute names

Chapter 4

136

and contents (and the handling of content changes which we’ll discuss in Chapter 5:

Procurement). In many organizations, business rules and data deﬁ nitions have

traditionally been established departmentally. The consequences of this commonly

encountered lack of data governance and control are the ubiquitous departmental

data silos that perpetuate similar but slightly di erent versions of the truth. Business

and IT management need to recognize the importance of addressing this shortfall

if you stand any chance of bringing order to the chaos; if management is reluctant

to drive change, the project will never achieve its goals.

Once the data governance issues and opportunities are acknowledged by senior

leadership, resources need to be identiﬁ ed to spearhead the e ort. IT is often tempted

to try leading the charge. They are frustrated by the isolated projects re-creating

data around the organization, consuming countless IT and outside resources while

delivering inconsistent solutions that ultimately just increase the complexity of

the organization’s data architecture at signiﬁ cant cost. Although IT can facilitate

the deﬁ nition of conformed dimensions, it is seldom successful as the sole driver,

even if it’s a temporary assignment. IT simply lacks the organizational authority to

make things happen.

Business-Driven Governance

To boost the likelihood of business acceptance, subject matter experts from the

business need to lead the initiative. Leading a cross-organizational governance

program is not for the faint of heart. The governance resources identiﬁ ed by busi-

ness leadership should have the following characteristics:

■ Respect from the organization

■ Broad knowledge of the enterprise’s operations

■ Ability to balance organizational needs against departmental requirements

■ Gravitas and authority to challenge the status quo and enforce policies

■ Strong communication skills

■ Politically savvy negotiation and consensus building skills

Clearly, not everyone is cut out for the job! Typically those tapped to spearhead

the governance program are highly valued and in demand. It takes the right skills,

experience, and conﬁ dence to rationalize diverse business perspectives and drive

the design of common reference data, together with the necessary organizational

compromises. Over the years, some have criticized conformed dimensions as being

too hard. Yes, it’s di cult to get people in di erent corners of the business to agree

on common attribute names, deﬁ nitions, and values, but that’s the crux of uniﬁ ed,

integrated data. If everyone demands their own labels and business rules, there’s

no chance of delivering on the promises made to establish a single version of the

Inventory 137

truth. The data governance program is critical in facilitating a culture shift away

from the typical siloed environment in which each department retains control of

their data and analytics to one where information is shared and leveraged across

the organization.

Governance Objectives

One of the key objectives of the data governance function is to reach agreement on

data deﬁ nitions, labels, and domain values so that everyone is speaking the same

language. Otherwise, the same words may describe di erent things; di erent words

may describe the same thing; and the same value may have di erent meaning.

Establishing common master data is often a politically charged issue; the chal-

lenges are cultural and geopolitical rather than technical. Deﬁ ning a foundation of

master descriptive conformed dimensions requires e ort. But after it’s agreed upon,

subsequent DW/BI e orts can leverage the work, both ensuring consistency and

reducing the implementation’s delivery cycle time.

In addition to tackling data deﬁ nitions and contents, the data governance func-

tion also establishes policies and responsibilities for data quality and accuracy, as

well as data security and access controls.

Historically, DW/BI teams created the “recipes” for conformed dimensions and

managed the data cleansing and integration mapping in the ETL system; the opera-

tional systems focused on accurately capturing performance metrics, but there was

often little e ort to ensure consistent common reference data. Enterprise resource

planning (ERP) systems promised to ﬁ ll the void, but many organizations still

rely on separate best-of-breed point solutions for niche requirements. Recently,

operational master data management (MDM) solutions have addressed the need

for centralized master data at the source where the transactions are captured.

Although technology can encourage data integration, it doesn’t ﬁ x the problem.

A strong data governance function is a necessary prerequisite for conforming infor-

mation regardless of technical approach.

Conformed Dimensions and the Agile Movement

Some lament that although they want to deliver and share consistently deﬁ ned

master conformed dimensions in their DW/BI environments, it’s “just not feasible.”

They explain they would if they could, but with senior management focused on

using agile development techniques, it’s “impossible” to take the time to get organi-

zational agreement on conformed dimensions. You can turn this argument upside

down by challenging that conformed dimensions enable agile DW/BI development,

along with agile decision making.

Chapter 4

138

Conformed dimensions allow a dimension table to be built and maintained once

rather than re-creating slightly di erent versions during each development cycle.

Reusing conformed dimensions across projects is where you get the leverage for

more agile DW/BI development. As you ﬂ esh out the portfolio of master conformed

dimensions, the development crank starts turning faster and faster. The time-to-

market for a new business process data source shrinks as developers reuse existing

conformed dimensions. Ultimately, new ETL development focuses almost exclusively

on delivering more fact tables because the associated dimension tables are already

sitting on the shelf ready to go.

Deﬁ ning a conformed dimension requires organizational consensus and com-

mitment to data stewardship. But you don’t need to get everyone to agree on every

attribute in every dimension table. At a minimum, you should identify a subset

of attributes that have signiﬁ cance across the enterprise. These commonly referenced

descriptive characteristics become the starter set of conformed attributes, enabling

drill-across integration. Even just a single attribute, such as enterprise product

category, is a viable starting point for the integration e ort. Over time, you can

iteratively expand from this minimalist starting point by adding attributes. These

dimensions could be tackled during architectural agile sprints. When a series of

sprint deliverables combine to deliver su cient value, they constitute a release to

the business users.

If you fail to focus on conformed dimensions because you’re under pressure to

deliver something yesterday, the departmental analytic data silos will likely have

inconsistent categorizations and labels. Even more troubling, data sets may look

like they can be compared and integrated due to similar labels, but the underlying

business rules may be slightly di erent. Business users waste inordinate amounts

of time trying to reconcile and resolve these data inconsistencies, which negatively

impact their ability to be agile decision makers.

The senior IT managers who are demanding agile systems development practices

should be exerting even greater organizational pressure, in conjunction with their

peers in the business, on the development of consistent conformed dimensions if

they’re interested in both long-term development e ciencies and long-term decision-

making e ectiveness across the enterprise.

Conformed Facts

Thus far we have considered the central task of setting up conformed dimensions to

tie dimensional models together. This is 95 percent or more of the data architecture

e ort. The remaining 5 percent of the e ort goes into establishing conformed fact

deﬁ nitions.

Inventory 139

Revenue, proﬁ t, standard prices and costs, measures of quality and customer

satisfaction, and other key performance indicators (KPIs) are facts that must also

conform. If facts live in more than one dimensional model, the underlying deﬁ ni-

tions and equations for these facts must be the same if they are to be called the

same thing. If they are labeled identically, they need to be deﬁ ned in the same

dimensional context and with the same units of measure from dimensional model

to dimensional model. For example, if several business processes report revenue,

then these separate revenue metrics can be added and compared only if they have

the same ﬁ nancial deﬁ nitions. If there are deﬁ nitional di erences, then it is essential

that the revenue facts be labeled uniquely.

NOTE You must be disciplined in your data naming practices. If it is impos-

sible to conform a fact exactly, you should give di erent names to the di erent

interpretations so that business users do not combine these incompatible facts in

calculations.

Sometimes a fact has a natural unit of measure in one fact table and another natu-

ral unit of measure in another fact table. For example, the ﬂ ow of product down the

retail value chain may best be measured in shipping cases at the warehouse but in

scanned units at the store. Even if all the dimensional considerations have been cor-

rectly taken into account, it would be di cult to use these two incompatible units of

measure in one drill-across report. The usual solution to this kind of problem is to

refer the user to a conversion factor buried in the product dimension table and hope

that the user can ﬁ nd the conversion factor and correctly use it. This is unacceptable

for both overhead and vulnerability to error. The correct solution is to carry the fact

in both units of measure, so a report can easily glide down the value chain, picking

o comparable facts. Chapter 6: Order Management talks more about multiple units

of measure.

Summary

In this chapter we developed dimensional models for the three complementary

views of inventory. The periodic snapshot is a good choice for long-running, con-

tinuously replenished inventory scenarios. The accumulating snapshot is a good

choice for ﬁ nite inventory pipeline situations with a deﬁ nite beginning and end.

Finally, most inventory analysis will require a transactional schema to augment

these snapshot models.

We introduced key concepts surrounding the enterprise data warehouse bus

architecture and matrix. Each business process of the value chain, supported by a

Chapter 4

140

primary source system, translates into a row in the bus matrix, and eventually, a

dimensional model. The matrix rows share a surprising number of standardized,

conformed dimensions. Developing and adhering to the enterprise bus architecture

is an absolute must if you intend to build a DW/BI system compose d of an integrated

set of dimensional models.

Procurement

We explore procurement processes in this chapter. This subject area has

obvious cross-industry appeal because it is applicable to any organization

that acquires products or services for either use or resale.

In addition to developing several purchasing models, this chapter provides

in-depth coverage of the techniques for handling dimension table attribute value

changes. Although descriptive attributes in dimension tables are relatively static,

they are subject to change over time. Product lines are restructured, causing product

hierarchies to change. Customers move, causing their geographic information to

change. We’ll describe several approaches to deal with these inevitable dimension

table changes. Followers of the Kimball methods will recognize the type 1, 2, and

3 techniques. Continuing in this tradition, we’ve expanded the slowly changing

dimension technique line-up with types 0, 4, 5, 6, and 7.

Chapter 5 discusses the following concepts:

■ Bus matrix snippet for procurement processes

■ Blended versus separate transaction schemas

■ Slowly changing dimension technique types 0 through 7, covering both basic

and advanced hybrid scenarios

Procurement Case Study

Thus far we have studied downstream sales and inventory processes in the retailer’s

value chain. We explained the importance of mapping out the enterprise data ware-

house bus architecture where conformed dimensions are used across process-centric

fact tables. In this chapter we’ll extend these concepts as we work our way further

up the value chain to the procurement processes.

Chapter 5

142

For many companies, procurement is a critical business activity. E ective pro-

curement of products at the right price for resale is obviously important to retail-

ers and distributors. Procurement also has strong bottom line implications for any

organization that buys products as raw materials for manufacturing. Signiﬁ cant

cost savings opportunities are associated with reducing the number of suppliers

and negotiating agreements with preferred suppliers.

Demand planning drives e cient materials management. After demand is fore-

casted, procurement’s goal is to source the appropriate materials or products in

the most economical manner. Procurement involves a wide range of activities from

negotiating contracts to issuing purchase requisitions and purchase orders (POs)

to tracking receipts and authorizing payments. The following list gives you a better

sense of a procurement organization’s common analytic requirements:

■ Which materials or products are most frequently purchased? How many ven-

dors supply these products? At what prices? Looking at demand across the

enterprise (rather than at a single physical location), are there opportunities

to negotiate favorable pricing by consolidating suppliers, single sourcing, or

making guaranteed buys?

■ Are your employees purchasing from the preferred vendors or skirting the

negotiated vendor agreements with maverick spending?

■ Are you receiving the negotiated pricing from your vendors or is there vendor

contract purchase price variance?

■ How are your vendors performing? What is the vendor’s ﬁ ll rate? On-time

delivery performance? Late deliveries outstanding? Percent back ordered?

Rejection rate based on receipt inspection?

Procurement Transactions and Bus Matrix

As you begin working through the four-step dimensional design process, you deter-

mine that procurement is the business process to be modeled. In studying the

process, you observe a ﬂ urry of procurement transactions, such as purchase requisi-

tions, purchase orders, shipping notiﬁ cations, receipts, and payments. Similar to the

approach taken in Chapter 4: Inventory, you could initially design a fact table with

the grain of one row per procurement transaction with transaction date, product,

vendor, contract terms, and procurement transaction type as key dimensions. The

procurement transaction quantity and dollar amount are the facts. The resulting

design is shown in Figure 5-1.

Procurement 143

Date Dimension

Procurement Transaction Fact

Product Dimension

Contract Terms Dimension

Procurement Transaction Type Dimension

Vendor Dimension

Vendor Key (PK)

Vendor Name

Vendor Street Address

Vendor City

Vendor City-State

Vendor ZIP-Postal Code

Vendor State-Province

Vendor Country

Vendor Status

Vendor Minority Ownership Flag

Vendor Corporate Parent

...

Procurement Transaction Date Key (FK)

Product Key (FK)

Vendor Key (FK)

Contract Terms Key (FK)

Procurement Transaction Type Key (FK)

Contract Number (DD)

Procurement Transaction Quantity

Procurement Transaction Dollar Amount

Contract Terms Key (PK)

Contract Terms Description

Contract Terms Type

Procurement Transaction Type Key (PK)

Procurement Transaction Type Description

Procurement Transaction Type Category

Figure 5-1: Procurement fact table with multiple transaction types.

If you work for the same grocery retailer from the earlier case studies, the trans-

action date and product dimensions are the same conformed dimensions developed

originally in Chapter 3: Retail Sales. If you work with manufacturing procurement,

the raw materials products likely are located in a separate raw materials dimen-

sion table rather than included in the product dimension for salable products. The

vendor, contract terms, and procurement transaction type dimensions are new

to this schema. The vendor dimension contains one row for each vendor, along

with interesting descriptive attributes to support a variety of vendor analyses. The

contract terms dimension contains one row for each generalized set of negotiated

terms, similar to the promotion dimension in Chapter 3. The procurement trans-

action type dimension enables grouping or ﬁ ltering on transaction types, such as

purchase orders. The contract number is a degenerate dimension; it could be used

to determine the volume of business conducted under each negotiated contract.

Single Versus Multiple Transaction Fact Tables

As you review the initial procurement schema design with business users, you learn

several new details. First, the business users describe the various procurement trans-

actions di erently. To the business, purchase orders, shipping notices, warehouse

receipts, and vendor payments are all viewed as separate and unique processes.

Several of the procurement transactions come from di erent source systems.

There is a purchasing system that provides purchase requisitions and purchase

orders, a warehousing system that provides shipping notices and warehouse receipts,

and an accounts payable system that deals with vendor payments.

Chapter 5

144

You further discover that several transaction types have di erent dimensionality.

For example, discounts taken are applicable to vendor payments but not to the other

transaction types. Similarly, the name of the employee who received the goods at

the warehouse applies to receipts but doesn’t make sense elsewhere.

There are also a variety of interesting control numbers, such as purchase order

and payment check numbers, created at various steps in the procurement pipeline.

These control numbers are perfect candidates for degenerate dimensions. For certain

transaction types, more than one control number may apply.

As you sort through these new details, you are faced with a design decision.

Should you build a blended transaction fact table with a transaction type dimension

to view all procurement transactions together, or do you build separate fact tables

for each transaction type? This is a common design quandary that surfaces in many

transactional situations, not just procurement.

As dimensional modelers, you need to make design decisions based on a thor-

ough understanding of the business requirements weighed against the realities of

the underlying source data. There is no simple formula to make the deﬁ nite deter-

mination of whether to use a single fact table or multiple fact tables. A single fact

table may be the most appropriate solution in some situations, whereas multiple

fact tables are most appropriate in others. When faced with this design decision,

the following considerations help sort out the options:

■ What are the users’ analytic requirements? The goal is to reduce complexity

by presenting the data in the most e ective form for business users. How will

the business users most commonly analyze this data? Which approach most

naturally aligns with their business-centric perspective?

■ Are there really multiple unique business processes? In the procurement

example, it seems buying products (purchase orders) is distinctly di erent

from receiving products (receipts). The existence of separate control num-

bers for each step in the process is a clue that you are dealing with separate

processes. Given this situation, you would lean toward separate fact tables. By

contrast, in Chapter 4’s inventory example, the varied inventory transactions

were part of a single inventory process resulting in a single fact table design.

■ Are multiple source systems capturing metrics with unique granularities?

There are three separate source systems in this case study: purchasing, ware-

housing, and accounts payable. This would suggest separate fact tables.

■ What is the dimensionality of the facts? In this procurement example, several

dimensions are applicable to some transaction types but not to others. This

would again lead you to separate fact tables.

Procurement 145

A simple way to consider these trade-o s is to draft a bus matrix. As illustrated in

Figure 5-2, you can include two additional columns identifying the atomic granular-

ity and metrics for each row. These matrix embellishments cause it to more closely

resemble the detailed implementation bus matrix, which we’ll more thoroughly

discuss in Chapter 16: Insurance.

Business Processes Atomic Granularity Metrics

Purchase Requisitions X

Date

Product

Vendor

Contract Terms

Employee

Warehouse

Carrier

Purchase Orders

Shipping Notifications

Warehouse Receipts

Vendor Invoices

Vendor Payments

1 row per requisition line

1 row per PO line

1 row per shipping notice line

1 row per receipt line

1 row per invoice line

1 row per payment

Requisition Quantity & Dollars

PO Quantity & Dollars

Shipped Quantity

Received Quantity

Invoice Quantity & Dollars

Invoice, Discount & Net

Payment Dollars

Figure 5-2: Sample bus matrix rows for procurement processes.

Based on the bus matrix for this hypothetical case study, multiple transaction

fact tables would be implemented, as illustrated in Figure 5-3. In this example, there

are separate fact tables for purchase requisitions, purchase orders, shipping notices,

warehouse receipts, and vendor payments. This decision was reached because users

view these activities as separate and distinct business processes, the data comes

from di erent source systems, and there is unique dimensionality for the various

transaction types. Multiple fact tables enable richer, more descriptive dimensions

and attributes. The single fact table approach would have required generalized

labeling for some dimensions. For example, purchase order date and receipt date

would likely have been generalized to simply transaction date. Likewise, purchasing

agent and receiving clerk would become employee. This generalization reduces the

legibility of the resulting dimensional model. Also, with separate fact tables as you

progress from purchase requisitions to payments, the fact tables inherit dimensions

from the previous steps.

Multiple fact tables may require more time to manage and administer because

there are more tables to load, index, and aggregate. Some would argue this approach

increases the complexity of the ETL processes. Actually, it may simplify the ETL

activities. Loading the operational data from separate source systems into separate

fact tables likely requires less complex ETL processing than attempting to integrate

data from the multiple sources into a single fact table.

Chapter 5

146

Date Dimension

Vendor Dimension

Employee Dimension

Carrier Dimension

Product Dimension

Contract Terms Dimension

Warehouse Dimension

Purchase Requisition Date Key (FK)

Product Key (FK)

Vendor Key (FK)

Contract Terms Key (FK)

Employee Requested By Key (FK)

Contract Number (DD)

Purchase Requisition Number (DD)

Purchase Requisition Quantity

Purchase Requisition Dollar Amount

Purchase Order Date Key (FK)

Requested By Date Key (FK)

Product Key (FK)

Vendor Key (FK)

Contract Terms Key (FK)

Warehouse Key (FK)

Carrier Key (FK)

Employee Ordered By Key (FK)

Employee Purchase Agent Key (FK)

Contract Number (DD)

Purchase Requisition Number (DD)

Purchase Order Number (DD)

Purchase Order Quantity

Purchase Order Dollar Amount

Purchase Order Fact

Purchase Requisition Fact

Shipping Notification Date Key (FK)

Estimated Arrival Date Key (FK)

Requested By Date Key (FK)

Product Key (FK)

Vendor Key (FK)

Warehouse Key (FK)

Carrier Key (FK)

Employee Ordered By Key (FK)

Purchase Order Number (DD)

Shipping Notification Number (DD)

Shipped Quantity

Shipping Notices Fact

Warehouse Receipt Date Key (FK)

Requested By Date Key (FK)

Product Key (FK)

Vendor Key (FK)

Warehouse Key (FK)

Carrier Key (FK)

Employee Ordered By Key (FK)

Employee Received By Key (FK)

Purchase Order Number (DD)

Shipping Notification Number (DD)

Warehouse Receipt Number (DD)

Received Quantity

Warehouse Receipts Fact

Vendor Payment Date Key (FK)

Product Key (FK)

Vendor Key (FK)

Warehouse Key (FK)

Contract Terms Key (FK)

Contract Number (DD)

Payment Check Number (DD)

Vendor Invoice Dollar Amount

Vendor Discount Dollar Amount

Vendor Net Payment Dollar Amount

Vendor Payment Fact

Figure 5-3: Multiple fact tables for procurement processes.

Procurement 147

Complementary Procurement Snapshot

Apart from the decision regarding multiple procurement transaction fact tables,

you may also need to develop a snapshot fact table to fully address the business’s

needs. As suggested in Chapter 4, an accumulating snapshot such as Figure 5-4 that

crosses processes would be extremely useful if the business is interested in monitor-

ing product movement as it proceeds through the procurement pipeline (including

the duration of each stage). Remember that an accumulating snapshot is meant to

model processes with well-deﬁ ned milestones. If the process is a continuous ﬂ ow

that never really ends, it is not a good candidate for an accumulating snapshot.

Purchase Order Date Key (FK)

Requested By Date Key (FK)

Warehouse Receipt Date Key (FK)

Vendor Invoice Date Key (FK)

Vendor Payment Date Key (FK)

Product Key (FK)

Vendor Key (FK)

Contract Terms Key (FK)

Employee Ordered By Key (FK)

Warehouse Key (FK)

Carrier Key (FK)

Contract Number (DD)

Purchase Order Number (DD)

Warehouse Receipt Number (DD)

Vendor Invoice Number (DD)

Payment Check Number (DD)

Purchase Order Quantity

Purchase Order Dollar Amount

Shipped Quantity

Received Quantity

Vendor Invoice Dollar Amount

Vendor Discount Dollar Amount

Vendor Net Payment Dollar Amount

PO to Requested By Date Lag

PO to Receipt Date Lag

Requested By to Receipt Date Lag

Receipt to Payment Date Lag

Invoice to Payment Date Lag

Procurement Pipeline Fact

Requested By Date Dimension

Vendor Invoice Date Dimension

Product Dimension

Contract Terms Dimension

Warehouse Dimension

Purchase Order Date Dimension

Warehouse Receipt Date Dimension

Vendor Payment Date Dimension

Vendor Dimension

Employee Dimension

Carrier Dimension

Figure 5-4: Procurement pipeline accumulating snapshot schema.

Slowly Changing Dimension Basics

To this point, we have pretended dimensions are independent of time. Unfortunately,

this is not the case in the real world. Although dimension table attributes are relatively

static, they aren’t ﬁ xed forever; attribute values change, albeit rather slowly, over time.

Chapter 5

148

Dimensional designers must proactively work with the business’s data governance

representatives to determine the appropriate change-handling strategy. You shouldn’t

simply jump to the conclusion that the business doesn’t care about dimension changes

just because they weren’t mentioned during the requirements gathering. Although

IT may assume accurate change tracking is unnecessary, business users may assume

the DW/BI system will allow them to see the impact of every attribute value change.

It is obviously better to get on the same page sooner rather than later.

NOTE The business’s data governance and stewardship representatives must be

actively involved in decisions regarding the handling of slowly changing dimension

attributes; IT shouldn’t make determinations on its own.

When change tracking is needed, it might be tempting to put every changing

attribute into the fact table on the assumption that dimension tables are static. This

is unacceptable and unrealistic. Instead you need strategies to deal with slowly

changing attributes within dimension tables. Since Ralph Kimball ﬁ rst introduced

the notion of slowly changing dimensions in 1995, some IT professionals in a never-

ending quest to speak in acronym-ese termed them SCDs. The acronym stuck.

For each dimension table attribute, you must specify a strategy to handle change.

In other words, when an attribute value changes in the operational world, how will

you respond to the change in the dimensional model? In the following sections, we

describe several basic techniques for dealing with attribute changes, followed by

more advanced options. You may need to employ a combination of these techniques

within a single dimension table.

Kimball method followers are likely already familiar with SCD types 1, 2, and 3.

Because legibility is part of our mantra, we sometimes wish we had given these tech-

niques more descriptive names in the ﬁ rst place, such as “overwrite.” But after nearly

two decades, the “type numbers” are squarely part of the DW/BI vernacular. As you’ll

see in the following sections, we’ve decided to expand the theme by assigning new

SCD type numbers to techniques that have been described, but less precisely labeled,

in the past; our hope is that assigning speciﬁ c numbers facilitates clearer communica-

tion among team members.

Type 0: Retain Original

This technique hasn’t been given a type number in the past, but it’s been around

since the beginning of SCDs. With type 0, the dimension attribute value never

changes, so facts are always grouped by this original value. Type 0 is appropriate

for any attribute labeled “original,” such as customer original credit score. It also

applies to most attributes in a date dimension.

Procurement 149

As we staunchly advocated in Chapter 3, the dimension table’s primary key is

a surrogate key rather than relying on the natural operational key. Although we

demoted the natural key to being an ordinary dimension attribute, it still has special

signiﬁ cance. Presuming it’s durable, it would remain inviolate. Persistent durable

keys are always type 0 attributes. Unless otherwise noted, throughout this chapter’s

SCD discussion, the durable supernatural key is assumed to remain constant, as

described in Chapter 3.

Type 1: Overwrite

With the slowly changing dimension type 1 response, you overwrite the old attri-

bute value in the dimension row, replacing it with the current value; the attribute

always reﬂ ects the most recent assignment.

Assume you work for an electronics retailer where products roll up into the retail

store’s departments. One of the products is IntelliKidz software. The existing row in

the product dimension table for IntelliKidz looks like the top half of Figure 5-5. Of

course, there would be additional descriptive attributes in the product dimension,

but we’ve abbreviated the attribute listing for clarity.

IntelliKidz Education12345 ABC922-Z

Product

Key SKU (NK)

Product

Description

Department

Name

IntelliKidz Strategy12345 ABC922-Z

Updated row in Product dimension:

Original row in Product dimension:

Product

Key SKU (NK)

Product

Description

Department

Name

Figure 5-5: SCD type 1 sample rows.

Suppose a new merchandising person decides IntelliKidz software should be

moved from the Education department to the Strategy department on February 1,

2013 to boost sales. With a type 1 response, you’d simply update the existing row

in the dimension table with the new department description, as illustrated in the

updated row of Figure 5-5.

In this case, no dimension or fact table keys were modiﬁ ed when IntelliKidz’s

department changed. The fact table rows still reference product key 12345, regardless

of IntelliKidz’s departmental location. When sales take o following the move to the

Strategy department, you have no information to explain the performance improve-

ment because the historical and more recent facts both appear as if IntelliKidz

always rolled up into Strategy.

Chapter 5

150

The type 1 response is the simplest approach for dimension attribute changes. In

the dimension table, you merely overwrite the preexisting value with the current

assignment. The fact table is untouched. The problem with a type 1 response is that

you lose all history of attribute changes. Because overwriting obliterates historical

attribute values, you’re left solely with the attribute values as they exist today. A type

1 response is appropriate if the attribute change is an insigniﬁ cant correction. It also

may be appropriate if there is no value in keeping the old description. However, too

often DW/BI teams use a type 1 response as the default for dealing with slowly chang-

ing dimensions and end up totally missing the mark if the business needs to track

historical changes accurately. After you implement a type 1, it’s di cult to change

course in the future.

NOTE The type 1 response is easy to implement, but it does not maintain any

history of prior attribute values.

Before we leave the topic of type 1 changes, be forewarned that the same BI

applications can produce di erent results before versus after the type 1 attribute

change. When the dimension attribute’s type 1 overwrite occurs, the fact rows are

associated with the new descriptive context. Business users who rolled up sales by

department on January 31 will get di erent department totals when they run the

same report on February 1 following the type 1 overwrite.

There’s another easily overlooked catch to be aware of. With a type 1 response

to deal with the relocation of IntelliKidz, any preexisting aggregations based on the

department value need to be rebuilt. The aggregated summary data must continue

to tie to the detailed atomic data, where it now appears that IntelliKidz has always

rolled up into the Strategy department.

Finally, if a dimensional model is deployed via an OLAP cube and the type 1

attribute is a hierarchical rollup attribute, like the product’s department in our

example, the cube likely needs to be reprocessed when the type 1 attribute changes.

At a minimum, similar to the relational environment, the cube’s performance aggre-

gations need to be recalculated.

WAR NI NG Even though type 1 changes appear the easiest to implement,

remember they invalidate relational tables and OLAP cubes that have aggregated

data over the a ected attribute.

Type 2: Add New Row

In Chapter 1: Data Warehousing, Business Intelligence, and Dimensional Modeling

Primer, we stated one of the DW/BI system’s goals was to correctly represent history.

Procurement 151

A type 2 response is the predominant technique for supporting this requirement

when it comes to slowly changing dimension attributes.

Using the type 2 approach, when IntelliKidz’s department changed on February

1, 2013, a new product dimension row for IntelliKidz is inserted to reﬂ ect the new

department attribute value. There are two product dimension rows for IntelliKidz,

as illustrated in Figure 5-6. Each row contains a version of IntelliKidz’s attribute

proﬁ le that was true for a span of time.

Original row in Product dimension:

Product

Key SKU (NK)

Product

Description

Department

Name …

Row

Effective

Date

Row

Expiration

Date

Current Row

Indicator

12345 ABC922-Z IntelliKidz Education … 2012-01-01 9999-12-31 Current

Rows in Product dimension following department reassignment:

Product

Key SKU (NK)

Product

Description

Department

Name ...

Row

Effective

Date

Row

Expiration

Date

Current Row

Indicator

12345

25984

ABC922-Z

IntelliKidz

Education

Strategy

...

2012-01-01

2013-02-01

2013-01-31

9999-12-31

Expired

Current

Figure 5-6: SCD type 2 sample rows.

With type 2 changes, the fact table is again untouched; you don’t go back to

the historical fact table rows to modify the product key. In the fact table, rows for

IntelliKidz prior to February 1, 2013, would reference product key 12345 when the

product rolled up to the Education department. After February 1, new IntelliKidz

fact rows would have product key 25984 to reﬂ ect the move to the Strategy depart-

ment. This is why we say type 2 responses perfectly partition or segment history to

account for the change. Reports summarizing pre-February 1 facts look identical

whether the report is generated before or after the type 2 change.

We want to reinforce that reported results may di er depending on whether

attribute changes are handled as a type 1 or type 2. Let’s presume the electronic

retailer sells $500 of IntelliKidz software during January 2013, followed by a $100

sale in February 2013. If the department attribute is a type 1, the results from a

query reporting January and February sales would indicate $600 under Strategy.

Conversely, if the department name attribute is a type 2, the sales would be reported

as $500 for the Education department and $100 for the Strategy department.

Unlike the type 1 approach, there is no need to revisit preexisting aggregation

tables when using the type 2 technique. Likewise, OLAP cubes do not need to be

reprocessed if hierarchical attributes are handled as type 2.

If you constrain on the department attribute, the two product proﬁ les are di er-

entiated. If you constrain on the product description, the query automatically fetches

both IntelliKidz product dimension rows and automatically joins to the fact table for

Chapter 5

152

the complete product history. If you need to count the number of products correctly,

then you would just use the SKU natural key attribute as the basis of the distinct

count rather than the surrogate key; the natural key column becomes the glue that

holds the separate type 2 rows for a single product together.

NOTE The type 2 response is the primary workhorse technique for accurately

tracking slowly changing dimension attributes. It is extremely powerful because

the new dimension row automatically partitions history in the fact table.

Type 2 is the safest response if the business is not absolutely certain about the

SCD business rules for an attribute. As we’ll discuss in the “Type 6: Add Type 1

Attributes to Type 2 Dimension” and “Type 7: Dual Type 1 and Type 2 Dimensions”

sections later in the chapter, you can provide the illusion of a type 1 overwrite when

an attribute has been handled with the type 2 response. The converse is not true. If

you treat an attribute as type 1, reverting to type 2 retroactively requires signiﬁ cant

e ort to create new dimension rows and then appropriately rekey the fact table.

Type 2 Effective and Expiration Dates

When a dimension table includes type 2 attributes, you should include several

administrative columns on each row, as shown in Figure 5-6. The e ective and

expiration dates refer to the moment when the row’s attribute values become valid

or invalid. E ective and expiration dates or date/time stamps are necessary in the

ETL system because it needs to know which surrogate key is valid when loading

historical fact rows. The e ective and expiration dates support precise time slic-

ing of the dimension; however, there is no need to constrain on these dates in the

dimension table to get the right answer from the fact table. The row e ective date

is the ﬁ rst date the descriptive proﬁ le is valid. When a new product is ﬁ rst loaded

in the dimension table, the expiration date is set to December 31, 9999. By avoiding

a null in the expiration date, you can reliably use a BETWEEN command to ﬁ nd the

dimension rows that were in e ect on a certain date.

When a new proﬁ le row is added to the dimension to capture a type 2 attribute

change, the previous row is expired. We typically suggest the end date on the

old row should be just prior to the e ective date of the new row leaving no gaps

between these e ective and expiration dates. The deﬁ nition of “just prior” depends

on the grain of the changes being tracked. Typically, the e ective and expiration

dates represent changes that occur during a day; if you’re tracking more granular

changes, you’d use a date/time stamp instead. In this case, you may elect to apply

di erent business rules, such as setting the row expiration date exactly equal to the

Procurement 153

e ective date of the next row. This would require logic such as “>= e ective date

and < expiration date” constraints, invalidating the use of BETWEEN.

Some argue that a single e ective date is adequate, but this makes for more

complicated searches to locate the dimension row with the latest e ective date

that is less than or equal to a date ﬁ lter. Storing an explicit second date simpli-

ﬁ es the query processing. Likewise, a current row indicator is another useful

administrative dimension attribute to quickly constrain queries to only the cur-

rent proﬁ les.

The type 2 response to slowly changing dimensions requires the use of surrogate

keys, but you’re already using them anyhow, right? You certainly can’t use the opera-

tional natural key because there are multiple proﬁ le versions for the same natural key.

It is not su cient to use the natural key with two or three version digits because you’d

be vulnerable to the entire list of potential operational issues discussed in Chapter 3.

Likewise, it is inadvisable to append an e ective date to the otherwise primary key

of the dimension table to uniquely identify each version. With the type 2 response,

you create a new dimension row with a new single-column primary key to uniquely

identify the new product proﬁ le. This single-column primary key establishes the link-

age between the fact and dimension tables for a given set of product characteristics.

There’s no need to create a confusing secondary join based on the dimension row’s

e ective or expiration dates.

We recognize some of you may be concerned about the administration of surro-

gate keys to support type 2 changes. In Chapter 19: ETL Subsystems and Techniques

and Chapter 20: ETL System Design and Development Process and Tasks, we’ll dis-

cuss a workﬂ ow for managing surrogate keys and accommodating type 2 changes

in more detail.

Type 1 Attributes in Type 2 Dimensions

It is not uncommon to mix multiple slowly changing dimension techniques within

the same dimension. When type 1 and type 2 are both used in a dimension, some-

times a type 1 attribute change necessitates updating multiple dimension rows. Let’s

presume the dimension table includes a product introduction date. If this attribute

is corrected using type 1 logic after a type 2 change to another attribute occurs,

the introduction date should probably be updated on both versions of IntelliKidz’s

proﬁ le, as illustrated in Figure 5-7.

The data stewards need to be involved in deﬁ ning the ETL business rules in

scenarios like this. Although the DW/BI team can facilitate discussion regarding

proper update handling, the business’s data stewards should make the ﬁ nal deter-

mination, not the DW/BI team.

Chapter 5

154

Original row in Product dimension:

Product

Key SKU (NK)

Product

Description

Department

Name

Introduction

Date

Introduction

Date

…

Row

Effective

Date

Row

Expiration

Date

Current Row

Indicator

12345 ABC922-Z IntelliKidz Education 2012-12-15 … 2012-01-01 9999-12-31 Current

Rows in Product dimension following type 2 change to Department Name and type 1 change to Introduction Date:

Product

Key SKU (NK)

Product

Description

Department

Name ...

Row

Effective

Date

Row

Expiration

Date

Current Row

Indicator

12345

25984

ABC922-Z

IntelliKidz

Education

Strategy

2012-01-01

...

2012-01-01

2013-02-01

2013-01-31

9999-12-31

Expired

Current

Figure 5-7: Type 1 updates in a dimension with type 2 attributes sample rows.

Type 3: Add New Attribute

Although the type 2 response partitions history, it does not enable you to associ-

ate the new attribute value with old fact history or vice versa. With the type 2

response, when you constrain the department attribute to Strategy, you see only

IntelliKidz facts from after February 1, 2013. In most cases, this is exactly what

you want.

However, sometimes you want to see fact data as if the change never occurred.

This happens most frequently with sales force reorganizations. District boundaries

may be redrawn, but some users still want the ability to roll up recent sales for the

prior districts just to see how they would have done under the old organizational

structure. For a few transitional months, there may be a need to track history for

the new districts and conversely to track new fact data in terms of old district

boundaries. A type 2 response won’t support this requirement, but type 3 comes

to the rescue.

In our software example, let’s assume there is a legitimate business need to

track both the new and prior values of the department attribute for a period of

time around the February 1 change. With a type 3 response, you do not issue a

new dimension row, but rather add a new column to capture the attribute change,

as illustrated in Figure 5-8. You would alter the product dimension table to add

a prior department attribute, and populate this new column with the existing

department value (Education). The original department attribute is treated as a

type 1 where you overwrite to reﬂ ect the current value (Strategy). All existing

reports and queries immediately switch over to the new department description,

but you can still report on the old department value by querying on the prior

department attribute.

Procurement 155

IntelliKidz Education12345 ABC922-Z

Product

Key SKU (NK)

Product

Description

Department

Name

IntelliKidz Strategy Education12345 ABC922-Z

Updated row in Product dimension:

Original row in Product dimension:

Product

Key SKU (NK)

Product

Description

Department

Name

Prior

Department

Name

Figure 5-8: SCD type 3 sample rows.

Don’t be fooled into thinking the higher type number associated with type 3

indicates it is the preferred approach; the techniques have not been presented in

good, better, and best practice sequence. Frankly, type 3 is infrequently used. It is

appropriate when there’s a strong need to support two views of the world simulta-

neously. Type 3 is distinguished from type 2 because the pair of current and prior

attribute values are regarded as true at the same time.

NOTE The type 3 slowly changing dimension technique enables you to see

new and historical fact data by either the new or prior attribute values, sometimes

called alternate realities.

Type 3 is not useful for attributes that change unpredictably, such as a customer’s

home state. There would be no beneﬁ t in reporting facts based on a prior home state

attribute that reﬂ ects a change from 10 days ago for some customers or 10 years

ago for others. These unpredictable changes are typically handled best with type

2 instead.

Type 3 is most appropriate when there’s a signiﬁ cant change impacting many

rows in the dimension table, such as a product line or sales force reorganization.

These en masse changes are prime candidates because business users often want

the ability to analyze performance metrics using either the pre- or post-hierarchy

reorganization for a period of time. With type 3 changes, the prior column is labeled

to distinctly represent the prechanged grouping, such as 2012 department or pre-

merger department. These column names provide clarity, but there may be unwanted

ripples in the BI layer.

Finally, if the type 3 attribute represents a hierarchical rollup level within the

dimension, then as discussed with type 1, the type 3 update and additional column

would likely cause OLAP cubes to be reprocessed.

Chapter 5

156

Multiple Type 3 Attributes

If a dimension attribute changes with a predictable rhythm, sometimes the business

wants to summarize performance metrics based on any of the historic attribute

values. Imagine the product line is recategorized at the start of every year and the

business wants to look at multiple years of historic facts based on the department

assignment for the current year or any prior year.

In this case, we take advantage of the regular, predictable nature of these changes

by generalizing the type 3 approach to a series of type 3 dimension attributes, as

illustrated in Figure 5-9. On every dimension row, there is a current department

attribute that is overwritten, plus attributes for each annual designation, such as

2012 department. Business users can roll up the facts with any of the department

assignments. If a product were introduced in 2013, the department attributes for

2012 and 2011 would contain Not Applicable values.

IntelliKidz Strategy Education Not Applicable12345 ABC922-Z

Updated row in Product dimension:

Product

Key SKU (NK)

Product

Description

Current

Department

Name

2012

Department

Name

2011

Department

Name

Figure 5-9: Dimension table with multiple SCD type 3 attributes.

The most recent assignment column should be identiﬁ ed as the current depart-

ment. This attribute will be used most frequently; you don’t want to modify existing

queries and reports to accommodate next year’s change. When the departments are

reassigned in January 2014, you’d alter the table to add a 2013 department attribute,

populate this column with the current department values, and then overwrite the

current attribute with the 2014 department assignment.

Type 4: Add Mini-Dimension

Thus far we’ve focused on slow evolutionary changes to dimension tables. What

happens when the rate of change speeds up, especially within a large multimillion-

row dimension table? Large dimensions present two challenges that warrant special

treatment. The size of these dimensions can negatively impact browsing and query

ﬁ ltering performance. Plus our tried-and-true type 2 technique for change tracking

is unappealing because we don’t want to add more rows to a dimension that already

has millions of rows, particularly if changes happen frequently.

Fortunately, a single technique comes to the rescue to address both the browsing

performance and change tracking challenges. The solution is to break o frequently

analyzed or frequently changing attributes into a separate dimension, referred to

as a mini-dimension. For example, you could create a mini-dimension for a group

Procurement 157

of more volatile customer demographic attributes, such as age, purchase frequency

score, and income level, presuming these columns are used extensively and changes

to these attributes are important to the business. There would be one row in the

mini-dimension for each unique combination of age, purchase frequency score,

and income level encountered in the data, not one row per customer. With this

approach, the mini-dimension becomes a set of demographic proﬁ les. Although the

number of rows in the customer dimension may be in the millions, the number of

mini-dimension rows should be a signiﬁ cantly smaller. You leave behind the more

constant attributes in the original multimillion-row customer table.

Sample rows for a demographic mini-dimension are illustrated in Figure 5-10.

When creating the mini-dimension, continuously variable attributes, such as income,

are converted to banded ranges. In other words, the attributes in the mini-dimension

are typically forced to take on a relatively small number of discrete values. Although

this restricts use to a set of predeﬁ ned bands, it drastically reduces the number of

combinations in the mini-dimension. If you stored income at a speciﬁ c dollar and

cents value in the mini-dimension, when combined with the other demographic

attributes, you could end up with as many rows in the mini-dimension as in the

customer dimension itself. The use of band ranges is probably the most signiﬁ cant

compromise associated with the mini-dimension technique. Although grouping

facts from multiple band values is viable, changing to more discreet bands (such

as $30,000-34,999) at a later time is di cult. If users insist on access to a speciﬁ c

raw data value, such as a credit bureau score that is updated monthly, it should be

included in the fact table, in addition to being value banded in the demographic

mini-dimension. In Chapter 10: Financial Services, we’ll discuss dynamic value

banding of facts; however, such queries are much less e cient than constraining

the value band in a mini-dimension table.

Low

Medium

High

Low

Medium

High

...

Low

Medium

High

...

<$30,000

$30,000-39,999

...

<$30,000

...

142

143

144

...

21-25

...

26-30

...

Demographics

Key Age Band Income Level

Purchase

Frequency

Score

Figure 5-10: SCD type 4 mini-dimension sample rows.

Chapter 5

158

Every time a fact table row is built, two foreign keys related to the customer would

be included: the customer dimension key and the mini-dimension demographics

key in e ect at the time of the event, as shown in Figure 5-11. The mini-dimension

delivers performance beneﬁ ts by providing a smaller point of entry to the facts.

Queries can avoid the huge customer dimension table unless attributes from that

table are constrained or used as report labels.

Customer Dimension

Customer Key (PK)

Customer ID (NK)

Customer Name

Customer Address

Customer City-State

Customer State

Customer ZIP-Postal Code

Customer Date of Birth

Demographics Dimension

Demographics Key (PK)

Age Band

Purchase Frequency Score

Income Level

Fact Table

Date Key (FK)

Customer Key (FK)

Demographics Key (FK)

More FKs...

Facts...

Figure 5-11: Type 4 mini-dimension with customer dimension.

When the mini-dimension key participates as a foreign key in the fact table, another

beneﬁ t is that the fact table captures the demographic proﬁ le changes. Let’s presume

we are loading data into a periodic snapshot fact table on a monthly basis. Referring

back to our sample demographic mini-dimension sample rows in Figure 5-10, if one

of our customers, John Smith, were 25 years old with a low purchase frequency score

and an income of $25,000, you’d begin by assigning demographics key 1 when loading

the fact table. If John has a birthday several weeks later and turns 26 years old, you’d

assign demographics key 142 when the fact table was next loaded; the demographics

key on John’s earlier fact table rows would not be changed. In this manner, the fact

table tracks the age change. You’d continue to assign demographics key 142 when

the fact table is loaded until there’s another change in John’s demographic proﬁ le. If

John receives a raise to $32,000 several months later, a new demographics key would

be reﬂ ected in the next fact table load. Again, the earlier rows would be unchanged.

OLAP cubes also readily accommodate type 4 mini-dimensions.

Customer dimensions are somewhat unique in that customer attributes frequently

are queried independently from the fact table. For example, users may want to know

how many customers live in Dade County by age bracket for segmentation and proﬁ l-

ing. Rather than forcing any analysis that combines customer and demographic data

to link through the fact table, the most recent value of the demographics key also

can exist as a foreign key on the customer dimension table. We’ll further describe

this customer demographic outrigger as an SCD type 5 in the next section.

Procurement 159

The demographic dimension cannot be allowed to grow too large. If you have

ﬁ ve demographic attributes, each with 10 possible values, then the demographics

dimension could have 100,000 (105) rows. This is a reasonable upper limit for the

number of rows in a mini-dimension if you build out all the possible combina-

tions in advance. An alternate ETL approach is to build only the mini-dimension

rows that actually occur in the data. However, there are certainly cases where even

this approach doesn’t help and you need to support more than ﬁ ve demographic

attributes with 10 values each. We’ll discuss the use of multiple mini-dimensions

associated with a single fact table in Chapter 10.

Demographic proﬁ le changes sometimes occur outside a business event, such

as when a customer’s proﬁ le is updated in the absence of a sales transaction. If the

business requires accurate point-in-time proﬁ ling, a supplemental factless fact table

with e ective and expiration dates can capture every relationship change between

the customer and demographics dimensions.

Hybrid Slowly Changing Dimension

Techniques

In this ﬁ nal section, we’ll discuss hybrid approaches that combine the basic SCD

techniques. Designers sometimes become enamored with these hybrids because they

seem to provide the best of all worlds. However, the price paid for greater analytic

ﬂ exibility is often greater complexity. Although IT professionals may be impressed

by elegant ﬂ exibility, business users may be just as easily turned o by complexity.

You should not pursue these options unless the business agrees they are needed to

address their requirements.

These ﬁ nal approaches are most relevant if you’ve been asked to preserve the

historically accurate dimension attribute associated with a fact event, while sup-

porting the option to report historical facts according to the current attribute values.

The basic slowly changing dimension techniques do not enable this requirement

easily on their own.

We’ll start by considering a technique that combines type 4 with a type 1 outrig-

ger; because 4 + 1 = 5, we’re calling this type 5. Next, we’ll describe type 6, which

combines types 1 through 3 for a single dimension attribute; it’s aptly named type

6 because 2 + 3 + 1 or 2 × 3 × 1 both equal 6. Finally, we’ll ﬁ nish up with type 7,

which just happens to be the next available sequence number; there is no underly-

ing mathematical signiﬁ cance to this label.

Chapter 5

160

Type 5: Mini-Dimension and Type 1 Outrigger

Let’s return to the type 4 mini-dimension. An embellishment to this technique is to

add a current mini-dimension key as an attribute in the primary dimension. This

mini-dimension key reference is a type 1 attribute, overwritten with every proﬁ le

change. You wouldn’t want to track this attribute as a type 2 because then you’d be

capturing volatile changes within the large multimillion-row dimension and avoid-

ing this explosive growth was one of the original motivations for type 4.

The type 5 technique is useful if you want a current proﬁ le count in the absence

of fact table metrics or want to roll up historical facts based on the customer’s cur-

rent proﬁ le. You’d logically represent the primary dimension and mini-dimension

outrigger as a single table in the presentation area, as shown in Figure 5-12. To

minimize user confusion and potential error, the current attributes in this role-

playing dimension should have distinct column names distinguishing them, such

as current age band. Even with unique labeling, be aware that presenting users with

two avenues for accessing demographic data, through either the mini-dimension

or outrigger, can deliver more functionality and complexity than some can handle.

Current Demographics Dimension

Current Demographics Key (PK)

Current Age Band

Current Purchase Frequency Score

Current Income Level

Logical representation to the BI tools:

View of Demographics Dimension

Fact Table

Date Key (FK)

Customer Key (FK)

Demographics Key (FK)

More FKs...

Facts

Customer Dimension

Customer Key (PK)

Customer ID (NK)

Customer Name

...

Current Demographics Key (FK)

Demographics Dimension

Demographics Key (PK)

Age Band

Purchase Frequency Score

Income Level

Fact Table

Date Key (FK)

Customer Key (FK)

Demographics Key (FK)

More FKs...

Facts

Customer Dimension

Customer Key (PK)

Customer ID (NK)

Customer Name

...

Current Age Band

Current Purchase Frequency Score

Current Income Level

Demographics Dimension

Demographics Key (PK)

Age Band

Purchase Frequency Score

Income Level

Figure 5-12: Type 4 mini-dimension with type 1 outrigger in customer dimension.

NOTE The type 4 mini-dimension terminology refers to when the demograph-

ics key is part of the fact table composite key. If the demographics key is a foreign

key in the customer dimension, it is referred to as an outrigger.

Type 6: Add Type 1 Attributes to Type 2 Dimension

Let’s return to the electronics retailer’s product dimension. With type 6, you would

have two department attributes on each row. The current department column

Procurement 161

represents the current assignment; the historic department column is a type 2

attribute representing the historically accurate department value.

When IntelliKidz software is introduced, the product dimension row would look

like the ﬁ rst scenario in Figure 5-13.

Original row in Product dimension:

Product

Key SKU (NK)

Product

Description

Historic

Department

Name

Current

Department

Name …

Row

Effective

Date

Row

Expiration

Date

Current Row

Indicator

12345 ABC922-Z IntelliKidz Education Education … 2012-01-01 9999-12-31 Current

Rows in Product dimension following first department reassignment:

Product

Key SKU (NK)

Product

Description

Historic

Department

Name

Current

Department

Name ...

Row

Effective

Date

Row

Expiration

Date

Current Row

Indicator

12345

25984

ABC922-Z

IntelliKidz

Education

Strategy

...

2012-01-01

2013-02-01

2013-01-31

9999-12-31

Expired

Current

Rows in Product dimension following second department reassignment:

Product

Key SKU (NK)

Product

Description

Historic

Department

Name

Current

Department

Name ...

Row

Effective

Date

Row

Expiration

Date

Current Row

Indicator

12345

25984

31726

ABC922-Z

IntelliKidz

Education

Strategy

Critical Thinking

...

2012-01-01

2013-02-01

2013-07-01

2013-01-31

2013-06-30

9999-12-31

Expired

Current

Figure 5-13: SCD type 6 sample rows.

When the departments are restructured and IntelliKidz is moved to the Strategy

department, you’d use a type 2 response to capture the attribute change by issu-

ing a new row. In this new IntelliKidz dimension row, the current department will

be identical to the historical department. For all previous instances of IntelliKidz

dimension rows, the current department attribute will be overwritten to reﬂ ect the

current structure. Both IntelliKidz rows would identify the Strategy department as

the current department (refer to the second scenario in Figure 5-13).

In this manner you can use the historic attribute to group facts based on the attribute

value that was in e ect when the facts occurred. Meanwhile, the current attri-

bute rolls up all the historical fact data for both product keys 12345 and 25984 into

the current department assignment. If IntelliKidz were then moved into the Critical

Thinking software department, the product table would look like Figure 5-13’s ﬁ nal

set of rows. The current column groups all facts by the current assignment, while

the historic column preserves the historic assignments accurately and segments the

facts accordingly.

With this hybrid approach, you issue a new row to capture the change (type 2)

and add a new column to track the current assignment (type 3), where subsequent

changes are handled as a type 1 response. An engineer at a technology company

Chapter 5

162

suggested we refer to this combo approach as type 6 because both the sum and

product of 1, 2, and 3 equals 6.

Again, although this technique may be naturally appealing to some, it is impor-

tant to always consider the business users’ perspective as you strive to arrive at a

reasonable balance between ﬂ exibility and complexity. You may want to limit which

columns are exposed to some users so they’re not overwhelmed by choices.

Type 7: Dual Type 1 and Type 2 Dimensions

When we ﬁ rst described type 6, someone asked if the technique would be appropri-

ate for supporting both current and historic perspectives for 150 attributes in a large

dimension table. That question sent us back to the drawing board.

In this ﬁ nal hybrid technique, the dimension natural key (assuming it’s durable)

is included as a fact table foreign key, in addition to the surrogate key for type 2

tracking, as illustrated in Figure 5-14. If the natural key is unwieldy or ever reas-

signed, you should use a separate durable supernatural key instead. The type 2

dimension contains historically accurate attributes for ﬁ ltering and grouping based

on the e ective values when the fact event occurred. The durable key joins to a

dimension with just the current type 1 values. Again, the column labels in this table

should be prefaced with “current” to reduce the risk of user confusion. You can use

these dimension attributes to summarize or ﬁ lter facts based on the current proﬁ le,

regardless of the attribute values in e ect when the fact event occurred.

Fact Table

Date Key (FK)

Product Key (FK)

Durable Product Key (FK)

More FKs...

Facts

Product Dimension

Product Key (PK)

Durable Product Key (DK)

Product Description

Department Name

...

Row Effective Date

Row Expiration Date

Current Row Indicator

Current Product Dimension

Durable Product Key (PK)

Current Product Description

Current Department Name

...

View of Product Dimension

(where Current Row Indicator=Current)

Figure 5-14: Type 7 with dual foreign keys for dual type 1 and type 2 dimension tables.

This approach delivers the same functionality as type 6. Although the type 6

response spawns more attribute columns in a single dimension table, this approach

relies on two foreign keys in the fact table. Type 7 invariably requires less ETL e ort

because the current type 1 attribute table could easily be delivered via a view of

the type 2 dimension table, limited to the most current rows. The incremental cost

of this ﬁ nal technique is the additional column carried in the fact table; however,

Procurement 163

queries based on current attribute values would be ﬁ ltering on a smaller dimension

table than previously described with type 6.

Of course, you could avoid storing the durable key in the fact table by joining the

type 1 view containing current attributes to the durable key in the type 2 dimension

table itself. In this case, however, queries that are only interested in current rollups

would need to traverse from the type 1 outrigger through the more voluminous

type 2 dimension before ﬁ nally reaching the facts, which would likely negatively

impact query performance for current reporting.

A variation of this dual type 1 and type 2 dimension table approach again relies

on a view to deliver current type 1 attributes. However, in this case, the view associ-

ates the current attribute values with all the durable key’s type 2 rows, as illustrated

in Figure 5-15.

Fact Table

Date Key (FK)

Product Key (FK)

More FKs...

Facts

Product Dimension

Product Key (PK)

Durable Product Key

Product Description

Department Name

...

Row Effective Date

Row Expiration Date

Current Row Indicator

Current Product Dimension

Product Key (PK)

Durable Product Key

Current Product Description

Current Department Name

...

View of Product Dimension

Figure 5-15: Type 7 variation with single surrogate key for dual type 1 and type 2

dimension tables.

Both dimension tables in Figure 5-15 have the same number of rows, but the

contents of the tables are di erent, as shown in Figure 5-16.

Rows in Product dimension:

Product

Key SKU (NK)

Durable

Product

Key

Product

Description

Department

Name ...

Row Effective

Date

Row

Expiration

Date

Current Row

Indicator

12345

25984

31726

ABC922-Z

12345

IntelliKidz

Education

Strategy

Critical Thinking

...

2012-01-01

2013-02-01

2013-07-01

2013-01-31

2013-06-30

9999-12-31

Expired

Current

Rows in Product dimension’s current view:

Product

Key SKU (NK)

Durable

Product

Key

Current

Product

Description

Current

Department

Name ...

12345

25984

31726

ABC922-Z

12345

IntelliKidz

Critical Thinking

...

Figure 5-16: SCD type 7 variation sample rows.

Chapter 5

164

Type 7 for Random “As Of” Reporting

Finally, although it’s uncommon, you might be asked to roll up historical facts

based on any speciﬁ c point-in-time proﬁ le, in addition to reporting by the attribute

values in e ect when the fact event occurred or by the attribute’s current values.

For example, perhaps the business wants to report three years of historical metrics

based on the hierarchy in e ect on December 1 of last year. In this case, you can

use the dual dimension keys in the fact table to your advantage. First ﬁ lter on the

type 2 dimension row e ective and expiration dates to locate the rows in e ect on

December 1 of last year. With this constraint, a single row for each durable key

in the type 2 dimension is identiﬁ ed. Then join this ﬁ ltered set to the durable key in

the fact table to roll up any facts based on the point-in-time attribute values. It’s as

if you’re deﬁ ning the meaning of “current” on-the-ﬂ y. Obviously, you must ﬁ lter

on the row e ective and expiration dates, or you’ll have multiple type 2 rows for

each durable key. Finally, only unveil this capability to a limited, highly analytic

audience; this embellishment is not for the timid.

Slowly Changing Dimension Recap

We’ve summarized the techniques for tracking dimension attribute changes in

Figure 5-17. This chart highlights the implications of each slowly changing dimen-

sion technique on the analysis of performance metrics in the fact table.

SCD Type Dimension Table Action Impact on Fact Analysis

Type 0 No change to attribute value. Facts associated with attribute’s original value.

Type 1 Overwrite attribute value. Facts associated with attribute’s current value.

Type 2

Type 3

Type 4

Type 5

Type 6

Add new dimension row for profile

with new attribute value.

Add new column to preserve attribute’s

current and prior values.

Add mini-dimension table containing

rapidly changing attributes.

Add type 4 mini-dimension, along with

overwritten type 1 mini-dimension key

in base dimension.

Add type 1 overwritten attributes to

type 2 dimension row, and overwrite

all prior dimension rows.

Facts associated with attribute value in effect when

fact occured.

Facts associated with both current and prior attribute

alternative values.

Facts associated with rapidly changing attributes in

effect when fact occured.

Facts associated with rapidly changing attributes in

effect when fact occurred, plus current rapidly changing

attribute values.

Facts associated with attribute value in effect when fact

occurred, plus current values.

Facts associated with attribute value in effect when fact

occurred, plus current values.

Type 7

Add type 2 dimension row with new

attribute value, plus view limited to

current rows and/or attribute values.

Figure 5-17: Slowly changing dimension techniques summary.

Procurement 165

Summary

In this chapter we discussed several approaches to handling procurement data.

E ectively managing procurement performance can have a major impact on an

organization’s bottom line.

We also introduced techniques to deal with changes to dimension attribute

values. The slowly changing responses range from doing nothing (type 0) to

overwriting the value (type 1) to complicated hybrid approaches (such as types 5

through 7) which combine techniques to support requirements for both historic

attribute preservation and current attribute reporting. You’ll undoubtedly need to

re-read this section as you consider slowly changing dimension attribute strategies

for your DW/BI system.

Order Management

Order management consists of several critical business processes, including

order, shipment, and invoice processing. These processes spawn metrics,

such as sales volume and invoice revenue, that are key performance indicators for

any organization that sells products or services to others. In fact, these foundation

metrics are so crucial that DW/BI teams frequently tackle one of the order manage-

ment processes for their initial implementation. Clearly, the topics in this case study

transcend industry boundaries.

In this chapter we’ll explore several di erent order management transactions,

including the common characteristics and complications encountered when

dimensionally modeling these transactions. We’ll further develop the concept of

an accumulating snapshot to analyze the order fulﬁ llment pipeline from initial

order to invoicing.

Chapter 6 discusses the following concepts:

■ Bus matrix snippet for order management processes

■ Orders transaction schema

■ Fact table normalization considerations

■ Role-playing dimensions

■ Ship-to/bill-to customer dimension considerations

■ Factors to determine if single or multiple dimensions

■ Junk dimensions for miscellaneous ﬂ ags and indicators versus alternative

designs

■ More on degenerate dimensions

■ Multiple currencies and units of measure

■ Handling of facts with di erent granularity

■ Patterns to avoid with header and line item transactions

■ Invoicing transaction schema with proﬁ t and loss facts

■ Audit dimension

Chapter 6

168

■ Quantitative measures and qualitative descriptors of service level performance

■ Order fulﬁ llment pipeline as accumulating snapshot schema

■ Lag calculations

Order Management Bus Matrix

The order management function is composed of a series of business processes. In

its most simplistic form, you can envision a subset of the enterprise data warehouse

bus matrix that resembles Figure 6-1.

Quoting

Ordering

Shipping to Customer

Shipment Invoicing

Receiving Payments

Customer Returns

Date

Customer

Product

Sales Rep

Deal

Warehouse

Shipper

Figure 6-1: Bus matrix rows for order management processes.

As described in earlier chapters, the bus matrix closely corresponds to the orga-

nization’s value chain. In this chapter we’ll focus on the order and invoice rows

of the matrix. We’ll also describe an accumulating snapshot fact table to evaluate

performance across multiple stages of the overall order fulﬁ llment process.

Order Transactions

The natural granularity for an order transaction fact table is one row for each line

item on an order. The dimensions associated with the orders business process are

order date, requested ship date, product, customer, sales rep, and deal. The facts

include the order quantity and extended order line gross, discount, and net (equal

to the gross amount less discount) dollar amounts. The resulting schema would

look similar to Figure 6-2.

Order Management 169

Order Date Key (FK)

Requested Ship Date Key (FK)

Customer Key (FK)

Product Key (FK)

Sales Rep Key (FK)

Deal Key (FK)

Order Number (DD)

Order Line Number (DD)

Order Line Quantity

Extended Order Line Gross Dollar Amount

Extended Order Line Discount Dollar Amount

Extended Order Line Net Dollar Amount

Customer Dimension

Product Dimension

Deal Dimension

Order Date Dimension

Order Line Transaction Fact

Requested Ship Date Dimension

Sales Rep Dimension

Figure 6-2: Order transaction fact table.

Fact Normalization

Rather than storing the list of facts in Figure 6-2, some designers want to further nor-

malize the fact table so there’s a single, generic fact amount along with a dimension

that identiﬁ es the type of measurement. In this scenario, the fact table granularity is

one row per measurement per order line, instead of the more natural one row per order

line event. The measurement type dimension would indicate whether the fact is the

gross order amount, order discount amount, or some other measure. This technique

may make sense when the set of facts is extremely lengthy, but sparsely populated

for a given fact row, and no computations are made between facts. You could use this

technique to deal with manufacturing quality test data where the facts vary widely

depending on the test conducted.

However, you should generally resist the urge to normalize the fact table in this

way. Facts usually are not sparsely populated within a row. In the order transaction

schema, if you were to normalize the facts, you’d be multiplying the number of rows

in the fact table by the number of fact types. For example, assume you started with

10 million order line fact table rows, each with six keys and four facts. If the fact

rows were normalized, you’d end up with 40 million fact rows, each with seven

keys and one fact. In addition, if any arithmetic function is performed between

the facts (such as discount amount as a percentage of gross order amount), it is

far easier if the facts are in the same row in a relational star schema because SQL

makes it di cult to perform a ratio or di erence between facts in di erent rows.

In Chapter 14: Healthcare, we’ll explore a situation where a measurement type

dimension makes more sense. This pattern is also more appropriate if the primary

platform supporting BI applications is an OLAP cube; the cube enables computations

Chapter 6

170

that cut the cube along any dimension, regardless if it’s a date, product, customer,

or measurement type.

Dimension Role Playing

By now you know to expect a date dimension in every fact table because you’re

always looking at performance over time. In a transaction fact table, the primary date

column is the transaction date, such as the order date. Sometimes you discover other

dates associated with each transaction, such as the requested ship date for the order.

Each of the dates should be a foreign key in the fact table, as shown in Figure 6-3.

However, you cannot simply join these two foreign keys to the same date dimension

table. SQL would interpret this two-way simultaneous join as requiring both the

dates to be identical, which isn’t very likely.

Logical views or aliases of the

single physical date dimension

Order Date Key (FK)

Requested Ship Date Key (FK)

Customer Key (FK)

Product Key (FK)

Sales Rep Key (FK)

Deal Key (FK)

Order Number (DD)

Order Line Number (DD)

Order Quantity

Extended Order Line Gross Dollar Amount

Extended Order Line Discount Dollar Amount

Extended Order Line Net Dollar Amount

Order Line Transaction Fact

Date Key

Date

Day of Week

Month

Quarter

...

Date Dimension

Order Date Key

Order Date

Order Day of Week

Order Month

Order Quarter

...

Requested Ship Date Key

Requested Ship Date

Requested Ship Day of Week

Requested Ship Month

Requested Ship Quarter

...

Order Date Dimension

Requested Ship Date Dimension

Figure 6-3: Role-playing date dimensions.

Even though you cannot literally join to a single date dimension table, you can

build and administer a single physical date dimension table. You then create the

illusion of two independent date dimensions by using views or aliases. Be careful to

uniquely label the columns in each of the views or aliases. For example, the order

month attribute should be uniquely labeled to distinguish it from the requested

ship month. If you don’t establish unique column names, you wouldn’t be able to

tell the columns apart when both are dragged into a report.

As we brieﬂ y described in Chapter 3: Retail Sales, we would deﬁ ne the order date

and requested order date views as follows:

create view order_date

(order_date_key, order_day_of_week, order_month, ...)

as select date_key, day_of_week, month, ... from date

Order Management 171

and

create view req_ship_date

(req_ship_date_key, req_ship_day_of_week, req_ship_month, ...)

as select date_key, day_of_week, month, ... from date

Alternatively, SQL supports the concept of aliasing. Many BI tools also enable

aliasing within their semantic layer. However, we caution against this approach

if multiple BI tools, along with direct SQL-based access, are used within the

organization.

Regardless of the implementation approach, you now have two unique logical date

dimensions that can be used as if they were independent with completely unrelated

constraints. This is referred to as role playing because the date dimension simultane-

ously serves di erent roles in a single fact table. You’ll see additional examples of

dimension role playing sprinkled throughout this book.

NOTE Role playing in a dimensional model occurs when a single dimension

simultaneously appears several times in the same fact table. The underlying dimen-

sion may exist as a single physical table, but each of the roles should be presented

to the BI tools as a separately labeled view.

It’s worth noting that some OLAP products do not support multiple roles of the

same dimension; in this scenario, you’d need to create two separate dimensions for

the two roles. In addition, some OLAP products that enable multiple roles do not

enable attribute renaming for each role. In the end, OLAP environments may be

littered with a plethora of separate dimensions, which are treated simply as roles

in the relational star schema.

To handle the multiple dates, some designers are tempted to create a single date

table with a key for each unique order date and requested ship date combination.

This approach falls apart on several fronts. First, the clean and simple daily date

table with approximately 365 rows per year would balloon in size if it needed to

handle all the date combinations. Second, a combination date table would no longer

conform to the other frequently used daily, weekly, and monthly date dimensions.

Role Playing and the Bus Matrix

The most common technique to document role playing on the bus matrix is to

indicate the multiple roles within a single cell, as illustrated in Figure 6-4. We used

a similar approach in Chapter 4: Inventory for documenting shrunken conformed

dimensions. This method is especially appropriate for the date dimension on the

bus matrix given its numerous logical roles. Alternatively, if the number of roles is

limited and frequently reused across processes, you can create subcolumns within

a single conformed dimension column on the matrix.

Chapter 6

172

Quoting

Ordering

Shipping to Customer

Shipment Invoicing

Receiving Payments

Customer Returns

Quote Date

Order Date

Requested Ship Date

Shipment Date

Invoice Date

Payment Receipt Date

Return Date

Date

Figure 6-4: Communicating role-playing dimensions on the bus matrix.

Product Dimension Revisited

Each of the case study vignettes presented so far has included a product dimen-

sion. The product dimension is one of the most common and most important

dimension tables. It describes the complete portfolio of products sold by a com-

pany. In many cases, the number of products in the portfolio turns out to be sur-

prisingly large, at least from an outsider’s perspective. For example, a prominent

U.S. manufacturer of dog and cat food tracks more than 25,000 manufacturing

variations of its products, including retail products everyone (or every dog and

cat) is familiar with, as well as numerous specialized products sold through com-

mercial and veterinary channels. Some durable goods manufacturers, such as

window companies, sell millions of unique product conﬁ gurations.

Most product dimension tables share the following characteristics:

■ Numerous verbose, descriptive columns. For manufacturers, it’s not unusual

to maintain 100 or more descriptors about the products they sell. Dimension

table attributes naturally describe the dimension row, do not vary because

of the inﬂ uence of another dimension, and are virtually constant over time,

although some attributes do change slowly over time.

■ One or more attribute hierarchies, plus non-hierarchical attributes. Products

typically roll up according to multiple deﬁ ned hierarchies. The many-to-one

ﬁ xed depth hierarchical data should be presented in a single ﬂ attened, denor-

malized product dimension table. You should resist creating normalized snow-

ﬂ aked sub-tables; the costs of a more complicated presentation and slower

intra-dimension browsing performance outweigh the minimal storage savings

beneﬁ ts. Product dimension tables can have thousands of entries. With so many

Order Management 173

rows, it is not too useful to request a pull-down list of the product descriptions.

It is essential to have the ability to constrain on one attribute, such as ﬂ avor,

and then another attribute, such as package type, before attempting to display

the product descriptions. Any attributes, regardless of whether they belong to

a single hierarchy, should be used freely for browsing and drilling up or down.

Many product dimension attributes are standalone low-cardinality attributes,

not part of explicit hierarchies.

The existence of an operational product master helps create and maintain the

product dimension, but a number of transformations and administrative steps must

occur to convert the operational master ﬁ le into the dimension table, including the

following:

■ Remap the operational product code to a surrogate key. As we discussed in

Chapter 3, this meaningless surrogate primary key is needed to avoid havoc

caused by duplicate use of an operational product code over time. It also

might be necessary to integrate product information sourced from di erent

operational systems. Finally, as you just learned in Chapter 5: Procurement,

the surrogate key is needed to track type 2 product attribute changes.

■ Add descriptive attribute values to augment or replace operational codes.

You shouldn’t accept the excuse that the business users are familiar with the

operational codes. The only reason business users are familiar with codes is that

they have been forced to use them! The columns in a product dimension are

the sole source of query constraints and report labels, so the contents must be

legible. Cryptic abbreviations are as bad as outright numeric codes; they also

should be augmented or replaced with readable text. Multiple abbreviated codes

in a single column should be expanded and separated into distinct attributes.

■ Quality check the attribute values to ensure no misspellings, impossible

values, or multiple variations. BI applications and reports rely on the precise

contents of the dimension attributes. SQL will produce another line in a report

if the attribute value varies in any way based on trivial punctuation or spell-

ing di erences. You should ensure that the attribute values are completely

populated because missing values easily cause misinterpretations. Incomplete

or poorly administered textual dimension attributes lead to incomplete or

poorly produced reports.

■ Document the attribute deﬁ nitions, interpretations, and origins in the

metadata. Remember that the metadata is analogous to the DW/BI encyclo-

pedia. You must be vigilant about populating and maintaining the metadata

repository.

Chapter 6

174

Customer Dimension

The customer dimension contains one row for each discrete location to which you

ship a product. Customer dimension tables can range from moderately sized (thou-

sands of rows) to extremely large (millions of rows) depending on the nature of the

business. A typical customer dimension is shown in Figure 6-5.

Customer Key (PK)

Customer ID (Natural Key)

Customer Name

Customer Ship To Address

Customer Ship To City

Customer Ship To County

Customer Ship To City-State

Customer Ship To State

Customer Ship To ZIP

Customer Ship To ZIP Region

Customer Ship To ZIP Sectional Center

Customer Bill To Name

Customer Bill To Address

Customer Organization Name

Customer Corporate Parent Name

Customer Credit Rating

Customer Dimension

Figure 6-5: Sample customer dimension.

Several independent hierarchies typically coexist in a customer dimension. The natu-

ral geographic hierarchy is clearly deﬁ ned by the ship-to location. Because the ship-to

location is a point in space, any number of geographic hierarchies may be deﬁ ned by

nesting more expansive geographic entities around the point. In the United States,

the usual geographic hierarchy is city, county, and state. It is often useful to include a

city-state attribute because the same city name exists in multiple states. The ZIP code

identiﬁ es a secondary geographic breakdown. The ﬁ rst digit of the ZIP code identiﬁ es

a geographic region of the United States (for example, 0 for the Northeast and 9 for

certain western states), whereas the ﬁ rst three digits of the ZIP code identify a mailing

sectional center.

Although these geographic characteristics may be captured and managed in a

single master data management system, you should embed the attributes within

the respective dimensions rather than relying on an abstract, generic geography/

location dimension that includes one row for every point in space independent of

the dimensions. We’ll talk more about this in Chapter 11: Telecommunications.

Another common hierarchy is the customer’s organizational hierarchy, assuming

the customer is a corporate entity. For each customer ship-to address, you might

have a customer bill-to and customer parent corporation. For every row in the

Order Management 175

customer dimension, both the physical geographies and organizational a liation

are well deﬁ ned, even though the hierarchies roll up di erently.

NOTE It is natural and common, especially for customer-oriented dimensions,

for a dimension to simultaneously support multiple independent hierarchies. The

hierarchies may have di erent numbers of levels. Drilling up and drilling down

within each of these hierarchies must be supported in a dimensional model.

The alert reader may have a concern with the implied assumption that multiple

ship-tos roll up to a single bill-to in a many-to-one relationship. The real world may

not be quite this clean and simple. There are always a few exceptions involving

ship-to addresses that are associated with more than one bill-to. Obviously, this

breaks the simple hierarchical relationship assumed in Figure 6-5. If this is a rare

occurrence, it would be reasonable to generalize the customer dimension so that the

grain of the dimension is each unique ship-to/bill-to combination. In this scenario,

if there are two sets of bill-to information associated with a given ship-to location,

then there would be two rows in the dimension, one for each combination. On the

other hand, if many of the ship-tos are associated with many bill-tos in a robust

many-to-many relationship, then the ship-to and bill-to customers probably need to

be handled as separate dimensions that are linked together by the fact table. With

either approach, exactly the same information is preserved. We’ll spend more time

on organizational hierarchies, including the handling of variable depth recursive

relationships, in Chapter 7: Accounting.

Single Versus Multiple Dimension Tables

Another potential hierarchy in the customer dimension might be the manufacturer’s

sales organization. Designers sometimes question whether sales organization attri-

butes should be modeled as a separate dimension or added to the customer dimension.

If sales reps are highly correlated with customers in a one-to-one or many-to-one rela-

tionship, combining the sales organization attributes with the customer attributes in

a single dimension is a viable approach. The resulting dimension is only as big as the

larger of the two dimensions. The relationships between sales teams and customers

can be browsed e ciently in the single dimension without traversing the fact table.

However, sometimes the relationship between sales organization and customer

is more complicated. The following factors must be taken into consideration:

■ Is the one-to-one or many-to-one relationship actually a many-to-many?

As we discussed earlier, if the many-to-many relationship is an exceptional

condition, then you may still be tempted to combine the sales rep attributes

into the customer dimension, knowing multiple surrogate keys are needed to

handle these rare many-to-many occurrences. However, if the many-to-many

Chapter 6

176

relationship is the norm, you should handle the sales rep and customer as

separate dimensions.

■ Does the sales rep and customer relationship vary over time or under the

inﬂ uence of another dimension? If so, you’d likely create separate dimensions

for the rep and customer.

■ Is the customer dimension extremely large? If there are millions of customer

rows, you’d be more likely to treat the sales rep as a separate dimension rather

than forcing all sales rep analysis through a voluminous customer dimension.

■ Do the sales rep and customer dimensions participate independently in

other fact tables? Again, you’d likely keep the dimensions separate. Creating a

single customer dimension with sales rep attributes exclusively around order

data may cause users to be confused when they’re analyzing other processes

involving sales reps.

■ Does the business think about the sales rep and customer as separate

things? This factor may be tough to discern and impossible to quantify. But

there’s no sense forcing two critical dimensions into a single blended dimen-

sion if this runs counter to the business’s perspectives.

When entities have a ﬁ xed, time-invariant, strongly correlated relationship, they

should be modeled as a single dimension. In most other cases, the design likely will

be simpler and more manageable when the entities are separated into two dimen-

sions (while remembering the general guidelines concerning too many dimensions).

If you’ve already identiﬁ ed 25 dimensions in your schema, you should consider

combining dimensions, if possible.

When the dimensions are separate, some designers want to create a little table

with just the two dimension keys to show the correlation without using the order

fact table. In many scenarios, this two-dimension table is unnecessary. There is no

reason to avoid the fact table to respond to this relationship inquiry. Fact tables are

incredibly e cient because they contain only dimension keys and measurements,

along with the occasional degenerate dimension. The fact table is created speciﬁ cally

to represent the correlations and many-to-many relationships between dimensions.

As we discussed in Chapter 5, you could capture the customer’s currently assigned

sales rep by including the relevant descriptors as type 1 attributes. Alternatively, you

could use the slowly changing dimension (SCD) type 5 technique by embedding a

type 1 foreign key to a sales rep dimension outrigger within the customer dimen-

sion; the current values could be presented as if they’re included on the customer

dimension via a view declaration.

Factless Fact Table for Customer/Rep Assignments

Before we leave the topic of sales rep assignments to customers, users sometimes

want the ability to analyze the complex assignment of sales reps to customers over

Order Management 177

time, even if no order activity has occurred. In this case, you could construct a

factless fact table, as illustrated in Figure 6-6, to capture the sales rep coverage.

The coverage table would provide a complete map of the historical assignments of

sales reps to customers, even if some of the assignments never resulted in a sale.

This factless fact table contains dual date keys for the e ective and expiration dates

of each assignment. The expiration date on the current rep assignment row would

reference a special date dimension row that identiﬁ es a future, undetermined date.

Assignment Effective Date Key (FK)

Assignment Expiration Date Key (FK)

Sales Rep Key (FK)

Customer Key (FK)

Customer Assignment Counter (=1)

Sales Rep-Customer Assignment Fact

Date Dimension (views for 2 roles)

Sales Rep Dimension

Customer Dimension

Figure 6-6: Factless fact table for sales rep assignments to customers.

You may want to compare the assignments fact table with the order transactions

fact table to identify rep assignments that have not yet resulted in order activity. You

would do so by leveraging SQL’s capabilities to perform set operations (for example,

selecting all the reps in the coverage table and subtracting all the reps in the orders

table) or by writing a correlated subquery.

Deal Dimension

The deal dimension is similar to the promotion dimension from Chapter 3. The deal

dimension describes the incentives o ered to customers that theoretically a ect the

customers’ desire to purchase products. This dimension is also sometimes referred

to as the contract. As shown in Figure 6-7, the deal dimension describes the full

combination of terms, allowances, and incentives that pertain to the particular

order line item.

Deal Key (PK)

Deal ID (NK)

Deal Description

Deal Terms Description

Deal Terms Type Description

Allowance Description

Allowance Type Description

Special Incentive Description

Special Incentive Type Description

Local Budget Indicator

Deal Dimension

Figure 6-7: Sample deal dimension.

Chapter 6

178

The same issues you faced in the retail promotion dimension also arise with this

deal dimension. If the terms, allowances, and incentives are usefully correlated, it

makes sense to package them into a single deal dimension. If the terms, allowances,

and incentives are quite uncorrelated and you end up generating the Cartesian

product of these factors in the dimension, it probably makes sense to split the deal

dimension into its separate components. Again, this is not an issue of gaining or

losing information because the schema contains the same information in both cases.

The issues of user convenience and administrative complexity determine whether

to represent these deal factors as multiple dimensions. In a very large fact table,

with hundreds of millions or billions of rows, the desire to reduce the number of

keys in the fact table composite key favors treating the deal attributes as a single

dimension, assuming this meshes with the business users’ perspectives. Certainly

any deal dimension smaller than 100,000 rows would be tractable in this design.

Degenerate Dimension for Order Number

Each line item row in the order fact table includes the order number as a degenerate

dimension. Unlike an operational header/line or parent/child database, the order

number in a dimensional model is typically not tied to an order header table. You

can triage all the interesting details from the order header into separate dimensions

such as the order date and customer ship-to. The order number is still useful for

several reasons. It enables you to group the separate line items on the order and

answer questions such as “What is the average number of line items on an order?”

The order number is occasionally used to link the data warehouse back to the

operational world. It may also play a role in the fact table’s primary key. Because

the order number sits in the fact table without joining to a dimension table, it is a

degenerate dimension.

NOTE Degenerate dimensions typically are reserved for operational transaction

identiﬁ ers. They should not be used as an excuse to stick cryptic codes in the fact

table without joining to dimension tables for descriptive decodes.

Although there is likely no analytic purpose for the order transaction line num-

ber, it may be included in the fact table as a second degenerate dimension given its

potential role in the primary key, along with the linkage to the operational system

of record. In this case, the primary key for the line item grain fact table would be

the order number and line number.

Sometimes data elements belong to the order itself and do not naturally fall into

other dimension tables. In this situation, the order number is no longer a degenerate

dimension but is a standard dimension with its own surrogate key and attributes.

Order Management 179

However, designers with a strong operational background should resist the urge to

simply dump the traditional order header information into an order dimension. In

almost all cases, the header information belongs in other analytic dimensions that

can be associated with the line item grain fact table rather than merely being cast

o into a dimension that closely resembles the operational order header record.

Junk Dimensions

When modeling complex transactional source data, you often encounter a number

of miscellaneous indicators and ﬂ ags that are populated with a small range of dis-

crete values. You have several rather unappealing options for handling these low

cardinality ﬂ ags and indicators, including:

■ Ignore the ﬂ ags and indicators. You can ask the obligatory question about

eliminating these miscellaneous ﬂ ags because they seem rather insigniﬁ cant,

but this notion is often vetoed quickly because someone occasionally needs

them. If the indicators are incomprehensible or inconsistently populated,

perhaps they should be left out.

■ Leave the ﬂ ags and indicators unchanged on the fact row. You don’t want

to store illegible cryptic indicators in the fact table. Likewise, you don’t

want to store bulky descriptors on the fact row, which would cause the

table to swell alarmingly. It would be a shame to leave a handful of textual

indicators on the row.

■ Make each ﬂ ag and indicator into its own dimension. Adding separate foreign

keys to the fact table is acceptable if the resulting number of foreign keys is

still reasonable (no more than 20 or so). However, if the list of foreign keys

is already lengthy, you should avoid adding more clutter to the fact table.

■ Store the ﬂ ags and indicators in an order header dimension. Rather than

treating the order number as a degenerate dimension, you could make it a

regular dimension with the low cardinality ﬂ ags and indicators as attributes.

Although this approach accurately represents the data relationships, it is ill-

advised, as described below.

An appropriate alternative approach for tackling these ﬂ ags and indicators is to

study them carefully and then pack them into one or more junk dimensions. A junk

dimension is akin to the junk drawer in your kitchen. The kitchen junk drawer is a

dumping ground for miscellaneous household items, such as rubber bands, paper

clips, batteries, and tape. Although it may be easier to locate the rubber bands if

a separate kitchen drawer is dedicated to them, you don’t have adequate storage

capacity to do so. Besides, you don’t have enough stray rubber bands, nor do you

need them frequently, to warrant the allocation of a single-purpose storage space.

Chapter 6

180

The junk drawer provides you with satisfactory access while still retaining storage

space for the more critical and frequently accessed dishes and silverware. In the

dimensional modeling world, the junk dimension nomenclature is reserved for DW/

BI professionals. We typically refer to the junk dimension as a transaction indicator

or transaction proﬁ le dimension when talking with the business users.

NOTE A junk dimension is a grouping of low-cardinality ﬂ ags and indicators.

By creating a junk dimension, you remove the ﬂ ags from the fact table and place

them into a useful dimensional framework.

If a single junk dimension has 10 two-value indicators, such as cash versus credit

payment type, there would be a maximum of 1,024 (210) rows. It probably isn’t

interesting to browse among these ﬂ ags within the dimension because every ﬂ ag

may occur with every other ﬂ ag. However, the junk dimension is a useful holding

place for constraining or reporting on these ﬂ ags. The fact table would have a single,

small surrogate key for the junk dimension.

On the other hand, if you have highly uncorrelated attributes that take on more

numerous values, it may not make sense to lump them together into a single junk

dimension. Unfortunately, the decision is not entirely formulaic. If you have ﬁ ve

indicators that each take on only three values, a single junk dimension is the best

route for these attributes because the dimension has only 243 (35) possible rows.

However, if the ﬁ ve uncorrelated indicators each have 100 possible values, we’d

suggest creating separate dimensions because there are now 100 million (1005)

possible combinations.

Figure 6-8 illustrates sample rows from an order indicator dimension. A subtle

issue regarding junk dimensions is whether you should create rows for the full

Cartesian product of all the combinations beforehand or create junk dimension

rows for the combinations as you encounter them in the data. The answer depends

on how many possible combinations you expect and what the maximum number

could be. Generally, when the number of theoretical combinations is high and you

don’t expect to encounter them all, you build a junk dimension row at extract time

whenever you encounter a new combination of ﬂ ags or indicators.

Now that junk dimensions have been explained, contrast them to the handling

of the ﬂ ags and indicators as attributes in an order header dimension. If you want

to analyze order facts where the order type is Inbound (refer to Figure 6-8’s junk

dimension rows), the fact table would be constrained to order indicator key equals

1, 2, 5, 6, 9, 10, and probably a few others. On the other hand, if these attributes

were stored in an order header dimension, the constraint on the fact table would be

an enormous list of all order numbers with an inbound order type.

Order Management 181

Cash

Visa

MasterCard

Cash

Credit

Inbound

Outbound

Inbound

Outbound

Inbound

Outbound

Commissionable

Non-Commissionable

Commissionable

Non-Commissionable

Commissionable

Non-Commissionable

Commissionable

Non-Commissionable

Commissionable

Non-Commissionable

Commissionable

Non-Commissionable

Order

Indicator Key

Payment Type

Description

Payment Type

Group Order Type

Commission Credit

Indicator

Figure 6-8: Sample rows of order indicator junk dimension.

Header/Line Pattern to Avoid

There are two common design mistakes to avoid when you model header/line data

dimensionally. Unfortunately, both of these patterns still accurately represent the

data relationships, so they don’t stick out like a sore thumb. Perhaps equally unfor-

tunate is that both patterns often feel more comfortable to data modelers and ETL

team members with signiﬁ cant transaction processing experience than the patterns

we advocate. We’ll discuss the ﬁ rst common mistake here; the other is covered in

the section “Another Header/Line Pattern to Avoid.”

Figure 6-9 illustrates a header/line modeling pattern we frequently observe when

conducting design reviews. In this example, the operational order header is virtually

replicated in the dimensional model as a dimension. The header dimension contains

all the data from its operational equivalent. The natural key for this dimension is

the order number. The grain of the fact table is one row per order line item, but

there’s not much dimensionality associated with it because most descriptive context

is embedded in the order header dimension.

Although this design accurately represents the header/line relationship, there are

obvious ﬂ aws. The order header dimension is likely very large, especially relative to

the fact table itself. If there are typically ﬁ ve line items per order, the dimension is

20 percent as large as the fact table; there should be orders of magnitude di erences

between the size of a fact table and its associated dimensions. Also, dimension tables

don’t normally grow at nearly the same rate as the fact table. With this design, you

would add one row to the dimension table and an average of ﬁ ve rows to the fact

table for every new order. Any analysis of the order’s interesting characteristics,

Chapter 6

182

such as the customer, sales rep, or deal involved, would need to traverse this large

dimension table.

Order Number (PK)

Order Date

Order Month

...

Requested Ship Date

Requested Ship Month

...

Customer ID

Customer Name

...

Sales Rep Number

Sales Rep Name

...

Deal ID

Deal Description

...

Order Number (FK)

Product Key (FK)

Order Line Number (DD)

Order Line Quantity

Extended Order Line Gross Dollar Amount

Extended Order Line Discount Dollar Amount

Extended Order Line Net Dollar Amount

1 row per Order Header

Order Line Transaction Fact

Product Dimension

Order Header Dimension

1 row per Order Line

Figure 6-9: Pattern to avoid: treating transaction header as a dimension.

Multiple Currencies

Suppose you track the orders of a large multinational U.S.-based company with sales

o ces around the world. You may be capturing order transactions in more than

15 di erent currencies. You certainly wouldn’t want to include columns in the fact

table for each currency.

The most common analytic requirement is that order transactions be expressed

in both the local transaction currency and the standardized corporate currency,

such as U.S. dollars in this example. To satisfy this need, each order fact would be

replaced with a pair of facts, one for the applicable local currency and another for

the equivalent standard corporate currency, as illustrated in Figure 6-10. The con-

version rate used to construct each fact row with the dual metrics would depend

on the business’s requirements. It might be the rate at the moment the order was

captured, an end of day rate, or some other rate based on deﬁ ned business rules. This

technique would preserve the transactional metrics, plus allow all transactions to

easily roll up to the corporate currency without complicated reporting application

coding. The metrics in standard currency would be fully additive. The local currency

metrics would be additive only for a single speciﬁ ed currency; otherwise, you’d be

trying to sum Japanese yen, Thai bhat, and British pounds. You’d also supplement

Order Management 183

the fact table with a currency dimension to identify the currency type associated

with the local currency facts; a currency dimension is needed even if the location

of the transaction is otherwise known because the location does not necessarily

guarantee which currency was used.

Order Date Key (FK)

Requested Ship Date Key (FK)

Customer Key (FK)

Product Key (FK)

Sales Rep Key (FK)

Deal Key (FK)

Local Currency Dimension Key (FK)

Order Number (DD)

Order Line Number (DD)

Order Line Quantity

Extended Order Line Gross USD Amount

Extended Order Line Discount USD Amount

Extended Order Line Net USD Amount

Extended Order Line Gross Local Currency Amount

Extended Order Line Discount Local Currency Amount

Extended Order Line Net Local Currency Amount

Local Currency Key (PK)

Local Currency Name

Local Currency Abbreviation

Local Currency Dimension

Order Line Transaction Fact

Figure 6-10: Metrics in multiple currencies within the fact table.

This technique can be expanded to support other relatively common examples.

If the business’s sales o ces roll up into a handful of regional centers, you could

supplement the fact table with a third set of metrics representing the transactional

amounts converted into the appropriate regional currency. Likewise, the fact table

columns could represent currencies for the customer ship-to and customer bill-to,

or the currencies as quoted and shipped.

In each of the scenarios, the fact table could physically contain a full set of metrics

in one currency, along with the appropriate currency conversion rate(s) for that row.

Rather than burdening the business users with appropriately multiplying or divid-

ing by the stored rate, the intra-row extrapolation should be done in a view behind

the scenes; all reporting applications would access the facts via this logical layer.

Sometimes the multi-currency support requirements are more complicated than

just described. You may need to allow a manager in any country to see order volume

in any currency. In this case, you can embellish the initial design with an additional

currency conversion fact table, as shown in Figure 6-11. The dimensions in this

fact table represent currencies, not countries, because the relationship between

currencies and countries is not one-to-one. The more common needs of the local

sales rep and sales management in headquarters would be met simply by querying

the orders fact table, but those with less predictable requirements would use the

Chapter 6

184

currency conversion table in a specially crafted query. Navigating the currency

conversion table is obviously more complicated than using the converted metrics

on the orders fact table.

Currency Conversion Fact

Conversion Date Key (FK)

Source Currency Key (FK)

Destination Currency Key (FK)

Source-Destination Exchange Rate

Destination-Source Exchange Rate

Figure 6-11: Tracking multiple currencies with daily currency exchange fact table.

Within each currency conversion fact table row, the amount expressed in local

currency is absolutely accurate because the sale occurred in that currency on that

day. The equivalent U.S. dollar value would be based on a conversion rate to U.S.

dollars for that day. The conversion rate table contains the combinations of rel-

evant currency exchange rates going in both directions because the symmetric

rates between two currencies are not equal. It is unlikely this conversion fact table

needs to include the full Cartesian product of all possible currency combinations.

Although there are approximately 100 unique currencies globally, there wouldn’t

need to be 10,000 daily rows in this currency fact table as there’s not a meaningful

market for every possible pair; likewise, all theoretical combinations are probably

overkill for the business users.

The use of a currency conversion table may also be required to support the busi-

ness’s need for multiple rates, such as an end of month or end of quarter close rate,

which may not be deﬁ ned until long after the transactions have been loaded into

the orders fact table.

Transaction Facts at Different Granularity

It is quite common in header/line operational data to encounter facts of di ering

granularity. On an order, there may be a shipping charge that applies to the entire

order. The designer’s ﬁ rst response should be to try to force all the facts down to

the lowest level, as illustrated in Figure 6-12. This procedure is broadly referred

to as allocating. Allocating the parent order facts to the child line item level is

critical if you want the ability to slice and dice and roll up all order facts by all

dimensions, including product.

Unfortunately, allocating header-level facts down to the line item level may entail

a political wrestling match. It is wonderful if the entire allocation issue is handled by

the ﬁ nance department, not by the DW/BI team. Getting organizational agreement

on allocation rules is often a controversial and complicated process. The DW/BI team

Order Management 185

shouldn’t be distracted and delayed by the inevitable organizational negotiation.

Fortunately, in many companies, the need to rationally allocate costs has already

been recognized. A task force, independent of the DW/BI project, already may have

established activity-based costing measures. This is just another name for allocating.

Allocated

Order Header Transaction Fact

Order Date Key (FK)

Requested Ship Date Key (FK)

Customer Key (FK)

Sales Rep Key (FK)

Deal Key (FK)

Order Number (PK)

Order Shipping Charges Dollar Amount

Order Line Transaction Fact

Order Date Key (FK)

Requested Ship Date Key (FK)

Customer Key (FK)

Product Key (FK)

Sales Rep Key (FK)

Deal Key (FK)

Order Number (DD)

Order Line Number (DD)

Order Line Quantity

Extended Order Line Gross Dollar Amount

Extended Order Line Discount Dollar Amount

Extended Order Line Net Dollar Amount

Allocated Order Line Shipping Charges Dollar Amount

Figure 6-12: Allocating header facts to line items.

If the shipping charges and other header-level facts cannot be successfully allo-

cated, they must be presented in an aggregate table for the overall order. We clearly

prefer the allocation approach, if possible, because the separate higher-level fact

table has some inherent usability issues. Without allocations, you cannot explore

header facts by product because the product isn’t identiﬁ ed in a header-grain fact

table. If you are successful in allocating facts down to the lowest level, the problem

goes away.

WAR NI NG You shouldn’t mix fact granularities such as order header and order

line facts within a single fact table. Instead, either allocate the higher-level facts

to a more detailed level or create two separate fact tables to handle the di erently

grained facts. Allocation is the preferred approach.

Optimally, the business data stewards obtain enterprise consensus on the allocation

rules. But sometimes organizations refuse to agree. For example, the ﬁ nance depart-

ment may want to allocate the header freight charged based on the extended gross

order amount on each line; meanwhile, the logistics group wants the freight charge

to be allocated based on the weight of the line’s products. In this case, you would

have two allocated freight charges on every order line fact table row; the uniquely

calculated metrics would also be uniquely labeled. Obviously, agreeing on a single,

standard allocation scheme is preferable.

Chapter 6

186

Design teams sometimes attempt to devise alternative techniques for handling

header/line facts at di erent granularity, including the following:

■ Repeat the unallocated header fact on every line. This approach is fraught

with peril given the risk of overstating the header amount when it’s summed

on every line.

■ Store the unallocated amount on the transaction’s ﬁ rst or last line. This tac-

tic eliminates the risk of overcounting, but if the ﬁ rst or last lines are excluded

from the query results due to a ﬁ lter constraint on the product dimension, it

appears there were no header facts associated with this transaction.

■ Set up a special product key for the header fact. Teams who adopt this

approach sometimes recycle an existing line fact column. For example, if

product key = 99999, then the gross order metric is a header fact, like the

freight charge. Dimensional models should be straightforward and legible. You

don’t want to embed complexities requiring a business user to wear a special

decoder ring to navigate the dimensional model successfully.

Another Header/Line Pattern to Avoid

The second header/line pattern to avoid is illustrated in Figure 6-13. In this example,

the order header is no longer treated as a monolithic dimension but as a fact table

instead. The header’s associated descriptive information is grouped into dimen-

sions surrounding the order fact. The line item fact table (identical in structure and

granularity as the ﬁ rst diagram) joins to the header fact based on the order number.

Order Date Dimension

Requested Ship Date Dimension

Deal Dimension

Order Header Transaction Fact

Customer Dimension

Sales Rep Dimension

Product Dimension

Order Date Key (FK)

Requested Ship Date Key (FK)

Customer Key (FK)

Sales Rep Key (FK)

Deal Key (FK)

Order Number (PK)

Extended Order Total Gross Dollar Amount

Extended Order Total Discount Dollar Amount

Extended Order Total Net Dollar Amount

Order Total Shipping Charges Dollar Amount

Order Line Transaction Fact

Order Number (FK)

Order Line Number (DD)

Product Key (FK)

Order Line Quantity

Extended Order Line Gross Dollar Amount

Extended Order Line Discount Dollar Amount

Extended Order Line Net Dollar Amount

Figure 6-13: Pattern to avoid: not inheriting header dimensionality in line facts.

Order Management 187

Again, this design accurately represents the parent/child relationship of the order

header and line items, but there are still ﬂ aws. Every time the user wants to slice

and dice the line facts by any of the header attributes, a large header fact table needs

to be associated with an even larger line fact table.

Invoice Transactions

In a manufacturing company, invoicing typically occurs when products are shipped

from your facility to the customer. Visualize shipments at the loading dock as boxes

of product are placed into a truck destined for a particular customer address. The

invoice associated with the shipment is created at this time. The invoice has mul-

tiple line items, each corresponding to a particular product being shipped. Various

prices, discounts, and allowances are associated with each line item. The extended

net amount for each line item is also available.

Although you don’t show it on the invoice to the customer, a number of other

interesting facts are potentially known about each product at the time of shipment.

You certainly know list prices; manufacturing and distribution costs may be avail-

able as well. Thus you know a lot about the state of your business at the moment

of customer invoicing.

In the invoice fact table, you can see all the company’s products, customers,

contracts and deals, o -invoice discounts and allowances, revenue generated by

customers, variable and ﬁ xed costs associated with manufacturing and delivering

products (if available), money left over after delivery of product (proﬁ t contribution),

and customer satisfaction metrics such as on-time shipment.

NOTE For any company that ships products to customers or bills customers

for services rendered, the optimal place to start a DW/BI project typically is with

invoices. We often refer to invoicing as the most powerful data because it combines

the company’s customers, products, and components of proﬁ tability.

You should choose the grain of the invoice fact table to be the individual invoice

line item. A sample invoice fact table associated with manufacturer shipments is

illustrated in Figure 6-14.

As expected, the invoice fact table contains a number of dimensions from earlier

in this chapter. The conformed date dimension table again would play multiple

roles in the fact table. The customer, product, and deal dimensions also would

conform, so you can drill across fact tables using common attributes. If a single

order number is associated with each invoice line item, it would be included as a

second degenerate dimension.

Chapter 6

188

Date Dimension (views for 3 roles)

Product Dimension

Deal Dimension

Shipper Dimension

Shipment Invoice Line Transaction Fact

Customer Dimension

Sales Rep Dimension

Warehouse Dimension

Service Level Dimension

Invoice Date Key (FK)

Requested Ship Date Key (FK)

Actual Ship Date Key (FK)

Customer Key (FK)

Product Key (FK)

Sales Rep Key (FK)

Deal Key (FK)

Warehouse Key (FK)

Shipper Key (FK)

Service Level Key (FK)

Invoice Number (DD)

Invoice Line Number (DD)

Invoice Line Quantity

Extended Invoice Line Gross Dollar Amount

Extended Invoice Line Allowance Dollar Amount

Extended Invoice Line Discount Dollar Amount

Extended Invoice Line Net Dollar Amount

Extended Invoice Line Fixed Mfg Cost Dollar Amount

Extended Invoice Line Variable Mfg Cost Dollar Amount

Extended Invoice Line Storage Cost Dollar Amount

Extended Invoice Line Distribution Cost Dollar Amount

Extended Invoice Line Contribution Dollar Amount

Shipment On-Time Counter

Requested to Actual Ship Lag

Figure 6-14: Shipment invoice fact table.

The shipment invoice fact table also contains some interesting new dimensions.

The warehouse dimension contains one row for each manufacturer warehouse loca-

tion. This is a relatively simple dimension with name, address, contact person, and

storage facility type. The attributes are somewhat reminiscent of the store dimension

from Chapter 3. The shipper dimension describes the method and carrier by which

the product was shipped from the manufacturer to the customer.

Service Level Performance as Facts,

Dimensions, or Both

The fact table in Figure 6-14 includes several critical dates intended to capture

shipment service levels. All these dates are known when the operational invoicing

process occurs. Delivering the multiple event dates in the invoicing fact table with

corresponding role-playing date dimensions allows business users to ﬁ lter, group,

and trend on any of these dates. But sometimes the business requirements are more

demanding.

You could include an additional on-time counter in the fact table that’s set to an

additive zero or one depending on whether the line shipped on time. Likewise, you

could include lag metrics representing the number of days, positive or negative,

between the requested and actual ship dates. As described later in this chapter, the

lag calculation may be more sophisticated than the simple di erence between dates.

Order Management 189

In addition to the quantitative service metrics, you could also include a qualita-

tive assessment of performance by adding either a new dimension or adding more

columns to the junk dimension. Either way, the attribute values might look similar

to those shown in Figure 6-15.

On-time

Early

Too early

Late

Too late

On-time

1 day early

2 days early

3 days early

> 3 days early

1 day late

2 days late

3 days late

> 3 days late

Service Level

Key

Service Level

Description

Service Level

Group

Figure 6-15: Sample qualitative service level descriptors.

If service level performance at the invoice line is closely watched by business

users, you may embrace all the patterns just described, since quantitative metrics

with qualitative text provide di erent perspectives on the same performance.

Proﬁ t and Loss Facts

If your organization has tackled activity-based costing or implemented a robust

enterprise resource planning (ERP) system, you might be in a position to identify

many of the incremental revenues and costs associated with shipping ﬁ nished prod-

ucts to the customer. It is traditional to arrange these revenues and costs in sequence

from the top line, which represents the undiscounted value of the products shipped

to the customer, down to the bottom line, which represents the money left over after

discounts, allowances, and costs. This list of revenues and costs is referred to as a

proﬁ t and loss (P&L) statement. You typically don’t attempt to carry it all the way

to a complete view of company proﬁ t including general and administrative costs.

For this reason, the bottom line in the P&L statement is referred to as contribution.

Keeping in mind that each row in the invoice fact table represents a single line

item on the invoice, the elements of the P&L statement shown in Figure 6-14 have

the following interpretations:

■ Quantity shipped: Number of cases of the particular line item’s product.

The use of multiple equivalent quantities with di erent units of measure is

discussed in the section “Multiple Units of Measure.”

■ Extended gross amount: Also known as extended list price because it is the

quantity shipped multiplied by the list unit price. This and all subsequent

Chapter 6

190

dollar values are extended amounts or, in other words, unit rates multiplied by

the quantity shipped. This insistence on additive values simpliﬁ es most access

and reporting applications. It is relatively rare for a business user to ask for

the unit price from a single fact table row. When the user wants an average

price drawn from many rows, the extended prices are ﬁ rst added, and then

the result is divided by the sum of the quantities.

■ Extended allowance amount: Amount subtracted from the invoice line gross

amount for deal-related allowances. The allowances are described in the

adjoined deal dimension. The allowance amount is often called an o -invoice

allowance. The actual invoice may have several allowances for a given line

item; the allowances are combined together in this simpliﬁ ed example. If

the allowances need to be tracked separately and there are potentially many

simultaneous allowances on a given line item, an allowance detail fact table

could augment the invoice line fact table, serving as a drill-down for details

on the allowance total in the invoice line fact table.

■ Extended discount amount: Amount subtracted for volume or payment term

discounts. The discount descriptions are found in the deal dimension. As

discussed earlier regarding the deal dimension, the decision to describe the

allowances and discount types together is the designer’s prerogative. It makes

sense to do this if allowances and discounts are correlated and business users

want to browse within the deal dimension to study the relationships between

allowances and discounts.

All allowances and discounts in this fact table are represented at the line

item level. As discussed earlier, some allowances and discounts may be cal-

culated operationally at the invoice level, not at the line item level. An effort

should be made to allocate them down to the line item. An invoice P&L state-

ment that does not include the product dimension poses a serious limitation

on your ability to present meaningful contribution slices of the business.

■ Extended net amount: Amount the customer is expected to pay for this line

item before tax. It is equal to the gross invoice amount less the allowances

and discounts.

The facts described so far likely would be displayed to the customer on the invoice

document. The following cost amounts, leading to a bottom line contribution, are

for internal consumption only.

■ Extended ﬁ xed manufacturing cost: Amount identiﬁ ed by manufacturing as

the pro rata ﬁ xed manufacturing cost of the invoice line’s product.

■ Extended variable manufacturing cost: Amount identiﬁ ed by manufacturing

as the variable manufacturing cost of the product on the invoice line. This

amount may be more or less activity-based, reﬂ ecting the actual location and

Order Management 191

time of the manufacturing run that produced the product being shipped to

the customer. Conversely, this number may be a standard value set by a com-

mittee. If the manufacturing costs or any of the other storage and distribution

costs are averages of averages, the detailed P&Ls may become meaningless.

The DW/BI system may illuminate this problem and accelerate the adoption

of activity-based costing methods.

■ Extended storage cost: Cost charged to the invoice line for storage prior to

being shipped to the customer.

■ Extended distribution cost: Cost charged to the invoice line for transportation

from the point of manufacture to the point of shipment. This cost is notori-

ous for not being activity-based. The distribution cost possibly can include

freight to the customer if the company pays the freight, or the freight cost

can be presented as a separate line item in the P&L.

■ Contribution amount: Extended net invoice less all the costs just discussed. This

is not the true bottom line of the overall company because general and admin-

istrative expenses and other ﬁ nancial adjustments have not been made, but it is

important nonetheless. This column sometimes has alternative labels, such as

margin, depending on the company culture.

You should step back and admire the robust dimensional model you just built.

You constructed a detailed P&L view of your business, showing all the activity-based

elements of revenue and costs. You have a full equation of proﬁ tability. However,

what makes this design so compelling is that the P&L view sits inside a rich dimen-

sional framework of dates, customers, products, and causal factors. Do you want

to see customer proﬁ tability? Just constrain and group on the customer dimension

and bring the components of the P&L into the report. Do you want to see product

proﬁ tability? Do you want to see deal proﬁ tability? All these analyses are equally

easy and take the same analytic form in the BI applications. Somewhat tongue in

cheek, we recommend you not deliver this dimensional model too early in your

career because you will get promoted and won’t be able to work directly on any

more DW/BI systems!

Proﬁ tability Words of Warning

We must balance the last paragraph with a more sober note and pass along some

cautionary words of warning. It goes without saying that most of the business users

probably are very interested in granular P&L data that can be rolled up to analyze

customer and product proﬁ tability. The reality is that delivering these detailed

P&L statements often is easier said than done. The problems arise with the cost

facts. Even with advanced ERP implementations, it is fairly common to be unable

to capture the cost facts at this atomic level of granularity. You will face a complex

process of mapping or allocating the original cost data down to the invoice line

Chapter 6

192

level. Furthermore, each type of cost may require a separate extraction from a

source system. Ten cost facts may mean 10 di erent extract and transformation

programs. Before signing up for mission impossible, be certain to perform a detailed

assessment of what is available and feasible from the source systems. You certainly

don’t want the DW/BI team saddled with driving the organization to consensus on

activity-based costing as a side project, on top of managing a number of parallel

extract implementations. If time and organization patience permits, proﬁ tability is

often tackled as a consolidated dimensional model after the components of revenue

and cost have been sourced and delivered separately to business users in the DW/

BI environment.

Audit Dimension

As mentioned, Figure 6-14’s invoice line item design is one of the most powerful

because it provides a detailed look at customers, products, revenues, costs, and

bottom line proﬁ t in one schema. During the building of rows for this fact table,

a wealth of interesting back room metadata is generated, including data quality

indicators, unusual processing requirements, and environment version numbers

that identify how the data was processed during the ETL. Although this metadata is

frequently of interest to ETL developers and IT management, there are times when

it can be interesting to the business users, too. For instance, business users might

want to ask the following:

■ What is my conﬁ dence in these reported numbers?

■ Were there any anomalous values encountered while processing this source

data?

■ What version of the cost allocation logic was used when calculating the costs?

■ What version of the foreign currency conversion rules was used when calcu-

lating the revenues?

These kinds of questions are often hard to answer because the metadata required

is not readily available. However, if you anticipate these kinds of questions, you can

include an audit dimension with any fact table to expose the metadata context that

was true when the fact table rows were built. Figure 6-16 illustrates an example

audit dimension.

The audit dimension is added to the fact table by including an audit dimension

foreign key. The audit dimension itself contains the metadata conditions encountered

when processing fact table rows. It is best to start with a modest audit dimension

design, such as shown in Figure 6-16, both to keep the ETL processing from getting

too complicated and to limit the number of possible audit dimension rows. The ﬁ rst

three attributes (quality indicator, out of bounds indicator, and amount adjusted ﬂ ag)

are all sourced from a special ETL processing table called the error event table, which

Order Management 193

is discussed in Chapter 19: ETL Subsystems and Techniques. The cost allocation and

foreign currency versions are environmental variables that should be available in an

ETL back room status table.

Date Dimension (views for 3 roles)

Product Dimension

Deal Dimension

Shipper Dimension

Shipment Invoice Line Transaction Fact

Customer Dimension

Sales Rep Dimension

Warehouse Dimension

Service Level Dimension

Invoice Date Key (FK)

Requested Ship Date Key (FK)

Actual Ship Date Key (FK)

Customer Key (FK)

Product Key (FK)

Sales Rep Key (FK)

Deal Key (FK)

Warehouse Key (FK)

Shipper Key (FK)

Service Level Key (FK)

Audit Key (FK)

Invoice Number (DD)

Invoice Line Number (DD)

Invoice Line Quantity

Extended Invoice Line Gross Dollar Amount

...

Audit Dimension

Audit Key (PK)

Quality Indicator

Out of Bounds Indicator

Amount Adjusted Flag

Cost Allocation Version

Foreign Currency Version

Figure 6-16: Sample audit dimension included on invoice fact table.

Armed with the audit dimension, some powerful queries can be performed. You

might want to take this morning’s invoice report and ask if any of the reported

numbers were based on out-of-bounds measures. Because the audit dimension is

now just an ordinary dimension, you can just add the out-of-bounds indicator to

your standard report. In the resulting “instrumented” report shown in Figure 6-17,

you see multiple rows showing normal and abnormal out-of-bounds results.

Standard Report:

Axon

East

West

1,438

2,249

235,000

480,000

Product Warehouse

Invoice Line

Quantity

Extended Invoice Line

Gross Amount

Instrumented Reported (with Out of Bounds Indicator added):

Axon

East

West

Abnormal

Normal

Abnormal

Normal

1,424

675

1,574

Product Warehouse

Out of Bounds

Indicator

Invoice Line

Quantity

2,350

232,650

144,000

336,000

Extended Invoice Line

Gross Amount

Figure 6-17: Audit dimension attribute included on standard report.

Chapter 6

194

Accumulating Snapshot for Order

Fulﬁ llment Pipeline

The order management process can be thought of as a pipeline, especially in a

build-to-order manufacturing business, as illustrated in Figure 6-18. Customers

place an order that goes into the backlog until it is released to manufacturing to be

built. The manufactured products are placed in ﬁ nished goods inventory and then

shipped to the customers and invoiced. Unique transactions are generated at each

spigot of the pipeline. Thus far we’ve considered each of these pipeline activities as

a separate transaction fact table. Doing so allows you to decorate the detailed facts

generated by each process with the greatest number of detailed dimensions. It also

allows you to isolate analysis to the performance of a single business process, which

is often precisely what the business users want.

Orders Backlog Shipment Invoicing

Release to

Manufacturing

Finished

Goods

Inventory

Figure 6-18: Order fulﬁ llment pipeline diagram.

However, there are times when business users want to analyze the entire order

fulﬁ llment pipeline. They want to better understand product velocity, or how quickly

products move through the pipeline. The accumulating snapshot fact table provides

this perspective of the business, as illustrated in Figure 6-19. It enables you to see

an updated status and ultimately the ﬁ nal disposition of each order.

The accumulating snapshot complements alternative schemas’ perspectives of

the pipeline. If you’re interested in understanding the amount of product ﬂ owing

through the pipeline, such as the quantity ordered, produced, or shipped, transac-

tion schemas monitor each of the pipeline’s major events. Periodic snapshots would

provide insight into the amount of product sitting in the pipeline, such as the

backorder or ﬁ nished goods inventories, or the amount of product ﬂ owing through

a pipeline spigot during a predeﬁ ned interval. The accumulating snapshot helps

you better understand the current state of an order, as well as product movement

velocities to identify pipeline bottlenecks and ine ciencies. If you only captured

performance in transaction event fact tables, it would be wildly di cult to calculate

the average number of days to move between milestones.

The accumulating snapshot looks different from the transaction fact tables

designed thus far in this chapter. The reuse of conformed dimensions is to be

expected, but the number of date and fact columns is larger. Each date represents

a major milestone of the fulﬁ llment pipeline. The dates are handled as dimension

Order Management 195

roles by creating either physically distinct tables or logically distinct views. The date

dimension needs to have a row for Unknown or To Be Determined because many of

these fact table dates are unknown when a pipeline row is initially loaded. Obviously,

you don’t need to declare all the date columns in the fact table’s primary key.

Order Date Key (FK)

Backlog Date Key (FK)

Release to Manufacturing Date Key (FK)

Finished Inventory Placement Date Key (FK)

Requested Ship Date Key (FK)

Scheduled Ship Date Key (FK)

Actual Ship Date Key (FK)

Arrival Date Key (FK)

Invoice Date Key (FK)

Product Key (FK)

Customer Key (FK)

Sales Rep Key (FK)

Deal Key (FK)

Manufacturing Facility Key (FK)

Warehouse Key (FK)

Shipper Key (FK)

Order Number (DD)

Order Line Number (DD)

Invoice Number (DD)

Order Quantity

Extended Order Line Dollar Amount

Release to Manufacturing Quantity

Manufacturing Pass Inspection Quantity

Manufacturing Fail Inspection Quantity

Finished Goods Inventory Quantity

Authorized to Sell Quantity

Shipment Quantity

Shipment Damage Quantity

Customer Return Quantity

Invoice Quantity

Extended Invoice Dollar Amount

Order to Manufacturing Release Lag

Manufacturing Release to Inventory Lag

Inventory to Shipment Lag

Order to Shipment Lag

Customer Dimension

Shipper Dimension

Manufacturing Facility Dimension

Product Dimension

Deal Dimension

Warehouse Dimension

Order Fulfillment Accumulating Fact

Sales Rep Dimension

Date Dimension (views for 9 roles)

Figure 6-19: Order fulﬁ llment accumulating snapshot fact table.

The fundamental di erence between accumulating snapshots and other fact tables

is that you can revisit and update existing fact table rows as more information becomes

available. The grain of an accumulating snapshot fact table in Figure 6-19 is one row

per order line item. However, unlike the order transaction fact table illustrated in

Figure 6-2 with the same granularity, accumulating snapshot fact rows are modiﬁ ed

while the order moves through the pipeline as more information is collected from

every stage of the cycle.

Chapter 6

196

NOTE Accumulating snapshot fact tables typically have multiple dates repre-

senting the major milestones of the process. However, just because a fact table

has several dates doesn’t dictate that it is an accumulating snapshot. The primary

di erentiator of an accumulating snapshot is that you revisit the fact rows as

activity occurs.

The accumulating snapshot technique is especially useful when the product mov-

ing through the pipeline is uniquely identiﬁ ed, such as an automobile with a vehicle

identiﬁ cation number, electronics equipment with a serial number, lab specimens

with an identiﬁ cation number, or process manufacturing batches with a lot num-

ber. The accumulating snapshot helps you understand throughput and yield. If the

granularity of an accumulating snapshot is at the serial or lot number, you can see

the disposition of a discrete product as it moves through the manufacturing and test

pipeline. The accumulating snapshot ﬁ ts most naturally with short-lived processes

with a deﬁ nite beginning and end. Long-lived processes, such as bank accounts,

are typically better modeled with periodic snapshot fact tables.

Accumulating Snapshots and Type 2 Dimensions

Accumulating snapshots present the latest state of a workﬂ ow or pipeline. If the

dimensions associated with an accumulating snapshot contain type 2 attributes,

the fact table should be updated to reference the most current surrogate dimension

key for active pipelines. When a single fact table pipeline row is complete, the row

is typically not revisited to reﬂ ect future type 2 changes.

Lag Calculations

The lengthy list of date columns captures the spans of time over which the order is

processed through the fulﬁ llment pipeline. The numerical di erence between any

two of these dates is a number that can be usefully averaged over all the dimensions.

These date lag calculations represent basic measures of fulﬁ llment e ciency. You

could build a view on this fact table that calculated a large number of these date

di erences and presented them as if they were stored in the underlying table. These

view columns could include metrics such as orders to manufacturing release lag,

manufacturing release to ﬁ nished goods lag, and order to shipment lag, depending

on the date spans monitored by the organization.

Rather than calculating a simple di erence between two dates via a view, the

ETL system may calculate elapsed times that incorporate more intelligence, such

as workday lags that account for weekends and holidays rather than just the raw

number of days between milestone dates. The lag metrics may also be calculated

by the ETL system at a lower level of granularity (such as the number of hours or

Order Management 197

minutes between milestone events based on operational timestamps) for short-lived

and closely monitored processes.

Multiple Units of Measure

Sometimes, di erent functional organizations within the business want to see the

same performance metrics expressed in di erent units of measure. For instance,

manufacturing managers may want to see the product ﬂ ow in terms of pallets or

shipping cases. Sales and marketing managers, on the other hand, may want to see

the quantities in retail cases, scan units (sales packs), or equivalized consumer units

(such as individual cans of soda).

Designers are tempted to bury the unit-of-measure conversion factors, such as

ship case factor, in the product dimension. Business users are then required to

appropriately multiply (or was it divide?) the order quantity by the conversion factor.

Obviously, this approach places a burden on users, in addition to being susceptible

to calculation errors. The situation is further complicated because the conversion

factors may change over time, so users would also need to determine which factor

is applicable at a speciﬁ c point in time.

Rather than risk miscalculating the equivalent quantities by placing conversion

factors in a dimension table, they should be stored in the fact table instead. In the

orders pipeline fact table, assume you have 10 basic fundamental quantity facts, in

addition to ﬁ ve units of measure. If you physically store all the facts expressed in

the di erent units of measure, you end up with 50 (10 × 5) facts in each fact row.

Instead, you can compromise by building an underlying physical row with 10 quan-

tity facts and 4 unit-of-measure conversion factors. You need only four conversion

factors rather than ﬁ ve because the base facts are already expressed in one of the

units of measure. The physical design now has 14 quantity-related facts (10 + 4), as

shown in Figure 6-20. With this design, you can see performance across the value

chain based on di erent units of measure.

Of course, you would deliver this fact table to the business users through one

or more views. The extra computation involved in multiplying quantities by con-

version factors is negligible; intra-row computations are very e cient. The most

comprehensive view could show all 50 facts expressed in every unit of measure,

but the view could be simpliﬁ ed to deliver only a subset of the quantities in units

of measure relevant to a user. Obviously, each unit of measures’ metrics should be

uniquely labeled.

NOTE Packaging all the facts and conversion factors together in the same fact

table row provides the safest guarantee that these factors will be used correctly.

The converted facts are presented in a view(s) to the users.

Chapter 6

198

Date Keys (FKs)

Product Key (FK)

More FKs...

Order Quantity Shipping Cases

Release to Manufacturing Quantity Shipping Cases

Manufacturing Pass Inspection Quantity Shipping Cases

Manufacturing Fail Inspection Quantity Shipping Cases

Finished Goods Inventory Quantity Shipping Cases

Authorized to Sell Quantity Shipping Cases

Shipment Quantity Shipping Cases

Shipment Damage Quantity Shipping Cases

Customer Return Quantity Shipping Cases

Invoice Quantity Shipping Cases

Pallet Conversion Factor

Retail Cases Conversion Factor

Scan Units Conversion Factor

Equivalized Consumer Units Conversion Factor

Order Fulfillment Accumulating Fact

Figure 6-20: Physical fact table supporting multiple units of measure with conversion

factors.

Finally, another side beneﬁ t of storing these factors in the fact table is it reduces

the pressure on the product dimension table to issue new product rows to reﬂ ect

minor conversion factor modiﬁ cations. These factors, especially if they evolve rou-

tinely over time, behave more like facts than dimension attributes.

Beyond the Rearview Mirror

Much of what we’ve discussed in this chapter focuses on e ective ways to analyze

historical product movement performance. People sometimes refer to these as rear-

view mirror metrics because they enable you to look backward and see where you’ve

been. As the brokerage industry reminds people, past performance is no guarantee of

future results. Many organizations want to supplement these historical performance

metrics with facts from other processes to help project what lies ahead. For example,

rather than focusing on the pipeline at the time an order is received, organizations

are analyzing the key drivers impacting the creation of an order. In a sales organiza-

tion, drivers such as prospecting or quoting activity can be extrapolated to provide

visibility to the expected order activity volume. Many organizations do a better job

collecting the rearview mirror information than they do the early indicators. As these

front window leading indicators are captured, they can be added gracefully to the

DW/BI environment. They’re just more rows on the enterprise data warehouse bus

matrix sharing common dimensions.

Order Management 199

Summary

This chapter covered a lengthy laundry list of topics in the context of the order

management process. Multiples were discussed on several fronts: multiple references

to the same dimension in a fact table (role-playing dimensions), multiple equivalent

units of measure, and multiple currencies. We explored several of the common chal-

lenges encountered when modeling header/line transaction data, including facts at

di erent levels of granularity and junk dimensions, plus design patterns to avoid.

We also explored the rich set of facts associated with invoice transactions. Finally,

the order fulﬁ llment pipeline illustrated the power of accumulating snapshot fact

tables where you can see the updated status of a speciﬁ c product or order as it moves

through a ﬁ nite pi peline.

Accounting

Financial analysis spans a variety of accounting applications, including the gen-

eral ledger, as well as detailed subledgers for purchasing and accounts payable,

invoicing and accounts receivable, and fixed assets. Because we’ve already touched

upon purchase orders and invoices earlier in this book, we’ll focus on the general

ledger in this chapter. Given the need for accurate handling of a company’s financial

records, general ledgers were one of the first applications to be computerized decades

ago. Perhaps some of you are still running your business on a 20-year-old ledger

system. In this chapter, we’ll discuss the data collected by the general ledger, both

in terms of journal entry transactions and snapshots at the close of an accounting

period. We’ll also talk about the budgeting process.

Chapter 7 discusses the following concepts:

■ Bus matrix snippet for accounting processes

■ General ledger periodic snapshots and journal transactions

■ Chart of accounts

■ Period close

■ Year-to-date facts

■ Multiple ﬁ scal accounting calendars

■ Drilling down through a multi-ledger hierarchy

■ Budgeting chain and associated processes

■ Fixed depth position hierarchies

■ Slightly ragged, variable depth hierarchies

■ Totally ragged hierarchies of indeterminate depth using a bridge table and

alternative modeling techniques

■ Shared ownership in a ragged hierarchy

■ Time varying ragged hierarchies

■ Consolidated fact tables that combine metrics from multiple business processes

■ Role of OLAP and packaged analytic ﬁ nancial solutions

Chapter 7

202

Accounting Case Study and Bus Matrix

Because ﬁ nance was an early adopter of technology, it comes as no surprise that

early decision support solutions focused on the analysis of ﬁ nancial data. Financial

analysts are some of the most data-literate and spreadsheet-savvy individuals. Often

their analysis is disseminated or leveraged by many others in the organization.

Managers at all levels need timely access to key ﬁ nancial metrics. In addition to

receiving standard reports, they need the ability to analyze performance trends, vari-

ances, and anomalies with relative speed and minimal e ort. Like many operational

source systems, the data in the general ledger is likely scattered among hundreds of

tables. Gaining access to ﬁ nancial data and/or creating ad hoc reports may require

a decoder ring to navigate through the maze of screens. This runs counter to many

organizations’ objective to push ﬁ scal responsibility and accountability to the line

managers.

The DW/BI system can provide a single source of usable, understandable ﬁ nan-

cial information, ensuring everyone is working o the same data with common

deﬁ nitions and common tools. The audience for ﬁ nancial data is quite diverse in

many organizations, ranging from analysts to operational managers to executives.

For each group, you need to determine which subset of corporate ﬁ nancial data is

needed, in which format, and with what frequency. Analysts and managers want to

view information at a high level and then drill to the journal entries for more detail.

For executives, ﬁ nancial data from the DW/BI system often feeds their dashboard or

scorecard of key performance indicators. Armed with direct access to information,

managers can obtain answers to questions more readily than when forced to work

through a middleman. Meanwhile, ﬁ nance can turn their attention to information

dissemination and value-added analysis, rather than focusing on report creation.

Improved access to accounting data allows you to focus on opportunities to better

manage risk, streamline operations, and identify potential cost savings. Although it

has cross-organization impact, many businesses focus their initial DW/BI implemen-

tation on strategic, revenue-generating opportunities. Consequently, accounting data

is often not the ﬁ rst subject area tackled by the DW/BI team. Given its proﬁ ciency

with technology, the ﬁ nance department has often already performed magic with

spreadsheets and desktop databases to create workaround analytic solutions, per-

haps to its short-term detriment, as these imperfect interim ﬁ xes are likely stressed

to their limits.

Figure 7-1 illustrates an accounting-focused excerpt from an organization’s bus

matrix. The dimensions associated with accounting processes, such as the general

ledger account or organizational cost center, are frequently used solely by these

processes, unlike the core customer, product, and employee dimensions which are

used repeatedly across many diverse business processes.

Accounting 203

General Ledger Transactions

General Ledger Snapshot

Budget

Commitment

Payments

Actual-Budget Variance

XXXX

Date

Ledger

Account

Organization

Budget Line

Commitment

Profile

Payment

Profile

Figure 7-1: Bus matrix rows for accounting processes.

General Ledger Data

The general ledger (G/L) is a core foundation ﬁ nancial system that ties together the

detailed information collected by subledgers or separate systems for purchasing,

payables (what you owe to others), and receivables (what others owe you). As we

work through a basic design for G/L data, you’ll discover the need for two comple-

mentary schemas with periodic snapshot and transaction fact tables.

General Ledger Periodic Snapshot

We’ll begin by delving into a snapshot of the general ledger accounts at the end of

each ﬁ scal period (or month if the ﬁ scal accounting periods align with calendar

months). Referring back to our four-step process for designing dimensional models

(see Chapter 3: Retail Sales), the business process is the general ledger. The grain

of this periodic snapshot is one row per accounting period for the most granular

level in the general ledger’s chart of accounts.

Chart of Accounts

The cornerstone of the general ledger is the chart of accounts. The ledger’s chart of

accounts is the epitome of an intelligent key because it usually consists of a series of

identiﬁ ers. For example, the ﬁ rst set of digits may identify the account, account type

(for example, asset, liability, equity, income, or expense), and other account rollups.

Sometimes intelligence is embedded in the account numbering scheme. For example,

account numbers from 1,000 through 1,999 might be asset accounts, whereas account

numbers ranging from 2,000 to 2,999 may identify liabilities. Obviously, in the data

Chapter 7

204

warehouse, you’d include the account type as a dimension attribute rather than forc-

ing users to ﬁ lter on the ﬁ rst digit of the account number.

The chart of accounts likely associates the organization cost center with the

account. Typically, the organization attributes provide a complete rollup from cost

center to department to division, for example. If the corporate general ledger com-

bines data across multiple business units, the chart of accounts would also indicate

the business unit or subsidiary company.

Obviously, charts of accounts vary from organization to organization. They’re

often extremely complicated, with hundreds or even thousands of cost centers in

large organizations. In this case study vignette, the chart of accounts naturally

decomposes into two dimensions. One dimension represents accounts in the general

ledger, whereas the other represents the organization rollup.

The organization rollup may be a ﬁ xed depth hierarchy, which would be handled

as separate hierarchical attributes in the cost center dimension. If the organization

hierarchy is ragged with an unbalanced rollup structure, you need the more power-

ful variable depth hierarchy techniques described in the section “Ragged Variable

Depth Hierarchies.”

If you are tasked with building a comprehensive general ledger spanning multiple

organizations in the DW/BI system, you should try to conform the chart of accounts

so the account types mean the same thing across organizations. At the data level,

this means the master conformed account dimension contains carefully deﬁ ned

account names. Capital Expenditures and O ce Supplies need to have the same

ﬁ nancial meaning across organizations. Of course, this kind of conformed dimen-

sion has an old and familiar name in ﬁ nancial circles: the uniform chart of accounts.

The G/L sometimes tracks ﬁ nancial results for multiple sets of books or sub-

ledgers to support di erent requirements, such as taxation or regulatory agency

reporting. You can treat this as a separate dimension because it’s such a fundamen-

tal ﬁ lter, but we alert you to carefully read the cautionary note in the next section.

Period Close

At the end of each accounting period, the ﬁ nance organization is responsible for

ﬁ nalizing the ﬁ nancial results so that they can be o cially reported internally

and externally. It typically takes several days at the end of each period to recon-

cile and balance the books before they can be closed with ﬁ nance’s o cial stamp

of approval. From there, ﬁ nance’s focus turns to reporting and interpreting the

results. It often produces countless reports and responds to countless variations

on the same questions each month.

Financial analysts are constantly looking to streamline the processes for period-

end closing, reconciliation, and reporting of general ledger results. Although

Accounting 205

operational general ledger systems often support these requisite capabilities, they

may be cumbersome, especially if you’re not dealing with a modern G/L. This chap-

ter focuses on easily analyzing the closed ﬁ nancial results, rather than facilitating

the close. However, in many organizations, general ledger trial balances are loaded

into the DW/BI system leveraging the capabilities of the DW/BI presentation area

to ﬁ nd the needles in the general ledger haystack, and then making the appropriate

operational adjustments before the period ends.

The sample schema in Figure 7-2 shows general ledger account balances at the

end of each accounting period which would be very useful for many kinds of ﬁ nan-

cial analyses, such as account rankings, trending patterns, and period-to-period

comparisons.

Accounting Period Key (FK)

Ledger Key (FK)

Account Key (FK)

Organization Key (FK)

Period End Balance Amount

Period Debit Amount

Period Credit Amount

Period Net Change Amount

Account Key (PK)

Account Name

Account Category

Account Type

Accounting Period Key (PK)

Accounting Period Number

Accounting Period Description

Accounting Period Fiscal Year

Ledger Key (PK)

Ledger Book Name

Organization Key (PK)

Cost Center Name

Cost Center Number

Department Name

Department Number

Division Name

Business Unit Name

Company Name

Accounting Period Dimension General Ledger Snapshot Fact Ledger Dimension

Organization Dimension

Account Dimension

Figure 7-2: General ledger periodic snapshot.

For the moment, we’re just representing actual ledger facts in the Figure 7-2

schema; we’ll expand our view to cover budget data in the section “Budgeting

Process.” In this table, the balance amount is a semi-additive fact. Although the

balance doesn’t represent G/L activity, we include the fact in the design because

it is so useful. Otherwise, you would need to go back to the beginning of time to

calculate an accurate end-of-period balance.

WAR NI NG The ledger dimension is a convenient and intuitive dimension

that enables multiple ledgers to be stored in the same fact table. However, every

query that accesses this fact table must constrain the ledger dimension to a single

value (for example, Final Approved Domestic Ledger) or the queries will double

count values from the various ledgers in this table. The best way to deploy this

schema is to release separate views to the business users with the ledger dimension

pre-constrained to a single value.

Chapter 7

206

The two most important dimensions in the proposed general ledger design are

account and organization. The account dimension is carefully derived from the

uniform chart of accounts in the enterprise. The organization dimension describes

the ﬁ nancial reporting entities in the enterprise. Unfortunately, these two crucial

dimensions almost never conform to operational dimensions such as customer,

product, service, or facility. This leads to a characteristic but unavoidable business

user frustration that the “GL doesn’t tie to my operational reports.” It is best to gently

explain this to the business users in the interview process, rather than promising

to ﬁ x it because this is a deep seated issue in the underlying data.

Year-to-Date Facts

Designers are often tempted to store “to-date” columns in fact tables. They think

it would be helpful to store quarter-to-date or year-to-date additive totals on each

fact row so they don’t need to calculate them. Remember that numeric facts must

be consistent with the grain. To-date facts are not true to the grain and are fraught

with peril. When fact rows are queried and summarized in arbitrary ways, these

untrue-to-the-grain facts produce nonsensical, overstated results. They should be

left out of the relational schema design and calculated in the BI reporting application

instead. It’s worth noting that OLAP cubes handle to-date metrics more gracefully.

NOTE In general, “to-date” totals should be calculated, not stored in the

fact table.

Multiple Currencies Revisited

If the general ledger consolidates data that has been captured in multiple curren-

cies, you would handle it much as we discussed in Chapter 6: Order Management.

With ﬁ nancial data, you typically want to represent the facts both in terms of the

local currency, as well as a standardized corporate currency. In this case, each

fact table row would represent one set of fact amounts expressed in local currency

and a separate set of fact amounts on the same row expressed in the equivalent

corporate currency. Doing so allows you to easily summarize the facts in a com-

mon corporate currency without jumping through hoops in the BI applications.

Of course, you’d also add a currency dimension as a foreign key in the fact table

to identify the local currency type.

General Ledger Journal Transactions

While the end-of-period snapshot addresses a multitude of ﬁ nancial analyses, many

users need to dive into the underlying details. If an anomaly is identiﬁ ed at the

Accounting 207

summary level, analysts want to look at the detailed transactions to sort through

the issue. Others need access to the details because the summarized monthly bal-

ances may obscure large disparities at the granular transaction level. Again, you

can complement the periodic snapshot with a detailed journal entry transaction

schema. Of course, the accounts payable and receivable subledgers may contain

transactions at progressively lower levels of detail, which would be captured in

separate fact tables with additional dimensionality.

The grain of the fact table is now one row for every general ledger journal entry

transaction. The journal entry transaction identiﬁ es the G/L account and the appli-

cable debit or credit amount. As illustrated in Figure 7-3, several dimensions from

the last schema are reused, including the account and organization. If the ledger

tracks multiple sets of books, you’d also include the ledger/book dimension. You

would normally capture journal entry transactions by transaction posting date,

so use a daily-grained date table in this schema. Depending on the business rules

associated with the source data, you may need a second role-playing date dimension

to distinguish the posting date from the e ective accounting date.

Post Date Key (FK)

Journal Entry Effective Date/Time

Ledger Key (FK)

Account Key (FK)

Organization Key (FK)

Debit-Credit Indicator Key (FK)

Journal Entry Number (DD)

Journal Entry Amount

Debit-Credit Indicator Key (PK)

Debit-Credit Indicator Description

General Ledger Journal Entry Fact

Ledger Dimension

Organization Dimension

Post Date Dimension

Account Dimension

Debit-Credit Indicator Dimension

Figure 7-3: General ledger journal entry transactions.

The journal entry number is likely a degenerate dimension with no linkage to

an associated dimension table. If the journal entry numbers from the source are

ordered, then this degenerate dimension can be used to order the journal entries

because the calendar date dimension on this fact table is too coarse to provide this

sorting. If the journal entry numbers do not easily support the sort, then an e ective

date/time stamp must be added to the fact table. Depending on the source data, you

may have a journal entry transaction type and even a description. In this situation,

you would create a separate journal entry transaction proﬁ le dimension (not shown).

Assuming the descriptions are not just freeform text, this dimension would have

signiﬁ cantly fewer rows than the fact table, which would have one row per journal

entry line. The speciﬁ c journal entry number would still be treated as degenerate.

Each row in the journal entry fact table is identiﬁ ed as either a credit or a debit.

The debit/credit indicator takes on two, and only two, values.

Chapter 7

208

Multiple Fiscal Accounting Calendars

In Figure 7-3, the data is captured by posting date, but users may also want to

summarize the data by ﬁ scal account period. Unfortunately, ﬁ scal accounting peri-

ods often do not align with standard Gregorian calendar months. For example, a

company may have 13 4-week accounting periods in a ﬁ scal year that begins on

September 1 rather than 12 monthly periods beginning on January 1. If you deal

with a single ﬁ scal calendar, then each day in a year corresponds to a single calendar

month, as well as a single accounting period. Given these relationships, the calendar

and accounting periods are merely hierarchical attributes on the daily date dimen-

sion. The daily date dimension table would simultaneously conform to a calendar

month dimension table, as well as to a ﬁ scal accounting period dimension table.

In other situations, you may deal with multiple ﬁ scal accounting calendars that

vary by subsidiary or line of business. If the number of unique ﬁ scal calendars is a

ﬁ xed, low number, then you can include each set of uniquely labeled ﬁ scal calendar

attributes on a single date dimension. A given row in the daily date dimension would

be identiﬁ ed as belonging to accounting period 1 for subsidiary A but accounting

period 7 for subsidiary B.

In a more complex situation with a large number of di erent ﬁ scal calendars,

you could identify the o cial corporate ﬁ scal calendar in the date dimension. You

then have several options to address the subsidiary-speciﬁ c ﬁ scal calendars. The

most common approach is to create a date dimension outrigger with a multipart key

consisting of the date and subsidiary keys. There would be one row in this table for

each day for each subsidiary. The attributes in this outrigger would consist of ﬁ s-

cal groupings (such as ﬁ scal week end date and ﬁ scal period end date). You would

need a mechanism for ﬁ ltering on a speciﬁ c subsidiary in the outrigger. Doing so

through a view would then allow the outrigger to be presented as if it were logically

part of the date dimension table.

A second approach for tackling the subsidiary-speciﬁ c calendars would be to

create separate physical date dimensions for each subsidiary calendar, using a

common set of surrogate date keys. This option would likely be used if the fact

data were decentralized by subsidiary. Depending on the BI tool’s capabilities, it

may be easier to either ﬁ lter on the subsidiary outrigger as described in option

1 or ensure usage of the appropriate subsidiary-speciﬁ c physical date dimension

table (option 2). Finally, you could allocate another foreign key in the fact table to

a subsidiary ﬁ scal period dimension table. The number of rows in this table would

be the number of ﬁ scal periods (approximately 36 for 3 years) times the number of

unique calendars. This approach simpliﬁ es user access but puts additional strain

on the ETL system because it must insert the appropriate ﬁ scal period key during

the transformation process.

Accounting 209

Drilling Down Through a Multilevel Hierarchy

Very large enterprises or government agencies may have multiple ledgers arranged

in an ascending hierarchy, perhaps by enterprise, division, and department. At the

lowest level, department ledger entries may be consolidated to roll up to a single

division ledger entry. Then the division ledger entries may be consolidated to the

enterprise level. This would be particularly common for the periodic snapshot grain

of these ledgers. One way to model this hierarchy is by introducing the parent snap-

shot’s fact table surrogate key in the fact table, as shown in Figure 7-4. In this case,

because you deﬁ ne a parent/child relationship between rows, you add an explicit

fact table surrogate key, a single column numeric identiﬁ er incremented as you add

rows to the fact table.

Fact Table Surrogate Key (PK)

Accounting Period Key (FK)

Ledger Key (FK)

Account Key (FK)

Organization Key (FK)

Parent Snapshot Key (FK)

Period End Balance Amount

Period Debit Amount

Period Credit Amount

Period Net Change Amount

General Ledger Snapshot Fact

Ledger Dimension

Organization Dimension

Accounting Period Dimension

Account Dimension

Figure 7-4: Design for drilling down through multiple ledgers.

You can use the parent snapshot surrogate key to drill down in your multilayer

general ledger. Suppose that you detect a large travel amount at the top level of the

ledger. You grab the surrogate key for that high-level entry and then fetch all the entries

whose parent snapshot key equals that key. This exposes the entries at the next lower

level that contribute to the original high-level record of interest. The SQL would look

something like this:

Select * from GL_Fact where Parent_Snapshot_key =

(select fact_table_surrogate_key from GL_Fact f, Account a

where <joins> and a.Account = 'Travel' and f.Amount > 1000)

Financial Statements

One of the primary functions of a general ledger system is to produce the organiza-

tion’s o cial ﬁ nancial reports, such as the balance sheet and income statement. The

operational system typically handles the production of these reports. You wouldn’t

want the DW/BI system to attempt to replace the reports published by the opera-

tional ﬁ nancial systems.

Chapter 7

210

However, DW/BI teams sometimes create complementary aggregated data that

provides simpliﬁ ed access to report information that can be more widely dissemi-

nated throughout the organization. Dimensions in the ﬁ nancial statement schema

would include the accounting period and cost center. Rather than looking at general

ledger account level data, the fact data would be aggregated and tagged with the

appropriate ﬁ nancial statement line number and label. In this manner, managers

could easily look at performance trends for a given line in the ﬁ nancial statement

over time for their organization. Similarly, key performance indicators and ﬁ nancial

ratios may be made available at the same level of detail.

Budgeting Process

Most modern general ledger systems include the capability to integrate budget data

into the general ledger. However, if the G/L either lacks this capability or it has not

been implemented, you need to provide an alternative mechanism for supporting

the budgeting process and variance comparisons.

Within most organizations, the budgeting process can be viewed as a series of

events. Prior to the start of a ﬁ scal year, each cost center manager typically creates a

budget, broken down by budget line items, which is then approved. In reality, bud-

geting is seldom simply a once-per-year event. Budgets are becoming more dynamic

because there are budget adjustments as the year progresses, reﬂ ecting changes in

business conditions or the realities of actual spending versus the original budget.

Managers want to see the current budget’s status, as well as how the budget has

been altered since the ﬁ rst approved version. As the year unfolds, commitments to

spend the budgeted monies are made. Finally, payments are processed.

As a dimensional modeler, you can view the budgeting chain as a series of fact

tables, as shown in Figure 7-5. This chain consists of a budget fact table, commit-

ments fact table, and payments fact table, where there is a logical ﬂ ow that starts

with a budget being established for each organization and each account. Then dur-

ing the operational period, commitments are made against the budgets, and ﬁ nally

payments are made against those commitments.

We’ll begin with the budget fact table. For an expense budget line item, each row

identiﬁ es what an organization in the company is allowed to spend for what purpose

during a given time frame. Similarly, if the line item reﬂ ects an income forecast,

which is just another variation of a budget, it would identify what an organization

intends to earn from what source during a time frame.

You could further identify the grain to be a snapshot of the current status of

each line item in each budget each month. Although this grain has a familiar ring

to it (because it feels like a management report), it is a poor choice as the fact table

Accounting 211

grain. The facts in such a “status report” are all semi-additive balances, rather than

fully additive facts. Also, this grain makes it di cult to determine how much has

changed since the previous month or quarter because you must obtain the rows

from several time periods and then subtract them from each other. Finally, this grain

choice would require the fact table to contain many duplicated rows when nothing

changes in successive months for a given line item.

Month Key (FK)

Organization Key (FK)

Account Key (FK)

Budget Key (FK)

Budget Amount

Month Key (FK)

Organization Key (FK)

Account Key (FK)

Budget Key (FK)

Commitment Key (FK)

Commitment Amount

Month Key (FK)

Organization Key (FK)

Account Key (FK)

Budget Key (FK)

Commitment Key (FK)

Payment Key (FK)

Payment Amount

Budget Key (PK)

Budget Name

Budget Version

Budget Approval Date

Commitment Key (PK)

Commitment Description

Commitment Party

Payment Key (PK)

Payment Description

Payment Party

Organization Dimension

Budget Dimension

Organization Dimension

Budget Dimension

Organization Dimension

Budget Dimension

Payment Dimension

Payment Fact

Account Dimension

Month Dimension

Account Dimension

Commitment Dimension

Month Dimension

Account Dimension

Commitment Dimension

Month Dimension

Budget Fact

Commitment Fact

Figure 7-5: Chain of budget processes.

Instead, the grain you’re interested in is the net change of the budget line item

in an organizational cost center that occurred during the month. Although this

su ces for budget reporting purposes, the accountants eventually need to tie the

budget line item back to a speciﬁ c general ledger account that’s a ected, so you’ll

also go down to the G/L account level.

Chapter 7

212

Given the grain, the associated budget dimensions would include e ective

month, organization cost center, budget line item, and G/L account, as illustrated

in Figure 7-6. The organization is identical to the dimension used earlier with the

general ledger data. The account dimension is also a reused dimension. The only

complication regarding the account dimension is that sometimes a single budget

line item impacts more than one G/L account. In that case, you would need to

allocate the budget line to the individual G/L accounts. Because the grain of the

budget fact table is by G/L account, a single budget line for a cost center may be

represented as several rows in the fact table.

Effective Date Dimension

Account Dimension

Organization Dimension

Budget Line Item Dimension

Budget Effective Date Key (PK)

Budget Effective Date Month

Budget Effective Date Year

...

Budget Effective Date Key (FK)

Budget Line Item Key (FK)

Account Key (FK)

Organization Key (FK)

Budget Amount

Budget Line Item Key (PK)

Budget Name

Budget Version

Budget Line Description

Budget Year

Budget Line Subcategory Description

Budget Line Category Description

Budget Fact

Figure 7-6: Budget schema.

The budget line item identiﬁ es the purpose of the proposed spending, such as

employee wages or o ce supplies. There are typically several levels of summariza-

tion categories associated with a budget line item. All the budget line items may not

have the same number of levels in their summarization hierarchy, such as when some

only have a category rollup, but not a subcategory. In this case, you may populate the

dimension attributes by replicating the category name in the subcategory column to

avoid having line items roll up to a Not Applicable subcategory bucket. The budget

line item dimension would also identify the budget year and/or budget version.

The e ective month is the month during which the budget changes are posted.

The ﬁ rst entries for a given budget year would show the e ective month when the

budget is ﬁ rst approved. If the budget is updated or modiﬁ ed as the budget year

gets underway, the e ective months would occur during the budget year. If you

don’t adjust a budget throughout the year, then the only entries would be the ﬁ rst

ones when the budget is initially approved. This is what is meant when the grain

is speciﬁ ed to be the net change. It’s critical that you understand this point, or you

won’t understand what is in this budget fact table or how it’s used.

Sometimes budgets are created as annual spending plans; other times, they’re

broken down by month or quarter. Figure 7-6 assumes the budget is an annual

amount, with the budget year identiﬁ ed in the budget line item dimension. If you

need to express the budget data by spending month, you would need to include a

second month dimension table that plays the role of spending month.

Accounting 213

The budget fact table has a single budget amount fact that is fully additive. If you

budget for a multinational organization, the budget amount may be tagged with the

expected currency conversion factor for planning purposes. If the budget amount

for a given budget line and account is modiﬁ ed during the year, an additional row

is added to the budget fact table representing the net change. For example, if the

original budget were $200,000, you might have another row in June for a $40,000

increase and then another in October for a negative $25,000 as you tighten your

belt going into year-end.

When the budget year begins, managers make commitments to spend the budget

through purchase orders, work orders, or other forms of contracts. Managers are

keenly interested in monitoring their commitments and comparing them to the

annual budget to manage their spending. We can envision a second fact table for

the commitments (refer to Figure 7-5) that shares the same dimensions, in addi-

tion to dimensions identifying the speciﬁ c commitment document (purchase order,

work order, or contract) and commitment party. In this case, the fact would be the

committed amount.

Finally, payments are made as monies are transferred to the party named in the

commitment. From a practical point of view, the money is no longer available in

the budget when the commitment is made. But the ﬁ nance department is interested

in the relationship between commitments and payments because it manages the

company’s cash. The dimensions associated with the payments fact table would

include the commitment fact table dimensions, plus a payment dimension to identify

the type of payment, as well as the payee to whom the payment was actually made.

Referring the budgeting chain shown in Figure 7-5, the list of dimensions expands

as you move from the budget to commitments to payments.

With this design, you can create a number of interesting analyses. To look at

the current budgeted amount by department and line item, you can constrain

on all dates up to the present, adding the amounts by department and line item.

Because the grain is the net change of the line items, adding up all the entries

over time does exactly the right thing. You end up with the current approved

budget amount, and you get exactly those line items in the given departments

that have a budget.

To ask for all the changes to the budget for various line items, simply constrain

on a single month. You’ll report only those line items that experienced a change

during the month.

To compare current commitments to the current budget, separately sum the

commitment amounts and budget amounts from the beginning of time to the cur-

rent date (or any date of interest). Then combine the two answer sets on the row

headers. This is a standard drill-across application using multipass SQL. Similarly,

you could drill across commitments and payments.

Chapter 7

214

Dimension Attribute Hierarchies

Although the budget chain use case described in this chapter is reasonably simple,

it contains a number of hierarchies, along with a number of choices for the designer.

Remember a hierarchy is deﬁ ned by a series of many-to-one relationships. You likely

have at least four hierarchies: calendar levels, account levels, geographic levels, and

organization levels.

Fixed Depth Positional Hierarchies

In the budget chain, the calendar levels are familiar ﬁ xed depth position hierarchies.

As the name suggests, a ﬁ xed position hierarchy has a ﬁ xed set of levels, all with

meaningful labels. Think of these levels as rollups. One calendar hierarchy may

be day ➪ ﬁ scal period ➪ year. Another could be day ➪ month ➪ year. These two

hierarchies may be di erent if there is no simple relationship between ﬁ scal periods

and months. For example, some organizations have 5-4-4 ﬁ scal periods, consisting

of a 5-week span followed by two 4-week spans. A single calendar date dimension

can comfortably represent these two hierarches at the same time in sets of parallel

attributes since the grain of the date dimension is the individual day.

The account dimension may also have a ﬁ xed many-to-one hierarchy such as

executive level, director level, and manager level accounts. The grain of the dimen-

sion is the manager level account, but the detailed accounts at the lowest grain roll

up to the director and executive levels.

In a ﬁ xed position hierarchy, it is important that each level have a speciﬁ c name.

That way the business user knows how to constrain and interpret each level.

WAR NI NG Avoid ﬁ xed position hierarchies with abstract names such as Level-1,

Level-2, and so on. This is a cheap way to avoid correctly modeling a ragged hierar-

chy. When the levels have abstract names, the business user has no way of knowing

where to place a constraint, or what the attribute values in a level mean in a report.

If a ragged hierarchy attempts to hide within a ﬁ xed position hierarchy with abstract

names, the individual levels are essentially meaningless.

Slightly Ragged Variable Depth Hierarchies

Geographic hierarchies present an interesting challenge. Figure 7-7 shows three

possibilities. The simple location has four levels: address, city, state, and country.

The medium complex location adds a zone level, and the complex location adds

both district and zone levels. If you need to represent all three types of locations

Accounting 215

in a single geographic hierarchy, you have a slightly variable hierarchy. You can

combine all three types if you are willing to make a compromise. For the medium

location that has no concept of district, you can propagate the city name down into

the district attribute. For the simple location that has no concept of either district or

zone, you can propagate the city name down into both these attributes. The business

data governance representatives may instead decide to propagate labels upward or

even populate the empty levels with Not Applicable. The business representatives

need to visualize the appropriate row label values on a report if the attribute is

grouped on. Regardless of the business rules applied, you have the advantage of a

clean positional design with attribute names that make reasonable sense across all

three geographies. The key to this compromise is the narrow range of geographic

hierarchies, ranging from four levels to only six levels. If the data ranged from

four levels to eight or ten or even more, this design compromise would not work.

Remember the attribute names need to make sense.

Simple Loc

Loc Key (PK)

Address+

City

State

Country

...

Medium Loc

Loc Key (PK)

Address+

City

Zone

State

Country

...

Complex Loc

Loc Key (PK)

Address+

City

District

Zone

State

Country

...

Figure 7-7: Sample data values exist simultaneously in a single location dimension

containing simple, intermediate, and complex hierarchies.

Ragged Variable Depth Hierarchies

In the budget use case, the organization structure is an excellent example of a ragged

hierarchy of indeterminate depth. In this chapter, we often refer to the hierarchical struc-

ture as a “tree” and the individual organizations in that tree as “nodes.” Imagine your

enterprise consists of 13 organizations with the rollup structure shown in Figure 7-8.

Each of these organizations has its own budget, commitments, and payments.

For a single organization, you can request a speciﬁ c budget for an account with a

simple join from the organization dimension to the fact table, as shown in Figure 7-9.

But you also want to roll up the budget across portions of the tree or even all the tree.

Figure 7-9 contains no information about the organizational rollup.

Chapter 7

216

56 10 13

11 12

Figure 7-8: Organization rollup structure.

Organization Dimension

Posting Date Key (FK)

Organization Key (FK)

Account Key (FK)

Transaction Key (FK)

Ledger Key (FK)

Transaction ID (DD)

Amount

Balance

Organization Key (PK)

Organization Name

...

General Ledger Fact

Figure 7-9: Organization dimension joined to fact table.

The classic way to represent a parent/child tree structure is by placing recur-

sive pointers in the organization dimension from each row to its parent, as shown

in Figure 7-10. The original deﬁ nition of SQL did not provide a way to evaluate

these recursive pointers. Oracle implemented a CONNECT BY function that traversed

these pointers in a downward fashion starting at a high-level parent in the tree

and progressively enumerated all the child nodes in lower levels until the tree

was exhausted. But the problem with Oracle CONNECT BY and other more general

approaches, such as SQL Server’s recursive common table expressions, is that the

representation of the tree is entangled with the organization dimension because

these approaches depend on the recursive pointer embedded in the data. It is imprac-

tical to switch from one rollup structure to another because many of the recursive

pointers would have to be destructively modiﬁ ed. It is also impractical to maintain

organizations as type 2 slowly changing dimension attributes because changing the

key for a high-level node would ripple key changes down to the bottom of the tree.

The solution to the problem of representing arbitrary rollup structures is to build

a special kind of bridge table that is independent from the primary dimension table

and contains all the information about the rollup. The grain of this bridge table is

Accounting 217

each path in the tree from a parent to all the children below that parent, as shown

in Figure 7-11. The ﬁ rst column in the map table is the primary key of the parent,

and the second column is the primary key of the child. A row must be constructed

from each possible parent to each possible child, including a row that connects the

parent to itself.

Organization Dimension

Organization Key (PK)

Organization Name

...

Organization Parent Key (FK)

Recursive

Pointer

Figure 7-10: Classic parent/child recursive design.

The example tree depicted in Figure 7-8 results in 43 rows in Figure 7-11. There

are 13 paths from node number 1, 5 paths from node number 2, one path from node

number 3 to itself, as so on.

The highest parent ﬂ ag in the map table means the particular path comes from

the highest parent in the tree. The lowest child ﬂ ag means the particular path ends

in a “leaf node” of the tree.

If you constrain the organization dimension table to a single row, you can join

the dimension table to the map table to the fact table, as shown in Figure 7-12. For

example, if you constrain the organization table to node number 1 and simply fetch

an additive fact from the fact table, you get 13 hits on the fact table, which traverses

the entire tree in a single query. If you perform the same query except constrain the

map table lowest child ﬂ ag to true, then you fetch only the additive fact from the six

leaf nodes, numbers 3, 5, 6, 8, 10, and 11. Again, this answer was computed without

traversing the tree at query time!

NOTE The article “Building Hierarchy Bridge Tables” (available at www

.kimballgroup.com under the Tools and Utilities tab for this book title) provides

a code example for building the hierarchy bridge table described in this section.

You must be careful when using the map bridge table to constrain the organization

dimension to a single row, or else you risk overcounting the children and grandchil-

dren in the tree. For example, if instead of a constraint such as “Node Organization

Number = 1” you constrain on “Node Organization Location = California”, you

would have this problem. In this case you need to craft a custom query, rather than

a simple join, with the following constraint:

GLfact.orgkey in (select distinct bridge.childkey

from innerorgdim, bridge

where innerorgdim.state = 'California' and

innerorgdim.orgkey = bridge.parentkey)

Chapter 7

218

Parent

Organization

Key

Depth from

Parent

Highest

Parent

Flag

Lowest

Child

Flag

Child

Organization

Key

TRUE

FALSE

TRUE

FALSE

TRUE

FALSE

TRUE

FALSE

TRUE

FALSE

TRUE

FALSE

TRUE

FALSE

TRUE

FALSE

TRUE

FALSE

TRUE

FALSE

TRUE

FALSE

TRUE

Organization Map Bridge

Parent Organization Key (FK)

Child Organization Key (FK)

Depth from Parent

Highest Parent Flag

Lowest Child Flag

Sample Organization Map bridge table rows for Figure 7-8:

Figure 7-11: Organization map bridge table sample rows.

Accounting 219

Organization Map Bridge

Parent Organization Key (FK)

Child Organization Key (FK)

Depth from Parent

Highest Parent Flag

Lowest Child Flag

Organization Dimension

Organization Key (PK)

Organization Name

...

Posting Date Key (FK)

Organization Key (FK)

Account Key (FK)

Transaction Profile Key (FK)

Ledger Version Key (FK)

Transaction ID (DD)

Amount

Balance

General Ledger Fact

FIGURE 7-12: Joining organization map bridge table to fact table.

Shared Ownership in a Ragged Hierarchy

The map table can represent partial or shared ownership, as shown in Figure 7-13.

For instance, suppose node 10 is 50 percent owned by node 6 and 50 percent

owned by node 11. In this case, any budget or commitment or payment attributed

to node 10 ﬂ ows upward through node 6 with a 50 percent weighting and also

upward through node 11 with a 50 percent weighting. You now need to add extra

path rows to the original 43 rows to accommodate the connection of node 10 up

to node 6 and its parents. All the relevant path rows ending in node 10 now need

a 50 percent weighting in the ownership percentage column in the map table.

Other path rows not ending in node 10 do not have their ownership percentage

column changed.

Organization Map Bridge

Parent Organization Key (FK)

Child Organization Key (FK)

Depth from Parent

Highest Parent Flag

Lowest Child Flag

Percent Ownership

Organization Dimension

Organization Key (PK)

Organization Name

...

Posting Date Key (FK)

Organization Key (FK)

Account Key (FK)

Transaction Profile Key (FK)

Ledger Version Key (FK)

Transaction ID (DD)

Amount

Balance

General Ledger Fact

FIGURE 7-13: Bridge table for ragged hierarchy with shared ownership.

Chapter 7

220

Time Varying Ragged Hierarchies

The ragged hierarchy bridge table can accommodate slowly changing hierarchies

with the addition of two date/time stamps, as shown in Figure 7-14. When a given

node no longer is a child of another node, the end e ective date/time of the old

relationship must be set to the date/time of the change, and new path rows inserted

into the bridge table with the correct begin e ective date/time.

Organization Map Bridge

Parent Organization Key (FK)

Child Organization Key (FK)

Depth from Parent

Highest Parent Flag

Lowest Child Flag

Begin Effective Date/Time

End Effective Date/Time

Organization Dimension

Organization Key (PK)

Organization Name

...

Posting Date Key (FK)

Organization Key (FK)

Account Key (FK)

Transaction Profile Key (FK)

Ledger Version Key (FK)

Transaction ID (DD)

Amount

Balance

General Ledger Fact

Figure 7-14: Bridge table for time varying ragged hierarchies.

WAR NI NG When using the bridge table in Figure 7-14, the query must always

constrain to a single date/time to “freeze” the bridge table to a single consistent

view of the hierarchy. Failing to constrain in this way otherwise would result in

multiple paths being fetched that could not exist at the same time.

Modifying Ragged Hierarchies

The organization map bridge table can easily be modiﬁ ed. Suppose you want to

move nodes 4, 5, and 6 from their original location reporting up to node 2 to a new

location reporting up to node 9, as shown in Figure 7-15.

In the static case in which the bridge table only reﬂ ects the current rollup struc-

ture, you merely delete the higher level paths in the tree pointing into the group

of nodes 4, 5, and 6. Then you attach nodes 4, 5, and 6 into the parents 1, 7, and 9.

Here is the static SQL:

Delete from Org_Map where child_org in (4, 5,6) and

parent_org not in (4,5,6)

Insert into Org_Map (parent_org, child_org)

select parent_org, 4 from Org_Map where parent_org in (1, 7, 9)

Insert into Org_Map (parent_org, child_org)

select parent_org, 5 from Org_Map where parent_org in (1, 7, 9)

Insert into Org_Map (parent_org, child_org)

select parent_org, 6 from Org_Map where parent_org in (1, 7, 9)

Accounting 221

11 12 56

10 13 4

389

Figure 7-15: Changes to Figure 7-8’s organization structure.

In the time varying case in which the bridge table has the pair of date/time

stamps, the logic is similar. You can ﬁ nd the higher level paths in the tree point-

ing into the group of nodes 4, 5, and 6 and set their end e ective date/times to the

moment of the change. Then you attach nodes 4, 5, and 6 into the parents 1, 7, and

9 with the appropriate date/times. Here is the time varying SQL:

Update Org_Map set end_eff_date = #December 31, 2012#

where child_org in (4, 5,6) and parent_org not in (4,5,6)

and #Jan 1, 2013# between begin_eff_date and end_eff_date

Insert into Org_Map

(parent_org, child_org, begin_eff_date, end_eff_date)

values (1, 4, #Jan 1, 2013#, #Dec 31, 9999#)

Insert into Org_Map

(parent_org, child_org, begin_eff_date, end_eff_date)

values (7, 4, #Jan 1, 2013#, #Dec 31, 9999#)

Insert into Org_Map

(parent_org, child_org, begin_eff_date, end_eff_date)

values (9, 4, #Jan 1, 2013#, #Dec 31, 9999#)

Identical insert statements for nodes 5 and 6 …

This simple recipe for changing the bridge table avoids nightmarish scenarios

when changing other types of hierarchical models. In the bridge table, only the

paths directly involved in the change are a ected. All other paths are untouched.

In most other schemes with clever node labels, a change in the tree structure can

a ect many or even all the nodes in the tree, as shown in the next section.

Alternative Ragged Hierarchy Modeling Approaches

In addition to using recursive pointers in the organization dimension, there are at

least two other ways to model a ragged hierarchy, both involving clever columns

placed in the organization dimension. There are two disadvantages to these schemes

Chapter 7

222

compared to the bridge table approach. First, the deﬁ nition of the hierarchy is locked

into the dimension and cannot easily be replaced. Second, both of these schemes are

vulnerable to a relabeling disaster in which a large part of the tree must be relabeled

due to a single small change. Textbooks (like this one!) usually show a tiny example,

but you need to tread cautiously if there are thousands of nodes in your tree.

One scheme adds a pathstring attribute to the organization dimension table,

as shown in Figure 7-16. The values of the pathstring attribute are shown within

each node. In this scenario, there is no bridge table. At each level, the pathstring

starts with the full pathstring of the parent and then adds the letters A, B, C, and

so on, from left to right under that parent. The ﬁ nal character is a “+” if the node

has children and is a period if the node has no children. The tree can be navigated

by using wild cards in constraints against the pathstring, for example,

■ A* retrieves the whole tree where the asterisk is a variable length wild card.

■ *. retrieves only the leaf nodes.

■ ?+ retrieves the topmost node where the question mark is a single character

wild card.

AA+

AAA. AAB+

AABA. AABB. ABBA+ABBB.

ABBAA. ABBAB.

ABA. ABB+

AB+

A+

Figure 7-16: Alternate ragged hierarchy design using pathstring attribute.

The pathstring approach is fairly sensitive to relabeling ripples caused by orga-

nization changes; if a new node is inserted somewhere in the tree, all the nodes to

the right of that node under the same parent must be relabeled.

Another similar scheme, known to computer scientists as the modiﬁ ed preordered

tree traversal approach, numbers the tree as shown in Figure 7-17. Every node has a

pair of numbers that identiﬁ es all the nodes below that point. The whole tree can be

enumerated by using the node numbers in the topmost node. If the values in each node

have the names Left and Right, then all the nodes in the example tree can be found with

Accounting 223

the constraint “Left between 1 and 26.” Leaf nodes can be found where Left and Right

di er by 1, meaning there aren’t any children. This approach is even more vulnerable

to the relabeling disaster than the pathstring approach because the entire tree must

be carefully numbered in sequence, top to bottom and left to right. Any change to the

tree causes the entire rest of the tree to the right to be relabeled.

2,11

3,4 5,10

6,7 8,9 16,21 22,23

17,18 19,20

13,14 15,24

12,25

1,26

Figure 7-17: Alternative ragged hierarchy design using the modiﬁ ed preordered tree

traversal approach.

Advantages of the Bridge Table Approach for Ragged

Hierarchies

Although the bridge table requires more ETL work to set up and more work when

querying, it o ers exceptional ﬂ exibility for analyzing ragged hierarches of inde-

terminate depth. In particular, the bridge table allows

■ Alternative rollup structures to be selected at query time

■ Shared ownership rollups

■ Time varying ragged hierarchies

■ Limited impact when nodes undergo slowly changing dimension (SCD)

type 2 changes

■ Limited impact when the tree structure is changed

You can use the organization hierarchy bridge table to fetch a fact across all three

fact tables in the budget chain. Figure 7-18 shows how an organization map table

can connect to the three budget chain fact tables. This would allow a drill-across

report such as ﬁ nding all the travel budgets, commitments, and payments made by

all the lowest leaf nodes in a complex organizational structure.

Chapter 7

224

Organization Map

Parent Organization Key (FK)

Child Organization Key (FK)

Depth from Parent

Highest Parent Flag

Lowest Child Flag

Organization Dimension

Organization Key

Cost Center Number

Cost Center Name

...

Month Key (FK)

Organization Key (FK)

Account Key (FK)

Budget Key (FK)

Budget Amount

Month Key (FK)

Organization Key (FK)

Account Key (FK)

Budget Key (FK)

Commitment Key (FK)

Commitment Amount

Month Key (FK)

Organization Key (FK)

Account Key (FK)

Budget Key (FK)

Commitment Key (FK)

Payment Key (FK)

Payment Amount

Payment Fact

Budget Fact

Commitment Fact

Organization Dimension

Figure 7-18: Drilling across and rolling up the budget chain.

Consolidated Fact Tables

In the last section, we discussed comparing metrics generated by separate business

processes by drilling across fact tables, such as budget and commitments. If this

type of drill-across analysis is extremely common in the user community, it likely

makes sense to create a single fact table that combines the metrics once rather than

relying on business users or their BI reporting applications to stitch together result

sets, especially given the inherent issues of complexity, accuracy, tool capabilities,

and performance.

Most typically, business managers are interested in comparing actual to budget

variances. At this point, you can presume the annual budgets and/or forecasts have

been broken down by accounting period. Figure 7-19 shows the actual and budget

amounts, as well as the variance (which is a calculated di erence) by the common

dimensions.

Accounting 225

Accounting Period Key (FK)

Account Key (FK)

Organization Key (FK)

Accounting Period Actual Amount

Accounting Period Budget Amount

Accounting Period Budget Variance

Budget Variance Fact

Organization Dimension

Accounting Period Dimension

Account Dimension

Figure 7-19: Actual versus budget consolidated fact table.

Again, in a multinational organization, you would likely see the actual amounts

in both local and the equivalent standard currency, based on the e ective conversion

rate. In addition, you may convert the actual results based on the planned currency

conversion factor. Given the unpredictable nature of currency ﬂ uctuations, it is

useful to monitor performance based on both the e ective and planned conversion

rates. In this manner, remote managers aren’t penalized for currency rate changes

outside their control. Likewise, ﬁ nance can better understand the big picture impact

of unexpected currency conversion ﬂ uctuations on the organization’s annual plan.

Fact tables that combine metrics from multiple business processes at a com-

mon granularity are referred to as consolidated fact tables. Although consolidated

fact tables can be useful, both in terms of performance and usability, they often

represent a dimensionality compromise as they consolidate facts at the “least

common denominator” of dimensionality. One potential risk associated with

consolidated fact tables is that project teams sometimes base designs solely on

the granularity of the consolidated table, while failing to meet user requirements

that demand the ability to dive into more granular data. These schemas run into

serious problems if project teams attempt to force a one-to-one correspondence

to combine data with di erent granularity or dimensionality.

NOTE When facts from multiple business processes are combined in a consoli-

dated fact table, they must live at the same level of granularity and dimensionality.

Because the separate facts seldom naturally live at a common grain, you are forced

to eliminate or aggregate some dimensions to support the one-to-one correspon-

dence, while retaining the atomic data in separate fact tables. Project teams should

not create artiﬁ cial facts or dimensions in an attempt to force-ﬁ t the consolidation

of di erently grained fact data.

Chapter 7

226

Role of OLAP and Packaged Analytic

Solutions

While discussing ﬁ nancial dimensional models in the context of relational data-

bases, it is worth noting that multidimensional OLAP vendors have long played a

role in this arena. OLAP products have been used extensively for ﬁ nancial reporting,

budgeting, and consolidation applications. Relational dimensional models often feed

ﬁ nancial OLAP cubes. OLAP cubes can deliver fast query performance that is critical

for executive usage. The data volumes, especially for general ledger balances or ﬁ nan-

cial statement aggregates, do not typically overwhelm the practical size constraints

of a multidimensional product. OLAP is well suited to handle complicated organiza-

tional rollups, as well as complex calculations, including inter-row manipulations.

Most OLAP vendors provide ﬁ nance-speciﬁ c capabilities, such as ﬁ nancial functions

(for example, net present value or compound growth), the appropriate handling of

ﬁ nancial statement data (in the expected sequential order such as income before

expenses), and the proper treatment of debits and credits depending on the account

type, as well as more advanced functions such as ﬁ nancial consolidation. OLAP

cubes often also readily support complex security models, such as limiting access

to detailed data while providing more open access to summary metrics.

Given the standard nature of general ledger processing, purchasing a general

ledger package rather than attempting to build one from scratch has been a popu-

lar route for years. Nearly all the operational packages also o er a complementary

analytic solution, sometimes in partnership with an OLAP vendor. In many cases,

precanned solutions based on the vendor’s cumulative experience are a sound way

to jump start a ﬁ nancial DW/BI implementation with potentially reduced cost and

risk. The analytic solutions often have tools to assist with the extraction and staging

of the operational data, as well as tools to assist with analysis and interpretation.

However, when leveraging packaged solutions, you need to be cautious in order to

avoid stovepipe applications. You could easily end up with separate ﬁ nancial, CRM,

human resources, and ERP packaged analytic solutions from as many di erent

vendors, none of which integrate with other internal data. You need to conform

dimensions across the entire DW/BI environment, regardless of whether you build

a solution or implement packages. Packaged analytic solutions can turbocharge a

DW/BI implementation; however, they do not alleviate the need for conformance.

Most organizations inevitably rely on a combination of building, buying, and inte-

grating for a complete solution.

Accounting 227

Summary

In this chapter, we focused primarily on ﬁ nancial data in the general ledger, both in

terms of periodic snapshots as well as journal entry transactions. We discussed the

handling of common G/L data challenges, including multiple currencies, multiple

ﬁ scal years, unbalanced organizational trees, and the urge to create to-date totals.

We used the familiar organization rollup structure to show how to model complex

ragged hierarchies of indeterminate depth. We introduced a special bridge table for

these hierarchies, and compared this approach to others.

We explored the series of events in a budgeting process chain. We described the

use of “net change” granularity in this situation rather than creating snapshots of

the budget data totals. We also discussed the concept of consolidated fact tables

that combine the results of separate business processes when they are frequently

analyzed together.

Finally, we discussed the natural ﬁ t of OLAP products for ﬁ nancial analysis. We

also stressed the importance of integrating analytic packages into the overall DW/

BI environment through the use of conformed dimensions .

Customer

Relationship

Management

Long before the customer relationship management (CRM) buzzword existed,

organizations were designing and developing customer-centric dimensional

models to better understand their customers’ behavior. For decades, these models

were used to respond to management’s inquiries about which customers were solic-

ited, who responded, and what was the magnitude of their response. The business

value of understanding the full spectrum of customers’ interactions and transactions

has propelled CRM to the top of the charts. CRM not only includes familiar resi-

dential and commercial customers, but also citizens, patients, students, and many

other categories of people and organizations whose behavior and preferences are

important. CRM is a mission-critical business strategy that many view as essential

to an organization’s survival.

In this chapter we start with a CRM overview, including its operational and ana-

lytic roles. We then introduce the basic design of the customer dimension, including

common attributes such as dates, segmentation attributes, repeated contact roles,

and aggregated facts. We discuss customer name and address parsing, along with

international considerations. We remind you of the challenges of modeling complex

hierarchies when we describe various kinds of customer hierarchies.

Chapter 8 discusses the following concepts:

■ CRM overview

■ Customer name and address parsing, including international considerations

■ Handling of dates, aggregated facts, and segmentation behavior attributes and

scores in a customer dimension

■ Outriggers for low cardinality attributes

■ Bridge tables for sparse attributes, along with trade-o s of bridge tables versus

a positional design

■ Bridge tables for multiple customer contacts

■ Behavior study groups to capture customer cohort groups

Chapter 8

230

■ Step dimensions to analyze sequential customer behavior

■ Timespan fact tables with e ective and expiration dates

■ Embellishing fact tables with dimensions for satisfaction or abnormal scenarios

■ Integrating customer data via master data management or partial conformity

during the downstream ETL processing

■ Warnings about fact-to-fact table joins

■ Reality check on real time, low latency requirements

Because this chapter’s customer-centric modeling issues and patterns are relevant

across industries and functional areas, we have not included a bus matrix.

CRM Overview

Regardless of the industry, organizations have ﬂ ocked to the concept of CRM.

They’ve jumped on the bandwagon in an attempt to migrate from a product-centric

orientation to one that is driven by customer needs. Although all-encompassing

terms such as customer relationship management sometimes seem ambiguous and/

or overly ambitious, the premise behind CRM is far from rocket science. It’s based

on the simple notion that the better you know your customers, the better you can

maintain long-lasting, valuable relationships with them. The goal of CRM is to

maximize relationships with your customers over their lifetime. It entails focus-

ing all aspects of the business, from marketing, sales, operations, and service, on

establishing and sustaining mutually beneﬁ cial customer relations. To do so, the

organization must develop a single, integrated view of each customer.

CRM promises signiﬁ cant returns for organizations that embrace it, both for

increased revenue and operational e ciencies. Switching to a customer-driven

perspective can lead to increased sales e ectiveness and closure rates, revenue

growth, enhanced sales productivity at reduced cost, improved customer proﬁ t-

ability margins, higher customer satisfaction, and increased customer retention.

Ultimately, every organization wants more loyal, more proﬁ table customers. As it

often requires a sizeable investment to attract new customers, you can’t a ord to

have the proﬁ table ones leave.

In many organizations, the view of the customer varies depending on the product

line business unit, business function, and/or geographic location. Each group may

use di erent customer data in di erent ways with di erent results. The evolution

from the existing silos to a more integrated perspective obviously requires organi-

zational commitment. CRM is like a stick of dynamite that knocks down the silo

walls. It requires the right integration of business processes, people resources, and

application technology to be e ective.

Over the past decade, the explosive growth of social media, location tracking tech-

nology, network usage monitoring, multimedia applications, and sensor networks

Customer Relationship Management 231

has provided an ocean of customer behavioral data that even Main Street enterprises

recognize as providing actionable insights. Although much of this data lies outside

the comfort zone of relational databases, the new “big data” techniques can bring

this data back into the DW/BI fold. Chapter 21: Big Data Analytics discusses the

best practices for bringing this new kind of big data into the DW/BI environment.

But setting aside the purely technological challenges, the real message is the need

for profound integration. You must step up to the challenge of integrating as many

as 100 customer-facing data sources, most of which are external. These data sources

are at di erent grains, have incompatible customer attributes, and are not under

your control. Any questions?

Because it is human nature to resist change, it comes as no surprise that people-

related issues often challenge CRM implementations. CRM involves brand new

ways of interacting with customers and often entails radical changes to the sales

channels. CRM requires new information ﬂ ows based on the complete acquisition

and dissemination of customer “touch point” data. Often organization structures and

incentive systems are dramatically altered.

In Chapter 17: Kimball DW/BI Lifecycle Overview, we’ll stress the importance of

having support from both senior business and IT management for a DW/BI initiative.

This advice also applies to a CRM implementation because of its cross-functional

focus. CRM requires a clear business vision. Without business strategy, buy-in, and

authorization to change, CRM becomes an exercise in futility. Neither IT nor the

business community can successfully implement CRM on its own; CRM demands

a joint commitment of support.

Operational and Analytic CRM

It could be said that CRM su ers from a split personality syndrome because it needs

to address both operational and analytic requirements. E ective CRM relies on the

collection of data at every interaction you have with a customer and then leveraging

that breadth of data through analysis.

On the operational front, CRM calls for the synchronization of customer-facing

processes. Often operational systems must either be updated or supplemented to coor-

dinate across sales, marketing, operations, and service. Think about all the customer

interactions that occur during the purchase and usage of a product or service, from

the initial prospect contact, quote generation, purchase transaction, fulﬁ llment, pay-

ment transaction, and on-going customer service. Rather than thinking about these

processes as independent silos (or multiple silos varying by product line), the CRM

mindset is to integrate these customer activities. Key customer metrics and charac-

teristics are collected at each touch point and made available to the others.

As data is created on the operational side of the CRM equation, you obviously

need to store and analyze the historical metrics resulting from the customer

Chapter 8

232

interaction and transaction systems. Sounds familiar, doesn’t it? The DW/BI system

sits at the core of CRM. It serves as the repository to collect and integrate the

breadth of customer information found in the operational systems, as well as from

external sources. The data warehouse is the foundation that supports the panoramic

360-degree view of your customers.

Analytic CRM is enabled via accurate, integrated, and accessible customer data

in the DW/BI system. You can measure the e ectiveness of decisions made in the

past to optimize future interactions. Customer data can be leveraged to better iden-

tify up-sell and cross-sell opportunities, pinpoint ine ciencies, generate demand,

and improve retention. In addition, the historical, integrated data can be leveraged

to generate models or scores that close the loop back to the operational world.

Recalling the major components of a DW/BI environment from Chapter 1: Data

Warehousing, Business Intelligence, and Dimensional Modeling Primer, you can

envision the model results pushed back to where the relationship is operationally

managed (such as the rep, call center, or website), as illustrated in Figure 8-1. The

model output can translate into speciﬁ c proactive or reactive tactics recommended

for the next point of customer contact, such as the appropriate next product o er or

anti-attrition response. The model results are also retained in the DW/BI environ-

ment for subsequent analysis.

Integrate

(ETL)

Store

(Data Presentation)

Analyze and

Report

(BI Applications)

Model

(BI Applications)

Collect

(Operational

Source System)

Figure 8-1: Closed loop analytic CRM.

Customer Relationship Management 233

In other situations, information must feed back to the operational website or call

center systems on a more real-time basis. In this case, the closed loop is much tighter

than Figure 8-1 because it’s a matter of collection and storage, and then feedback

to the collection system. Today’s operational processes must combine the current

view with a historical view, so a decision maker can decide, for example, whether

to grant credit to a customer in real time, while considering the customer’s lifetime

history. But generally, the integration requirements for operational CRM are not as

far reaching as for analytic CRM.

Obviously, as the organization becomes more centered on the customer, so must

the DW/BI system. CRM will inevitably drive change in the data warehouse. DW/BI

environments will grow even more rapidly as you collect more and more informa-

tion about your customers. ETL processes will grow more complicated as you match

and integrate data from multiple sources. Most important, the need for a conformed

customer dimension becomes even more paramount.

Customer Dimension Attributes

The conformed customer dimension is a critical element for e ective CRM. A well-

maintained, well-deployed conformed customer dimension is the cornerstone of

sound CRM analysis.

The customer dimension is typically the most challenging dimension for any

DW/BI system. In a large organization, the customer dimension can be extremely

deep (with many millions of rows), extremely wide (with dozens or even hundreds

of attributes), and sometimes subject to rapid change. The biggest retailers, credit

card companies, and government agencies have monster customer dimensions whose

size exceeds 100 million rows. To further complicate matters, the customer dimen-

sion often represents an amalgamation of data from multiple internal and external

source systems.

In this next section, we focus on numerous customer dimension design con-

siderations. We’ll begin with name/address parsing and other common customer

attributes, including coverage of dimension outriggers, and then move on to other

interesting customer attributes. Of course, the list of customer attributes is typically

quite lengthy. The more descriptive information you capture about your customers,

the more robust the customer dimension, and the more interesting the analyses.

Name and Address Parsing

Regardless of whether you deal with individual human beings or commercial enti-

ties, customers’ name and address attributes are typically captured. The operational

handling of name and address information is usually too simplistic to be very useful

Chapter 8

234

in the DW/BI system. Many designers feel a liberal design of general purpose col-

umns for names and addresses, such as Name-1 through Name-3 and Address-1

through Address-6, can handle any situation. Unfortunately, these catchall columns

are virtually worthless when it comes to better understanding and segmenting

the customer base. Designing the name and location columns in a generic way

can actually contribute to data quality problems. Consider the sample design in

Figure 8-2 with general purpose columns.

Ms. R. Jane Smith, Atty

123 Main Rd, North West, Ste 100A

PO Box 2348

Kensington

Ark.

88887-2348

888-555-3333 x776 main, 555-4444 fax

Name

Address 1

Address 2

City

State

ZIP Code

Phone Number

Column Sample Data Value

Figure 8-2: Sample customer name/address data in overly general columns.

In this design, the name column is far too limited. There is no consistent mecha-

nism for handling salutations, titles, or su xes. You can’t identify what the person’s

ﬁ rst name is, or how she should be addressed in a personalized greeting. If you

look at additional sample data from this operational system, you would potentially

ﬁ nd multiple customers listed in a single name attribute. You might also ﬁ nd addi-

tional descriptive information in the name column, such as Conﬁ dential, Trustee,

or UGMA (Uniform Gift to Minors Act).

In the sample address attributes, inconsistent abbreviations are used in various

places. The address columns may contain enough room for any address, but there

is no discipline imposed by the columns that can guarantee conformance with

postal authority regulations or support address matching and latitude/longitude

identiﬁ cation.

Instead of using a few, general purpose columns, the name and location attributes

should be broken down into as many elemental parts as possible. The extract process

needs to perform signiﬁ cant parsing on the original dirty names and addresses. After

the attributes have been parsed, they can be standardized. For example, Rd would

become Road and Ste would become Suite. The attributes can also be veriﬁ ed, such

as verifying the ZIP code and associated state combination is correct. Fortunately,

there are name and address data cleansing and scrubbing tools available in the

market to assist with parsing, standardization, and veriﬁ cation.

Customer Relationship Management 235

A sample set of name and location attributes for individuals in the United States is

shown in Figure 8-3. Every attribute is ﬁ lled in with sample data to make the design

clearer, but no single real instance would look like this. Of course, the business data

governance representatives should be involved in determining the analytic value of

these parsed data elements in the customer dimension.

Ms.

Jane

Ms. Smith

R. Jane

Smith

Jr.

English

Attorney

123

Main

Road

North West

Kensington

Cornwall

Berkeleyshire

Arkansas

South

United States

North America

88887

2348

United States

888

5553333

776

509

5554444

RJSmith@ABCGenIntl.com

www.ABCGenIntl.com

X.509

Verisign

7346531

Salutation

Informal Greeting Name

Formal Greeting Name

First and Middle Names

Surname

Suffix

Ethnicity

Title

Street Number

Street Name

Street Type

Street Direction

City

District

Second District

State

Region

Country

Continent

Primary Postal Code

Secondary Postal Code

Postal Code Type

Office Telephone Country Code

Office Telephone Area Code

Office Telephone Number

Office Extension

Mobile Telephone Country Code

Mobile Telephone Area Code

Mobile Telephone Number

E-mail

Web Site

Public Key Authentication

Certificate Authority

Unique Individual Identifier

Column Sample Data Value

Figure 8-3: Sample customer name/address data with parsed name and address

elements.

Chapter 8

236

Commercial customers typically have multiple addresses, such as physical and

shipping addresses; each of these addresses would follow much the same logic as

the address structure shown in Figure 8-3.

International Name and Address Considerations

International display and printing typically requires representing foreign characters,

including not just the accented characters from western European alphabets, but

also Cyrillic, Arabic, Japanese, Chinese, and dozens of other less familiar writing

systems. It is important to understand this is not a font problem. This is a character

set problem. A font is simply an artist’s rendering of a set of characters. There are

hundreds of fonts available for standard English, but standard English has a rela-

tively small character set that is enough for anyone’s use unless you are a professional

typographer. This small character set is usually encoded in American Standard Code

for Information Interchange (ASCII), which is an 8-bit encoding that has a maximum

of 255 possible characters. Only approximately 100 of these 255 characters have a

standard interpretation that can be invoked from a normal English keyboard, but

this is usually enough for English speaking computer users. It should be clear that

ASCII is woefully inadequate for representing the thousands of characters needed

for non-English writing systems.

An international body of system architects, the Unicode Consortium, deﬁ ned a

standard known as Unicode for representing characters and alphabets in almost all

the world’s languages and cultures. Their work can be accessed on the web at www.

unicode.org. The Unicode Standard, version 6.2.0 has deﬁ ned speciﬁ c interpreta-

tions for 110,182 possible characters and now covers the principal written languages

of the Americas, Europe, the Middle East, Africa, India, Asia, and Paciﬁ ca. Unicode

is the foundation you must use for addressing international character sets.

But it is important to understand that implementing Unicode solutions is done in

the foundation layers of your systems. First, the operating system must be Unicode-

compliant. Fortunately, the most current releases of all the major operating systems

are Unicode-compliant.

Above the operating system, all the devices that capture, store, transmit, and

print characters must be Unicode-compliant. Data warehouse back room tools must

be Unicode-compliant, including sort packages, programming languages, and auto-

mated ETL packages. Finally, the DW/BI applications, including database engines,

BI application servers and their report writers and query tools, web servers, and

browsers must all be Unicode-compliant. The DW/BI architect should not only talk

to the vendors of each package in the data pipeline, but also should conduct various

end-to-end tests. Capture some names and addresses with Unicode characters at

the data capture screens of one of the legacy applications, and send them through the

system. Get them to print out of a ﬁ nal report or a ﬁ nal browser window from

Customer Relationship Management 237

the DW/BI system and see if the special characters are still there. That simple

test will cut through a lot of the confusion. Note that even when you do this, the

same character, such as an a-umlaut, sorts di erently in di erent countries such as

Norway and Germany. Even though you can’t solve all the variations in international

collating sequences, at least both the Norwegians and the Germans will agree that

the character is an a-umlaut.

Customer geographic attributes become more complicated if you deal with cus-

tomers from multiple countries. Even if you don’t have international customers, you

may need to contend with international names and addresses somewhere in the

DW/BI system for international suppliers and human resources personnel records.

NOTE Customer dimensions sometimes include a full address block attribute.

This is a specially crafted column that assembles a postally-valid address for the

customer including mail stop, ZIP code, and other attributes needed to satisfy postal

authorities. This attribute is useful for international locations where addresses

have local idiosyncrasies.

International DW/BI Goals

After committing to a Unicode foundation, you need to keep the following goals in

mind, in addition to the name and address parsing requirements discussed earlier:

■ Universal and consistent. As they say, in for a penny, in for a pound. If you

are going to design a system for international use, you want it to work around

the world. You need to think carefully if BI tools are to produce translated ver-

sions of reports in many languages. It may be tempting to provide translated

versions of dimensions for each language, but translated dimensions give rise

to some subtle problems.

■ Sorting sequences will be different, so either the reports will be sorted

differently or all reports except those in the “root” language will appear

to be unsorted.

■ If the attribute cardinalities are not faithfully preserved across lan-

guages, then either group totals will not be the same across reports, or

some groups in various languages will contain duplicated row headers

that look like mistakes. To avoid the worst of these problems, you

should translate dimensions after the report is run; the report first

needs to be produced in a single root language, and then the report

face needs to be translated into the intended target languages.

■ All the BI tool messages and prompts need to be translated for the

benefit of the business user. This process is known as localization and

is further discussed in Chapter 12: Transportation.

Chapter 8

238

■ End-to-end data quality and downstream compatibility. The data warehouse

cannot be the only step in the data pipeline that worries about the integrity

of international names and addresses. A proper design requires support from

the ﬁ rst step of capturing the name and the address, through the data cleaning

and storage steps, to the ﬁ nal steps of performing geographic and demographic

analysis and printing reports.

■ Cultural correctness. In many cases, foreign customers and partners will see

the results from your DW/BI system in some form. If we don’t understand

which name is a ﬁ rst name and which is a last name, and if you don’t under-

stand how to refer to a person, you run the risk of insulting these individuals,

or at the very least, looking stupid. When outputs are punctuated improperly, or

misspelled, your foreign customers and partners will wish they were doing

business with a local company, rather than you.

■ Real-time customer response. DW/BI systems can play an operational role

by supporting real-time customer response systems. A customer service rep-

resentative may answer the telephone and may have 5 seconds or less to wait

for a greeting to appear on the screen that the data warehouse recommends

using with the customer. The greeting may include a proper salutation and

a proper use of the customer’s title and name. This greeting represents an

excellent use of a hot response cache that contains precalculated responses

for each customer.

■ Other kinds of addresses. We are in the midst of a revolution in communication

and networking. If you are designing a system for identifying international

names and addresses, you must anticipate the need to store electronic names,

security tokens, and internet addresses.

Similar to international addresses, telephone numbers must be presented

di erently depending on where the phone call originates. You need to provide

attributes to represent the complete foreign dialing sequence, complete domestic

dialing sequence, and local dialing sequence. Unfortunately, complete foreign dial-

ing sequences vary by origin country.

Customer-Centric Dates

Customer dimensions often contains dates, such as the date of the ﬁ rst purchase,

date of last purchase, and date of birth. Although these dates initially may be SQL

date type columns, if you want to summarize these dates by your unique calen-

dar attributes, such as seasons, quarters, and ﬁ scal periods, the dates should be

changed to foreign key references to the date dimension. You need to be careful

that all such dates fall within the span of the corporate date dimension. These date

dimension roles are declared as semantically distinct views, such as a First Purchase

Customer Relationship Management 239

Date dimension table with unique column labels. The system behaves as if there

is another physical date table. Constraints on any of these tables have nothing to

do with constraints on the primary date dimension table. This design, as shown

in Figure 8-4, is an example of a dimension outrigger, which is discussed in the

section “Outrigger for Low Cardinality Attribute Set.”

Date of 1st Purchase Key (PK)

Date of 1st Purchase

Date of 1st Purchase Month

Date of 1st Purchase Year

Date of 1st Purchase Fiscal Month

Date of 1st Purchase Fiscal Quarter

Date of 1st Purchase Fiscal Year

Date of 1st Purchase Season

…

Date of 1st Purchase Dimension

Customer Key (PK)

Customer ID (Natural Key)

Customer Salutation

Customer First Name

Customer Surname

Customer City

Customer State

…

Date of 1st Purchase Key (FK)

Customer Dimension

Transaction Date Key (FK)

Customer Key (FK)

More FKs …

Facts …

Fact Table

Figure 8-4: Date dimension outrigger.

Aggregated Facts as Dimension Attributes

Business users are often interested in constraining the customer dimension based on

aggregated performance metrics, such as ﬁ ltering on all customers who spent more

than a certain dollar amount during last year. Or to make matters worse, perhaps

they want to constrain based on how much the customer has purchased in a lifetime.

Providing aggregated facts as dimension attributes is sure to be a crowd-pleaser with

the business users. They could issue a query to identify all customers who satisﬁ ed the

spending criteria and then issue another fact query to analyze the behavior for

that customer dimension subset. But rather than all that, you can instead store

an aggregated fact as a dimension attribute. This allows business users to simply

constrain on the spending attribute just like they might on a geographic attribute.

These attributes are meant to be used for constraining and labeling; they’re not to be

used in numeric calculations. Although there are query usability and performance

advantages of storing these attributes, the main burden falls on the back room ETL

processes to ensure the attributes are accurate, up-to-date, and consistent with the

actual fact rows. These attributes can require signiﬁ cant care and feeding. If you

opt to include some aggregated facts as dimension attributes, be certain to focus on

those that will be frequently used. Also strive to minimize the frequency with which

these attributes need to be updated. For example, an attribute for last year’s spending

would require much less maintenance than one providing year-to-date behavior.

Rather than storing attributes down to the speciﬁ c dollar value, they are sometimes

Chapter 8

240

replaced (or supplemented) with more meaningful descriptive values, such as High

Spender as discussed in the next section. These descriptive values minimize your

vulnerability that the numeric attributes might not tie back to the appropriate fact

tables. In addition, they ensure that all users have a consistent deﬁ nition for high

spenders, for example, rather than resorting to their own individual business rules.

Segmentation Attributes and Scores

Some of the most powerful attributes in a customer dimension are segmentation

classiﬁ cations. These attributes obviously vary greatly by business context. For an

individual customer, they may include:

■ Gender

■ Ethnicity

■ Age or other life stage classiﬁ cations

■ Income or other lifestyle classiﬁ cations

■ Status (such as new, active, inactive, and closed)

■ Referring source

■ Business-speciﬁ c market segment (such as a preferred customer identiﬁ er)

Similarly, many organizations score their customers to characterize them.

Statistical segmentation models typically generate these scores which cluster cus-

tomers in a variety of ways, such as based on their purchase behavior, payment

behavior, propensity to churn, or probability to default. Each customer is tagged

with a resultant score.

Behavior Tag Time Series

One popular approach for scoring and proﬁ ling customers looks at the recency (R),

frequency (F), and intensity (I) of the customer’s behavior. These are known as the

RFI measures; sometimes intensity is replaced with monetary (M), so it’s also known

as RFM. Recency is how many days has it been since the customer last ordered

or visited your site. Frequency is how many times the customer has ordered or

visited, typically in the past year. And intensity is how much money the customer

has spent over the same time period. When dealing with a large customer base,

every customer’s behavior can be modeled as a point in an RFI cube, as depicted

in Figure 8-5. In this ﬁ gure, the scales along each axis are quintiles, from 1 to 5,

which spread the actual values into even groups.

If you have millions of points in the cube, it becomes di cult to see meaning-

ful clusters of these points. This is a good time to ask a data mining professional

where the meaningful clusters are. The data mining professional may come back

with a list of behavior tags like the following, which is drawn from a slightly more

complicated scenario that includes credit behavior and returns:

Customer Relationship Management 241

A: High volume repeat customer, good credit, few product returns

B: High volume repeat customer, good credit, many product returns

C: Recent new customer, no established credit pattern

D: Occasional customer, good credit

E: Occasional customer, poor credit

F: Former good customer, not seen recently

G: Frequent window shopper, mostly unproductive

H: Other

Highest

Lowest

HighestLowest

Highest

Lowest

Recency

Frequency

Intensity

2345

Figure 8-5: Recency, frequency, intensity (RFI) cube.

Now you can look at the customers’ time series data and associate each customer

in each reporting period with the nearest cluster. The data miner can help do this.

Thus, the last 10 observations of a customer named John Doe could look like:

John Doe: C C C D D A A A B B

This time series of behavior tags is unusual because although it comes from

a regular periodic measurement process, the observed “values” are textual. The

behavior tags are not numeric and cannot be computed or averaged, but they can

be queried. For example, you may want to ﬁ nd all the customers who were an A

sometime in the ﬁ fth, fourth, or third prior period and were a B in the second or ﬁ rst

prior period. Perhaps you are concerned by progressions like this and fear losing a

valuable customer because of the increasing number of returns.

Behavior tags should not be stored as regular facts. The main use of behavior tags

is formulating complex query patterns like the example in the previous paragraph.

If the behavior tags were stored in separate fact rows, such querying would be

extremely di cult, requiring a cascade of correlated subqueries. The recommended

way to handle behavior tags is to build an explicit time series of attributes in the

customer dimension. This is another example of a positional design. BI interfaces

Chapter 8

242

are simple because the columns are in the same table, and performance is good

because you can build bitmapped indexes on them.

In addition to the separate columns for each behavior tag time period, it would

be a good idea to create a single attribute with all the behavior tags concatenated

together, such as CCCDDAAABB. This column would support wild card searches

for exotic patterns, such as “D followed by a B.”

NOTE In addition to the customer dimension’s time series of behavior tags, it

would be reasonable to include the contemporary behavior tag value in a mini-

dimension to analyze facts by the behavior tag in e ect when the fact row was

loaded.

Relationship Between Data Mining and DW/BI System

The data mining team can be a great client of the data warehouse, and especially

great users of customer behavior data. However, there can be a mismatch between

the velocity that the data warehouse can deliver data and the velocity that the data

miners can consume data. For example, a decision tree tool can process hundreds

of records per second, but a big drill-across report that produces “customer observa-

tions” can never deliver data at such speeds. Consider the following seven-way drill

across a report that might produce millions of customer observations from census,

demographic, external credit, internal credit, purchases, returns, and website data:

SELECT Customer Identifier, Census Tract, City, County, State,

Postal Code, Demographic Cluster, Age, Sex, Marital Status,

Years of Residency, Number of Dependents, Employment Profile,

Education Profile, Sports Magazine Reader Flag,

Personal Computer Owner Flag, Cellular Telephone Owner Flag,

Current Credit Rating, Worst Historical Credit Rating,

Best Historical Credit Rating, Date First Purchase,

Date Last Purchase, Number Purchases Last Year,

Change in Number Purchases vs. Previous Year,

Total Number Purchases Lifetime, Total Value Purchases Lifetime,

Number Returned Purchases Lifetime, Maximum Debt,

Average Age Customer's Debt Lifetime, Number Late Payments,

Number Fully Paid, Times Visited Web Site,

Change in Frequency of Web Site Access,

Number of Pages Visited Per Session,

Average Dwell Time Per Session, Number Web Product Orders,

Value Web Product Orders, Number Web Site Visits to Partner Web

Sites, Change in Partner Web Site Visits

FROM *** WHERE *** ORDER BY *** GROUP BY ***

Customer Relationship Management 243

Data mining teams would love this data! For example a big ﬁ le of millions of these

observations could be analyzed by a decision tree tool where the tool is “aimed”

at the Total Value Purchases Lifetime column, which is highlighted above. In this

analysis, the decision tree tool would determine which of the other columns “predict

the variance” of the target ﬁ eld. Maybe the answer is Best Historical Credit Rating

and Number of Dependents. Armed with this answer, the enterprise now has a

simple way to predict who is going to be a good lifetime customer, without needing

to know all the other data content.

But the data mining team wants to use these observations over and over for

di erent kinds of analyses perhaps with neural networks or case-based reasoning

tools. Rather than producing this answer set on demand as a big, expensive query,

this answer set should be written to a ﬁ le and given to the data mining team to

analyze on its servers.

Counts with Type 2 Dimension Changes

Businesses frequently want to count customers based on their attributes without

joining to a fact table. If you used type 2 to track customer dimension changes,

you need to be careful to avoid overcounting because you may have multiple rows

in the customer dimension for the same individual. Doing a COUNT DISTINCT on a

unique customer identiﬁ er is a possibility, assuming the attribute is indeed unique

and durable. A current row indicator in the customer dimension is also helpful to

do counts based on the most up-to-date descriptive values for a customer.

Things get more complicated if you need to do a customer count at a given histori-

cal point in time using e ective and expiration dates in the customer dimension. For

example, if you need to know the number of customers you had at the beginning of

2013, you could constrain the row e ective date <= ‘1/1/2013’ and row expiration

date >= ‘1/1/2013’ to restrict the result set to only those rows that were valid on

1/1/2013. Note the comparison operators are dependent on the business rules used

to set the row e ective/expiration dates. In this example, the row expiration date on

the no longer valid customer row is 1 day less than the e ective date on the new row.

Outrigger for Low Cardinality Attribute Set

In Chapter 3: Retail Sales, we encouraged designers to avoid snowﬂ aking where low

cardinality columns in the dimension are removed to separate normalized tables,

which then link back into the original dimension table. Generally, snowﬂ aking is

not recommended in a DW/BI environment because it almost always makes the

user presentation more complex, in addition to negatively impacting browsing per-

formance. In spite of this prohibition against snowﬂ aking, there are some special

Chapter 8

244

situations in which it is permissible to build a dimension outrigger that begins to

look like a snowﬂ aked table.

In Figure 8-6, the dimension outrigger is a set of data from an external data pro-

vider consisting of 150 demographic and socio-economic attributes regarding the

customers’ county of residence. The data for all customers residing in a given county

is identical. Rather than repeating this large block of data for every customer within

a county, opt to model it as an outrigger. There are several reasons for bending the

“no snowﬂ ake” rule. First, the demographic data is available at a signiﬁ cantly dif-

ferent grain than the primary dimension data and it’s not as analytically valuable.

It is loaded at di erent times than the rest of the data in the customer dimension.

Also, you do save signiﬁ cant space in this case if the underlying customer dimen-

sion is large. If you have a query tool that insists on a classic star schema with no

snowﬂ akes, the outrigger can be hidden under a view declaration.

Country Demographics Outrigger Dimension

County Demographics Key (PK)

Total Population

Population under 5 Years

% Population under 5 Years

Population under 18 Years

% Population under 18 Years

Population 65 Years and Older

% Population 65 Years and Older

Female Population

% Female Population

Male Population

% Male Population

Number of High School Graduates

Number of College Graduates

Number of Housing Units

Home Ownership Rate

...

Customer Dimension

Customer Key (PK)

Customer ID (Natural Key)

Customer Salutation

Customer First Name

Customer Surname

Customer City

Customer County

County Demographics Key (FK)

Customer State

...

Fact Table

Customer Key (FK)

More FKs ...

Facts ...

Figure 8-6: Dimension outrigger for cluster of low cardinality attributes.

WAR NI NG Dimension outriggers are permissible, but they should be the

exception rather than the rule. A red warning ﬂ ag should go up if your design

is riddled with outriggers; you may have succumbed to the temptation to overly

normalize the design.

Customer Hierarchy Considerations

One of the most challenging aspects of dealing with commercial customers is mod-

eling their internal organizational hierarchy. Commercial customers often have a

Customer Relationship Management 245

nested hierarchy of entities ranging from individual locations or organizations up

through regional o ces, business unit headquarters, and ultimate parent companies.

These hierarchical relationships may change frequently as customers reorganize

themselves internally or are involved in acquisitions and divestitures.

NOTE In Chapter 7: Accounting, we described how to handle ﬁ xed hierar-

chies, slightly variable hierarchies, and ragged hierarchies of indeterminate depth.

Chapter 7 focuses on ﬁ nancial cost center rollups, but the techniques are exactly

transferrable to customer hierarchies. If you skipped Chapter 7, you need to back-

track to read that chapter to make sense of the following recommendations.

Although relatively uncommon, the lucky ones amongst us sometimes are

confronted with a customer hierarchy that has a highly predictable ﬁ xed number

of levels. Suppose you track a maximum of three rollup levels, such as the ultimate

corporate parent, business unit headquarters, and regional headquarters. In this

case, you have three distinct attributes in the customer dimension corresponding

to these three levels. For commercial customers with complicated organizational

hierarchies, you’d populate all three levels to appropriately represent the three di er-

ent entities involved at each rollup level. This is the ﬁ xed depth hierarchy approach

from Chapter 7.

By contrast, if another customer had a mixture of one, two, and three level

organizations, you’d duplicate the lower-level value to populate the higher-level attri-

butes. In this way, all regional headquarters would sum to the sum of all business

unit headquarters, which would sum to the sum of all ultimate corporate parents.

You can report by any level of the hierarchy and see the complete customer base

represented. This is the slightly variable hierarchy approach.

But in many cases, complex commercial customer hierarchies are ragged hier-

archies with an indeterminate depth, so you must use a ragged hierarchy modeling

technique, as described in Chapter 7. For example, if a utility company is devising a

custom rate plan for all the utility consumers that are part of a huge customer with

many levels of o ces, branch locations, manufacturing locations, and sales loca-

tions, you cannot use a ﬁ xed hierarchy. As pointed out in Chapter 7, the worst design

is a set of generic levels named such as Level-1, Level-2, and so on. This makes for

an unusable customer dimension because you don’t know how to constrain against

these levels when you have a ragged hierarchy of indeterminate depth.

Bridge Tables for Multivalued Dimensions

A fundamental tenet of dimensional modeling is to decide on the grain of the fact

table, and then carefully add dimensions and facts to the design that are true to

the grain. For example, if you record customer purchase transactions, the grain of

Chapter 8

246

the individual purchase is natural and physically compelling. You do not want to

change that grain. Thus you normally require any dimension attached to this fact

table to take on a single value because then there’s a clean single foreign key in the

fact table that identiﬁ es a single member of the dimension. Dimensions such as

the customer, location, product or service, and time are always single valued. But

you may have some “problem” dimensions that take on multiple values at the grain

of the individual transaction. Common examples of these multivalued dimensions

include:

■ Demographic descriptors drawn from a multiplicity of sources

■ Contact addresses for a commercial customer

■ Professional skills of a job applicant

■ Hobbies of an individual

■ Diagnoses or symptoms of a patient

■ Optional features for an automobile or truck

■ Joint account holders in a bank account

■ Tenants in a rental property

When faced with a multivalued dimension, there are two basic choices: a posi-

tional design or bridge table design. Positional designs are very attractive because

the multivalued dimension is spread out into named columns that are easy to query.

For example, if modeling the hobbies of an individual as previously mentioned,

you could have a hobby dimension with named columns for all the hobbies gath-

ered from your customers, including stamp collecting, coin collecting, astronomy,

photography, and many others! Immediately you can see the problem. The posi-

tional design approach isn’t very scalable. You can easily run out of columns in

your database, and it is awkward to add new columns. Also if you have a column

for every possible hobby, then any single individual’s hobby dimension row will

contain mostly null values.

The bridge table approach to multivalued dimensions is powerful but comes

with a big compromise. The bridge table removes the scalability and null value

objections because rows in the bridge table exist only if they are actually needed,

and you can add hundreds or even thousands of hobbies in the previous example.

But the resulting table design requires a complex query that must be hidden from

direct view by the business users.

WAR NI NG Be aware that complex queries using bridge tables may require SQL

that is beyond the normal reach of BI tools.

Customer Relationship Management 247

In the next two sections, we illustrate multivalued bridge table designs that

ﬁ t with the customer-centric topics of this chapter. We will revisit multivalued

bridges in Chapter 9: Human Resources Management, Chapter 10: Financial

Services, Chapter 13: Education, Chapter 14: Healthcare, and Chapter 16: Insurance.

We’ll then describe how to build these bridges in Chapter 19: ETL Subsystems and

Techniques.

Bridge Table for Sparse Attributes

Organizations are increasingly collecting demographics and status information

about their customers, but the traditional ﬁ xed column modeling approach for

handling these attributes becomes di cult to scale with hundreds of attributes.

Positional designs have a named column for each attribute. BI tool interfaces are

easy to construct for positional attributes because the named columns are easily

presented in the tool. Because many columns contain low cardinality contents, the

query performance using these attributes can be very good if bitmapped indexes

are placed on each column. Positional designs can be scaled up to perhaps 100 or

so columns before the databases and user interfaces become awkward or hard to

maintain. Columnar databases are well suited to these kinds of designs because

new columns can be easily added with minimal disruption to the internal storage

of the data, and the low-cardinality columns containing only a few discrete values

are dramatically compressed.

When the number of di erent attributes grows beyond your comfort zone, and

if new attributes are added frequently, a bridge table is recommended. Ultimately,

when you have a very large and expanding set of demographics indicators, using

outriggers or mini-dimensions simply does not gracefully scale. For example, you

may collect loan application information as a set of open ended name-value pairs,

as shown in Figure 8-7. Name-value pair data is interesting because the values can

be numeric, textual, a ﬁ le pointer, a URL, or even a recursive reference to enclosed

name-value pair data.

Over a period of time, you could collect hundreds or even thousands of di erent

loan application variables. For a true name-value pair data source, the value ﬁ eld

itself can be stored as a text string to handle the open-ended modality of the val-

ues, which is interpreted by the analysis application. In these situations whenever

the number of variables is open-ended and unpredictable, a bridge table design is

appropriate, as shown in Figure 8-8.

Chapter 8

248

Loan Application Name-Value Pair Data

Photograph: <image>

Primary Income: $72345

Other Taxable Income: $2345

Tax-Free Income: $3456

Long Term Gains: $2367

Garnished Wages: $789

Pending Judgment Potential: $555

Alimony: $666

Jointly Owned Real Estate Appraised Value: $123456

Jointly Owned Real Estate Image: <image>

Jointly Owned Real Estate MLS Listing: <URL>

Percentage Ownership Real Estate: 50

Number Dependents: 4

Pre-existing Medical Disability: Back Injury

Number of Weeks Lost to Disability: 6

Employer Disability Support Statement: <document archive>

Previous Bankruptcy Declaration Type: 11

Years Since Bankruptcy: 8

Spouse Financial Disclosure: <name-value pair>

... 100 more name-value pairs...

Figure 8-7: Loan application name-value pair data.

Loan Application Fact

Application Date Key (FK)

Applicant Key (FK)

Loan Type Key (FK)

Application ID (DD)

Loan Officer Key (FK)

Underwriter Key (FK)

Branch Key (FK)

Status Key (FK)

Application Disclosure Key (FK)

Application Disclosure Dimension

Application Disclosure Key (PK)

Application Disclosure Description

Application Disclosure Bridge

Application Disclosure Key (FK)

Disclosure Item Key (FK)

Disclosure Item Dimension

Disclosure Item Key (PK)

Item Name

Item Value Type

Item Value Text String

Figure 8-8: Bridge table for wide and sparse name-value pair data set.

Bridge Table for Multiple Customer Contacts

Large commercial customers have many points of contact, including decision mak-

ers, purchasing agents, department heads, and user liaisons; each point of contact is

associated with a speciﬁ c role. Because the number of contacts is unpredictable but

possibly large, a bridge table design is a convenient way to handle this situation, as

shown in Figure 8-9. Some care should be taken not to overdo the contact dimen-

sion and make it a dumping ground for every employee or citizen or salesperson or

human being the organization interacts with. Restrict the dimension for this use

case of contacts as part of the customer relationship.

Customer Relationship Management 249

Customer Dimension

Customer Key (PK)

Customer Name

Customer Type

Customer Contact Group (FK)

Date of First Contact

...

Contact Group Dimension

Contact Group Key (PK)

Contact Group Name

Contact Group Bridge

Contact Group Key (FK)

Contact Key (FK)

Contact Role

Contact Dimension

Contact Key (PK)

Contact Name

Contact Street Address

...

Figure 8-9: Bridge table design for multiple contacts.

Complex Customer Behavior

Customer behavior can be very complex. In this section, we’ll discuss the han-

dling of customer cohort groups and capturing sequential behavior. We’ll also cover

precise timespan fact tables and tagging fact events with indicators of customer

satisfaction or abnormal scenarios.

Behavior Study Groups for Cohorts

With customer analysis, simple queries such as how much was sold to custom-

ers in this geographic area in the past year rapidly evolve to much more complex

inquiries, such as how many customers bought more this past month than their

average monthly purchase amount from last year. The latter question is too complex

for business users to express in a single SQL request. Some BI tool vendors allow

embedded subqueries, whereas others have implemented drill-across capabilities

in which complex requests are broken into multiple select statements and then

combined in a subsequent pass.

In other situations, you may want to capture the set of customers from a query or

exception report, such as the top 100 customers from last year, customers who spent

more than $1,000 last month, or customers who received a speciﬁ c test solicitation, and

then use that group of customers, called a behavior study group, for subsequent analyses

without reprocessing to identify the initial condition. To create a behavior study group,

run a query (or series of queries) to identify the set of customers you want to further

analyze, and then capture the customer durable keys of the identiﬁ ed set as an actual

physical table consisting of a single customer key column. By leveraging the custom-

ers’ durable keys, the study group dimension is impervious to type 2 changes to the

customer dimension which may occur after the study group members are identiﬁ ed.

NOTE The secret to building complex behavioral study group queries is to

capture the keys of the customers or products whose behavior you are tracking.

You then use the captured keys to subsequently constrain other fact tables without

having to rerun the original behavior analysis.

Chapter 8

250

You can now use this special behavior study group dimension table of customer

keys whenever you want to constrain any analysis on any table to that set of spe-

cially deﬁ ned customers. The only requirement is that the fact table contains a

customer key reference. The use of the behavior study group dimension is shown

in Figure 8-10.

Customer Behavior Study

Group Dimension

Customer ID (Durable Key)

Customer Dimension

Customer Key (PK)

Customer ID (Durable Key)

...

POS Retail Sales Transaction Fact

Date Key (FK)

Customer Key (FK)

More FKs ...

Sales Quantity

Sales Dollar Amount

Figure 8-10: Behavior study group dimension joined to customer dimension’s

durable key.

The behavior study group dimension is attached with an equijoin to the customer

dimension’s durable key (refer to Customer ID in Figure 8-10). This can even be

done in a view that hides the explicit join to the behavior dimension. In this way,

the resulting dimensional model looks and behaves like an uncomplicated star. If the

special dimension table is hidden under a view, it should be labeled to uniquely

identify it as being associated with the top 100 customers, for example. Virtually

any BI tool can now analyze this specially restricted schema without paying syn-

tax or user-interface penalties for the complex processing that deﬁ ned the original

subset of customers.

NOTE The exceptional simplicity of study group tables allows them to be com-

bined with union, intersection, and set di erence operations. For example, a set of

problem customers this month can be intersected with the set of problem custom-

ers from last month to identify customers who were problems for two consecutive

months.

Study groups can be made even more powerful by including an occurrence date

as a second column correlated with each durable key. For example, a panel study

of consumer purchases can be conducted where consumers enter the study when

they exhibit some behavior such as switching brands of peanut butter. Then fur-

ther purchases can be tracked after the event to see if they switched brands again.

To get this right, these purchase events must be tracked with the right time stamps to

get the behavior in the right sequence.

Like many design decisions, this one represents certain compromises. First,

this approach requires a user interface for capturing, creating, and administering

Customer Relationship Management 251

real physical behavior study group tables in the data warehouse. After a complex

exception report has been deﬁ ned, you need the ability to capture the resulting keys

into an applet to create the special behavior study group dimension. These study

group tables must live in the same space as the primary fact table because they are

going to be joined directly to the customer dimension table. This obviously a ects

the DBA’s responsibilities.

Step Dimension for Sequential Behavior

Most DW/BI systems do not have good examples of sequential processes. Usually

measurements are taken at a particular place watching the stream of customers or

products going by. Sequential measurements, by contrast, need to follow a customer

or a product through a series of steps, often measured by di erent data capture

systems. Perhaps the most familiar example of a sequential process comes from

web events where a session is constructed by collecting individual page events on

multiple web servers tied together via a customer’s cookie. Understanding where

an individual step ﬁ ts in the overall sequence is a major challenge when analyzing

sequential processes.

By introducing a step dimension, you can place an individual step into the context

of an overall session, as shown in Figure 8-11.

Transaction Fact

Transaction Date Key (FK)

Customer Key (FK)

Session Key (FK)

Transaction Number (DD)

Session Step Key (FK)

Purchase Step Key (FK)

Abandon Step Key (FK)

More FKs ...

Facts...

Step Dimension (3 Roles)

Step Key (PK)

Total Number Steps

This Step Number

Steps Until End

Sample Step Dimension Rows:

Step

Key

Total

Number

Steps

This

Step

Number

Steps

Until

End

Figure 8-11: Step dimension to capture sequential activities.

The step dimension is an abstract dimension deﬁ ned in advance. The ﬁ rst row in

the dimension is used only for one-step sessions, where the current step is the ﬁ rst

step and there are no more steps remaining. The next two rows in the step dimension

are used for two-step sessions. The ﬁ rst row (Step Key = 2) is for step number 1 where

there is one more step to go, and the next row (Step Key = 3) is for step number2

Chapter 8

252

where there are no more steps. The step dimension can be prebuilt to accommodate

sessions of at least 100 steps. In Figure 8-11 you see the step dimension can be asso-

ciated with a transaction fact table whose grain is the individual page event. In this

example, the step dimension has three roles. The ﬁ rst role is the overall session. The

second role is a successful purchase subsession, where a sequence of page events

leads to a conﬁ rmed purchase. The third role is the abandoned shopping cart, where

the sequence of page events is terminated without a purchase.

Using the step dimension, a speciﬁ c page can immediately be placed into one or

more understandable contexts (overall session, successful purchase, and abandoned

shopping cart). But even more interestingly, a query can constrain exclusively only

to the ﬁ rst page of successful purchases. This is a classic web event query, where

the “attractant” page of successful sessions is identiﬁ ed. Conversely, a query could

constrain exclusively to the last page of abandoned shopping carts, where the cus-

tomer is about to decide to go elsewhere.

Another approach for modeling sequential behavior takes advantage of speciﬁ c

ﬁ xed codes for each possible step. If you track customer product purchases in a

retail environment, and if each product can be encoded, for instance, as a 5 digit

number, then you can create a single wide text column for each customer with the

sequence of product codes. You separate the codes with a unique non-numeric

character. Such a sequence might look like

11254|45882|53340|74934|21399|93636|36217|87952|…etc.

Now using wild cards you can search for speciﬁ c products bought sequentially,

or bought with other products intervening, or situations in which one product was

bought but another was never bought. Modern relational DBMSs can store and

process wide text ﬁ elds of 64,000 characters or more with wild card searches.

Timespan Fact Tables

In more operational applications, you may want to retrieve the exact status of a cus-

tomer at some arbitrary instant in the past. Was the customer on fraud alert when

denied an extension of credit? How long had he been on fraud alert? How many

times in the past two years has he been on fraud alert? How many customers were on

fraud alert at some point in the past two years? All these questions can be addressed

if you carefully manage the transaction fact table containing all customer events. The

key modeling step is to include a pair of date/time stamps, as shown in Figure 8-12.

The ﬁ rst date/time stamp is the precise moment of the transaction, and the second

date/time stamp is the exact moment of the next transaction. If this is done correctly,

then the time history of customer transactions maintains an unbroken sequence

of date/time stamps with no gaps. Each actual transaction enables you to associate

Customer Relationship Management 253

both demographics and status with the customer. Dense transaction fact tables are

interesting because you potentially can change the demographics and especially

the status each time a transaction occurs.

Date Dimension

Demographics Dimension

Customer Dimension

Status Dimension

Customer Transaction Fact

Transaction Date Key (FK)

Customer Key (FK)

Demographics Key (FK)

Status Key (FK)

Transaction Number (DD)

More FKs ...

Begin Effective Date/Time

End Effective Date/Time

Amount

Figure 8-12: Twin date/time stamps in a timespan fact table.

The critical insight is that the pair of date/time stamps on a given transaction

deﬁ nes a span of time in which the demographics and the status are constant.

Queries can take advantage of this “quiet” span of time. Thus if you want to know

what the status of the customer “Jane Smith” was on July 18, 2013 at 6:33 am, you

can issue the following query:

Select Customer.Customer_Name, Status

From Transaction_Fact, Customer_dim, Status_dim

Where Transaction_Fact_Customer_Key = Customer_dim.Customer_key

And Transaction_Fact.Status_key = Status_dim.Status_key

And Customer_dim.Customer_Name = 'Jane Smith'

And #July 18, 2013 6:33:00# >= Transaction_Fact.Begin_Eff_

DateTime

And #July 18, 2013 6:33:00# < Transaction_Fact.End_Eff_DateTime

These date/time stamps can be used to perform tricky queries on your customer

base. If you want to ﬁ nd all the customers who were on fraud alert sometime in the

year 2013, issue the following query:

Select Customer.Customer_Name

From Transaction_Fact, Customer_dim, Status_dim

Where <joins>

And Status_dim Status_Description = 'Fraud Alert'

And Transaction_Fact.Begin_Eff_DateTime <= 12/31/2013:23:59:59

And Transaction_Fact.End_Eff_DateTime >= 1/1/2013:0:0:0

Amazingly, this one query handles all the possible cases of begin and end e ec-

tive date/times straddling the beginning or end of 2013, being entirely contained

with 2013, or completely straddling 2013.

Chapter 8

254

You can even count the number of days each customer was on fraud alert in 2013:

Select Customer.Customer_Name,

sum( least(12/31/2013:23:59:59, Transaction_Fact.End_Eff_

DateTime)

- greatest(1/1/2013:0:0:0, Transaction_Fact.Begin_Eff_

DateTime))

From Transaction_Fact, Customer_dim, Status_dim

Where <joins>

And Status_dim Status_Description = 'Fraud Alert'

And Transaction_Fact.Begin_Eff_DateTime <= 12/31/2013:23:59:59

And Transaction_Fact.End_Eff_DateTime >= 1/1/2013:0:0:0

Group By Customer.Customer_Name

Back Room Administration of Dual Date/Time Stamps

For a given customer, the date/time stamps on the sequence of transactions must

form a perfect unbroken sequence with no gaps. It is tempting to make the end

e ective date/time stamp be one “tick” less than the beginning e ective date/time

stamp of the next transaction, so the query SQL can use the BETWEEN syntax

rather than the uglier constraints shown above. However, in many situations the

little gap deﬁ ned by that tick could be signiﬁ cant if a transaction could fall within

the gap. By making the end e ective date/time exactly equal to the begin date time

of the next transaction, you eliminate this risk.

Using the pair of date/time stamps requires a two-step process whenever a new

transaction row is entered. In the ﬁ rst step, the end e ective date/time stamp of

the most current transaction must be set to a ﬁ ctitious date/time far in the future.

Although it would be semantically correct to insert NULL for this date/time, nulls

become a headache when you encounter them in constraints because they can cause

a database error when you ask if the ﬁ eld is equal to a speciﬁ c value. By using a

ﬁ ctitious date/time far in the future, this problem is avoided.

In the second step, after the new transaction is entered into the database, the ETL

process must retrieve the previous transaction and set its end e ective date/time to

the date/time of the newly entered transaction. Although this two-step process is a

noticeable cost of this twin date/time approach, it is a classic and desirable trade-

o between extra ETL overhead in the back room and reduced query complexity

in the front room.

Tagging Fact Tables with Satisfaction Indicators

Although proﬁ tability might be the most important key performance indicator in

many organizations, customer satisfaction is a close second. And in organizations

without proﬁ t metrics, such as government agencies, satisfaction is (or should be)

number one.

Customer Relationship Management 255

Satisfaction, like profitability, requires integration across many sources.

Virtually every customer facing process is a potential source of satisfaction infor-

mation, whether the source is sales, returns, customer support, billing, website

activity, social media, or even geopositioning data.

Satisfaction data can be either numeric or textual. In the Chapter 6: Order

Management, you saw how classic measures of customer satisfaction could be

modeled both ways simultaneously. The on-time measures could be both additive

numeric facts as well as textual attributes in a service level dimension. Other purely

numeric measures of satisfaction include numbers of product returns, numbers

of lost customers, numbers of support calls, and product attitude metrics from

social media.

Figure 8-13 illustrates a frequent ﬂ yer satisfaction dimension that could be added

to the ﬂ ight activity fact tables described in Chapter 12. Textual satisfaction data is

generally modeled in two ways, depending on the number of satisfaction attributes

and the sparsity of the incoming data. When the list of satisfaction attributes is

bounded and reasonably stable, a positional design is very e ective, as shown in

Figure 8-13.

Satisfaction Dimension

Satisfaction Key (PK)

Delayed Arrival Indicator

Diversion to Other Airport Indicator

Lost Luggage Indicator

Failure to Get Upgrade Indicator

Middle Seat Indicator

Personnel Problem Indicator

Figure 8-13: Positional satisfaction dimension for airline frequent ﬂ yers.

Tagging Fact Tables with Abnormal

Scenario Indicators

Accumulating snapshot fact tables depend on a series of dates that implement the

“standard scenario” for the pipeline process. For order fulﬁ llment, you may have

the steps of order created, order shipped, order delivered, order paid, and order

returned as standard steps in the order scenario. This kind of design is successful

when 90 percent or more of the orders progress through these steps (hopefully

without the return) without any unusual exceptions.

But if an occasional situation deviates from the standard scenario, you don’t

have a good way to reveal what happened. For example, maybe when the order

Chapter 8

256

was shipped, the delivery truck had a ﬂ at tire. A decision was made to unload the

delivery to another truck, but unfortunately it began to rain and the shipment was

water damaged. Then it was refused by the customer, and ultimately there was a

lawsuit. None of these unusual steps are modeled in the standard scenario in the

accumulating snapshot. Nor should they be!

The way to describe unusual departures from the standard scenario is to add

a delivery status dimension to the accumulating snapshot fact table. For the case

of the weird delivery scenario, you tag this order fulﬁ llment row with the status

Weird. Then if the analyst wants to see the complete story, the analyst can join to a

companion transaction fact table through the order number and line number that

has every step of the story. The transaction fact table joins to a transaction dimen-

sion, which indeed has Flat Tire, Damaged Shipment, and Lawsuit as transactions.

Even though this transaction dimension will grow over time with unusual entries,

it is well bounded and stable.

Customer Data Integration Approaches

In typical environments with many customer facing processes, you need to choose

between two approaches: a single customer dimension derived from all the versions

of customer source system records or multiple customer dimensions tied together

by conformed attributes.

Master Data Management Creating a Single

Customer Dimension

In some cases, you can build a single customer dimension that is the “best of breed”

choice among a number of available customer data sources. It is likely that such

a conformed customer dimension is a distillation of data from several operational

systems within your organization. But it would be typical for a unique customer to

have multiple identiﬁ ers in multiple touch point systems. To make matters worse,

data entry systems often don’t incorporate adequate validation rules. Obviously, an

operational CRM objective is to create a unique customer identiﬁ er and restrict the

creation of unnecessary identiﬁ ers. In the meantime, the DW/BI team will likely

be responsible for sorting out and integrating the disparate sources of customer

information.

Some organizations are lucky enough to have a centralized master data manage-

ment (MDM) system that takes responsibility for creating and controlling the single

enterprise-wide customer entity. But such centralization is rare in the real world.

More frequently, the data warehouse extracts multiple incompatible customer data

Customer Relationship Management 257

ﬁ les and builds a “downstream” MDM system. These two styles of MDM are illus-

trated in Figure 8-14.

Enterprise

MDM

Downstream

MDM

Operational

App #1

Operational

App #2

Operational

App #3

Operational

App #1

Operational

App #2

Operational

App #3 EDW

EDW

Figure 8-14: Two styles of master data management.

Unfortunately, there’s no secret weapon for tackling this data consolidation. The

attributes in the customer dimension should represent the “best” source available

in the enterprise. A national change of address (NCOA) process should be integrated

to ensure address changes are captured. Much of the heavy lifting associated with

customer data consolidation demands customer matching or deduplicating logic.

Removing duplicates or invalid addresses from large customer lists is critical to

eliminate the costs associated with redundant, misdirected, or undeliverable com-

munication, avoid misleading customer counts, and improve customer satisfaction

through higher quality communication.

The science of customer matching is more sophisticated than it might ﬁ rst appear.

It involves fuzzy logic, address parsing algorithms, and enormous look-up direc-

tories to validate address elements and postal codes, which vary signiﬁ cantly by

country. There are specialized, commercially available software and service o erings

that perform individual customer or commercial entity matching with remarkable

accuracy. Often these products match the address components to standardized

census codes, such as state codes, country codes, census tracts, block groups, metro-

politan statistical areas (MSAs), and latitude/longitude, which facilitate the merging

Chapter 8

258

of external data. As discussed in Chapter 10, there are also householding capabilities

to group or link customers sharing similar name and/or address information. Rather

than merely performing intraﬁ le matching, some services maintain an enormous

external reference ﬁ le of everyone in the United States to match against. Although

these products and services are expensive and/or complex, it’s worthwhile to make

the investment if customer matching is strategic to the organization. In the end,

e ective consolidation of customer data depends on a balance of capturing the data

as accurately as possible in the source systems, coupled with powerful data cleans-

ing/merging tools in the ETL process.

Partial Conformity of Multiple Customer Dimensions

Enterprises today build customer knowledge stores that collect all the internal and

external customer-facing data sources they can ﬁ nd. A large organization could

have as many as 20 internal data sources and 50 or more external data sources,

all of which relate in some way to the customer. These sources can vary wildly in

granularity and consistency. Of course, there is no guaranteed high-quality customer

key deﬁ ned across all these data sources and no consistent attributes. You don’t have

any control over these sources. It seems like a hopeless mess.

In Chapter 4: Inventory, we laid the groundwork for conformed dimensions,

which are the required glue for achieving integration across separate data sources.

In the ideal case, you examine all the data sources and deﬁ ne a single compre-

hensive dimension which you attach to all the data sources, either simultaneously

within a single tablespace or by replicating across multiple tablespaces. Such a

single comprehensive conformed dimension becomes a wonderful driver for creating

integrated queries, analyses, and reports by making consistent row labels available

for drill-across queries.

But in the extreme integration world with dozens of customer-related dimensions

of di erent granularity and di erent quality, such a single comprehensive customer

dimension is impossible to build. Fortunately, you can implement a lighter weight

kind of conformed customer dimension. Remember the essential requirement for

two dimensions to be conformed is they share one or more specially administered

attributes that have the same column names and data values. Instead of requiring

dozens of customer-related dimensions to be identical, you only require they share

the specially administered conformed attributes.

Not only have you taken the pressure o the data warehouse by relaxing the

requirement that all the customer dimensions in your environment be equal from

top to bottom, but in addition you can proceed in an incremental and agile way

to plant the specially administered conformed attributes in each of the customer-

related dimensions. For example, suppose you start by deﬁ ning a fairly high-level

Customer Relationship Management 259

categorization of customers called customer category. You can proceed methodically

across all the customer-related dimensions, planting this attribute in each dimension

without changing the grain of any target dimension and without invalidating any

existing applications that depend on those dimensions. Over a period of time, you

gradually increase the scope of integration as you add the special attributes to the

separate customer dimensions attached to di erent sources. At any point in time,

you can stop and perform drill-across reports using the dimensions where you have

inserted the customer category attribute.

When the customer category attribute has been inserted into as many of the

customer-related dimensions as possible, you can then deﬁ ne more conformed attri-

butes. Geographic attributes such as city, county, state, and country should be even

easier than the customer category. Over a period of time, the scope and power of the

conformed customer dimensions let you do increasingly sophisticated analyses. This

incremental development with its closely spaced deliverables ﬁ ts an agile approach.

Avoiding Fact-to-Fact Table Joins

DW/BI systems should be built process-by-process, not department-by-department,

on a foundation of conformed dimensions to support integration. You can imagine

querying the sales or support fact tables to better understand a customers’ purchase

or service history.

Because the sales and support tables both contain a customer foreign key, you

can further imagine joining both fact tables to a common customer dimension to

simultaneously summarize sales facts along with support facts for a given customer.

Unfortunately, the many-to-one-to-many join will return the wrong answer in a

relational environment due to the di erences in fact table cardinality, even when

the relational database is working perfectly. There is no combination of inner, outer,

left, or right joins that produces the desired answer when the two fact tables have

incompatible cardinalities.

Consider the case in which you have a fact table of customer solicitations,

and another fact table with the customer responses to solicitations, as shown in

Figure 8-15. There is a one-to-many relationship between customer and solicita-

tion, and another one-to-many relationship between customer and response. The

solicitation and response fact tables have di erent cardinalities; in other words, not

every solicitation results in a response (unfortunately for the marketing department)

and some responses are received for which there is no solicitation. Simultaneously

joining the solicitations fact table to the customer dimension, which is, in turn,

joined to the responses fact table, does not return the correct answer in a relational

DBMS due to the cardinality di erences. Fortunately, this problem is easily avoided.

You simply issue the drill-across technique explained in Chapter 4 to query the

Chapter 8

260

solicitations table and responses table in separate queries and then outer join the two

answer sets. The drill-across approach has additional beneﬁ ts for better controlling

performance parameters, in addition to supporting queries that combine data from

fact tables in di erent physical locations.

Customer Solicitation Fact

Solicitation Date Key (FK)

Customer Key (FK)

More FKs ...

Solicitation Facts ...

Customer Response Fact

Response Date Key (FK)

Customer Key (FK)

More FKs ...

Response Facts ...

Customer Dimension

Customer Key (PK)

Customer ID (Natural Key)

...

Figure 8-15: Many-to-one-to-many joined tables should not be queried with a single

SELECT statement.

WAR NI NG Be very careful when simultaneously joining a single dimension

table to two fact tables of di erent cardinality. In many cases, relational engines

return the “wrong” answer.

If business users are frequently combining data from multiple business pro-

cesses, a ﬁ nal approach is to deﬁ ne an additional fact table that combines the data

once into a consolidated fact table rather than relying on users to consistently

and accurately combine the data on their own, as described in Chapter 7. Merely

using SQL to drill across fact tables to combine the results makes more sense when

the underlying processes are less closely correlated. Of course, when constructing the

consolidated fact table, you still need to establish business rules to deal with

the di ering cardinality. For example, does the consolidated fact table include all the

solicitations and responses or only those where both a solicitation and response

occurred?

Low Latency Reality Check

The behavior of a customer in the last few hours or minutes can be extremely inter-

esting. You may even want to make decisions while dealing with the customer in

real time. But you need to be thoughtful in recognizing the costs and limitations

of low latency data. Generally, data quality su ers as the data is delivered closer

to real time.

Business users may automatically think that the faster the information arrives in

the DW/BI system, the better. But decreasing the latency increases the data quality

Customer Relationship Management 261

problems. Figure 20-6 summarizes the issues that arise as data is delivered faster.

In the conventional batch world, perhaps downloading a batch ﬁ le once each 24

hours, you typically get complete transaction sets. For example, if a commercial

customer places an order, they may have to pass a credit check and verify the ﬁ nal

commitment. The batch download includes orders only where all these steps have

taken place. In addition, because the batch download is processed just once each 24

hours, the ETL team has the time to run the full spectrum of data quality checks,

as we’ll describe in Chapter 19: ETL Subsystems and Techniques.

If the data is extracted many times per day, then the guarantee of complete

transaction sets may have to be relinquished. The customer may have placed the

order but has not passed the credit check. Thus there is the possibility that results

may have to be adjusted after the fact. You also may not run the full spectrum of

data quality checks because you don’t have time for extensive multitable lookups.

Finally, you may have to post data into the data warehouse when all the keys have

not been resolved. In this case, temporary dimensional entries may need to be used

while waiting for additional data feeds.

Finally, if you deliver data instantaneously, you may be getting only transaction

fragments, and you may not have time to perform any data quality checks or other

processing of the data.

Low latency data delivery can be very valuable, but the business users need to

be informed about these trade-o s. An interesting hybrid approach is to provide

low latency intraday delivery but then revert to a batch extract at night, thereby

correcting various data problems that could not be addressed during the day. We

discuss the impact of low latency requirements on the ETL system in Chapter 20:

ETL System Design and Development Process and Tasks.

Summary

In this chapter, we focused exclusively on the customer, beginning with an over-

view of customer relationship management (CRM) basics. We then delved into

design issues surrounding the customer dimension table. We discussed name and

address parsing where operational ﬁ elds are decomposed to their basic elements

so that they can be standardized and validated. We explored several other types

of common customer dimension attributes, such as dates, segmentation attributes,

and aggregated facts. Dimension outriggers that contain a large block of relatively

low-cardinality attributes were described.

This chapter introduced the use of bridge tables to handle unpredictable, sparsely

populated dimension attributes, as well as multivalued dimension attributes.

Chapter 8

262

We also explored several complex customer behavior scenarios, including sequential

activities, timespan fact tables, and tagging fact events with indicators to identify

abnormal situations.

We closed the chapter by discussing alternative approaches for consistently iden-

tifying customers and consolidating a rich set of characteristics from the source

data, either via operational master data management or downstream processing in

the ETL back room with potentially partial conformity. Fi nally, we touched on the

challenges of low latency data requirements.

Human Resources

Management

This chapter, which focuses on human resources (HR) data, is the last in the series

dealing with cross-industry business applications. Similar to the accounting

and finance data described in Chapter 7: Accounting, HR information is dissemi-

nated broadly throughout the organization. Organizations want to better understand

their employees’ demographics, skills, earnings, and performance to maximize their

impact. In this chapter we’ll explore several dimensional modeling techniques in the

context of HR data.

Chapter 9 discusses the following concepts:

■ Dimension tables to track employee proﬁ le changes

■ Periodic headcount snapshots

■ Bus matrix for a snippet of HR-centric processes

■ Pros and cons of packaged DW/BI solutions or data models

■ Recursive employee hierarchies

■ Multivalued skill keyword attributes handled via dimension attributes, out-

riggers, or bridges

■ Survey questionnaire data

■ Text comments

Employee Proﬁ le Tracking

Thus far the dimensional models we have designed closely resemble each other;

the fact tables contain key performance metrics that typically can be added across

all the dimensions. It is easy for dimensional modelers to get lulled into a kind of

additive complacency. In most cases, this is exactly how it is supposed to work.

However, with HR employee data, a robust employee dimension supports numerous

metrics required by the business on its own.

Chapter 9

264

To frame the problem with a business vignette, let’s assume you work in the HR

department of a large enterprise. Each employee has a detailed HR proﬁ le with

at least 100 attributes, including hire date, job grade, salary, review dates, review

outcomes, vacation entitlement, organization, education, address, insurance plan,

and many others. Employees are constantly hired, transferred, and promoted, as

well as adjusting their proﬁ les in a variety of ways.

A high-priority business requirement is to accurately track and analyze

employee proﬁ le changes. You might immediately visualize a schema in which

each employee proﬁ le change event is captured in a transaction-grained fact table,

as depicted in Figure 9-1. The granularity of this somewhat generalized fact

table would be one row per employee proﬁ le transaction. Because no numeric

metrics are associated with changes made to employee proﬁ les, such as a new

address or job grade promotion, the fact table is factless.

Employee Key (PK)

Employee ID (NK)

...

Transaction Date Key (FK)

Transaction Date/Time

Employee Key (FK)

Employee Transaction Type Key (FK)

Employee Dimension

Employee Transaction Type Key (PK)

Employee Transaction Type Description

1 row per employee profile transaction

Employee Transaction Type Dimension

Employee Transaction Fact Transaction Date Dimension

Figure 9-1: Initial draft schema for tracking employees’ proﬁ le changes.

In this draft schema, the dimensions include the transaction date, transaction

type, and employee. The transaction type dimension refers to the reason code that

caused the creation of this particular row, such as a promotion or address change.

The employee dimension is extremely wide with many attribute columns.

We envision using the type 2 slowly changing dimension technique for tracking

changed proﬁ le attributes within the employee dimension. Consequently, with every

employee proﬁ le transaction in the Figure 9-1 fact table, you would also create a

new type 2 row in the employee dimension that represents the employee’s proﬁ le as

a result of the proﬁ le change event. This new row continues to accurately describe

the employee until the next employee transaction occurs at some indeterminate

time in the future. The alert reader is quick to point out that the employee proﬁ le

transaction fact table and type 2 employee dimension table have the same number of

rows; plus they are almost always joined to one another. At this point dimensional

modeling alarms should be going o . You certainly don’t want to have as many rows

in a fact table as you do in a related dimension table.

Instead of using the initial schema, you can simplify the design by embellish-

ing the employee dimension table to make it more powerful and thereby doing

Human Resources Management 265

away with the proﬁ le transaction event fact table. As depicted in Figure 9-2, the

employee dimension contains a snapshot of the employee proﬁ le characteristics

following the employee’s proﬁ le change. The transaction type description becomes

a change reason attribute in the employee dimension to track the cause for the pro-

ﬁ le change. In some cases, the a ected characteristics are numeric. If the numeric

attributes are summarized rather than simply constrained upon, they belong in a

fact table instead.

Employee Key (PK)

Employee ID (NK)

Employee Name ...

Employee Address ...

Job Grade ...

Salary ...

Education ...

Original Hire Date (FK)

Last Review Date (FK)

Appraisal Rating ...

Health Insurance Plan ...

Vacation Plan ...

Change Reason Code

Change Reason Description

Row Effective Date/Time

Row Expiration Date/Time

Current Row Indicator

Employee Dimension

Figure 9-2: Employee dimension with proﬁ le characteristics.

As you’d expect, the surrogate employee key is the primary key of the dimen-

sion table; the durable natural employee ID used in the HR operational system to

persistently identify an employee is included as a dimension attribute.

Precise Effective and Expiration Timespans

As discussed in Chapter 5: Procurement with the coverage of slowly changing

dimension techniques, you should include two columns on the employee dimension

to capture when a speciﬁ c row is e ective and then expired. These columns deﬁ ne

a precise timespan during which the employee’s proﬁ le is accurate. Historically,

when daily data latency was the norm, the e ective and expiration columns were

dates. However, if you load data from any business process on a more frequent basis,

the columns should be date/time stamps so that you can associate the appropriate

employee proﬁ le row, which may di er between 9 a.m. and 9 p.m. on the same day,

to operational events.

Chapter 9

266

The expiration attribute for the current row is set to a future date. When the row

needs to be expired because the ETL system has detected a new proﬁ le of attributes,

the expiration attribute is typically set to “just before” the new row’s e ective stamp,

meaning either the prior day, minute, or second.

If the employee’s proﬁ le is accurately changed for a period of time, then the

employee reverts back to an earlier set of characteristics, a new employee dimen-

sion row is inserted. You should resist the urge to simply revisit the earlier proﬁ le

row and modify the expiration date because multiple dimension rows would be

e ective at the same time.

The current row indicator enables the most recent status of any employee to be

retrieved quickly. If a new proﬁ le row occurs for this employee, the indicator in the

former proﬁ le row needs to be updated to indicate it is no longer the current proﬁ le.

On its own, a date/time stamped type 2 employee dimension answers a number

of interesting HR inquiries. You can choose an exact historical point in time and ask

how many employees you have and what their detailed proﬁ les were at that speciﬁ c

moment by constraining the date/time to be equal to or greater than the e ective

date/time and strictly less than the expiration date/time. The query can perform

counts and constraints against all the rows returned from these constraints.

Dimension Change Reason Tracking

When a dimension row contains type 2 attributes, you can embellish it with a change

reason. In this way, some ETL-centric metadata is embedded with the actual data.

The change reason attribute could contain a two-character abbreviation for each

changed attribute on a dimension row. For example, the change reason attribute

value for a last name change could be LN or a more legible value, such as Last

Name, depending on the intended usage and audience. If someone asks how many

peoplechanged ZIP codes last year, the SELECT statement would include a LIKE

operator and wild cards, such as "WHERE ChangeReason LIKE '%ZIP%’".

Because multiple dimension attributes may change concurrently and be repre-

sented by a single new row in the dimension, the change reason would be multi-

valued. As we’ll explore later in the chapter when discussing employee skills, the

multiple reason codes could be handled as a single text string attribute, such as

“|Last Name|ZIP|” or via a multivalued bridge table.

NOTE The e ective and expiration date/time stamps, along with a reason code

description, on each row of a type 2 slowly changing dimension allows very precise

time slicing of the dimension by itself.

Finally, employee proﬁ le changes may be captured in the underlying source

system by a set of micro-transactions corresponding to each individual employee

Human Resources Management 267

attribute change. In the DW/BI system, you may want to encapsulate the series of

micro-transactions from the source system and treat them as a super transaction,

such as an employee promotion because it would be silly to treat these artiﬁ cial

micro-transactions as separate type 2 changes. The new type 2 employee dimen-

sion row would reﬂ ect all the relevant changed attributes in one step. Identifying

these super transactions may be tricky. Obviously the best way to identify them is

to ensure the HR operational application captures the higher level action.

Proﬁ le Changes as Type 2 Attributes or Fact Events

We just described the handling of employee attribute changes as slowly changing

dimension type 2 attributes with proﬁ le e ective and expiration dates within the

employee dimension. Designers sometimes wholeheartedly embrace this pattern

and try to leverage it to capture every employee-centric change. This results in a

dimension table with potentially hundreds of attributes and millions of rows for

a 100,000-employee organization given the attributes’ volatility.

Tracking changes within the employee dimension table enables you to easily

associate the employee’s accurate proﬁ le with multiple business processes. You

simply load these fact tables with the employee key in e ect when the fact event

occurred, and ﬁ lter and group based on the full spectrum of employee attributes.

But the pendulum can swing too far. You probably shouldn’t use the employee

dimension to track every employee review event, every beneﬁ t participation event,

or every professional development event. As illustrated in Figure 9-4’s bus matrix

in the next section, many of these events involve other dimensions, like an event

date, organization, beneﬁ t description, reviewer, approver, exit interviewer, separa-

tion reasons, and the list goes on. Consequently, most of them should be handled

as separate process-centric fact tables. Although many human resources events are

factless, capturing them within a fact table enables business users to easily count

or trend by time periods and all the other associated dimensions.

It’s certainly common to include the outcome of these HR events, like the job

grade resulting from a promotion, as an attribute on the employee dimension. But

designers sometimes err by including lots of foreign keys to outriggers for the

reviewer, beneﬁ t, separation reason and other dimensions within the employee

dimension, resulting in an overloaded dimension that’s di cult to navigate.

Headcount Periodic Snapshot

In addition to proﬁ ling employees in HR, you also want to report statuses of the

employees on a regular basis. Business managers are interested in counts, statistics,

and totals, including number of employees, salary paid, vacation days taken, vaca-

tion days accrued, number of new hires, and number of promotions. They want

Chapter 9

268

to analyze the data by all possible slices, including time and organization, plus

employee characteristics.

As shown in Figure 9-3, the employee headcount periodic snapshot consists of an

ordinary looking fact table with three dimensions: month, employee, and organiza-

tion. The month dimension table contains the usual descriptors for the corporate cal-

endar at the month grain. The employee key corresponds to the employee dimension

row in e ect at the end of the last day of the given reporting month to guarantee the

month-end report is a correct depiction of the employees’ proﬁ les. The organization

dimension contains a description of the organization to which the employee belongs

at the close of the relevant month.

1 row per employee per month

Month Key (FK)

Organization Key (FK)

Employee Key (FK)

Employee Count

New Hire Count

Transfer Count

Promotion Count

Salary Paid

Overtime Paid

Retirement Fund Paid

Retirement Fund Employee Contribution

Vacation Days Accrued

Vacation Days Taken

Vacation Days Balance

Employee Headcount Snapshot Fact

Month Dimension

Employee Dimension

Organization Dimension

Figure 9-3: Employee headcount periodic snapshot.

The facts in this headcount snapshot consist of monthly numeric metrics and

counts that may be di cult to calculate from the employee dimension table alone.

These monthly counts and metrics are additive across all the dimensions or dimen-

sion attributes, except for any facts labeled as balances. These balances, like all

balances, are semi-additive and must be averaged across the month dimension after

adding across the other dimensions.

Bus Matrix for HR Processes

Although an employee dimension with precise type 2 slowly changing dimension

tracking coupled with a monthly periodic snapshot of core HR performance metrics

is a good start, they just scratch the surface when it comes to tracking HR data.

Human Resources Management 269

Figure 9-4 illustrates other processes that HR professionals and functional manag-

ers are likely keen to analyze. We’ve embellished this preliminary bus matrix with

the type of fact table that might be used for each process; however, your source

data realities and business requirements may warrant a di erent or complementary

treatment.

Employee Position Snapshot

Employee Requisition Pipeline

Employee Performance Review Pipeline

Employee Disciplinary Action Pipeline

Employee Separations

Employee Performance Review

Employee Prof Dev Completed Courses

Employee Hiring

Employee “On Board” Pipeline

Employee Benefits Eligibility

Employee Benefits Application

Employee Benefit Participation

Employee Benefit Accruals

Employee Headcount Snapshot

Employee Compensation

Empl

Mgr

Empl

Mgr

Empl

Mgr

Empl

Mgr

Empl

Mgr

Empl

Mgr

Empl

Mgr

Empl

Mgr

Periodic

Date

Position

Employee

Organization

Benefit

Accumulating

Transaction

Accumulating

Periodic

Accumulating

Periodic

Transaction

Periodic

Transaction

Fact Type

Hiring Processes

Benefits Processes

Employee Management Processes

Figure 9-4: Bus matrix rows for HR processes.

Some of these business processes capture performance metrics, but many result

in factless fact tables, such as beneﬁ t eligibility or participation.

Chapter 9

270

Packaged Analytic Solutions

and Data Models

Many organizations purchase a vendor solution to address their operational HR

application needs. Most of these products o er an add-on DW/BI solution. In addi-

tion, other vendors sell standard data models, potentially with prebuilt data loaders

for the popular HR application products.

Vendors and proponents argue these standard, prebuilt solutions and models

allow for more rapid, less risky implementations by reducing the scope of the data

modeling and ETL development e ort. After all, every HR department hires employ-

ees, signs them up for beneﬁ ts, compensates them, reviews them, and eventually

processes employee separations. Why bother re-creating the wheel by designing

custom data models and solutions to support these common business processes

when you can buy a standard data model or complete solution instead?

Although there are undoubtedly common functions, especially within the HR

space, businesses typically have unique peculiarities. To handle these nuances,

most application software vendors introduce abstractions in their products, which

enable them to be more easily “customized.”

These abstractions, like the party table and associated apparatus to describe each

role or generic attribute column names rather than more meaningful labels, provide

ﬂ exibility to adapt to a variety of business situations. Although implementation

adaptability is a win for vendors who want their products to address a broad range of

potential customers’ business scenarios, the downside is the associated complexity.

HR professionals who live with the vendor’s product 24x7 are often willing to

adjust their vocabulary to accommodate the abstractions. But these abstractions can

feel like a foreign language for less-immersed functional managers. Delivering data

to the business via a packaged DW/BI solution or industry-standard data model may

bypass the necessary translations into the business’s vernacular.

Besides the reliance on the vendor’s terminology instead of incorporating the

business’s vocabulary in the DW/BI solution, another potential sharp corner is

the integration of source data from other domains. Can you readily conform the

dimensions in the vendor solution or industry model with other internally avail-

able master data? If not, the packaged model is destined to become another isolated

stovepipe data set. Clearly, this outcome is unappealing; although it may be less of

an obstacle if all your operational systems are supported by the same ERP vendor,

or you’re a small organization without an IT shop doing independent development.

What can you realistically expect to gain from a packaged model? Prebuilt generic

models can help identify core business processes and associate common dimensions.

That provides some comfort for DW/BI teams feeling initially overwhelmed by the

Human Resources Management 271

design task. After a few days or weeks studying the standard model, most teams

gain enough conﬁ dence to want to customize the schema for their data.

However, is this knowledge worth the price tag associated with the packaged

solution or data model? You could likely gain the same insight by spending a few

weeks with the business users. You’d not only improve your understanding of the

business’s needs, but also begin bonding business users to the DW/BI initiative.

It’s also worth mentioning that just because a packaged model or solution costs

thousands of dollars doesn’t mean it exhibits generally accepted dimensional modeling

best practices. Unfortunately, some standard models embody common dimensional

modeling design ﬂ aws; this isn’t surprising if the model’s designers focused more on

best practices for source system data capture rather than those required for BI report-

ing and analytics. It’s di cult to design a predeﬁ ned generic model, even if the vendor

owns the data capture source code.

Recursive Employee Hierarchies

A common employee characteristic is the name of the employee’s manager. You

could simply embed this attribute along with the other attributes in the employee

dimension. But if the business users want more than the manager’s name, more

complex structures are necessary.

One approach is to include the manager’s employee key as another foreign key in

the fact table, as shown in Figure 9-5. This manager employee key joins to a role-

playing employee dimension where every attribute name refers to “manager” to

di erentiate the manager’s proﬁ le from the employee’s. This approach associates the

employee and their manager whenever a row is inserted into a fact table. BI analyses

can easily ﬁ lter and group by either employee or manager attributes with virtually

identical query performance because both dimensions provide symmetrical access

to the fact table. The downside of this approach is these dual foreign keys must be

embedded in every fact table to support managerial reporting.

Separation Date Key (FK)

Organization Key (FK)

Employee Key (FK)

Manager Key (FK)

Separation Profile Key (FK)

Separation Count

Employee Key (PK)

Employee ID (NK)

...

Employee Dimension

Manager Key (PK)

Manager Employee ID (NK)

...

Manager Dimension Separation Profile Key (PK)

Separation Type Description

Separation Reason Description

Separation Profile Dimension

Employee Separation Fact

Date Dimension

Organization Dimension

Figure 9-5: Dual role-playing employee and manager dimensions.

Chapter 9

272

Another option is to include the manager’s employee key as an attribute on the

employee’s dimension row. The manager key would join to an outrigger consisting of

a role play on the employee dimension where all the attributes reference “manager”

to di erentiate them from the employee’s characteristics, as shown in Figure 9-6.

Separation Date Key (FK)

Organization Key (FK)

Employee Key (FK)

Separation Profile Key (FK)

Separation Count

Employee Key (PK)

Employee ID (NK)

Employee Attributes ...

Manager Key (FK)

Row Effective Date

Row Expiration Date

Current Row Indicator

Employee Dimension

Manager Key (PK)

Manager Employee ID (NK)

Manager Employee Attributes ...

Row Effective Date

Row Expiration Date

Current Row Indicator

Manager Dimension

Employee Separation Fact

Figure 9-6: Manager role-playing dimension as an outrigger.

If the manager’s foreign key in the employee dimension is designated as a type 2

attribute, then new employee rows would be generated with each manager change.

However, we encourage you to think carefully about the underlying ETL business rules.

Change Tracking on Embedded Manager Key

Let’s walk through an example. Abby is Hayden’s manager. With the outrigger

approach just described, Hayden’s employee dimension row would include an

attribute linking to Abby’s row in the manager role-play employee dimension. If

Hayden’s manager changes, and assuming the business wants to track these histori-

cal changes, then treating the manager foreign key as a type 2 and creating a new

row for Hayden to capture his new proﬁ le with a new manager would be appropriate.

However, think about the desired outcome if Abby were still Hayden’s manager,

but her employee proﬁ le changes, perhaps caused by something as innocuous as

a home address change. If the home address is designated as a type 2 attribute,

this move would spawn a new employee dimension row for Abby. If the manager

key is also designated as a type 2 attribute, then Abby’s new employee key would

also spawn a new dimension row for Hayden. Now imagine Abby is the CEO of a

large organization. A type 2 change in her proﬁ le would ripple through the entire

table; you’d end up replicating a new proﬁ le row for every employee due to a single

type 2 attribute change on the CEO’s proﬁ le.

Does the business want to capture these manager proﬁ le changes? If not, perhaps

the manager key on the employee’s row should be the manager’s durable natural key

Human Resources Management 273

linked to a role-playing dimension limited to just the current row for each manager’s

durable natural key in the dimension.

If you designate the manager’s key in the employee dimension to be a type 1

attribute, it would always associate an employee with her current manager. Although

this simplistic approach obliterates history, it may completely satisfy the business

user’s needs.

Drilling Up and Down Management Hierarchies

Adding an attribute, either a textual label or a foreign key to a role-playing dimen-

sion, to an employee dimension row is appropriate for handling the ﬁ xed depth,

many-to-one employee-to-manager relationship. However, more complex approaches

might be required if the business wants to navigate a deeper recursive hierarchy,

such as identifying an employee’s entire management chain or drilling down to

identify the activity for all employees who directly or indirectly work for a given

manager.

If you use an OLAP tool to query employee data, the embedded manager key

on every employee dimension row may su ce. Popular OLAP products contain a

parent/child hierarchy structure that works smoothly with variable depth recursive

hierarchies. In fact, this is one of the strengths of OLAP products.

However, if you want to query the recursive employee/manager relationship in

the relational environment, you must use Oracle’s nonstandard CONNECT BY syntax

or SQL’s recursive common table extension (CTE) syntax. Both approaches are

virtually unworkable for business users armed with a BI reporting tool.

So you’re left with the options described in Chapter 7 for dealing with vari-

able depth customer hierarchies. In Figure 9-7, the employee dimension from

Figure 9-6 relates to the fact table through a bridge table. The bridge table has

one row for each manager and each employee who is directly or indirectly in

their management chain, plus an additional row for the manager to himself. The

bridge joins shown in Figure 9-7 enable you to drill down within a manager’s

chain of command.

Separation Date Key (FK)

Organization Key (FK)

Employee Key (FK)

Separation Profile Key (FK)

Separation Count

Manager Key (FK)

Employee Key (FK)

# Levels from Top

Bottom Flag

Top Flag

Management Hierarchy Bridge

Manager Key (PK)

Manager Employee ID (NK)

Manager Employee Attributes ...

Row Effective Date

Row Expiration Date

Current Row Indicator

Manager Dimension

Employee Separation Fact

Figure 9-7: Bridge table to drill down into a manager’s reporting structure.

Chapter 9

274

As previously described, there are several disadvantages to this approach. The

bridge table is somewhat challenging to build, plus it contains many rows, so query

performance can su er. The BI user experience is complicated for ad hoc queries,

although we’ve seen analysts e ectively use it. Finally, if users want to aggregate

information up rather than down a management chain, the join paths must be

reversed.

Once again, the situation is further complicated if you want to track employee pro-

ﬁ le changes in conjunction with the bridge table. If the manager and employee reﬂ ect

employee proﬁ les with type 2 changes, the bridge table will experience rapid growth,

especially when senior management proﬁ le changes cause new keys to ripple across

the organization.

You could use durable natural keys in the bridge table, instead of the employee

keys which capture type 2 proﬁ le changes. Limiting the relationship to the man-

agement hierarchy’s current proﬁ les is one thing. However, if the business wants

to retain a history of employee/manager rollups, you need to embellish the bridge

table with e ective and expiration dates that capture the e ective timespan for each

employee/manager relationship.

The propagation of new rows in this bridge table using durable keys is substan-

tially reduced compared to the Figure 9-7 bridge because new rows are added when

reporting relationships change, not when any type 2 employee attribute is modiﬁ ed.

A bridge table built on durable keys is easier to manage, but quite challenging to

navigate, especially given the need to associate the relevant organizational structures

with the event dates in the fact table. Given the complexities, the bridge table should

be buried within a canned BI application for all but a small subset of power BI users.

The alternative approaches discussed in Chapter 7 for handling recursive hierar-

chies, like the pathstring attribute, are also relevant to the management hierarchy

conundrum. Unfortunately, there’s no silver bullet solution for handling these com-

plex structures in a simple and fast way.

Multivalued Skill Keyword Attributes

Let’s assume the IT department wants to supplement the employee dimension with

technical skillset proﬁ ciency information. You could consider these technical skills,

such as programming languages, operating systems, or database platforms, to be key-

words describing employees. Each employee is tagged with a number of skill keywords.

You want to search the IT employee population by their descriptive skills.

If the technical skills of interest were a ﬁ nite number, you could include them

as individual attributes in the employee dimension. The advantage of using posi-

tional dimension attributes, such as a Linux attribute with domain values such as

Human Resources Management 275

Linux Skills and No Linux Skills, is they’re easy to query and deliver fast query

performance. This approach works well to a point but falls apart when the number

of potential skills expands.

Skill Keyword Bridge

More realistically, each employee will have a variable, unpredictable number of

skills. In this case, the skill keyword attribute is a prime candidate to be a multi-

valued dimension. Skill keywords, by their nature, are open-ended; new skills are

added regularly as domain values. We’ll show two logically equivalent modeling

schemes for handling open-ended sets of skills.

Figure 9-8 shows a multivalued dimension design for handling the skills as an

outrigger bridge table to the employee dimension table. As you’ll see in Chapter 14:

Healthcare, sometimes the multivalued bridge table is joined directly to a fact table.

Employee Key (FK)

More FKs ...

Headcount Facts ...

Employee Headcount

Snapshot Fact

Employee Key (PK)

...

Employee Skill Group Key (FK)

Employee Dimension

Employee Skill Key (PK)

Employee Skill Description

Employee Skill Category

Skills Dimension

Employee Skill Group Key (FK)

Employee Skill Key (FK)

Employee Skill Group Bridge

Figure 9-8: Skills group keyword bridge table.

The skills group bridge identiﬁ es a given set of skill keywords. IT employees who

are proﬁ cient in Oracle, Unix, and SQL would be assigned the same skills group key.

In the skills group bridge table, there would be three rows for this particular group,

one for each of the associated skill keywords (Oracle, Unix, and SQL).

AND/OR Query Dilemma

Assuming you built the schema shown in Figure 9-8, you are still left with a serious

query problem. Query requests against the skill keywords fall into two categories. The

OR queries (for example, Unix or Linux experience) can be satisﬁ ed by a simple OR

constraint on the skills description attribute in the skills dimension table. However,

AND queries (for example, Unix and Linux experience) are di cult because the AND

constraint is a constraint across two rows in the skills dimension. SQL is notoriously

poor at handling constraints across rows. The answer is to create SQL code using

unions and intersections, probably in a custom interface that hides the complex logic

from the business user. The SQL code would look like this:

(SELECT employee_ID, employee_name

FROM Employee, SkillBridge, Skills

WHERE Employee.SkillGroupKey = SkillBridge.SkillGroupKey AND

SkillGroup.SkillKey = Skill.SkillKey AND

Chapter 9

276

Skill.Skill = "UNIX")

UNION / INTERSECTION

(SELECT employee_ID, employee_name

FROM Employee, SkillBridge, Skills

WHERE Employee.SkillGroupKey = SkillBridge.SkillGroupKey AND

SkillGroup.SkillKey = Skill.SkillKey AND

Skill.Skill = "LINUX")

Using the UNION lists employees with Unix or Linux experience, whereas using

INTERSECTION identiﬁ es employees with Unix and Linux experience.

Skill Keyword Text String

You can remove the many-to-many bridge and the need for union/intersection SQL

by simplifying the design. One approach would be to add a skills list outrigger to

the employee dimension containing one long text string concatenating all the skill

keywords for that list key. You would need a special delimiter such as a backslash or

vertical bar at the beginning of the skills text string and after each skill in the list.

Thus the skills string containing Unix and C++ would look like |Unix|C++|. This

outrigger approach presumes a number of employees share a common list of skills.

If the lists are not reused frequently, you could collapse the skills list outrigger by

simply including the skills list text string as an employee dimension attribute, as

shown in Figure 9-9.

Employee Key (FK)

More FKs ...

Headcount Facts ...

Employee Headcount Snapshot Fact

Employee Key (PK)

...

Employee Skill Group List

Employee Dimension

Figure 9-9: Delimited skills list string.

Text string searches can be challenging because of the ambiguity caused by

searching on uppercase or lowercase. Is it UNIX or Unix or unix? You can resolve

this by coercing the skills list to upper case with the UCase function in most SQL

environments.

With the design in Figure 9-9, the AND/OR dilemma can be addressed in a single

SELECT statement. The OR constraint looks like this:

UCase(skill_list) like '%|UNIX|% OR UCase(skill_list) like '%|LINUX|%'

Meanwhile, the AND constraint has exactly the same structure:

UCase(skill_list) like '%|UNIX|' AND UCase(skill_list) like '%|LINUX|%'

Human Resources Management 277

The % symbol is a wild card pattern-matching character deﬁ ned in SQL that

matches zero or more characters. The vertical bar delimiter is used explicitly in the

constraints to exactly match the desired keywords and not get erroneous matches.

The keyword list approach shown in Figure 9-9 can work in any relational database

because it is based on standard SQL. Although the text string approach facilitates

AND/OR searching, it doesn’t support queries that count by skill keyword.

Survey Questionnaire Data

HR departments often collect survey data from employees, especially when gather-

ing peer and/or management review data. The department analyzes questionnaire

responses to determine the average rating for a reviewed employee and within a

department.

To handle questionnaire data in a dimensional model, a fact table with one row

for each question on a respondent’s survey is typically created, as illustrated in

Figure 9-10. Two role-playing employee dimensions in the schema correspond to the

responding employee and reviewed employee. The survey dimension has descriptors

about the survey instrument. The question dimension provides the question and

its categorization; presumably, the same question is asked on multiple surveys. The

survey and question dimensions can be useful when searching for speciﬁ c topics in

a broad database of questionnaires. The response dimension contains the responses

and perhaps categories of responses, such as favorable or hostile.

Survey Key (PK)

Survey Title

Survey Type

Survey Objective

Review Year

Question Key (PK)

Question Label

Question Category

Question Objective

Survey Sent Date Key (FK)

Survey Received Date Key (FK)

Survey Key (FK)

Responding Employee Key (FK)

Reviewed Employee Key (FK)

Question Key (FK)

Response Category Key (FK)

Survey Number (DD)

Response

Survey Dimension

Response Category Key (PK)

Response Category Description

Response Category Dimension

Employee Evaluation Survey Fact

Question Dimension

Date Dimension (2 views for roles)

Employee Dimension (2 views for roles)

Figure 9-10: Survey schema.

Creating the simple schema in Figure 9-10 supports robust slicing and dicing

of survey data. Variations of this schema design would be useful for analyzing all

types of survey data, including customer satisfaction and product usage feedback.

Chapter 9

278

Text Comments

Facts are typically thought of as continuously valued numeric measures; dimension

attributes, on the other hand, are drawn from a discrete list of domain values. So

how do you handle textual comments, such as a manager’s remarks on a perfor-

mance review or freeform feedback on a survey question, which seem to defy clean

classiﬁ cation into the fact or dimension category? Although IT professionals may

instinctively want to simply exclude them from a dimensional design, business

users may demand they’re retained to further describe the performance metrics.

After it’s been conﬁ rmed the business is unwilling to relinquish the text com-

ments, you should determine if the comments can be parsed into well-behaved

dimension attributes. Although there are sometimes opportunities to categorize

the text, such as a compliment versus complaint, the full text verbiage is typically

also required.

Because freeform text takes on so many potential values, designers are some-

times tempted to store the text comment within the fact table. Although cognizant

that fact tables are typically limited to foreign keys, degenerate dimensions, and

numeric facts, they contend the text comment is just another degenerate dimension.

Unfortunately, text comments don’t qualify as degenerate dimensions.

Freeform text ﬁ elds shouldn’t be stored in the fact table because they just add

bulky clutter to the table. Depending on the database platform, this relatively low

value bulk may get dragged along on every operation involving the fact table’s much

more valuable performance metrics.

Rather than treating the comments as textual metrics, we recommend retaining

them outside the fact table. The comments should either be captured in a separate

comments dimensions (with a corresponding foreign key in the fact table) or as

an attribute on a transaction-grained dimension table. In some situations, identi-

cal comments are observed multiple times. At a minimum, this typically occurs

with the No Comment comment. If the cardinality of the comments is less than

the number of transactions, the text should be captured in a comments dimension.

Otherwise, if there’s a unique comment for every event, it’s treated as a transaction

dimension attribute. In either case, regardless of whether the comments are handled

in a comment or transaction dimension, the query performance when this sizeable

dimension is joined to the fact table will be slow. However, by the time users are

viewing comments, they’ve likely signiﬁ cantly ﬁ ltered their query as they can real-

istically read only a limited number of comments. Meanwhile, the more common

analyses focusing on the fact table’s performance metrics won’t be burdened by the

extra weight of the textual comments on every fact table query.

Human Resources Management 279

Summary

In this chapter, we discussed several concepts in the context of HR data. First,

we further elaborated on the advantages of embellishing an employee dimension

table. In the world of HR, this single table is used to address a number of ques-

tions regarding the status and proﬁ le of the employee base at any point in time. We

drafted a bus matrix representing multiple processes within the HR arena and high-

lighted a core headcount snapshot fact table, along with the potential advantages

and disadvantages of vendor-designed solutions and data models. The handling of

managerial rollups and multivalued dimension attributes was discussed. Finally,

we provided a brief overview regarding the handling of survey or questionnaire

data, along with text comments.

Financial Services

The financial services industry encompasses a wide variety of businesses, includ-

ing credit card companies, brokerage firms, and mortgage providers. In this

chapter, we’ll primarily focus on the retail bank since most readers have some degree

of personal familiarity with this type of financial institution. A full-service bank offers

a breadth of products, including checking accounts, savings accounts, mortgage

loans, personal loans, credit cards, and safe deposit boxes. This chapter begins with

a very simplistic schema. We then explore several schema extensions, including the

handling of the bank’s broad portfolio of heterogeneous products that vary signifi-

cantly by line of business.

We want to remind you that industry focused chapters like this one are not

intended to provide full-scale industry solutions. Although various dimensional

modeling techniques are discussed in the context of a given industry, the techniques

are certainly applicable to other businesses. If you don’t work in ﬁ nancial services,

you still need to read this chapter. If you do work in ﬁ nancial services, remember

that the schemas in this chapter should not be viewed as complete.

Chapter 10 discusses the following concepts:

■ Bus matrix snippet for a bank

■ Dimension triage to avoid the “too few dimensions” trap

■ Household dimensions

■ Bridge tables to associate multiple customers with an account, along with

weighting factors

■ Multiple mini-dimensions in a single fact table

■ Dynamic value banding of facts for reporting

■ Handling heterogeneous products across lines of business, each with unique

metrics and/or dimension attributes, as supertype and subtype schemas

■ Hot swappable dimensions

Chapter 10

282

Banking Case Study and Bus Matrix

The bank’s initial goal is to better analyze the bank’s accounts. Business users want the

ability to slice and dice individual accounts, as well as the residential household groupings

to which they belong. One of the bank’s major objectives is to market more e ectively by

o ering additional products to households that already have one or more accounts with

the bank. Figure 10-1 illustrates a portion of a bank’s bus matrix.

New Business Solicitation

Lead Tracking

Account Application Pipeline

Account Initiation

Account Transactions

Account Monthly Snapshot

Account Servicing Activities

Date

Prospect

Customer

Household

Branch

Account

Product

Figure 10-1: Subset of bus matrix rows for a bank.

After conducting interviews with managers and analysts around the bank, the

following set of requirements were developed:

■ Business users want to see ﬁ ve years of historical monthly snapshot data on

every account.

■ Every account has a primary balance. The business wants to group di erent

types of accounts in the same analyses and compare primary balances.

■ Every type of account (known as products within the bank) has a set of cus-

tom dimension attributes and numeric facts that tend to be quite di erent

from product to product.

■ Every account is deemed to belong to a single household. There is a surprising

amount of volatility in the account/household relationships due to changes

in marital status and other life stage factors.

■ In addition to the household identiﬁ cation, users are interested in demographic

information both as it pertains to individual customers and households. In

addition, the bank captures and stores behavior scores relating to the activity

or characteristics of each account and household.

Financial Services 283

Dimension Triage to Avoid Too Few

Dimensions

Based on the previous business requirements, the grain and dimensionality of the

initial model begin to emerge. You can start with a fact table that records the pri-

mary balances of every account at the end of each month. Clearly, the grain of the

fact table is one row for each account each month. Based on that grain declaration,

you can initially envision a design with only two dimensions: month and account.

These two foreign keys form the fact table primary key, as shown in Figure 10-2.

A data-centric designer might argue that all the other description information, such

as household, branch, and product characteristics should be embedded as descriptive

attributes of the account dimension because each account has only one household,

branch, and product associated with it.

Month Dimension Account Dimension

Month End Date Key (PK)

Month Attributes ...

Month End Date Key (FK)

Account Key (FK)

Primary Month Ending Balance

Account Key (PK)

Account Attributes ...

Primary Customer Attributes ...

Product Attributes ...

Household Attributes ...

Status Attributes ...

Branch Attributes ...

Month Account Snapshot Fact

Figure 10-2: Balance snapshot with too few dimensions.

Although this schema accurately represents the many-to-one and many-to-many

relationships in the snapshot data, it does not adequately reﬂ ect the natural business

dimensions. Rather than collapsing everything into the huge account dimension table,

additional analytic dimensions such as product and branch mirror the instinctive

way users think about their business. These supplemental dimensions provide much

smaller points of entry to the fact table. Thus, they address both the performance and

usability objectives of a dimensional model. Finally, given a big bank may have mil-

lions of accounts, you should worry about type 2 slowly changing dimension e ects

potentially causing this huge dimension to mushroom into something unmanage-

able. The product and branch attributes are convenient groups of attributes to remove

from the account dimension to cut down on the row growth caused by type 2 change

tracking. In the section “Mini-Dimensions Revisited,” the changing demographics

and behavioral attributes will be squeezed out of the account dimension for the same

reasons.

The product and branch dimensions are two separate dimensions as there is

a many-to-many relationship between products and branches. They both change

Chapter 10

284

slowly, but on di erent rhythms. Most important, business users think of them as

distinct dimensions of the banking business.

In general, most dimensional models end up with between ﬁ ve and 20 dimen-

sions. If you are at or below the low end of this range, you should be suspicious

that dimensions may have been inadvertently left out of the design. In this case,

carefully consider whether any of the following kinds of dimensions are appropriate

supplements to your initial dimensional model:

■ Causal dimensions, such as promotion, contract, deal, store condition, or even

weather. These dimensions, as discussed in Chapter 3: Retail Sales, provide

additional insight into the cause of an event.

■ Multiple date dimensions, especially when the fact table is an accumulating

snapshot. Refer to Chapter 4: Inventory for sample fact tables with multiple

date stamps.

■ Degenerate dimensions that identify operational transaction control numbers,

such as an order, an invoice, a bill of lading, or a ticket, as initially illustrated

in Chapter 3.

■ Role-playing dimensions, such as when a single transaction has several busi-

ness entities associated with it, each represented by a separate dimension. In

Chapter 6: Order Management, we described role playing to handle multiple

dates.

■ Status dimensions that identify the current status of a transaction or monthly

snapshot within some larger context, such as an account status.

■ An audit dimension, as discussed in Chapter 6, to track data lineage and

quality.

■ Junk dimensions of correlated indicators and ﬂ ags, as described in Chapter 6.

These dimensions can typically be added gracefully to a design, even after the

DW/BI system has gone into production because they do not change the grain of

the fact table. The addition of these dimensions usually does not alter the existing

dimension keys or measured facts in the fact table. All existing applications should

continue to run without change.

NOTE Any descriptive attribute that is single-valued in the presence of the

measurements in the fact table is a good candidate to be added to an existing

dimension or to be its own dimension.

Based on further study of the bank’s requirements, you can ultimately choose the

following dimensions for the initial schema: month end date, branch, account, pri-

mary customer, product, account status, and household. As illustrated in Figure 10-3,

Financial Services 285

at the intersection of these seven dimensions, you take a monthly snapshot and

record the primary balance and any other metrics that make sense across all prod-

ucts, such as transaction count, interest paid, and fees charged. Remember account

balances are just like inventory balances in that they are not additive across any

measure of time. Instead, you must average the account balances by dividing the

balance sum by the number of time periods.

Month Dimension Account Dimension

Month End Date Key (PK)

Month Attributes ...

Branch Dimension

Branch Key (PK)

Branch Number (NK)

Branch Address Attributes ...

Branch Rollup Attributes ...

Account Key (PK)

Account Number (NK)

Account Address Attributes ...

Account Open Date

...

Household Key (PK)

Household ID (NK)

Household Address Attributes ...

Household Income

Household Homeownership Indicator

Household Presence of Children

...

Primary Customer Key (PK)

Primary Customer Name

Primary Customer Date of Birth

...

Product Dimension

Product Key (PK)

Product Code (NK)

Product Description

...

Account Status Dimension

Account Status Key (PK)

Account Status Description

Account Status Group

Month End Date Key (FK)

Branch Key (FK)

Account Key (FK)

Primary Customer Key (FK)

Product Key (FK)

Account Status Key (FK)

Household Key (FK)

Primary Month Ending Balance

Average Daily Balance

Number of Transactions

Interest Paid

Fees Charged

Primary Customer Dimension

Household Dimension

Monthly Account Snapshot Fact

Figure 10-3: Supertype snapshot fact table for all accounts.

NOTE In this chapter we use the basic object-oriented terms supertype and

subtype to refer respectively to the single fact table covering all possible account

types, as well as the multiple fact tables containing speciﬁ c details of each individual

account type. In past writings these have been called core and custom fact tables,

but it is time to change to the more familiar and accepted terminology.

The product dimension consists of a simple hierarchy that describes all the

bank’s products, including the name of the product, type, and category. The need to

construct a generic product categorization in the bank is the same need that causes

grocery stores to construct a generic merchandise hierarchy. The main di erence

between the bank and grocery store examples is that the bank also develops a large

number of subtype product attributes for each product type. We’ll defer discussion

Chapter 10

286

regarding the handling of these subtype attributes until the “Supertype and Subtype

Schemas for Heterogeneous Products” section at the end of the chapter.

The branch dimension is similar to the facility dimensions we discussed earlier

in this book, such as the retail store or distribution center warehouse.

The account status dimension is a useful dimension to record the condition of

the account at the end of each month. The status records whether the account is

active or inactive, or whether a status change occurred during the month, such as a

new account opening or account closure. Rather than whipsawing the large account

dimension, or merely embedding a cryptic status code or abbreviation directly in

the fact table, we treat status as a full-ﬂ edged dimension with descriptive status

decodes, groupings, and status reason descriptions, as appropriate. In many ways,

you could consider the account status dimension to be another example of a mini-

dimension, as we introduced in Chapter 5: Procurement.

Household Dimension

Rather than focusing solely on the bank’s accounts, business users also want the

ability to analyze the bank’s relationship with an entire economic unit, referred to as

a household. They are interested in understanding the overall proﬁ le of a household,

the magnitude of the existing relationship with the household, and what additional

products should be sold to the household. They also want to capture key demographics

regarding the household, such as household income, whether they own or rent their

home, whether they are retirees, and whether they have children. These demographic

attributes change over time; as you might suspect, the users want to track the changes.

If the bank focuses on accounts for commercial entities, rather than consumers, simi-

lar requirements to identify and link corporate “households” are common.

From the bank’s perspective, a household may be comprised of several accounts

and individual account holders. For example, consider John and Mary Smith as a

single household. John has a checking account, whereas Mary has a savings account.

In addition, they have a joint checking account, credit card, and mortgage with

the bank. All ﬁ ve of these accounts are considered to be a part of the same Smith

household, despite the fact that minor inconsistencies may exist in the operational

name and address information.

The process of relating individual accounts to households (or the commercial

business equivalent) is not to be taken lightly. Householding requires the devel-

opment of business rules and algorithms to assign accounts to households. There

are specialized products and services to do the matching necessary to determine

household assignments. It is very common for a large ﬁ nancial services organization

to invest signiﬁ cant resources in specialized capabilities to support its household-

ing needs.

Financial Services 287

The decision to treat account and household as separate dimensions is somewhat

a matter of the designer’s prerogative. Even though they are intuitively correlated,

you decide to treat them separately because of the size of the account dimension and

the volatility of the account constituents within a household dimension, as men-

tioned earlier. In a large bank, the account dimension is huge, with easily over 10

million rows that group into several million households. The household dimension

provides a somewhat smaller point of entry into the fact table, without traversing

a 10 million-row account dimension table. Also, given the changing nature of the

relationship between accounts and households, you elect to use the fact table to

capture the relationship, rather than merely including the household attributes on

each account dimension row. In this way, you avoid using the type 2 slowly chang-

ing dimension technique with a 10-million row account dimension.

Multivalued Dimensions and Weighting Factors

As you just saw in the John and Mary Smith example, an account can have one, two,

or more individual account holders, or customers, associated with it. Obviously, the

customer cannot be included as an account attribute (beyond the designation of a

primary customer/account holder); doing so violates the granularity of the dimen-

sion table because more than one individual can be associated with an account.

Likewise, you cannot include a customer as an additional dimension in the fact

table; doing so violates the granularity of the fact table (one row per account per

month), again because more than one individual can be associated with any given

account. This is another classic example of a multivalued dimension. To link an

individual customer dimension to an account-grained fact table requires the use of

an account-to-customer bridge table, as shown in Figure 10-4. At a minimum, the

primary key of the bridge table consists of the surrogate account and customer keys.

The time stamping of bridge table rows, as discussed in Chapter 7: Accounting, for

time-variant relationships is also applicable in this scenario.

Account Dimension

Account Key (PK)

Account Number (NK)

Account Address Attributes ...

Account Open Date

...

Account-to-Customer Bridge

Account Key (FK)

Customer Key (FK)

Weighting Factor

Customer Dimension

Customer Key (FK)

Customer Name

Customer Date of Birth

...

Month End Date Key (FK)

Account Key (FK)

More FKs ...

Primary Month Ending Balance

Average Daily Balance

Number of Transactions

Interest Paid

Fees Charged

Monthly Account Snapshot Fact

Figure 10-4: Account-to-customer bridge table with weighting factor.

Chapter 10

288

If an account has two account holders, then the associated bridge table has two

rows. You assign a numerical weighting factor to each account holder such that the

sum of all the weighting factors is exactly 1.00. The weighting factors are used to

allocate any of the numeric additive facts across individual account holders. In this

way you can add up all numeric facts by individual holder, and the grand total will

be the correct grand total amount. This kind of report is a correctly weighted report.

The weighting factors are simply a way to allocate the numeric additive facts

across the account holders. Some would suggest changing the grain of the fact table

to be account snapshot by account holder. In this case you would take the weight-

ing factors and physically multiply them against the original numeric facts. This is

rarely done for three reasons. First, the size of the fact table would be multiplied

by the average number of account holders. Second, some fact tables have more than

one multivalued dimension. The number of rows would get out of hand in this situ-

ation, and you would start to question the physical signiﬁ cance of an individual row.

Finally, you may want to see the unallocated numbers, and it is hard to reconstruct

these if the allocations have been combined physically with the numeric facts.

If you choose not to apply the weighting factors in a given query, you can still

summarize the account snapshots by individual account holder, but in this case you

get what is called an impact report. A question such as, “What is the total balance

of all individuals with a speciﬁ c demographic proﬁ le?” would be an example of an

impact report. Business users understand impact analyses may result in overcount-

ing because the facts are associated with both account holders.

In Figure 10-4, an SQL view could be deﬁ ned combining the fact table and the

account-to-customer bridge table so these two tables, when combined, would appear

to BI tools as a standard fact table with a normal customer foreign key. Two views

could be deﬁ ned, one using the weighting factors and one not using the weighting

factors.

NOTE An open-ended, many-valued attribute can be associated with a dimen-

sion row by using a bridge table to associate the many-valued attributes with the

dimension.

In some ﬁ nancial services companies, the individual customer is identiﬁ ed and

associated with each transaction. For example, credit card companies often issue

unique card numbers to each cardholder. John and Mary Smith may have a joint

credit card account, but the numbers on their respective pieces of plastic are unique.

In this case, there is no need for an account-to-customer bridge table because the

atomic transaction facts are at the discrete customer grain; account and customer

would both be foreign keys in this fact table. However, the bridge table would be

Financial Services 289

required to analyze metrics that are naturally captured at the account level, such

as the credit card billing data.

Mini-Dimensions Revisited

Similar to the discussion of the customer dimension in Chapter 8: Customer

Relationship Management, there are a wide variety of attributes describing the

bank’s accounts, customers, and households, including monthly credit bureau attri-

butes, external demographic data, and calculated scores to identify their behavior,

retention, proﬁ tability, and delinquency characteristics. Financial services organiza-

tions are typically interested in understanding and responding to changes in these

attributes over time.

As discussed earlier, it’s unreasonable to rely on slowly changing dimension tech-

nique type 2 to track changes in the account dimension given the dimension row

count and attribute volatility, such as the monthly update of credit bureau attributes.

Instead, you can break o the browseable and changeable attributes into multiple

mini-dimensions, such as credit bureau and demographics mini-dimensions, whose

keys are included in the fact table, as illustrated in Figure 10-5. The type 4 mini-

dimensions enable you to slice and dice the fact table, while readily tracking attribute

changes over time, even though they may be updated at di erent frequencies. Although

mini-dimensions are extremely powerful, be careful to avoid overusing the technique.

Account-oriented ﬁ nancial services are a good environment for using mini-dimensions

because the primary fact table is a very long-running periodic snapshot. Thus every

month a fact table row is guaranteed to exist for every account, providing a home for

all the associated foreign keys. You can always see the account together with all the

mini-dimensions for any month.

Customer Dimension

Customer Key (PK)

Relatively Constant Attributes ...

Customer Demographics Key (PK)

Customer Age Band

Customer Income Band

Customer Marital Status

Customer Risk Profile Key (PK)

Customer Risk Cluster

Customer Delinquency Cluster

Customer Demographics Dimension

Customer Key (FK)

Customer Demographics Key (FK)

Customer Risk Profile Key (FK)

More FKs ...

Facts ...

Fact Table

Customer Risk Profile Dimension

Figure 10-5: Multiple mini-dimensions associated with a fact table.

Chapter 10

290

NOTE Mini-dimensions should consist of correlated clumps of attributes; each

attribute shouldn’t be its own mini-dimension or you end up with too many dimen-

sions in the fact table.

As described in Chapter 4, one of the compromises associated with mini-dimen-

sions is the need to band attribute values to maintain reasonable mini-dimension row

counts. Rather than storing extremely discrete income amounts, such as $31,257.98,

you store income ranges, such as $30,000 to $34,999 in the mini-dimension.

Similarly, the proﬁ tability scores may range from 1 through 1200, which you band

into ﬁ xed ranges such as less than or equal to 100, 101 to 150, and 151 to 200, in

the mini-dimension.

Most organizations ﬁ nd these banded attribute values support their routine ana-

lytic requirements, however there are two situations in which banded values may

be inadequate. First, data mining analysis often requires discrete values rather than

ﬁ xed bands to be e ective. Secondly, a limited number of power analysts may want

to analyze the discrete values to determine if the bands are appropriate. In this case,

you still maintain the banded value mini-dimension attributes to support consistent

day-to-day analytic reporting but also store the key discrete numeric values as facts

in the fact table. For example, if each account’s proﬁ tability score were recalculated

each month, you would assign the appropriate proﬁ tability range mini-dimension for

that score each month. In addition, you would capture the discrete proﬁ tability score

as a fact in the monthly account snapshot fact table. Finally, if needed, the current

proﬁ tability range or score could be included in the account dimension where any

changes are handled by deliberately overwriting the type 1 attribute. Each of these

data elements should be uniquely labeled so that they are distinguishable. Designers

must always carefully balance the incremental value of including such somewhat

redundant facts and attributes versus the cost in terms of additional complexity for

both the ETL processing and BI presentation.

Adding a Mini-Dimension to a Bridge Table

In the bank account example, the account-to-customer bridge table can get very

large. If you have 20 million accounts and 25 million customers, the bridge table can

grow to hundreds of millions of rows after a few years if both the account dimen-

sion and the customer dimension are slowly changing type 2 dimensions (where

you track history by issuing new rows with new keys).

Now the experienced dimensional modeler asks, “What happens when my cus-

tomer dimension turns out to be a rapidly changing monster dimension?” This could

happen when rapidly changing demographics and status attributes are added to the

Financial Services 291

customer dimension, forcing numerous type 2 additions to the customer dimension.

Now the 25-million row customer dimension threatens to become several hundred

million rows.

The standard response to a rapidly changing monster dimension is to split o the

rapidly changing demographics and status attributes into a type 4 mini-dimension,

often called a demographics dimension. This works great when this dimension attaches

directly to the fact table along with a customer dimension because it stabilizes the

large customer dimension and keeps it from growing every time a demographics or

status attribute changes. But can you get this same advantage when the customer

dimension is attached to a bridge table, as in the bank account example?

The solution is to add a foreign key reference in the bridge table to the demo-

graphics dimension, as shown in Figure 10-6.

Account Dimension

Account Key (PK)

Account Number (NK)

...

Account-to-Customer Bridge

Account Key (FK)

Customer Key (FK)

Demographics Key (FK)

Fact Table

Month Key (FK)

Account Key (FK)

More FKs ...

Facts ...

Customer Dimension

Customer Key (PK)

Customer Name

...

Demographics Key (PK)

Age Band

Income Band

Marital Status

Demographics Dimension

Figure 10-6: Account-to-customer bridge table with an added mini-dimension.

The way to visualize the bridge table is that it links every account to its associ-

ated customers and their demographics. The key for the bridge table now consists

of the account key, customer key, and demographics key.

Depending on how frequently new demographics are assigned to each customer,

the bridge table will perhaps grow signiﬁ cantly. In the above design because the

grain of the root bank account fact table is month by account, the bridge table should

be limited to changes recorded only at month ends. This takes some of the change

tracking pressure o the bridge table.

Dynamic Value Banding of Facts

Suppose business users want the ability to perform value band reporting on a stan-

dard numeric fact, such as the account balance, but are not willing to live with the

predeﬁ ned bands in a dimension table. They may want to create a report based on

the account balance snapshot, as shown in Figure 10-7.

Chapter 10

292

Balance Range

0–1000

1001–2000

2001–5000

5001–10000

10001 and up

Number of Accounts

456,783

367,881

117,754

52,662

8,437

Total of Balances

$229,305,066

$552,189,381

$333,479,328

$397,229,466

$104,888,784

Figure 10-7: Report rows with dynamic value band groups.

Using the schema in Figure 10-3, it is di cult to create this report directly from

the fact table. SQL has no generalization of the GROUP BY clause that clumps additive

values into ranges. To further complicate matters, the ranges are of unequal size and

have textual names, such as “10001 and up.” Also, users typically need the ﬂ exibility

to redeﬁ ne the bands at query time with di erent boundaries or levels of precision.

The schema design shown in Figure 10-8 enables on-the-ﬂ y value band reporting.

The band deﬁ nition table can contain as many sets of di erent reporting bands as

desired. The name of a particular group of bands is stored in the band group column.

The band deﬁ nition table is joined to the balance fact using a pair of less-than and

greater-than joins. The report uses the band range name as the row header and sorts

the report on the sort order attribute.

Band Definition Table

Month End Date Key (FK)

Account Key (FK)

Product Key (FK)

More FKs ...

Primary Month Ending Balance

Band Group Key (PK)

Band Group Sort Order (PK)

Band Group Name

Band Range Name

Band Lower Value

Band Upper Value

Monthly Account Snapshot Fact

Figure 10-8: Dynamic value band reporting.

Controlling the performance of this query can be a challenge. A value band query

is by deﬁ nition very lightly constrained. The example report needed to scan the

balances of more than 1 million accounts. Perhaps only the month dimension was

constrained to the current month. Furthermore the funny joins to the value band-

ing table are not the basis of a nice restricting constraint because they are grouping

the 1 million balances. In this situation, you may need to place an index directly

on the balance fact. The performance of a query that constrains or groups on the

value of a fact-like balance will be improved enormously if the DBMS can e ciently

sort and compress the individual fact. This approach was pioneered by the Sybase

IQ columnar database product in the early 1990s and is now becoming a standard

indexing option on several of the competing columnar DBMSs.

Financial Services 293

Supertype and Subtype Schemas

for Heterogeneous Products

In many ﬁ nancial service businesses, a dilemma arises because of the heteroge-

neous nature of the products or services o ered by the institution. As mentioned

in the introduction, a typical retail bank o ers a myriad of products, from check-

ing accounts to credit cards, to the same customers. Although every account at the

bank has a primary balance and interest amount associated with it, each product

type has many special attributes and measured facts that are not shared by the other

products. For instance, checking accounts have minimum balances, overdraft limits,

service charges, and other measures relating to online banking; time deposits such as

certiﬁ cates of deposit have few attribute overlaps with checking, but have maturity

dates, compounding frequencies, and current interest rate.

Business users typically require two di erent perspectives that are di cult to

present in a single fact table. The ﬁ rst perspective is the global view, including the

ability to slice and dice all accounts simultaneously, regardless of their product

type. This global view is needed to plan appropriate customer relationship manage-

ment cross-sell and up-sell strategies against the aggregate customer/household base

spanning all possible products. In this situation, you need the single supertype fact

table (refer to Figure 10-3) that crosses all the lines of business to provide insight

into the complete account portfolio. Note, however, that the supertype fact table

can present only a limited number of facts that make sense for virtually every line

of business. You cannot accommodate incompatible facts in the supertype fact table

because there may be several hundred of these facts when all the possible account

types are considered. Similarly, the supertype product dimension must be restricted

to the subset of common product attributes.

The second perspective is the line-of-business view that focuses on the in-depth

details of one business, such as checking. There is a long list of special facts and attri-

butes that make sense only for the checking business. These special facts cannot be

included in the supertype fact table; if you did this for each line of business in a retail

bank, you would end up with hundreds of special facts, most of which would have

null values in any speciﬁ c row. Likewise, if you attempt to include line-of-business

attributes in the account or product dimension tables, these tables would have hun-

dreds of special attributes, almost all of which would be empty for any given row. The

resulting tables would resemble Swiss cheese, littered with data holes. The solution

to this dilemma for the checking department in this example is to create a subtype

schema for the checking line of business that is limited to just checking accounts, as

shown in Figure 10-9.

Chapter 10

294

Month Key (FK)

Account Key (FK)

Primary Customer Key (FK)

Branch Key (FK)

Household Key (FK)

Product Key (FK)

Balance

Change in Balance

Total Deposits

Total Withdrawals

Number Transactions

Max Backup Reserve

Number Overdraws

Total Overdraw Penalties

Count Local ATM Transactions

Count Foreign ATM Transactions

Count Online Transactions

Days Below Minimum

+ 10 more facts

Account Key (PK)

Account Number (NK)

Account Address Attributes

Account Open Date

...

Checking Account Fact Checking Account Dimension

Product Key (PK)

Product Code (NK)

Product Description

Premium Flag

Checking Type

Interest Payment Type

Overdraft Policy

+ 12 more attributes

Checking Product Dimension

Figure 10-9: Line-of-business subtype schema for checking products.

Now both the subtype checking fact table and corresponding checking account

dimension are widened to describe all the speciﬁ c facts and attributes that make

sense only for checking products. These subtype schemas must also contain the

supertype facts and attributes to avoid joining tables from the supertype and subtype

schemas for the complete set of facts and attributes. You can also build separate

subtype fact and account tables for the other lines of business to support their in-

depth analysis requirements. Although creating account-speciﬁ c schemas sounds

complex, only the DBA sees all the tables at once. From the business users’ perspec-

tive, either it’s a cross-product analysis that relies on the single supertype fact table

and its attendant supertype account table, or the analysis focuses on a particular

account type and only one of the subtype line of business schemas is utilized. In

general, it makes less sense to combine data from more than one subtype schema,

because by deﬁ nition, the accounts’ facts and attributes are disjointed (or nearly so).

The keys of the subtype account dimensions are the same keys used in the super-

type account dimension, which contains all possible account keys. For example,

if the bank o ers a “$500 minimum balance with no per check charge” checking

account, this account would be identiﬁ ed by the same surrogate key in both the

supertype and subtype checking account dimensions. Each subtype account dimen-

sion is a shrunken conformed dimension with a subset of rows from the supertype

Financial Services 295

account dimension table; each subtype account dimension contains attributes spe-

ciﬁ c to a particular account type.

This supertype/subtype design technique applies to any business that o ers

widely varied products through multiple lines of business. If you work for a technol-

ogy company that sells hardware, software, and services, you can imagine building

supertype sales fact and product dimension tables to deliver the global customer

perspective. The supertype tables would include all facts and dimension attributes

that are common across lines of business. The supertype tables would then be

supplemented with schemas that do a deep dive into subtype facts and attributes that

vary by business. Again, a speciﬁ c product would be assigned the same surrogate

product key in both the supertype and subtype product dimensions.

NOTE A family of supertype and subtype fact tables are needed when a business

has heterogeneous products that have naturally di erent facts and descriptors, but

a single customer base that demands an integrated view.

If the lines of business in your retail bank are physically separated so each has its

own location, the subtype fact and dimension tables will likely not reside in the same

space as the supertype fact and dimension tables. In this case, the data in the super-

type fact table would be duplicated exactly once to implement all the subtype tables.

Remember that the subtype tables provide a disjointed partitioning of the accounts,

so there is no overlap between the subtype schemas.

Supertype and Subtype Products with Common Facts

The supertype and subtype product technique just discussed is appropriate for

fact tables where a single logical row contains many product-speciﬁ c facts. On the

other hand, the metrics captured by some business processes, such as the bank’s

new account solicitations, may not vary by line of business. In this case, you do

not need line-of-business fact tables; one supertype fact table su ces. However,

you still can have a rich set of heterogeneous products with diverse attributes. In

this case, you would generate the complete portfolio of subtype account dimension

tables, and use them as appropriate, depending on the nature of the application.

In a cross product analysis, the supertype account dimension table would be used

because it can span any group of accounts. In a single account type analysis, you

could optionally use the subtype account dimension table instead of the supertype

dimension if you wanted to take advantage of the subtype attributes speciﬁ c to

that account type.

Chapter 10

296

Hot Swappable Dimensions

A brokerage house may have many clients who track the stock market. All of them

access the same fact table of daily high-low-close stock prices. But each client has a

conﬁ dential set of attributes describing each stock. The brokerage house can sup-

port this multi-client situation by having a separate copy of the stock dimension

for each client, which is joined to the single fact table at query time. We call these

hot swappable dimensions. To implement hot swappable dimensions in a relational

environment, referential integrity constraints between the fact table and the various

stock dimension tables probably must be turned o to allow the switches to occur

on an individual query basis.

Summary

We began this chapter by discussing the situation in which a fact table has too few

dimensions and provided suggestions for ferreting out additional dimensions using

a triage process. Approaches for handling the often complex relationship between

accounts, customers, and households were described. We also discussed the use of

multiple mini-dimensions in a single fact table, which is fairly common in ﬁ nancial

services schemas.

We illustrated a technique for clustering numeric facts into arbitrary value bands

for reporting purposes through the use of a separate band table. Finally, we pro-

vided recommendations for any organization that o ers heterogeneous products

to the same set of customers. In this case, we create a supertype fact table that

contains performance metrics that are common across all lines of business. The

companion dimension table contains rows for the complete account portfolio, but

the attributes are limited to those that are applicable across all accounts. Multiple

subtype schemas, one of each line of business, complement the supertype schema

with account-speciﬁ c f acts and attributes.

Telecommunications

This chapter unfolds a bit differently than preceding chapters. We begin with

a case study overview but we won’t be designing a dimensional model from

scratch this time. Instead, we’ll step into a project midstream to conduct a design

review, looking for opportunities to improve the initial draft schema. The bulk of

this chapter focuses on identifying design flaws in dimensional models.

We’ll use a billing vignette drawn from the telecommunications industry as the

basis for the case study; it shares similar characteristics with the billing data gener-

ated by a utilities company. At the end of this chapter we’ll describe the handling

of geographic location information in the data warehouse.

Chapter 11 discusses the following concepts:

■ Bus matrix snippet for telecommunications company

■ Design review exercise

■ Checklist of common design mistakes

■ Recommended tactics when conducting design reviews

■ Retroﬁ tting existing data structures

■ Abstract geographic location dimensions

Telecommunications Case Study

and Bus Matrix

Given your extensive experience in dimensional modeling (10 chapters so far), you’ve

recently been recruited to a new position as a dimensional modeler on the DW/BI

team for a large wireless telecommunications company. On your ﬁ rst day, after a few

hours of human resources paperwork and orientation, you’re ready to get to work.

The DW/BI team is anxious for you to review its initial dimensional design. So

far it seems the project is o to a good start. The business and IT sponsorship com-

mittee appreciates that the DW/BI program must be business-driven; as such, the

Chapter 11

298

committee was fully supportive of the business requirements gathering process.

Based on the requirements initiative, the team drafted an initial data warehouse bus

matrix, illustrated in Figure 11-1. The team identiﬁ ed several core business processes

and a number of common dimensions. Of course, the complete enterprise matrix

would have a much larger number of rows and columns, but you’re comfortable that

the key constituencies’ major data requirements have been captured.

Purchasing

Internal Inventory

Channel Inventory

Service Activation

Product Sales

Promotion Participation

Call Detail Traffic

Date

Product

Customer

Service Line #

Switch

Customer Billing

Customer Support Calls

Repair Work Orders

Employee

Support Call

Profile

Rate Plan

Sales

Organization

Figure 11-1: Sample bus matrix rows for telecommunications company.

The sponsorship committee decided to focus on the customer billing process for the

initial DW/BI project. Business management determined better access to the metrics

resulting from the billing process would have a signiﬁ cant impact on the business.

Management wants the ability to see monthly usage and billing metrics (otherwise

known as revenue) by customer, sales organization, and rate plan to perform sales

channel and rate plan analyses. Fortunately, the IT team felt it was feasible to tackle

this business process during the ﬁ rst warehouse iteration.

Some people in the IT organization thought it would be preferable to tackle

individual call and message detail tra c, such as every call initiated or received by

every phone. Although this level of highly granular data would provide interesting

insights, it was determined by the joint sponsorship committee that the associated

data presents more feasibility challenges while not delivering as much short-term

business value.

Based on the sponsors’ direction, the team looked more closely at the customer

billing data. Each month, the operational billing system generates a bill for each

phone number, also known as a service line. Because the wireless company has

millions of service lines, this represents a signiﬁ cant amount of data. Each service

Telecommunications 299

line is associated with a single customer. However, a customer can have multiple

wireless service lines, which appear as separate line items on the same bill; each

service line has its own set of billing metrics, such as the number of minutes,

number of text messages, amount of data, and monthly service charges. There is a

single rate plan associated with each service line on a given bill, but this plan can

change as customers’ usage habits evolve. Finally, a sales organization and channel

is associated with each service line to evaluate the ongoing billing revenue stream

generated by each channel partner.

Working closely with representatives from the business and other DW/BI team

members, the data modeler designed a fact table with the grain being one row per

bill each month. The team proudly unrolls its draft dimensional modeling master-

piece, as shown in Figure 11-2, and expectantly looks at you.

What do you think? Before moving on, please spend several minutes studying the

design in Figure 11-2. Try to identify the design ﬂ aws and suggest improvements

before reading ahead.

Bill Dimension

Bill #

Service Line Number

Bill Date

Rate Plan Code (PK and NK)

Rate Plan Abbreviation

Plan Minutes Allowed

Plan Messages Allowed

Plan Data MB Allowed

Night-Weekend Minute Ind

Service Line Dimension

Service Line Number (PK)

Service Line Area Code

Service Line Activation Date

Customer ID (PK and NK)

Customer Name

Customer Address

Customer City

Customer State

Customer Zip

Orig Authorization Credit Score

Customer Dimension

Bill # (FK)

Customer ID (FK)

Sales Org Number (FK)

Sales Channel ID (FK)

Rate Plan Code (FK)

Rate Plan Type Code

Call Count

Total Minute Count

Night-Weekend Minute Count

Roam Minute Count

Message Count

Data MB Used

Month Service Charge

Prior Month Service Charge

Year-to-Date Service Charge

Message Charge

Data Charge

Roaming Charge

Taxes

Regulatory Charges

Billing Fact

Sales Org Number (PK and NK)

Sales Channel ID

Sales Org Dimension

Sales Channel ID (PK and NK)

Sales Channel Name

Sales Channel Dimension

Rate Plan Dimension

Figure 11-2: Draft schema prior to design review.

General Design Review Considerations

Before we discuss the speciﬁ c issues and potential recommendations for the Figure

11-2 schema, we’ll outline the design issues commonly encountered when conduct-

ing design reviews. Not to insinuate that the DW/BI team in this case study has

stepped into all these traps, but it may be guilty of violating several. Again, the

design review exercise will be a more e ective learning tool if you take a moment

to jot down your personal ideas regarding Figure 11-2 before proceeding.

Chapter 11

300

Balance Business Requirements and Source Realities

Dimensional models should be designed based on a blended understanding of the

business’s needs, along with the operational source system’s data realities. While

requirements are collected from the business users, the underlying source data

should be proﬁ led. Models driven solely by requirements inevitably include data ele-

ments that can’t be sourced. Meanwhile, models driven solely by source system data

analysis inevitably omit data elements that are critical to the business’s analytics.

Focus on Business Processes

As reinforced for 10 chapters, dimensional models should be designed to mirror an

organization’s primary business process events. Dimensional models should not be

designed solely to deliver speciﬁ c reports or answer speciﬁ c questions. Of course,

business users’ analytic questions are critical input because they help identify which

processes are priorities for the business. But dimensional models designed to pro-

duce a speciﬁ c report or answer a speciﬁ c question are unlikely to withstand the

test of time, especially when the questions and report formats are slightly modiﬁ ed.

Developing dimensional models that more fully describe the underlying business

process are more resilient to change. Process-centric dimensional models also address

the analytic needs from multiple business departments; the same is deﬁ nitely not

true when models are designed to answer a single department’s speciﬁ c need.

After the base processes have been built, it may be useful to design complemen-

tary schemas, such as summary aggregations, accumulating snapshots that look

across a workﬂ ow of processes, consolidated fact tables that combine facts from

multiple processes to a common granularity, or subset fact tables that provide access

to a limited subset of fact data for security or data distribution purposes. Again, these

are all secondary complements to the core process-centric dimensional models.

Granularity

The ﬁ rst question to always ask during a design review is, “What’s the grain of the

fact table?” Surprisingly, you often get inconsistent answers to this inquiry from a

design team. Declaring a clear and concise deﬁ nition of the grain of the fact table

is critical to a productive modeling e ort. The project team and business liaisons

must share a common understanding of this grain declaration; without this agree-

ment, the design e ort will spin in circles.

Of course, if you’ve read this far, you’re aware we strongly believe fact tables

should be built at the lowest level of granularity possible for maximum ﬂ exibility

and extensibility, especially given the unpredictable ﬁ ltering and grouping required

by business user queries. Users typically don’t need to see a single row at a time,

Telecommunications 301

but you can’t predict the somewhat arbitrary ways they’ll want to screen and roll

up the details. The deﬁ nition of the lowest level of granularity possible depends on

the business process being modeled. In this case, you want to implement the most

granular data available for the selected billing process, not just the most granular

data available in the enterprise.

Single Granularity for Facts

After the fact table granularity has been established, facts should be identiﬁ ed

that are consistent with the grain declaration. To improve performance or reduce

query complexity, aggregated facts such as year-to-date totals sometimes sneak into

the fact row. These totals are dangerous because they are not perfectly additive.

Although a year-to-date total reduces the complexity and run time of a few speciﬁ c

queries, having it in the fact table invites double counting the year-to-date column

(or worse) when more than one date is included in the query results. It is important

that once the grain of a fact table is chosen, all the additive facts are presented at

a uniform grain.

You should prohibit text ﬁ elds, including cryptic indicators and ﬂ ags, from the

fact table. They almost always take up more space in the fact table than a surrogate

key. More important, business users generally want to query, constrain, and report

against these text ﬁ elds. You can provide quicker responses and more ﬂ exible access

by handling these textual values in a dimension table, along with descriptive rollup

attributes associated with the ﬂ ags and indicators.

Dimension Granularity and Hierarchies

Each of the dimensions associated with a fact table should take on a single value

with each row of fact table measurements. Likewise, each of the dimension attri-

butes should take on one value for a given dimension row. If the attributes have a

many-to-one relationship, this hierarchical relationship can be represented within

a single dimension. You should generally look for opportunities to collapse or

denormalize dimension hierarchies whenever possible.

Experienced data modelers often revert to the normalization techniques they’ve

applied countless times in operational entity-relationship models. These modelers

often need to be reminded that normalization is absolutely appropriate to support

transaction processing and ensure referential integrity. But dimensional models

support analytic processing. Normalization in the dimensional model negatively

impacts the model’s twin objectives of understandability and performance. Although

normalization is not forbidden in the extract, transform, and load (ETL) system

where data integrity must be ensured, it does place an additional burden on the

dimension change handling subsystems.

Chapter 11

302

Sometimes designers attempt to deal with dimension hierarchies within the fact

table. For example, rather than having a single foreign key to the product dimension,

they include separate foreign keys for the key elements in the product hierarchy,

such as brand and category. Before you know it, a compact fact table turns into an

unruly centipede fact table joining to dozens of dimension tables. If the fact table

has more than 20 or so foreign keys, you should look for opportunities to combine

or collapse dimensions.

Elsewhere, normalization appears with the snowﬂ aking of hierarchical relationships

into separate dimension tables linked to one another. We generally also discourage this

practice. Although snowﬂ aking may reduce the disk space consumed by dimension tables,

the savings are usually insigniﬁ cant when compared with the entire data warehouse

environment and seldom o set the disadvantages in ease of use or query performance.

Throughout this book we have occasionally discussed outriggers as permissible

snowﬂ akes. Outriggers can play a useful role in dimensional designs, but keep in

mind that the use of outriggers for a cluster of relatively low-cardinality should be

the exception rather than the rule. Be careful to avoid abusing the outrigger tech-

nique by overusing them in schemas.

Finally, we sometimes review dimension tables that contain rows for both atomic and

hierarchical rollups, such as rows for both products and brands in the same dimension

table. These dimensions typically have a telltale “level” attribute to distinguish between

its base and summary rows. This pattern was prevalent and generally accepted decades

ago prior to aggregate navigation capabilities. However, we discourage its continued

use given the strong likelihood of user confusion and the risk of overcounting if the

level indicator in every dimension is not constrained in every query.

Date Dimension

Every fact table should have at least one foreign key to an explicit date dimension.

Design teams sometimes join a generic date dimension to a fact table because they

know it’s the most common dimension but then can’t articulate what the date refers

to, presenting challenges for the ETL team and business users alike. We encourage

a meaningful date dimension table with robust date rollup and ﬁ lter attributes.

Fixed Time Series Buckets Instead of Date Dimension

Designers sometimes avoid a date dimension table altogether by representing a time

series of monthly buckets of facts on a single fact table row. Legacy operational systems

may contain metric sets that are repeated 12 times on a single record to represent

month 1, month 2, and so on. There are several problems with this approach. First,

the hard-coded identity of the time slots is inﬂ exible. When you ﬁ ll up all the buck-

ets, you are left with unpleasant choices. You could alter the table to expand the row.

Otherwise, you could shift everything over by one column, dropping the oldest data,

Telecommunications 303

but this wreaks havoc with existing query applications. The second problem with

this approach is that all the attributes of the date are now the responsibility of the

application, not the database. There is no date dimension in which to place calendar

event descriptions for constraining. Finally, the ﬁ xed slot approach is ine cient if

measurements are taken only in a particular time period, resulting in null columns

in many rows. Instead, these recurring time buckets should be presented as separate

rows in the fact table.

Degenerate Dimensions

Rather than treating operational transaction numbers such as the invoice or order

number as degenerate dimensions, teams sometimes want to create a separate

dimension table for the transaction number. In this case, attributes of the transac-

tion number dimension include elements from the transaction header record, such

as the transaction date and customer.

Remember, transaction numbers are best treated as degenerate dimensions. The

transaction date and customer should be captured as foreign keys on the fact table,

not as attributes in a transaction dimension. Be on the lookout for a dimension

table that has as many (or nearly as many) rows as the fact table; this is a warning

sign that there may be a degenerate dimension lurking within a dimension table.

Surrogate Keys

Instead of relying on operational keys or identiﬁ ers, we recommend the use of sur-

rogate keys as the dimension tables’ primary keys. The only permissible deviation

from this guideline applies to the highly predictable and stable date dimension. If

you are unclear about the reasons for pursuing this strategy, we suggest backtrack-

ing to Chapter 3: Retail Sales to refresh your memory.

Dimension Decodes and Descriptions

All identiﬁ ers and codes in the dimension tables should be accompanied by descrip-

tive decodes. This practice often seems counterintuitive to experienced data modelers

who have historically tried to reduce data redundancies by relying on look-up codes.

In the dimensional model, dimension attributes should be populated with the values

that business users want to see on BI reports and application pull-down menus. You

need to dismiss the misperception that business users prefer to work with codes. To

convince yourself, stroll down to their o ces to see the decode listings ﬁ lling their

bulletin boards or lining their computer monitors. Most users do not memorize the

codes outside of a few favorites. New hires are rendered helpless when assaulted with

a lengthy list of meaningless codes.

Chapter 11

304

The good news is that decodes can usually be sourced from operational systems

with relatively minimal additional e ort or overhead. Occasionally, the descriptions

are not available from an operational system but need to be provided by business

partners. In these cases, it is important to determine an ongoing maintenance strat-

egy to maintain data quality.

Finally, project teams sometimes opt to embed labeling logic in the BI tool’s

semantic layer rather than supporting it via dimension table attributes. Although

some BI tools provide the ability to decode within the query or reporting applica-

tion, we recommend that decodes be stored as data elements instead. Applications

should be data-driven to minimize the impact of decode additions and changes.

Of course, decodes that reside in the database also ensure greater report labeling

consistency because most organizations ultimately utilize multiple BI products.

Conformity Commitment

Last, but certainly not least, design teams must commit to using shared conformed

dimensions across process-centric models. Everyone needs to take this pledge

seriously. Conformed dimensions are absolutely critical to a robust data architec-

ture that ensures consistency and integration. Without conformed dimensions,

you inevitably perpetuate incompatible stovepipe views of performance across the

organization. By the way, dimension tables should conform and be reused whether

you drink the Kimball Kool-Aid or embrace a hub-and-spoke architectural alterna-

tive. Fortunately, operational master data management systems are facilitating the

development and deployment of conformed dimensions.

Design Review Guidelines

Before diving into a review of the draft model in Figure 11-2, let’s review some prac-

tical recommendations for conducting dimensional model design reviews. Proper

advance preparation increases the likelihood of a successful review process. Here

are some suggestions when setting up for a design review:

■ Invite the right players. The modeling team obviously needs to participate,

but you also want representatives from the BI development team to ensure

that proposed changes enhance usability. Perhaps most important, it’s critical

that folks who are very knowledgeable about the business and their needs

are sitting at the table. Although diverse perspectives should participate in a

review, don’t invite 25 people to the party.

■ Designate someone to facilitate the review. Group dynamics, politics, and the

design challenges will drive whether the facilitator should be a neutral resource

or involved party. Regardless, their role is to keep the team on track toward a

Telecommunications 305

common goal. E ective facilitators need the right mix of intelligence, enthu-

siasm, conﬁ dence, empathy, ﬂ exibility, assertiveness (and a sense of humor).

■ Agree on the review’s scope. Ancillary topics will inevitably arise during the

review, but agreeing in advance on the scope makes it easier to stay focused

on the task at hand.

■ Block time on everyone’s calendar. We typically conduct dimensional model

reviews as a focused two day e ort. The entire review team needs to be present

for the full two days. Don’t allow players to ﬂ oat in and out to accommodate

other commitments. Design reviews require undivided attention; it’s disrup-

tive when participants leave intermittently.

■ Reserve the right space. The same conference room should be blocked for

the full two days. Optimally, the room should have a large white board; it’s

especially helpful if the white board drawings can be saved or printed. If a

white board is unavailable, have ﬂ ip charts on hand. Don’t forget markers

and tape; drinks and food also help.

■ Assign homework. For example, ask everyone involved to make a list of their

top ﬁ ve concerns, problem areas, or opportunities for improvement with the

existing design. Encourage participants to use complete sentences when mak-

ing their list so that it’s meaningful to others. These lists should be sent to the

facilitator in advance of the design review for consolidation. Soliciting advance

input gets people engaged and helps avoid “group think” during the review.

After the team gathers to focus on the review, we recommend the following

tactics:

■ Check attitudes at the door. Although it’s easier said than done, don’t be

defensive about prior design decisions. Embark on the review thinking change

is possible; don’t go in resigned to believing nothing can be done to improve

the situation.

■ Ban technology unless needed for the review process. Laptops and smart-

phones should also be checked at the door (at least ﬁ guratively). Allowing

participants to check e-mail during the sessions is no di erent than having

them leave to attend an alternative meeting.

■ Exhibit strong facilitation skills. Review ground rules and ensure everyone is

openly participating and communicating. The facilitator must keep the group

on track and ban side conversations and discussions that are out of scope or

spiral into the death zone.

■ Ensure a common understanding of the current model. Don’t presume every-

one around the table already has a comprehensive perspective. It may be

worthwhile to dedicate the ﬁ rst hour to walking through the current design

and reviewing objectives before delving into potential improvements.

Chapter 11

306

■ Designate someone to act as scribe. He should take copious notes about both

the discussions and decisions being made.

■ Start with the big picture. Just as when you design from a blank slate, begin

with the bus matrix. Focus on a single, high-priority business process, deﬁ ne

its granularity and then move out to the corresponding dimensions. Follow

this same “peeling back the layers of the onion” method with a design review,

starting with the fact table and then tackling dimension-related issues. But

don’t defer the tough stu to the afternoon of the second day.

■ Remind everyone that business acceptance is critical. Business acceptance is

the ultimate measure of DW/BI success. The review should focus on improv-

ing the business users’ experience.

■ Sketch out sample rows with data values. Viewing sample data during the

review sessions helps ensure everyone has a common understanding of

the recommended improvements.

■ Close the meeting with a recap. Don’t let participants leave the room with-

out clear expectations about their assignments and due dates, along with an

established time for the next follow-up.

After the team completes the design review meeting, here are a few recommenda-

tions to wrap up the process:

■ Assign responsibility for any remaining open issues. Commit to wrestling

these issues to the ground following the review, even though this can be chal-

lenging without an authoritative party involved.

■ Don’t let the team’s hard work gather dust. Evaluate the cost/beneﬁ t for the

potential improvements; some changes will be more painless (or painful)

than others. Action plans for implementing the improvements then need to

be developed.

■ Anticipate future reviews. Plan to reevaluate models every 12 to 24 months.

Try to view inevitable changes to the design as signs of success, rather than

failure.

Draft Design Exercise Discussion

Now that you’ve reviewed the common dimensional modeling mistakes frequently

encountered during design reviews, refer to the draft design in Figure 11-2. Several

opportunities for improvement should immediately jump out at you.

Telecommunications 307

The ﬁ rst thing to focus on is the grain of the fact table. The team stated the

grain is one row for each bill each month. However, based on your understanding

from the source system documentation and data proﬁ ling e ort, the lowest level of

billing data would be one row per service line on a bill. When you point this out,

the team initially directs you to the bill dimension, which includes the service line

number. However, when reminded that each service line has its own set of billing

metrics, the team agrees the more appropriate grain declaration would be one row

per service line per bill. The service line key is moved into the fact table as a foreign

key to the service line dimension.

While discussing the granularity, the bill dimension is scrutinized, especially

because the service line key was just moved into the fact table. As the draft model

was originally drawn in Figure 11-2, every time a bill row is loaded into the fact

table, a row also would be loaded into the bill dimension table. It doesn’t take much

to convince the team that something is wrong with this picture. Even with the

modiﬁ ed granularity to include service line, you would still end up with nearly as

many rows in both the fact and bill dimension tables because many customers are

billed for one service line. Instead, the bill number should be treated as a degenerate

dimension. At the same time, you move the bill date into the fact table and join it

to a robust date dimension playing the role of bill date in this schema.

You’ve probably been bothered since ﬁ rst looking at the design by the double

joins on the sales channel dimension table. The sales channel hierarchy has been

unnecessarily snowﬂ aked. You opt to collapse the hierarchy by including the sales

channel identiﬁ ers (hopefully along with more meaningful descriptors) as additional

attributes in the sales organization dimension table. In addition, you can eliminate

the unneeded sales channel foreign key in the fact table.

The design inappropriately treats the rate plan type code as a textual fact. Textual

facts are seldom a sound design choice. In this case study, the rate plan type code

and its decode can be treated as rollup attributes in the rate plan dimension table.

The team spent some time discussing the relationship between the service line

and the customer, sales organization, and rate plan dimensions. Because there is

a single customer, sales organization, and rate plan associated with a service line

number, the dimensions theoretically could be collapsed and modeled as service

line attributes. However, collapsing the dimensions would result in a schema with

just two dimensions: bill date and service line. The service line dimension already

has millions of rows in it and is rapidly growing. In the end, you opt to treat the

customer, sales organization, and rate plan as separate entities (or mini-dimensions)

of the service line.

Chapter 11

308

Surrogate keys are used inconsistently throughout the design. Many of the draft

dimension tables use operational identiﬁ ers as primary keys. You encourage the

team to implement surrogate keys for all the dimension primary keys and then

reference them as fact table foreign keys.

The original design was riddled with operational codes and identiﬁ ers. Adding

descriptive names makes the data more legible to the business users. If required

by the business, the operational codes can continue to accompany the descriptors

as dimension attributes.

Finally, you notice that there is a year-to-date metric stored in the fact table.

Although the team felt this would enable users to report year-to-date ﬁ gures more

easily, in reality, year-to-date facts can be confusing and prone to error. You opt

to remove the year-to-date fact. Instead, users can calculate year-to-date amounts

on-the-ﬂ y by using a constraint on the year in the date dimension or leveraging the

BI tool’s capabilities.

After two exhausting days, the initial review of the design is complete. Of

course, there’s more ground to cover, including the handling of changes to the

dimension attributes. In the meantime, everyone on the team agrees the revamped

design, illustrated in Figure 11-3, is a vast improvement. You’ve earned your ﬁ rst

week’s pay.

Bill Date Dimension

Rate Plan Key (PK)

Rate Plan Code

Rate Plan Name

Rate Plan Abbreviation

Rate Plan Type Code

Rate Plan Type Description

Plan Minutes Allowed

Plan Messages Allowed

Plan Data MB Allowed

Night-Weekend Minute Ind

Service Line Dimension

Service Line Key (PK)

Service Line Number (NK)

Service Line Area Code

Service Line Activation Date

Customer Key (PK)

Customer ID (NK)

Customer Name

Customer Address

Customer City

Customer State

Customer Zip

Orig Authorization Credit Score

Customer Dimension Bill Date Key (FK)

Customer Key (FK)

Service Line Key (FK)

Sales Organization Key (FK)

Rate Plan Key (FK)

Bill Number (DD)

Call Count

Total Minute Count

Night-Weekend Minute Count

Roam Minute Count

Message Count

Data MB Used

Month Service Charge

Message Charge

Data Charge

Roaming Charge

Taxes

Regulatory Charges

Billing Fact

Sales Organization Key (PK)

Sales Organization Number

Sales Organization Name

Sales Channel ID

Sales Channel Name

Sales Organization Dimension

Rate Plan Dimension

Figure 11-3: Draft schema following design review.

Telecommunications 309

Remodeling Existing Data Structures

It’s one thing to conduct a review and identify opportunities for improvement.

However, implementing the changes might be easier said than done if the design

has already been physically implemented.

For example, adding a new attribute to an existing dimension table feels like a

minor enhancement. It is nearly pain-free if the business data stewards declare it

to be a slowly changing dimension type 1 attribute. Likewise if the attribute is to

be populated starting now with no attempt to backﬁ ll historically accurate values

beyond a Not Available attribute value; note that while this tactic is relatively easy

to implement, it presents analytic challenges and may be deemed unacceptable. But

if the new attribute is a designated type 2 attribute with the requirement to capture

historical changes, this seemingly simple enhancement just got much more com-

plicated. In this scenario, rows need to be added to the dimension table to capture

the historical changes in the attribute, along with the other dimension attribute

changes. Some fact table rows then need to be recast so the appropriate dimension

table row is associated with the fact table’s event. This most robust approach con-

sumes surprisingly more e ort than you might initially imagine.

Much less surprising is the e ort required to take an existing dimensional model

and convert it into a structure that leverages newly created conformed dimensions.

As discussed in Chapter 4: Inventory, at a minimum, the fact table’s rows must be

completely reprocessed to reference the conformed dimension keys. The task is

obviously more challenging if there are granularity or other major issues.

In addition to thinking about the data-centric challenges of retroﬁ tting existing

data structures, there are also unwanted ripples in the BI reporting and analytic

applications built on the existing data foundation. Using views to bu er the BI appli-

cations from the physical data structures provides some relief, but it’s typically not

adequate to avoid unpleasant whipsawing in the BI environment.

When considering enhancements to existing data structures, you must evaluate

the costs of tackling the changes alongside the perceived beneﬁ ts. In many cases,

you’ll determine improvements need to be made despite the pain. Similarly, you

may determine the best approach is to decommission the current structures to put

them out of their misery and tackle the subject area with a fresh slate. Finally, in

some situations, the best approach is to simply ignore the suboptimal data structures

because the costs compared to the potential beneﬁ ts don’t justify the remodeling

and schema improvement e ort. Sometimes, the best time to consider a remodeling

e ort is when other changes, such as a source system conversion or migration to a

new BI tool standard, provide a catalyst.

Chapter 11

310

Geographic Location Dimension

Let’s shift gears and presume you work for a phone company with land lines tied to

a speciﬁ c physical location. The telecommunications and utilities industries have a

very well-developed notion of location. Many of their dimensions contain a precise

geographic location as part of the attribute set. The location may be resolved to a

physical street, city, state, ZIP code, latitude, and longitude. Latitude and longitude

geo-coding can be leveraged for geospatial analysis and map-centric visualization.

Some designers imagine a single master location table where address data is stan-

dardized and then the location dimension is attached as an outrigger to the service

line telephone number, equipment inventory, network inventory (including poles

and switch boxes), real estate inventory, service location, dispatch location, right of

way, and customer entities. In this scenario, each row in the master location table

is a speciﬁ c point in space that rolls up to every conceivable geographic grouping.

Standardizing the attributes associated with points in space is valuable. However,

this is a back room ETL task; you don’t need to unveil the single resultant table

containing all the addresses the organization interacts with to the business users.

Geographic information is naturally handled as attributes within multiple dimen-

sions, not as a standalone location dimension or outrigger. There is typically little

overlap between the geographic locations embedded in various dimensions. You

would pay a performance price for consolidating all the disparate addresses into a

single dimension.

Operational systems often embrace data abstraction, but you should typically

avoid generic abstract dimensions, such as a generalized location dimension in the

DW/BI presentation area because they negatively impact the ease-of-use and query

performance objectives. These structures are more acceptable behind the scenes in

the ETL back room.

Summary

This chapter provided the opportunity to conduct a design review using an example

case study. It provided recommendations for conducting e ective design reviews,

along with a laundry list of common design ﬂ aws to scout for when performing a

review. We encourage you to use this laundry list to review your own draft schemas

when searching for potent ial improvements.

Transportation

Voyages occur whenever a person or thing travels from one point to another,

perhaps with stops in the middle. Obviously, voyages are a fundamental

concept for organizations in the travel industry. Shippers and internal logistical

functions also relate to the discussion, as well as package delivery services and car

rental companies. Somewhat unexpected, many of this chapter’s schemas are also

applicable to telecommunications network route analyses; a phone network can

be thought of as a map of possible voyages that a call makes between origin and

destination phone numbers.

In this chapter we’ll draw on an airline case study to explore voyages and routes

because many readers are familiar (perhaps too familiar) with the subject matter.

The case study lends itself to a discussion of multiple fact tables at di erent granu-

larities. We’ll also elaborate on dimension role playing and additional date and time

dimension considerations. As usual, the intended audience for this chapter should

not be limited to the industries previously listed.

Chapter 12 discusses the following concepts:

■ Bus matrix snippet for an airline

■ Fact tables at di erent levels of granularity

■ Combining correlated role-playing dimensions

■ Country-speciﬁ c date dimensions

■ Dates and times in multiple time zones

■ Recap of localization issues

Airline Case Study and Bus Matrix

We’ll begin by exploring a simpliﬁ ed bus matrix, and then dive into the fact tables

associated with ﬂ ight activity.

Chapter 12

312

Figure 12-1 shows a snippet of an airline’s bus matrix. This example includes

an additional column to capture the degenerate dimension associated with most of

the bus process events. Like most organizations, airlines are keenly interested in

revenue. In this industry, the sale of a ticket represents unearned revenue; revenue

is earned when a passenger takes a ﬂ ight between origin and destination airports.

Reservations

Issued Tickets

Unearned Revenue & Availability

Flight Activity

Frequent Flyer Account Credits

Customer Care Interactions

Frequent Flyer Communications

Time

Airport

Passenger

Fare Basis

Aircraft

Maintenance Work Orders

Crew Scheduling

XXXXXX

Conf #

Ticket #

Conf #

Ticket #

Conf #

Ticket #

Case #

Ticket #

Work

Order #

Date

X XXXXX

Communication

Profile

Booking Channel

Class of Service

Transaction ID #

Figure 12-1: Subset of bus matrix row for an airline.

The business and DW/BI team representatives decide the ﬁ rst deliverable should

focus on ﬂ ight activity. The marketing department wants to analyze what ﬂ ights the

company’s frequent ﬂ yers take, what fare basis they pay, how often they upgrade,

how they earn and redeem their frequent ﬂ yer miles, whether they respond to spe-

cial fare promotions, how long their overnight stays are, and what proportion of

these frequent ﬂ yers have gold, platinum, aluminum, or titanium status. The ﬁ rst

project doesn’t focus on reservation or ticketing activity data that didn’t result in a

passenger boarding a plane. The DW/BI team will contend with those other sources

of data in subsequent phases.

Multiple Fact Table Granularities

When it comes to the grain as you work through the four-step design process, this

case presents multiple potential levels of fact table granularity, each having di er-

ent associated metrics.

Transportation 313

At the most granular level, the airline captures data at the leg level. The leg

represents an aircraft taking o at one airport and landing at another without any

intermediate stops. Capacity planning and ﬂ ight scheduling analysts are interested

in this discrete level of information because they can look at the number of seats

to calculate load factors by leg. Operational aircraft ﬂ ight metrics are captured at

the leg level, such as ﬂ ight duration and the number of minutes late at departure

and arrival. Perhaps there’s even a dimension to easily identify on-time arrivals.

The next level of granularity corresponds to a segment. Segments refer to a

single ﬂ ight number (such as Delta ﬂ ight number 40 or DL0040) ﬂ own by a single

aircraft. Segments may have one or more legs associated with them; in most cases

segments are composed of just one leg with a single take-o and landing. If you

take a ﬂ ight from San Francisco to Minneapolis with a stop in Denver but no air-

craft or ﬂ ight number change, you have ﬂ own one segment (SFO-MSP) but two

legs (SFO-DEN and DEN-MSP). Conversely, if the ﬂ ight ﬂ ew nonstop from San

Francisco to Minneapolis, you would have ﬂ own one segment as well as one leg.

The segment represents the line item on an airline ticket coupon; passenger revenue

and mileage credit is determined at the segment level. So although some airline

departments focus on leg level operations, the marketing and revenue groups focus

on segment-level metrics.

Next, you can analyze ﬂ ight activity by trip. The trip provides an accurate picture

of customer demand. In the prior example, assume the ﬂ ights from San Francisco

to Minneapolis required the ﬂ yer to change aircraft in Denver. In this case, the trip

from San Francisco to Minneapolis would entail two segments corresponding to the

two involved aircraft. In reality, the passenger just asked to go from San Francisco

to Minneapolis; the fact that she needs to stop in Denver is merely a necessary evil.

For this reason, sales and marketing analysts are also interested in trip level data.

Finally, the airline collects data for the itinerary, which is equivalent to the entire

airline ticket or reservation conﬁ rmation number.

The DW/BI team and business representatives decide to begin at the segment-level

grain. This represents the lowest level of data with meaningful revenue metrics.

Alternatively, you could lean on the business for rules to allocate the segment-level

metrics down to the leg, perhaps based on the mileage of each leg within the seg-

ment. The data warehouse inevitably will tackle the more granular leg level data for

the capacity planners and ﬂ ight schedulers at some future point. The conforming

dimensions built during this ﬁ rst iteration will be leveraged at that time.

There will be one row in the fact table for each boarding pass collected from

passengers. The dimensionality associated with this data is quite extensive, as

illustrated in Figure 12-2. The schema extensively uses the role-playing technique.

The multiple date, time, and airport dimensions link to views of a single underly-

ing physical date, time, and airport dimension table, respectively, as we discussed

originally in Chapter 6: Order Management.

Chapter 12

314

Time-of-Day Dimension (views for 2 roles)

Airport Dimension (views for 2 roles)

Date Dimension (views for 2 roles)

Passenger Dimension

Passenger Profile Dimension

Class of Service Flown Dimension

Booking Channel Dimension

Aircraft Dimension

Fare Basis Dimension

Segment-Level Flight Activity Fact

Scheduled Departure Date Key (FK)

Scheduled Departure Time Key (FK)

Actual Departure Date Key (FK)

Actual Departure Time Key (FK)

Passenger Key (FK)

Passenger Profile Key (FK)

Segment Origin Airport Key (FK)

Segment Destination Airport Key (FK)

Aircraft Key (FK)

Class of Service Flown Key (FK)

Fare Basis Key (FK)

Booking Channel Key (FK)

Confirmation Number (DD)

Ticket Number (DD)

Segment Sequence Number (DD)

Flight Number (DD)

Base Fare Revenue

Passenger Facility Charges

Airport Tax

Government Tax

Baggage Charges

Upgrade Fees

Transaction Fees

Segment Miles Flown

Segment Miles Earned

Figure 12-2: Initial segment ﬂ ight activity schema.

The passenger dimension is a garden variety customer dimension with rich attri-

butes captured about the most valuable frequent ﬂ yers. Interestingly, frequent ﬂ yers

are motivated to help maintain this dimension accurately because they want to

ensure they’re receiving appropriate mileage credit. For a large airline, this dimen-

sion has tens to hundreds of millions of rows.

Marketing wants to analyze activity by the frequent ﬂ yer tier, which can change

during the course of a year. In addition, you learned during the requirements pro-

cess that the users are interested in slicing and dicing based on the ﬂ yers’ home

airports, whether they belong to the airline’s airport club at the time of each ﬂ ight,

and their lifetime mileage tier. Given the change tracking requirements, coupled

with the size of the passenger dimension, we opt to create a separate passenger pro-

ﬁ le mini-dimension, as we discussed in Chapter 5: Procurement, with one row for

each unique combination of frequent ﬂ yer elite tier, home airport, club membership

status, and lifetime mileage tier. Sample rows for this mini-dimension are illustrated

in Figure 12-3. You considered treating these attributes as slowly changing type

2 attributes, especially because the attributes don’t rapidly change. But given the

number of passengers, you opt for a type 4 mini-dimension instead. As it turns

out, marketing analysts often leverage this mini-dimension for their analysis and

reporting without touching the millions of passenger dimension rows.

Transportation 315

Passenger Profile

Key

789

790

791

2468

2469

2470

...

Basic

MidTier

WarriorTier

...

ATL

BOS

ATL

BOS

ATL

BOS

...

Non-Member

Club Member

Non-Member

Club Member

Non-Member

Club Member

...

Under 100,000 miles

100,000-499,999 miles

1,000,000-1,999,999 miles

2,000,000-2,999,999 miles

1,000,000-1,999,999 miles

...

Frequent Flyer

Tier Home Airport

Club Membership

Status

Lifetime Mileage

Tier

Figure 12-3: Passenger mini-dimension sample rows.

The aircraft dimension contains information about each plane ﬂ own. The origin

and destination airports associated with each ﬂ ight are called out separately to

simplify the user’s view of the data and make access more e cient.

The class of service ﬂ own describes whether the passenger sat in economy, pre-

mium economy, business, or ﬁ rst class. The fare basis dimension describes the terms

surrounding the fare. It would identify whether it’s an unrestricted fare, a 21-day

advance purchase fare with change and cancellation penalties, or a 10 percent o

fare due to a special promotion.

The sales channel dimension identiﬁ es how the ticket was purchased, whether

through a travel agency, directly from the airline’s phone number, city ticket o ce, or

website, or via another internet travel services provider. Although the sales channel

relates to the entire ticket, each segment should inherit ticket-level dimensional-

ity. In addition, several operational numbers are associated with the ﬂ ight activity

data, including the itinerary number, ticket number, ﬂ ight number, and segment

sequence number.

The facts captured at the segment level of granularity include the base fare rev-

enue, passenger facility charges, airport and government taxes, other ancillary

charges and fees, segment miles ﬂ own, and segment miles awarded (in those cases

in which a minimum number of miles are awarded regardless of the ﬂ ight distance).

Linking Segments into Trips

Despite the powerful dimensional framework you just designed, you cannot easily

answer one of the most important questions about your frequent ﬂ yers, namely,

“Where are they going?” The segment grain masks the true nature of the trip. If

you fetch all the segments of a trip and sequence them by segment number, it is still

Chapter 12

316

nearly impossible to discern the trip start and endpoints. Most complete itinerar-

ies start and end at the same airport. If a lengthy stop were used as a criterion for

a meaningful trip destination, it would require extensive and tricky processing at

the BI reporting layer whenever you try to summarize trips.

The answer is to introduce two more airport role-playing dimensions, trip origin

and trip destination, while keeping the grain at the ﬂ ight segment level. These are

determined during data extraction by looking on the ticket for any stop of more

than four hours, which is the airline’s o cial deﬁ nition of a stopover. You need to

exercise some caution when summarizing data by trip in this schema. Some of the

dimensions, such as fare basis or class of service ﬂ own, don’t apply at the trip level.

On the other hand, it may be useful to see how many trips from San Francisco to

Minneapolis included an unrestricted fare on a segment.

In addition to linking segments into trips on the segment ﬂ ight activity schema,

if the business users are constantly looking at information at the trip level, rather

than by segment, you might create an aggregate fact table at the trip grain. Some of

the earlier dimensions discussed, such as class of service and fare basis, obviously

would not be applicable. The facts would include aggregated metrics like trip total

base fare or trip total taxes, plus additional facts that would appear only in this

complementary trip summary table, such as the number of segments in the trip.

However, you would go to the trouble of creating this aggregate table only if there

were obvious performance or usability issues when you use the segment-level table

as the basis for rolling up the same reports. If a typical trip consists of three seg-

ments, you might barely see a three times performance improvement with such an

aggregate table, meaning it may not be worth the bother.

Related Fact Tables

As discussed earlier, you would likely create a leg-grained ﬂ ight activity fact table

to satisfy the more operational needs surrounding the departure and arrival of each

ﬂ ight. Metrics at the leg level might include actual and blocked ﬂ ight durations,

departure and arrival delays, and departure and arrival fuel weights.

In addition to the ﬂ ight activity, there will be fact tables to capture reservations

and issued tickets. Given the focus on maximizing revenue, there might be a rev-

enue and availability snapshot for each ﬂ ight; it could provide snapshots for the

ﬁ nal 90 days leading up to a ﬂ ight departure with cumulative unearned revenue and

remaining availability per class of service for each scheduled ﬂ ight. The snapshot

might include a dimension supporting the concept of “days prior to departure” to

facilitate the comparison of similar ﬂ ights at standard milestones, such as 60 days

prior to scheduled departure.

Transportation 317

Extensions to Other Industries

Using the airline case study to illustrate a voyage schema makes intuitive sense

because most people have boarded a plane at one time or another. We’ll brieﬂ y touch

on several other variations on this theme.

Cargo Shipper

The schema for a cargo shipper looks quite similar to the airline schemas just

developed. Suppose a transoceanic shipping company transports bulk goods in

containers from foreign to domestic ports. The items in the containers are shipped

from an original shipper to a ﬁ nal consignor. The trip can have multiple stops at

intermediate ports. It is possible the containers may be o -loaded from one ship to

another at a port. Likewise, it is possible one or more of the legs may be by truck

rather than ship.

As illustrated in Figure 12-4, the grain of the fact table is the container on a spe-

ciﬁ c bill-of-lading number on a particular leg of its trip. The ship mode dimension

identiﬁ es the type of shipping company and speciﬁ c vessel. The container dimen-

sion describes the size of the container and whether it requires electrical power or

refrigeration. The commodity dimension describes the item in the container. Almost

anything that can be shipped can be described by harmonized commodity codes,

which are a kind of master conformed dimension used by agencies, including U.S.

Customs. The consignor, foreign transporter, foreign consolidator, shipper, domestic

consolidator, domestic transporter, and consignee are all roles played by a master

business entity dimension that contains all the possible business parties associated

with a voyage. The bill-of-lading number is a degenerate dimension. We assume the

fees and tari s are applicable to the individual leg of the voyage.

Travel Ser vices

If you work for a travel services company, you can complement the ﬂ ight activity

schema with fact tables to track associated hotel stays and rental car usage. These

schemas would share several common dimensions, such as the date and customer.

For hotel stays, the grain of the fact table is the entire stay, as illustrated in Figure

12-5. The grain of a similar car rental fact table would be the entire rental episode.

Of course, if constructing a fact table for a hotel chain rather than a travel services

company, the schema would be much more robust because you’d know far more

about the hotel property characteristics, the guest’s use of services, and associated

detailed charges.

Chapter 12

318

Shipping Transport Fact

Voyage Departure Date Key (FK)

Leg Departure Date Key (FK)

Voyage Origin Port Key (FK)

Voyage Destination Port Key (FK)

Leg Origin Port Key (FK)

Leg Destination Port Key (FK)

Ship Mode Key (FK)

Container Key (FK)

Commodity Key (FK)

Consignor Key (FK)

Foreign Transporter Key (FK)

Foreign Consolidator Key (FK)

Shipper Key (FK)

Domestic Consolidator Key (FK)

Domestic Transporter Key (FK)

Consignee Key (FK)

Bill-of-Lading Number (DD)

Leg Fee

Leg Tariffs

Leg Miles

Date Dimension (views for 2 roles)

Port Dimension (views for 4 roles)

Ship Mode Dimension

Container Dimension

Commodity Dimension

Business Entity Dimension (views for 7 roles)

Figure 12-4: Shipper schema.

Travel Services Hotel Stay Fact

Reservation Date Key (FK)

Arrival Date Key (FK)

Departure Date Key (FK)

Customer Key (FK)

Hotel Property Key (FK)

Sales Channel Key (FK)

Confirmation Number (DD)

Ticket Number (DD)

Number of Nights

Extended Room Charge

Tax Charge

Date Dimension (views for 3 roles)

Customer Dimension

Sales Channel Dimension

Hotel Property Dimension

Figure 12-5: Travel services hotel stay schema.

Combining Correlated Dimensions

We stated previously that if a many-to-many relationship exists between two groups

of dimension attributes, they should be modeled as separate dimensions with sepa-

rate foreign keys in the fact table. Sometimes, however, you encounter situations

where these dimensions can be combined into a single dimension rather than treat-

ing them as two separate dimensions with two separate foreign keys in the fact table.

Transportation 319

Class of Service

The Figure 12-2 draft schema includes the class of service flown dimension.

Following a design checkpoint with the business community, you learn the

users also want to analyze the booking class purchased. In addition, the business users

want to easily ﬁ lter and report on activity based on whether an upgrade or down-

grade occurred. Your initial reaction might be to include a second role-playing

dimension and foreign key in the fact table to support both the purchased and

ﬂ own class of service. In addition, you would need a third foreign key for the

upgrade indicator; otherwise, the BI application would need to include logic to

identify numerous scenarios as upgrades, including economy to premium economy,

economy to business, economy to ﬁ rst, premium economy to business, and so on.

In this situation, however, there are only four rows in the class dimension table

to indicate ﬁ rst, business, premium economy, and economy classes. Likewise, the

upgrade indicator dimension also would have just three rows in it, corresponding to

upgrade, downgrade, or no class change. Because the row counts are so small, you

can elect instead to combine the dimensions into a single class of service dimension,

as illustrated in Figure 12-6.

Class of Service

Key Class Purchased Class Flown Purchased-Flown Group

Class Change

Indicator

Economy

Prem Economy

Business

First

Economy

Prem Economy

Business

First

Economy

Prem Economy

Business

First

Economy

Prem Economy

Business

First

Economy

Prem Economy

Business

First

Economy-Economy

Economy-Prem Economy

Economy-Business

Economy-First

Prem Economy-Economy

Prem Economy-Prem Economy

Prem Economy-Business

Prem Economy-First

Business-Economy

Business-Prem Economy

Business-Business

Business-First

First-Economy

First-Prem Economy

First-Business

First-First

No Class Change

Upgrade

Downgrade

No Class Change

Upgrade

Downgrade

No Class Change

Upgrade

Downgrade

No Class Change

Figure 12-6: Combined class dimension sample rows.

The Cartesian product of the separate class dimensions results in a 16-row

dimension table (4 class purchased rows times 4 class ﬂ own rows). You also have

the opportunity in this combined dimension to describe the relationship between

Chapter 12

320

the purchased and ﬂ own classes, such as a class change indicator. Think of this

combined class of service dimension as a type of junk dimension, introduced in

Chapter 6. In this case study, the attributes are tightly correlated. Other airline fact

tables, such as inventory availability or ticket purchases, would invariably reference

a conformed class dimension table with just four rows.

NOTE In most cases, role-playing dimensions should be treated as separate logi-

cal dimensions created via views on a single physical table. In isolated situations,

it may make sense to combine the separate dimensions into a single dimension,

notably when the data volumes are extremely small or there is a need for additional

attributes that depend on the combined underlying roles for context and meaning.

Origin and Destination

Likewise, consider the pros and cons of combining the origin and destination airport

dimensions. In this situation the data volumes are more signiﬁ cant, so separate role-

playing origin and destination dimensions seem more practical. However, the busi-

ness users may need additional attributes that depend on the combination of origin

and destination. In addition to accessing the characteristics of each airport, business

users also want to analyze ﬂ ight activity data by the distance between the city-pair

airports, as well as the type of city pair (such as domestic or trans-Atlantic). Even

the seemingly simple question regarding the total activity between San Francisco

(SFO) and Denver (DEN), regardless of whether the ﬂ ights originated in SFO or

DEN, presents some challenges with separate origin and destination dimensions.

SQL experts could surely answer the question programmatically with separate air-

port dimensions, but what about the less empowered? Even if experts can derive

the correct answer, there’s no standard label for the nondirectional city-pair route.

Some reporting applications may label it SFO-DEN, whereas others might opt for

DEN-SFO, San Fran-Denver, Den-SF, and so on. Rather than embedding inconsis-

tent labels in BI reporting application code, the attribute values should be stored

in a dimension table, so common standardized labels can be used throughout the

organization. It would be a shame to go to the bother of creating a data warehouse

and then allowing application code to implement inconsistent reporting labels. The

business sponsors of the DW/BI system won’t tolerate that for long.

To satisfy the need to access additional city-pair route attributes, you have two

options. One is merely to add another dimension to the fact table for the city-pair

route descriptors, including the directional route name, nondirectional route name,

type, and distance, as shown in Figure 12-7. The other alternative is to combine

Transportation 321

the origin and destination airport attributes, plus the supplemental city-pair route

attributes, into a single dimension. Theoretically, the combined dimension could

have as many rows as the Cartesian product of all the origin and destination air-

ports. Fortunately, in real life the number of rows is much smaller than this theo-

retical limit because airlines don’t operate ﬂ ights between every airport where they

have a presence. However, with a couple dozen attributes about the origin airport,

plus a couple dozen identical attributes about the destination airport, along with

attributes about the route, you would probably be more tempted to treat them as

separate dimensions.

City-Pair

Route Key

Directional

Route Name

191

3,267

6,737

BOS-JFK

JFK-BOS

BOS-LGW

LGW-BOS

BOS-NRT

NRT-BOS

BOS-JFK

BOS-LGW

BOS-NRT

Less than 200 miles

3,000 to 3,500 miles

More than 6,000 miles

Domestic

International

Non-Oceanic

Transatlantic

Transpacific

Non-Directional

Route Name

Route Distance

in Miles Route Distance Band Dom-Intl Ind Transocean Ind

Figure 12-7: City-pair route dimension sample rows.

Sometimes designers suggest using a bridge table containing the origin and

destination airport keys to capture the route information. Although the origin

and destination represent a many-to-many relationship, in this case, you can

cleanly represent the relationship within the existing fact table rather than

using a bridge.

More Date and Time Considerations

From the earliest chapters in this book we’ve discussed the importance of having a

verbose date dimension, whether at the individual day, week, or month granular-

ity, that contains descriptive attributes about the date and private labels for ﬁ scal

periods and work holidays. In this ﬁ nal section, we’ll introduce several additional

considerations for dealing with date and time dimensions.

Country-Speciﬁ c Calendars as Outriggers

If the DW/BI system serves multinational needs, you must generalize the standard

date dimension to handle multinational calendars in an open-ended number of coun-

tries. The primary date dimension contains generic calendar attributes about the date,

Chapter 12

322

regardless of the country. If your multinational business spans Gregorian, Hebrew,

Islamic, and Chinese calendars, you would include four sets of days, months, and

years in this primary dimension.

Country-speciﬁ c date dimensions supplement the primary date table. The key to

the supplemental dimension is the primary date key, along with the country code.

The table would include country-speciﬁ c date attributes, such as holiday or season

names, as illustrated in Figure 12-8. This approach is similar to the handling of

multiple ﬁ scal accounting calendars, as described in Chapter 7: Accounting.

Date Dimension

Date Key (FK)

More FKs ...

Facts ...

Date Key (PK)

Date

Day of Week

Day Number in Epoch

Week Number in Epoch

Month Number in Epoch

Day Number in Calendar Month

Day Number in Calendar Year

Day Number in Fiscal Month

Last Day in Fiscal Month Indicator

Calendar Month

Calendar Month Number in Year

Calendar Year-Month (YYYY-MM)

Calendar Quarter

Calendar Year-Quarter

Calendar Year

Fiscal Month

Fiscal Month Number in Year

Fiscal Year-Month

Fiscal Quarter

Fiscal Year-Quarter

Fiscal Year

...

Date Key (FK)

Country Key (FK)

Country Name

Civil Name

Civil Holiday Flag

Civil Holiday Name

Religious Holiday Flag

Religious Holiday Name

Weekday Indicator

Season Name

Country-Specific Date OutriggerFact

Figure 12-8: Country-speciﬁ c calendar outrigger.

You can join this table to the main calendar dimension as an outrigger or directly

to the fact table. If you provide an interface that requires the user to specify a coun-

try name, then the attributes of the country-speciﬁ c supplement can be viewed as

logically appended to the primary date table, allowing them to view the calendar

through the eyes of a single country at a time. Country-speciﬁ c calendars can be

Transportation 323

messy to build in their own right; things get even more complicated if you need to

deal with local holidays that occur on di erent days in di erent parts of a country.

Date and Time in Multiple Time Zones

When operating in multiple countries or even just multiple time zones, you’re faced

with a quandary concerning transaction dates and times. Do you capture the date and

time relative to local midnight in each time zone, or do you express the time period

relative to a standard, such as the corporate headquarters date/time, Greenwich Mean

Time (GMT), or Coordinated Universal Time (UTC), also known as Zulu time in the

aviation world? To fully satisfy users’ requirements, the correct answer is probably

both. The standard time enables you to see the simultaneous nature of transactions

across the business, whereas the local time enables you to understand transaction

timing relative to the time of day.

Contrary to popular belief, there are more than 24 time zones (corresponding

to the 24 hours of the day) in the world. For example, there is a single time zone in

China despite its latitudinal span. Likewise, there is a single time zone in India, o -

set from UTC by 5.5 hours. In Australia, there are three time zones with its Central

time zone o set by one-half hour. Meanwhile, Nepal and some other nations use

one-quarter hour o set. The situation gets even more unpleasant when you account

for switches to and from daylight saving time.

Given the complexities, it’s unreasonable to think that merely providing a UTC

o set in a fact table can support equivalized dates and times. Likewise, the o set

can’t reside in a time or airport dimension table because the o set depends on both

location and date. The recommended approach for expressing dates and times in

multiple time zones is to include separate date and time-of-day dimensions corre-

sponding to the local and equivalized dates, as shown in Figure 12-9. The time-of-day

dimensions, as discussed in Chapter 3: Retail Sales, support time period groupings

such as shift numbers or rush period time block designations.

Flight Activity Fact

Departure Date Key (FK)

GMT Departure Date Key (FK)

Departure Time-of-Day Key (FK)

GMT Departure Time-of-Day Key (FK)

More FKs ...

Degenerate Dimensions ...

Facts ...

Date Dimension

(2 views for roles)

Time-of-Day Dimension

(2 views for roles)

Figure 12-9: Local and equivalized date/time across time zones.

Chapter 12

324

Localization Recap

We have discussed the challenges of international DW/BI system in several chapters

of the book. In addition to the international time zones and calendars discussed in the

previous two sections, we have also talked about multi-currency reporting in Chapter

6 and multi-language support in Chapter 8: Customer Relationship Management.

All these database-centric techniques fall under the general theme of localiza-

tion. Localization in the larger sense also includes the translation of user interface

text embedded in BI tools. BI tool vendors implement this form of localization with text

databases containing all the text prompts and labels needed by the tool, which can

then be conﬁ gured for each local environment. Of course, this can become quite

complicated because text translated from English to most European languages results

in text strings that are longer than their English equivalents, which may force a

redesign of the BI application. Also, Arabic text reads from right to left, and many

Asian languages are completely di erent.

A serious international DW/BI system built to serve business users in many

countries needs to be thoughtfully designed to account for a selected set of these

localization issues. But perhaps it is worth thinking about how airport control tow-

ers and airplane pilots around the world deal with language incompatibilities when

communicating critical messages about ﬂ ight directions and altitudes. They all use

one language (English) and unit of measure (feet).

Summary

In this chapter we turned our attention to airline trips or routes; we brieﬂ y touched

on similar scenarios drawn from the shipping and travel services industries. We

examined the situation in which we have multiple fact tables at multiple granularities

with multiple grain-speciﬁ c facts. We also discussed the possibility of combining

dimensions into a single dimension table for cases in which the row count volumes

are extremely small or when there are additional attributes that depend on the com-

bined dimensions. Again, combining correlated dimensions should be viewed as the

exception rather than the rule.

We wrapped up this chapter by discussing several date and time dimension tech-

niques, including country-speciﬁ c calendar outriggers and the handling of absolute

and relative dates and times.

Education

We step into the world of an educational institution in this chapter, looking first

at the applicant pipeline as an accumulating snapshot. When accumulating

snapshot fact tables were introduced in Chapter 4: Inventory, a product movement

pipeline illustrated the concept; order fulfillment workflows were captured in an

accumulating snapshot in Chapter 6: Order Management. In this chapter, rather than

watching products or orders move through various states, an accumulating snapshot

is used to monitor prospective student applicants as they progress through admis-

sions milestones.

The other primary concept discussed in this chapter is the factless fact table. We’ll

explore several case study illustrations drawn from higher education to further elabo-

rate on these special fact tables and discuss the analysis of events that didn’t occur.

Chapter 13 discusses the following concepts:

■ Example bus matrix snippet for a university or college

■ Applicant tracking and research grant proposals as accumulating snapshot

fact tables

■ Factless fact table for admission events, course registration facilities manage-

ment, and student attendance

■ Handling of nonexistent events

University Case Study and Bus Matrix

In this chapter you’re working for a university, college, or other type of educational

institution. Someone at a higher education client once remarked that running a

university is akin to operating all the businesses needed to support a small vil-

lage. Universities are simultaneously a real estate property management company

(residential student housing), restaurant with multiple outlets (dining halls), retailer

(bookstore), events management and ticketing agency (athletics and speaker events),

Chapter 13

326

police department (campus security), professional fundraiser (alumni development),

consumer ﬁ nancial services company (ﬁ nancial aid), investment ﬁ rm (endowment

management), venture capitalist (research and development), job placement ﬁ rm

(career planning), construction company (buildings and facilities maintenance), and

medical services provider (health clinic). In addition to these varied functions, higher

education institutions are obviously also focused on attracting high caliber students

and talented faculty to create a robust educational environment.

The bus matrix snippet in Figure 13-1 covers several core processes within an

educational institution. Traditionally, there has been less focus on revenue and proﬁ t

in higher education, but with ever-escalating costs and competition, universities

and colleges cannot ignore these ﬁ nancial metrics. They want to attract and retain

students who align with their academic and other institutional objectives. There’s

a strong interest in analyzing what students are “buying” in terms of courses each

term and the associated academic outcomes. Colleges and universities want to

understand many aspects of the student’s experience, along with maintaining an

ongoing relationship well beyond graduation.

Accumulating Snapshot Fact Tables

Chapter 4 used an accumulating snapshot fact table to track products identiﬁ ed by

serial or lot numbers as they move through various inventory stages in a warehouse.

Take a moment to recall the distinguishing characteristics of an accumulating snap-

shot fact table:

■ A single row represents the complete history of a workflow or pipeline

instance.

■ Multiple dates represent the standard pipeline milestone events.

■ The accumulating snapshot facts often included metrics corresponding to

each milestone, plus status counts and elapsed durations.

■ Each row is revisited and updated whenever the pipeline instance changes;

both foreign keys and measured facts may be changed during the fact row

updates.

Applicant Pipeline

Now envision these same accumulating snapshot characteristics as applied to the

prospective student admissions pipeline. For those who work in other industries,

there are obvious similarities to tracking job applicants through the hiring process

or sales prospects as they are qualiﬁ ed and become customers.

Education 327

Admission Events

Student Lifecycle Processes

Financial Processes

Applicant Pipeline

Financial Aid Awards

Student Enrollment/Profile Snapshot

Student Residential Housing

Student Course Registration & Outcomes

Student Course Instructor Evaluations

Student Activities

Career Placement Activities

Advancement Contacts

Advancement Pledges & Gifts

Budgeting

Endowment Tracking

GL Transactions

Payroll

Procurement

Employee Management Processes

Employee Headcount Snapshot

Employee Hiring & Separations

Employee Benefits & Compensation

Administrative Processes

Facilities Utilization

Energy Consumption & Waste Management

Work Orders

Staff Performance Management

Faculty Appointment Management

Research Proposal Pipeline

Research Expenditures

Faculty Publications

Date/Term

Applicant-Student-Alum

Employee (Faculty, Staff)

Facility

Account

Course

Department

Figure 13-1: Subset of bus matrix rows for educational institution.

Chapter 13

328

In the case of applicant tracking, prospective students progress through a stan-

dard set of admissions hurdles or milestones. Perhaps you’re interested in tracking

activities around key dates, such as initial inquiry, campus visit, application submit-

ted, application ﬁ le completed, admissions decision notiﬁ cation, and enrolled or

withdrawn. At any point in time, admissions and enrollment management analysts

are interested in how many applicants are at each stage in the pipeline. The process

is much like a funnel, where many inquiries enter the pipeline, but far less prog-

ress through to the ﬁ nal stage. Admission personnel also would like to analyze the

applicant pool by a variety of characteristics.

The grain of the applicant pipeline accumulating snapshot is one row per prospec-

tive student; this granularity represents the lowest level of detail captured when the

prospect enters the pipeline. As more information is collected while the prospective

student progresses toward application, acceptance, and enrollment, you continue

to revisit and update the fact table row, as illustrated in Figure 13-2.

Applicant Dimension

Applicant Key (PK)

Applicant Name

Applicant Address Attributes ...

High School

High School GPA

High School Type

SAT Math Score

SAT Verbal Score

SAT Writing Score

ACT Composite Score

Number of AP Credits

Gender

Date of Birth

Ethnicity

Full time-Part time Indicator

Application Source

Intended Major

...

Date Key (PK)

...

Term

Academic Year-Term

Academic Year

Date Dimension (views for 6 roles)

Application Status Key

Application Status Code

Application Status Description

Application Status Category

Application Status Dimension

Applicant Pipeline Fact

Initial Inquiry Date Key (FK)

Campus Visit Date Key (FK)

Application Submitted Date Key (FK)

Application File Completed Date Key (FK)

Admission Decision Notification Date Key (FK)

Applicant Enroll-Withdraw Date Key (FK)

Applicant Key (FK)

Application Status Key (FK)

Application ID (DD)

Inquiry Count

Campus Visit Count

Application Submitted Count

Application Completed Count

Admit Early Decision Count

Admit Regular Decision Count

Waitlist Count

Defer to Regular Decision Count

Deny Count

Enroll Early Decision Count

Enroll Regular Decision Count

Admit Withdraw Count

Figure 13-2: Student applicant pipeline as an accumulating snapshot.

Like earlier accumulating snapshots, there are multiple dates in the fact table

corresponding to the standard milestone events. You want to analyze the prospect’s

progress by these dates to determine the pace of movement through the pipeline and

spot bottlenecks. This is especially important if you see a signiﬁ cant lag involving

a candidate whom you’re interested in recruiting. Each of these dates is treated as a

role-playing dimension, with a default surrogate key to handle the unknown dates

for new and in-process rows.

Education 329

The applicant dimension contains many interesting attributes about prospective

students. Analysts are interested in slicing and dicing by applicant characteristics

such as geography, incoming credentials (grade point average, college admissions test

scores, advanced placement credits, and high school), gender, date of birth, ethnicity,

preliminary major, application source, and a multitude of others. Analyzing these char-

acteristics at various stages of the pipeline can help admissions personnel adjust their

strategies to encourage more (or fewer) students to proceed to the next mile marker.

The facts in the applicant pipeline fact table include a variety of counts that are

closely monitored by admissions personnel. If available, this table could include esti-

mated probabilities that the prospect will apply and subsequently enroll if accepted

to predict admission yields.

Alternative Applicant Pipeline Schemas

Accumulating snapshots are appropriate for short-lived processes that have a deﬁ ned

beginning and end, with standard intermediate milestones. This type of fact table

enables you to see an updated status and ultimately ﬁ nal disposition of each appli-

cant. However, because accumulating snapshot rows are updated, they do not pre-

serve applicant counts and statuses at critical points in the admissions calendar,

such as the early decision notiﬁ cation date. Given the close scrutiny of these num-

bers, analysts might also want to retain snapshots at several important cut-o dates.

Alternatively, you could build an admission transaction fact table with one row per

transaction per applicant for counting and period-to-period comparisons.

Research Grant Proposal Pipeline

The research proposal pipeline is another education-based example of an accumu-

lating snapshot. Faculty and administration are interested in viewing the lifecycle

of a grant proposal as it progresses through the pipeline from preliminary proposal

to grant approval and award receipt. This would support analysis of the number of

outstanding proposals in each stage of the pipeline by faculty, department, research

topic area, or research funding source. Likewise, you could see success rates by

various attributes. Having this information in a common repository would allow

it to be leveraged by a broader university population.

Factless Fact Tables

So far we’ve largely designed fact tables with very similar structures. Each fact table

typically has 5 to approximately 20 foreign key columns, followed by one to poten-

tially several dozen numeric, continuously valued, preferably additive facts. The

facts can be regarded as measurements taken at the intersection of the dimension

Chapter 13

330

key values. From this perspective, the facts are the justiﬁ cation for the fact table,

and the key values are simply administrative structure to identify the facts.

There are, however, a number of business processes whose fact tables are simi-

lar to those we’ve been designing with one major distinction. There are no mea-

sured facts! We introduced factless fact tables while discussing promotion events

in Chapter 3: Retail Sales, as well as in Chapter 6 to describe sales rep/customer

assignments. There are numerous examples of factless events in higher education.

Admissions Events

You can envision a factless fact table to track each prospective student’s attendance

at an admission event, such as a high school visit, college fair, alumni interview or

campus overnight, as illustrated in Figure 13-3.

Admissions Event Date Key (FK)

Planned Enroll Term Key (FK)

Applicant Key (FK)

Applicant Status Key (FK)

Admissions Officer Key (FK)

Admission Event Key (FK)

Admissions Event Attendance Count (=1)

Admissions Event Attendance Fact

Planned Enroll Term Dimension

Application Status Dimension

Admissions Event Date Dimension

Applicant Dimension

Admissions Officer Dimension

Admission Event Dimension

Figure 13-3: Admission event attendance as a factless fact table.

Course Registrations

Similarly, you can track student course registrations by term using a factless fact

table. The grain would be one row for each registered course by student and term,

as illustrated in Figure 13-4.

Term Dimension

In this fact table, the data is at the term level rather than at the more typical cal-

endar day, week, or month granularity. The term dimension still should conform

to the calendar date dimension. In other words, each date in the daily calendar

dimension should identify the term (for example, Fall), term and academic year

(for example, Fall 2013), and academic year (for example, 2013-2014). The column

labels and values must be identical for the attributes common to both the calendar

date and term dimensions.

Student Dimension and Change Tracking

The student dimension is an expanded version of the applicant dimension discussed

in the ﬁ rst scenario. You still want to retain some information garnered from the

application process (for example, geography, credentials, and intended major) but

Education 331

supplement it with on-campus information, such as part-time or full-time status,

residence, athletic involvement indicator, declared major, and class level status (for

example, sophomore).

Instructor Key (PK)

Instructor Employee ID (NK)

Instructor Name

Instructor Address Attributes...

Instructor Type

Instructor Tenure Indicator

Instructor Original Hire Date

Instructor Years of Service

Student Dimension

Student Key (PK)

Student ID (NK)

...

Term Key (PK)

Term

Academic Year-Term

Academic Year

Term Dimension

Term Key (FK)

Student Key (FK)

Course Key (FK)

Instructor Key (FK)

Course Registration Count (=1)

Course Key (PK)

Course Name

Course Department

Course Format

Course Credit Hours

Course Dimension

Course Registration Event Fact

Instructor Dimension

Figure 13-4: Course registration events as a factless fact table.

As discussed in Chapter 5: Procurement, you could imagine placing some of

these attributes in a type 4 mini-dimension because factions throughout the uni-

versity are interested in tracking changes to them, especially for declared major,

class level, and graduation attainment. People in administration and academia are

keenly interested in academic progress and retention rates by class, school, depart-

ment, and major. Alternatively, if there’s a strong demand to preserve the students’

proﬁ les at the time of course registration, plus ﬁ lter and group by the students’

current characteristics, you should consider handling the student information as

a slowly changing dimension type 7 with dual student dimension keys in the fact

table, as also described in Chapter 5. The surrogate student key would link to a

dimension table with type 2 attributes; the student’s durable identiﬁ er would link

to a view of the complete student dimension containing only the current row for

each student.

Artiﬁ cial Count Metric

A fact table represents the robust set of many-to-many relationships among dimen-

sions; it records the collision of dimensions at a point in time and space. This course

registration fact table could be queried to answer a number of interesting questions

regarding registration for the college’s academic o erings, such as which students

registered for which courses? How many declared engineering majors are taking an

out-of-major ﬁ nance course? How many students have registered for a given faculty

member’s courses during the last three years? How many students have registered

Chapter 13

332

for more than one course from a given faculty member? The only peculiarity in

these examples is that you don’t have a numeric fact tied to this registration data.

As such, analyses of this data will be based largely on counts.

NOTE Events are modeled as fact tables containing a series of keys, each

representing a participating dimension in the event. Event tables sometimes have

no variable measurement facts associated with them and hence are called factless

fact tables.

The SQL for performing counts in this factless fact is asymmetric because of

the absence of any facts. When counting the number of registrations for a faculty

member, any key can be used as the argument to the COUNT function. For example:

select faculty, count(term_key)... group by faculty

This gives the simple count of the number of student registrations by faculty,

subject to any constraints that may exist in the WHERE clause. An oddity of SQL is

that you can count any key and still get the same answer because you are counting

the number of keys that ﬂ y by the query, not their distinct values. You would need

to use a COUNT DISTINCT if you want to count the unique instances of a key rather

than the number of keys encountered.

The inevitable confusion surrounding the SQL statement, although not a serious

semantic problem, causes some designers to create an artiﬁ cial implied fact, perhaps

called course registration count (as opposed to “dummy”), that is always populated

by the value 1. Although this fact does not add any information to the fact table, it

makes the SQL more readable, such as:

select faculty, sum(registration_count)... group by faculty

At this point the table is no longer strictly factless, but the “1” is nothing more

than an artifact. The SQL will be a bit cleaner and more expressive with the regis-

tration count. Some BI query tools have an easier time constructing this query with

a few simple user gestures. More important, if you build a summarized aggregate

table above this fact table, you need a real column to roll up to meaningful aggre-

gate registration counts. And ﬁ nally, if deploying to an OLAP cube, you typically

include an explicit count column (always equal to 1) for complex counts because

the dimension join keys are not explicitly revealed in a cube.

If a measurable fact does surface during the design, it can be added to the schema,

assuming it is consistent with the grain of student course registrations by term. For

example, you could add tuition revenue, earned credit hours, and grade scores to

this fact table, but then it’s no longer a factless fact table.

Education 333

Multiple Course Instructors

If courses are taught by a single instructor, you can associate an instructor key to

the course registration events, as shown in Figure 13-4. However, if some courses

are co-taught, then it is a dimension attribute that takes on multiple values for the

fact table’s declared grain. You have several options:

■ Alter the grain of the fact table to be one row per instructor per course reg-

istration per student per term. Although this would address the multiple

instructors associated with a course, it’s an unnatural granularity that would

be extremely prone to overstated registration count errors.

■ Add a bridge table with an instructor group key in either the fact table or as

an outrigger on the course dimension, as introduced in Chapter 8: Customer

Relationship Management. There would be one row in this table for each instruc-

tor who teaches courses on his own. In addition, there would be two rows for

each instructor team; these rows would associate the same group key with

individual instructor keys. The concatenation of the group key and instructor

key would uniquely identify each bridge table row. As described in Chapter 10:

Financial Services, you could assign a weighting factor to each row in the bridge

if the teaching workload allocation is clearly deﬁ ned. This approach would

be susceptible to the potential overstatement issues surrounding the bridge

table usage described in Chapter 10.

■ Concatenate the instructor names into a single, delimited attribute on the

course dimension, as discussed in Chapter 9: Human Resources Management.

This option enables users to easily label reports with a single dimension attri-

bute, but it would not support analysis of registration events by instructor

characteristics.

■ If one of the instructors is identiﬁ ed as the primary instructor, then her

instructor key could be handled as a single foreign key in the fact table,

joined to a dimension where the attributes were prefaced with “primary” for

di erentiation.

Course Registration Periodic Snapshots

The grain of the fact table illustrated in Figure 13-4 is one row for each regis-

tered course by student and term. Some users at the college or university might be

interested in periodic snapshots of the course registration events at key academic

calendar dates, such as preregistration, start of the term, course drop/add deadline,

and end of the term. In this case, the fact table’s grain would be one row for each

student’s registered courses for a term per snapshot date.

Chapter 13

334

Facility Utilization

The second type of factless fact table deals with coverage, which can be illustrated

with a facilities management scenario. Universities invest a tremendous amount

of capital in their physical plant and facilities. It would be helpful to understand

which facilities were being used for what purpose during every hour of the day

during each term. For example, which facilities were used most heavily? What

was the average occupancy rate of the facilities as a function of time of day?

Does utilization drop o signiﬁ cantly on Fridays when no one wants to attend

(or teach) classes?

Again, the factless fact table comes to the rescue. In this case you’d insert one

row in the fact table for each facility for standard hourly time blocks during each

day of the week during a term regardless of whether the facility is being used.

Figure 13-5 illustrates the schema.

Time-of-Day Hour Dimension

Time-of-Day Hour Key (PK)

Time-of-Day Hour

Day Part Indicator

Term Dimension

Day of Week Dimension

Term Key (FK)

Day of Week Key (FK)

Time-of-Day Hour Key (FK)

Facility Key (FK)

Owner Department Key (FK)

Assigned Department Key (FK)

Utilization Status Key (FK)

Facility Count (=1)

Facility Key (PK)

Facility Building Name - Room

Facility Building Name

Facility Building Address attributes...

Facility Type

Facility Floor

Facility Square Footage

Facility Capacity

Projector Indicator

Vent Indicator

...

Facility Dimension

Facility Utilization Fact

Department Dimension (2 views for roles)

Utilization Status Dimension

Figure 13-5: Facilities utilization as a coverage factless fact table.

The facility dimension would include all types of descriptive attributes about the

facility, such as the building, facility type (for example, classroom, lab, or o ce),

square footage, capacity, and amenities (for example, whiteboard or built-in projec-

tor). The utilization status dimension would include a text descriptor with values

of Available or Utilized. Meanwhile, multiple organizations may be involved in

facilities utilization. For example, one organization might own the facility during

a time block, but the same or a di erent organization might be assigned as the

facility user.

Education 335

Student Attendance

You can visualize a similar schema to track student attendance in a course. In this

case, the grain would be one row for each student who walks through the course’s

classroom door each day. This factless fact table would share a number of the same

dimensions discussed with registration events. The primary di erence would be

the granularity is by calendar date in this schema rather than merely term. This

dimensional model, illustrated in Figure 13-6, allows business users to answer

questions concerning which courses were the most heavily attended. Which courses

su ered the least attendance attrition over the term? Which students attended which

courses? Which faculty member taught the most students?

Date Key (FK)

Student Key (FK)

Course Key (FK)

Instructor Key (FK)

Facility Key (FK)

Attendance Count

Student Attendance Fact

Student Dimension

Instructor Dimension

Date Dimension

Course Dimension

Facility Dimension

Figure 13-6: Student attendance fact table.

Explicit Rows for What Didn’t Happen

Perhaps people are interested in monitoring students who were registered for a

course but didn’t show up. In this example you can envision adding explicit rows to

the fact table for attendance events that didn’t occur. The fact table would no longer

be factless as there is an attendance metric equal to either 1 or 0.

Adding rows is viable in this scenario because the non-attendance events have the

same exact dimensionality as the attendance events. Likewise, the fact table won’t

grow at an alarming rate, presuming (or perhaps hoping) the no-shows are a small

percentage of the total students registered for a course. Although this approach is

reasonable in this scenario, creating rows for events that didn’t happen is ridiculous

in many other situations, such as adding rows to a customer’s sales transaction for

promoted products that weren’t purchased by the customer.

What Didn’t Happen with Multidimensional OLAP

Multidimensional OLAP databases do an excellent job of helping users understand

what didn’t happen. When the cube is constructed, multidimensional databases

handle the sparsity of the transaction data while minimizing the overhead burden

of storing explicit zeroes. As such, at least for fact cubes that are not too sparse, the

Chapter 13

336

event and nonevent data is available for user analysis while reducing some of the

complexities just discussed in the relational star schema world.

More Educational Analytic Opportunities

Many of the business processes described in earlier chapters, such as procurement

and human resources, are obviously applicable to the university environment given

the desire to better monitor and manage costs. Research grants and alumni contri-

butions are key sources of revenue, in addition to the tuition revenue.

Research grant analysis is often a variation of ﬁ nancial analysis, as discussed in

Chapter 7: Accounting, but at a lower level of detail, much like a subledger. The grain

would include additional dimensions to further describe the research grant, such as

the corporate or governmental funding source, research topic, grant duration, and

faculty investigator. There is a strong need to better understand and manage the

budgeted and actual spending associated with each research project. The objective

is to optimize the spending so a surplus or deﬁ cit situation is avoided, and funds

are deployed where they will be most productive. Likewise, understanding research

spending rolled up by various dimensions is necessary to ensure proper institutional

control of such monies.

Better understanding the university’s alumni is much like better understanding

a customer base, as described in Chapter 8. Obviously, there are many interesting

characteristics that would be helpful in maintaining a relationship with your alumni,

such as geographic, demographic, employment, interests, and behavioral information,

in addition to the data you collected about them as students (for example, a liations,

residential housing, school, major, length of time to graduate, and honors designa-

tions). Improved access to a broad range of attributes about the alumni population

would allow the institution to better target messages and allocate resources. In addi-

tion to alumni contributions, alumni relationships can be leveraged for potential

recruiting, job placement, and research opportunities. To this end, a robust CRM

operational system should track all the touch points with alumni to capture mean-

ingful data for the DW/BI analytic platform.

Summary

In this chapter we focused on two primary concepts. First, we looked at the accu-

mulating snapshot fact table to track application or research grant pipelines. Even

though the accumulating snapshot is used much less frequently than the more com-

mon transaction and periodic snapshot fact tables, it is very useful for tracking the

current status of a short-lived process with standard milestones. As we described,

Education 337

accumulating snapshots are often complemented with transactional or periodic

snapshot tables.

Second, we explored several examples of factless fact tables. These fact tables

capture the relationship between dimensions in the case of an event or coverage,

but are unique in that no measurements are collected to serve as actual facts. We

also discussed the handling of situation s in which you want to track events that

didn’t occur.

Healthcare

The healthcare industry is undergoing tremendous change as it seeks to both

improve patient outcomes, while simultaneously improving operational effi-

ciencies. The challenges are plentiful as organizations attempt to integrate their

clinical and administrative information. Healthcare data presents several interesting

dimensional design patterns that we’ll explore in this chapter.

Chapter 14 discusses the following concepts:

■ Example bus matrix snippet for a healthcare organization

■ Accumulating snapshot fact table to handle the claims billing and payment

pipeline

■ Dimension role playing for multiple dates and physicians

■ Multivalued dimensions, such as patient diagnoses

■ Supertype and subtype handling of healthcare charges

■ Treatment of textual comments

■ Measurement type dimension for sparse, heterogeneous measurements

■ Handling of images with dimensional schemas

■ Facility/equipment inventory utilization as transactions and periodic snapshots

Healthcare Case Study and Bus Matrix

In the face of unprecedented consumer focus and governmental policy regulations,

coupled with internal pressures, healthcare organizations need to leverage informa-

tion more e ectively to impact both patient outcomes and operational e ciencies.

Healthcare organizations typically wrestle with many disparate systems to collect

their clinical, ﬁ nancial, and operational performance metrics. This information

needs to be better integrated to deliver more e ective patient care, while concur-

rently managing costs and risks. Healthcare analysts want to better understand

which procedures deliver the best outcomes, while identifying opportunities to

Chapter 14

340

impact resource utilization, including labor, facilities, and associated equipment

and supplies. Large healthcare consortiums with networks of physicians, clinics,

hospitals, pharmacies, and laboratories are focused on these requirements, espe-

cially as both the federal government and private payers are encouraging providers

to assume more responsibility for the quality and cost of their healthcare services.

Figure 14-1 illustrates a sample snippet of a healthcare organization’s bus matrix.

Patient Encounter Workflow

Clinical Events

Billing/Revenue Events

Procedures

Physician Orders

Medications

Lab Test Results

Disease/Case Management Participation

Patient Reported Outcomes

Patient Satisfaction Surveys

Inpatient Facility Charges

Outpatient Professional Charges

Claims Billing

Claims Payments

Collections and Write-Offs

Operational Events

Bed Inventory Utilization

Facilities Utilization

Supply Procurement

Supply Utilization

Workforce Scheduling

Date

Patient

Physician

Diagnosis

Payer

Employee

Facility

Procedure

Figure 14-1: Subset of bus matrix row for a healthcare consortium.

Traditionally, healthcare insurance payers have leveraged claims information to

better understand their risk, improve underwriting policies, and detect potential

fraudulent activity. Payers have historically been more sophisticated than health-

care provider organizations in leveraging data analytically, perhaps in part because

their prime data source, claims, was more reliably captured and structured than

Healthcare 341

providers’ data. However, claims data is both a beneﬁ t and curse for payers’ analytic

e orts because it historically hasn’t provided the robust, granular clinical picture.

Increasingly, healthcare payers are partnering with providers to leverage detailed

patient information to support more predictive analysis. In many ways, the needs

and objectives of the providers and payers are converging, especially with the push

for shared-risk delivery models.

Every patient’s episode of care with a healthcare organization generates mounds

of information. Patient-centric transactional data falls into two prime categories:

administrative and clinical. The claims billing data provides detail on a patient bill

from a physician’s o ce, clinic, hospital, or laboratory. The clinical medical record,

on the other hand, is more comprehensive and includes not only the services result-

ing in charges, but also the laboratory test results, prescriptions, physician’s notes

or orders, and sometimes outcomes.

The issues of conforming common dimensions remain exactly the same for

healthcare as in other industries. Obviously, the most important conformed dimen-

sion is the patient. In Chapter 8: Customer Relationship Management, we described

the need for a 360-degree view of customers. It’s easy to argue that a 360-degree

view of patients is even more critical given the stakes; adoption of patient electronic

medical record (EMR) and electronic health record (EHR) systems clearly focus on

this objective.

Other dimensions that must be conformed include:

■ Date

■ Responsible party

■ Employer

■ Health plan

■ Payer (primary and secondary)

■ Physician

■ Procedure

■ Equipment

■ Lab test

■ Medication

■ Diagnosis

■ Facility (o ce, clinic, outpatient facility, and hospital)

In the healthcare arena, some of these dimensions are hard to conform, whereas

others are easier than they look at ﬁ rst glance. The patient dimension has historically

been challenging, at least in the United States, because of the lack of a reliable national

identity number and/or consistent patient identiﬁ er across facilities and physicians.

To further complicate matters, the Health Insurance Portability and Accountability Act

(HIPAA) includes strict privacy and security requirements to protect the conﬁ dential

Chapter 14

342

nature of patient information. Operational process improvements, like electronic

medical records, are ensuring more consistent master patient identiﬁ cation.

The diagnosis and treatment dimensions are considerably more structured and

predictable than you might expect because the insurance industry and government

have mandated their content. For example, diagnosis and disease classiﬁ cations fol-

low the International Classiﬁ cation of Diseases (ICD) standard for consistent reporting.

Similarly, the Healthcare Common Procedure Coding System (HCPCS) is based on the

American Medical Association’s Current Procedural Terminology (CPT) to describe

medical, surgical, and diagnostic services, along with supplies and devices. Dentists

use the Current Dental Terminology (CDT) code set, which is updated and distributed

by the American Dental Association.

Finally, beyond integrated patient-centric clinical and ﬁ nancial information,

healthcare organizations also want to analyze operational information regarding

the utilization of their workforce, facilities, and supplies. Much of the discussion

from earlier chapters about human resources, inventory management, and procure-

ment processes is also applicable to healthcare organizations.

Claims Billing and Payments

Imagine you work in the healthcare consortium’s billing organization. You receive

the primary charges from the physicians and facilities, prepare bills for the respon-

sible payers, and track the progress of the claims payments received.

The dimensional model for the claims billing process must address a number of

business objectives. You want to analyze the billed dollar amounts by every avail-

able dimension, including patient, physician, facility, diagnosis, procedure, and

date. You want to see how these claims have been paid and what percentage of the

claims have not been collected. You want to see how long it takes to get paid, and

the current status of all unpaid claims.

As we discussed in Chapter 4: Inventory, whenever a source business process is consid-

ered for inclusion in the DW/BI system, there are three essential grain choices. Remember

the fact table’s granularity determines what constitutes a fact table row. In other words,

what is the measurement event being recorded?

The transaction grain is the most fundamental. In the healthcare billing example,

the transaction grain would include every billing transaction from the physicians

and facilities, as well as every claim payment transaction received. We’ll talk more

about these fact tables in a moment.

The periodic snapshot is the grain of choice for long-running time series, such

as bank accounts and insurance policies. However, the periodic snapshot doesn’t

Healthcare 343

do a good job of capturing the behavior of relatively short-lived processes, such as

orders or medical claims billing.

The accumulating snapshot grain is chosen to analyze the claims billing and pay-

ment workﬂ ow. A single fact table row represents a single line on a medical claim.

Furthermore, the row represents the accumulated history of the line item from the

moment of creation to the current state. When anything about the line changes, the row

is revisited and modiﬁ ed appropriately. From the point of view of the billing organiza-

tion, let’s assume the standard scenario of a claim includes:

■ Treatment date

■ Primary insurance billing date

■ Secondary insurance billing date

■ Responsible party billing date

■ Last primary insurance payment date

■ Last secondary insurance payment date

■ Last responsible party payment date

■ Zero balance date

These dates describe the normal claim workﬂ ow. An accumulating snapshot

does not attempt to fully describe unusual situations. Business users undoubt-

edly need to see all the details of messy claim payment scenarios because multiple

payments are sometimes received for a single line, or conversely, a single payment

sometimes applies to multiple claims. Companion transaction schemas inevitably

will be needed. In the meantime, the purpose of the accumulating snapshot grain

is to place every claim into a standard framework so that the analytic objectives

described earlier can be satisﬁ ed easily.

With a clear understanding that an individual fact table row represents the accu-

mulated history of a line item on a claim bill, you can identify the dimensions by

carefully listing everything known to be true in the context of this row. In this

hypothetical scenario, you know the patient, responsible party, physician, physi-

cian organization, procedure, facility, diagnosis, primary insurance organization,

secondary insurance organization, and master patient bill ID number, as shown in

Figure 14-2.

The interesting facts accumulated over the claim line’s history include the billed

amount, primary insurance paid amount, secondary insurance paid amount, respon-

sible party paid amount, total paid amount (calculated), amount sent to collections,

amount written o , amount remaining to be paid (calculated), length of stay, number

of days from billing to initial primary insurance, secondary insurance, and respon-

sible party payments, and ﬁ nally, number of days to zero balance.

Chapter 14

344

Claims Billing and Payment Workflow Fact

Treatment Date Key (FK)

Primary Insurance Billing Date Key (FK)

Secondary Insurance Billing Date Key (FK)

Responsible Party Billing Date Key (FK)

Last Primary Insurance Payment Date Key (FK)

Last Secondary Insurance Payment Date Key (FK)

Last Responsible Party Payment Date Key (FK)

Zero Balance Date Key (FK)

Patient Key (FK)

Physician Key (FK)

Physician Organization Key (FK)

Procedure Key (FK)

Facility Key (FK)

Primary Diagnosis Key (FK)

Primary Insurance Organization Key (FK)

Secondary Insurance Organization Key (FK)

Responsible Party Key (FK)

Employer Key (FK)

Master Bill ID (DD)

Billed Amount

Primary Insurance Paid Amount

Secondary Insurance Paid Amount

Responsible Party Paid Amount

Total Paid Amount

Sent to Collections Amount

Written Off Amount

Unpaid Balance Amount

Length of Stay

Bill to Initial Primary Insurance Payment Lag

Bill to Initial Secondary Insurance Payment Lag

Bill to Initial Responsible Party Payment Lag

Bill to Zero Balance Lag

Patient Dimension

Procedure Dimension

Primary Diagnosis Dimension

Responsible Party Dimension

Physician Dimension

Physician Organization Dimension

Facility Dimension

Insurance Organization Dimension (views for 2 roles)

Employer Dimension

Date Dimension (views for 8 roles)

Figure 14-2: Accumulating snapshot fact table for medical claim billing and payment

workﬂ ow.

A row is initially created in this fact table when the charge transactions are

received from the physicians or facilities and the initial bills are generated. On a

given bill, perhaps the primary insurance company is billed, but the secondary

insurance and responsible party are not billed, pending a response from the pri-

mary insurance company. For a period of time after the row is ﬁ rst entered into

the fact table, the last seven dates are not applicable. Because the surrogate date

keys in the fact table must not be null, they will point to a date dimension row

reserved for a To Be Determined date.

In the weeks after creation of the row, some payments are received. Bills are then

sent to the secondary insurance company and responsible party. Each time these

events take place, the same fact table row is revisited, and the appropriate keys and

facts are destructively updated. This destructive updating poses some challenges

for the database administrator. If most of the accumulating rows stabilize and stop

changing within a given timeframe, a physical reorganization of the database at

that time can recover disk storage and improve performance. If the fact table is

Healthcare 345

partitioned on the treatment date key, the physical clustering or partitioning prob-

ably will be well preserved throughout these changes because the treatment date

is not revisited and changed.

Date Dimension Role Playing

Accumulating snapshot fact tables always involve multiple date stamps, like the eight

foreign keys pointing to the date dimension in Figure 14-2. The eight date foreign

keys should not join to a single instance of the date dimension table. Instead, create

eight views on the single underlying date dimension table, and join the fact table

separately to these eight views, as if they were eight independent date dimension

tables. The eight view deﬁ nitions should cosmetically relabel the column names to

be distinguishable, so BI tools accessing the views present understandable column

names to the business users.

Although the role-playing behavior of the date dimension is a common charac-

teristic of accumulating snapshot fact tables, other dimensions in Figure 14-2 play

roles in similar ways, such as the payer dimension. In the section “Supertypes and

Subtypes for Charges,” the physician dimension will play multiple roles depending

on whether the physician is the referring physician, attending physician, or working

in a consulting or assisting capacity.

Multivalued Diagnoses

Normally the dimensions surrounding a fact table take on a single value in the

context of the fact event. However, there are situations where multivaluedness is

natural and unavoidable. The diagnosis dimension in healthcare fact tables is a

good example. At the moment of a procedure or lab test, the patient has one or more

diagnoses. Electronic medical record applications facilitate the physician’s selection

of multiple diagnoses well beyond the historical practice of providing the minimal

coding needed for reimbursement; the result is a richer, more complete picture of

the severity of the patient’s medical condition. There is strong analytic incentive to

retain the multivalued diagnoses, along with the other ﬁ nancial performance data,

especially as organizations do more comparative utilization and cost benchmarking.

If there were always a maximum of three diagnoses, for instance, you might be

tempted to create three diagnosis foreign keys in the fact table with correspond-

ing dimensions, almost as if they were roles. However, diagnoses don’t behave like

independent roles. And unfortunately, there are often more than three diagnoses,

especially for hospitalized elderly patients who may present 20 simultaneous diag-

noses! Diagnoses don’t ﬁ t into well-deﬁ ned roles other than potentially the primary

admitting and discharging diagnoses. Finally, a design with multiple diagnosis

Chapter 14

346

foreign keys would make for very ine cient BI applications because the query

doesn’t know which dimensional slot to constrain for a particular diagnosis.

The design shown in Figure 14-3 handles the open-ended nature of multiple diag-

noses. The diagnosis foreign key in the fact table is replaced with a diagnosis group

key. This diagnosis group key is connected by a many-to-many join to a diagnosis

group bridge table, which contains a separate row for each individual diagnosis in

a particular group.

Diagnosis Dimension

Diagnosis Key (PK)

Diagnosis Code (NK)

Diagnosis Description

Diagnosis Section Code

Diagnosis Section Description

Diagnosis Category Code

Diagnosis Category Description

More FKs ...

Diagnosis Group Key (FK)

Master Bill ID (DD)

Facts ...

Diagnosis Group Key (FK)

Diagnosis Key (FK)

Claim Billing Line Item Fact

Diagnosis Group Bridge

Figure 14-3: Bridge table to handle multivalued diagnoses.

If a patient has three diagnoses, he is assigned a diagnosis group with three cor-

responding rows in the bridge table. In Chapter 10: Financial Services, we described

the use of a weighting factor on each bridge table row to allocate the fact table’s

metrics accordingly. However, in the case of multiple patient diagnoses, it’s virtu-

ally impossible to weight their impact on a patient’s treatment or bill, beyond the

potential determination of a primary diagnosis. Without a realistic way of assigning

weighting factors, the analysis of diagnosis codes must largely focus on impact ques-

tions like “What is the total billed amount for procedures involving the diagnosis of

congestive heart failure?” Most healthcare analysts understand impact analysis may

result in over counting as the same metrics are associated with multiple diagnoses.

NOTE Weighting factors in multivalued bridge tables provide an elegant way

to prorate numeric facts to produce correctly weighted reports. However, these

weighting factors are by no means required in a dimensional design. If there is no

agreement or enthusiasm within the business community for the weighting factors,

they should be left out. Also, in a schema with more than one multivalued dimen-

sion, it is not worth trying to decide how multiple weighting factors would interact.

If the many-to-many join in Figure 14-3 causes problems for a modeling tool that

insists on proper foreign-key-to-primary-key relationships, the equivalent design

Healthcare 347

of Figure 14-4 can be used. In this case an extra table whose primary key is a diag-

nosis group is inserted between the fact and bridge tables. There is likely no new

information in this extra table, unless there were labels for a cluster of diagnoses,

such as the Kimball Syndrome, but now both the fact table and bridge table have

conventional many-to-one joins in all directions.

Diagnosis Dimension

Diagnosis Key (PK)

Diagnosis Code (NK)

Diagnosis Description

Diagnosis Section Code

Diagnosis Section Description

Diagnosis Category Code

Diagnosis Category Description

Foreign Keys ...

Diagnosis Group Key (FK)

Master Bill ID (DD)

Facts ...

Diagnosis Group Key (FK)

Diagnosis Key (FK)

Claim Billing Line Item Fact

Diagnosis Group Bridge

Diagnosis Group Key (PK)

Diagnosis Group Dimension

Figure 14-4: Diagnosis group dimension to create a primary key relationship.

If a unique diagnosis group is created for every patient encounter, the number

of rows could become astronomical and many of the groups would be identical.

Probably a better approach is to have a portfolio of diagnosis groups that are repeat-

edly used. Each set of diagnoses would be looked up in the master diagnosis group

table during the ETL. If the existing group is found, it is used; if not found, a new

diagnosis group is created. Chapter 19: ETL Subsystems and Techniques provides

guidance for creating and administering bridge tables.

In an inpatient hospital stay scenario, the diagnosis group may be unique to each

patient if it evolves over time during the patient’s stay. In this case you would supple-

ment the bridge table with two date stamps to capture begin and end dates. Although

the twin date stamps complicate updates to the diagnosis group bridge table, they

are useful for change tracking, as described more fully in Chapter 7: Accounting.

Supertypes and Subtypes for Charges

We’ve described a design for billed healthcare treatments to cover both inpatient and

outpatient claims. In reality, healthcare charges resemble the supertype and subtype

pattern described in Chapter 10. Facility charges for inpatient hospital stays di er

from professional charges for outpatient treatments in clinics and doctor o ces.

If you were focused exclusively on hospital stays, it would be reasonable to

tweak the Figure 14-2 dimensional structure to incorporate more hospital-speciﬁ c

information. Figure 14-5 shows a revised set of dimensions specialized for hospital

stays, with the new dimensions bolded.

Chapter 14

348

Treatment Date Key (FK)

Primary Insurance Billing Date Key (FK)

Secondary Insurance Billing Date Key (FK)

Responsible Party Billing Date Key (FK)

Last Primary Insurance Payment Date Key (FK)

Last Secondary Insurance Payment Date Key (FK)

Last Responsible Party Payment Date Key (FK)

Zero Balance Date Key (FK)

Patient Key (FK)

Admitting Physician Key (FK)

Admitting Physician Organization Key (FK)

Attending Physician Key (FK)

Attending Physician Organization Key (FK)

Procedure Key (FK)

Facility Key (FK)

Admitting Diagnosis Group Key (FK)

Discharge Diagnosis Group Key (FK)

Primary Insurance Organization Key (FK)

Secondary Insurance Organization Key (FK)

Responsible Party Key (FK)

Employer Key (FK)

Master Bill ID (DD)

Facts...

Inpatient Hospital Claim Billing and Payment Workflow Fact

Figure 14-5: Accumulating snapshot for hospital stay charges.

Referring to Figure 14-5, you can see two roles for the physician: admitting physi-

cian and attending physician. The ﬁ gure shows physician organizations for both roles

because physicians may represent di erent organizations in a hospital setting. With

more complex surgical events, such as a heart transplant operation, whole teams of

specialists and assistants are assembled. In this case, you could include a key in the

fact table for the primary responsible physician; the other physicians and medical

sta would be linked to the fact row via a group key to a multivalued bridge table.

You also have two multivalued diagnosis dimensions on each fact table row. The

admitting diagnosis group is determined at the beginning of the hospital stay and

should be the same for every treatment row that is part of the same hospital stay.

The discharge diagnosis group is not known until the patient is discharged.

Electronic Medical Records

Many healthcare organizations are moving from paper-based processes to elec-

tronic medical records. In the United States, federally mandated quality goals to

support improved population health management may be achievable only with

Healthcare 349

their adoption. Healthcare providers are aggressively implementing electronic

health record systems; the movement is signiﬁ cantly impacting healthcare DW

/BI initiatives.

Electronic medical records can present challenges for data warehouse environ-

ments because of their extreme variability and potentially extreme volumes. Patients’

medical record data comes in many di erent forms, ranging from numeric data to

freeform text comments entered by a healthcare professional to images and photo-

graphs. We’ll further discuss unstructured data in Chapter 21: Big Data Analytics;

electronic medical and/or health records may become a classic use case for big data.

One thing is certain. The amount and variability of electronic data in the healthcare

industry will continue to grow.

Measure Type Dimension for Sparse Facts

As designers, it is tempting to strive for a more standardized framework that could

be extended to handle data variability. For example, you could potentially handle the

variability of lab test results with a measurement type dimension describing what

the fact row means, or in other words, what the generic fact represents. The unit

of measure for a given numeric entry is found in the associated measurement type

dimension row, along with any additivity restrictions, as shown in Figure 14-6.

Lab Test Measurement Type Dimension

Lab Test Measurement Type Key (PK)

Lab Test Measurement Type Description

Lab Test Measurement Type Unit of Measure

Order Date Key (FK)

Test Date Key (FK)

Patient Key (FK)

Physican Key (FK)

Lab Test Key (FK)

Lab Test Measurement Type Key (FK)

Observed Test Result Value

Lab Test Result Facts

Figure 14-6: Lab test observations with measurement type dimension.

This approach is superbly ﬂ exible; you can add new measurement types simply by

adding new rows in the measurement type dimension, not by altering the structure

of the fact table. This approach also eliminates the nulls in the classic positional fact

table design because a row exists only if the measurement exists. However, there

are trade-o s. Using a measurement type dimension may generate lots of new fact

table rows because the grain is “one row per measurement per event” rather than the

more typical “one row per event.” If a lab test results in 10 numeric measurements,

there are now 10 rows in the fact table rather than a single row in the classic design.

For extremely sparse situations, such as clinical laboratory or manufacturing test

environments, this is a reasonable compromise. However, as the density of the facts

Chapter 14

350

grows, you end up spewing out too many fact rows. At this point you no longer have

sparse facts and should return to the classic fact table design with ﬁ xed columns.

Moreover, this measurement type approach may complicate BI data access appli-

cations. In the relational star schema, combining two numbers that were captured

as part of a single event is more di cult with this approach because now you must

fetch two rows from the fact table. SQL likes to perform arithmetic functions within

a row, not across rows. In addition, you must be careful not to mix incompatible

amounts in a calculation because all the numeric measures reside in a single amount

column. It’s worth noting that multidimensional OLAP cubes are more tolerant of

performing calculations across measurement types.

Freeform Text Comments

Freeform text comments, such as clinical notes, are sometimes associated with fact

table events. Although text comments are not very analytically potent unless they’re

parsed into well-behaved dimension attributes, business users are often unwilling

to part with them given the embedded nuggets of information.

Textual comments should not be stored in a fact table directly because they waste

space and rarely participate in queries. Some designers think it’s permissible to store

textual ﬁ elds in the fact table, as long as they’re referred to as degenerate dimensions.

Degenerate dimensions are most typically used for operational transaction control

numbers and identiﬁ ers; it’s not an acceptable approach or pattern for contending

with bulky text ﬁ elds. Storing freeform comments in the fact table adds clutter that

may negatively impact the performance of analysts’ more typical quantitative queries.

The unbounded text comments should either be stored in a separate comments

dimension or treated as attributes in a transaction event dimension. A key consider-

ation when evaluating these two approaches is the text ﬁ eld’s cardinality. If there’s

nearly a unique comment for every fact table event, storing the textual ﬁ eld in a trans-

action dimension makes the most sense. However, in many cases, No Comment is

associated with numerous fact rows. Because the number of unique text comments in

this scenario is much smaller than the number of unique transactions, it would make

more sense to store the textual data in a comments dimension with an associated

foreign key in the fact table. In either case, queries involving both the text comments

and fact metrics will perform relatively poorly given the need to resolve joins between

two voluminous tables. Often business users want to drill into text comments for

further investigation after highly selective fact table query ﬁ lters have been applied.

Images

Sometimes the data captured in a patient’s electronic medical record is an image,

in addition to either quantitative numbers or qualitative notes. There are trade-o s

Healthcare 351

between capturing a JPEG ﬁ lename in the fact table to refer to an associated image

versus embedding the image as a blob directly in the database. The advantage of

using a JPEG ﬁ lename is that other image creation, viewing, and editing programs

can freely access the image. The disadvantage is that a separate database of graphic

ﬁ les must be maintained in synchrony with the fact table.

Facility/Equipment Inventory Utilization

In addition to ﬁ nancial and clinical data, healthcare organizations are also keenly

interested in more operationally oriented metrics, such as utilization and availability

of their assets, whether referring to patient beds or surgical operating theatres. In

Chapter 4, we discussed product inventory data as transaction events as well as

periodic snapshots. Facility or equipment inventories in a healthcare organization

can be handled similarly.

For example, you can envision a bed utilization periodic snapshot with every bed’s

status at regularly recurring points in time, perhaps at midnight, the start of every

shift, or even more frequently throughout the day. In addition to a snapshot date and

potentially time-of-day, this factless fact table would include foreign keys to identify

the patient, attending physician, and perhaps an assigned nurse on duty.

Conversely, you can imagine treating the bed inventory data as a transaction

fact table with one row per movement into and out of a hospital bed. This may be a

simplistic transaction fact table with transaction date and time dimension foreign

keys, along with dimensions to describe the type of movement, such as ﬁ lled or

vacated. In the case of operating room utilization and availability, you can envision

a lengthier list of statuses, such as pre-operation, post-operation, or downtime,

along with time durations.

If the inventory changes are not terribly volatile, such as the beds in a rehabilita-

tion or eldercare inpatient environment, you should consider a timespan fact table,

as discussed in Chapter 8, with row e ective and expiration dates and times to

represent the various states of a bed over a period of time.

Dealing with Retroactive Changes

As DW/BI practitioners, we have well-developed techniques for accurately capturing

the historical ﬂ ow of data from our enterprise’s source applications. Numeric mea-

surements go into fact tables, which are surrounded with contemporary descriptions

of what you know is true at the time of the measurements, packaged as dimension

tables. The descriptions of patient, physician, facility, and payer evolve as slowly

changing dimensions whenever these entities change their descriptions.

Chapter 14

352

However, in the healthcare industry, especially with legacy operational systems,

you often need to contend with late arriving data that should have been loaded into

the data warehouse weeks or months ago. For example, you might receive data

regarding patient procedures that occurred several weeks ago, or updates to patient

proﬁ les that were back-dated as e ective several months ago. The more delayed the

incoming records are, the more challenging the DW/BI system’s ETL processing

becomes. We’ll discuss these late arriving fact and dimension scenarios in Chapter

19. Unfortunately, these patterns are common in healthcare DW/BI environments;

in fact, they may be the dominant modes of processing rather than specialized

techniques for outlier cases. Eventually, more e ective source data capture systems

should reduce the frequency of these late arriving data anomalies.

Summary

Healthcare provides a wealth of dimensional design examples. In this chapter, the

enterprise data warehouse bus matrix illustrated the critical linkages between a

healthcare organization’s administrative and clinical data. We used an accumulating

snapshot grain fact table with role-playing date dimensions for the healthcare claim

billing and payment pipeline. We also saw role playing used for the physician and

payer dimensions in other fact tables of this chapter.

Healthcare schemas are littered with multivalued dimensions, especially the

diagnosis dimension. Complex surgical events might also use multivalued bridge

tables to represent the teams of involved physicians and other sta members. The

bridge tables used with healthcare data seldom contain weighting factors, as dis-

cussed in earlier chapters, because it is extremely di cult to establish weighting

business rules, beyond the designation of a “primary” relationship.

We discussed medical records and test results, suggesting a measurement type

dimension to organize sparse, heterogeneous measurements into a single, uniform

framework. We also discussed the handling of text comments and linked images.

Transaction and periodic snapshot fact tables were used to represent facility or

equipment inventory utilization and availability. In closing, we touched upon ret-

roactive fact and dimension changes that are often all too common with healthcare

performance data.

Electronic

Commerce

web-intensive business’s clickstream data records the gestures of every web

visitor. In its most elemental form, the clickstream is every page event recorded

by each of the company’s web servers. The clickstream contains a number of new

dimensions, such as page, session, and referrer, which are not found in other data

sources. The clickstream is a torrent of data; it can be difficult and exasperating for

DW/BI professionals. Does it connect to the rest of the DW/BI system? Can its dimen-

sions and facts be conformed in the enterprise data warehouse bus architecture?

We start this chapter by describing the raw clickstream data source and designing

its relevant dimensional models. We discuss the impact of Google Analytics, which

can be thought of as an external data warehouse delivering information about your

website. We then integrate clickstream data into a larger matrix of more conven-

tional processes for a web retailer, and argue that the proﬁ tability of the web sales

channel can be measured if you allocate the right costs back to the individual sales.

Chapter 15 discusses the following concepts:

■ Clickstream data and its unique dimensionality

■ Role of external services such as Google Analytics

■ Integrating clickstream data with the other business processes on the bus

matrix

■ Assembling a complete view of proﬁ tability for a web enterprise

Clickstream Source Data

The clickstream is not just another data source that is extracted, cleaned, and

dumped into the DW/BI environment. The clickstream is an evolving collection of

data sources. There are a number of server log ﬁ le formats for capturing clickstream

data. These log ﬁ le formats have optional data components that, if used, can be very

helpful in identifying visitors, sessions, and the true meaning of behavior.

Chapter 15

354

Because of the distributed nature of the web, clickstream data often is collected

simultaneously by di erent physical servers, even when the visitor thinks they are

interacting with a single website. Even if the log ﬁ les collected by these separate

servers are compatible, a very interesting problem arises in synchronizing the log

ﬁ les after the fact. Remember that a busy web server may be processing hundreds

of page events per second. It is unlikely the clocks on separate servers will be in

synchrony to one-hundredth of a second.

You also obtain clickstream data from di erent parties. Besides your own log

ﬁ les, you may get clickstream data from referring partners or from internet service

providers (ISPs). Another important form of clickstream data is the search speciﬁ ca-

tion given to a search engine that then directs the visitor to the website.

Finally, if you are an ISP providing web access to directly connected customers,

you have a unique perspective because you see every click of your captive custom-

ers that may allow more powerful and invasive analyses of the customer’s sessions.

The most basic form of clickstream data from a normal website is stateless. That

is, the log shows an isolated page retrieval event but does not provide a clear tie to

other page events elsewhere in the log. Without some kind of contextual help, it is

di cult or impossible to reliably identify a complete visitor session.

The other big frustration with basic clickstream data is the anonymity of the

session. Unless visitors agree to reveal their identity in some way, you often cannot

be sure who they are, or if you have ever seen them before. In certain situations,

you may not distinguish the clicks of two visitors who are simultaneously brows-

ing the website.

Clickstream Data Challenges

Clickstream data contains many ambiguities. Identifying visitor origins, visitor

sessions, and visitor identities is something of an interpretive art. Browser caches

and proxy servers make these identiﬁ cations more challenging.

Identifying the Visitor Origin

If you are very lucky, your site is the default home page for the visitor’s browser.

Every time he opens his browser, your home page is the ﬁ rst thing he sees. This is

pretty unlikely unless you are the webmaster for a portal site or an intranet home

page, but many sites have buttons which, when clicked, prompt visitors to set their

URL as the browser’s home page. Unfortunately there is no easy way to determine

from a log whether your site is set as a browser’s home page.

A visitor may be directed to your site from a search at a portal such as Yahoo! or

Google. Such referrals can come either from the portal’s index, for which you may

have paid a placement fee, or from a word or content search.

Electronic Commerce 355

For some websites, the most common source of visitors is from a browser book-

mark. For this to happen, the visitor must have previously bookmarked your site,

and this can occur only after the site’s interest and trust levels cross the visitor’s

bookmark threshold.

Finally, your site may be reached as a result of a clickthrough—a deliberate click

on a text or graphical link from another site. This may be a paid-for referral via a

banner ad, or a free referral from an individual or cooperating site. In the case of

clickthroughs, the referring site will almost always be identiﬁ able as a ﬁ eld in the

web event record. Capturing this crucial clickstream data is important to verify the

e cacy of marketing programs. It also provides crucial data for auditing invoices

you may receive from clickthrough advertising charges.

Identifying the Session

Most web-centric analyses require every visitor session (visit) to have its own unique

identity tag, similar to a supermarket receipt number. This is the session ID. Records

for every individual visitor action in a session, whether they are derived from the

clickstream or an application interaction, must contain this tag. But keep in mind

the operational application, such as an order entry system generates this session

ID, not the web server.

The basic protocol for the web, Hyper Text Transfer Protocol (HTTP) is stateless;

that is, it lacks the concept of a session. There are no intrinsic login or logout actions

built into the HTTP protocol, so session identity must be established in some other

way. There are several ways to do this:

1. In many cases, the individual hits comprising a session can be consolidated by

collating time-contiguous log entries from the same host (IP address). If the

log contains a number of entries with the same host ID in a short period of

time (for example, one hour), you can reasonably assume the entries are for

the same session. This method breaks down for websites with large numbers

of visitors because dynamically assigned IP addresses may be reused immedi-

ately by di erent visitors over a brief time period. Also, di erent IP addresses

may be used within the same session for the same visitor. This approach also

presents problems when dealing with browsers that are behind some ﬁ rewalls.

Notwithstanding these problems, many commercial log analysis products use

this method of session tracking, and it requires no cookies or special web

server features.

2. Another much more satisfactory method is to let the web browser place a

session-level cookie into the visitor’s web browser. This cookie will last as

long as the browser is open and in general won’t be available in subsequent

Chapter 15

356

browser sessions. The cookie value can serve as a temporary session ID not

only to the browser, but also to any application that requests the session

cookie from the browser. But using a transient cookie has the disadvantage

that you can’t tell when the visitor returns to the site at a later time in a new

session.

3. HTTP’s secure sockets layer (SSL) o ers an opportunity to track a visitor

session because it may include a login action by the visitor and the exchange

of encryption keys. The downside to using this method is that to track the

session, the entire information exchange needs to be in high-overhead SSL,

and the visitor may be put o by security advisories that can pop up using

certain browsers. Also, each host must have its own unique security certiﬁ cate.

4. If page generation is dynamic, you can try to maintain visitor state by plac-

ing a session ID in a hidden ﬁ eld of each page returned to the visitor. This

session ID can be returned to the web server as a query string appended to

a subsequent URL. This method of session tracking requires a great deal of

control over the website’s page generation methods to ensure the thread of

a session ID is not broken. If the visitor clicks links that don’t support this

session ID ping-pong, a single session may appear to be multiple sessions.

This approach also breaks down if multiple vendors supply content in a single

session unless those vendors are closely collaborating.

5. Finally, the website may establish a persistent cookie in the visitor’s machine

that is not deleted by the browser when the session ends. Of course, it’s pos-

sible the visitor will have his browser set to refuse cookies, or may manually

clean out his cookie ﬁ le, so there is no absolute guarantee that even a per-

sistent cookie will survive. Although any given cookie can be read only by

the website that caused it to be created, certain groups of websites can agree

to store a common ID tag that would let these sites combine their separate

notions of a visitor session into a “super session.”

In summary, the most reliable method of session tracking from web server log

records is obtained by setting a persistent cookie in the visitor’s browser. Less reli-

able, but good results can be obtained by setting a session level and a nonpersistent

cookie and by associating time-contiguous log entries from the same host. The latter

method requires a robust algorithm in the log postprocessor to ensure satisfactory

results and to decide when not to take the results seriously.

Identifying the Visitor

Identifying a speciﬁ c visitor who logs into your site presents some of the most

challenging problems facing a site designer, webmaster, or manager of the web

analytics group.

Electronic Commerce 357

■ Web visitors want to be anonymous. They may have no reason to trust you,

the internet, or their computer with personal identiﬁ cation or credit card

information.

■ If you request visitors’ identity, they may not provide accurate information.

■ You can’t be sure which family member is visiting your site. If you obtain

an identity by association, for instance from a persistent cookie left during a

previous visit, the identiﬁ cation is only for the computer, not for the speciﬁ c

visitor. Any family member or company employee may have been using that

particular computer at that moment in time.

■ You can’t assume an individual is always at the same computer. Server-

provided cookies identify a computer, not an individual. If someone accesses

the same website from an o ce computer, home computer, and mobile device,

a di erent website cookie is probably put into each machine.

Clickstream Dimensional Models

Before designing clickstream dimensional models, let’s consider all the dimensions

that may have relevance in a clickstream environment. Any single dimensional

model will not use all the dimensions at once, but it is nice to have a portfolio

of dimensions waiting to be used. The list of dimensions for a web retailer could

include:

■ Date

■ Time of day

■ Part

■ Vendor

■ Status

■ Carrier

■ Facilities location

■ Product

■ Customer

■ Media

■ Promotion

■ Internal organization

■ Employee

■ Page

■ Event

■ Session

■ Referral

Chapter 15

358

All the dimensions in the list, except for the last four shown in bold, are familiar

dimensions, most of which we have already used in earlier chapters of this book.

But the last four are the unique dimensions of the clickstream and warrant some

careful attention.

Page Dimension

The page dimension describes the page context for a web page event, as illustrated

in Figure 15-1. The grain of this dimension is the individual page. The deﬁ nition

of page must be ﬂ exible enough to handle the evolution of web pages from static

page delivery to highly dynamic page delivery in which the exact page the customer

sees is unique at that instant in time. We assume even in the case of the dynamic

page that there is a well-deﬁ ned function that characterizes the page, and we will

use that to describe the page. We will not create a page row for every instance of a

dynamic page because that would yield a dimension with an astronomical number

of rows. These rows also would not di er in interesting ways. You want a row in this

dimension for each interesting distinguishable type of page. Static pages probably get

their own row, but dynamic pages would be grouped by similar function and type.

Page Dimension Attribute

Page Key

Page Source

Page Function

Page Template

Item Type

Graphics Type

Animation Type

Sound Type

Page File Name

Surrogate values (1..N)

Static, Dynamic, Unknown, Corrupted, Inapplicable, ...

Portal, Search, Product description, Corporate information, ...

Sparse, Dense, ...

Product SKU, Book ISBN number, Telco rate type, ...

GIF, JPG, Progressive disclosure, Size pre-declared, ...

Similar to graphics type

Optional application dependent name

Sample Data Values/Definitions

Figure 15-1: Page dimension attributes and sample data values.

When the deﬁ nition of a static page changes because it is altered by the web-

master, the page dimension row can either be type 1 overwritten or treated with

an alternative slowly changing technique. This decision is a matter of policy for

the data warehouse and depends on whether the old and new descriptions of the

page di er materially, and whether the old deﬁ nition should be kept for historical

analysis purposes.

Website designers, data governance representatives from the business, and the

DW/BI architects need to collaborate to assign descriptive codes and attributes to

each page served by the web server, whether the page is dynamic or static. Ideally,

the web page developers supply descriptive codes and attributes with each page

Electronic Commerce 359

they create and embed these codes and attributes into the optional ﬁ elds of the

web log ﬁ les. This crucial step is at the foundation of the implementation of this

page dimension.

Before leaving the page dimension, we want to point out that some internet com-

panies track the more granular individual elements on each page of their web sites,

including graphical elements and links. Each element generates its own row for each

visitor for each page request. A single complex web page can generate hundreds of

rows each time the page is served to a visitor. Obviously, this extreme granularity

generates astronomical amounts of data, often exceeding 10 terabytes per day!

Similarly, gaming companies may generate a row for every gesture made by every

online game player, which again can result in hundreds of millions of rows per day.

In both cases, the most atomic fact table will have extra dimensions describing the

graphical element, link, or game situation.

Event Dimension

The event dimension describes what happened on a particular page at a particular

point in time. The main interesting events are Open Page, Refresh Page, Click Link,

and Enter Data. You want to capture that information in this small event dimension,

as illustrated in Figure 15-2.

Event Dimension Attribute

Event Key

Event Type

Event Content

Surrogate values (1..N)

Open page, Refresh page, Click link, Unknown, Inapplicable

Application-dependent fields eventually driven by XML tags

Sample Data Values/Definitions

Figure 15-2: Event dimension attributes and sample data values.

Session Dimension

The session dimension provides one or more levels of diagnosis for the visitor’s

session as a whole, as shown in Figure 15-3. For example, the local context of the

session might be Requesting Product Information, but the overall session context

might be Ordering a Product. The success status would diagnose whether the mis-

sion was completed. The local context may be decidable from just the identity of

the current page, but the overall session context probably can be judged only by

processing the visitor’s complete session at data extract time. The customer status

attribute is a convenient place to label the customer for periods of time, with labels

that are not clear either from the page or immediate session. These statuses may be

derived from auxiliary business processes in the DW/BI system, but by placing these

labels deep within the clickstream, you can directly study the behavior of certain

types of customers. Do not put these labels in the customer dimension because they

Chapter 15

360

may change over very short periods of time. If there are a large number of these

statuses, consider creating a separate customer status mini-dimension rather than

embedding this information in the session dimension.

Session Dimension Attribute

Session Key

Session Type

Local Context

Session Context

Action Sequence

Success Status

Customer Status

Surrogate values (1..N)

Classified, Unclassified, Corrupted, Inapplicable

Page-derived context like Requesting Product Information

Trajectory-derived context like Ordering a Product

Summary label for overall sequence of actions during session

Identifies whether overall session mission was accomplished

New customer, High value customer, About to cancel, In default

Sample Data Values/Definitions

Figure 15-3: Session dimension attributes and sample data values.

This dimension groups sessions for analysis, such as:

■ How many customers consulted your product information before ordering?

■ How many customers looked at your product information and never ordered?

■ How many customers did not ﬁ nish ordering? Where did they stop?

Referral Dimension

The referral dimension, illustrated in Figure 15-4, describes how the customer

arrived at the current page. The web server logs usually provide this information.

The URL of the previous page is identiﬁ ed, and in some cases additional information

is present. If the referrer was a search engine, usually the search string is speciﬁ ed.

It may not be worthwhile to put the raw search speciﬁ cation into your database

because the search speciﬁ cations are so complicated and idiosyncratic that an ana-

lyst may not be able to query them usefully. You can assume some kind of simpliﬁ ed

and cleaned speciﬁ cation is placed in the speciﬁ cation attribute.

Referral Dimension Attribute

Referral Key

Referral Type

Referring URL

Referring Site

Referring Domain

Search Type

Specification

Target

Surrogate values (1..N)

Intra site, Remote site, Search engine, Corrupted, Inapplicable

www.organization-site.com/linkspage

www.organization-site.com

Simple text match, Complex logical match

Actual spec used (useful if simple text, otherwise questionable)

Meta tags, Body text, Title (where search found its match)

Sample Data Values/Definitions

Figure 15-4: Referral dimension attributes and sample data values.

Electronic Commerce 361

Clickstream Session Fact Table

Now that you have a portfolio of useful clickstream dimensions, you can design

the primary clickstream dimensional models based on the web server log data.

This business process can then be integrated into the family of other web retailing

subject areas.

With an eye toward keeping the ﬁ rst fact table from growing astronomically,

you should choose the grain to be one row for each completed customer session.

This grain is signiﬁ cantly higher than the underlying web server logs which record

each individual page event, including individual pages as well as each graphical

element on each page. While we typically encourage designers to start with the

most granular data available in the source system, this is a purposeful deviation

from our standard practices. Perhaps you have a big site recording more than 100

million page fetches per day, and 1 billion micro page events (graphical elements),

but you want to start with a more manageable number of rows to be loaded each

day. We assume for the sake of argument that the 100 million page fetches boil

down to 20 million complete visitor sessions. This could arise if an average visitor

session touched 5 pages.

The dimensions that are appropriate for this ﬁ rst fact table are calendar date, time

of day, customer, page, session, and referrer. Finally, you can add a set of measured

facts for this session including session seconds, pages visited, orders placed, units

ordered, and order dollars. The completed design is shown in Figure 15-5.

Date Dimension (2 views for roles)

Clickstream Session Fact

Universal Date Key (FK)

Universal Date/Time

Local Date Key (FK)

Local Date/Time

Customer Key (FK)

Entry Page Key (FK)

Session Key (FK)

Referrer Key (FK)

Session ID (DD)

Session Seconds

Pages Visited

Orders Placed

Order Quantity

Order Dollar Amount

Entry Page Dimension

Customer Dimension

Session Dimension

Referrer Dimension

Figure 15-5: Clickstream fact table design for complete sessions.

Chapter 15

362

There are a number of interesting aspects to this design. You may wonder why

there are two connections from the calendar date dimension to the fact table and

two date/time stamps. This is a case in which both the calendar date and the time

of day must play two di erent roles. Because you are interested in measuring the

precise times of sessions, you must meet two conﬂ icting requirements. First, you

want to make sure you can synchronize all session dates and times internationally

across multiple time zones. Perhaps you have other date and time stamps from

other web servers or nonweb systems elsewhere in the DW/BI environment. To

achieve true synchronization of events across multiple servers and processes, you

must record all session dates and times, uniformly, in a single time zone such as

Greenwich Mean Time (GMT) or Coordinated Universal Time (UTC). You should

interpret the session date and time combinations as the beginning of the session.

Because you have the dwell time of the session as a numeric fact, you can tell when

the session ended, if that is of interest.

The other requirement you meet with this design is to record the date and time of

the session relative to the visitor’s wall clock. The best way to represent this informa-

tion is with a second calendar date foreign key and date/time stamp. Theoretically,

you could represent the time zone of the customer in the customer dimension table,

but constraints to determine the correct wall clock time would be horrendously

complicated. The time di erence between two cities (such as London and Sydney)

can change by as much as two hours at di erent times of the year depending on

when these cities go on and o daylight savings time. This is not the business of

the BI reporting application to work out. It is the business of the database to store

this information, so it can be constrained in a simple and direct way.

The two role-playing calendar date dimension tables are views on a single under-

lying table. The column names are massaged in the view deﬁ nition, so they are

slightly di erent when they show up in the user interface pick lists of BI tools.

Note that the use of views makes the two instances of each table semantically

independent.

We modeled the exact instant in time with a full date/time stamp rather than a

time-of-day dimension. Unlike the calendar date dimension, a time-of-day dimen-

sion would contain few if any meaningful attributes. You don’t have labels for each

hour, minute, or second. Such a time-of-day dimension could be ridiculously large

if its grain were the individual second or millisecond. Also, the use of an explicit

date/time stamp allows direct arithmetic between di erent date/time stamps to

calculate precise time gaps between sessions, even those crossing days. Calculating

time gaps using a time-of-day dimension would be awkward.

The inclusion of the page dimension in Figure 15-5 may seem surprising given

the grain of the design is the customer session. However, in a given session, a very

Electronic Commerce 363

interesting page is the entry page. The page dimension in this design is the page the

session started with. In other words, how did the customer hop onto your bus just

now? Coupled with the referrer dimension, you now have an interesting ability to

analyze how and why the customer accessed your website. A more elaborate design

would also add an exit page dimension.

You may be tempted to add the causal dimension to this design, but if the causal

dimension focuses on individual products, it would be inappropriate to add it to

this design. The symptom that the causal dimension does not mesh with this design

is the multivalued nature of the causal factors for a given complete session. If you

run ad campaigns or special deals for several products, how do you represent this

multivalued situation if the customer’s session involves several products? The right

place for a product-oriented causal dimension will be in the more ﬁ ne-grained table

described in the next fact table example. Conversely, a more broadly focused mar-

ket conditions dimension that describes conditions a ecting all products would be

appropriate for a session-grained fact table.

The session seconds fact is the total number of seconds the customer spent on the

site during this session. There will be many cases in which you can’t tell when the

customer left. Perhaps the customer typed in a new URL. This won’t be detected by

conventional web server logs. (If the data is collected by an ISP who can see every

click across sessions, this particular issue goes away.) Or perhaps the customer

got up out of the chair and didn’t return for 1 hour. Or perhaps the customer just

closed the browser without making any more clicks. In all these cases, your extract

software needs to assign a small and nominal number of seconds to this last session

step, so the analysis is not unrealistically distorted.

We purposely designed this ﬁ rst clickstream fact table to focus on complete visitor

sessions while keeping the size under control. The next schema drops down to the

lowest practical granularity you can support in the data warehouse: the individual

page event.

Clickstream Page Event Fact Table

The granularity of the second clickstream fact table is the individual page event in

each customer session; the underlying micro events recording graphical elements

such as JPGs and GIFs are discarded (unless you are Yahoo! or eBay as described

previously). With simple static HTML pages, you can record only one interesting

event per page view, namely the page view. As websites employ dynamically created

XML-based pages, with the ability to establish an on-going dialogue through the

page, the number and type of events will grow.

This fact table could become astronomical in size. You should resist the urge

to aggregate the table up to a coarser granularity because that inevitably involves

Chapter 15

364

dropping dimensions. Actually, the ﬁ rst clickstream fact table represents just such

an aggregation; although it is a worthwhile fact table, analysts cannot ask questions

about visitor behavior or individual pages.

Having chosen the grain, you can choose the appropriate dimensions. The list of

dimensions includes calendar date, time of day, customer, page, event, session, ses-

sion ID, step (three roles), product, referrer, and promotion. The completed design

is shown in Figure 15-6.

Date Dimension (2 views for roles)

Clickstream Page Event Fact

Universal Date Key (FK)

Universal Date/Time

Local Date Key (FK)

Local Date/Time

Customer Key (FK)

Page Key (FK)

Event Key (FK)

Session Key (FK)

Session ID (DD)

Session Step Key (FK)

Purchase Step Key (FK)

Abandonment Step Key (FK)

Product Key (FK)

Referrer Key (FK)

Promotion Key (FK)

Page Seconds

Order Quantity

Order Dollar Amount

Product Key (PK)

Product Attributes ...

Page Dimension Customer Dimension

Event Dimension

Session Dimension

Promotion Dimension

Referrer Dimension

Product Dimension

Step Key (PK)

Step Number

Steps Until End

Step Dimension (3 views for roles)

Figure 15-6: Clickstream fact table design for individual page use.

Figure 15-6 looks similar to the ﬁ rst design, except for the addition of the page,

event, promotion, and step dimensions. This similarity between fact tables is typical

of dimensional models. One of the charms of dimensional modeling is the “boring”

similarity of the designs. But that is where they get their power. When the designs

have a predictable structure, all the software up and down the DW/BI chain, from

extraction, to database querying, to the BI tools, can exploit this similarity to great

advantage.

The two roles played by the calendar date and date/time stamps have the same

interpretation as in the ﬁ rst design. One role is the universal synchronized time,

and the other role is the local wall clock time as measured by the customer. In this

fact table, these dates and times refer to the individual page event.

Electronic Commerce 365

The page dimension refers to the individual page. This is the main di erence in

grain between the two clickstream fact tables. In this fact table you can see all the

pages accessed by the customers.

As described earlier, the session dimension describes the outcome of the session.

A companion column, the session ID, is a degenerate dimension that does not have

a join to a dimension table. This degenerate dimension is a typical dimensional

modeling construct. The session ID is simply a unique identiﬁ er, with no semantic

content, that serves to group together the page events of each customer session

in an unambiguous way. You did not need a session ID degenerate dimension in

the ﬁ rst fact table, but it is included as a “parent key” if you want to easily link to

the individual page event fact table. We recommend the session dimension be at a

higher level of granularity than the session ID; the session dimension is intended

to describe classes and categories of sessions, not the characteristics of each indi-

vidual session.

A product dimension is shown in this design under the assumption this website

belongs to a web retailer. A ﬁ nancial services site probably would have a similar

dimension. A consulting services site would have a service dimension. An auction

site would have a subject or category dimension describing the nature of the items

being auctioned. A news site would have a subject dimension, although with dif-

ferent content than an auction site.

You should accompany the product dimension with a promotion dimension so

you can attach useful causal interpretations to the changes in demand observed

for certain products.

For each page event, you should record the number of seconds that elapse before

the next page event. Call this page seconds to contrast it with session seconds in

the ﬁ rst fact table. This is a simple example of paying attention to conformed facts.

If you call both of these measures simply “seconds,” you risk having these seconds

inappropriately added or combined. Because these seconds are not precisely equiva-

lent, you should name them di erently as a warning. In this particular case, you

would expect the page seconds for a session in this second fact table to add up to

the session seconds in the ﬁ rst fact table.

The ﬁ nal facts are units ordered and order dollars. These columns will be zero

or null for many rows in this fact table if the speciﬁ c page event is not the event

that places the order. Nevertheless, it is highly attractive to provide these columns

because they tie the all-important web revenue directly to behavior. If the units

ordered and order dollars were only available through the production order entry

system elsewhere in the DW/BI environment, it would be ine cient to perform the

Chapter 15

366

revenue-to-behavior analysis across multiple large tables. In many database man-

agement systems, these null facts are handled e ciently and may take up literally

zero space in the fact table.

Step Dimension

Because the fact table grain is the individual page event, you can add the powerful

step dimension described in Chapter 8: Customer Relationship Management. The

step dimension, originally shown in Figure 8-11, provides the position of the speciﬁ c

page event within the overall session.

The step dimension becomes particularly powerful when it is attached to the fact

table in various roles. Figure 15-6 shows three roles: overall session, purchase subses-

sion, and abandonment subsession. A purchase subsession, by deﬁ nition, ends in a

successful purchase. An abandonment subsession is one that fails to complete a pur-

chase transaction for some reason. Using these roles of the step dimension allows some

very interesting queries. For example, if the purchase step dimension is constrained to

step number 1, the query returns nothing but the starting page for successful purchase

experiences. Conversely, if the abandonment step dimension is constrained to zero

steps remaining, the query returns nothing but the last and presumably most unful-

ﬁ lling pages visited in unsuccessful purchase sessions. Although the whole design

shown in Figure 15-6 is aimed at product purchases, the step dimension technique

can be used in the analysis of any sequential process.

Aggregate Clickstream Fact Tables

Both clickstream fact tables designed thus far are pretty large. There are many

business questions that would be forced to summarize millions of rows from these

tables. For example, if you want to track the total visits and revenue from major

demographic groups of customers accessing your website on a month-by-month

basis, you can certainly do that with either fact table. In the session-grained fact

table, you would constrain the calendar date dimension to the appropriate time span

(say January, February, and March of the current year). You would then create row

headers from the demographics type attribute in the customer dimension and the

month attribute in the calendar dimension (to separately label the three months

in the output). Finally, you would sum the Order Dollars and count the number of

sessions. This all works ﬁ ne. But it is likely to be slow without help from an aggre-

gate table. If this kind of query is frequent, the DBA will be encouraged to build an

aggregate table, as shown in Figure 15-7.

You can build this table directly from your ﬁ rst fact table, whose grain is the

individual session. To build this aggregate table, you group by month, demographic

type, entry page, and session outcome. You count the number of sessions, and sum

Electronic Commerce 367

all the other additive facts. This results in a drastically smaller fact table, almost

certainly less than 1% of the original session-grained fact table. This reduction in

size translates directly to a corresponding increase in performance for most queries.

In other words, you can expect queries directed to this aggregate table to run at

least 100 times as fast.

Month Dimension

Session Aggregate Fact

Universal Month Key (FK)

Demographic Key (FK)

Entry Page Key (FK)

Session Outcome Key (FK)

Number of Sessions

Session Seconds

Pages Visited

Orders Placed

Order Quantity

Order Dollar Amount

Entry Page Dimension

Demographic Dimension

Session Outcome Dimension

Figure 15-7: Aggregate clickstream fact table.

Although it may not have been obvious, we followed a careful discipline in build-

ing the aggregate table. This aggregate fact table is connected to a set of shrunken

rollup dimensions directly related to the original dimensions in the more granular

fact tables. The month dimension is a conformed subset of the calendar day dimen-

sion’s attributes. The demographic dimension is a conformed subset of customer

dimension attributes. You should assume the page and session tables are unchanged;

a careful design of the aggregation logic could suggest a conformed shrinking of

these tables as well.

Google Analytics

Google Analytics (GA) is a service provided by Google that is best described as an

external data warehouse that provides many insights about how your website is used.

To use GA, you modify each page of your website to include a GA tracking code

(GATC) embedded in a Java code snippet located in the HTML <head> declaration of

each page to be tracked. When a visitor accesses the page, information is sent to the

Analytics service at Google, as long as the visitor has JavaScript enabled. Virtually

all of the information described in this chapter can be collected through GA, with

the exception of personally identiﬁ able information (PII) which is forbidden by GA’s

campaigns and conversions (sales). Reportedly, GA is used by more than 50% of the

The Data Warehouse Toolkit, 3rd Edition Ralph Kimball, Margy Ross Toolkit Definitive Guide To Dimensional Ing Wil

Navigation menu

Versions of this User Manual:

Views

Navigation