User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 447 [warning: Documents this large are best viewed by clicking the View PDF Link!]

John Wiley & Sons, Inc.
Wiley Computer Publishing
Ralph Kimball
Margy Ross
The Data Warehouse
Second Edition
The Complete Guide to
Dimensional Modeling
The Data Warehouse Toolkit
Second Edition
John Wiley & Sons, Inc.
Wiley Computer Publishing
Ralph Kimball
Margy Ross
The Data Warehouse
Second Edition
The Complete Guide to
Dimensional Modeling
Publisher: Robert Ipsen
Editor: Robert Elliott
Assistant Editor: Emilie Herman
Managing Editor: John Atkins
Associate New Media Editor: Brian Snapp
Text Composition: John Wiley Composition Services
Designations used by companies to distinguish their products are often claimed as trade-
marks. In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names
appear in initial capital or ALL CAPITAL LETTERS. Readers, however, should contact the
appropriate companies for more complete information regarding trademarks and registration.
This book is printed on acid-free paper.
Copyright © 2002 by Ralph Kimball and Margy Ross. All rights reserved.
Published by John Wiley and Sons, Inc.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording, scanning
or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States
Copyright Act, without either the prior written permission of the Publisher, or authoriza-
tion through payment of the appropriate per-copy fee to the Copyright Clearance Center,
222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4744. Requests
to the Publisher for permission should be addressed to the Permissions Department,
John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 10158-0012, (212) 850-6011, fax
(212) 850-6008, E-Mail: PERMREQ@WILEY.COM.
This publication is designed to provide accurate and authoritative information in regard to
the subject matter covered. It is sold with the understanding that the publisher is not
engaged in professional services. If professional advice or other expert assistance is
required, the services of a competent professional person should be sought.
Library of Congress Cataloging-in-Publication Data:
Kimball, Ralph.
The data warehouse toolkit : the complete guide to dimensional modeling /
Ralph Kimball, Margy Ross. — 2nd ed.
p. cm.
“Wiley Computer Publishing.”
Includes index.
ISBN 0-471-20024-7
1. Database design. 2. Data warehousing. I. Ross, Margy, 1959– II. Title.
QA76.9.D26 K575 2002
658.4'038'0285574—dc21 2002002284
Printed in the United States of America.
10 9 8 7 6 5 4 3 2 1
Acknowledgments xv
Introduction xvii
Chapter 1 Dimensional Modeling Primer 1
Different Information Worlds 2
Goals of a Data Warehouse 2
The Publishing Metaphor 4
Components of a Data Warehouse 6
Operational Source Systems 7
Data Staging Area 8
Data Presentation 10
Data Access Tools 13
Additional Considerations 14
Dimensional Modeling Vocabulary 16
Fact Table 16
Dimension Tables 19
Bringing Together Facts and Dimensions 21
Dimensional Modeling Myths 24
Common Pitfalls to Avoid 26
Summary 27
Chapter 2 Retail Sales 29
Four-Step Dimensional Design Process 30
Retail Case Study 32
Step 1. Select the Business Process 33
Step 2. Declare the Grain 34
Step 3. Choose the Dimensions 35
Step 4. Identify the Facts 36
Dimension Table Attributes 38
Date Dimension 38
Product Dimension 42
Store Dimension 45
Promotion Dimension 46
Degenerate Transaction Number Dimension 50
Retail Schema in Action 51
Retail Schema Extensibility 52
Resisting Comfort Zone Urges 54
Dimension Normalization (Snowflaking) 55
Too Many Dimensions 57
Surrogate Keys 58
Market Basket Analysis 62
Summary 65
Chapter 3 Inventory 67
Introduction to the Value Chain 68
Inventory Models 69
Inventory Periodic Snapshot 69
Inventory Transactions 74
Inventory Accumulating Snapshot 75
Value Chain Integration 76
Data Warehouse Bus Architecture 78
Data Warehouse Bus Matrix 79
Conformed Dimensions 82
Conformed Facts 87
Summary 88
Chapter 4 Procurement 89
Procurement Case Study 89
Procurement Transactions 90
Multiple- versus Single-Transaction Fact Tables 91
Complementary Procurement Snapshot 93
Slowly Changing Dimensions 95
Type 1: Overwrite the Value 95
Type 2: Add a Dimension Row 97
Type 3: Add a Dimension Column 100
Hybrid Slowly Changing Dimension Techniques 102
Predictable Changes with Multiple Version Overlays 102
Unpredictable Changes with Single Version Overlay 103
More Rapidly Changing Dimensions 105
Summary 105
Chapter 5 Order Management 107
Introduction to Order Management 108
Order Transactions 109
Fact Normalization 109
Dimension Role-Playing 110
Product Dimension Revisited 111
Customer Ship-To Dimension 113
Deal Dimension 116
Degenerate Dimension for Order Number 117
Junk Dimensions 117
Multiple Currencies 119
Header and Line Item Facts with Different Granularity 121
Invoice Transactions 122
Profit and Loss Facts 124
Profitability—The Most Powerful Data Mart 126
Profitability Words of Warning 127
Customer Satisfaction Facts 127
Accumulating Snapshot for the Order Fulfillment Pipeline 128
Lag Calculations 130
Multiple Units of Measure 130
Beyond the Rear-View Mirror 132
Fact Table Comparison 132
Transaction Fact Tables 133
Periodic Snapshot Fact Tables 134
Accumulating Snapshot Fact Tables 134
Contents vii
Designing Real-Time Partitions 135
Requirements for the Real-Time Partition 136
Transaction Grain Real-Time Partition 136
Periodic Snapshot Real-Time Partition 137
Accumulating Snapshot Real-Time Partition 138
Summary 139
Chapter 6 Customer Relationship Management 141
CRM Overview 142
Operational and Analytical CRM 143
Packaged CRM 145
Customer Dimension 146
Name and Address Parsing 147
Other Common Customer Attributes 150
Dimension Outriggers for a Low-Cardinality Attribute Set 153
Large Changing Customer Dimensions 154
Implications of Type 2 Customer Dimension Changes 159
Customer Behavior Study Groups 160
Commercial Customer Hierarchies 161
Combining Multiple Sources of Customer Data 168
Analyzing Customer Data from Multiple Business Processes 169
Summary 170
Chapter 7 Accounting 173
Accounting Case Study 174
General Ledger Data 175
General Ledger Periodic Snapshot 175
General Ledger Journal Transactions 177
Financial Statements 180
Budgeting Process 180
Consolidated Fact Tables 184
Role of OLAP and Packaged Analytic Solutions 185
Summary 186
Chapter 8 Human Resources Management 187
Time-Stamped Transaction Tracking in a Dimension 188
Time-Stamped Dimension with Periodic Snapshot Facts 191
Audit Dimension 193
Keyword Outrigger Dimension 194
AND/OR Dilemma 195
Searching for Substrings 196
Survey Questionnaire Data 197
Summary 198
Chapter 9 Financial Services 199
Banking Case Study 200
Dimension Triage 200
Household Dimension 204
Multivalued Dimensions 205
Minidimensions Revisited 206
Arbitrary Value Banding of Facts 207
Point-in-Time Balances 208
Heterogeneous Product Schemas 210
Heterogeneous Products with Transaction Facts 215
Summary 215
Chapter 10 Telecommunications and Utilities 217
Telecommunications Case Study 218
General Design Review Considerations 220
Granularity 220
Date Dimension 222
Degenerate Dimensions 222
Dimension Decodes and Descriptions 222
Surrogate Keys 223
Too Many (or Too Few) Dimensions 223
Draft Design Exercise Discussion 223
Geographic Location Dimension 226
Location Outrigger 226
Leveraging Geographic Information Systems 227
Summary 227
Contents ix
Chapter 11 Transportation 229
Airline Frequent Flyer Case Study 230
Multiple Fact Table Granularities 230
Linking Segments into Trips 233
Extensions to Other Industries 234
Cargo Shipper 234
Travel Services 235
Combining Small Dimensions into a Superdimension 236
Class of Service 236
Origin and Destination 237
More Date and Time Considerations 239
Country-Specific Calendars 239
Time of Day as a Dimension or Fact 240
Date and Time in Multiple Time Zones 240
Summary 241
Chapter 12 Education 243
University Case Study 244
Accumulating Snapshot for Admissions Tracking 244
Factless Fact Tables 246
Student Registration Events 247
Facilities Utilization Coverage 249
Student Attendance Events 250
Other Areas of Analytic Interest 253
Summary 254
Chapter 13 Health Care 255
Health Care Value Circle 256
Health Care Bill 258
Roles Played By the Date Dimension 261
Multivalued Diagnosis Dimension 262
Extending a Billing Fact Table to Show Profitability 265
Dimensions for Billed Hospital Stays 266
Complex Health Care Events 267
Medical Records 269
Fact Dimension for Sparse Facts 269
Going Back in Time 271
Late-Arriving Fact Rows 271
Late-Arriving Dimension Rows 273
Summary 274
Chapter 14 Electronic Commerce 277
Web Client-Server Interactions Tutorial 278
Why the Clickstream Is Not Just Another Data Source 281
Challenges of Tracking with Clickstream Data 282
Specific Dimensions for the Clickstream 287
Clickstream Fact Table for Complete Sessions 292
Clickstream Fact Table for Individual Page Events 295
Aggregate Clickstream Fact Tables 298
Integrating the Clickstream Data Mart into the
Enterprise Data Warehouse 299
Electronic Commerce Profitability Data Mart 300
Summary 303
Chapter 15 Insurance 305
Insurance Case Study 306
Insurance Value Chain 307
Draft Insurance Bus Matrix 309
Policy Transactions 309
Dimension Details and Techniques 310
Alternative (or Complementary) Policy
Accumulating Snapshot 315
Policy Periodic Snapshot 316
Conformed Dimensions 316
Conformed Facts 316
Heterogeneous Products Again 318
Multivalued Dimensions Again 318
Contents xi
More Insurance Case Study Background 319
Updated Insurance Bus Matrix 320
Claims Transactions 322
Claims Accumulating Snapshot 323
Policy/Claims Consolidated Snapshot 324
Factless Accident Events 325
Common Dimensional Modeling Mistakes to Avoid 326
Summary 330
Chapter 16 Building the Data Warehouse 331
Business Dimensional Lifecycle Road Map 332
Road Map Major Points of Interest 333
Project Planning and Management 334
Assessing Readiness 334
Scoping 336
Justification 336
Staffing 337
Developing and Maintaining the Project Plan 339
Business Requirements Definition 340
Requirements Preplanning 341
Collecting the Business Requirements 343
Postcollection Documentation and Follow-up 345
Lifecycle Technology Track 347
Technical Architecture Design 348
Eight-Step Process for Creating the Technical Architecture 348
Product Selection and Installation 351
Lifecycle Data Track 353
Dimensional Modeling 353
Physical Design 355
Aggregation Strategy 356
Initial Indexing Strategy 357
Data Staging Design and Development 358
Dimension Table Staging 358
Fact Table Staging 361
Lifecycle Analytic Applications Track 362
Analytic Application Specification 363
Analytic Application Development 363
Deployment 364
Maintenance and Growth 365
Common Data Warehousing Mistakes to Avoid 366
Summary 369
Chapter 17 Present Imperatives and Future Outlook 371
Ongoing Technology Advances 372
Political Forces Demanding Security and Affecting Privacy 375
Conflict between Beneficial Uses and Insidious Abuses 375
Who Owns Your Personal Data? 376
What Is Likely to Happen? Watching the Watchers . . . 377
How Watching the Watchers Affects Data
Warehouse Architecture 378
Designing to Avoid Catastrophic Failure 379
Catastrophic Failures 380
Countering Catastrophic Failures 380
Intellectual Property and Fair Use 383
Cultural Trends in Data Warehousing 383
Managing by the Numbers
across the Enterprise 383
Increased Reliance on Sophisticated Key
Performance Indicators 384
Behavior Is the New Marquee Application 385
Packaged Applications Have Hit Their High Point 385
Application Integration Has to Be Done by Someone 386
Data Warehouse Outsourcing Needs a Sober Risk Assessment 386
In Closing 387
Glossary 389
Index 419
Contents xiii
First of all, we want to thank the thousands of you who have read our Toolkit
books, attended our courses, and engaged us in consulting projects. We have
learned as much from you as we have taught. As a group, you have had a pro-
foundly positive impact on the data warehousing industry. Congratulations!
This book would not have been written without the assistance of our business
partners. We want to thank Julie Kimball of Ralph Kimball Associates for her
vision and determination in getting the project launched. While Julie was the
catalyst who got the ball rolling, Bob Becker of DecisionWorks Consulting
helped keep it in motion as he drafted, reviewed, and served as a general
sounding board. We are grateful to them both because they helped an enor-
mous amount.
We wrote this book with a little help from our friends, who provided input or
feedback on specific chapters. We want to thank Bill Schmarzo of Decision-
Works, Charles Hagensen of Attachmate Corporation, and Warren Thorn-
thwaite of InfoDynamics for their counsel on Chapters 6, 7, and 16, respectively.
Bob Elliott, our editor at John Wiley & Sons, and the entire Wiley team have
supported this project with skill, encouragement, and enthusiasm. It has been
a pleasure to work with them. We also want to thank Justin Kestelyn, editor-
in-chief at Intelligent Enterprise for allowing us to adapt materials from sev-
eral of Ralph’s articles for inclusion in this book.
To our families, thanks for being there for us when we needed you and for giv-
ing us the time it took. Spouses Julie Kimball and Scott Ross and children Sara
Hayden Smith, Brian Kimball, and Katie Ross all contributed a lot to this book,
often without realizing it. Thanks for your unconditional support.
The data warehousing industry certainly has matured since Ralph Kimball pub-
lished the first edition of The Data Warehouse Toolkit (Wiley) in 1996. Although
large corporate early adopters paved the way, since then, data warehousing
has been embraced by organizations of all sizes. The industry has constructed
thousands of data warehouses. The volume of data continues to grow as we
populate our warehouses with increasingly atomic data and update them with
greater frequency. Vendors continue to blanket the market with an ever-
expanding set of tools to help us with data warehouse design, development,
and usage. Most important, armed with access to our data warehouses, busi-
ness professionals are making better decisions and generating payback on
their data warehouse investments.
Since the first edition of The Data Warehouse Toolkit was published, dimen-
sional modeling has been broadly accepted as the dominant technique for data
warehouse presentation. Data warehouse practitioners and pundits alike have
recognized that the data warehouse presentation must be grounded in sim-
plicity if it stands any chance of success. Simplicity is the fundamental key that
allows users to understand databases easily and software to navigate data-
bases efficiently. In many ways, dimensional modeling amounts to holding the
fort against assaults on simplicity. By consistently returning to a business-
driven perspective and by refusing to compromise on the goals of user under-
standability and query performance, we establish a coherent design that
serves the organization’s analytic needs. Based on our experience and the
overwhelming feedback from numerous practitioners from companies like
your own, we believe that dimensional modeling is absolutely critical to a suc-
cessful data warehousing initiative.
Dimensional modeling also has emerged as the only coherent architecture for
building distributed data warehouse systems. When we use the conformed
dimensions and conformed facts of a set of dimensional models, we have a
practical and predictable framework for incrementally building complex data
warehouse systems that have no center.
For all that has changed in our industry, the core dimensional modeling tech-
niques that Ralph Kimball published six years ago have withstood the test of
time. Concepts such as slowly changing dimensions, heterogeneous products,
factless fact tables, and architected data marts continue to be discussed in data
warehouse design workshops around the globe. The original concepts have
been embellished and enhanced by new and complementary techniques. We
decided to publish a second edition of Kimball’s seminal work because we felt
that it would be useful to pull together our collective thoughts on dimensional
modeling under a single cover. We have each focused exclusively on decision
support and data warehousing for over two decades. We hope to share the
dimensional modeling patterns that have emerged repeatedly during the
course of our data warehousing careers. This book is loaded with specific,
practical design recommendations based on real-world scenarios.
The goal of this book is to provide a one-stop shop for dimensional modeling
techniques. True to its title, it is a toolkit of dimensional design principles and
techniques. We will address the needs of those just getting started in dimen-
sional data warehousing, and we will describe advanced concepts for those of
you who have been at this a while. We believe that this book stands alone in its
depth of coverage on the topic of dimensional modeling.
Intended Audience
This book is intended for data warehouse designers, implementers, and man-
agers. In addition, business analysts who are active participants in a ware-
house initiative will find the content useful.
Even if you’re not directly responsible for the dimensional model, we believe
that it is important for all members of a warehouse project team to be comfort-
able with dimensional modeling concepts. The dimensional model has an
impact on most aspects of a warehouse implementation, beginning with the
translation of business requirements, through data staging, and finally, to the
unveiling of a data warehouse through analytic applications. Due to the broad
implications, you need to be conversant in dimensional modeling regardless
whether you are responsible primarily for project management, business
analysis, data architecture, database design, data staging, analytic applica-
tions, or education and support. We’ve written this book so that it is accessible
to a broad audience.
For those of you who have read the first edition of this book, some of the famil-
iar case studies will reappear in this edition; however, they have been updated
significantly and fleshed out with richer content. We have developed vignettes
for new industries, including health care, telecommunications, and electronic
commerce. In addition, we have introduced more horizontal, cross-industry
case studies for business functions such as human resources, accounting, pro-
curement, and customer relationship management.
Introduction xix
The content in this book is mildly technical. We discuss dimensional modeling
in the context of a relational database primarily. We presume that readers have
basic knowledge of relational database concepts such as tables, rows, keys,
and joins. Given that we will be discussing dimensional models in a non-
denominational manner, we won’t dive into specific physical design and
tuning guidance for any given database management systems.
Chapter Preview
The book is organized around a series of business vignettes or case studies. We
believe that developing the design techniques by example is an extremely
effective approach because it allows us to share very tangible guidance. While
not intended to be full-scale application or industry solutions, these examples
serve as a framework to discuss the patterns that emerge in dimensional mod-
eling. In our experience, it is often easier to grasp the main elements of a
design technique by stepping away from the all-too-familiar complexities of
one’s own applications in order to think about another business. Readers of
the first edition have responded very favorably to this approach.
The chapters of this book build on one another. We will start with basic con-
cepts and introduce more advanced content as the book unfolds. The chapters
are to be read in order by every reader. For example, Chapter 15 on insurance
will be difficult to comprehend unless you have read the preceding chapters
on retailing, procurement, order management, and customer relationship
Those of you who have read the first edition may be tempted to skip the first
few chapters. While some of the early grounding regarding facts and dimen-
sions may be familiar turf, we don’t want you to sprint too far ahead. For
example, the first case study focuses on the retailing industry, just as it did in
the first edition. However, in this edition we advocate a new approach, mak-
ing a strong case for tackling the atomic, bedrock data of your organization.
You’ll miss out on this rationalization and other updates to fundamental con-
cepts if you skip ahead too quickly.
Navigation Aids
We have laced the book with tips, key concepts, and chapter pointers to make
it more usable and easily referenced in the future. In addition, we have pro-
vided an extensive glossary of terms.
You can find the tips sprinkled throughout this book by flipping through the chapters
and looking for the lightbulb icon.
We begin each chapter with a sidebar of key concepts, denoted by the key icon.
Purpose of Each Chapter
Before we get started, we want to give you a chapter-by-chapter preview of the
concepts covered as the book unfolds.
Chapter 1: Dimensional Modeling Primer
The book begins with a primer on dimensional modeling. We explore the com-
ponents of the overall data warehouse architecture and establish core vocabu-
lary that will be used during the remainder of the book. We dispel some of the
myths and misconceptions about dimensional modeling, and we discuss the
role of normalized models.
Chapter 2: Retail Sales
Retailing is the classic example used to illustrate dimensional modeling. We
start with the classic because it is one that we all understand. Hopefully, you
won’t need to think very hard about the industry because we want you to
focus on core dimensional modeling concepts instead. We begin by discussing
the four-step process for designing dimensional models. We explore dimen-
sion tables in depth, including the date dimension that will be reused repeat-
edly throughout the book. We also discuss degenerate dimensions,
snowflaking, and surrogate keys. Even if you’re not a retailer, this chapter is
required reading because it is chock full of fundamentals.
Chapter 3: Inventory
We remain within the retail industry for our second case study but turn our
attention to another business process. This case study will provide a very vivid
example of the data warehouse bus architecture and the use of conformed
dimensions and facts. These concepts are critical to anyone looking to con-
struct a data warehouse architecture that is integrated and extensible.
Chapter 4: Procurement
This chapter reinforces the importance of looking at your organization’s value
chain as you plot your data warehouse. We also explore a series of basic and
advanced techniques for handling slowly changing dimension attributes.
Chapter 5: Order Management
In this case study we take a look at the business processes that are often the
first to be implemented in data warehouses as they supply core business per-
formance metrics—what are we selling to which customers at what price? We
discuss the situation in which a dimension plays multiple roles within a
schema. We also explore some of the common challenges modelers face when
dealing with order management information, such as header/line item con-
siderations, multiple currencies or units of measure, and junk dimensions with
miscellaneous transaction indicators. We compare the three fundamental
types of fact tables: transaction, periodic snapshot, and accumulating snap-
shot. Finally, we provide recommendations for handling more real-time ware-
housing requirements.
Chapter 6: Customer Relationship Management
Numerous data warehouses have been built on the premise that we need to bet-
ter understand and service our customers. This chapter covers key considera-
tions surrounding the customer dimension, including address standardization,
managing large volume dimensions, and modeling unpredictable customer
hierarchies. It also discusses the consolidation of customer data from multiple
Chapter 7: Accounting
In this totally new chapter we discuss the modeling of general ledger informa-
tion for the data warehouse. We describe the appropriate handling of year-to-
date facts and multiple fiscal calendars, as well as the notion of consolidated
dimensional models that combine data from multiple business processes.
Chapter 8: Human Resources Management
This new chapter explores several unique aspects of human resources dimen-
sional models, including the situation in which a dimension table begins to
behave like a fact table. We also introduce audit and keyword dimensions, as
well as the handling of survey questionnaire data.
Introduction xxi
Chapter 9: Financial Services
The banking case study explores the concept of heterogeneous products in
which each line of business has unique descriptive attributes and performance
metrics. Obviously, the need to handle heterogeneous products is not unique
to financial services. We also discuss the complicated relationships among
accounts, customers, and households.
Chapter 10: Telecommunications and Utilities
This new chapter is structured somewhat differently to highlight considera-
tions when performing a data model design review. In addition, we explore
the idiosyncrasies of geographic location dimensions, as well as opportunities
for leveraging geographic information systems.
Chapter 11: Transportation
In this case study we take a look at related fact tables at different levels of gran-
ularity. We discuss another approach for handling small dimensions, and we
take a closer look at date and time dimensions, covering such concepts as
country-specific calendars and synchronization across multiple time zones.
Chapter 12: Education
We look at several factless fact tables in this chapter and discuss their impor-
tance in analyzing what didn’t happen. In addition, we explore the student
application pipeline, which is a prime example of an accumulating snapshot
fact table.
Chapter 13: Health Care
Some of the most complex models that we have ever worked with are from the
health care industry. This new chapter illustrates the handling of such com-
plexities, including the use of a bridge table to model multiple diagnoses and
providers associated with a patient treatment.
Chapter 14: Electronic Commerce
This chapter provides an introduction to modeling clickstream data. The con-
cepts are derived from The Data Webhouse Toolkit (Wiley 2000), which Ralph
Kimball coauthored with Richard Merz.
Introduction xxiii
Chapter 15: Insurance
The final case study serves to illustrate many of the techniques we discussed
earlier in the book in a single set of interrelated schemas. It can be viewed
as a pulling-it-all-together chapter because the modeling techniques will be
layered on top of one another, similar to overlaying overhead projector
Chapter 16: Building the Data Warehouse
Now that you are comfortable designing dimensional models, we provide a
high-level overview of the activities that are encountered during the lifecycle
of a typical data warehouse project iteration. This chapter could be considered
a lightning tour of The Data Warehouse Lifecycle Toolkit (Wiley 1998) that we
coauthored with Laura Reeves and Warren Thornthwaite.
Chapter 17: Present Imperatives and Future Outlook
In this final chapter we peer into our crystal ball to provide a preview of what
we anticipate data warehousing will look like in the future.
We’ve supplied a detailed glossary to serve as a reference resource. It will help
bridge the gap between your general business understanding and the case
studies derived from businesses other than your own.
Companion Web Site
You can access the book’s companion Web site at www.kimballuniversity.com.
The Web site offers the following resources:
Register for Design Tips to receive ongoing, practical guidance about
dimensional modeling and data warehouse design via electronic mail on a
periodic basis.
Link to all Ralph Kimball’s articles from Intelligent Enterprise and its
predecessor, DBMS Magazine.
Learn about Kimball University classes for quality, vendor-independent
education consistent with the authors’ experiences and writings.
The goal of this book is to communicate a set of standard techniques for
dimensional data warehouse design. Crudely speaking, if you as the reader
get nothing else from this book other than the conviction that your data ware-
house must be driven from the needs of business users and therefore built and
presented from a simple dimensional perspective, then this book will have
served its purpose. We are confident that you will be one giant step closer to
data warehousing success if you buy into these premises.
Now that you know where we are headed, it is time to dive into the details.
We’ll begin with a primer on dimensional modeling in Chapter 1 to ensure that
everyone is on the same page regarding key terminology and architectural
concepts. From there we will begin our discussion of the fundamental tech-
niques of dimensional modeling, starting with the tried-and-true retail industry.
Dimensional Modeling
In this first chapter we lay the groundwork for the case studies that follow.
We’ll begin by stepping back to consider data warehousing from a macro per-
spective. Some readers may be disappointed to learn that it is not all about
tools and techniques—first and foremost, the data warehouse must consider
the needs of the business. We’ll drive stakes in the ground regarding the goals
of the data warehouse while observing the uncanny similarities between the
responsibilities of a data warehouse manager and those of a publisher. With
this big-picture perspective, we’ll explore the major components of the ware-
house environment, including the role of normalized models. Finally, we’ll
close by establishing fundamental vocabulary for dimensional modeling. By
the end of this chapter we hope that you’ll have an appreciation for the need
to be half DBA (database administrator) and half MBA (business analyst) as
you tackle your data warehouse.
Chapter 1 discusses the following concepts:
■■ Business-driven goals of a data warehouse
■■ Data warehouse publishing
■■ Major components of the overall data warehouse
■■ Importance of dimensional modeling for the data
warehouse presentation area
■■ Fact and dimension table terminology
■■ Myths surrounding dimensional modeling
■■ Common data warehousing pitfalls to avoid
Different Information Worlds
One of the most important assets of any organization is its information. This
asset is almost always kept by an organization in two forms: the operational
systems of record and the data warehouse. Crudely speaking, the operational
systems are where the data is put in, and the data warehouse is where we get
the data out.
The users of an operational system turn the wheels of the organization. They
take orders, sign up new customers, and log complaints. Users of an opera-
tional system almost always deal with one record at a time. They repeatedly
perform the same operational tasks over and over.
The users of a data warehouse, on the other hand, watch the wheels of the orga-
nization turn. They count the new orders and compare them with last week’s
orders and ask why the new customers signed up and what the customers
complained about. Users of a data warehouse almost never deal with one row
at a time. Rather, their questions often require that hundreds or thousands of
rows be searched and compressed into an answer set. To further complicate
matters, users of a data warehouse continuously change the kinds of questions
they ask.
In the first edition of The Data Warehouse Toolkit (Wiley 1996), Ralph Kimball
devoted an entire chapter to describe the dichotomy between the worlds of
operational processing and data warehousing. At this time, it is widely recog-
nized that the data warehouse has profoundly different needs, clients, struc-
tures, and rhythms than the operational systems of record. Unfortunately, we
continue to encounter supposed data warehouses that are mere copies of the
operational system of record stored on a separate hardware platform. While
this may address the need to isolate the operational and warehouse environ-
ments for performance reasons, it does nothing to address the other inherent
differences between these two types of systems. Business users are under-
whelmed by the usability and performance provided by these pseudo data
warehouses. These imposters do a disservice to data warehousing because
they don’t acknowledge that warehouse users have drastically different needs
than operational system users.
Goals of a Data Warehouse
Before we delve into the details of modeling and implementation, it is helpful
to focus on the fundamental goals of the data warehouse. The goals can be
developed by walking through the halls of any organization and listening to
business management. Inevitably, these recurring themes emerge:
■■ “We have mountains of data in this company, but we can’t access it.”
■■ “We need to slice and dice the data every which way.”
■■ “You’ve got to make it easy for business people to get at the data directly.”
■■ “Just show me what is important.”
■■ “It drives me crazy to have two people present the same business metrics
at a meeting, but with different numbers.”
■■ “We want people to use information to support more fact-based decision
Based on our experience, these concerns are so universal that they drive the
bedrock requirements for the data warehouse. Let’s turn these business man-
agement quotations into data warehouse requirements.
The data warehouse must make an organization’s information easily acces-
sible. The contents of the data warehouse must be understandable. The
data must be intuitive and obvious to the business user, not merely the
developer. Understandability implies legibility; the contents of the data
warehouse need to be labeled meaningfully. Business users want to sepa-
rate and combine the data in the warehouse in endless combinations, a
process commonly referred to as slicing and dicing. The tools that access the
data warehouse must be simple and easy to use. They also must return
query results to the user with minimal wait times.
The data warehouse must present the organization’s information consis-
tently. The data in the warehouse must be credible. Data must be carefully
assembled from a variety of sources around the organization, cleansed,
quality assured, and released only when it is fit for user consumption.
Information from one business process should match with information
from another. If two performance measures have the same name, then they
must mean the same thing. Conversely, if two measures don’t mean the
same thing, then they should be labeled differently. Consistent information
means high-quality information. It means that all the data is accounted for
and complete. Consistency also implies that common definitions for the
contents of the data warehouse are available for users.
The data warehouse must be adaptive and resilient to change. We simply
can’t avoid change. User needs, business conditions, data, and technology
are all subject to the shifting sands of time. The data warehouse must be
designed to handle this inevitable change. Changes to the data warehouse
should be graceful, meaning that they don’t invalidate existing data or
applications. The existing data and applications should not be changed or
disrupted when the business community asks new questions or new data
is added to the warehouse. If descriptive data in the warehouse is modi-
fied, we must account for the changes appropriately.
Dimensional Modeling Primer 3
The data warehouse must be a secure bastion that protects our information
assets. An organization’s informational crown jewels are stored in the data
warehouse. At a minimum, the warehouse likely contains information
about what we’re selling to whom at what price—potentially harmful
details in the hands of the wrong people. The data warehouse must effec-
tively control access to the organization’s confidential information.
The data warehouse must serve as the foundation for improved decision
making. The data warehouse must have the right data in it to support deci-
sion making. There is only one true output from a data warehouse: the deci-
sions that are made after the data warehouse has presented its evidence.
These decisions deliver the business impact and value attributable to the
warehouse. The original label that predates the data warehouse is still the
best description of what we are designing: a decision support system.
The business community must accept the data warehouse if it is to be
deemed successful. It doesn’t matter that we’ve built an elegant solution
using best-of-breed products and platforms. If the business community has
not embraced the data warehouse and continued to use it actively six
months after training, then we have failed the acceptance test. Unlike an
operational system rewrite, where business users have no choice but to use
the new system, data warehouse usage is sometimes optional. Business
user acceptance has more to do with simplicity than anything else.
As this list illustrates, successful data warehousing demands much more than
being a stellar DBA or technician. With a data warehousing initiative, we have
one foot in our information technology (IT) comfort zone, while our other foot
is on the unfamiliar turf of business users. We must straddle the two, modify-
ing some of our tried-and-true skills to adapt to the unique demands of data
warehousing. Clearly, we need to bring a bevy of skills to the party to behave
like we’re a hybrid DBA/MBA.
The Publishing Metaphor
With the goals of the data warehouse as a backdrop, let’s compare our respon-
sibilities as data warehouse managers with those of a publishing editor-in-
chief. As the editor of a high-quality magazine, you would be given broad
latitude to manage the magazine’s content, style, and delivery. Anyone with
this job title likely would tackle the following activities:
■■ Identify your readers demographically.
■■ Find out what the readers want in this kind of magazine.
■■ Identify the “best” readers who will renew their subscriptions and buy
products from the magazine’s advertisers.
■■ Find potential new readers and make them aware of the magazine.
■■ Choose the magazine content most appealing to the target readers.
■■ Make layout and rendering decisions that maximize the readers’ pleasure.
■■ Uphold high quality writing and editing standards, while adopting a
consistent presentation style.
■■ Continuously monitor the accuracy of the articles and advertiser’s claims.
■■ Develop a good network of writers and contributors as you gather new
input to the magazine’s content from a variety of sources.
■■ Attract advertising and run the magazine profitably.
■■ Publish the magazine on a regular basis.
■■ Maintain the readers’ trust.
■■ Keep the business owners happy.
We also can identify items that should be nongoals for the magazine editor-in-
chief. These would include such things as building the magazine around the
technology of a particular printing press, putting management’s energy into
operational efficiencies exclusively, imposing a technical writing style that
readers don’t easily understand, or creating an intricate and crowded layout
that is difficult to peruse and read.
By building the publishing business on a foundation of serving the readers
effectively, your magazine is likely to be successful. Conversely, go through
the list and imagine what happens if you omit any single item; ultimately, your
magazine would have serious problems.
The point of this metaphor, of course, is to draw the parallel between being a
conventional publisher and being a data warehouse manager. We are con-
vinced that the correct job description for a data warehouse manager is pub-
lisher of the right data. Driven by the needs of the business, data warehouse
managers are responsible for publishing data that has been collected from a
variety of sources and edited for quality and consistency. Your main responsi-
bility as a data warehouse manager is to serve your readers, otherwise known
as business users. The publishing metaphor underscores the need to focus out-
ward to your customers rather than merely focusing inward on products and
processes. While you will use technology to deliver your data warehouse, the
technology is at best a means to an end. As such, the technology and tech-
niques you use to build your data warehouses should not appear directly in
your top job responsibilities.
Let’s recast the magazine publisher’s responsibilities as data warehouse man-
ager responsibilities:
Dimensional Modeling Primer 5
■■ Understand your users by business area, job responsibilities, and com-
puter tolerance.
■■ Determine the decisions the business users want to make with the help of
the data warehouse.
■■ Identify the “best” users who make effective, high-impact decisions using
the data warehouse.
■■ Find potential new users and make them aware of the data warehouse.
■■ Choose the most effective, actionable subset of the data to present in the
data warehouse, drawn from the vast universe of possible data in your
■■ Make the user interfaces and applications simple and template-driven,
explicitly matching to the users’ cognitive processing profiles.
■■ Make sure the data is accurate and can be trusted, labeling it consistently
across the enterprise.
■■ Continuously monitor the accuracy of the data and the content of the
delivered reports.
■■ Search for new data sources, and continuously adapt the data warehouse
to changing data profiles, reporting requirements, and business priorities.
■■ Take a portion of the credit for the business decisions made using the data
warehouse, and use these successes to justify your staffing, software, and
hardware expenditures.
■■ Publish the data on a regular basis.
■■ Maintain the trust of business users.
■■ Keep your business users, executive sponsors, and boss happy.
If you do a good job with all these responsibilities, you will be a great data
warehouse manager! Conversely, go down through the list and imagine what
happens if you omit any single item. Ultimately, your data warehouse would
have serious problems. We urge you to contrast this view of a data warehouse
manager’s job with your own job description. Chances are the preceding list is
much more oriented toward user and business issues and may not even sound
like a job in IT. In our opinion, this is what makes data warehousing interesting.
Components of a Data Warehouse
Now that we understand the goals of a data warehouse, let’s investigate the
components that make up a complete warehousing environment. It is helpful
to understand the pieces carefully before we begin combining them to create a
data warehouse. Each warehouse component serves a specific function. We
need to learn the strategic significance of each component and how to wield it
effectively to win the data warehousing game. One of the biggest threats to
data warehousing success is confusing the components’ roles and functions.
As illustrated in Figure 1.1, there are four separate and distinct components to
be considered as we explore the data warehouse environment—operational
source systems, data staging area, data presentation area, and data access tools.
Operational Source Systems
These are the operational systems of record that capture the transactions of the
business. The source systems should be thought of as outside the data ware-
house because presumably we have little to no control over the content and for-
mat of the data in these operational legacy systems. The main priorities of the
source systems are processing performance and availability. Queries against
source systems are narrow, one-record-at-a-time queries that are part of the nor-
mal transaction flow and severely restricted in their demands on the opera-
tional system. We make the strong assumption that source systems are not
queried in the broad and unexpected ways that data warehouses typically are
queried. The source systems maintain little historical data, and if you have a
good data warehouse, the source systems can be relieved of much of the
responsibility for representing the past. Each source system is often a natural
stovepipe application, where little investment has been made to sharing com-
mon data such as product, customer, geography, or calendar with other opera-
tional systems in the organization. It would be great if your source systems
were being reengineered with a consistent view. Such an enterprise application
integration (EAI) effort will make the data warehouse design task far easier.
Figure 1.1 Basic elements of the data warehouse.
Clean, combine,
and standardize
Data Store:
Flat files and
relational tables
Sorting and
Ad Hoc Query Tools
Report Writers
Data mining
Data Mart #1
Atomic and
summary data
Based on a single
business process
Data Mart #2 ...
(Similarly designed)
DW Bus:
facts &
Dimensional Modeling Primer 7
Data Staging Area
The data staging area of the data warehouse is both a storage area and a set of
processes commonly referred to as extract-transformation-load (ETL). The data
staging area is everything between the operational source systems and the
data presentation area. It is somewhat analogous to the kitchen of a restaurant,
where raw food products are transformed into a fine meal. In the data ware-
house, raw operational data is transformed into a warehouse deliverable fit for
user query and consumption. Similar to the restaurant’s kitchen, the backroom
data staging area is accessible only to skilled professionals. The data ware-
house kitchen staff is busy preparing meals and simultaneously cannot be
responding to customer inquiries. Customers aren’t invited to eat in the
kitchen. It certainly isn’t safe for customers to wander into the kitchen. We
wouldn’t want our data warehouse customers to be injured by the dangerous
equipment, hot surfaces, and sharp knifes they may encounter in the kitchen,
so we prohibit them from accessing the staging area. Besides, things happen in
the kitchen that customers just shouldn’t be privy to.
The key architectural requirement for the data staging area is that it is off-limits to
business users and does not provide query and presentation services.
Extraction is the first step in the process of getting data into the data ware-
house environment. Extracting means reading and understanding the source
data and copying the data needed for the data warehouse into the staging area
for further manipulation.
Once the data is extracted to the staging area, there are numerous potential
transformations, such as cleansing the data (correcting misspellings, resolving
domain conflicts, dealing with missing elements, or parsing into standard for-
mats), combining data from multiple sources, deduplicating data, and assign-
ing warehouse keys. These transformations are all precursors to loading the
data into the data warehouse presentation area.
Unfortunately, there is still considerable industry consternation about whether
the data that supports or results from this process should be instantiated in
physical normalized structures prior to loading into the presentation area for
querying and reporting. These normalized structures sometimes are referred
to in the industry as the enterprise data warehouse; however, we believe that this
terminology is a misnomer because the warehouse is actually much more
encompassing than this set of normalized tables. The enterprise’s data ware-
house more accurately refers to the conglomeration of an organization’s data
warehouse staging and presentation areas. Thus, throughout this book, when
we refer to the enterprise data warehouse, we mean the union of all the diverse
data warehouse components, not just the backroom staging area.
The data staging area is dominated by the simple activities of sorting and
sequential processing. In many cases, the data staging area is not based on
relational technology but instead may consist of a system of flat files. After you
validate your data for conformance with the defined one-to-one and many-to-
one business rules, it may be pointless to take the final step of building a full-
blown third-normal-form physical database.
However, there are cases where the data arrives at the doorstep of the data
staging area in a third-normal-form relational format. In these situations, the
managers of the data staging area simply may be more comfortable perform-
ing the cleansing and transformation tasks using a set of normalized struc-
tures. A normalized database for data staging storage is acceptable. However,
we continue to have some reservations about this approach. The creation of
both normalized structures for staging and dimensional structures for presen-
tation means that the data is extracted, transformed, and loaded twice—once
into the normalized database and then again when we load the dimensional
model. Obviously, this two-step process requires more time and resources for
the development effort, more time for the periodic loading or updating of
data, and more capacity to store the multiple copies of the data. At the bottom
line, this typically translates into the need for larger development, ongoing
support, and hardware platform budgets. Unfortunately, some data ware-
house project teams have failed miserably because they focused all their
energy and resources on constructing the normalized structures rather than
allocating time to development of a presentation area that supports improved
business decision making. While we believe that enterprise-wide data consis-
tency is a fundamental goal of the data warehouse environment, there are
equally effective and less costly approaches than physically creating a normal-
ized set of tables in your staging area, if these structures don’t already exist.
It is acceptable to create a normalized database to support the staging processes;
however, this is not the end goal. The normalized structures must be off-limits to
user queries because they defeat understandability and performance. As soon as a
database supports query and presentation services, it must be considered part of the
data warehouse presentation area. By default, normalized databases are excluded
from the presentation area, which should be strictly dimensionally structured.
Regardless of whether we’re working with a series of flat files or a normalized
data structure in the staging area, the final step of the ETL process is the load-
ing of data. Loading in the data warehouse environment usually takes the
form of presenting the quality-assured dimensional tables to the bulk loading
facilities of each data mart. The target data mart must then index the newly
arrived data for query performance. When each data mart has been freshly
loaded, indexed, supplied with appropriate aggregates, and further quality
Dimensional Modeling Primer 9
assured, the user community is notified that the new data has been published.
Publishing includes communicating the nature of any changes that have
occurred in the underlying dimensions and new assumptions that have been
introduced into the measured or calculated facts.
Data Presentation
The data presentation area is where data is organized, stored, and made avail-
able for direct querying by users, report writers, and other analytical applica-
tions. Since the backroom staging area is off-limits, the presentation area is the
data warehouse as far as the business community is concerned. It is all the
business community sees and touches via data access tools. The prerelease
working title for the first edition of The Data Warehouse Toolkit originally was
Getting the Data Out. This is what the presentation area with its dimensional
models is all about.
We typically refer to the presentation area as a series of integrated data marts.
A data mart is a wedge of the overall presentation area pie. In its most sim-
plistic form, a data mart presents the data from a single business process.
These business processes cross the boundaries of organizational functions.
We have several strong opinions about the presentation area. First of all, we
insist that the data be presented, stored, and accessed in dimensional schemas.
Fortunately, the industry has matured to the point where we’re no longer
debating this mandate. The industry has concluded that dimensional model-
ing is the most viable technique for delivering data to data warehouse users.
Dimensional modeling is a new name for an old technique for making data-
bases simple and understandable. In case after case, beginning in the 1970s, IT
organizations, consultants, end users, and vendors have gravitated to a simple
dimensional structure to match the fundamental human need for simplicity.
Imagine a chief executive officer (CEO) who describes his or her business as,
“We sell products in various markets and measure our performance over
time.” As dimensional designers, we listen carefully to the CEO’s emphasis on
product, market, and time. Most people find it intuitive to think of this busi-
ness as a cube of data, with the edges labeled product, market, and time. We
can imagine slicing and dicing along each of these dimensions. Points inside
the cube are where the measurements for that combination of product, market,
and time are stored. The ability to visualize something as abstract as a set of
data in a concrete and tangible way is the secret of understandability. If this
perspective seems too simple, then good! A data model that starts by being
simple has a chance of remaining simple at the end of the design. A model that
starts by being complicated surely will be overly complicated at the end.
Overly complicated models will run slowly and be rejected by business users.
Dimensional modeling is quite different from third-normal-form (3NF) mod-
eling. 3NF modeling is a design technique that seeks to remove data redun-
dancies. Data is divided into many discrete entities, each of which becomes a
table in the relational database. A database of sales orders might start off with
a record for each order line but turns into an amazingly complex spiderweb
diagram as a 3NF model, perhaps consisting of hundreds or even thousands of
normalized tables.
The industry sometimes refers to 3NF models as ER models. ER is an acronym
for entity relationship. Entity-relationship diagrams (ER diagrams or ERDs) are
drawings of boxes and lines to communicate the relationships between tables.
Both 3NF and dimensional models can be represented in ERDs because both
consist of joined relational tables; the key difference between 3NF and dimen-
sional models is the degree of normalization. Since both model types can be
presented as ERDs, we’ll refrain from referring to 3NF models as ER models;
instead, we’ll call them normalized models to minimize confusion.
Normalized modeling is immensely helpful to operational processing perfor-
mance because an update or insert transaction only needs to touch the data-
base in one place. Normalized models, however, are too complicated for data
warehouse queries. Users can’t understand, navigate, or remember normal-
ized models that resemble the Los Angeles freeway system. Likewise, rela-
tional database management systems (RDBMSs) can’t query a normalized
model efficiently; the complexity overwhelms the database optimizers, result-
ing in disastrous performance. The use of normalized modeling in the data
warehouse presentation area defeats the whole purpose of data warehousing,
namely, intuitive and high-performance retrieval of data.
There is a common syndrome in many large IT shops. It is a kind of sickness
that comes from overly complex data warehousing schemas. The symptoms
might include:
■■ A $10 million hardware and software investment that is performing only a
handful of queries per day
■■ An IT department that is forced into a kind of priesthood, writing all the
data warehouse queries
■■ Seemingly simple queries that require several pages of single-spaced
Structured Query Language (SQL) code
■■ A marketing department that is unhappy because it can’t access the sys-
tem directly (and still doesn’t know whether the company is profitable in
■■ A restless chief information officer (CIO) who is determined to make some
changes if things don’t improve dramatically
Dimensional Modeling Primer 11
Fortunately, dimensional modeling addresses the problem of overly complex
schemas in the presentation area. A dimensional model contains the same infor-
mation as a normalized model but packages the data in a format whose design
goals are user understandability, query performance, and resilience to change.
Our second stake in the ground about presentation area data marts is that they
must contain detailed, atomic data. Atomic data is required to withstand
assaults from unpredictable ad hoc user queries. While the data marts also
may contain performance-enhancing summary data, or aggregates, it is not
sufficient to deliver these summaries without the underlying granular data in
a dimensional form. In other words, it is completely unacceptable to store only
summary data in dimensional models while the atomic data is locked up in
normalized models. It is impractical to expect a user to drill down through
dimensional data almost to the most granular level and then lose the benefits
of a dimensional presentation at the final step. In Chapter 16 we will see that
any user application can descend effortlessly to the bedrock granular data by
using aggregate navigation, but only if all the data is available in the same,
consistent dimensional form. While users of the data warehouse may look
infrequently at a single line item on an order, they may be very interested in
last week’s orders for products of a given size (or flavor, package type, or man-
ufacturer) for customers who first purchased within the last six months (or
reside in a given state or have certain credit terms). We need the most finely
grained data in our presentation area so that users can ask the most precise
questions possible. Because users’ requirements are unpredictable and con-
stantly changing, we must provide access to the exquisite details so that they
can be rolled up to address the questions of the moment.
All the data marts must be built using common dimensions and facts, which
we refer to as conformed. This is the basis of the data warehouse bus architec-
ture, which we’ll elaborate on in Chapter 3. Adherence to the bus architecture
is our third stake in the ground regarding the presentation area. Without
shared, conformed dimensions and facts, a data mart is a standalone stovepipe
application. Isolated stovepipe data marts that cannot be tied together are the
bane of the data warehouse movement. They merely perpetuate incompatible
views of the enterprise. If you have any hope of building a data warehouse
that is robust and integrated, you must make a commitment to the bus archi-
tecture. In this book we will illustrate that when data marts have been
designed with conformed dimensions and facts, they can be combined and
used together. The data warehouse presentation area in a large enterprise data
warehouse ultimately will consist of 20 or more very similar-looking data
marts. The dimensional models in these data marts also will look quite similar.
Each data mart may contain several fact tables, each with 5 to 15 dimension
tables. If the design has been done correctly, many of these dimension tables
will be shared from fact table to fact table.
Using the bus architecture is the secret to building distributed data warehouse
systems. Let’s be real—most of us don’t have the budget, time, or political
power to build a fully centralized data warehouse. When the bus architecture
is used as a framework, we can allow the enterprise data warehouse to
develop in a decentralized (and far more realistic) way.
Data in the queryable presentation area of the data warehouse must be dimen-
sional, must be atomic, and must adhere to the data warehouse bus architecture.
If the presentation area is based on a relational database, then these dimen-
sionally modeled tables are referred to as star schemas. If the presentation area
is based on multidimensional database or online analytic processing (OLAP)
technology, then the data is stored in cubes. While the technology originally
wasn’t referred to as OLAP, many of the early decision support system ven-
dors built their systems around the cube concept, so today’s OLAP vendors
naturally are aligned with the dimensional approach to data warehousing.
Dimensional modeling is applicable to both relational and multidimensional
databases. Both have a common logical design with recognizable dimensions;
however, the physical implementation differs. Fortunately, most of the recom-
mendations in this book pertain, regardless of the database platform. While
the capabilities of OLAP technology are improving continuously, at the time of
this writing, most large data marts are still implemented on relational data-
bases. In addition, most OLAP cubes are sourced from or drill into relational
dimensional star schemas using a variation of aggregate navigation. For these
reasons, most of the specific discussions surrounding the presentation area are
couched in terms of a relational platform.
Contrary to the original religion of the data warehouse, modern data marts
may well be updated, sometimes frequently. Incorrect data obviously should
be corrected. Changes in labels, hierarchies, status, and corporate ownership
often trigger necessary changes in the original data stored in the data marts
that comprise the data warehouse, but in general, these are managed-load
updates, not transactional updates.
Data Access Tools
The final major component of the data warehouse environment is the data
access tool(s). We use the term tool loosely to refer to the variety of capabilities
that can be provided to business users to leverage the presentation area for
analytic decision making. By definition, all data access tools query the data in
the data warehouse’s presentation area. Querying, obviously, is the whole
point of using the data warehouse.
Dimensional Modeling Primer 13
A data access tool can be as simple as an ad hoc query tool or as complex as a
sophisticated data mining or modeling application. Ad hoc query tools, as
powerful as they are, can be understood and used effectively only by a small
percentage of the potential data warehouse business user population. The
majority of the business user base likely will access the data via prebuilt
parameter-driven analytic applications. Approximately 80 to 90 percent of the
potential users will be served by these canned applications that are essentially
finished templates that do not require users to construct relational queries
directly. Some of the more sophisticated data access tools, like modeling or
forecasting tools, actually may upload their results back into operational
source systems or the staging/presentation areas of the data warehouse.
Additional Considerations
Before we leave the discussion of data warehouse components, there are sev-
eral other concepts that warrant discussion.
Metadata is all the information in the data warehouse environment that is not
the actual data itself. Metadata is akin to an encyclopedia for the data ware-
house. Data warehouse teams often spend an enormous amount of time talk-
ing about, worrying about, and feeling guilty about metadata. Since most
developers have a natural aversion to the development and orderly filing of
documentation, metadata often gets cut from the project plan despite every-
one’s acknowledgment that it is important.
Metadata comes in a variety of shapes and forms to support the disparate
needs of the data warehouse’s technical, administrative, and business user
groups. We have operational source system metadata including source
schemas and copybooks that facilitate the extraction process. Once data is in
the staging area, we encounter staging metadata to guide the transformation
and loading processes, including staging file and target table layouts, trans-
formation and cleansing rules, conformed dimension and fact definitions,
aggregation definitions, and ETL transmission schedules and run-log results.
Even the custom programming code we write in the data staging area is meta-
Metadata surrounding the warehouse DBMS accounts for such items as the
system tables, partition settings, indexes, view definitions, and DBMS-level
security privileges and grants. Finally, the data access tool metadata identifies
business names and definitions for the presentation area’s tables and columns
as well as constraint filters, application template specifications, access and
usage statistics, and other user documentation. And of course, if we haven’t
included it already, don’t forget all the security settings, beginning with source
transactional data and extending all the way to the user’s desktop.
The ultimate goal is to corral, catalog, integrate, and then leverage these dis-
parate varieties of metadata, much like the resources of a library. Suddenly, the
effort to build dimensional models appears to pale in comparison. However,
just because the task looms large, we can’t simply ignore the development of a
metadata framework for the data warehouse. We need to develop an overall
metadata plan while prioritizing short-term deliverables, including the pur-
chase or construction of a repository for keeping track of all the metadata.
Operational Data Store
Some of you probably are wondering where the operational data store (ODS)
fits in our warehouse components diagram. Since there’s no single universal
definition for the ODS, if and where it belongs depend on your situation. ODSs
are frequently updated, somewhat integrated copies of operational data. The
frequency of update and degree of integration of an ODS vary based on the
specific requirements. In any case, the O is the operative letter in the ODS
Most commonly, an ODS is implemented to deliver operational reporting,
especially when neither the legacy nor more modern on-line transaction pro-
cessing (OLTP) systems provide adequate operational reports. These reports
are characterized by a limited set of fixed queries that can be hard-wired in a
reporting application. The reports address the organization’s more tactical
decision-making requirements. Performance-enhancing aggregations, signifi-
cant historical time series, and extensive descriptive attribution are specifically
excluded from the ODS. The ODS as a reporting instance may be a stepping-
stone to feed operational data into the warehouse.
In other cases, ODSs are built to support real-time interactions, especially in cus-
tomer relationship management (CRM) applications such as accessing your
travel itinerary on a Web site or your service history when you call into customer
support. The traditional data warehouse typically is not in a position to support
the demand for near-real-time data or immediate response times. Similar to the
operational reporting scenario, data inquiries to support these real-time interac-
tions have a fixed structure. Interestingly, this type of ODS sometimes leverages
information from the data warehouse, such as a customer service call center
application that uses customer behavioral information from the data warehouse
to precalculate propensity scores and store them in the ODS.
In either scenario, the ODS can be either a third physical system sitting between
the operational systems and the data warehouse or a specially administered hot
partition of the data warehouse itself. Every organization obviously needs
Dimensional Modeling Primer 15
operational systems. Likewise, every organization would benefit from a data
warehouse. The same cannot be said about a physically distinct ODS unless the
other two systems cannot answer your immediate operational questions.
Clearly, you shouldn’t allocate resources to construct a third physical system
unless your business needs cannot be supported by either the operational data-
collection system or the data warehouse. For these reasons, we believe that the
trend in data warehouse design is to deliver the ODS as a specially adminis-
tered portion of the conventional data warehouse. We will further discuss hot-
partition-style ODSs in Chapter 5.
Finally, before we leave this topic, some have defined the ODS to mean the
place in the data warehouse where we store granular atomic data. We believe
that this detailed data should be considered a natural part of the data ware-
house’s presentation area and not a separate entity. Beginning in Chapter 2, we
will show how the lowest-level transactions in a business are the foundation
for the presentation area of the data warehouse.
Dimensional Modeling Vocabulary
Throughout this book we will refer repeatedly to fact and dimension tables.
Contrary to popular folklore, Ralph Kimball didn’t invent this terminology. As
best as we can determine, the terms dimensions and facts originated from a joint
research project conducted by General Mills and Dartmouth University in the
1960s. In the 1970s, both AC Nielsen and IRI used these terms consistently to
describe their syndicated data offerings, which could be described accurately
today as dimensional data marts for retail sales data. Long before simplicity
was a lifestyle trend, the early database syndicators gravitated to these con-
cepts for simplifying the presentation of analytic information. They under-
stood that a database wouldn’t be used unless it was packaged simply.
It is probably accurate to say that a single person did not invent the dimensional ap-
proach. It is an irresistible force in the design of databases that will always result
when the designer places understandability and performance as the highest goals.
Fact Table
A fact table is the primary table in a dimensional model where the numerical
performance measurements of the business are stored, as illustrated in Figure
1.2. We strive to store the measurement data resulting from a business process
in a single data mart. Since measurement data is overwhelmingly the largest
part of any data mart, we avoid duplicating it in multiple places around the
Figure 1.2 Sample fact table.
We use the term fact to represent a business measure. We can imagine standing
in the marketplace watching products being sold and writing down the quan-
tity sold and dollar sales amount each day for each product in each store. A
measurement is taken at the intersection of all the dimensions (day, product,
and store). This list of dimensions defines the grain of the fact table and tells us
what the scope of the measurement is.
A row in a fact table corresponds to a measurement. A measurement is a row in a
fact table. All the measurements in a fact table must be at the same grain.
The most useful facts are numeric and additive, such as dollar sales amount.
Throughout this book we will use dollars as the standard currency to make the
case study examples more tangible—please bear with the authors and substi-
tute your own local currency if it doesn’t happen to be dollars.
Additivity is crucial because data warehouse applications almost never
retrieve a single fact table row. Rather, they bring back hundreds, thousands,
or even millions of fact rows at a time, and the most useful thing to do with so
many rows is to add them up. In Figure 1.2, no matter what slice of the data-
base the user chooses, we can add up the quantities and dollars to a valid total.
We will see later in this book that there are facts that are semiadditive and still
others that are nonadditive. Semiadditive facts can be added only along some
of the dimensions, and nonadditive facts simply can’t be added at all. With
nonadditive facts we are forced to use counts or averages if we wish to sum-
marize the rows or are reduced to printing out the fact rows one at a time. This
would be a dull exercise in a fact table with a billion rows.
The most useful facts in a fact table are numeric and additive.
We often describe facts as continuously valued mainly as a guide for the
designer to help sort out what is a fact versus a dimension attribute. The dol-
lar sales amount fact is continuously valued in this example because it can take
on virtually any value within a broad range. As observers, we have to stand
Date Key (FK)
Product Key (FK)
Store Key (FK)
Quantity Sold
Dollar Sales Amount
Daily Sales Fact Table
Dimensional Modeling Primer 17
out in the marketplace and wait for the measurement before we have any idea
what the value will be.
It is theoretically possible for a measured fact to be textual; however, the con-
dition arises rarely. In most cases, a textual measurement is a description of
something and is drawn from a discrete list of values. The designer should
make every effort to put textual measures into dimensions because they can be
correlated more effectively with the other textual dimension attributes and
will consume much less space. We do not store redundant textual information
in fact tables. Unless the text is unique for every row in the fact table, it belongs
in the dimension table. A true text fact is rare in a data warehouse because the
unpredictable content of a text fact, like a free text comment, makes it nearly
impossible to analyze.
In our sample fact table (see Figure 1.2), if there is no sales activity on a given
day in a given store for a given product, we leave the row out of the table. It is
very important that we do not try to fill the fact table with zeros representing
nothing happening because these zeros would overwhelm most of our fact
tables. By only including true activity, fact tables tend to be quite sparse.
Despite their sparsity, fact tables usually make up 90 percent or more of the
total space consumed by a dimensional database. Fact tables tend to be deep in
terms of the number of rows but narrow in terms of the number of columns.
Given their size, we are judicious about fact table space utilization.
As we develop the examples in this book, we will see that all fact table grains
fall into one of three categories: transaction, periodic snapshot, and accumu-
lating snapshot. Transaction grain fact tables are among the most common. We
will introduce transaction fact tables in Chapter 2, periodic snapshots in Chap-
ter 3, and accumulating snapshots in Chapter 5.
All fact tables have two or more foreign keys, as designated by the FK notation
in Figure 1.2, that connect to the dimension tables’ primary keys. For example,
the product key in the fact table always will match a specific product key in the
product dimension table. When all the keys in the fact table match their respec-
tive primary keys correctly in the corresponding dimension tables, we say that
the tables satisfy referential integrity. We access the fact table via the dimension
tables joined to it.
The fact table itself generally has its own primary key made up of a subset of
the foreign keys. This key is often called a composite or concatenated key. Every
fact table in a dimensional model has a composite key, and conversely, every
table that has a composite key is a fact table. Another way to say this is that in
a dimensional model, every table that expresses a many-to-many relationship
must be a fact table. All other tables are dimension tables.
Fact tables express the many-to-many relationships between dimensions in dimen-
sional models.
Only a subset of the components in the fact table composite key typically is
needed to guarantee row uniqueness. There are usually about a half dozen
dimensions that have robust many-to-many relationships with each other and
uniquely identify each row. Sometimes there are as few as two dimensions,
such as the invoice number and the product key. Once this subset has been
identified, the rest of the dimensions take on a single value in the context of the
fact table row’s primary key. In other words, they go along for the ride. In most
cases, there is no advantage to introducing a unique ROWID key to serve as
the primary key in the fact table. Doing so makes your fact table larger, while
any index on this artificial ROWID primary key would be worthless. However,
such a key may be required to placate the database management system, espe-
cially if you can legitimately, from a business perspective, load multiple iden-
tical rows into the fact table.
Dimension Tables
Dimension tables are integral companions to a fact table. The dimension tables
contain the textual descriptors of the business, as illustrated in Figure 1.3. In a
well-designed dimensional model, dimension tables have many columns or
attributes. These attributes describe the rows in the dimension table. We strive
to include as many meaningful textlike descriptions as possible. It is not
uncommon for a dimension table to have 50 to 100 attributes. Dimension
tables tend to be relatively shallow in terms of the number of rows (often far
fewer than 1 million rows) but are wide with many large columns. Each
dimension is defined by its single primary key, designated by the PK notation
in Figure 1.3, which serves as the basis for referential integrity with any given
fact table to which it is joined.
Dimension attributes serve as the primary source of query constraints, group-
ings, and report labels. In a query or report request, attributes are identified as
the by words. For example, when a user states that he or she wants to see dol-
lar sales by week by brand, week and brand must be available as dimension
Dimension table attributes play a vital role in the data warehouse. Since they
are the source of virtually all interesting constraints and report labels, they are
key to making the data warehouse usable and understandable. In many ways,
the data warehouse is only as good as the dimension attributes. The power of
the data warehouse is directly proportional to the quality and depth of the
Dimensional Modeling Primer 19
dimension attributes. The more time spent providing attributes with verbose
business terminology, the better the data warehouse is. The more time spent
populating the values in an attribute column, the better the data warehouse is.
The more time spent ensuring the quality of the values in an attribute column,
the better the data warehouse is.
Dimension tables are the entry points into the fact table. Robust dimension attrib-
utes deliver robust analytic slicing and dicing capabilities. The dimensions imple-
ment the user interface to the data warehouse.
The best attributes are textual and discrete. Attributes should consist of real
words rather than cryptic abbreviations. Typical attributes for a product
dimension would include a short description (10 to 15 characters), a long
description (30 to 50 characters), a brand name, a category name, packaging
type, size, and numerous other product characteristics. Although the size is
probably numeric, it is still a dimension attribute because it behaves more like
a textual description than like a numeric measurement. Size is a discrete and
constant descriptor of a specific product.
Sometimes when we are designing a database it is unclear whether a numeric
data field extracted from a production data source is a fact or dimension
attribute. We often can make the decision by asking whether the field is a mea-
surement that takes on lots of values and participates in calculations (making
it a fact) or is a discretely valued description that is more or less constant and
participates in constraints (making it a dimensional attribute). For example,
the standard cost for a product seems like a constant attribute of the product
but may be changed so often that eventually we decide that it is more like a
measured fact. Occasionally, we can’t be certain of the classification. In such
cases, it may be possible to model the data field either way, as a matter of
designer’s prerogative.
Figure 1.3 Sample dimension table.
Product Key (PK)
Product Description
SKU Number (Natural Key)
Brand Description
Category Description
Department Description
Package Type Description
Package Size
Fat Content Description
Diet Type Description
Weight Units of Measure
Storage Type
Shelf Life Type
Shelf Width
Shelf Height
Shelf Depth
... and many more
Product Dimension Table
We strive to minimize the use of codes in our dimension tables by replacing
them with more verbose textual attributes. We understand that you may have
already trained the users to make sense of operational codes, but going for-
ward, we’d like to minimize their reliance on miniature notes attached to their
computer monitor for code translations. We want to have standard decodes for
the operational codes available as dimension attributes so that the labeling on
data warehouse queries and reports is consistent. We don’t want to encourage
decodes buried in our reporting applications, where inconsistency is
inevitable. Sometimes operational codes or identifiers have legitimate busi-
ness significance to users or are required to communicate back to the opera-
tional world. In these cases, the codes should appear as explicit dimension
attributes, in addition to the corresponding user-friendly textual descriptors.
We have identified operational, natural keys in the dimension figures, as
appropriate, throughout this book.
Operational codes often have intelligence embedded in them. For example, the
first two digits may identify the line of business, whereas the next two digits
may identify the global region. Rather than forcing users to interrogate or fil-
ter on the operational code, we pull out the embedded meanings and present
them to users as separate dimension attributes that can be filtered, grouped, or
reported on easily.
Dimension tables often represent hierarchical relationships in the business. In
our sample product dimension table, products roll up into brands and then
into categories. For each row in the product dimension, we store the brand and
category description associated with each product. We realize that the hierar-
chical descriptive information is stored redundantly, but we do so in the spirit
of ease of use and query performance. We resist our natural urge to store only
the brand code in the product dimension and create a separate brand lookup
table. This would be called a snowflake. Dimension tables typically are highly
denormalized. They are usually quite small (less than 10 percent of the total
data storage requirements). Since dimension tables typically are geometrically
smaller than fact tables, improving storage efficiency by normalizing or
snowflaking has virtually no impact on the overall database size. We almost
always trade off dimension table space for simplicity and accessibility.
Bringing Together Facts and
Now that we understand fact and dimension tables, let’s bring the two build-
ing blocks together in a dimensional model. As illustrated in Figure 1.4, the
fact table consisting of numeric measurements is joined to a set of dimension
tables filled with descriptive attributes. This characteristic starlike structure is
often called a star join schema. This term dates back to the earliest days of rela-
tional databases.
Dimensional Modeling Primer 21
Figure 1.4 Fact and dimension tables in a dimensional model.
The first thing we notice about the resulting dimensional schema is its sim-
plicity and symmetry. Obviously, business users benefit from the simplicity
because the data is easier to understand and navigate. The charm of the design
in Figure 1.4 is that it is highly recognizable to business users. We have
observed literally hundreds of instances where users agree immediately that
the dimensional model is their business. Furthermore, the reduced number of
tables and use of meaningful business descriptors make it less likely that mis-
takes will occur.
The simplicity of a dimensional model also has performance benefits. Data-
base optimizers will process these simple schemas more efficiently with fewer
joins. A database engine can make very strong assumptions about first con-
straining the heavily indexed dimension tables, and then attacking the fact
table all at once with the Cartesian product of the dimension table keys satis-
fying the user’s constraints. Amazingly, using this approach it is possible to
evaluate arbitrary n-way joins to a fact table in a single pass through the fact
table’s index.
Finally, dimensional models are gracefully extensible to accommodate change.
The predictable framework of a dimensional model withstands unexpected
changes in user behavior. Every dimension is equivalent; all dimensions are
symmetrically equal entry points into the fact table. The logical model has no
built-in bias regarding expected query patterns. There are no preferences for
the business questions we’ll ask this month versus the questions we’ll ask next
month. We certainly don’t want to adjust our schemas if business users come
up with new ways to analyze the business.
We will see repeatedly in this book that the most granular or atomic data has
the most dimensionality. Atomic data that has not been aggregated is the
Date Key (FK)
Date Attributes...
Date Dimension
Date Key (PK)
Product Key (FK)
Store Key (FK)
Daily Sales Facts
Product Key (PK)
Product Attributes...
Product Dimension
Store Key (PK)
Store Attributes...
Store Dimension
most expressive data; this atomic data should be the foundation for every
fact table design in order to withstand business users’ ad hoc attacks where
they pose unexpected queries. With dimensional models, we can add com-
pletely new dimensions to the schema as long as a single value of that
dimension is defined for each existing fact row. Likewise, we can add new,
unanticipated facts to the fact table, assuming that the level of detail is con-
sistent with the existing fact table. We can supplement preexisting dimension
tables with new, unanticipated attributes. We also can break existing dimen-
sion rows down to a lower level of granularity from a certain point in time
forward. In each case, existing tables can be changed in place either simply
by adding new data rows in the table or by executing an SQL ALTER TABLE
command. Data would not have to be reloaded. All existing data access
applications continue to run without yielding different results. We’ll
examine this graceful extensibility of our dimensional models more fully in
Chapter 2.
Another way to think about the complementary nature of fact and dimension
tables is to see them translated into a report. As illustrated in Figure 1.5,
dimension attributes supply the report labeling, whereas the fact tables supply
the report’s numeric values.
Finally, as we’ve already stressed, we insist that the data in the presentation
area be dimensionally structured. However, there is a natural relationship
between dimensional and normalized models. The key to understanding the
relationship is that a single normalized ER diagram often breaks down into
multiple dimensional schemas. A large normalized model for an organization
may have sales calls, orders, shipment invoices, customer payments, and
product returns all on the same diagram. In a way, the normalized ER diagram
does itself a disservice by representing, on a single drawing, multiple business
processes that never coexist in a single data set at a single point in time. No
wonder the normalized model seems complex.
If you already have an existing normalized ER diagram, the first step in con-
verting it into a set of dimensional models is to separate the ER diagram into
its discrete business processes and then model each one separately. The second
step is to select those many-to-many relationships in the ER diagrams that con-
tain numeric and additive nonkey facts and designate them as fact tables. The
final step is to denormalize all the remaining tables into flat tables with single-
part keys that join directly to the fact tables. These tables become the dimen-
sion tables.
Dimensional Modeling Primer 23
Figure 1.5 Dragging and dropping dimensional attributes and facts into a simple report.
Dimensional Modeling Myths
Despite the general acceptance of dimensional modeling, some mispercep-
tions continue to be disseminated in the industry. We refer to these miscon-
ceptions as dimensional modeling myths.
Myth 1. Dimensional models and data marts are for summary data only. This first
myth is the root cause of many ill-designed dimensional models. Because
we can’t possibly predict all the questions asked by business users, we
need to provide them with queryable access to the most detailed data so
that they can roll it up based on the business question at hand. Data at the
lowest level of detail is practically impervious to surprises or changes. Our
data marts also will include commonly requested summarized data in
dimensional schemas. This summary data should complement the granu-
lar detail solely to provide improved performance for common queries, but
not attempt to serve as a replacement for the details.
A related corollary to this first myth is that only a limited amount of histor-
ical data should be stored in dimensional structures. There is nothing
Date Key (PK)
Day of Week
... and more
Date Dimension
Date Key (PK)
Product Key (FK)
Store Key (FK)
Quantity Sold
Dollar Sales Amount
Daily Sales Facts
Product Key (PK)
Product Description...
SKU Number (Natural Key)
Brand Description
Subcategory Description
Category Description
... and more
Product Dimension
Store Key (PK)
Store Number
Store Name
Store Address
Store City
Store State
Store Zip
Store District
Store Region
... and more
Store Dimension
Clean Fast
More Power
Clean Fast
More Power
Quantity Sold
Dollar Sales Amount
SumSum SumSum
about a dimensional model that prohibits the storage of substantial history.
The amount of history available in data marts must be driven by the busi-
ness’s requirements.
Myth 2. Dimensional models and data marts are departmental, not enterprise, solu-
tions. Rather than drawing boundaries based on organizational depart-
ments, we maintain that data marts should be organized around business
processes, such as orders, invoices, and service calls. Multiple business
functions often want to analyze the same metrics resulting from a single
business process. We strive to avoid duplicating the core measurements in
multiple databases around the organization.
Supporters of the normalized data warehouse approach sometimes draw
spiderweb diagrams with multiple extracts from the same source feeding
into multiple data marts. The illustration supposedly depicts the perils of
proceeding without a normalized data warehouse to feed the data marts.
These supporters caution about increased costs and potential inconsisten-
cies as changes in the source system of record would need to be rippled to
each mart’s ETL process.
This argument falls apart because no one advocates multiple extracts from
the same source. The spiderweb diagrams fail to appreciate that the data
marts are process-centric, not department-centric, and that the data is
extracted once from the operational source and presented in a single place.
Clearly, the operational system support folks would frown on the multiple-
extract approach. So do we.
Myth 3. Dimensional models and data marts are not scalable. Modern fact tables
have many billions of rows in them. The dimensional models within our
data marts are extremely scalable. Relational DBMS vendors have
embraced data warehousing and incorporated numerous capabilities into
their products to optimize the scalability and performance of dimensional
A corollary to myth 3 is that dimensional models are only appropriate for
retail or sales data. This notion is rooted in the historical origins of dimen-
sional modeling but not in its current-day reality. Dimensional modeling
has been applied to virtually every industry, including banking, insurance,
brokerage, telephone, newspaper, oil and gas, government, manufacturing,
travel, gaming, health care, education, and many more. In this book we use
the retail industry to illustrate several early concepts mainly because it is
an industry to which we have all been exposed; however, these concepts
are extremely transferable to other businesses.
Myth 4. Dimensional models and data marts are only appropriate when there is a
predictable usage pattern. A related corollary is that dimensional models
aren’t responsive to changing business needs. On the contrary, because of
Dimensional Modeling Primer 25
their symmetry, the dimensional structures in our data marts are extremely
flexible and adaptive to change. The secret to query flexibility is building
the fact tables at the most granular level. In our opinion, the source of
myth 4 is the designer struggling with fact tables that have been prema-
turely aggregated based on the designer’s unfortunate belief in myth 1
regarding summary data. Dimensional models that only deliver summary
data are bound to be problematic. Users run into analytic brick walls when
they try to drill down into details not available in the summary tables.
Developers also run into brick walls because they can’t easily accommo-
date new dimensions, attributes, or facts with these prematurely summa-
rized tables. The correct starting point for your dimensional models is to
express data at the lowest detail possible for maximum flexibility and
Myth 5. Dimensional models and data marts can’t be integrated and therefore lead
to stovepipe solutions. Dimensional models and data marts most certainly
can be integrated if they conform to the data warehouse bus architecture.
Presentation area databases that don’t adhere to the data warehouse bus
architecture will lead to standalone solutions. You can’t hold dimensional
modeling responsible for the failure of some organizations to embrace one
of its fundamental tenets.
Common Pitfalls to Avoid
While we can provide positive recommendations about dimensional data
warehousing, some readers better relate to a listing of common pitfalls or traps
into which others have already stepped. Borrowing from a popular late-night
television show, here is our favorite top 10 list of common errors to avoid while
building your data warehouse. These are all quite lethal errors—one alone
may be sufficient to bring down your data warehouse initiative. We’ll further
elaborate on these in Chapter 16; however, we wanted to plant the seeds early
on while we have your complete attention.
Pitfall 10. Become overly enamored with technology and data rather than
focusing on the business’s requirements and goals.
Pitfall 9. Fail to embrace or recruit an influential, accessible, and reasonable
management visionary as the business sponsor of the data warehouse.
Pitfall 8. Tackle a galactic multiyear project rather than pursuing more man-
ageable, while still compelling, iterative development efforts.
Pitfall 7. Allocate energy to construct a normalized data structure, yet run
out of budget before building a viable presentation area based on dimen-
sional models.
Pitfall 6. Pay more attention to backroom operational performance and ease
of development than to front-room query performance and ease of use.
Pitfall 5. Make the supposedly queryable data in the presentation area overly
complex. Database designers who prefer a more complex presentation
should spend a year supporting business users; they’d develop a much
better appreciation for the need to seek simpler solutions.
Pitfall 4. Populate dimensional models on a standalone basis without regard
to a data architecture that ties them together using shared, conformed
Pitfall 3. Load only summarized data into the presentation area’s dimen-
sional structures.
Pitfall 2. Presume that the business, its requirements and analytics, and the
underlying data and the supporting technology are static.
Pitfall 1. Neglect to acknowledge that data warehouse success is tied directly
to user acceptance. If the users haven’t accepted the data warehouse as a
foundation for improved decision making, then your efforts have been
exercises in futility.
In this chapter we discussed the overriding goals for the data warehouse and
the differences between data warehouses and operational source systems. We
explored the major components of the data warehouse and discussed the per-
missible role of normalized ER models in the staging area, but not as the end
goal. We then focused our attention on dimensional modeling for the presen-
tation area and established preliminary vocabulary regarding facts and
dimensions. Stay tuned as we put these concepts into action in our first case
study in the next chapter.
Dimensional Modeling Primer 27
Retail Sales
The best way to understand the principles of dimensional modeling is to work
through a series of tangible examples. By visualizing real cases, we can hold
the particular design challenges and solutions in our minds much more effec-
tively than if they are presented abstractly. In this book we will develop exam-
ples from a range of businesses to help move past one’s own detail and come
up with the right design.
To learn dimensional modeling, please read all the chapters in this book, even if
you don’t manage a retail business or work for a telecommunications firm. The
chapters are not intended to be full-scale solution handbooks for a given indus-
try or business function. Each chapter is a metaphor for a characteristic set of
dimensional modeling problems that comes up in nearly every kind of busi-
ness. Universities, insurance companies, banks, and airlines alike surely will
need the techniques developed in this retail chapter. Besides, thinking about
someone else’s business is refreshing at times. It is too easy to let historical
complexities derail us when we are dealing with data from our own compa-
nies. By stepping outside our own organizations and then returning with a
well-understood design principle (or two), it is easier to remember the spirit of
the design principles as we descend into the intricate details of our businesses.
Chapter 2 discusses the following concepts:
■■ Four-step process for designing dimensional models
■■ Transaction-level fact tables
■■ Additive and non-additive facts
■■ Sample dimension table attributes
■■ Causal dimensions, such as promotion
■■ Degenerate dimensions, such as the transaction ticket number
■■ Extending an existing dimension model
■■ Snowflaking dimension attributes
■■ Avoiding the “too many dimensions” trap
■■ Surrogate keys
■■ Market basket analysis
Four-Step Dimensional Design Process
Throughout this book we will approach the design of a dimensional database
by consistently considering four steps in a particular order. The meaning of
these four steps will become more obvious as we proceed with the various
designs, but we’ll provide initial definitions at this time.
1. Select the business process to model. A process is a natural business activ-
ity performed in your organization that typically is supported by a source
data-collection system. Listening to your users is the most efficient means
for selecting the business process. The performance measurements that
they clamor to analyze in the data warehouse result from business mea-
surement processes. Example business processes include raw materials
purchasing, orders, shipments, invoicing, inventory, and general ledger.
It is important to remember that we’re not referring to an organizational
business department or function when we talk about business processes.
For example, we’d build a single dimensional model to handle orders
data rather than building separate models for the sales and marketing
departments, which both want to access orders data. By focusing on busi-
ness processes, rather than on business departments, we can deliver con-
sistent information more economically throughout the organization. If we
establish departmentally bound dimensional models, we’ll inevitably
duplicate data with different labels and terminology. Multiple data flows
into separate dimensional models will make us vulnerable to data incon-
sistencies. The best way to ensure consistency is to publish the data once.
A single publishing run also reduces the extract-transformation-load
(ETL) development effort, as well as the ongoing data management and
disk storage burden.
2. Declare the grain of the business process. Declaring the grain means speci-
fying exactly what an individual fact table row represents. The grain con-
veys the level of detail associated with the fact table measurements. It
provides the answer to the question, “How do you describe a single row
in the fact table?”
Example grain declarations include:
■■ An individual line item on a customer’s retail sales ticket as measured
by a scanner device
■■ A line item on a bill received from a doctor
■■ An individual boarding pass to get on a flight
■■ A daily snapshot of the inventory levels for each product in a
■■ A monthly snapshot for each bank account
Data warehouse teams often try to bypass this seemingly unnecessary
step of the process. Please don’t! It is extremely important that everyone
on the design team is in agreement regarding the fact table granularity.
It is virtually impossible to reach closure in step 3 without declaring the
grain. We also should warn you that an inappropriate grain declaration
will haunt a data warehouse implementation. Declaring the grain is a crit-
ical step that can’t be taken lightly. Having said this, you may discover in
steps 3 or 4 that the grain statement is wrong. This is okay, but then you
must return to step 2, redeclare the grain correctly, and revisit steps 3 and
4 again.
3. Choose the dimensions that apply to each fact table row. Dimensions fall
out of the question, “How do businesspeople describe the data that results
from the business process?” We want to decorate our fact tables with a
robust set of dimensions representing all possible descriptions that take
on single values in the context of each measurement. If we are clear about
the grain, then the dimensions typically can be identified quite easily.
With the choice of each dimension, we will list all the discrete, textlike
attributes that will flesh out each dimension table. Examples of common
dimensions include date, product, customer, transaction type, and status.
4. Identify the numeric facts that will populate each fact table row. Facts are
determined by answering the question, “What are we measuring?” Business
users are keenly interested in analyzing these business process performance
measures. All candidate facts in a design must be true to the grain defined in
step 2. Facts that clearly belong to a different grain must be in a separate fact
table. Typical facts are numeric additive figures such as quantity ordered or
dollar cost amount.
Retail Sales 31
Throughout this book we will keep these four steps in mind as we develop
each of the case studies. We’ll apply a user’s understanding of the business to
decide what dimensions and facts are needed in the dimensional model.
Clearly, we need to consider both our business users’ requirements and the
realities of our source data in tandem to make decisions regarding the four
steps, as illustrated in Figure 2.1. We strongly encourage you to resist the
temptation to model the data by looking at source data files alone. We realize
that it may be much less intimidating to dive into the file layouts and copy-
books rather than interview a businessperson; however, they are no substitute
for user input. Unfortunately, many organizations have attempted this path-
of-least-resistance data-driven approach, but without much success.
Retail Case Study
Let’s start with a brief description of the retail business that we’ll use in this
case study to make dimension and fact tables more understandable. We begin
with this industry because it is one to which we can all relate. Imagine that we
work in the headquarters of a large grocery chain. Our business has 100 gro-
cery stores spread over a five-state area. Each of the stores has a full comple-
ment of departments, including grocery, frozen foods, dairy, meat, produce,
bakery, floral, and health/beauty aids. Each store has roughly 60,000 individ-
ual products on its shelves. The individual products are called stock keeping
units (SKUs). About 55,000 of the SKUs come from outside manufacturers and
have bar codes imprinted on the product package. These bar codes are called
universal product codes (UPCs). UPCs are at the same grain as individual SKUs.
Each different package variation of a product has a separate UPC and hence is
a separate SKU.
Figure 2.1 Key input to the four-step dimensional design process.
Dimensional Model
1. Business Process
2. Grain
3. Dimensions
4. Facts
The remaining 5,000 SKUs come from departments such as meat, produce,
bakery, or floral. While these products don’t have nationally recognized UPCs,
the grocery chain assigns SKU numbers to them. Since our grocery chain is
highly automated, we stick scanner labels on many of the items in these other
departments. Although the bar codes are not UPCs, they are certainly SKU
Data is collected at several interesting places in a grocery store. Some of the
most useful data is collected at the cash registers as customers purchase prod-
ucts. Our modern grocery store scans the bar codes directly into the point-of-
sale (POS) system. The POS system is at the front door of the grocery store
where consumer takeaway is measured. The back door, where vendors make
deliveries, is another interesting data-collection point.
At the grocery store, management is concerned with the logistics of ordering,
stocking, and selling products while maximizing profit. The profit ultimately
comes from charging as much as possible for each product, lowering costs for
product acquisition and overhead, and at the same time attracting as many
customers as possible in a highly competitive pricing environment. Some of
the most significant management decisions have to do with pricing and pro-
motions. Both store management and headquarters marketing spend a great
deal of time tinkering with pricing and promotions. Promotions in a grocery
store include temporary price reductions, ads in newspapers and newspaper
inserts, displays in the grocery store (including end-aisle displays), and
coupons. The most direct and effective way to create a surge in the volume of
product sold is to lower the price dramatically. A 50-cent reduction in the price
of paper towels, especially when coupled with an ad and display, can cause
the sale of the paper towels to jump by a factor of 10. Unfortunately, such a big
price reduction usually is not sustainable because the towels probably are
being sold at a loss. As a result of these issues, the visibility of all forms of pro-
motion is an important part of analyzing the operations of a grocery store.
Now that we have described our business case study, we’ll begin to design the
dimensional model.
Step 1. Select the Business Process
The first step in the design is to decide what business process(es) to model by
combining an understanding of the business requirements with an under-
standing of the available data.
The first dimensional model built should be the one with the most impact—it should
answer the most pressing business questions and be readily accessible for data
Retail Sales 33
In our retail case study, management wants to better understand customer
purchases as captured by the POS system. Thus the business process we’re
going to model is POS retail sales. This data will allow us to analyze what
products are selling in which stores on what days under what promotional
Step 2. Declare the Grain
Once the business process has been identified, the data warehouse team faces
a serious decision about the granularity. What level of data detail should be
made available in the dimensional model? This brings us to an important
design tip.
Preferably you should develop dimensional models for the most atomic information
captured by a business process. Atomic data is the most detailed information col-
lected; such data cannot be subdivided further.
Tackling data at its lowest, most atomic grain makes sense on multiple fronts.
Atomic data is highly dimensional. The more detailed and atomic the fact
measurement, the more things we know for sure. All those things we know for
sure translate into dimensions. In this regard, atomic data is a perfect match
for the dimensional approach.
Atomic data provides maximum analytic flexibility because it can be con-
strained and rolled up in every way possible. Detailed data in a dimensional
model is poised and ready for the ad hoc attack by business users.
Of course, you can always declare higher-level grains for a business process
that represent an aggregation of the most atomic data. However, as soon as we
select a higher-level grain, we’re limiting ourselves to fewer and/or poten-
tially less detailed dimensions. The less granular model is immediately vul-
nerable to unexpected user requests to drill down into the details. Users
inevitably run into an analytic wall when not given access to the atomic data.
As we’ll see in Chapter 16, aggregated summary data plays an important role
as a performance-tuning tool, but it is not a substitute for giving users access
to the lowest-level details. Unfortunately, some industry pundits have been
confused on this point. They claim that dimensional models are only appro-
priate for summarized data and then criticize the dimensional modeling
approach for its supposed need to anticipate the business question. This mis-
understanding goes away when detailed, atomic data is made available in a
dimensional model.
In our case study, the most granular data is an individual line item on a POS
transaction. To ensure maximum dimensionality and flexibility, we will proceed
with this grain. It is worth noting that this granularity declaration represents a
change from the first edition of this text. Previously, we focused on POS data, but
rather than representing transaction line item detail in the dimensional model,
we elected to provide sales data rolled up by product and promotion in a store
on a day. At the time, these daily product totals represented the state of the art
for syndicated retail sales databases. It was unreasonable to expect then-current
hardware and software to deal effectively with the volumes of data associated
with individual POS transaction line items.
Providing access to the POS transaction information gives us with a very
detailed look at store sales. While users probably are not interested in analyz-
ing single items associated with a specific POS transaction, we can’t predict all
the ways that they’ll want to cull through that data. For example, they may
want to understand the difference in sales on Monday versus Sunday. Or they
may want to assess whether it’s worthwhile to stock so many individual sizes
of certain brands, such as cereal. Or they may want to understand how many
shoppers took advantage of the 50-cents-off promotion on shampoo. Or they
may want to determine the impact in terms of decreased sales when a com-
petitive diet soda product was promoted heavily. While none of these queries
calls for data from one specific transaction, they are broad questions that
require detailed data sliced in very precise ways. None of them could have
been answered if we elected only to provide access to summarized data.
A data warehouse almost always demands data expressed at the lowest possible
grain of each dimension not because queries want to see individual low-level rows,
but because queries need to cut through the details in very precise ways.
Step 3. Choose the Dimensions
Once the grain of the fact table has been chosen, the date, product, and store
dimensions fall out immediately. We assume that the calendar date is the date
value delivered to us by the POS system. Later, we will see what to do if we
also get a time of day along with the date. Within the framework of the pri-
mary dimensions, we can ask whether other dimensions can be attributed to
the data, such as the promotion under which the product is sold. We express
this as another design principle:
A careful grain statement determines the primary dimensionality of the fact table. It
is then often possible to add more dimensions to the basic grain of the fact table,
where these additional dimensions naturally take on only one value under each
combination of the primary dimensions. If the additional dimension violates the
grain by causing additional fact rows to be generated, then the grain statement must
be revised to accommodate this dimension.
Retail Sales 35
Figure 2.2 Preliminary retail sales schema.
“TBD” means “to be determined.
In our case study we’ve decided on the following descriptive dimensions:
date, product, store, and promotion. In addition, we’ll include the POS trans-
action ticket number as a special dimension. More will be said on this later in
the chapter.
We begin to envision the preliminary schema as illustrated in Figure 2.2.
Before we delve into populating the dimension tables with descriptive attrib-
utes, let’s complete the final step of the process. We want to ensure that you’re
comfortable with the complete four-step process—we don’t want you to lose
sight of the forest for the trees at this stage of the game.
Step 4. Identify the Facts
The fourth and final step in the design is to make a careful determination of
which facts will appear in the fact table. Again, the grain declaration helps
anchor our thinking. Simply put, the facts must be true to the grain: the indi-
vidual line item on the POS transaction in this case. When considering poten-
tial facts, you again may discover that adjustments need to be made to either
our earlier grain assumptions or our choice of dimensions.
The facts collected by the POS system include the sales quantity (e.g., the num-
ber of cans of chicken noodle soup), per unit sales price, and the sales dollar
amount. The sales dollar amount equals the sales quantity multiplied by the unit
price. More sophisticated POS systems also provide a standard dollar cost for
the product as delivered to the store by the vendor. Presuming that this cost fact
is readily available and doesn’t require a heroic activity-based costing initiative,
we’ll include it in the fact table. Our fact table begins to take shape in Figure 2.3.
Three of the facts, sales quantity, sales dollar amount, and cost dollar amount,
are beautifully additive across all the dimensions. We can slice and dice the fact
table with impunity, and every sum of these three facts is valid and correct.
Date Key (PK)
Date Attributes TBD
Date Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number
Facts TBD
POS Retail Sales Transaction Fact
Product Key (PK)
Product Attributes TBD
Product Dimension
Store Key (PK)
Store Attributes TBD
Store Dimension
Promotion Key (PK)
Promotion Attributes TBD
Promotion Dimension
Figure 2.3 Measured facts in the retail sales schema.
We can compute the gross profit by subtracting the cost dollar amount from
the sales dollar amount, or revenue. Although computed, this gross profit is
also perfectly additive across all the dimensions—we can calculate the gross
profit of any combination of products sold in any set of stores on any set of
days. Dimensional modelers sometimes question whether a calculated fact
should be stored physically in the database. We generally recommend that it
be stored physically. In our case study, the gross profit calculation is straight-
forward, but storing it eliminates the possibility of user error. The cost of a user
incorrectly representing gross profit overwhelms the minor incremental stor-
age cost. Storing it also ensures that all users and their reporting applications
refer to gross profit consistently. Since gross profit can be calculated from adja-
cent data within a fact table row, some would argue that we should perform
the calculation in a view that is indistinguishable from the table. This is a rea-
sonable approach if all users access the data via this view and no users with ad
hoc query tools can sneak around the view to get at the physical table. Views
are a reasonable way to minimize user error while saving on storage, but the
DBA must allow no exceptions to data access through the view. Likewise,
some organizations want to perform the calculation in the query tool. Again,
this works if all users access the data using a common tool (which is seldom
the case in our experience).
The gross margin can be calculated by dividing the gross profit by the dollar
revenue. Gross margin is a nonadditive fact because it can’t be summarized
along any dimension. We can calculate the gross margin of any set of products,
stores, or days by remembering to add the revenues and costs before dividing.
This can be stated as a design principle:
Percentages and ratios, such as gross margin, are nonadditive. The numerator and
denominator should be stored in the fact table. The ratio can be calculated in a data
access tool for any slice of the fact table by remembering to calculate the ratio of
the sums, not the sum of the ratios.
Date Key (PK)
Date Attributes TBD
Date Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact
Product Key (PK)
Product Attributes TBD
Product Dimension
Store Key (PK)
Store Attributes TBD
Store Dimension
Promotion Key (PK)
Promotion Attributes TBD
Promotion Dimension
Retail Sales 37
Unit price is also a nonadditive fact. Attempting to sum up unit price across any
of the dimensions results in a meaningless, nonsensical number. In order to ana-
lyze the average selling price for a product in a series of stores or across a period
of time, we must add up the sales dollars and sales quantities before dividing the
total dollars by the total quantity sold. Every report writer or query tool in the
data warehouse marketplace should automatically perform this function cor-
rectly, but unfortunately, some still don’t handle it very gracefully.
At this early stage of the design, it is often helpful to estimate the number of
rows in our largest table, the fact table. In our case study, it simply may be a
matter of talking with a source system guru to understand how many POS
transaction line items are generated on a periodic basis. Retail traffic fluctuates
significantly from day to day, so we’ll want to understand the transaction activ-
ity over a reasonable period of time. Alternatively, we could estimate the num-
ber of rows added to the fact table annually by dividing the chain’s annual
gross revenue by the average item selling price. Assuming that gross revenues
are $4 billion per year and that the average price of an item on a customer ticket
is $2.00, we calculate that there are approximately 2 billion transaction line
items per year. This is a typical engineer’s estimate that gets us surprisingly
close to sizing a design directly from our armchairs. As designers, we always
should be triangulating to determine whether our calculations are reasonable.
Dimension Table Attributes
Now that we’ve walked through the four-step process, let’s return to the
dimension tables and focus on filling them with robust attributes.
Date Dimension
We will start with the date dimension. The date dimension is the one dimen-
sion nearly guaranteed to be in every data mart because virtually every data
mart is a time series. In fact, date is usually the first dimension in the underly-
ing sort order of the database so that the successive loading of time intervals of
data is placed into virgin territory on the disk.
For readers of the first edition of The Data Warehouse Toolkit (Wiley 1996), this
dimension was referred to as the time dimension in that text. Rather than stick-
ing with that more ambiguous nomenclature, we use the date dimension in this
book to refer to daily-grained dimension tables. This helps distinguish the date
and time-of-day dimensions, which we’ll discuss later in this chapter.
Unlike most of our other dimensions, we can build the date dimension table in
advance. We may put 5 or 10 years of rows representing days in the table so
that we can cover the history we have stored, as well as several years in the
future. Even 10 years’ worth of days is only about 3,650 rows, which is a rela-
tively small dimension table. For a daily date dimension table in a retail envi-
ronment, we recommend the partial list of columns shown in Figure 2.4.
Each column in the date dimension table is defined by the particular day that
the row represents. The day-of-week column contains the name of the day, such
as Monday. This column would be used to create reports comparing the busi-
ness on Mondays with Sunday business. The day number in calendar month
column starts with 1 at the beginning of each month and runs to 28, 29, 30, or
31, depending on the month. This column is useful for comparing the same day
each month. Similarly, we could have a month number in year (1, ... , 12). The
day number in epoch is effectively a Julian day number (that is, a consecutive
day number starting at the beginning of some epoch). We also could include
Figure 2.4 Date dimension in the retail sales schema.
Date Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact
Product Dimension
Store Dimension
Promotion Dimension
Date Key (PK)
Full Date Description
Day of Week
Day Number in Epoch
Week Number in Epoch
Month Number in Epoch
Day Number in Calendar Month
Day Number in Calendar Year
Day Number in Fiscal Month
Day Number in Fiscal Year
Last Day in Week Indicator
Last Day in Month Indicator
Calendar Week Ending Date
Calendar Week Number in Year
Calendar Month Name
Calendar Month Number in Year
Calendar Year-Month (YYYY-MM)
Calendar Quarter
Calendar Year-Quarter
Calendar Half Year
Calendar Year
Fiscal Week
Fiscal Week Number in Year
Fiscal Month
Fiscal Month Number in Year
Fiscal Year-Month
Fiscal Quarter
Fiscal Year-Quarter
Fiscal Half Year
Fiscal Year
Holiday Indicator
Weekday Indicator
Selling Season
Major Event
SQL Date Stamp
… and more
Retail Sales 39
absolute week and month number columns. All these integers support simple
date arithmetic between days across year and month boundaries. For reporting,
we would want a month name with values such as January. In addition, a year-
month (YYYY-MM) column is useful as a report column header. We likely also
will want a quarter number (Q1, ... , Q4), as well as a year quarter, such as 2001-
Q4. We would have similar columns for the fiscal periods if they differ from
calendar periods.
The holiday indicator takes on the values of Holiday or Nonholiday. Remem-
ber that the dimension table attributes serve as report labels. Simply populat-
ing the holiday indicator with a Y or an N would be far less useful. Imagine a
report where we’re comparing holiday sales for a given product versus non-
holiday sales. Obviously, it would be helpful if the columns had meaningful
values such as Holiday/Nonholiday versus a cryptic Y/N. Rather than decod-
ing cryptic flags into understandable labels in a reporting application, we pre-
fer that the decode be stored in the database so that a consistent value is
available to all users regardless of their reporting environment.
A similar argument holds true for the weekday indicator, which would have a
value of Weekday or Weekend. Saturdays and Sundays obviously would be
assigned the Weekend value. Of course, multiple date table attributes can be
jointly constrained, so we can easily compare weekday holidays with week-
end holidays, for example.
The selling season column is set to the name of the retailing season, if any.
Examples in the United States could include Christmas, Thanksgiving, Easter,
Valentine’s Day, Fourth of July, or None. The major event column is similar to
the season column and can be used to mark special outside events such as
Super Bowl Sunday or Labor Strike. Regular promotional events usually are
not handled in the date table but rather are described more completely by
means of the promotion dimension, especially since promotional events are
not defined solely by date but usually are defined by a combination of date,
product, and store.
Some designers pause at this point to ask why an explicit date dimension table
is needed. They reason that if the date key in the fact table is a date-type field,
then any SQL query can directly constrain on the fact table date key and use
natural SQL date semantics to filter on month or year while avoiding a sup-
posedly expensive join. This reasoning falls apart for several reasons. First of
all, if our relational database can’t handle an efficient join to the date dimen-
sion table, we’re already in deep trouble. Most database optimizers are quite
efficient at resolving dimensional queries; it is not necessary to avoid joins like
the plague. Also, on the performance front, most databases don’t index SQL
date calculations, so queries constraining on an SQL-calculated field wouldn’t
take advantage of an index.
In terms of usability, the typical business user is not versed in SQL date seman-
tics, so he or she would be unable to directly leverage inherent capabilities
associated with a date data type. SQL date functions do not support filtering
by attributes such as weekdays versus weekends, holidays, fiscal periods, sea-
sons, or major events. Presuming that the business needs to slice data by these
nonstandard date attributes, then an explicit date dimension table is essential.
At the bottom line, calendar logic belongs in a dimension table, not in the
application code. Finally, we’re going to suggest that the date key is an integer
rather than a date data type anyway. An SQL-based date key typically is 8 bytes,
so you’re wasting 4 bytes in the fact table for every date key in every row. More
will be said on this later in this chapter.
Figure 2.5 illustrates several rows from a partial date dimension table.
Data warehouses always need an explicit date dimension table. There are many
date attributes not supported by the SQL date function, including fiscal periods,
seasons, holidays, and weekends. Rather than attempting to determine these non-
standard calendar calculations in a query, we should look them up in a date dimen-
sion table.
If we wanted to access the time of the transaction for day-part analysis (for
example, activity during the evening after-work rush or third shift), we’d han-
dle it through a separate time-of-day dimension joined to the fact table. Date
and time are almost completely independent. If we combined the two dimen-
sions, the date dimension would grow significantly; our neat date dimension
with 3,650 rows to handle 10 years of data would expand to 5,256,000 rows if
we tried to handle time by minute in the same table (or via an outrigger). Obvi-
ously, it is preferable to create a 3,650-row date dimension table and a separate
1,440-row time-of-day by minute dimension.
In Chapter 5 we’ll discuss the handling of multiple dates in a single schema.
We’ll explore international date and time considerations in Chapters 11
and 14.
Figure 2.5 Date dimension table detail.
Key Date
Full Date
Description Day of Week
Fiscal Year-
January 1, 2002
January 2, 2002
January 3, 2002
January 4, 2002
January 5, 2002
January 6, 2002
January 7, 2002
January 8, 2002
Retail Sales 41
Product Dimension
The product dimension describes every SKU in the grocery store. While a typ-
ical store in our chain may stock 60,000 SKUs, when we account for different
merchandising schemes across the chain and historical products that are no
longer available, our product dimension would have at least 150,000 rows
and perhaps as many as a million rows. The product dimension is almost
always sourced from the operational product master file. Most retailers
administer their product master files at headquarters and download a subset
of the file to each store’s POS system at frequent intervals. It is headquarters’
responsibility to define the appropriate product master record (and unique
SKU number) for each new UPC created by packaged goods manufacturers.
Headquarters also defines the rules by which SKUs are assigned to such items
as bakery goods, meat, and produce. We extract the product master file into
our product dimension table each time the product master changes.
An important function of the product master is to hold the many descriptive
attributes of each SKU. The merchandise hierarchy is an important group of
attributes. Typically, individual SKUs roll up to brands. Brands roll up to
categories, and categories roll up to departments. Each of these is a many-to-
one relationship. This merchandise hierarchy and additional attributes are
detailed for a subset of products in Figure 2.6.
For each SKU, all levels of the merchandise hierarchy are well defined. Some
attributes, such as the SKU description, are unique. In this case, there are at
least 150,000 different values in the SKU description column. At the other
extreme, there are only perhaps 50 distinct values of the department attribute.
Thus, on average, there are 3,000 repetitions of each unique value in the
department attribute. This is all right! We do not need to separate these
repeated values into a second normalized table to save space. Remember that
dimension table space requirements pale in comparison with fact table space
Figure 2.6 Product dimension table detail.
Figure 2.7 Product dimension in the retail sales schema.
Many of the attributes in the product dimension table are not part of the mer-
chandise hierarchy. The package-type attribute, for example, might have values
such as Bottle, Bag, Box, or Other. Any SKU in any department could have one
of these values. It makes perfect sense to combine a constraint on this attribute
with a constraint on a merchandise hierarchy attribute. For example, we could
look at all the SKUs in the Cereal category packaged in Bags. To put this another
way, we can browse among dimension attributes whether or not they belong to
the merchandise hierarchy, and we can drill up and drill down using attributes
whether or not they belong to the merchandise hierarchy. We can even have
more than one explicit hierarchy in our product dimension table.
A recommended partial product dimension for a retail grocery data mart
would look similar to Figure 2.7.
A reasonable product dimension table would have 50 or more descriptive
attributes. Each attribute is a rich source for constraining and constructing row
headers. Viewed in this manner, we see that drilling down is nothing more
than asking for a row header that provides more information. Let’s say we
have a simple report where we’ve summarized the sales dollar amount and
quantity by department.
Department Sales Dollar
Description Amount Sales Quantity
Bakery $12,331 5,088
Frozen Foods $31,776 15,565
Product Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact
Date Dimension
Store Dimension
Promotion Dimension
Product Key (PK)
Product Description
SKU Number (Natural Key)
Brand Description
Category Description
Department Description
Package Type Description
Package Size
Fat Content
Diet Type
Weight Units of Measure
Storage Type
Shelf Life Type
Shelf Width
Shelf Height
Shelf Depth
and more
Retail Sales 43
If we want to drill down, we can drag virtually any other attribute, such as
brand, from the product dimension into the report next to department, and we
automatically drill down to this next level of detail. A typical drill down within
the merchandise hierarchy would look like this:
Department Brand Sales Dollar Sales
Description Description Amount Quantity
Bakery Baked Well $3,009 1,138
Bakery Fluffy $3,024 1,476
Bakery Light $6,298 2,474
Frozen Foods Coldpack $5,321 2,640
Frozen Foods Freshlike $10,476 5,234
Frozen Foods Frigid $7,328 3,092
Frozen Foods Icy $2,184 1,437
Frozen Foods QuickFreeze $6,467 3,162
Or we could drill down by the fat-content attribute, even though it isn’t in the
merchandise hierarchy roll-up.
Department Sales Dollar Sales
Description Fat Content Amount Quantity
Bakery Non-Fat $6,298 2,474
Bakery Reduced Fat $5,027 2,086
Bakery Regular Fat $1,006 528
Frozen Foods Non-Fat $5,321 2,640
Frozen Foods Reduced Fat $10,476 5,234
Frozen Foods Regular Fat $15,979 7,691
We have belabored the examples of drilling down in order to make a point,
which we will express as a design principle.
Drilling down in a data mart is nothing more than adding row headers from the
dimension tables. Drilling up is removing row headers. We can drill down or up on
attributes from more than one explicit hierarchy and with attributes that are part of
no hierarchy.
The product dimension is one of the two or three primary dimensions in
nearly every data mart. Great care should be taken to fill this dimension with
as many descriptive attributes as possible. A robust and complete set of
dimension attributes translates into user capabilities for robust and complete
analysis. We’ll further explore the product dimension in Chapter 4, where
we’ll also discuss the handling of product attribute changes.
Store Dimension
The store dimension describes every store in our grocery chain. Unlike the
product master file that is almost guaranteed to be available in every large
grocery business, there may not be a comprehensive store master file. The
product master needs to be downloaded to each store every time there’s a
new or changed product. However, the individual POS systems do not
require a store master. Information technology (IT) staffs frequently must
assemble the necessary components of the store dimension from multiple
operational sources at headquarters.
The store dimension is the primary geographic dimension in our case study.
Each store can be thought of as a location. Because of this, we can roll stores up
to any geographic attribute, such as ZIP code, county, and state in the United
States. Stores usually also roll up to store districts and regions. These two dif-
ferent hierarchies are both easily represented in the store dimension because
both the geographic and store regional hierarchies are well defined for a single
store row.
It is not uncommon to represent multiple hierarchies in a dimension table. Ideally,
the attribute names and values should be unique across the multiple hierarchies.
A recommended store dimension table for the grocery business is shown in
Figure 2.8.
Figure 2.8 Store dimension in the retail sales schema.
Store Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact
Date Dimension
Product Dimension
Promotion Dimension
Store Key (PK)
Store Name
Store Number (Natural Key)
Store Street Address
Store City
Store County
Store State
Store Zip Code
Store Manager
Store District
Store Region
Floor Plan Type
Photo Processing Type
Financial Service Type
Selling Square Footage
Total Square Footage
First Open Date
Last Remodel Date
and more
Retail Sales 45
The floor plan type, photo processing type, and finance services type are all
short text descriptors that describe the particular store. These should not be
one-character codes but rather should be 10- to 20-character standardized
descriptors that make sense when viewed in a pull-down list or used as a
report row header.
The column describing selling square footage is numeric and theoretically
additive across stores. One might be tempted to place it in the fact table. How-
ever, it is clearly a constant attribute of a store and is used as a report constraint
or row header more often than it is used as an additive element in a summa-
tion. For these reasons, we are confident that selling square footage belongs in
the store dimension table.
The first open date and last remodel date typically are join keys to copies of the
date dimension table. These date dimension copies are declared in SQL by the
VIEW construct and are semantically distinct from the primary date dimen-
sion. The VIEW declaration would look like
Now the system acts as if there is another physical copy of the date dimension
table called FIRST_OPEN_DATE. Constraints on this new date table have
nothing to do with constraints on the primary date dimension table. The first
open date view is a permissible outrigger to the store dimension. Notice that
we have carefully relabeled all the columns in the view so that they cannot be
confused with columns from the primary date dimension. We will further dis-
cuss outriggers in Chapter 6.
Promotion Dimension
The promotion dimension is potentially the most interesting dimension in our
schema. The promotion dimension describes the promotion conditions under
which a product was sold. Promotion conditions include temporary price
reductions, end-aisle displays, newspaper ads, and coupons. This dimension
is often called a causal dimension (as opposed to a casual dimension) because
it describes factors thought to cause a change in product sales.
Managers at both headquarters and the stores are interested in determining
whether a promotion is effective or not. Promotions are judged on one or more
of the following factors:
■■ Whether the products under promotion experienced a gain in sales during
the promotional period. This is called the lift. The lift can only be measured
if the store can agree on what the baseline sales of the promoted products
would have been without the promotion. Baseline values can be estimated
from prior sales history and, in some cases, with the help of sophisticated
mathematical models.
■■ Whether the products under promotion showed a drop in sales just prior
to or after the promotion, canceling the gain in sales during the promotion
(time shifting). In other words, did we transfer sales from regularly priced
products to temporarily reduced-priced products?
■■ Whether the products under promotion showed a gain in sales but other
products nearby on the shelf showed a corresponding sales decrease (can-
■■ Whether all the products in the promoted category of products experi-
enced a net overall gain in sales taking into account the time periods
before, during, and after the promotion (market growth).
■■ Whether the promotion was profitable. Usually the profit of a promotion
is taken to be the incremental gain in profit of the promoted category over
the baseline sales taking into account time shifting and cannibalization, as
well as the costs of the promotion, including temporary price reductions,
ads, displays, and coupons.
The causal conditions potentially affecting a sale are not necessarily tracked
directly by the POS system. The transaction system keeps track of price reduc-
tions and markdowns. The presence of coupons also typically is captured with
the transaction because the customer either presents coupons at the time of
sale or does not. Ads and in-store display conditions may need to be linked
from other sources.
The various possible causal conditions are highly correlated. A temporary
price reduction usually is associated with an ad and perhaps an end-aisle
display. Coupons often are associated with ads. For this reason, it makes
sense to create one row in the promotion dimension for each combination of
promotion conditions that occurs. Over the course of a year, there may be
1,000 ads, 5,000 temporary price reductions, and 1,000 end-aisle displays,
but there may only be 10,000 combinations of these three conditions affect-
ing any particular product. For example, in a given promotion, most of
the stores would run all three promotion mechanisms simultaneously, but
a few of the stores would not be able to deploy the end-aisle displays. In
this case, two separate promotion condition rows would be needed, one for
the normal price reduction plus ad plus display and one for the price reduc-
tion plus ad only. A recommended promotion dimension table is shown in
Figure 2.9.
Retail Sales 47
Figure 2.9 Promotion dimension in the retail sales schema.
From a purely logical point of view, we could record very similar information
about the promotions by separating the four major causal mechanisms (price
reductions, ads, displays, and coupons) into four separate dimensions rather
than combining them into one dimension. Ultimately, this choice is the
designer’s prerogative. The tradeoffs in favor of keeping the four dimensions
together include the following:
■■ Since the four causal mechanisms are highly correlated, the combined sin-
gle dimension is not much larger than any one of the separated dimen-
sions would be.
■■ The combined single dimension can be browsed efficiently to see how the
various price reductions, ads, displays, and coupons are used together.
However, this browsing only shows the possible combinations. Browsing
in the dimension table does not reveal which stores or products were
affected by the promotion. This information is found in the fact table.
The tradeoffs in favor of separating the four causal mechanisms into distinct
dimension tables include the following:
■■ The separated dimensions may be more understandable to the business
community if users think of these mechanisms separately. This would be
revealed during the business requirement interviews.
■■ Administration of the separate dimensions may be more straightforward
than administering a combined dimension.
Keep in mind that there is no difference in the information content in the data
warehouse between these two choices.
Promotion Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact
Date Dimension
Product Dimension
Store Dimension
Promotion Key (PK)
Promotion Name
Price Reduction Type
Promotion Media Type
Ad Type
Display Type
Coupon Type
Ad Media Name
Display Provider
Promotion Cost
Promotion Begin Date
Promotion End Date
and more
Typically, many sales transaction line items involve products that are not being
promoted. We will need to include a row in the promotion dimension, with its
own unique key, to identify “No Promotion in Effect” and avoid a null promo-
tion key in the fact table. Referential integrity is violated if we put a null in a
fact table column declared as a foreign key to a dimension table. In addition to
the referential integrity alarms, null keys are the source of great confusion to
our users because they can’t join on null keys.
You must avoid null keys in the fact table. A proper design includes a row in the
corresponding dimension table to identify that the dimension is not applicable
to the measurement.
Promotion Coverage Factless Fact Table
Regardless of the handling of the promotion dimension, there is one important
question that cannot be answered by our retail sales schema: What products
were on promotion but did not sell? The sales fact table only records the SKUs
actually sold. There are no fact table rows with zero facts for SKUs that didn’t
sell because doing so would enlarge the fact table enormously. In the relational
world, a second promotion coverage or event fact table is needed to help
answer the question concerning what didn’t happen. The promotion coverage
fact table keys would be date, product, store, and promotion in our case study.
This obviously looks similar to the sales fact table we just designed; however,
the grain would be significantly different. In the case of the promotion cover-
age fact table, we’d load one row in the fact table for each product on promo-
tion in a store each day (or week, since many retail promotions are a week in
duration) regardless of whether the product sold or not. The coverage fact
table allows us to see the relationship between the keys as defined by a pro-
motion, independent of other events, such as actual product sales. We refer to
it as a factless fact table because it has no measurement metrics; it merely cap-
tures the relationship between the involved keys. To determine what products
where on promotion but didn’t sell requires a two-step process. First, we’d
query the promotion coverage table to determine the universe of products that
were on promotion on a given day. We’d then determine what products sold
from the POS sales fact table. The answer to our original question is the set dif-
ference between these two lists of products. Stay tuned to Chapter 12 for more
complete coverage of factless fact tables; we’ll illustrate the promotion cover-
age table and provide the set difference SQL. If you’re working with data in a
multidimensional online analytical processing (OLAP) cube environment, it is
often easier to answer the question regarding what didn’t sell because the cube
typically contains explicit cells for nonbehavior.
Retail Sales 49
Degenerate Transaction Number
The retail sales fact table contains the POS transaction number on every line
item row. In a traditional parent-child database, the POS transaction number
would be the key to the transaction header record, containing all the informa-
tion valid for the transaction as a whole, such as the transaction date and store
identifier. However, in our dimensional model, we have already extracted this
interesting header information into other dimensions. The POS transaction
number is still useful because it serves as the grouping key for pulling together
all the products purchased in a single transaction.
Although the POS transaction number looks like a dimension key in the fact
table, we have stripped off all the descriptive items that might otherwise fall in
a POS transaction dimension. Since the resulting dimension is empty, we refer
to the POS transaction number as a degenerate dimension (identified by the DD
notation in Figure 2.10). The natural operational ticket number, such as the
POS transaction number, sits by itself in the fact table without joining to a
dimension table. Degenerate dimensions are very common when the grain of
a fact table represents a single transaction or transaction line item because the
degenerate dimension represents the unique identifier of the parent. Order
numbers, invoice numbers, and bill-of-lading numbers almost always appear
as degenerate dimensions in a dimensional model.
Degenerate dimensions often play an integral role in the fact table’s primary
key. In our case study, the primary key of the retail sales fact table consists of
the degenerate POS transaction number and product key (assuming that the
POS system rolls up all sales for a given product within a POS shopping cart
into a single line item). Often, the primary key of a fact table is a subset of the
table’s foreign keys. We typically do not need every foreign key in the fact
table to guarantee the uniqueness of a fact table row.
Operational control numbers such as order numbers, invoice numbers, and bill-of-
lading numbers usually give rise to empty dimensions and are represented as degen-
erate dimensions (that is, dimension keys without corresponding dimension tables)
in fact tables where the grain of the table is the document itself or a line item in the
If, for some reason, one or more attributes are legitimately left over after all the
other dimensions have been created and seem to belong to this header entity,
we would simply create a normal dimension record with a normal join. How-
ever, we would no longer have a degenerate dimension.
Figure 2.10 Querying the retail sales schema.
Retail Schema in Action
With our retail POS schema designed, let’s illustrate how it would be put to use
in a query environment. A business user might be interested in better under-
standing weekly sales dollar volume by promotion for the snacks category dur-
ing January 2002 for stores in the Boston district. As illustrated in Figure 2.10,
we would place query constraints on month and year in the date dimension,
district in the store dimension, and category in the product dimension.
If the query tool summed the sales dollar amount grouped by week-ending
date and promotion, the query results would look similar to those below. You
can plainly see the relationship between the dimensional model and the asso-
ciated query. High-quality dimension attributes are crucial because they are
the source of query constraints and result set labels.
Calendar Week Sales
Ending Date Promotion Name Dollar Amount
January 6, 2002 No Promotion 22,647
January 13, 2002 No Promotion 4,851
January 20, 2002 Super Bowl Promotion 7,248
January 27, 2002 Super Bowl Promotion 13,798
If you are using a data access tool with more functionality, the results may
appear as a cross-tabular report. Such reports are more appealing to business
users than the columnar data resulting from an SQL statement.
Date Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number (DD)
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact Product Key (PK)
Product Description
SKU Number
Brand Description
Subcategory Description
Category Description
Department Description
Package Type
Fat Content
Diet Type
and more
Product Dimension
Promotion Key (PK)
Promotion Name
Promotion Media Type
Promotion Begin Date
Promotion End Date
and more
Promotion Dimension
Date Key (PK)
Day of Week
Calendar Week Ending Date
Calendar Month
Calendar Year - Month
Calendar Quarter
Calendar Year - Quarter
Calendar Half Year
Calendar Year
Holiday Indicator
and more
Store Dimension
Store Key (PK)
Store Name
Store Number
Store District
Store Region
First Open Date
Last Remodel Date
and more
Retail Sales 51
Super Bowl No Promotion
Calendar Week Promotion Sales Sales Dollar
Ending Date Dollar Amount Amount
January 6, 2002 0 22,647
January 13, 2002 0 4,851
January 20, 2002 7,248 0
January 27, 2002 13,793 0
Retail Schema Extensibility
Now that we’ve completed our first dimensional model, let’s turn our atten-
tion to extending the design. Assume that our retailer decides to implement a
frequent shopper program. Now, rather than knowing that an unidentified
shopper had 26 items in his or her shopping cart, we’re able to see exactly what
a specific shopper, say, Julie Kimball, purchases on a weekly basis. Just imag-
ine the interest of business users in analyzing shopping patterns by a multi-
tude of geographic, demographic, behavioral, and other differentiating
shopper characteristics.
The handling of this new frequent shopper information is relatively straight-
forward. We’d create a frequent shopper dimension table and add another
foreign key in the fact table. Since we can’t ask shoppers to bring in all
their old cash register receipts to tag our historical sales transactions with
their new frequent shopper number, we’d substitute a shopper key corre-
sponding to a “Prior to Frequent Shopper Program” description on our his-
torical fact table rows. Likewise, not everyone who shops at the grocery store
will have a frequent shopper card, so we’d also want to include a “Frequent
Shopper Not Identified” row in our shopper dimension. As we discussed
earlier with the promotion dimension, we must avoid null keys in the fact
As we embellished our original design with a frequent shopper dimension, we
also could add dimensions for the time of day and clerk associated with the
transaction, as illustrated in Figure 2.11. Any descriptive attribute that has a
single value in the presence of the fact table measurements is a good candidate
to be added to an existing dimension or be its own dimension. The decision
regarding whether a dimension can be attached to a fact table should be a
binary yes/no based on the declared grain. If you are in doubt, it’s time to
revisit step 2 of the design process.
Figure 2.11 Embellished retail sales schema.
Our original schema gracefully extends to accommodate these new dimen-
sions largely because we chose to model the POS transaction data at its most
granular level. The addition of dimensions that apply at that granularity did
not alter the existing dimension keys or facts; all preexisting applications con-
tinue to run without unraveling or changing. If we had declared originally that
the grain would be daily retail sales (transactions summarized by day, store,
product, and promotion) rather than at transaction line detail, we would not
have been able to easily incorporate the frequent-shopper, time-of-day, or clerk
dimensions. Premature summarization or aggregation inherently limits our
ability to add supplemental dimensions because the additional dimensions
often don’t apply at the higher grain.
Obviously, there are some changes that can never be handled gracefully. If a
data source ceases to be available and there is no compatible substitute, then
the data warehouse applications depending on this source will stop working.
However, the predictable symmetry of dimensional models allow them to
absorb some rather significant changes in source data and/or modeling
assumptions without invalidating existing applications. We’ll describe several
of these unexpected modification categories, starting with the simplest:
Frequent Shopper Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
Frequent Shopper Key (FK)
Clerk Key (FK)
Time of Day Key (FK)
POS Transaction Number (DD)
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact Date Dimension
Product Dimension
Store Dimension
Promotion Dimension
Frequent Shopper Key (PK)
Frequent Shopper Name
Frequent Shopper Address
Frequent Shopper City
Frequent Shopper State
Frequent Shopper Zip Code
Frequent Shopper Segment
and more
Clerk Dimension
Clerk Key (PK)
Clerk Name
Clerk Job Grade
Clerk Supervisor
Date of Hire
and more
Time Of Day Dimension
Time of Day Key (PK)
AM/PM Indicator
Day Part Segment
and more
Retail Sales 53
New dimension attributes. If we discover new textual descriptors of a prod-
uct, for example, we add these attributes to the dimension as new columns.
All existing applications will be oblivious to the new attributes and con-
tinue to function. If the new attributes are available only after a specific
point in time, then “Not Available” or its equivalent should be populated
in the old dimension records.
New dimensions. As we just illustrated in Figure 2.11, we can add a dimen-
sion to an existing fact table by adding a new foreign key field and popu-
lating it correctly with values of the primary key from the new dimension.
New measured facts. If new measured facts become available, we can add
them gracefully to the fact table. The simplest case is when the new facts
are available in the same measurement event and at the same grain as the
existing facts. In this case, the fact table is altered to add the new columns,
and the values are populated into the table. If the ALTER TABLE statement
is not viable, then a second fact table must be defined with the additional
columns and the rows copied from the first. If the new facts are only avail-
able from a point in time forward, then null values need to be placed in the
older fact rows. A more complex situation arises when new measured facts
occur naturally at a different grain. If the new facts cannot be allocated or
assigned to the original grain of the fact table, it is very likely that the new
facts belong in their own fact table. It is almost always a mistake to mix
grains in the same fact table.
Dimension becoming more granular. Sometimes it is desirable to increase
the granularity of a dimension. In most cases, the original dimension
attributes can be included in the new, more granular dimension because
they roll up perfectly in a many-to-one relationship. The more granular
dimension often implies a more granular fact table. There may be no alter-
native but to drop the fact table and rebuild it. However, all the existing
applications would be unaffected.
Addition of a completely new data source involving existing dimensions
as well as unexpected new dimensions. Almost always, a new source of
data has its own granularity and dimensionality, so we create a new fact
table. We should avoid force-fitting new measurements into an existing
fact table of consistent measurements. The existing applications will still
work because the existing fact and dimension tables are untouched.
Resisting Comfort Zone Urges
With our first dimensional design behind us, let’s directly confront several of the
natural urges that tempt modelers coming from a more normalized background.
We’re consciously breaking some traditional modeling rules because we’re
focused on delivering business value through ease of use and performance, not
on transaction processing efficiencies.
Dimension Normalization
The flattened, denormalized dimension tables with repeating textual values
may make a normalization modeler uncomfortable. Let’s revisit our case
study product dimension table. The 150,000 products roll up into 50 distinct
departments. Rather than redundantly storing the 20-byte department
description in the product dimension table, modelers with a normalized
upbringing want to store a 2-byte department code and then create a new
department dimension for the department decodes. In fact, they would feel
more comfortable if all the descriptors in our original design were normalized
into separate dimension tables. They argue that this design saves space
because we’re only storing cryptic codes in our 150,000-row dimension table,
not lengthy descriptors.
In addition, some modelers contend that the normalized design for the dimen-
sion tables is easier to maintain. If a department description changes, they’d
only need to update the one occurrence rather than the 3,000 repetitions in our
original product dimension. Maintenance often is addressed by normalization
disciplines, but remember that all this happens back in the staging area, long
before the data is loaded into a presentation area’s dimensional schema.
Dimension table normalization typically is referred to as snowflaking. Redun-
dant attributes are removed from the flat, denormalized dimension table and
placed in normalized secondary dimension tables. Figure 2.12 illustrates
the partial snowflaking of our original schema. If the schema were fully
snowflaked, it would appear as a full third-normal-form entity-relationship
diagram. The contrast between Figure 2.12 and the earlier design in Figure 2.10
is startling. While the fact tables in both figures are identical, the plethora of
dimension tables (even in our simplistic representation) is overwhelming.
Figure 2.12 Partially snowflaked product dimension.
Brand Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number (DD)
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact
Product Key (PK)
Product Description
SKU Number (Natural Key)
Brand Key (FK)
Package Type Key (FK)
Fat Content
Weight Units of Measure
Storage Type Key (FK)
Shelf Width
Shelf Height
Shelf Depth
and more
Product Dimension
Brand Key (PK)
Brand Description
Category Key (FK)
Package Type Dimension
Package Type Key (PK)
Package Type Description
Storage Type Dimension
Storage Type Key (PK)
Storage Type Description
Shelf Life Type Key (FK)
Shelf Life Type Dimension
Shelf Life Type Key (PK)
Shelf Life Type Description
Category Dimension
Category Key (PK)
Category Description
Department Key (FK)
Department Dimension
Department Key (PK)
Department Description
Retail Sales 55
While snowflaking is a legal extension of the dimensional model, in general,
we encourage you to resist the urge to snowflake given our two primary
design drivers, ease of use and performance.
■■ The multitude of snowflaked tables makes for a much more complex pre-
sentation. Users inevitably will struggle with the complexity. Remember
that simplicity is one of the primary objectives of a denormalized dimen-
sional model.
■■ Likewise, database optimizers will struggle with the complexity of the
snowflaked schema. Numerous tables and joins usually translate into
slower query performance. The complexities of the resulting join specifi-
cations increase the chances that the optimizer will get sidetracked and
choose a poor strategy.
■■ The minor disk space savings associated with snowflaked dimension
tables are insignificant. If we replaced the 20-byte department description
in our 150,000-row product dimension table with a 2-byte code, we’d save
a whopping 2.7 MB (150,000 x 18 bytes), but we may have a 10-GB fact
table! Dimension tables are almost always geometrically smaller than fact
table. Efforts to normalize most dimension tables in order to save disk
space are a waste of time.
■■ Snowflaking slows down the users’ ability to browse within a dimension.
Browsing often involves constraining one or more dimension attributes
and looking at the distinct values of another attribute in the presence of
these constraints. Browsing allows users to understand the relationship
between dimension attribute values.
Obviously, a snowflaked product dimension table would respond well if
we just wanted a list of the category descriptions. However, if we wanted
to see all the brands within a category, we’d need to traverse the brand
and category dimensions. If we then wanted to also list the package types
for each brand in a category, we’d be traversing even more tables. The
SQL needed to perform these seemingly simple queries is quite complex,
and we haven’t even touched the other dimensions or fact table.
■■ Finally, snowflaking defeats the use of bitmap indexes. Bitmap indexes are
very useful when indexing low-cardinality fields, such as the category
and department columns in our product dimension tables. They greatly
speed the performance of a query or constraint on the single column in
question. Snowflaking inevitably would interfere with your ability to
leverage this performance-tuning technique.
The dimension tables should remain as flat tables physically. Normalized,
snowflaked dimension tables penalize cross-attribute browsing and prohibit the use
of bit-mapped indexes. Disk space savings gained by normalizing the dimension ta-
bles typically are less than 1 percent of the total disk space needed for the overall
schema. We knowingly sacrifice this dimension table space in the spirit of perfor-
mance and ease-of-use advantages.
There are times when snowflaking is permissible, such as our earlier example
with the date outrigger on the store dimension, where a clump of correlated
attributes is used repeatedly in various independent roles. We just urge you to
be conservative with snowflaked designs and use them only when they are
obviously called for.
Too Many Dimensions
The fact table in a dimensional schema is naturally highly normalized and
compact. There is no way to further normalize the extremely complex many-
to-many relationships among the keys in the fact table because the dimensions
are not correlated with each other. Every store is open every day. Sooner or
later, almost every product is sold on promotion in most or all of our stores.
Interestingly, while uncomfortable with denormalized dimension tables, some
modelers are tempted to denormalize the fact table. Rather than having a sin-
gle product foreign key on the fact table, they include foreign keys for the fre-
quently analyzed elements on the product hierarchy, such as brand,
subcategory, category, and department. Likewise, the date key suddenly turns
into a series of keys joining to separate week, month, quarter, and year dimen-
sion tables. Before you know it, our compact fact table has turned into an
unruly monster that joins to literally dozens of dimension tables. We affection-
ately refer to these designs as centipedes because the fact tables appear to have
nearly 100 legs, as shown in Figure 2.13. Clearly, the centipede design has
stepped into the too-many-dimensions trap.
Remember, even with its tight format, the fact table is the behemoth in a
dimensional design. Designing a fact table with too many dimensions leads to
significantly increased fact table disk space requirements. While we’re willing
to use extra space for dimension tables, fact table space consumption concerns
us because it is our largest table by orders of magnitude. There is no way to
index the enormous multipart key effectively in our centipede example. The
numerous joins are an issue for both usability and query performance.
Retail Sales 57
Figure 2.13 Centipede fact table with too many dimensions.
Most business processes can be represented with less than 15 dimensions in
the fact table. If our design has 25 or more dimensions, we should look for
ways to combine correlated dimensions into a single dimension. Perfectly cor-
related attributes, such as the levels of a hierarchy, as well as attributes with a
reasonable statistical correlation, should be part of the same dimension. You
have made a good decision to combine dimensions when the resulting new
single dimension is noticeably smaller than the Cartesian product of the sepa-
rate dimensions.
A very large number of dimensions typically is a sign that several dimensions are not
completely independent and should be combined into a single dimension. It is a di-
mensional modeling mistake to represent elements of a hierarchy as separate di-
mensions in the fact table.
Surrogate Keys
We strongly encourage the use of surrogate keys in dimensional models rather
than relying on operational production codes. Surrogate keys go by many
Date Key (FK)
Week Key (FK)
Month Key (FK)
Quarter Key (FK)
Year Key (FK)
Fiscal Year (FK)
Fiscal Month (FK)
Product Key (FK)
Brand Key (FK)
Subcategory Key (FK)
Category Key (FK)
Department Key (FK)
Package Type Key (FK)
Store Key (FK)
Store County (FK)
Store State Key (FK)
Store District Key (FK)
Store Region Key (FK)
Store Floor Plan (FK)
Promotion Key (FK)
Promotion Reduction Type (FK)
Promotion Media Type (FK)
POS Transaction Number (DD)
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact
Product Dimension
Brand Dimension
Subcategory Dimension
Category Dimension
Department Dimension
Package Type Dimension
Promotion Dimension
Promotion Reduction Type
Promotion Media Type
Date Dimension
Week Dimension
Month Dimension
Quarter Dimension
Year Dimension
Fiscal Year Dimension
Store Floor Plan Dimension
Fiscal Month Dimension
Store Dimension
Store County Dimension
Store State Dimension
Store District Dimension
Store Region Dimension
other aliases: meaningless keys, integer keys, nonnatural keys, artificial keys, syn-
thetic keys, and so on. Simply put, surrogate keys are integers that are assigned
sequentially as needed to populate a dimension. For example, the first product
record is assigned a product surrogate key with the value of 1, the next prod-
uct record is assigned product key 2, and so forth. The surrogate keys merely
serve to join the dimension tables to the fact table.
Modelers sometimes are reluctant to give up their natural keys because they
want to navigate the fact table based on the operational code while avoiding a
join to the dimension table. Remember, however, that dimension tables are our
entry points to the facts. If the fifth through ninth characters in the operational
code identify the manufacturer, then the manufacturer’s name should be
included as a dimension table attribute. In general, we want to avoid embed-
ding intelligence in the data warehouse keys because any assumptions that we
make eventually may be invalidated. Likewise, queries and data access appli-
cations should not have any built-in dependency on the keys because the logic
also would be vulnerable to invalidation.
Every join between dimension and fact tables in the data warehouse should be
based on meaningless integer surrogate keys. You should avoid using the natural op-
erational production codes. None of the data warehouse keys should be smart,
where you can tell something about the row just by looking at the key.
Initially, it may be faster to implement a dimensional model using operational
codes, but surrogate keys definitely will pay off in the long run. We sometimes
think of them as being similar to a flu shot for the data warehouse—like an
immunization, there’s a small amount of pain to initiate and administer surro-
gate keys, but the long-run benefits are substantial.
One of the primary benefits of surrogate keys is that they buffer the data ware-
house environment from operational changes. Surrogate keys allow the ware-
house team to maintain control of the environment rather than being
whipsawed by operational rules for generating, updating, deleting, recycling,
and reusing production codes. In many organizations, historical operational
codes (for example, inactive account numbers or obsolete product codes) get
reassigned after a period of dormancy. If account numbers get recycled fol-
lowing 12 months of inactivity, the operational systems don’t miss a beat
because their business rules prohibit data from hanging around for that long.
The data warehouse, on the other hand, will retain data for years. Surrogate
keys provide the warehouse with a mechanism to differentiate these two sep-
arate instances of the same operational account number. If we rely solely on
operational codes, we also are vulnerable to key overlap problems in the case
Retail Sales 59
of an acquisition or consolidation of data. Surrogate keys allow the data ware-
house team to integrate data from multiple operational source systems, even if
they lack consistent source keys.
There are also performance advantages associated with the use of surrogate
keys. The surrogate key is as small an integer as possible while ensuring that it
will accommodate the future cardinality or maximum number of rows in the
dimension comfortably. Often the operational code is a bulky alphanumeric
character string. The smaller surrogate key translates into smaller fact tables,
smaller fact table indices, and more fact table rows per block input-output
operation. Typically, a 4-byte integer is sufficient to handle most dimension sit-
uations. A 4-byte integer is a single integer, not four decimal digits. It has 32
bits and therefore can handle approximately 2 billion positive values (232–1) or
4 billion total positive and negative values (–232–1 to +232–1). As we said, this is
more than enough for just about any dimension. Remember, if you have a
large fact table with 1 billion rows of data, every byte in each fact table row
translates into another gigabyte of storage.
As we mentioned earlier, surrogate keys are used to record dimension con-
ditions that may not have an operational code, such as the “No Promotion in
Effect” condition. By taking control of the warehouse’s keys, we can assign
a surrogate key to identify this condition despite the lack of operational
Similarly, you may find that your dimensional models have dates that are yet
to be determined. There is no SQL date value for “Date to be Determined” or
“Date Not Applicable.” This is another reason we advocate using surrogate
keys for your date keys rather than SQL date data types (as if our prior ratio-
nale wasn’t convincing enough).
The date dimension is the one dimension where surrogate keys should be
assigned in a meaningful, sequential order. In other words, January 1 of the
first year would be assigned surrogate key value 1, January 2 would be
assigned surrogate key 2, February 1 would be assigned surrogate key 32, and
so on. We don’t want to embed extensive calendar intelligence in these keys
(for example, YYYY-MM-DD) because doing so may encourage people to
bypass the date lookup dimension table. And, of course, in using this smart
format, we would again have no way to represent “Hasn’t happened yet” and
other common date situations. We just want our fact table rows to be in
sequential order. Treating the surrogate date key as a date sequence number
will allow the fact table to be physically partitioned on the basis of the date
key. Partitioning a large fact table on the basis of date is highly effective
because it allows old data to be removed gracefully and new data to be loaded
and indexed without disturbing the rest of the fact table.
Finally, surrogate keys are needed to support one of the primary techniques
for handling changes to dimension table attributes. This is actually one of the
most important reasons to use surrogate keys. We’ll devote a whole section in
Chapter 4 to using surrogate keys for slowly changing dimensions.
Of course, some effort is required to assign and administer surrogate keys, but
it’s not nearly as intimidating as many people imagine. We’ll need to establish
and maintain a cross-reference table in the staging area that will be used to
substitute the appropriate surrogate key on each fact and dimension table row.
In Chapter 16 we lay out a flow diagram for administering and processing sur-
rogate keys in our dimensional schemas.
Before we leave the topic of keys, we want to discourage the use of concate-
nated or compound keys for dimension tables. We can’t create a truly surro-
gate key simply by gluing together several natural keys or by combining the
natural key with a time stamp. Also, we want to avoid multiple parallel joins
between the dimension and fact tables, sometimes referred to as double-barreled
joins, because they have an adverse impact on performance.
While we don’t typically assign surrogate keys to degenerate dimensions, you
should evaluate each situation to determine if one is required. A surrogate key
is necessary if the transaction control numbers are not unique across locations
or get reused. For example, our retailer’s POS system may not assign unique
transaction numbers across stores. The system may wrap back to zero and
reuse previous control numbers once its maximum has been reached. Also,
your transaction control number may be a bulky 24-byte alphanumeric column.
In such cases, it would be advantageous to use a surrogate key. Technically, con-
trol number dimensions modeled in this way are no longer degenerate.
For the moment, let’s assume that the first version of the retail sales schema rep-
resents both the logical and physical design of our database. In other words, the
relational database contains only five actual tables: retail sales fact table and
date, product, store, and promotion dimension tables. Each of the dimension
tables has a primary key, and the fact table has a composite key made up of
these four foreign keys, in addition to the degenerate transaction number. Per-
haps the most striking aspect of the design at this point is the simplicity of the
fact table. If the four foreign keys are tightly administered consecutive integers,
we could reserve as little as 14 bytes for all four keys (4 bytes each for date,
product, and promotion and 2 bytes for store). The transaction number might
require an additional 8 bytes. If the four facts in the fact table were each 4-byte
integers, we would need to reserve only another 16 bytes. This would make our
fact table row only 38 bytes wide. Even if we had a billion rows, the fact table
would occupy only about 38 GB of primary data space. Such a streamlined fact
table row is a very typical result in a dimensional design.
Retail Sales 61
Our embellished retail sales schema, illustrated in Figure 2.11, has three addi-
tional dimensions. If we allocate 4 bytes each for shopper and clerk and 2 bytes
for the time of day (to the nearest minute), then our fact table width grows to
only 48 bytes. Our billion-row fact table occupies just 48 GB.
Market Basket Analysis
The retail sales schema tells us in exquisite detail what was purchased at each
store and under what conditions. However, the schema doesn’t allow us to
very easily analyze which products were sold in the same market basket
together. This notion of analyzing the combination of products that sell
together is known by data miners as affinity grouping but more popularly is
called market basket analysis. Market basket analysis gives the retailer insights
about how to merchandise various combinations of items. If frozen pasta din-
ners sell well with cola products, then these two products perhaps should be
located near each other or marketed with complementary pricing. The concept
of market basket analysis can be extended easily to other situations. In the
manufacturing environment, it is useful to see what products are ordered
together because we may want to offer product bundles with package pricing.
The retail sales fact table cannot be used easily to perform market basket
analyses because SQL was never designed to constrain and group across line
item fact rows. Data mining tools and some OLAP products can assist with
market basket analysis, but in the absence of these tools, we’ll describe a more
direct approach below. Be forewarned that this is a rather advanced technique;
if you are not doing market basket analysis today, simply skim this section to
get a general sense of the techniques involved.
In Figure 2.14 we illustrate a market basket fact table that was derived from
retail sales transactions. The market basket fact table is a periodic snapshot
representing the pairs of products purchased together during a specified time
period. The facts include the total number of baskets (customer tickets) that
included products A and B, the total number of product A dollars and units in
this subset of purchases, and the total number of product B dollars and units
purchased. The basket count is a semiadditive fact. For example, if a customer
ticket contains line items for pasta, soft drinks, and peanut butter in the mar-
ket basket fact table, this single order is counted once on the pasta-soft drinks
fact row, once on the row for the pasta-peanut butter combination, and so on.
Obviously, care must be taken to avoid summarizing purchase counts for more
than one product.
Figure 2.14 Market basket fact table populated from purchase transactions.
You will notice that there are two generalized product keys (product keys A
and B) in the market basket fact table. Here we have built a single product
dimension table that contains entries at multiple levels of the hierarchy, such
as individual products, brands, and categories. This specialized variant of our
normal product dimension table contains a small number of rather generic
attributes. The surrogate keys for the various levels of the product hierarchy
have been assigned so that they don’t overlap.
Conceptually, the idea of recording market basket correlations is simple, but
the sheer number of product combinations makes the analysis challenging. If
we have Nproducts in our product portfolio and we attempt to build a table
with every possible pair of product keys encountered in product orders, we
will approach N2product combinations [actually Nx (N – 1) for the mathe-
maticians among you]. In other words, if we have 10,000 products in our port-
folio, there would be nearly 100,000,000 pairwise combinations. The number
of possible combinations quickly approaches absurdity when we’re dealing
with a large number of products. If a retail store sells 100,000 SKUs, there are
10 billion possible SKU combinations.
The key to realistic market basket analysis is to remember that the primary
goal is to understand the meaningful combinations of products sold together.
Thinking about our market basket fact table, we would first be interested in
rows with high basket counts. Since these product combinations are
observed frequently, they warrant further investigation. Second, we would
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number (DD)
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
POS Retail Sales Transaction Fact
Date Key (FK)
Product A Key (FK)
Product B Key (FK)
Store Key (FK)
Promotion Key (FK)
Basket Count
Sales Quantity Product A
Sales Quantity Product B
Sales Dollar Amount Product A
Sales Dollar Amount Product B
POS Market Basket Fact
Grain = 1 row per POS
transaction line
Grain = 1 row for each pair of
products sold on a day by store
and promotion
Retail Sales 63
look for situations where the dollars or units for products A and B were in
reasonable balance. If the dollars or units are far out of balance, all we’ve
done is find high-selling products coupled with insignificant secondary
products, which wouldn’t be very helpful in making major merchandising or
promotion decisions.
In order to avoid the combinatorial explosion of product pairs in the market
basket fact table, we rely on a progressive pruning algorithm. We begin at the
top of the product hierarchy, which we’ll assume is category. We first enumer-
ate all the category-to-category market basket combinations. If there are 25 cat-
egories, this first step generates 625 market basket rows. We then prune this
list for further analysis by selecting only the rows that have a reasonably high
order count and where the dollars and units for products A and B (which are
categories at this point) are reasonably balanced. Experimentation will tell you
what the basket count threshold and balance range should be.
We then push down to the next level of detail, which we’ll assume is brand.
Starting with the pruned set of combinations from the last step, we drill down
on product A by enumerating all combinations of brand (product A) by cate-
gory (product B). Similarly, we drill down one level of the hierarchy for prod-
uct B by looking at all combinations of brand (product A) by brand (product B).
Again, we prune the lists to those with the highest basket count frequencies and
dollar or unit balance and then drill down to the next level in the hierarchy.
As we descend the hierarchy, we produce rows with smaller and smaller bas-
ket counts. Eventually, we find no basket counts greater than the reasonable
threshold for relevance. It is permissible to stop at any time once we’ve satis-
fied the analyst’s curiosity. One of the advantages of this top-down approach
is that the rows found at each point are those with the highest relevance and
impact. Progressively pruning the list provides more focus to already relevant
results. One can imagine automating this process, searching the product hier-
archy downward, ignoring the low basket counts, and always striving for bal-
anced dollars and units with the high basket counts. The process could halt
when the number of product pairs reached some desired threshold or when
the total activity expressed in basket count, dollars, or units reached some
lower limit.
A variation on this approach could start with a specific category, brand, or
even a product. Again, the idea would be to combine this specific product first
with all the categories and then to work down the hierarchy. Another twist
would be to look at the mix of products purchased by a given customer during
a given time period, regardless of whether they were in the same basket. In
any case, much of the hard work associated with market basket analysis has
been off-loaded to the staging area’s ETL processes in order to simplify the
ultimate query and presentation aspects of the analysis.
In this chapter we got our first exposure to designing a dimensional model.
Regardless of industry, we strongly encourage the four-step process for tack-
ling dimensional model designs. Remember that it is especially important that
we clearly state the grain associated with our dimensional schema. Loading
the fact table with atomic data provides the greatest flexibility because we can
summarize that data “every which way.” As soon as the fact table is restricted
to more aggregated information, we’ll run into walls when the summarization
assumptions prove to be invalid. Also remember that it is vitally important to
populate our dimension tables with verbose, robust descriptive attributes.
In the next chapter we’ll remain within the retail industry to discuss tech-
niques for tackling a second business process within the organization, ensur-
ing that we’re leveraging our earlier efforts while avoiding stovepipes.
Retail Sales 65
In Chapter 2 we developed a dimensional model for the sales transactions in a
large grocery chain. We remain within the same industry in this chapter but
move up the value chain to tackle the inventory process. The designs devel-
oped in this chapter apply to a broad set of inventory pipelines both inside and
outside the retail industry.
Even more important, this chapter provides a thorough discussion of the data
warehouse bus architecture. The bus architecture is essential to creating an
integrated data warehouse from a distributed set of related business processes.
It provides a framework for planning the overall warehouse, even though we
will build it incrementally. Finally, we will underscore the importance of using
common, conformed dimensions and facts across the warehouse’s dimen-
sional models.
Chapter 3 discusses the following concepts:
■■ Value chain implications
■■ Inventory periodic snapshot model, as well as transaction and accumulating
snapshot models
■■ Semi-additive facts
■■ Enhanced inventory facts
■■ Data warehouse bus architecture and matrix
■■ Conformed dimensions and facts
Introduction to the Value Chain
Most organizations have an underlying value chain consisting of their key
business processes. The value chain identifies the natural, logical flow of an
organization’s primary activities. For example, in the case of a retailer, the
company may issue a purchase order to a product manufacturer. The products
are delivered to the retailer’s warehouse, where they are held in inventory. A
delivery is then made to an individual store, where again the products sit in
inventory until a consumer makes a purchase. We have illustrated this subset
of a retailer’s value chain in Figure 3.1. Obviously, products sourced from a
manufacturer that delivers directly to the retail store would bypass the ware-
housing steps of the value chain.
Operational source systems typically produce transactions or snapshots at
each step of the value chain, generating interesting performance metrics along
the way. The primary objective of most analytic decision support systems is to
monitor the performance results of key processes. Since each business process
produces unique metrics at unique time intervals with unique granularity and
dimensionality, each process typically spawns one or more fact tables. To this
end, the value chain provides high-level insight into the overall enterprise
data warehouse. We’ll devote more time to this topic later in this chapter.
Figure 3.1 Subset of a retailer’s value chain.
Retailer Issues
Deliveries at
Deliveries at
Retail Store
Retail Store
Retail Store
Inventory Models
In the meantime, we’ll delve into several complementary inventory models.
The first is the inventory periodic snapshot. Every day (or at some other regu-
lar time interval), we measure the inventory levels of each product and place
them as separate rows in a fact table. These periodic snapshot rows appear
over time as a series of data layers in the dimensional model, much like geo-
logic layers represent the accumulation of sediment over long periods of time.
We’ll explore this common inventory model in some detail. We’ll also discuss
briefly a second inventory model where we record every transaction that has
an impact on inventory levels as products move through the warehouse.
Finally, in the third model, we’ll touch on the inventory accumulating snap-
shot, where we build one fact table row for each product delivery and update
the row until the product leaves the warehouse. Each of the three inventory
models tells a different story. In some inventory applications, two or even all
three models may be appropriate simultaneously.
Inventory Periodic Snapshot
Let’s return to our retail case study. Optimized inventory levels in the stores
can have a major impact on chain profitability. Making sure the right product
is in the right store at the right time minimizes out-of-stocks (where the prod-
uct isn’t available on the shelf to be sold) and reduces overall inventory carry-
ing costs. The retailer needs the ability to analyze daily quantity-on-hand
inventory levels by product and store.
It is time to put the four-step process for designing dimensional models to
work again. The business process we’re interested in analyzing is the retail
store inventory. In terms of granularity, we want to see daily inventory by
product at each individual store, which we assume is the atomic level of detail
provided by the operational inventory system. The dimensions immediately
fall out of this grain declaration: date, product, and store. We are unable to
envision additional descriptive dimensions at this granularity. Inventory typi-
cally is not associated with a retail promotion dimension. Although a store
promotion may be going on while the products are sitting in inventory, the
promotion usually is not associated with the product until it is actually sold.
After the promotion has ended, the products still may be sitting in inventory.
Typically, promotion dimensions are associated with product movement, such
as when the product is ordered, received, or sold.
The simplest view of inventory involves only a single fact: quantity on
hand. This leads to an exceptionally clean dimensional design, as shown in
Figure 3.2.
Inventory 69
Figure 3.2 Store inventory periodic snapshot schema.
The date dimension table in this case study is identical to the table developed
in the earlier case for retail store sales. The product and store dimensions also
may be identical. Alternatively, we may want to further decorate these dimen-
sion tables with additional attributes that would be useful for inventory analy-
sis. For example, the product dimension could be enhanced to include
columns such as the minimum reorder quantity, assuming that they are con-
stant and discrete descriptors of each product stock keeping unit (SKU). Like-
wise, in the store dimension, in addition to the selling square-footage attribute
we discussed in Chapter 2, we also might include attributes to identify the
frozen and refrigerated storage square footages. We’ll talk more about the
implications of adding these dimension attributes later in this chapter.
If we are analyzing inventory levels at the retailer’s warehouse rather than at
the store location, the schema would look quite similar to Figure 3.2. Obvi-
ously, the store dimension would be replaced with a warehouse dimension.
When monitoring inventory levels at the warehouse, normally we do not
retain the store dimension as a fourth dimension unless the warehouse inven-
tory has been allocated to a specific store.
Even a schema as simple as this one can be very useful. Numerous insights can
be derived if inventory levels are measured frequently for many products in
many storage locations. If we’re analyzing the in-store inventory levels of a
mass merchandiser, this database could be used to balance store inventories
each night after the stores close.
This periodic snapshot fact table faces a serious challenge that Chapter 2’s
sales transaction fact table did not. The sales fact table was reasonably sparse
because only about 10 percent of the products in each of our hypothetical
stores actually sold each day. If a product didn’t sell in a store on a given day,
then there was no row in the fact table for that combination of keys. Inventory,
on the other hand, generates dense snapshot tables. Since the retailer strives to
avoid out-of-stock situations where the product is not available for sale, there
is a row in the fact table for virtually every product in every store every day.
Date Key (PK)
Data Attributes ...
Date Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Quantity on Hand
Store Inventory Snapshot Fact
Product Key (PK)
Product Attributes …
Product Dimension
Store Key (PK)
Store Attributes …
Store Dimension
We may well include the zero measurements as explicit records. For our gro-
cery retailer with 60,000 products stocked in 100 stores, we would insert
approximately 6 million rows (60,000 products x 100 stores) with each fact
table load. With a row width of just 14 bytes, the fact table would grow by 84
MB each time we append more fact table rows. A year’s worth of daily snap-
shots would consume over 30 GB. The denseness of inventory snapshots
sometimes mandates some compromises.
Perhaps the most obvious compromise is to reduce the snapshot frequencies
over time. It may be acceptable to keep the last 60 days of inventory at the
daily level and then revert to less granular weekly snapshots for historical
data. In this way, instead of retaining 1,095 snapshots during a 3-year period,
the number could be reduced to 208 total snapshots (60 daily + 148 weekly
snapshots in two separate fact tables given their unique periodicity). We have
reduced the total data size by more than a factor of 5.
Semiadditive Facts
We stressed the importance of fact additivity in Chapter 2. When we modeled
the flow of product past a point at the checkout cash register, only the products
that actually sold were measured. Once a product was sold, it couldn’t be
counted again in a subsequent sale. This made most of the measures in the
retail sales schema perfectly additive across all dimensions.
In the inventory snapshot schema, the quantity on hand can be summarized
across products or stores and result in a valid total. Inventory levels, however,
are not additive across dates because they represent snapshots of a level or bal-
ance at one point in time. It is not possible to tell whether yesterday’s inven-
tory is the same or different from today’s inventory solely by looking at
inventory levels. Because inventory levels (and all forms of financial account
balances) are additive across some dimensions but not all, we refer to them as
semiadditive facts.
The semiadditive nature of inventory balance facts is even more understand-
able if we think about our checking account balances. On Monday, let’s pre-
sume that you have $50 in your account. On Tuesday, the balance remains
unchanged. On Wednesday, you deposit another $50 into your account so that
the balance is now $100. The account has no further activity through the end of
the week. On Friday, you can’t merely add up the daily balances during the
week and declare that your balance is $400 (based on $50 + 50 + 100 + 100 +
100). The most useful way to combine account balances and inventory levels
across dates is to average them (resulting in an $80 average balance in the
checking example). We are all familiar with our bank referring to the average
daily balance on our monthly account summary.
Inventory 71
All measures that record a static level (inventory levels, financial account balances,
and measures of intensity such as room temperatures) are inherently nonadditive
across the date dimension and possibly other dimensions. In these cases, the mea-
sure may be aggregated usefully across time, for example, by averaging over the
number of time periods.
The last few words in this design principle contain a trap. Unfortunately, you
cannot use the SQL AVG function to calculate the average over time. The SQL
AVG function averages over all the rows received by the query, not just the
number of dates. For example, if a query requested the average inventory for
a cluster of three products in four stores across seven dates (that is, what is the
average daily inventory of a brand in a geographic region during a given
week), the SQL AVG function would divide the summed inventory value by
84 (3 products x 4 stores x 7 dates). Obviously, the correct answer is to divide
the summed inventory value by 7, which is the number of daily time periods.
Because SQL has no standard functionality such as an AVG_DATE_SUM oper-
ator that would compute the average over just the date dimension, inventory
calculations are burdened with additional complexity. A proper inventory
application must isolate the date constraint and retrieve its cardinality alone
(in this case, the 7 days comprising the requested week). Then the application
must divide the final summed inventory value by the date constraint cardinal-
ity. This can be done with an embedded SQL call within the overall SQL state-
ment or by querying the date dimension separately and then storing the
resulting value in an application that is passed to the overall SQL statement.
Enhanced Inventory Facts
The simplistic view of inventory we developed in our periodic snapshot fact
table allows us to see a time series of inventory levels. For most inventory
analysis, quantity on hand isn’t enough. Quantity on hand needs to be used in
conjunction with additional facts to measure the velocity of inventory move-
ment and develop other interesting metrics such as the number of turns, num-
ber of days’ supply, and gross margin return on inventory (GMROI,
pronounced “jem-roy”).
If we added quantity sold (or equivalently, quantity depleted or shipped if
we’re dealing with a warehouse location) to each inventory fact row, we could
calculate the number of turns and days’ supply. For daily inventory snapshots,
the number of turns measured each day is calculated as the quantity sold
divided by the quantity on hand. For an extended time span, such as a year,
the number of turns is the total quantity sold divided by the daily average
quantity on hand. The number of days’ supply is a similar calculation. Over a
time span, the number of days’ supply is the final quantity on hand divided by
the average quantity sold.
In addition to the quantity sold, we probably also can supply the extended value
of the inventory at cost, as well as the value at the latest selling price. The differ-
ence between these two values is the gross profit, of course. The gross margin is
equal to the gross profit divided by the value at the latest selling price.
Finally, we can multiply the number of turns by the gross margin to get the
GMROI, as expressed in the following formula:
GMROI = total quantity sold x (value at latest selling price – value at cost)
daily average quantity on hand x value at the latest selling price
Although this formula looks complicated, the idea behind GMROI is simple. By
multiplying the gross margin by the number of turns, we create a measure of the
effectiveness of our inventory investment. A high GMROI means that we are
moving the product through the store quickly (lots of turns) and are making
good money on the sale of the product (high gross margin). A low GMROI means
that we are moving the product slowly (low turns) and aren’t making very much
money on it (low gross margin). The GMROI is a standard metric used by inven-
tory analysts to judge a company’s quality of investment in its inventory.
If we want to be more ambitious than our initial design in Figure 3.2, then we
should include the quantity sold, value at cost, and value at the latest selling
price columns in our snapshot fact table, as illustrated in Figure 3.3. Of course,
if some of these metrics exist at different granularity in separate fact tables, a
requesting application would need to retrieve all the components of the
GMROI computation at the same level.
Notice that quantity on hand is semiadditive but that the other measures in
our advanced periodic snapshot are all fully additive across all three dimen-
sions. The quantity sold amount is summarized to the particular grain of the
fact table, which is daily in this case. The value columns are extended, additive
amounts. We do not store GMROI in the fact table because it is not additive.
We can calculate GMROI from the constituent columns across any number of
fact rows by adding the columns up before performing the calculation, but we
are dead in the water if we try to store GMROI explicitly because we can’t use-
fully combine GMROIs across multiple rows.
Figure 3.3 Enhanced inventory periodic snapshot to support GMROI analysis.
Date Key (PK)
Date Attributes
Date Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Quantity on Hand
Quantity Sold
Dollar Value at Cost
Dollar Value at Latest Selling Price
Store Inventory Snapshot Fact
Product Key (PK)
Product Attributes
Product Dimension
Store Key (PK)
Store Attributes
Store Dimension
Inventory 73
The periodic snapshot is the most common inventory schema. We’ll touch
briefly on two alternative perspectives to complement the inventory snapshot
just designed. For a change of pace, rather than describing these models in the
context of the retail in-store inventory, we’ll move up the value chain to dis-
cuss the inventory located in our warehouses.
Inventory Transactions
A second way to model an inventory business process is to record every trans-
action that affects inventory. Inventory transactions at the warehouse might
include the following:
■■ Receive product
■■ Place product into inspection hold
■■ Release product from inspection hold
■■ Return product to vendor due to inspection failure
■■ Place product in bin
■■ Authorize product for sale
■■ Pick product from bin
■■ Package product for shipment
■■ Ship product to customer
■■ Receive product from customer
■■ Return product to inventory from customer return
■■ Remove product from inventory
Each inventory transaction identifies the date, product, warehouse, vendor,
transaction type, and in most cases, a single amount representing the inven-
tory quantity impact caused by the transaction. Assuming that the granularity
of our fact table is one row per inventory transaction, the resulting schema is
illustrated in Figure 3.4.
Figure 3.4 Warehouse inventory transaction model.
Date Dimension Date Key (FK)
Product Key (FK)
Warehouse Key (FK)
Vendor Key (FK)
Inventory Transaction Type Key (FK)
Inventory Transaction Dollar Amount
Warehouse Inventory Transaction Fact
Product Dimension
Warehouse Key (PK)
Warehouse Name
Warehouse Address
Warehouse City
Warehouse State
Warehouse Zip
Warehouse Zone
Warehouse Total Square Footage
and more
Warehouse Dimension
Inventory Transaction Type Key (PK)
Inventory Transaction Type Description
Inventory Transaction Type Group
Inventory Transaction Type Dimension
Vendor Dimension
Even though the transaction-level fact table is again very simple, it contains the
most detailed information available about inventory because it mirrors fine-
scale inventory manipulations. The transaction-level fact table is useful for mea-
suring the frequency and timing of specific transaction types. For instance, only
a transaction-grained inventory fact table can answer the following questions:
■■ How many times have we placed a product into an inventory bin on the
same day we picked the product from the same bin at a different time?
■■ How many separate shipments did we receive from a given vendor, and
when did we get them?
■■ On which products have we had more than one round of inspection
failures that caused return of the product to the vendor?
Even so, it is impractical to use this table as the sole basis for analyzing inven-
tory performance. Although it is theoretically possible to reconstruct the exact
inventory position at any moment in time by rolling all possible transactions
forward from a known inventory position, it is too cumbersome and impracti-
cal for broad data warehouse questions that span dates or products.
Remember that there’s more to life than transactions alone. Some form of snapshot
table to give a more cumulative view of a process often accompanies a transaction
fact table.
Inventory Accumulating Snapshot
The final inventory model that we’ll explore briefly is the accumulating snap-
shot. In this model we place one row in the fact table for a shipment of a par-
ticular product to the warehouse. In a single fact table row we track the
disposition of the product shipment until it has left the warehouse. The accu-
mulating snapshot model is only possible if we can reliably distinguish prod-
ucts delivered in one shipment from those delivered at a later time. This
approach is also appropriate if we are tracking disposition at very detailed lev-
els, such as by product serial number or lot number.
Let’s assume that the inventory goes through a series of well-defined events or
milestones as it moves through the warehouse, such as receiving, inspection,
bin placement, authorization to sell, picking, boxing, and shipping. The phi-
losophy behind the accumulating snapshot fact table is to provide an updated
status of the product shipment as it moves through these milestones. Each fact
table row will be updated until the product leaves the warehouse. As illus-
trated in Figure 3.5, the inventory accumulating snapshot fact table with its
multitude of dates and facts looks quite different from the transaction or peri-
odic snapshot schemas.
Inventory 75
Figure 3.5 Warehouse inventory accumulating snapshot.
Accumulating snapshots are the third major type of fact table. They are inter-
esting both because of the multiple date-valued foreign keys at the beginning
of the key list and also because we revisit and modify the same fact table
records over and over. Since the accumulating snapshot rarely is used in long-
running, continuously replenished inventory processes, rather than focusing
on accumulating snapshots at this time, we’ll provide more detailed coverage
in Chapter 5. The alert reader will notice the four non-additive metrics at the
end of the fact table. Again, stay tuned for Chapter 5.
Value Chain Integration
Now that we’ve completed the design of three inventory model variations,
let’s revisit our earlier discussion about the retailer’s value chain. Both the
business and IT organizations typically are very interested in value chain inte-
gration. Low-level business analysts may not feel much urgency, but those in
the higher ranks of management are very aware of the need to look across the
business to better evaluate performance. Numerous data warehouse projects
have focused recently on management’s need to better understand customer
relationships from an end-to-end perspective. Obviously, this requires the
ability to look consistently at customer information across processes, such as
Date Received Key (FK)
Date Inspected Key (FK)
Date Placed in Inventory Key (FK)
Date Authorized to Sell Key (FK)
Date Picked Key (FK)
Date Boxed Key (FK)
Date Shipped Key (FK)
Date of Last Return Key (FK)
Product Key (FK)
Warehouse Key (FK)
Vendor Key (FK)
Quantity Received
Quantity Inspected
Quantity Returned to Vendor
Quantity Placed in Bin
Quantity Authorized to Sell
Quantity Picked
Quantity Boxed
Quantity Shipped
Quantity Returned by Customer
Quantity Returned to Inventory
Quantity Damaged
Quantity Lost
Quantity Written Off
Unit Cost
Unit List Price
Unit Average Price
Unit Recovery Price
Warehouse Inventory Accumulating Fact
Product Dimension
Warehouse Dimension
Vendor Dimension
Date Received Dimension
Date Inspected Dimension
Date Placed in Inventory Dimension
Date Authorized to Sell Dimension
Date Picked Dimension
Date Boxed Dimension
Date Shipped Dimension
Date of Last Return Dimension
quotes, orders, invoicing, payments, and customer service. Even if your man-
agement’s vision is not so lofty, business users certainly are tired of getting
reports that don’t match from different systems or teams.
IT managers know all too well that integration is needed to deliver on the
promises of data warehousing. Many consider it their fiduciary responsibility
to manage the organization’s information assets. They know that they’re not
fulfilling their responsibilities if they allow standalone, nonintegrated data-
bases to proliferate. In addition to better addressing the business’s needs, the
IT organization also benefits from integration because it allows the organiza-
tion to better leverage scarce resources and gain efficiencies through the use of
reusable components.
Fortunately, the folks who typically are most interested in integration also
have the necessary organizational influence and economic willpower to make
it happen. If they don’t place a high value on integration, then you’re facing a
much more serious organizational challenge. It shouldn’t be the sole responsi-
bility of the data warehouse manager to garner organizational consensus for
an integrated warehouse architecture across the value chain. The political sup-
port of senior management is very important. It takes the data warehouse
manager off the hook and places the burden of the decision-making process on
senior management’s shoulders, where it belongs.
In Chapters 1 and 2 we modeled data from several processes of the value
chain. While separate fact tables in separate data marts represent the data from
each process, the models share several common business dimensions, namely,
date, product, and store. We’ve logically represented this dimension sharing in
Figure 3.6. Using shared, common dimensions is absolutely critical to design-
ing data marts that can be integrated. They allow us to combine performance
measurements from different processes in a single report. We use multipass
SQL to query each data mart separately, and then we outer join the query
results based on a common dimension attribute. This linkage, often referred to
as drill across, is straightforward if the dimension table attributes are identical.
Figure 3.6 Sharing dimensions between business processes.
POS Retail Sales
Transaction Fact
Retail Inventory
Snapshot Fact
Warehouse Inventory
Transaction Fact
Store Dimension
Date Dimension
Promotion Dimension
Product Dimension
Warehouse Dimension
Vendor Dimension
Inventory 77
Data Warehouse Bus Architecture
Obviously, building the enterprise’s data warehouse in one step is too daunt-
ing, yet building it as isolated pieces defeats the overriding goal of consistency.
For long-term data warehouse success, we need to use an architected, incre-
mental approach to build the enterprise’s warehouse. The approach we advo-
cate is the data warehouse bus architecture.
The word bus is an old term from the electrical power industry that is now
used commonly in the computer industry. A bus is a common structure to
which everything connects and from which everything derives power. The bus
in your computer is a standard interface specification that allows you to plug
in a disk drive, CD-ROM, or any number of other specialized cards or devices.
Because of the computer’s bus standard, these peripheral devices work
together and usefully coexist, even though they were manufactured at differ-
ent times by different vendors.
By defining a standard bus interface for the data warehouse environment, separate
data marts can be implemented by different groups at different times. The separate
data marts can be plugged together and usefully coexist if they adhere to the standard.
If we think back to the value chain diagram in Figure 3.1, we can envision
many business processes plugging into the data warehouse bus, as illustrated
in Figure 3.7. Ultimately, all the processes of an organization’s value chain will
create a family of dimensional models that share a comprehensive set of com-
mon, conformed dimensions. We’ll talk more about conformed dimensions
later in this chapter, but for now, assume that the term means similar.
Figure 3.7 Sharing dimensions across the value chain.
Date Store Promotion Warehouse Vendor ShipperProduct
Store Sales
Store Inventory
Purchase Orders
The data warehouse bus architecture provides a rational approach to decom-
posing the enterprise data warehouse planning task. During the limited-
duration architecture phase, the team designs a master suite of standardized
dimensions and facts that have uniform interpretation across the enterprise.
This establishes the data architecture framework. We then tackle the imple-
mentation of separate data marts in which each iteration closely adheres to
the architecture. As the separate data marts come on line, they fit together like
the pieces of a puzzle. At some point, enough data marts exist to make good
on the promise of an integrated enterprise data warehouse.
The bus architecture allows data warehouse managers to get the best of both
worlds. They have an architectural framework that guides the overall design,
but the problem has been divided into bite-sized data mart chunks that can be
implemented in realistic time frames. Separate data mart development teams
follow the architecture guidelines while working fairly independently and
The bus architecture is independent of technology and the database platform.
All flavors of relational and online analytical processing (OLAP)-based data
marts can be full participants in the data warehouse bus if they are designed
around conformed dimensions and facts. Data warehouses will inevitably
consist of numerous separate machines with different operating systems and
database management systems (DBMSs). If designed coherently, they will
share a uniform architecture of conformed dimensions and facts that will
allow them to be fused into an integrated whole.
Data Warehouse Bus Matrix
The tool we use to create, document, and communicate the bus architecture is
the data warehouse bus matrix, which we’ve illustrated in Figure 3.8.
Figure 3.8 Sample data warehouse bus matrix.
Retail Sales
Retail Inventory
Retail Deliveries
Warehouse Inventory
Warehouse Deliveries
Purchase Orders
Inventory 79
Working in a tabular fashion, we lay out the business processes of the organi-
zation as matrix rows. It is important to remember that we are identifying the
business processes closely identified with sources of data, not the organiza-
tion’s business departments. The matrix rows translate into data marts based
on the organization’s primary activities. We begin by listing the data marts
that are derived from a single primary source system, commonly known as
first-level data marts. These data marts are recognizable complements to their
operational source.
The rows of the bus matrix correspond to data marts. You should create separate
matrix rows if the sources are different, the processes are different, or if the matrix
row represents more than what can reasonably be tackled in a single implementa-
tion iteration.
Once it is time to begin a data mart development project, we recommend start-
ing the actual implementation with first-level data marts because they mini-
mize the risk of signing up for an implementation that is too ambitious. Most
of the overall risk of failure comes from biting off too much of the extract-
transformation-load (ETL) data staging design and development effort. In
many cases, first-level data marts provide users with enough interesting data
to keep them happy and quiet while the data mart teams keep working on
more difficult issues.
Once we’ve fully enumerated the list of first-level data marts, then we can
identify more complex multisource marts as a second step. We refer to these
data marts as consolidated data marts because they typically cross business
processes. While consolidated data marts are immensely beneficial to the orga-
nization, they are more difficult to implement because the ETL effort grows
alarmingly with each additional major source that’s integrated into a single
dimensional model. It is prudent to focus on the first-level data marts as
dimensional building blocks before tackling the task of consolidating. In some
cases the consolidated data mart is actually more than a simple union of data
sets from the first-level data marts.
Profitability is a classic example of a consolidated data mart where separate
revenue and cost factors are combined from different process marts to provide
a complete view of profitability. While a highly granular profitability mart is
exciting because it provides visibility into product and customer profit perfor-
mance, it is definitely not the first mart you should attempt to implement. You
could easily drown while attempting to stage all the components of revenue
and cost. If you are absolutely forced to focus on profitability as your first mart,
then you should begin by allocating costs on a rule-of-thumb basis rather than
doing the complete job of sourcing all the underlying cost detail. Even so,
attempting to get organization consensus on allocation rules may be a project
showstopper given the sensitive (and perhaps wallet-impacting) nature of the
allocations. One of the project prerequisites, outside the scope of the warehouse
project team’s responsibilities, should be business agreement on the allocation
rules. It is safe to say that it is best to avoid dealing with the complexities of
profitability until you have some data warehousing successes under your belt.
The columns of the matrix represent the common dimensions used across the
enterprise. It is often helpful to create a comprehensive list of dimensions
before filling in the matrix. When you start with a large list of potential dimen-
sions, it becomes a useful creative exercise to determine whether a given
dimension possibly could be associated with a data mart.
The shaded cells indicate that the dimension column is related to the business
process row. The resulting matrix will be surprisingly dense. Looking across
the rows is revealing because you can see the dimensionality of each data mart
at a glance. However, the real power of the matrix comes from looking at the
columns as they depict the interaction between the data marts and common
The matrix is a very powerful device for both planning and communication.
Although it is relatively straightforward to lay out the rows and columns, in
the process, we’re defining the overall data architecture for the warehouse. We
can see immediately which dimensions warrant special attention given their
participation in multiple data marts. The matrix helps prioritize which dimen-
sions should be tackled first for conformity given their prominent roles.
The matrix allows us to communicate effectively within and across data mart
teams, as well as upward and outward throughout the organization. The
matrix is a succinct deliverable that visually conveys the entire plan at once. It
is a tribute to its simplicity that the matrix can be used effectively to directly
communicate with senior IT and business management.
Creating the data warehouse bus matrix is one of the most important up-front deliv-
erables of a data warehouse implementation. It is a hybrid resource that is part tech-
nical design tool, part project management tool, and part communication tool.
It goes without saying that it is unacceptable to build separate data marts that
ignore a framework to tie the data together. Isolated, independent data marts
are worse than simply a lost opportunity for analysis. They deliver access to
irreconcilable views of the organization and further enshrine the reports that
cannot be compared with one another. Independent data marts become legacy
implementations in their own right; by their very existence, they block the
development of a coherent warehouse environment.
Inventory 81
So what happens if you’re not starting with a blank data warehousing slate?
Perhaps several data marts have been constructed already without regard to
an architecture of conformed dimensions. Can you rescue your stovepipes and
convert them to the bus architecture? To answer this question, you should start
first with an honest appraisal of your existing nonintegrated data marts. This
typically entails a series of face-to-face meetings with the separate teams
(including the clandestine teams within business organizations) to determine
the gap between the current environment and the organization’s architected
goal. Once the gap is understood, you need to develop an incremental plan to
convert the data marts to the enterprise architecture. The plan needs to be sold
internally. Senior IT and business management must understand the current
state of data chaos, the risks of doing nothing, and the benefits of moving for-
ward according to your game plan. Management also needs to appreciate that
the conversion will require a commitment of support, resources, and funding.
If an existing data mart is based on a sound dimensional design, perhaps you
can simply map an existing dimension to a standardized version. The original
dimension table would be rebuilt using a cross-reference map. Likewise, the
fact table also would need to be reprocessed to replace the original dimension
keys with the conformed dimension keys. Of course, if the original and con-
formed dimension tables contain different attributes, rework of the preexisting
queries is inevitable. More typically, existing data marts are riddled with
dimensional modeling errors beyond just the lack of adherence to standard-
ized dimensions. In some cases, the stovepipe data mart already has outlived
its useful life. Isolated data marts often are built for a specific functional area.
When others try to leverage the environment, they typically discover that the
data mart was implemented at an inappropriate level of granularity and is
missing key dimensionality. The effort required to retrofit these data marts
into the warehouse architecture may exceed the effort to start over from
scratch. As difficult as it is to admit, stovepipe data marts often have to be shut
down and rebuilt in the proper bus architecture framework.
Conformed Dimensions
Now that you understand the importance of the bus architecture, let’s further
explore the standardized conformed dimensions that serve as the cornerstone of
the warehouse bus. Conformed dimensions are either identical or strict mathe-
matical subsets of the most granular, detailed dimension. Conformed dimen-
sions have consistent dimension keys, consistent attribute column names,
consistent attribute definitions, and consistent attribute values (which translates
into consistent report labels and groupings). Dimension tables are not con-
formed if the attributes are labeled differently or contain different values. If a
customer or product dimension is deployed in a nonconformed manner, then
either the separate data marts cannot be used together or, worse, attempts to use
them together will produce invalid results.
Conformed dimensions come in several different flavors. At the most basic level,
conformed dimensions mean the exact same thing with every possible fact table
to which they are joined. The date dimension table connected to the sales facts is
identical to the date dimension table connected to the inventory facts. In fact, the
conformed dimension may be the same physical table within the database.
However, given the typical complexity of our warehouse’s technical environ-
ment with multiple database platforms, it is more likely that the dimensions are
duplicated synchronously in each data mart. In either case, the date dimensions
in both data marts will have the same number of rows, same key values, same
attribute labels, same attribute definitions, and same attribute values. There is
consistent data content, data interpretation, and user presentation.
Most conformed dimensions are defined naturally at the most granular level
possible. The grain of the customer dimension naturally will be the individual
customer. The grain of the product dimension will be the lowest level at which
products are tracked in the source systems. The grain of the date dimension
will be the individual day.
Sometimes dimensions are needed at a rolled-up level of granularity. Perhaps
the roll-up dimension is required because the fact table represents aggregated
facts that are associated with aggregated dimensions. This would be the case if
we had a weekly inventory snapshot in addition to our daily snapshot. In
other situations, the facts simply may be generated by another business
process at a higher level of granularity. One business process, such as sales,
captures data at the atomic product level, whereas forecasting generates data
at the brand level. You couldn’t share a single product dimension table across
the two business process schemas because the granularity is different. The
product and brand dimensions still would conform if the brand table were a
strict subset of the atomic product table. Attributes that are common to both
the detailed and rolled-up dimension tables, such as the brand and category
descriptions, should be labeled, defined, and valued identically in both tables,
as illustrated in Figure 3.9.
Roll-up dimensions conform to the base-level atomic dimension if they are a strict
subset of that atomic dimension.
We may encounter other legitimate conformed dimension subsets with dimen-
sion tables at the same level of granularity. For example, in the inventory snap-
shot schema we added supplemental attributes to the product and store
dimensions that may not be useful to the sales transaction schema. The prod-
uct dimension tables used in these two data marts still conform if the keys and
Inventory 83
Figure 3.9 Conforming roll-up dimension subsets.
common columns are identical. Of course, given that the supplemental attrib-
utes were limited to the inventory data mart, we would be unable to look
across processes using these add-on attributes.
Another case of conformed dimension subsetting occurs when two dimen-
sions are at the same level of detail but one represents only a subset of rows.
For example, we may have a corporate product dimension that contains data
for our full portfolio of products across multiple disparate lines of business,
as illustrated in Figure 3.10. Analysts in the separate businesses may want to
view only their subset of the corporate dimension, restricted to the product
rows for their business. By using a subset of rows, they aren’t encumbered
with the entire product set for the organization. Of course, the fact table
joined to this subsetted dimension must be limited to the same subset of
products. If a user attempts to use a subset dimension while accessing a fact
table consisting of the complete product set, he or she may encounter unex-
pected query results. Technically, referential integrity would be violated. We
need to be cognizant of the potential opportunity for user confusion or error
with dimension row subsetting. We will further elaborate on dimension sub-
sets when we discuss heterogeneous products in Chapter 9.
Product Key (PK)
Product Description
SKU Number (Natural Key)
Brand Description
Subcategory Description
Category Description
Department Description
Package Type Description
Package Size
Fat Content Description
Diet Type Description
Weight Units of Measure
Storage Type
Shelf Life Type
Shelf Width
Shelf Height
Shelf Depth
and more
Product Dimensions
Brand Key (PK)
Brand Description
Subcategory Description
Category Description
Department Description
Brand Dimension
The conformed date dimension in our daily sales and monthly forecasting sce-
nario is a unique example of both row and column dimension subsetting. Obvi-
ously, we can’t simply use the same date dimension table because of
the difference in roll-up granularity. However, the month dimension may
consist of strictly the month-end daily date table rows with the exclusion of
all columns that don’t apply at the monthly granularity. Excluded columns
would include daily date columns such as the date description, day number in
epoch, weekday/weekend indicator, week-ending date, holiday indicator, day
number within year, and others. You might consider including a month-end
indicator on the daily date dimension to facilitate creation of this monthly table.
Conformed dimensions will be replicated either logically or physically through-
out the enterprise; however, they should be built once in the staging area. The
responsibility for each conformed dimension is vested in a group we call the
dimension authority. The dimension authority has responsibility for defining,
maintaining, and publishing a particular dimension or its subsets to all the data
mart clients who need it. They take responsibility for staging the gold-standard
dimension data. Ultimately, this may involve sourcing from multiple opera-
tional systems to publish a complete, high-quality dimension table.
Figure 3.10 Conforming dimension subsets at the same granularity.
Product Dimension
Drilling across (conforming)
both appliance products and
apparel products requires using
attributes common to both types.
Inventory 85
The major responsibility of the centralized dimension authority is to establish, main-
tain, and publish the conformed dimensions to all the client data marts.
Once a set of master conformed dimensions has been defined for the enter-
prise, it is extremely important that the data mart teams actually use these
dimensions. The commitment to use conformed dimensions is more than a
technical decision; it is a business policy decision that is key to making the
enterprise data warehouse function. Agreement on conformed dimensions
faces far more political challenges than technical hurdles. Given the political
issues surrounding them, conformed dimensions must be supported from the
outset by the highest levels of the organization. Business executives must
stress the importance to their teams, even if the conformed dimension causes
some compromises. The CIO also should appreciate the importance of con-
formed dimensions and mandate that each data mart team takes the pledge to
always use them.
Obviously, conformed dimensions require implementation coordination.
Modifications to existing attributes or the addition of new attributes must be
reviewed with all the data mart teams employing the conformed dimension.
You will also need to determine your conformed dimension release strategy.
Changes to identical dimensions should be replicated synchronously to all
associated data marts. This push approach to dimension publishing maintains
the requisite consistency across the organization.
Now that we’ve preached about the importance of conformed dimensions,
we’ll discuss the situation where it may not be realistic or necessary to estab-
lish conformed dimensions for the organization. If you are a conglomerate
with subsidiaries that span widely varied industries, there may be little point
in trying to integrate. If you don’t want to cross-sell the same customers from
one line of business to another, sell products that span lines of business, or
assign products from multiple lines of business to a single salesperson, then it
may not make sense to attempt a comprehensive data warehouse architecture.
There likely isn’t much perceived business value to conform your dimensions.
The willingness to seek a common definition for product or customer is a
major litmus test for an organization theoretically intent on building an enter-
prise data warehouse. If the organization is unwilling to agree on common
definitions across all data marts, the organization shouldn’t attempt to build a
data warehouse that spans these marts. You would be better off building sep-
arate, self-contained data warehouses for each subsidiary.
In our experience, while many organizations find it currently mission impos-
sible to combine data across their disparate lines of business, some degree of
integration is typically an ultimate goal. Rather than throwing your hands in
the air and declaring that it can’t possibly be done, we suggest starting down
the path toward conformity. Perhaps there are a handful of attributes that can
be conformed across disparate lines of business. Even if it is merely a product
description, category, and line of business attribute that is common to all busi-
nesses, this least-common-denominator approach is still a step in the right
direction. You don’t have to get all your businesses to agree on everything
related to a dimension before proceeding.
Conformed Facts
Thus far we have talked about the central task of setting up conformed dimen-
sions to tie our data marts together. This is 90 percent of the up-front data
architecture effort. The remaining effort goes into establishing conformed fact
Revenue, profit, standard prices, standard costs, measures of quality, measures
of customer satisfaction, and other key performance indicators (KPIs) are facts
that must be conformed. In general, fact table data is not duplicated explicitly
in multiple data marts. However, if facts do live in more than one location,
such as in first-level and consolidated marts, the underlying definitions and
equations for these facts must be the same if they are to be called the same
thing. If they are labeled identically, then they need to be defined in the same
dimensional context and with the same units of measure from data mart to
data mart.
We must be disciplined in our data naming practices. If it is impossible to conform a
fact exactly, then you should give different names to the different interpretations.
This makes it less likely that incompatible facts will be used in a calculation.
Sometimes a fact has a natural unit of measure in one fact table and another
natural unit of measure in another fact table. For example, the flow of product
down the retail value chain may best be measured in shipping cases at the
warehouse but in scanned units at the store. Even if all the dimensional con-
siderations have been taken into account correctly, it would be difficult to use
these two incompatible units of measure in one drill-across report. The usual
solution to this kind of problem is to refer the user to a conversion factor
buried in the product dimension table and hope that the user can find the con-
version factor and use it correctly. This is unacceptable in terms of both over-
head and vulnerability to error. The correct solution is to carry the fact in both
units of measure so that a report can easily glide down the value chain, pick-
ing off comparable facts. We’ll talk more about multiple units of measure in
Chapter 5.
Inventory 87
Inventory is an important process to measure and monitor in many industries.
In this chapter we developed dimensional models for the three complemen-
tary views of inventory. Either the periodic or accumulating snapshot model
will serve as a good stand-alone depiction of inventory. The periodic snapshot
would be chosen for long-running, continuously replenished inventory sce-
narios. The accumulating snapshot would be used for one-time, finite inven-
tory situations with a definite beginning and end. More in-depth inventory
applications will want to augment one or both of these models with the trans-
action model.
We introduced key concepts surrounding the data warehouse bus architecture
and matrix. Each business process of the value chain, supported by a primary
source system, translates into a data mart, as well as a row in the bus matrix.
The data marts share a surprising number of standardized, conformed dimen-
sions. Developing and adhering to the bus architecture is an absolute must if
you intend to build a data warehouse composed of an integrated set of data
We’ll explore the procurement process in this chapter. This topic has obvious cross-
industry appeal because it is applicable to anyone who acquires products or ser-
vices for either use or resale. In addition to developing several purchasing
models in this chapter, we will provide in-depth coverage of the techniques for
handling changes to our dimension table attributes. While the descriptive attrib-
utes in dimension tables are relatively static, they are subject to change over
time. Product lines are restructured, causing product hierarchies to change. Cus-
tomers move, causing their geographic information to change. Sales reps are
realigned, causing territory assignments to change. We’ll discuss several
approaches to dealing with these inevitable changes in our dimension tables.
Chapter 4 discusses the following concepts:
■■ Value chain reinforcement
■■ Blended versus separate transaction schemas
■■ Slowly changing dimension techniques, both basic and advanced
Procurement Case Study
Thus far we have studied downstream retail sales and inventory processes in the
value chain. We understand the importance of mapping out the data warehouse
bus architecture where conformed dimensions are used across process-centric
fact tables. In this chapter we’ll extend these concepts as we work our way fur-
ther up the value chain to the procurement process.
For many companies, procurement is a critical business activity. Effective pro-
curement of products at the right price for resale is obviously important to
retailers such as our grocery chain. Procurement also has strong bottom-line
implications for any large organization that buys products as raw materials for
manufacturing. Significant cost-savings opportunities are associated with
reducing the number of suppliers and negotiating agreements with preferred
Demand planning drives efficient materials management. Once demand is
forecasted, procurement’s goal is to source the appropriate materials/prod-
ucts in the most economical manner. Procurement involves a wide range of
activities from negotiating contracts to issuing purchase requisitions and pur-
chase orders (POs) to tracking receipts and authorizing payments. The follow-
ing list gives you a better sense of a procurement organization’s common
analytic requirements:
■■ Which materials or products are purchased most frequently? How many
vendors supply these products? At what prices? In what units of measure
(such as bulk or drum)?
■■ Looking at demand across the enterprise (rather than at a single physical
location), are there opportunities to negotiate favorable pricing by consoli-
dating suppliers, single sourcing, or making guaranteed buys?
■■ Are our employees purchasing from the preferred vendors or skirting the
negotiated vendor agreements (maverick spending)?
■■ Are we receiving the negotiated pricing from our vendors (vendor con-
tract purchase price variance)?
■■ How are our vendors performing? What is the vendor’s fill rate? On-time
delivery performance? Late deliveries outstanding? Percent of orders
backordered? Rejection rate based on receipt inspection?
Procurement Transactions
As we begin working through the four-step design process, we first decide that
procurement is the business process to be modeled. We study the process in
detail and observe a flurry of procurement transactions, such as purchase requi-
sitions, purchase orders, shipping notifications, receipts, and payments. Similar
to the approach we took in Chapter 3 with the inventory transactions, we first
elect to build a fact table with the grain of one row per procurement transaction.
We identify transaction date, product, vendor, contract terms, and procurement
transaction type as our key dimensions. Procured units and transaction amount
are the facts. The resulting design looks similar to Figure 4.1.
Figure 4.1 Procurement fact table with multiple transaction types.
If we are still working for the same grocery retailer, then the transaction date
and product dimensions are the same conformed dimensions we developed
originally in Chapter 2. If we’re working with manufacturing procurement,
the raw materials products likely are located in a separate raw materials
dimension table rather than included in the product dimension for salable
products. The vendor, contract terms, and procurement transaction type
dimensions are new to this schema. The vendor dimension contains one row
for each vendor, along with interesting descriptive attributes to support a vari-
ety of vendor analyses. The contract terms dimension contains one row for
each generalized set of terms negotiated with a vendor, similar to the promo-
tion dimension in Chapter 2. The procurement transaction type dimension
allows us to group or filter on transaction types, such as purchase orders. The
contract number is a degenerate dimension. It would be used to determine the
volume of business conducted under each negotiated contract.
Multiple- versus Single-Transaction
Fact Tables
As we review the initial procurement schema design with business users, we
are made aware of several new details. First of all, we learn that the business
users describe the various procurement transactions differently. To the busi-
ness, purchase orders, shipping notices, warehouse receipts, and vendor pay-
ments are all viewed as separate and unique processes.
It turns out that several of the procurement transactions actually come from
different source systems. There is no single procurement system to source all
the procurement transactions. Instead, there is a purchasing system that pro-
vides purchase requisitions and purchase orders, a warehousing system that
provides shipping notices and warehouse receipts, and an accounts payable
system that deals with vendor payments.
Date Dimension Procurement Transaction Date Key (FK)
Product Key (FK)
Vendor Key (FK)
Contract Terms Key (FK)
Procurement Transaction Type Key (FK)
Contract Number (DD)
Procurement Transaction Quantity
Procurement Transaction Dollar Amount
Procurement Transaction Fact
Product Dimension
Vendor Key (PK)
Vendor Name
Vendor Street Address
Vendor City
Vendor Zip
Vendor State/Province
Vendor Country
Vendor Status
Vendor Minority Ownership Flag
Vendor Corporate Parent
… and more
Vendor Dimension
Contract Terms Key (PK)
Contract Terms Description
Contract Terms Type
Contract Terms Dimension
Procurement Transaction Type Key (PK)
Procurement Transaction Type Description
Procurement Transaction Type Category
Procurement Trasaction Type Dimension
Procurement 91
We further discover that several of our transaction types have different dimen-
sionality. For example, discounts taken are applicable to vendor payments but
not to the other transaction types. Similarly, the name of the warehouse clerk
who received the goods at the warehouse applies to receipts but doesn’t make
sense elsewhere.
We also learn about a variety of interesting control numbers, such as purchase
order and payment check numbers, that are created at various steps in the pro-
curement process. These control numbers are perfect candidates for degener-
ate dimensions. For certain transaction types, more than one control number
may apply.
While we sort through these new details, we are faced with a design decision.
Should we build a blended transaction fact table with a transaction type
dimension to view all our procurement transactions together, or do we build
separate fact tables for each transaction type? This is a common design
quandary that surfaces in many transactional situations, not just procurement.
As dimensional modelers, we need to make design decisions based on a thor-
ough understanding of the business requirements weighed against the trade-
offs of the available options. In this case, there is no simple formula to make
the definite determination of whether to use a single or multiple fact tables. A
single fact table may be the most appropriate solution in some situations,
whereas multiple fact tables are most appropriate in others. When faced with
this design decision, we look to the following considerations to help us sort
things out:
■■ First, what are the users’ analytic requirements? As designers, our goal is
to reduce complexity in order to present the data in the most effective form
for the business users. How will the business users most commonly ana-
lyze this data? Do the required analyses often require multiple transaction
types together, leading us to consider a single blended fact table? Or do
they more frequently look solely at a single transaction type in an analysis,
causing us to favor separate fact tables for each type of transaction?
■■ Are there really multiple unique business processes? In our procurement
example, it seems that buying products (purchase orders) is distinctly dif-
ferent from receiving products (receipts). The existence of separate control
numbers for each step in the process is a clue that we are dealing with
separate processes. Given this situation, we would lean toward separate
fact tables. In Chapter 3’s inventory example, all the varied inventory
transactions clearly related to a single inventory process, resulting in a
single fact table design.
■■ Are multiple source systems involved? In our example, we’re dealing
with three separate source systems: purchasing, warehousing, and
accounts payable. Again, this would suggest separate fact tables. The data
staging activities required to source the single-transaction fact table from
three separate source systems is likely daunting.
■■ What is the dimensionality of the facts? In our procurement example we
discovered several dimensions that applied to some transaction types but
not to others. This would again lead us to separate fact tables.
In our hypothetical case study we decide to implement multiple transaction
fact tables as illustrated in Figure 4.2. We have separate fact tables for purchase
requisitions, purchase orders, shipping notices, warehouse receipts, and ven-
dor payments. We arrived at this decision because the users view these activi-
ties as separate and distinct business processes, the data comes from different
source systems, and there is unique dimensionality for the various transaction
types. Multiple fact tables allow us to provide richer, more descriptive dimen-
sions and attributes. As we progress from purchase requisitions all the way to
vendor payments, we inherit date dimensions and degenerate dimensions
from the previous steps. The single fact table approach would have required
generalization of the labeling for some dimensions. For example, purchase
order date and receipt date likely would have been generalized to transaction
date. Likewise, purchasing agent and receiving clerk would become
employee. In another organization with different business requirements,
source systems, and data dimensionality, the single blended fact table may be
more appropriate.
We understand that multiple fact tables may require more time to manage and
administer because there are more tables to load, index, and aggregate. Some
would argue that this approach increases the complexity of the data staging
processes. In fact, it may simplify the staging activities. Since the operational
data exist in separate source systems, we would need multiple staging
processes in either fact table scenario. Loading the data into separate fact
tables likely will be less complex than attempting to integrate data from the
multiple sources into a single fact table.
Complementary Procurement
Separate from the decision regarding procurement transaction fact tables, we
may find that we also need to develop some sort of snapshot fact table to fully
address the needs of the business. As we suggested in Chapter 3, an accumu-
lating snapshot that crosses processes would be extremely useful if the busi-
ness is interested in monitoring product movement as it proceeds through the
procurement pipeline (including the duration or lag at each stage). We’ll spend
more time on this topic in Chapter 5.
Procurement 93
Figure 4.2 Multiple fact tables for procurement processes.
Requisition Date Key (FK)
Requested Date Key (FK)
Product Key (FK)
Vendor Key (FK)
Contract Terms Key (FK)
Requested By Key (FK)
Contract Number (DD)
Purchase Requisition Number (DD)
Purchase Requisition Quantity
Purchase Requisition Dollar Amount
Purchase Requisition Fact
Requisition Date Key (FK)
Requested Date Key (FK)
Purchase Order Date Key (FK)
Product Key (FK)
Vendor Key (FK)
Contract Terms Key (FK)
Requested By Key (FK)
Purchase Agent Key (FK)
Contract Number (DD)
Purchase Requisition Number (DD)
Purchase Order Number (DD)
Purchase Order Quantity
Purchase Order Dollar Amount
Purchase Order Fact
Shipping Notification Date Key (FK)
Ship Date Key (FK)
Requested Date Key (FK)
Product Key (FK)
Vendor Key (FK)
Contract Terms Key (FK)
Requested By Key (FK)
Purchase Agent Key (FK)
Contract Number (DD)
Purchase Requisition Number (DD)
Purchase Order Number (DD)
Shipping Notification Number (DD)
Shipped Quantity
Shipping Notices Fact
Warehouse Receipt Date Key (FK)
Ship Date Key (FK)
Requested Date Key (FK)
Product Key (FK)
Vendor Key (FK)
Received Condition Key (FK)
Warehouse Clerk (FK)
Purchase Requisition Number (DD)
Purchase Order Number (DD)
Shipping Notification Number (DD)
Received Quantity
Warehouse Receipts Fact
Payment Date Key (FK)
Ship Date Key (FK)
Warehouse Receipt Date Key (FK)
Product Key (FK)
Vendor Key (FK)
Contract Terms Key (FK)
Discount Taken Key (FK)
Contract Number (DD)
Purchase Requisition Number (DD)
Purchase Order Number (DD)
Shipping Notification Number (DD)
Accounts Payable Check Number (DD)
Vendor Payment Quantity
Vendor Gross Payment Dollar Amount
Vendor Payment Discount Dollar Amount
Vendor Net Payment Dollar Amount
Vendor Payment Fact
Product DimensionDate Dimension
Contract Terms DimensionVendor Dimension
Received Condition DimensionEmployee Dimension
Discount Taken Dimension
Slowly Changing Dimensions
Up to this point we have pretended that each dimension is logically indepen-
dent from all the other dimensions. In particular, dimensions have been
assumed to be independent of time. Unfortunately, this is not the case in the
real world. While dimension table attributes are relatively static, they are not
fixed forever. Dimension attributes change, albeit rather slowly, over time.
Dimensional designers must engage business representatives proactively to
help determine the appropriate change-handling strategy. We can’t simply
jump to the conclusion that the business doesn’t care about dimension changes
just because its representatives didn’t mention it during the requirements
process. While we’re assuming that accurate change tracking is unnecessary,
business users may be assuming that the data warehouse will allow them to
see the impact of each and every dimension change. Even though we may not
want to hear that change tracking is a must-have because we are not looking
for any additional development work, it is obviously better to receive the mes-
sage sooner rather than later.
When we need to track change, it is unacceptable to put everything into the
fact table or make every dimension time-dependent to deal with these
changes. We would quickly talk ourselves back into a full-blown normalized
structure with the consequential loss of understandability and query perfor-
mance. Instead, we take advantage of the fact that most dimensions are nearly
constant over time. We can preserve the independent dimensional structure
with only relatively minor adjustments to contend with the changes. We refer
to these nearly constant dimensions as slowly changing dimensions. Since Ralph
Kimball first introduced the notion of slowly changing dimensions in 1994,
some IT professionals—in a never-ending quest to speak in acronymese—have
termed them SCDs.
For each attribute in our dimension tables, we must specify a strategy to han-
dle change. In other words, when an attribute value changes in the operational
world, how will we respond to the change in our dimensional models? In the
following section we’ll describe three basic techniques for dealing with
attribute changes, along with a couple hybrid approaches. You may decide
that you need to employ a combination of these techniques within a single
dimension table.
Type 1: Overwrite the Value
With the type 1 response, we merely overwrite the old attribute value in the
dimension row, replacing it with the current value. In so doing, the attribute
always reflects the most recent assignment.
Procurement 95
Let’s assume that we work for an electronics retailer. The procurement buyers
are aligned along the same departmental lines as the store, so the products
being acquired roll up into departments. One of the procured products is Intel-
liKidz software. The existing row in the product dimension table for Intel-
liKidz looks like the following:
Product SKU Number
Product Key Description Department (Natural Key)
12345 IntelliKidz 1.0 Education ABC922-Z
Of course, there would be numerous additional descriptive attributes in the
product dimension, but we’ve abbreviated the column listing given our page
space constraints. As we discussed earlier, a surrogate product key is the pri-
mary key of the table rather than just relying on the stock keeping unit (SKU)
number. Although we have demoted the SKU number to being an ordinary
product attribute, it still has a special significance because it remains the nat-
ural key. Unlike all other product attributes, the natural key must remain invi-
olate. Throughout the discussion of all three SCD types, we assume that the
natural key of a dimension remains constant.
Suppose that a new merchandising person decides that IntelliKidz should be
moved from the Education software department to the Strategy department
on January 15, 2002, in an effort to boost sales. With the type 1 response, we’d
simply update the existing row in the dimension table with the new depart-
ment description. The updated row would look like the following:
Product SKU Number
Product Key Description Department (Natural Key)
12345 IntelliKidz 1.0 Strategy ABC922-Z
In this case, no dimension or fact table keys were modified when IntelliKidz’s
department changed. The rows in the fact table still reference product key
12345, regardless of IntelliKidz’s departmental location. When sales take off
following the move to the Strategy department, we have no information to
explain the performance improvement because the historical and more
recently loaded data both appear as if IntelliKidz has always rolled up into
The type 1 response is the simplest approach to dealing with dimension
attribute changes. The advantage of type 1 is that it is fast and easy. In the
dimension table, we merely overwrite the preexisting value with the current
assignment. The fact table is left untouched. The problem with a type 1 response
is that we lose all history of attribute changes. Since overwriting obliterates his-
torical attribute values, we’re left solely with the attribute values as they exist
today. A type 1 response obviously is appropriate if the attribute change is a cor-
rection. It also may be appropriate if there is no value in keeping the old descrip-
tion. We need input from the business to determine the value of retaining the old
attribute value; we shouldn’t make this determination on our own in an IT vac-
uum. Too often project teams use a type 1 response as the default response for
dealing with slowly changing dimensions and end up totally missing the mark
if the business needs to track historical changes accurately.
The type 1 response is easy to implement, but it does not maintain any history of
prior attribute values.
Before we leave the topic of type 1 changes, there’s one more easily overlooked
catch that you should be aware of. When we used a type 1 response to deal
with the relocation of IntelliKidz, any preexisting aggregations based on the
department value will need to be rebuilt. The aggregated data must continue
to tie to the detailed atomic data, where it now appears that IntelliKidz has
always rolled up into the Strategy department.
Type 2: Add a Dimension Row
We made the claim earlier in this book that one of the primary goals of the data
warehouse was to represent prior history correctly. A type 2 response is the
predominant technique for supporting this requirement when it comes to
slowly changing dimensions.
Using the type 2 approach, when IntelliKidz’s department changed, we issue
a new product dimension row for IntelliKidz to reflect the new department
attribute value. We then would have two product dimension rows for Intel-
liKidz, such as the following:
Product SKU Number
Product Key Description Department (Natural Key)
12345 IntelliKidz 1.0 Education ABC922-Z
25984 IntelliKidz 1.0 Strategy ABC922-Z
Now we see why the product dimension key can’t be the SKU number natural
key. We need two different product surrogate keys for the same SKU or phys-
ical barcode. Each of the separate surrogate keys identifies a unique product
attribute profile that was true for a span of time. With type 2 changes, the fact
table is again untouched. We don’t go back to the historical fact table rows to
Procurement 97
modify the product key. In the fact table, rows for IntelliKidz prior to January
15, 2002, would reference product key 12345 when the product rolled into the
Education department. After January 15, the IntelliKidz fact rows would have
product key 25984 to reflect the move to the Strategy department until we are
forced to make another type 2 change. This is what we mean when we say that
type 2 responses perfectly partition or segment history to account for the
If we constrain only on the department attribute, then we very precisely dif-
ferentiate between the two product profiles. If we constrain only on the prod-
uct description, that is, IntelliKidz 1.0, then the query automatically will fetch
both IntelliKidz product dimension rows and automatically join to the fact
table for the complete product history. If we need to count the number of prod-
ucts correctly, then we would just use the SKU natural key attribute as the
basis of the distinct count rather than the surrogate key. The natural key field
becomes a kind of reliable glue that holds the separate type 2 records for a sin-
gle product together. Alternatively, a most recent row indicator might be
another useful dimension attribute to allow users to quickly constrain their
query to only the current profiles.
The type 2 response is the primary technique for accurately tracking slowly changing
dimension attributes. It is extremely powerful because the new dimension row auto-
matically partitions history in the fact table.
It certainly would feel natural to include an effective date stamp on a dimen-
sion row with type 2 changes. The date stamp would refer to the moment
when the attribute values in the row become valid or invalid in the case of
expiration dates. Effective and expiration date attributes are necessary in the
staging area because we’d need to know which surrogate key is valid when
we’re loading historical fact records. In the dimension table, these date stamps
are helpful extras that are not required for the basic partitioning of history. If
you use these extra date stamps, just remember that there is no need to con-
strain on the effective date in the dimension table in order to get the right
answer. This is often a point of confusion in the design and use of type 2 slowly
changing dimensions.
While including effective and expiration date attributes may feel comfortable to
database designers, we should be aware that the effective date on the dimen-
sion table may have little to do with the dates in the fact table. Attempting to
constrain on the dimension row effective date actually may yield an incorrect
result. Perhaps version 2.0 of IntelliKidz software will be released on May 1,
2002. A new operational SKU code (and corresponding data warehouse surro-
gate key) would be created for the new product. This isn’t a type 2 change
because the product is a completely new physical entity. However, if we look at
a fact table for the retailer, we don’t see such an abrupt partitioning of history.
The old version 1.0 of the software inevitably will continue to be sold in stores
after May 1, 2002, until the existing inventory is depleted. The new version 2.0
will appear on the shelves on May 1 and gradually will supersede the old ver-
sion. There will be a transition period where both versions of the software will
move past the cash registers in any given store. Of course, the product overlap
period will vary from store to store. The cash registers will recognize both oper-
ational SKU codes and have no difficulty handling the sale of either version. If
we had an effective date on the product dimension row, we wouldn’t dare con-
strain on this date to partition sales because the date has no relevance. Even
worse, using such a constraint may even give us the wrong answer.
Nevertheless, the effective/expiration date stamps in the dimension may be
useful for more advanced analysis. The dates support very precise time slicing
of the dimension by itself. The row effective date is the first date the descrip-
tive profile is valid. The row expiration date would be one day less than the
row effective date for the next assignment, or the date the product was retired
from the catalog. We could determine what the product catalog looked like as
of December 31, 2001, by constraining a product table query to retrieve all
rows where the row effective date to less than or equal to December 31, 2001,
and the row expiration date to greater than or equal to December 31, 2001.
We’ll further discuss opportunities to leverage effective and expiration dates
when we delve into the human resources schema in Chapter 8.
The type 2 response is the workhorse technique to support analysis using his-
torically accurate attributes. This response perfectly segments fact table his-
tory because prechange fact rows use the prechange surrogate key. Another
type 2 advantage is that we can gracefully track as many dimension changes
as required. Unlike the type 1 approach, there is no need to revisit preexisting
aggregation tables when using the type 2 approach.
Of course, the type 2 response to slowly changing dimensions requires the use
of surrogate keys, but you’re already using them anyhow, right? It is not suffi-
cient to use the underlying operational key with two or three version digits
because you’ll be vulnerable to the entire list of potential operational key
issues discussed in Chapter 2. Likewise, it is certainly inadvisable to append
an effective date to the otherwise primary key of the dimension table to
uniquely identify each version. With the type 2 response, we create a new
dimension row with a new single-column primary key to uniquely identify the
new product profile. This single-column primary key establishes the linkage
between the fact and dimension tables for a given set of product characteris-
tics. There’s no need to create a confusing secondary join based on effective or
expiration dates, as we have pointed out.
Procurement 99
We recognize that some of you may be concerned about the administration of
surrogate keys to support type 2 changes. In Chapter 16 we’ll discuss a work-
flow for managing surrogate keys while accommodating type 2 changes in
more detail. In the meantime, we want to put your mind somewhat at ease
about the administrative burden. When we’re staging dimension tables, we’re
often handed a complete copy of the latest, greatest source data. It would be
wonderful if only the changes since the last extract, or deltas, were delivered
to the staging area, but more typically, the staging application has to find the
changed dimensions. A field-by-field comparison of each dimension row to
identify the changes between yesterday’s and today’s versions would be
extremely laborious, especially if we have 100 attributes in a several-million-
row dimension table. Rather than checking each field to see if something has
changed, we instead compute a checksum for the entire row all at once. A
cyclic redundancy checksum (CRC) algorithm helps us quickly recognize that
a wide, messy row has changed without looking at each of its constituent
fields. In our staging area we calculate the checksum for each row in a dimen-
sion table and add it to the row as an administrative column. At the next data
load, we compute the CRCs on the incoming records to compare with the prior
CRCs. If the CRCs match, all the attributes on both rows are identical; there’s
no need to check every field. Obviously, any new rows would trigger the cre-
ation of a new product dimension row. Finally, when we encounter a changed
CRC, then we’ll need to deal with the change based on our dimension-change
strategy. If we’re using a type 2 response for all the attributes, then we’d just
create another new row. If we’re using a combination of techniques, then we’d
have to look at the fields in more detail to determine the appropriate action.
Since the type 2 technique spawns new dimension rows, one downside of this
approach is accelerated dimension table growth. Hence it may be an inappro-
priate technique for dimension tables that already exceed a million rows. We’ll
discuss an alternative approach for handling change in large, multimillion-
row dimension tables when we explore the customer dimension in Chapter 6.
Type 3: Add a Dimension Column
While the type 2 response partitions history, it does not allow us to associate
the new attribute value with old fact history or vice versa. With the type 2
response, when we constrain on Department = Strategy, we will not see Intel-
liKidz facts from before January 15, 2002. In most cases, this is exactly what we
However, sometimes we want the ability to see fact data as if the change never
occurred. This happens most frequently with sales force reorganizations. Dis-
trict boundaries have been redrawn, but some users still want the ability to see
today’s sales in terms of yesterday’s district lines just to see how they would
have done under the old organizational structure. For a few transitional
months, there may be a desire to track history in terms of the new district
names and conversely to track new data in terms of old district names. A type
2 response won’t support this requirement, but the type 3 response comes to
the rescue.
In our software example, let’s assume that there is a legitimate business need
to track both the old and new values of the department attribute both forward
and backward for a period of time around the change. With a type 3 response,
we do not issue a new dimension row, but rather we add a new column to cap-
ture the attribute change. In the case of IntelliKidz, we alter the product
dimension table to add a prior department attribute. We populate this new col-
umn with the existing department value (Education). We then treat the depart-
ment attribute as a type 1 response, where we overwrite to reflect the current
value (Strategy). All existing reports and queries switch over to the new
department description immediately, but we can still report on the old depart-
ment value by querying on the prior department attribute.
Product Product Prior SKU Number
Key Description Department Department (Natural Key)
12345 IntelliKidz 1.0 Strategy Education ABC922-Z
Type 3 is appropriate when there’s a strong need to support two views of the
world simultaneously. Some designers call this an alternate reality. This often
occurs when the change or redefinition is soft or when the attribute is a
human-applied label rather than a physical characteristic. Although the
change has occurred, it is still logically possible to act as if it has not. The type
3 response is distinguished from the type 2 response because both the current
and prior descriptions can be regarded as true at the same time. In the case of
a sales reorganization, management may want the ability to overlap and ana-
lyze results using either map of the sales organization for a period of time.
Another common variation occurs when your users want to see the current
value in addition to retaining the original attribute value rather than the prior.
The type 3 response is used rather infrequently. Don’t be fooled into thinking
that the higher type number associated with the type 3 response indicates that
it is the preferred approach. The techniques have not been presented in good,
better, and best practice sequence. There is a time and place where each of
them is the most appropriate response.
The type 3 slowly changing dimension technique allows us to see new and historical
fact data by either the new or prior attribute values.
Procurement 101
A type 3 response is inappropriate if you want to track the impact of numerous
intermediate attribute values. Obviously, there are serious implementation
and usage limitations to creating attributes that reflect the prior minus 1, prior
minus 2, and prior minus 3 states of the world, so we give up the ability to ana-
lyze these intermediate values. If there is a need to track a myriad of unpre-
dictable changes, then a type 2 response should be used instead in most cases.
Hybrid Slowly Changing Dimension Techniques
In this section we’ll discuss two hybrid approaches that combine basic slowly
changing dimension techniques. Many IT professionals become enamored of
these techniques because they seem to provide the best of all worlds. However,
the price we pay for greater flexibility is often greater complexity. While some
IT professionals are easily impressed by elegant flexibility, our business users
are just as easily turned off by complexity. You should not pursue these options
unless the business agrees that they are needed to address their requirements.
Predictable Changes with
Multiple Version Overlays
This technique is used most frequently to deal with sales organization realign-
ments, so we’ll depart from our IntelliKidz example to present the concept in
a more realistic scenario. Consider the situation where a sales organization
revises the map of its sales districts on an annual basis. Over a 5-year period,
the sales organization is reorganized five times. On the surface, this may seem
like a good candidate for a type 2 approach, but we discover through business
user interviews that they have a more complex set of requirements, including
the following capabilities:
■■ Report each year’s sales using the district map for that year.
■■ Report each year’s sales using a district map from an arbitrary
different year.
■■ Report an arbitrary span of years’ sales using a single district map from
any chosen year. The most common version of this requirement would
be to report the complete span of fact data using the current district map.
We cannot address this set of requirements with a standard type 2 response
because it partitions history. A year of fact data can only be reported using the
assigned map at that point in time with a type 2 approach. The requirements
can’t be met with a standard type 3 response because we want to support more
than two simultaneous maps.
Figure 4.3 Sample dimension table with multiple version overlays.
In this case we take advantage of the regular, predictable nature of these
changes by geralizing the type 3 approach to have five versions of the district
attribute for each sales rep. The sales rep dimension would include the attrib-
utes shown in Figure 4.3.
Each sales rep dimension row would include all prior district assignments.
The business user could choose to roll up the sales facts with any of the five
district maps. If a sales rep were hired in 2000, the dimension attributes for
1998 and 1999 would contain values along the lines of “Not Applicable.”
We label the most recent assignment as “Current District.” This attribute will
be used most frequently; we don’t want to modify our existing queries and
reports to accommodate next year’s change. When the districts are redrawn
next, we’d alter the table to add a district 2002 attribute. We’d populate this
column with the current district values and then overwrite the current
attribute with the 2003 district assignments.
Unpredictable Changes with
Single-Version Overlay
This final approach is relevant if you’ve been asked to preserve historical accu-
racy surrounding unpredictable attribute changes while supporting the ability
to report historical data according to the current values. None of the standard
slowly changing dimension techniques enable this requirement independently.
In the case of the electronics retailer’s product dimension, we would have two
department attributes on each row. The current department column represents
the current assignment; the historical department column represents the his-
torically accurate department attribute value.
Sales Rep Key
Sales Rep Name
Sales Rep Address...
Current District
District 2001
District 2000
District 1999
District 1998
and more
Sales Rep Dimension
Procurement 103
When IntelliKidz software is procured initially, the product dimension row
would look like the following:
Product Product Current Historical Number
Key Description Department Department (Natural Key)
12345 IntelliKidz 1.0 Education Education ABC922-Z
When the departments are restructured and IntelliKidz is moved to the Strat-
egy department, we’d use a type 2 response to capture the attribute change by
issuing a new row. In this new dimension row for IntelliKidz, the current
department will be identical to the historical department. For all previous
instances of IntelliKidz dimension rows, the current department attribute will
be overwritten to reflect the current structure. Both IntelliKidz rows would
identify the Strategy department as the current department.
Product Product Current Historical Number
Key Description Department Department (Natural Key)
12345 IntelliKidz 1.0 Strategy Education ABC922-Z
25984 IntelliKidz 1.0 Strategy Strategy ABC922-Z
In this manner we’re able to use the historical attribute to segment history and
see facts according to the departmental roll-up at that point in time. Mean-
while, the current attribute rolls up all the historical fact data for product keys
12345 and 25984 into the current department assignment. If IntelliKidz were
then moved into the Critical Thinking software department, our product table
would look like the following:
Product Product Current Historical Number
Key Description Department Department (Natural Key)
12345 IntelliKidz 1.0 Critical Education ABC922-Z
25984 IntelliKidz 1.0 Critical Strategy ABC922-Z
31726 IntelliKidz 1.0 Critical Critical ABC922-Z
Thinking Thinking
With this hybrid approach, we issue a new row to capture the change (type
2) and add a new column to track the current assignment (type 3), where
subsequent changes are handled as a type 1 response. Someone once sug-
gested that we refer to this combo approach as type 6 (2 + 3 + 1). This tech-
nique allows us to track the historical changes accurately while also
supporting the ability to roll up history based on the current assignments.
We could further embellish (and complicate) this strategy by supporting
additional static department roll-up structures, in addition to the current
department, as separate attributes.
Again, while this powerful technique may be naturally appealing to some
readers, it is important that we always consider the users’ perspective as we
strive to arrive at a reasonable balance between flexibility and complexity.
More Rapidly Changing Dimensions
In this chapter we’ve focused on the typically rather slow, evolutionary
changes to our dimension tables. What happens, however, when the rate of
change speeds up? If a dimension attribute changes monthly, then we’re no
longer dealing with a slowly changing dimension that can be handled reason-
ably with the techniques just discussed. One powerful approach for handling
more rapidly changing dimensions is to break off these rapidly changing
attributes into one or more separate dimensions. In our fact table we would
then have two foreign keys—one for the primary dimension table and another
for the rapidly changing attribute(s). These dimension tables would be associ-
ated with one another every time we put a row in the fact table. Stay tuned for
more on this topic when we cover customer dimensions in Chapter 6.
In this chapter we discussed several approaches to handling procurement
data. Effectively managing procurement performance can have a major impact
on an organization’s bottom line.
We also introduced several techniques to deal with changes to our dimension
table attributes. The slowly changing responses range from merely overwriting
the value (type 1), to adding a new row to the dimension table (type 2), to the
least frequently used approach in which we add a column to the table (type 3).
We also discussed several powerful, albeit more complicated, hybrid
approaches that combine the basic techniques.
Procurement 105
Order Management
Order management consists of several critical business processes, including
order, shipment, and invoice processing. These processes spawn important
business metrics, such as sales volume and invoice revenue, that are key per-
formance indicators for any organization that sells products or services to
others. In fact, these foundation metrics are so crucial that data warehouse
teams most frequently tackle one of the order management processes for their
initial data warehouse implementation. Clearly, the topics in this case study
transcend industry boundaries.
In this chapter we’ll explore several different order management transactions,
including the common characteristics and complications you might encounter
when dimensionally modeling these transactions. We’ll elaborate on the con-
cept of an accumulating snapshot to analyze the order-fulfillment pipeline
from initial order through release to manufacturing, into finished goods inven-
tory, and finally to product shipment and invoicing. We’ll close the chapter by
comparing and contrasting the three types of fact tables: transaction, periodic
snapshot, and accumulating snapshot. For each of these fact table types, we’ll
also discuss the handling of real-time warehousing requirements.
Chapter 5 discusses the following concepts:
■■ Orders transaction schema
■■ Fact table normalization considerations
■■ Date dimension role-playing
■■ More on product dimensions
■■ Ship-to / bill-to customer dimension considerations
■■ Junk dimensions
■■ Multiple currencies and units of measure
■■ Handling of header and line item facts with different granularity
■■ Invoicing transaction schema with profit and loss facts
■■ Order fulfillment pipeline as accumulating snapshot schema
■■ Lag calculations
■■ Comparison of transaction, periodic snapshot, and accumulating snapshot fact
■■ Special partitions to support the demand for near real time data warehousing
Introduction to Order Management
If we take a closer look at the order management function, we see that
it’s comprised of a series of business processes. In its most simplistic form,
we can envision a subset of the data warehouse bus matrix that resembles
Figure 5.1.
As we saw in earlier chapters, the data warehouse bus matrix closely corre-
sponds to the organization’s value chain. In this chapter we’ll focus specifi-
cally on the order and invoice rows of the matrix. We’ll also describe an
accumulating snapshot fact table that combines data from multiple order man-
agement processes.
Figure 5.1 Subset of data warehouse bus matrix for order management processes.
Sales Rep
Ship From
Figure 5.2 Order transaction fact table.
Order Transactions
The first process we’ll explore is order transactions. As companies have grown
through acquisition, they often find themselves with multiple operational
order transaction processing systems in the organization. The existence of
multiple source systems often creates a degree of urgency to integrate the dis-
parate results in the data warehouse rather than waiting for the long-term
application integration.
The natural granularity for an order transaction fact table is one row for each line
item on an order. The facts associated with this process typically include the order
quantity, extended gross order dollar amount, order discount dollar amount, and
extended net order dollar amount (which is equal to the gross order amount less
the discounts). The resulting schema would look similar to Figure 5.2.
Fact Normalization
Rather than storing a list of facts, as in Figure 5.2, some designers want to fur-
ther normalize the fact table so that there’s a single, generic fact amount, along
with a dimension that identifies the type of fact. The fact dimension would
indicate whether it is the gross order amount, order discount amount, or some
other measure. This technique may make sense when the set of facts is
sparsely populated for a given fact row and no computations are made
between facts. We have used this technique to deal with manufacturing qual-
ity test data, where the facts vary widely depending on the test conducted.
However, we generally resist the urge to further normalize the fact table. As we
see with orders data, facts usually are not sparsely populated within a row. In
this case, if we were to normalize the facts, we’d be multiplying the number of
rows in the fact table by the number of fact types. For example, assume that we
started with 10 million order line fact table rows, each with six keys and four
Order Date Key (FK)
Requested Ship Date Key (FK)
Product Key (FK)
Customer Ship To Key (FK)
Sales Rep Key (FK)
Deal Key (FK)
Order Number (DD)
Order Line Number (DD)
Order Quantity
Gross Order Dollar Amount
Order Deal Discount Dollar Amount
Net Order Dollar Amount
Order Transaction Fact
Order Date Key (PK)
Order Date
Order Date Day of Week
Order Date Month
… and more
Order Date Dimension
Product Dimension
Customer Ship To Dimension
Sales Rep Dimension
Deal Dimension
Requested Ship Date Key (PK)
Requested Ship Date
Requested Ship Date Day of Week
Requested Ship Date Month
… and more
Requested Ship Date Dimension
Order Management 109
facts. If we normalized the facts, we’d end up with 40 million fact rows, each
with seven keys and one fact. In addition, if we are performing any arithmetic
function between the facts (such as discount amount as a percentage of gross
order amount), it is far easier if the facts are in the same row because SQL makes
it difficult to perform a ratio or difference between facts in different rows. In
Chapter 13 we’ll explore a situation where a fact dimension makes more sense.
Dimension Role-Playing
By now we all know that a date dimension is found in every fact table because
we are always looking at performance over time. In a transaction-grained fact
table, the primary date column is the transaction date, such as the order date.
Sometimes we also discover other dates associated with each transaction, such
as the requested ship date for the order.
Each of the dates should be a foreign key in the fact table. However, we cannot
simply join these two foreign keys to the same date dimension table. SQL
would interpret such a two-way simultaneous join as requiring both the dates
to be identical, which isn’t very likely.
Even though we cannot literally join to a single date dimension table, we can
build and administer a single date dimension table behind the scenes. We cre-
ate the illusion of two independent date tables by using views. We are careful
to uniquely label the columns in each of the SQL views. For example, order
month should be uniquely labeled to distinguish it from requested ship
month. If we don’t practice good data housekeeping, we could find ourselves
in the uncomfortable position of not being able to tell the columns apart when
both are dragged into a report.
As we briefly described in Chapter 2, you would define the order date and
requested order date views as follows:
We now have two unique date dimensions that can be used as if they were inde-
pendent with completely unrelated constraints. We refer to this as role-playing
because the date dimension simultaneously serves different roles in a single
fact table. We’ll see additional examples of dimension role-playing sprinkled
throughout this book.
Role-playing in a data warehouse occurs when a single dimension simultaneously
appears several times in the same fact table. The underlying dimension may exist as
a single physical table, but each of the roles should be presented to the data access
tools in a separately labeled view.
To handle the multiple dates, some designers are tempted to create a single
date table with a key for each unique order date and requested ship date com-
bination. This approach falls apart on several fronts. First, our clean and sim-
ple daily date table with approximately 365 rows per year would balloon in
size if it needed to handle all the date combinations. Second, such a combina-
tion date table would no longer conform to our other frequently used daily,
weekly, and monthly date dimensions.
Product Dimension Revisited
A product dimension has participated in each of the case study vignettes pre-
sented so far in this book. The product dimension is one of the most common
and most important dimension tables you’ll encounter in a dimensional
The product dimension describes the complete portfolio of products sold by a
company. In most cases, the number of products in the portfolio turns out to be
surprisingly large, at least from an outsider’s perspective. For example, a
prominent U.S. manufacturer of dog and cat food tracks nearly 20,000 manu-
facturing variations of its products, including retail products everyone (or
every dog and cat) is familiar with, as well as numerous specialized products
sold through commercial and veterinary channels. We’ve worked with
durable goods manufacturers who sell literally millions of unique product
Most product dimension tables share the following characteristics:
Numerous verbose descriptive columns. For manufacturers, it’s not unusual
to maintain 100 or more descriptors about the products they sell. Dimen-
sion table attributes naturally describe the dimension row, do not vary
because of the influence of another dimension, and are virtually constant
over time, although as we just discussed in Chapter 4, some attributes do
change slowly over time.
One or more attribute hierarchies in addition to many nonhierarchical
attributes. It is too limiting to think of products as belonging to a single
hierarchy. Products typically roll up according to multiple defined hierar-
chies. All the hierarchical data should be presented in a single flattened,
Order Management 111
denormalized product dimension table. We resist creating normalized
snowflaked sub-tables for the product dimension. The costs of a more com-
plicated presentation and slower intradimension browsing performance
outweigh the minimal storage savings benefits. It is misleading to think
about browsing in a small dimension table, where all the relationships can
be imagined or visualized. Real product dimension tables have thousands
of entries, and the typical user does not know the relationships intimately.
If there are 20,000 dog and cat foods in the product dimension, it is not too
useful to request a pull-down list of the product descriptions. It would be
essential, in this example, to have the ability to constrain on one attribute,
such as flavor, and then another attribute, such as package type, before
attempting to display the product description listings. Notice that the first
two constraints were not drawn strictly from a product hierarchy. Any of
the product attributes, regardless of whether they belong to a hierarchy,
should be used freely for drilling down and up. In fact, most of the attrib-
utes in a large product table are standalone low-cardinality attributes, not
part of explicit hierarchies.
The existence of an operational product master aids in maintenance of the
product dimension, but a number of transformations and administrative steps
must occur to convert the operational master file into the dimension table,
Remap the operational product key to a surrogate key. As we discussed in
Chapter 2, this smaller, more efficient join key is needed to avoid havoc
caused by duplicate use of the operational product key over time. It also
might be necessary to integrate product information sourced from different
operational systems. Finally, as we just learned in Chapter 4, the surrogate
key is needed to track changing product attributes in cases where the oper-
ational system has not generated a new product master key.
Add readable text strings to augment or replace numeric codes in the opera-
tional product master. We don’t accept the excuse that the businesspeople
are familiar with the codes. The only reason businesspeople are familiar
with codes is that they have been forced to use them! Remember that the
columns in a product dimension table are the sole source of query con-
straints and report labels, so the contents must be legible. Keep in mind that
cryptic abbreviations are as bad as outright numeric codes; they also should
be augmented or replaced with readable text. Multiple abbreviated codes in
a single field should be expanded and separated into distinct fields.
Quality assure all the text strings to ensure that there are no misspellings,
impossible values, or cosmetically different versions of the same
attribute. In addition to automated procedures, a simple backroom
technique for flushing out minor misspellings of attribute values is to just
sort the distinct values of the attribute and look down the list. Spellings
that differ by a single character usually will sort next to each other and can
be found with a visual scan of the list. This supplemental manager’s qual-
ity assurance check should be performed occasionally to monitor data
quality. Data access interfaces and reports rely on the precise contents of
the dimension attributes. SQL will happily produce another line in a report
if the attribute value varies in any way based on trivial punctuation or
spelling differences. We also should ensure that the attribute values are
completely populated because missing values easily cause misinterpreta-
tions. Incomplete or poorly administered textual dimension attributes lead
to incomplete or poorly produced reports.
Document the product attribute definitions, interpretations, and origins in
the data warehouse’s metadata. Remember that the metadata is analogous
to the data warehouse encyclopedia. We must be vigilant about populating
and maintaining the metadata.
Customer Ship-To Dimension
The customer ship-to dimension contains one row for each discrete location to
which we ship a product. Customer ship-to dimension tables can range from
moderately sized (thousands of rows) to extremely large (millions of rows)
depending on the nature of the business. A typical customer ship-to dimension
is shown in Figure 5.3.
Figure 5.3 Sample customer ship-to dimension.
Order Date Key (FK)
Requested Ship Date Key (FK)
Product Key (FK)
Customer Ship To Key (FK)
Sales Rep Key (FK)
Deal Key (FK)
Order Number (DD)
Order Line Number (DD)
Order Quantity
Gross Order Dollar Amount
Order Deal Discount Dollar Amount
Net Order Dollar Amount
Order Transaction Fact
Customer Ship To Key (PK)
Customer Ship To ID (Natural Key)
Customer Ship To Name
Customer Ship To Address
Customer Ship To City
Customer Ship To State
Customer Ship To Zip + 4
Customer Ship To Zip
Customer Ship To Zip Region
Customer Ship To Zip Sectional Center
Customer Bill To Name
Customer Bill To Address Attributes
Customer Organization Name
Customer Corporate Parent Name
Customer Credit Rating
Assigned Sales Rep Name
Assigned Sales Rep Team Name
Assigned Sales District
Assigned Sales Region
Customer Ship To Dimension
Order Date Dimension
Request Ship Date Dimension
Product Dimension
Sales Rep Dimension
Deal Dimension
Order Management 113
Several separate and independent hierarchies typically coexist in a customer
ship-to dimension. The natural geographic hierarchy is clearly defined by the
ship-to location. Since the ship-to location is a point in space, any number of
geographic hierarchies may be defined by nesting ever-larger geographic enti-
ties around the point. In the United States, the usual geographic hierarchy is
city, county, and state. The U.S. ZIP code identifies a secondary geographic
breakdown. The first digit of the ZIP code identifies a geographic region of the
United States (for example, 0 for the Northeast and 9 for certain western states),
whereas the first three digits of the ZIP code identify a mailing sectional center.
Another common hierarchy is the customer’s organizational hierarchy, assum-
ing that the customer is a corporate entity. For each customer ship-to, we
might have a customer bill-to and customer corporation. For every base-level
row in the customer ship-to dimension, both the physical geography and the
customer organizational affiliation are well defined, even though the hierar-
chies roll up differently.
It is natural and common, especially for customer-oriented dimensions, for a dimen-
sion to simultaneously support multiple independent hierarchies. The hierarchies
may have different numbers of levels. Drilling up and drilling down within each of
these hierarchies must be supported in a data warehouse.
The alert reader may have a concern with the implied assumption that multi-
ple ship-tos roll up to a single bill-to in a many-to-one relationship. The real
world is rarely quite this clean and simple. There are always a few exceptions
involving ship-tos that are associated with more than one bill-to. Obviously,
this breaks the simple hierarchical relationship that we have assumed in the
earlier denormalized customer ship-to dimension. If this is a rare occurrence,
it would be reasonable to generalize the customer ship-to dimension so that
the grain of the dimension is each unique ship-to/bill-to combination. If there
are two sets of bill-to information associated with a given ship-to location,
then there would be two rows in the dimension, one for each combination. On
the other hand, if many of the ship-tos are associated with many bill-tos in a
robust many-to-many relationship, then ship-to and bill-to probably need to
be handled as separate dimensions that are linked together by the fact table.
This is the designer’s prerogative. With either approach, exactly the same
information is preserved at the fact table order line-item level. We’ll spend
more time on customer organizational hierarchies, including the handling of
recursive customer parent-child relationships, in Chapter 6.
Another potential independent hierarchy in the customer ship-to dimension
might be the manufacturer’s sales organization. Designers sometimes ques-
tion whether sales organization attributes should be modeled as a separate
dimension or the attributes just should be added to the existing customer
dimension. Similar to the preceding discussion about bill-tos, the designer
should use his or her judgment. If sales reps are highly correlated with cus-
tomer ship-tos in a one-to-one or many-to-one relationship, combining the
sales organization attributes with the customer ship-to dimension is a viable
approach. The resulting dimension is only about as big as the larger of the two
dimensions. The relationships between sales teams and customers can be
browsed efficiently in the single dimension without traversing the fact table.
However, sometimes the relationship between sales organization and cus-
tomer ship-to is more complicated. The following factors must be taken into
The one-to-one or many-to-one relationship may turn out to be a many-to-
many relationship. As we discussed earlier, if the many-to-many relation-
ship is an exceptional condition, then we may still be tempted to combine
the sales rep attributes into the ship-to dimension, knowing that we’d need
to treat these rare many-to-many occurrences by issuing another surrogate
ship-to key.
If the relationship between sales rep and customer ship-to varies over time
or under the influence of a fourth dimension such as product, then the
combined dimension is in reality some kind of fact table itself! In this
case, we’d likely create separate dimensions for the sales rep and the cus-
tomer ship-to.
If the sales rep and customer ship-to dimensions participate independently
in other business process fact tables, we’d likely keep the dimensions
separate. Creating a single customer ship-to dimension with sales rep
attributes exclusively around orders data may make some of the other
processes and relationships difficult to express.
When entities have a fixed, time-invariant, strongly correlated relationship,
they obviously should be modeled as a single dimension. In most other cases,
your design likely will be simpler and more manageable when you separate
the entities into two dimensions (while remembering the general guidelines
concerning too many dimensions). If you’ve already identified 25 dimensions
in your schema, you should give strong consideration to combining dimen-
sions, if possible.
When the dimensions are separate, some designers want to create a little table
with just the two dimension keys to show the correlation without using the fact
table. This two-dimension table is unnecessary. There is no reason to avoid the
fact table to respond to this relationship inquiry. Fact tables are incredibly effi-
cient because they contain only dimension keys and measurements. The fact
table was created specifically to represent the correlation between dimensions.
Order Management 115
Before we leave the topic of sales rep assignments to customers, users some-
times want the ability to analyze the complex assignment of sales reps to cus-
tomers over time, even if no order activity has occurred. In this case, we could
construct a factless fact table, as we briefly introduced in Chapter 2, to capture
the sales rep coverage. The coverage table would provide a complete map of
the historical assignments of sales reps to customers, even if some of the
assignments never resulted in a sale. As we’ll learn in Chapter 13, we’d likely
include effective and expiration dates in the sales rep coverage table because
coverage assignments change over time.
Deal Dimension
The deal dimension is similar to the promotion dimension from Chapter 2. The
deal dimension describes the incentives that have been offered to the customer
that theoretically affect the customers’ desire to purchase products. This
dimension is also sometimes referred to as the contract. As shown in Figure 5.4,
the deal dimension describes the full combination of terms, allowances, and
incentives that pertain to the particular order line item.
The same issues that we faced in the retail promotion dimension also arise
with this deal dimension. If the terms, allowances, and incentives are usefully
correlated, then it makes sense to package them into a single deal dimension.
If the terms, allowances, and incentives are quite uncorrelated and we find
ourselves generating the Cartesian product of these factors in the dimension,
then it probably makes sense to split such a deal dimension into its separate
components. Once again, this is not an issue of gaining or losing information,
since the database contains the same information in both cases, but rather the
issues of user convenience and administrative complexity determine whether
to represent these deal factors as multiple dimensions. In a very large fact
table, with tens of millions or hundreds of millions of rows, the desire to
reduce the number of keys in the fact table composite key would favor keep-
ing the deal dimension as a single dimension. Certainly any deal dimension
smaller than 100,000 rows would be tractable in this design.
Figure 5.4 Sample deal dimension.
Order Date Key (FK)
Requested Ship Date Key (FK)
Product Key (FK)
Customer Ship To Key (FK)
Sales Rep Key (FK)
Deal Key (FK)
Order Number (DD)
Order Line Number (DD)
Order Quantity
Gross Order Dollar Amount
Order Deal Discount Dollar Amount
Net Order Dollar Amount
Order Transaction Fact
Deal Key (PK)
Deal Description
Deal Terms Description
Deal Terms Type Description
Allowance Description
Allowance Type Description
Special Incentive Description
Special Incentive Type Description
Deal Dimension
Order Date Dimension
Request Ship Date Dimension
Product Dimension
Customer Ship To Dimension
Sales Rep Dimension
Degenerate Dimension for
Order Number
Each line item row in the orders fact table includes the order number as a
degenerate dimension, as we introduced in Chapter 2. Unlike a transactional
parent-child database, the order number in our dimensional models is not tied
to an order header table. We have stripped all the interesting details from the
order header into separate dimensions such as the order date, customer ship-to,
and other interesting fields. The order number is still useful because it allows us
to group the separate line items on the order. It enables us to answer such ques-
tions as the average number of line items on an order. In addition, the order
number is used occasionally to link the data warehouse back to the operational
world. Since the order number is left sitting by itself in the fact table without
joining to a dimension table, it is referred to as a degenerate dimension.
Degenerate dimensions typically are reserved for operational transaction identifiers.
They should not be used as an excuse to stick a cryptic code in the fact table without
joining to a descriptive decode in a dimension table.
If the designer decides that certain data elements actually do belong to the
order itself and do not usefully fall into another natural business dimension,
then order number is no longer a degenerate dimension but rather is a normal
dimension with its own surrogate key and attribute columns. However,
designers with a strong parent-child background should resist the urge simply
to lump the traditional order header information into an order dimension. In
almost all cases, the header information belongs in other analytic dimensions
rather than merely being dumped into a dimension that closely resembles the
transaction order header table.
Junk Dimensions
When we’re confronted with a complex operational data source, we typically
perform triage to quickly identify fields that are obviously related to dimen-
sions, such as date stamps or attributes. We then identify the numeric mea-
surements in the source data. At this point, we are often left with a number of
miscellaneous indicators and flags, each of which takes on a small range of dis-
crete values. The designer is faced with several rather unappealing options,
Leave the flags and indicators unchanged in the fact table row. This could
cause the fact table row to swell alarmingly. It would be a shame to create a
nice tight dimensional design with five dimensions and five facts and then
leave a handful of uncompressed textual indicator columns in the row.
Order Management 117
Make each flag and indicator into its own separate dimension. Doing so
could cause our 5-dimension design to balloon into a 25-dimension design.
Strip out all the flags and indicators from the design. Of course, we ask the
obligatory question about removing these miscellaneous flags because they
seem rather insignificant, but this notion is often vetoed quickly because
someone might need them. It is worthwhile to examine this question care-
fully. If the indicators are incomprehensible, noisy, inconsistently popu-
lated, or only of operational significance, they should be left out.
An appropriate approach for tackling these flags and indicators is to study
them carefully and then pack them into one or more junk dimensions. You can
envision the junk dimension as being akin to the junk drawer in your kitchen.
The kitchen junk drawer is a dumping ground for miscellaneous household
items, such as rubber bands, paper clips, batteries, and tape. While it may be
easier to locate the rubber bands if we dedicated a separate kitchen drawer to
them, we don’t have adequate storage capacity to do so. Besides, we don’t
have enough stray rubber bands, nor do we need them very frequently, to war-
rant the allocation of a single-purpose storage space. The junk drawer pro-
vides us with satisfactory access while still retaining enough kitchen storage
for the more critical and frequently accessed dishes and silverware.
A junk dimension is a convenient grouping of typically low-cardinality flags and indi-
cators. By creating an abstract dimension, we remove the flags from the fact table
while placing them into a useful dimensional framework.
A simple example of a useful junk dimension would be to remove 10 two-value
indicators, such as the cash versus credit payment type, from the order fact table
and place them into a single dimension. At the worst, you would have 1,024 (210)
rows in this junk dimension. It probably isn’t very interesting to browse among
these flags within the dimension because every flag occurs with every other flag
if the database is large enough. However, the junk dimension is a useful holding
place for constraining or reporting on these flags. Obviously, the 10 foreign keys
in the fact table would be replaced with a single small surrogate key.
On the other hand, if you have highly uncorrelated attributes that take on
more numerous values, then it may not make sense to lump them together into
a single junk dimension. Unfortunately, the decision is not entirely formulaic.
If you have five indicators that each take on only three values, the single junk
dimension is the best route for these attributes because the dimension has only
243 (35) possible rows. However, if the five uncorrelated indicators each have
100 possible values, we’d suggest the creation of separate dimensions because
you now have 100 million (1005) possible combinations.
Figure 5.5 Sample rows of an order indicator junk dimension.
We’ve illustrated sample rows from an order indicator dimension in Figure 5.5. A
subtle issue regarding junk dimensions is whether you create rows for all the
combinations beforehand or create junk dimension rows for the combinations as
you actually encounter them in the data. The answer depends on how many pos-
sible combinations you expect and what the maximum number could be. Gener-
ally, when the number of theoretical combinations is very high and you don’t
think you will encounter them all, you should build a junk dimension row at
extract time whenever you encounter a new combination of flags or indicators.
Another interesting application of the junk dimension technique is to use it to
handle the infrequently populated, open-ended comments field sometimes
attached to a fact row. Optimally, the comments have been parameterized in a
dimension so that they can be used for robust analysis. Even if this is not the
case, users still may feel that the comments field is meaningful enough to
include in the data warehouse. In this case, a junk dimension simply contains
all the distinct comments. The junk dimension is noticeably smaller than the
fact table because the comments are relatively rare. Of course, you will need a
special surrogate key that points to the “No Comment” row in the dimension
because most of your fact table rows will use this key.
Multiple Currencies
Suppose that we are tracking the orders of a large multinational California-
based company with sales offices around the world. We may be capturing
order transactions in more than 15 different currencies. We certainly wouldn’t
want to include columns in the fact table for each currency because theoreti-
cally there are an open-ended number of currencies.
Payment Type
Order Indicator
Commission Credit
Order Type
Discover Card
Discover Card
Discover Card
Discover Card
Discover Card
Order Management 119
The most obvious requirement is that order transactions be expressed in both
local currency and the standardized corporate currency, such as U.S. dollars in
this example. To satisfy this need, we would replace each underlying order
fact with a pair of facts, one for the applicable local currency and another for
the equivalent standard corporate currency. This would allow all transactions
to easily roll up to the corporate currency without complicated application
coding. We’d also supplement the fact table with an additional currency
dimension to identify the currency type associated with the local-currency
facts. A currency dimension is needed even if the location of the transaction is
otherwise known because the location does not necessarily guarantee which
currency was used.
However, you may