Packtpub.Pentaho.3.2.Data.Integration.Beginners.Guide.Apr.2010

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 493 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Cover
Copyright
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Table of Contents
Preface
Chapter 1: Getting started with Pentaho Data Integration
- Pentaho Data Integration and Pentaho BI Suite
  - Exploring the Pentaho Demo
- Pentaho Data Integration
  - Using PDI in real world scenarios
- Installing PDI
- Time for action – installing PDI
- Launching the PDI graphical designer: Spoon
- Time for action – starting and customizing Spoon
  - Spoon
    - Setting preferences in the Options window
    - Storing transformations and jobs in a repository
  - Creating your first transformation
- Time for action – creating a hello world transformation
  - Directing the Kettle engine with transformations
    - Exploring the Spoon interface
    - Running and previewing the transformation
- Time for action – running and previewing the hello_world
- transformation
- Installing MySQL
- Time for action – installing MySQL on Windows
- Time for action – installing MySQL on Ubuntu
- Summary
Chapter 2: Getting Started with Transformations
- Reading data from files
- Time for action – reading results of football matches from files
  - Input files
    - Input steps
  - Reading several files at once
- Time for action – reading all your files at a time using a single
- Text file input step
- Time for action – reading all your files at a time using a single
- Text file input step and regular expressions
  - Regular expressions
  - Grids
- Sending data to files
- Time for action – sending the results of matches to a plain file
- Getting system information
- Time for action – updating a file with news about examinations
- Time for action – running the examination transformation from
- a terminal window
- XML files
- Time for action – getting data from an XML file with information
- about countries
- Summary
Chapter 3: Basic data manipulation
- Basic calculations
- Time for action – reviewing examinations by using the
- Calculator step
  - Adding or modifying fields by using different PDI steps
    - The Calculator step
    - The Formula step
- Time for action – reviewing examinations by using the
- Formula step
- Calculations on groups of rows
- Time for action – calculating World Cup statistics by
- grouping data
  - Group by step
- Filtering
- Time for action – counting frequent words by filtering
  - Filtering rows using the Filter rows step
- Looking up data
- Time for action – finding out which language people speak
  - The Stream lookup step
- Summary
Chapter 4: Controlling the Flow of Data
- Splitting streams
- Time for action – browsing new PDI features by copying
- a dataset
  - Copying rows
  - Distributing rows
- Time for action – assigning tasks by distributing
- Splitting the stream based on conditions
- Time for action – assigning tasks by filtering priorities with the
- Filter rows step
  - PDI steps for splitting the stream based on conditions
- Time for action – assigning tasks by filtering priorities with the
- Switch/ Case step
- Merging streams
- Time for action – gathering progress and merging all together
  - PDI options for merging streams
- Time for action – giving priority to Bouchard by using
- Append Stream
- Summary
Chapter 5: Transforming Your Data with JavaScript Code and the JavaScript Step
- Doing simple tasks with the JavaScript step
- Time for action – calculating scores with JavaScript
- Time for action – testing the calculation of averages
  - Testing the script using the Test script button
- Enriching the code
- Time for action – calculating flexible scores by using variables
- Reading and parsing unstructured files
- Time for action – changing a list of house descriptions with
- JavaScript
  - Looking at previous rows
- Avoiding coding by using purpose-built steps
- Summary
Chapter 6: Transforming the Row Set
- Converting rows to columns
- Time for action – enhancing a films file by converting
- rows to columns
  - Converting row data to column data by using the Row denormalizer step
  - Aggregating data with a Row denormalizer step
- Time for action – calculating total scores by performances
- by country
  - Using Row denormalizer for aggregating data
- Normalizing data
- Time for action – enhancing the matches file by normalizing
- the dataset
  - Modifying the dataset with a Row Normalizer step
  - Summarizing the PDI steps that operate on sets of rows
- Generating a custom time dimension dataset by using Kettle variables
- Time for action – creating the time dimension dataset
  - Getting variables
- Time for action – getting variables for setting the default
- starting date
  - Using the Get Variables step
- Summary
Chapter 7: Validating Data and Handling Errors
- Capturing errors
- Time for action – capturing errors while calculating the age
- of a film
  - Using PDI error handling functionality
  - Aborting a transformation
- Time for action – aborting when there are too many errors
  - Aborting a transformation using the Abort step
  - Fixing captured errors
- Time for action – treating errors that may appear
  - Treating rows coming to the error stream
- Avoiding unexpected errors by validating data
- Time for action – validating genres with a Regex Evaluation step
  - Validating data
- Time for action – checking films file with the Data Validator
  - Defining simple validation rules using the Data Validator
  - Cleansing data
- Summary
Chapter 8: Working with Databases
- Introducing the Steel Wheels sample database
  - Connecting to the Steel Wheels database
- Time for action – creating a connection with the Steel Wheels
- database
  - Connecting with Relational Database Management Systems
  - Exploring the Steel Wheels database
- Time for action – exploring the sample database
  - A brief word about SQL
    - Exploring any configured database with the PDI Database explorer
- Querying a database
- Time for action – getting data about shipped orders
  - Getting data from the database with the Table input step
  - Using the SELECT statement for generating a new dataset
    - Making flexible queries by using parameters
- Time for action – getting orders in a range of dates by using
- parameters
  - Making flexible queries by using Kettle variables
- Time for action – getting orders in a range of dates by using
- variables
- Sending data to a database
- Time for action – loading a table with a list of manufacturers
  - Inserting new data into a database table with the Table output step
  - Inserting or updating data by using other PDI steps
- Time for action – inserting new products or updating
- existent ones
- Time for action – testing the update of existing products
  - Inserting or updating data with the Insert/Update step
- Eliminating data from a database
- Time for action – deleting data about discontinued items
  - Deleting records of a database table with the Delete step
- Summary
Chapter 9: Performing Advanced Operations with Databases
- Preparing the environment
- Time for action – populating the Jigsaw database
  - Exploring the Jigsaw database model
- Looking up data in a database
  - Doing simple lookups
- Time for action – using a Database lookup step to create a list
- of products to buy
  - Looking up values in a database with the Database lookup step
  - Doing complex lookups
- Time for action – using a Database join step to create a list of
- suggested products to buy
  - Joining data from the database to the stream data by using a Database join step
- Introducing dimensional modeling
- Loading dimensions with data
- Time for action – loading a region dimension with a
- Combination lookup/update step
- Time for action – testing the transformation that loads the
- region dimension
  - Describing data with dimensions
    - Loading Type I SCD with a Combination lookup/update step
  - Keeping a history of changes
- Time for action – keeping a history of product changes with the
- Dimension lookup/update step
- Time for action – testing the transformation that keeps a history
- of product changes
  - Keeping an entire history of data with a Type II slowly changing dimension
  - Loading Type II SCDs with the Dimension lookup/update step
- Summary
Chapter 10: Creating Basic Task Flows
- Introducing PDI jobs
- Time for action – creating a simple hello world job
  - Executing processes with PDI jobs
    - Using Spoon to design and run jobs
  - Using the transformation job entry
- Receiving arguments and parameters in a job
- Time for action – customizing the hello world file with
- arguments and parameters
  - Using named parameters in jobs
- Running jobs from a terminal window
- Time for action – executing the hello world job from a terminal
- window
- Using named parameters and command-line arguments in transformations
- Time for action – calling the hello world transformation with
- fixed arguments and parameters
- Deciding between the use of a command-line argument and a named parameter
- Running job entries under conditions
- Time for action – sending a sales report and warning the
- administrator if something is wrong
  - Changing the flow of execution on the basis of conditions
  - Creating and using a file results list
- Summary
Chapter 11: Creating Advanced Transformations and Jobs
- Enhancing your processes with the use of variables
- Time for action – updating a file with news about examinations
- by setting a variable with the name of the file
  - Setting variables inside a transformation
- Enhancing the design of your processes
- Time for action – generating files with top scores
  - Reusing part of your transformations
- Time for action – calculating the top scores with a
- subtransformation
  - Creating and using subtransformations
  - Creating a job as a process flow
- Time for action – splitting the generation of top scores by
- copying and getting rows
  - Transferring data between transformations by using the copy /get rows mechanism
  - Nesting jobs
- Time for action – generating the files with top scores by
- nesting jobs
  - Running a job inside another job with a job entry
    - Understanding the scope of variables
- Iterating jobs and transformations
- Time for action – generating custom files by executing a
- transformation for every input row
  - Executing for each row
- Summary
Chapter 12: Developing and Implementing a Simple Datamart
- Exploring the sales datamart
  - Deciding the level of granularity
- Loading the dimensions
- Time for action – loading dimensions for the sales datamart
- Extending the sales datamart model
- Loading a fact table with aggregated data
- Time for action – loading the sales fact table by looking up
- dimensions
  - Getting the information from the source with SQL queries
  - Translating the business keys into surrogate keys
- Getting facts and dimensions together
- Time for action – loading the fact table using a range of dates
- obtained from the command line
- Time for action – loading the sales star
- Getting rid of administrative tasks
- Time for action – automating the loading of the sales datamart
- Summary
Chapter 13: Taking it Further
- PDI best practices
- Getting the most out of PDI
- Integrating PDI and the Pentaho BI suite
- PDI Enterprise Edition and Kettle Developer Support
- Summary
Appendix A: Working with Repositories
- Creating a repository
- Time for action – creating a PDI repository
  - Creating repositories to store your transformations and jobs
- Working with the repository storage system
- Time for action – logging into a repository
- Examining and modifying the contents of a repository with the Repository explorer
- Migrating from a file-based system to a repository-based system and vice-versa
- Summary
Appendix B: Pan and Kitchen: Launching Transformations and Jobs from the Command Line
- Running transformations and jobs stored in files
- Running transformations and jobs from a repository
  - Specifying command line options
- Checking the exit code
- Providing options when running Pan and Kitchen
Appendix C: Quick Reference: Steps and Job Entries
- Transformation steps
- Job entries
Appendix D: Spoon Shortcuts
- General shortcuts
- Designing transformations and jobs
- Grids
- Repositories
Appendix E: Introducing PDI 4 Features
- Agile BI
- Visual improvements for designing transformations and jobs
  - Experiencing the mouse-over assistance
- Time for action – creating a hop with the mouse-over assistance
- Enterprise features
- Summary
Appendix F: Pop Quiz Answers
- Chapter 1
- Chapter 2
  - formatting data
- Chapter 3
  - concatenating strings
- Chapter 4
  - data movement (copying and distributing)
  - splitting a stream
- Chapter 5
  - finding the seven errors
- Chapter 6
  - using Kettle variables inside transformations
- Chapter 7
  - PDI error handling
- Chapter 8
- Chapter 9
  - loading slowly changing dimensions
  - loading type III slowly changing dimensions
- Chapter 10
  - defining PDI jobs
- Chapter 11
  - using the Add sequence step
  - deciding the scope of variables
- Chapter 12
  - modifying a star model and loading the star with PDI
- Chapter 13
  - remote execution and clustering
Index

Pentaho 3.2 Data Integration

Beginner's Guide

Explore, transform, validate, and integrate your data with ease

María Carina Roldán

BIRMINGHAM - MUMBAI

Pentaho 3.2 Data Integration

Beginner's Guide

or transmied in any form or by any means, without the prior wrien permission of the

publisher, except in the case of brief quotaons embedded in crical arcles or reviews.

Every eort has been made in the preparaon of this book to ensure the accuracy of the

informaon presented. However, the informaon contained in this book is sold without

warranty, either express or implied. Neither the author, Packt Publishing, nor its dealers or

distributors will be held liable for any damages caused or alleged to be caused directly or

indirectly by this book.

Packt Publishing has endeavored to provide trademark informaon about all the companies

and products menoned in this book by the appropriate use of capitals. However, Packt

Publishing cannot guarantee the accuracy of this informaon.

First published: April 2010

Producon Reference: 1050410

Published by Packt Publishing Ltd.

32 Lincoln Road

Olton

Birmingham, B27 6PA, UK.

ISBN 978-1-847199-54-6

www.packtpub.com

Cover Image by Parag Kadam (paragvkadam@gmail.com)

Credits

Author

María Carina Roldán

Reviewers

Jens Bleuel

Roland Bouman

Ma Casters

James Dixon

Will Gorman

Gretchen Moran

Acquision Editor

Usha Iyer

Development Editor

Reshma Sundaresan

Technical Editors

Gaurav Datar

Rukhsana Khambaa

Copy Editor

Sanchari Mukherjee

Editorial Team Leader

Gagandeep Singh

Project Team Leader

Lata Basantani

Project Coordinator

Poorvi Nair

Proofreader

Sandra Hopper

Indexer

Rekha Nair

Graphics

Geetanjali Sawant

Producon Coordinator

Shantanu Zagade

Cover Work

Shantanu Zagade

Foreword

If we look back at what has happened in the data integraon market over the last 10

years we can see a lot of change. In the rst half of that decade there was an explosion

in the number of data integraon tools and in the second half there was a big wave of

consolidaons. This consolidaon wave put an ever growing amount of data integraon

power in the hands of only a few large billion dollar companies. For any person, company

or project in need of data integraon, this meant either paying large amounts of money or

doing hand-coding of their soluon.

During that exact same period, we saw web servers, programming languages, operang

systems, and even relaonal databases turn into a commodity in the ICT market place. This

was driven among other things by the availability of open source soware such as Apache,

GNU, Linux, MySQL, and many others. For the ICT market, this meant that more services

could be deployed at a lower cost. If you look closely at what has been going on in those last

10 years, you will noce that most companies increasingly deployed more ICT services to

end-users. These services get more and more connected over an ever growing network.

Prey much anything ranging from ny mobile devices to huge cloud-based infrastructure

is being deployed and all those can contain data that is valuable to an organizaon.

The job of any person that needs to integrate all this data is not easy. Complexity of

informaon services technology usually increases exponenally with the number of systems

involved. Because of this, integrang all these systems can be a daunng and scary task that

is never complete. Any piece of code lives in what can be described as a soware ecosystem

that is always in a state of ux. Like in nature, certain ecosystems evolve extremely fast

where others change very slowly over me. However, like in nature all ICT systems change.

What is needed is another wave of commodicaon in the area of data integraon and

business intelligence in general. This is where Pentaho comes in.

Pentaho tries to provide answers to these problems by making the integraon soware

available as open source, accessible, easy to use, and easy to maintain for users and

developers alike. Every release of our soware we try to make things easier, beer, and

faster. However, even if things can be done with nice user interfaces, there are sll a huge

amount of possibilies and opons to choose from.

As the founder of the project I've always liked the fact that Kele users had a lot of choice.

Choice translates into creavity, and creavity oen delivers good soluons that are

comfortable to the person implemenng them. However, this choice can be daunng to any

beginning Kele developer. With thousands of opons to choose from, it can be very hard to

get started.

This is above all others the reason why I'm very happy to see this book come to life. It will

be a great and indispensable help for everyone that is taking steps into the wonderful world

of data integraon with Kele. As such, I hope you see this book as an open invitaon to get

started with Kele in the wonderful world of data integraon.

Ma Casters

Chief Data Integraon at Pentaho

Kele founder

The Kettle Project

Whether there is a migraon to do, an ETL process to run, or a need for massively loading

data into a database, you have several soware tools, ranging from expensive and

sophiscated to free open source and friendly ones, which help you accomplish the task.

Ten years ago, the scenario was clearly dierent. By 2000, Ma Casters, a Belgian business

intelligent consultant, had been working for a while as a datawarehouse architect and

administrator. As such, he was one of quite a number of people who, no maer if the

company they worked for was big or small, had to deal with the dicules that involve

bridging the gap between informaon technology and business needs. What made it even

worse at that me was that ETL tools were prohibively expensive and everything had to

be craed done. The last employer he worked for, didn't think that wring a new ETL tool

would be a good idea. This was one of the movaons for Ma to become an independent

contractor and to start his own company. That was in June 2001.

At the end of that year, he told his wife that he was going to write a new piece of soware

for himself to do ETL tasks. It was going to take up some me le and right in the evenings

and weekends. Surprised, she asked how long it would take you to get it done. He replied

that it would probably take ve years and that he perhaps would have something working

in three.

Working on that started in early 2003. Ma's main goals for wring the soware included

learning about databases, ETL processes, and data warehousing. This would in turn improve

his chances on a job market that was prey volale. Ulmately, it would allow him to work

full me on the soware.

Another important goal was to understand what the tool had to do. Ma wanted a scalable

and parallel tool, and wanted to isolate rows of data as much as possible.

The last but not least goal was to pick the right technology that would support the tool. The

rst idea was to build it on top of KDE, the popular Unix desktop environment. Trolltech, the

people behind Qt, the core UI library of KDE, had released database plans to create drivers

for popular databases. However, the lack of decent drivers for those databases drove Ma

to change plans and use Java. He picked Java because he had some prior experience as he

had wrien a Japanese Chess (Shogi) database program when Java 1.0 was released. To

Sun's credit, this soware sll runs and is available at http://ibridge.be/shogi/.

Aer a year of development, the tool was capable of reading text les, reading from

databases, wring to databases and it was very exible. The experience with Java was not

100% posive though. The code had grown unstructured, crashes occurred all too oen, and

it was hard to get something going with the Java graphic library used at that moment, the

Abstract Window Toolkit (AWT); it looked bad and it was slow.

As for the library, Ma decided to start using the newly released Standard Widget Toolkit

(SWT), which helped solve part of the problem. As for the rest, Kele was a complete mess.

It was me to ask for help. The help came in hands of Wim De Clercq, a senior enterprise

Java architect, co-owner of Ixor (www.ixor.be) and also friend of Ma. At various intervals

over the next few years, Wim involved himself in the project, giving advices to Ma about

good pracces in Java programming. Listening to that advice meant performing massive

amounts of code changes. As a consequence, it was not unusual to spend weekends doing

nothing but refactoring code and xing thousands of errors because of that. But, bit by bit,

things kept going in the right direcon.

At that same me, Ma also showed the results to his peers, colleagues, and other senior

BI consultants to hear what they thought of Kele. That was how he got in touch with the

Flemish Trac Centre (www.verkeerscentrum.be/verkeersinfo/kaart) where billions

of rows of data had to be integrated from thousands of data sources all over Belgium. All of

a sudden, he was being paid to deploy and improve Kele to handle that job. The diversity of

test cases at the trac center helped to improve Kele dramacally. That was somewhere in

2004 and Kele was by its version 1.2.

While working at Flemish, Ma also posted messages on Javaforge (www.javaforge.com)

to let people know they could download a free copy of Kele for their own use. He got a

few reacons. Despite some of them being remarkably negave, most were posive. The

most interesng response came from a nice guy called Jens Bleuel in Germany who asked if

it was possible to integrate third-party soware into Kele. In his specic case, he needed a

connector to link Kele with the German SAP soware (www.sap.com). Kele didn't have a

plugin architecture, so Jens' queson made Ma think about a plugin system, and that was

the main movaon for developing version 2.0.

For various reasons including the birth of Ma's son Sam and a lot of consultancy work,

it took around a year to release Kele version 2.0. It was a fairly complete release with

advanced support for slowly changing dimensions and junk dimensions (Chapter 9 explains

those concepts), ability to connect to thirteen dierent databases, and the most important

fact being support for plugins. Ma contacted Jens to let him know the news and Jens was

really interested. It was a very memorable moment for Ma and Jens as it took them only a

few hours to get a new plugin going that read data from an SAP/R3 server. There was a lot

of excitement, and they agreed to start promong the sales of Kele from the Kettle.be

website and from Prorao (www.proratio.de), the company Jens worked for.

Those were days of improvements, requests, people interested in the project. However, it

became too much to handle. Doing development and sales all by themselves was no fun

aer a while. As such, Ma thought about open sourcing Kele early in 2005 and by late

summer he made his decision. Jens and Prorao didn't mind and the decision was nal.

When they nally open sourced Kele on December 2005, the response was massive. The

downloadable package put up on Javaforge got downloaded around 35000 mes during rst

week only. The news got spread all over the world prey quickly.

What followed was a ood of messages, both private and on the forum. At its peak in March

2006, Ma got over 300 messages a day concerning Kele.

In no me, he was answering quesons like crazy, allowing people to join the development

team and working as a consultant at the same me. Added to this, the birth of his daughter

Hannelore in February 2006 was too much to deal with.

Fortunately, good mes came. While Ma was trying to handle all that, a discussion was

taking place at the Pentaho forum (http://forums.pentaho.org/) concerning the ETL

tool that Pentaho should support. They had selected Enhydra Octopus, a Java-based ETL

soware, but they didn't have a strong reliance on a specic tool.

While Jens was evaluang all sorts of open source BI packages, he came across that thread.

Ma replied immediately persuading people at Pentaho to consider including Kele. And

he must be convincing because the answer came quickly and was posive. James Dixon,

Pentaho founder and CTO, opened Kele the possibility to be the premier and only ETL

tool supported by Pentaho. Later on, Ma came in touch with one of the other Pentaho

founders, Richard Daley, who oered him a job. That allowed Ma to focus full-me on

Kele. Four years later, he's sll happily working for Pentaho as chief architect for data

integraon, doing the best eort to deliver Kele 4.0. Jens Bleuel, who collaborated with

Ma since the early versions, is now also part of the Pentaho team.

About the Author

María Carina was born in a small town in the Patagonia region in Argenna. She earned

her Bachelor degree in Computer Science at UNLP in La Plata and then moved to Buenos

Aires where she has lived since 1994 working in IT.

She has been working as a BI consultant for the last 10 years. At the beginning she worked

with Cognos suite. However, over the last three years, she has been dedicated, full me, to

developing Pentaho BI soluons both for local and several Lan-American companies, as well

as for a French automove company in the last months.

She is also an acve contributor to the Pentaho community.

At present, she lives in Buenos Aires, Argenna, with her husband Adrián and children

Camila and Nicolás.

Wring my rst book in a foreign language and working on a full me job

at the same me, not to menon the upbringing of two small kids, was

denitely a big challenge. Now I can tell that it's not impossible.

I dedicate this book to my husband and kids; I'd like to thank them for all

their support and tolerance over the last year. I'd also like to thank my

colleagues and friends who gave me encouraging words throughout the

wring process.

Special thanks to the people at Packt; working with them has been

really pleasant.

I'd also like to thank the Pentaho community and developers for making

Kele the incredible tool it is. Thanks to the technical reviewers who,

with their very crical eye, contributed to make this a book suited to

the audience.

Finally, I'd like to thank Ma Casters who, despite his busy schedule, was

willing to help me from the rst moment he knew about this book.

About the Reviewers

Jens Bleuel is a Senior Consultant and Engineer at Pentaho. He is also working as a project

leader, trainer, and product specialist in the services and support department. Before he

joined Pentaho in mid 2007, he was soware developer and project leader, and his main

business was Data Warehousing and the architecture along with designing and developing of

user friendly tools. He studied business economics, was on a grammar school for electronics,

and has been programming in a wide area of environments such as Assembler, C, Visual

Basic, Delphi, .NET, and these days mainly in Java. His customer focus is on the wholesale

market and consumer goods industries. Jens is 40 years old and lives with his wife and two

boys in Mainz, Germany (near the nice Rhine river). In his spare me, he pracces Tai-Chi,

Qigong, and photography.

Roland Bouman has been working in the IT industry since 1998, mostly as a database and

web applicaon developer. He has also worked for MySQL AB (later Sun Microsystems) as

cercaon developer and as curriculum developer.

Roland mainly focuses on open source web technology, databases, and Business Intelligence.

He's an acve member of the MySQL and Pentaho communies and can oen be found

speaking at worldwide conferences and events such as the MySQL user conference, the

O'Reilly Open Source conference (OSCON), and at Pentaho community events.

Roland is co-author of the MySQL 5.1 Cluster DBA Cercaon Study Guide (Vervante,

ISBN: 595352502) and Pentaho Soluons: Business Intelligence and Data Warehousing with

Pentaho and MySQL (Wiley, ISBN: 978-0-470-48432-6). He also writes on a regular basis for

the Dutch Database Magazine (DBM).

Roland is @rolandbouman on Twier and maintains a blog at

http://rpbouman.blogspot.com/.

Ma Casters has been an independent senior BI consultant for almost two decades. In that

period he led, designed, and implemented numerous data warehouses and BI soluons for

large and small companies. In that capacity, he always had the need for ETL in some form

or another. Almost out of pure necessity, he has been busy wring the ETL tool called Kele

(a.k.a. Pentaho Data Integraon) for the past eight years. First, he developed the tool mostly

on his own. Since the end of 2005 when Kele was declared an open source technology,

development took place with the help of a large community.

Since the Kele project was acquired by Pentaho in early 2006, he has been Chief of Data

Integraon at Pentaho as the lead architect, head of development, and spokesperson for the

Kele community.

I would like to personally thank the complete community for their help

in making Kele the success it is today. In parcular, I would like to thank

Maria for taking the me to write this nice book as well as the many

arcles on the Pentaho wiki (for example, the Kele tutorials), and her

appreciated parcipaon on the forum. Many thanks also go to my

employer Pentaho, for their large investment in open source BI in

general and Kele in parcular.

James Dixon is the Chief Geek and one of the co-founders of Pentaho Corporaon—the

leading commercial open source Business Intelligence company. He has worked in the

business intelligence market since graduang in 1992 from Southampton University with a

degree in Computer Science. He has served as Soware Engineer, Development Manager,

Engineering VP, and CTO at mulple business intelligence soware companies. He regularly

uses Pentaho Data Integraon for internal projects and was involved in the architectural

design of PDI V3.0.

He lives in Orlando, Florida, with his wife Tami and son Samuel.

I would like to thank my co-founders, my parents, and my wife Tami for all

their support and tolerance of my odd working hours.

I would like to thank my son Samuel for all the opportunies he gives me to

prove I'm not as clever as I think I am.

Will Gorman is an Engineering Team Lead at Pentaho. He works on a variety of Pentaho's

products, including Reporng, Analysis, Dashboards, Metadata, and the BI Server. Will

started his career at GE Research and earned his Masters degree in Computer Science at

Rensselaer Polytechnic Instute in Troy, New York. Will is the author of Pentaho Reporng

3.5 for Java Developers (ISBN: 3193), published by Packt Publishing.

Gretchen Moran is a graduate of University of Wisconsin – Stevens Point with a Bachelor's

degree in Computer Informaon Systems with a minor in Data Communicaons. Gretchen

began her career as a corporate data warehouse developer in the insurance industry and

joined Arbor Soware/Hyperion Soluons in 1999 as a commercial developer for the

Hyperion Analyzer and Web Analycs team. Gretchen has been a key player with Pentaho

Corporaon since its incepon in 2004. As Community Leader and core developer, Gretchen

managed the explosive growth of Pentaho's open source community for her rst 2 years

with the company. Gretchen has contributed to many of the Pentaho projects, including the

Pentaho BI Server, Pentaho Data Integraon, Pentaho Metadata Editor, Pentaho Reporng,

Pentaho Charng, and others.

Thanks Doug, Anthony, Isabella and Baby Jack for giving me my favorite

challenges and crowning achievements—being a wife and mom.

Table of Contents

Preface 1

Chapter 1: Geng started with Pentaho Data Integraon 7

Pentaho Data Integraon and Pentaho BI Suite 7

Exploring the Pentaho Demo 9

Pentaho Data Integraon 9

Using PDI in real world scenarios 11

Loading data warehouses or data marts 11

Integrang data 12

Data cleansing 12

Migrang informaon 13

Exporng data 13

Integrang PDI using Pentaho BI 13

Installing PDI 14

Time for acon – installing PDI 14

Launching the PDI graphical designer: Spoon 15

Time for acon – starng and customizing Spoon 15

Spoon 18

Seng preferences in the Opons window 18

Storing transformaons and jobs in a repository 19

Creang your rst transformaon 20

Time for acon – creang a hello world transformaon 20

Direcng the Kele engine with transformaons 25

Exploring the Spoon interface 26

Running and previewing the transformaon 27

Time for acon – running and previewing the

hello_world transformaon 27

Installing MySQL 29

Time for acon – installing MySQL on Windows 29

Time for acon – installing MySQL on Ubuntu 32

Summary 34

Table of Contents

[ ii ]

Chapter 2: Geng Started with Transformaons 35

Reading data from les 35

Time for acon – reading results of football matches from les 36

Input les 41

Input steps 41

Reading several les at once 42

Time for acon – reading all your les at a me using a single

Text le input step 42

Time for acon – reading all your les at a me using a single

Text le input step and regular expressions 43

Regular expressions 44

Grids 46

Sending data to les 47

Time for acon – sending the results of matches to a plain le 47

Output les 49

Output steps 50

Some data denions 50

Rowset 50

Streams 51

The Select values step 52

Geng system informaon 52

Time for acon – updang a le with news about examinaons 53

Geng informaon by using Get System Info step 57

Data types 58

Date elds 58

Numeric elds 59

Running transformaons from a terminal window 60

Time for acon – running the examinaon transformaon from

a terminal window 60

XML les 62

Time for acon – geng data from an XML le with informaon

about countries 62

What is XML 67

PDI transformaon les 68

Geng data from XML les 68

XPath 68

Conguring the Get data from XML step 69

Kele variables 70

How and when you can use variables 70

Summary 72

Table of Contents

[ iii ]

Chapter 3: Basic data manipulaon 73

Basic calculaons 73

Time for acon – reviewing examinaons by using the Calculator step 74

Adding or modifying elds by using dierent PDI steps 82

The Calculator step 83

The Formula step 84

Time for acon – reviewing examinaons by using the Formula step 84

Calculaons on groups of rows 88

Time for acon – calculang World Cup stascs by grouping data 89

Group by step 94

Filtering 97

Time for acon – counng frequent words by ltering 97

Filtering rows using the Filter rows step 103

Looking up data 105

Time for acon – nding out which language people speak 105

The Stream lookup step 109

Summary 112

Chapter 4: Controlling the Flow of Data 113

Spling streams 113

Time for acon – browsing new PDI features by copying a dataset 114

Copying rows 119

Distribung rows 120

Time for acon – assigning tasks by distribung 121

Spling the stream based on condions 125

Time for acon – assigning tasks by ltering priories with the Filter rows step 126

PDI steps for spling the stream based on condions 128

Time for acon – assigning tasks by ltering priories with the Switch/ Case step 129

Merging streams 131

Time for acon – gathering progress and merging all together 132

PDI opons for merging streams 134

Time for acon – giving priority to Bouchard by using Append Stream 137

Summary 139

Chapter 5: Transforming Your Data with JavaScript Code and

the JavaScript Step 141

Doing simple tasks with the JavaScript step 141

Time for acon – calculang scores with JavaScript 142

Using the JavaScript language in PDI 147

Inserng JavaScript code using the Modied Java Script Value step 148

Adding elds 150

Table of Contents

[ iv ]

Modifying elds 150

Turning on the compability switch 151

Tesng your code 151

Time for acon – tesng the calculaon of averages 152

Tesng the script using the Test script buon 153

Enriching the code 154

Time for acon – calculang exible scores by using variables 154

Using named parameters 158

Using the special Start, Main, and End scripts 159

Using transformaon predened constants 159

Reading and parsing unstructured les 162

Time for acon – changing a list of house descripons with JavaScript 162

Looking at previous rows 164

Avoiding coding by using purpose-built steps 165

Summary 167

Chapter 6: Transforming the Row Set 169

Converng rows to columns 169

Time for acon – enhancing a lms le by converng rows to columns 170

Converng row data to column data by using the Row denormalizer step 173

Aggregang data with a Row denormalizer step 176

Time for acon – calculang total scores by performances by country 177

Using Row denormalizer for aggregang data 178

Normalizing data 180

Time for acon – enhancing the matches le by normalizing the dataset 180

Modifying the dataset with a Row Normalizer step 182

Summarizing the PDI steps that operate on sets of rows 184

Generang a custom me dimension dataset by using Kele variables 186

Time for acon – creang the me dimension dataset 187

Geng variables 191

Time for acon – geng variables for seng the default starng date 192

Using the Get Variables step 193

Summary 194

Chapter 7: Validang Data and Handling Errors 195

Capturing errors 195

Time for acon – capturing errors while calculang the age of a lm 196

Using PDI error handling funconality 200

Aborng a transformaon 201

Time for acon – aborng when there are too many errors 202

Aborng a transformaon using the Abort step 203

Fixing captured errors 203

Table of Contents

[ v ]

Time for acon – treang errors that may appear 203

Treang rows coming to the error stream 205

Avoiding unexpected errors by validang data 206

Time for acon – validang genres with a Regex Evaluaon step 206

Validang data 208

Time for acon – checking lms le with the Data Validator 209

Dening simple validaon rules using the Data Validator 211

Cleansing data 213

Summary 215

Chapter 8: Working with Databases 217

Introducing the Steel Wheels sample database 217

Connecng to the Steel Wheels database 219

Time for acon – creang a connecon with the Steel Wheels database 219

Connecng with Relaonal Database Management Systems 222

Exploring the Steel Wheels database 223

Time for acon – exploring the sample database 224

A brief word about SQL 225

Exploring any congured database with the PDI Database explorer 228

Querying a database 229

Time for acon – geng data about shipped orders 229

Geng data from the database with the Table input step 231

Using the SELECT statement for generang a new dataset 232

Making exible queries by using parameters 234

Time for acon – geng orders in a range of dates by using parameters 234

Making exible queries by using Kele variables 236

Time for acon – geng orders in a range of dates by using variables 237

Sending data to a database 239

Time for acon – loading a table with a list of manufacturers 239

Inserng new data into a database table with the Table output step 245

Inserng or updang data by using other PDI steps 246

Time for acon – inserng new products or updang existent ones 246

Time for acon – tesng the update of exisng products 249

Inserng or updang data with the Insert/Update step 251

Eliminang data from a database 256

Time for acon – deleng data about disconnued items 256

Deleng records of a database table with the Delete step 259

Summary 260

Chapter 9: Performing Advanced Operaons with Databases 261

Preparing the environment 261

Time for acon – populang the Jigsaw database 261

Exploring the Jigsaw database model 264

Table of Contents

[ vi ]

Looking up data in a database 266

Doing simple lookups 266

Time for acon – using a Database lookup step to create a list of products to buy 266

Looking up values in a database with the Database lookup step 268

Doing complex lookups 270

Time for acon – using a Database join step to create a list of

suggested products to buy 270

Joining data from the database to the stream data by using a Database join step 272

Introducing dimensional modeling 275

Loading dimensions with data 276

Time for acon – loading a region dimension with a

Combinaon lookup/update step 276

Time for acon – tesng the transformaon that loads the region dimension 279

Describing data with dimensions 281

Loading Type I SCD with a Combinaon lookup/update step 282

Keeping a history of changes 286

Time for acon – keeping a history of product changes with the

Dimension lookup/update step 286

Time for acon – tesng the transformaon that keeps a history

of product changes 288

Keeping an enre history of data with a Type II slowly changing dimension 289

Loading Type II SCDs with the Dimension lookup/update step 291

Summary 296

Chapter 10: Creang Basic Task Flows 297

Introducing PDI jobs 297

Time for acon – creang a simple hello world job 298

Execung processes with PDI jobs 305

Using Spoon to design and run jobs 306

Using the transformaon job entry 307

Receiving arguments and parameters in a job 309

Time for acon – customizing the hello world le with

arguments and parameters 309

Using named parameters in jobs 312

Running jobs from a terminal window 312

Time for acon – execung the hello world job from a terminal window 313

Using named parameters and command-line arguments in transformaons 314

Time for acon – calling the hello world transformaon with

xed arguments and parameters 315

Deciding between the use of a command-line argument and a named parameter 317

Running job entries under condions 318

Table of Contents

[ vii ]

Time for acon – sending a sales report and warning the

administrator if something is wrong 318

Changing the ow of execuon on the basis of condions 324

Creang and using a le results list 326

Summary 327

Chapter 11: Creang Advanced Transformaons and Jobs 329

Enhancing your processes with the use of variables 329

Time for acon – updang a le with news about examinaons by seng

a variable with the name of the le 330

Seng variables inside a transformaon 335

Enhancing the design of your processes 337

Time for acon – generang les with top scores 337

Reusing part of your transformaons 341

Time for acon – calculang the top scores with a subtransformaon 341

Creang and using subtransformaons 345

Creang a job as a process ow 348

Time for acon – spling the generaon of top scores by

copying and geng rows 348

Transferring data between transformaons by using the copy /get rows mechanism 352

Nesng jobs 354

Time for acon – generang the les with top scores by nesng jobs 354

Running a job inside another job with a job entry 355

Understanding the scope of variables 356

Iterang jobs and transformaons 357

Time for acon – generang custom les by execung a transformaon

for every input row 358

Execung for each row 361

Summary 366

Chapter 12: Developing and Implemenng a Simple Datamart 367

Exploring the sales datamart 367

Deciding the level of granularity 370

Loading the dimensions 370

Time for acon – loading dimensions for the sales datamart 371

Extending the sales datamart model 376

Loading a fact table with aggregated data 378

Time for acon – loading the sales fact table by looking up dimensions 378

Geng the informaon from the source with SQL queries 384

Translang the business keys into surrogate keys 388

Obtaining the surrogate key for a Type I SCD 388

Obtaining the surrogate key for a Type II SCD 389

Obtaining the surrogate key for the Junk dimension 391

Obtaining the surrogate key for the Time dimension 391

Table of Contents

[ viii ]

Geng facts and dimensions together 394

Time for acon – loading the fact table using a range of dates obtained

from the command line 394

Time for acon – loading the sales star 396

Geng rid of administrave tasks 399

Time for acon – automang the loading of the sales datamart 399

Summary 403

Chapter 13: Taking it Further 405

PDI best pracces 405

Geng the most out of PDI 408

Extending Kele with plugins 408

Overcoming real world risks with some remote execuon 410

Scaling out to overcome bigger risks 411

Integrang PDI and the Pentaho BI suite 412

PDI as a process acon 412

PDI as a datasource 413

More about the Pentaho suite 414

PDI Enterprise Edion and Kele Developer Support 415

Summary 416

Appendix A: Working with Repositories 417

Creang a repository 418

Time for acon – creang a PDI repository 418

Creang repositories to store your transformaons and jobs 420

Working with the repository storage system 421

Time for acon – logging into a repository 421

Logging into a repository by using credenals 422

Dening repository user accounts 422

Creang transformaons and jobs in repository folders 423

Creang database connecons, parons, servers, and clusters 424

Backing up and restoring a repository 424

Examining and modifying the contents of a repository with

the Repository explorer 424

Migrang from a le-based system to a repository-based system and

vice-versa 426

Summary 427

Appendix B: Pan and Kitchen: Launching Transformaons and

Jobs from the Command Line 429

Running transformaons and jobs stored in les 429

Running transformaons and jobs from a repository 430

Specifying command line opons 431

Table of Contents

[ ix ]

Checking the exit code 432

Providing opons when running Pan and Kitchen 432

Log details 433

Named parameters 433

Arguments 433

Variables 433

Appendix C: Quick Reference: Steps and Job Entries 435

Transformaon steps 436

Job entries 440

Appendix D: Spoon Shortcuts 443

General shortcuts 443

Designing transformaons and jobs 444

Grids 445

Repositories 445

Appendix E: Introducing PDI 4 Features 447

Agile BI 447

Visual improvements for designing transformaons and jobs 447

Experiencing the mouse-over assistance 447

Time for acon – creang a hop with the mouse-over assistance 448

Using the mouse-over assistance toolbar 448

Experiencing the sni-tesng feature 449

Experiencing the job drill-down feature 449

Experiencing even more visual changes 450

Enterprise features 450

Summary 450

Appendix F: Pop Quiz Answers 451

Chapter 1 451

PDI data sources 451

PDI prerequisites 451

PDI basics 451

Chapter 2 452

formang data 452

Chapter 3 452

concatenang strings 452

Chapter 4 452

data movement (copying and distribung) 452

spling a stream 452

Chapter 5 453

nding the seven errors 453

Table of Contents

[ x ]

Chapter 6 453

using Kele variables inside transformaons 453

Chapter 7 453

PDI error handling 453

Chapter 8 454

dening database connecons 454

database datatypes versus PDI datatypes 454

Insert/Update step versus Table Output/Update steps 454

ltering the rst 10 rows 454

Chapter 9 454

loading slowly changing dimensions 454

loading type III slowly changing dimensions 455

Chapter 10 455

dening PDI jobs 455

Chapter 11 455

using the Add sequence step 455

deciding the scope of variables 455

Chapter 12 456

modifying a star model and loading the star with PDI 456

Chapter 13 456

remote execuon and clustering 456

Index 457

Preface

Pentaho Data Integraon (aka Kele) is an engine along with a suite of tools responsible

for the processes of Extracng, Transforming, and Loading—beer known as the ETL

processes. PDI not only serves as an ETL tool, but it's also used for other purposes such as

migrang data between applicaons or databases, exporng data from databases to at

les, data cleansing, and much more. PDI has an intuive, graphical, drag-and-drop design

environment, and its ETL capabilies are powerful. However, geng started with PDI can be

dicult or confusing. This book provides the guidance needed to overcome that diculty,

covering the key features of PDI. Each chapter introduces new features, allowing you to

gradually get involved with the tool.

By the end of the book, you will have not only experimented with all kinds of examples, but

will also have built a basic but complete datamart with the help of PDI.

How to read this book

Although it is recommended that you read all the chapters, you don't need to. The book

allows you to tailor the PDI learning process according to your parcular needs.

The rst four chapters, along with Chapter 7 and Chapter 10, cover the core concepts. If

you don't know PDI and want to learn just the basics, reading those chapters would suce.

Besides, if you need to work with databases, you could include Chapter 8 in the roadmap.

If you already know the basics, you can improve your PDI knowledge by reading chapters 5,

6, and 11.

Finally, if you already know PDI and want to learn how to use it to load or maintain a

datawarehouse or datamart, you will nd all that you need in chapters 9 and 12.

While Chapter 13 is useful for anyone who is willing to take it further, all the appendices are

valuable resources for anyone who reads this book.

Preface

[ 2 ]

What this book covers

Chapter 1, Geng started with Pentaho Data Integraon serves as the most basic

introducon to PDI, presenng the tool. The chapter includes instrucons for installing PDI

and gives you the opportunity to play with the graphical designer (Spoon). The chapter also

includes instrucons for installing a MySQL server.

Chapter 2, Geng Started with Transformaons introduces one of the basic components

of PDI—transformaons. Then, it focuses on the explanaon of how to work with les. It

explains how to get data from simple input sources such as txt, csv, xml, and so on, do a

preview of the data, and send the data back to any of these common output formats. The

chapter also explains how to read command-line parameters and system informaon.

Chapter 3, Basic Data Manipulaon explains the simplest and most commonly used ways of

transforming data, including performing calculaons, adding constants, counng, ltering,

ordering, and looking for data.

Chapter 4—Controlling the Flow of Data explains dierent opons that PDI oers to combine

or split ows of data.

Chapter 5, Transforming Your Data with JavaScript Code and the JavaScript Step explains how

JavaScript coding can help in the treatment of data. It shows why you need to code inside

PDI, and explains in detail how to do it.

Chapter 6, Transforming the Row Set explains the ability of PDI to deal with some

sophiscated problems, such as normalizing data from pivoted tables, in a simple fashion.

Chapter 7, Validang Data and Handling Errors explains the dierent opons that PDI has to

validate data, and how to treat the errors that may appear.

Chapter 8, Working with Databases explains how to use PDI to work with databases. The

list of topics covered includes connecng to a database, previewing and geng data, and

inserng, updang, and deleng data. As database knowledge is not presumed, the chapter

also covers fundamental concepts of databases and the SQL language.

Chapter 9, Performing Advanced Operaons with Databases explains how to perform

advanced operaons with databases, including those specially designed to load

datawarehouses. A primer on datawarehouse concepts is also given in case you are not

familiar with the subject.

Chapter 10, Creang Basic Task Flow serves as an introducon to processes in PDI. Through

the creaon of simple jobs, you will learn what jobs are and what they are used for.

Chapter 11, Creang Advanced Transformaons and Jobs deals with advanced concepts that

will allow you to build complex PDI projects. The list of covered topics includes nesng jobs,

iterang on jobs and transformaons, and creang subtransformaons.

Preface

[ 3 ]

Chapter 12, Developing and implemenng a simple datamart presents a simple datamart

project, and guides you to build the datamart by using all the concepts learned throughout

the book.

Chapter 13, Taking it Further gives a list of best PDI pracces and recommendaons for

going beyond.

Appendix A, Working with repositories guides you step by step in the creaon of a PDI

database repository and then gives instrucons to work with it.

Appendix B, Pan and Kitchen: Launching Transformaons and Jobs from the Command Line is

a quick reference for running transformaons and jobs from the command line.

Appendix C, Quick Reference: Steps and Job Entries serves as a quick reference to steps and

job entries used throughout the book.

Appendix D, Spoon Shortcuts is an extensive list of Spoon shortcuts useful for saving me

when designing and running PDI jobs and transformaons.

Appendix E, Introducing PDI 4 features quickly introduces you to the architectural and

funconal features included in Kele 4—the version that was under development while

wring this book.

Appendix F, Pop Quiz Answers, contains answers to pop quiz quesons.

What you need for this book

PDI is a mulplaorm tool. This means no maer what your operang system is, you will

be able to work with the tool. The only prerequisite is to have JVM 1.5 or a higher version

installed. It is also useful to have Excel or Calc along with a nice text editor.

Having an Internet connecon while reading is extremely useful as well. Several links are

provided throughout the book that complement what is explained. Besides, there is the

PDI forum where you may search or post doubts if you are stuck with something.

Who this book is for

This book is for soware developers, database administrators, IT students, and everyone

involved or interested in developing ETL soluons or, more generally, doing any kind of data

manipulaon. If you have never used PDI before, this will be a perfect book to start with.

You will nd this book to be a good starng point if you are a database administrator, a data

warehouse designer, an architect, or any person who is responsible for data warehouse

projects and need to load data into them.

Preface

[ 4 ]

You don't need to have any prior data warehouse or database experience to read this book.

Fundamental database and data warehouse technical terms and concepts are explained in

an easy-to-understand language.

Conventions

In this book, you will nd a number of styles of text that disnguish between dierent

kinds of informaon. Here are some examples of these styles, and an explanaon of

their meaning.

Code words in text are shown as follows: "You read the examination.txt le, and did

some calculaons to see how the students did."

New terms and important words are shown in bold. Words that you see on the screen, in

menus or dialog boxes for example, appear in our text like this: "Edit the Sort rows step by

double-clicking it, click the Get Fields buon, and adjust the grid."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—

what you liked or may have disliked. Reader feedback is important for us to develop tles

that you really get the most out of.

To send us general feedback, simply drop an email to feedback@packtpub.com, and

menon the book tle in the subject of your message.

If there is a book that you need and would like to see us publish, please send us a note in the

SUGGEST A TITLE form on www.packtpub.com or email suggest@packtpub.com.

If there is a topic that you have experse in and you are interested in either wring or

contribung to a book, see our author guide on www.packtpub.com/authors.

Preface

[ 5 ]

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you

to get the most from your purchase.

Downloading the example code for the book

Visit http://www.packtpub.com/files/code/9546_Code.zip to

directly download the example code.

The downloadable les contain instrucons on how to use them.

Errata

Although we have taken every care to ensure the accuracy of our contents, mistakes do

happen. If you nd a mistake in one of our books—maybe a mistake in text or code—we

would be grateful if you would report this to us. By doing so, you can save other readers

from frustraon, and help us to improve subsequent versions of this book. If you nd any

errata, please report them by vising http://www.packtpub.com/support, selecng

your book, clicking on the let us know link, and entering the details of your errata.

Once your errata are veried, your submission will be accepted and the errata added

to any list of exisng errata. Any exisng errata can be viewed by selecng your tle

from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt,

we take the protecon of our copyright and licenses very seriously. If you come across any

illegal copies of our works in any form on the Internet, please provide us with the locaon

address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

We appreciate your help in protecng our authors, and our ability to bring you

valuable content.

Questions

You can contact us at questions@packtpub.com if you are having a problem with any

aspect of the book, and we will do our best to address it.

Getting Started with Pentaho

Data Integration

Pentaho Data Integraon is an engine along with a suite of tools responsible

for the processes of extracng, transforming, and loading—best known as the

ETL processes. This book is meant to teach you how to use PDI.

In this chapter you will:

Learn what Pentaho Data Integraon is

Install the soware and start working with the PDI graphical designer

Install MySQL, a database engine that you will use when you start working

with databases

Pentaho Data Integration and Pentaho BI Suite

Before introducing PDI, let's talk about Pentaho BI Suite. The Pentaho Business Intelligence

Suite is a collecon of soware applicaons intended to create and deliver soluons for

decision making. The main funconal areas covered by the suite are:

Analysis: The analysis engine serves muldimensional analysis. It's provided by the

Mondrian OLAP server and the JPivot library for navigaon and exploring.



Geng Started with Pentaho Data Integraon

[ 8 ]

Reporng: The reporng engine allows designing, creang, and distribung reports

in various known formats (HTML, PDF, and so on) from dierent kinds of sources.

The reports created in Pentaho are based mainly in the JFreeReport library, but it's

possible to integrate reports created with external reporng libraries such as Jasper

Reports or BIRT.

Data Mining: Data mining is running data through algorithms in order to understand

the business and do predicve analysis. Data mining is possible thanks to the

Weka Project.

Dashboards: Dashboards are used to monitor and analyze Key Performance

Indicators (KPIs). A set of tools incorporated to the BI Suite in the latest version

allows users to create interesng dashboards, including graphs, reports, analysis

views, and other Pentaho content, without much eort.

Data integraon: Data integraon is used to integrate scaered informaon

from dierent sources (applicaons, databases, les) and make the integrated

informaon available to the nal user. Pentaho Data Integraon—our main

concern—is the engine that provides this funconality.

All this funconality can be used standalone as well as integrated. In order to run analysis,

reports, and so on integrated as a suite, you have to use the Pentaho BI Plaorm. The

plaorm has a soluon engine, and oers crical services such as authencaon,

scheduling, security, and web services.



Chapter 1

[ 9 ]

This set of soware and services forms a complete BI Plaorm, which makes Pentaho Suite

the world's leading open source Business Intelligence Suite.

Exploring the Pentaho Demo

Despite being out of the scope of this book, it's worth to briey introduce the Pentaho

Demo. The Pentaho BI Plaorm Demo is a precongured installaon that lets you explore

several capabilies of the Pentaho plaorm. It includes sample reports, cubes, and

dashboards for Steel Wheels. Steel Wheels is a conal store that sells all kind of scale

replicas of vehicles.

The demo can be downloaded from http://sourceforge.net/projects/pentaho/

files/. Under the Business Intelligence Server folder, look for the latest stable

version. The le you have to download is named biserver-ce-3.5.2.stable.zip for

Windows and biserver-ce-3.5.2.stable.tar.gz for other systems.

In the same folder you will nd a le named biserver-getting_started-ce-

3.5.0.pdf. The le is a guide that introduces you the plaorm and gives you some

guidance on how to install and run it. The guide even includes a mini tutorial on building

a simple PDI input-output transformaon.

You can nd more about Pentaho BI Suite at www.pentaho.org.

Pentaho Data Integration

Most of the Pentaho engines, including the engines menoned earlier, were created as

community projects and later adopted by Pentaho. The PDI engine is no excepon—Pentaho

Data Integraon is the new denominaon for the business intelligence tool born as Kele.

The name Kele didn't come from the recursive acronym Kele Extracon,

Transportaon, Transformaon, and Loading Environment it has now, but from

KDE Extracon, Transportaon, Transformaon and Loading Environment,

as the tool was planned to be wrien on top of KDE, as menoned in the

introducon of the book.

In April 2006 the Kele project was acquired by the Pentaho Corporaon and Ma Casters,

Kele's founder, also joined the Pentaho team as a Data Integraon Architect.

Geng Started with Pentaho Data Integraon

[ 10 ]

When Pentaho announced the acquision, James Dixon, the Chief Technology Ocer, said:

We reviewed many alternaves for open source data integraon, and Kele clearly

had the best architecture, richest funconality, and most mature user interface.

The open architecture and superior technology of the Pentaho BI Plaorm

and Kele allowed us to deliver integraon in only a few days, and make that

integraon available to the community.

By joining forces with Pentaho, Kele beneted from a huge developer community, as well

as from a company that would support the future of the project.

From that moment the tool has grown constantly. Every few months a new release is

available, bringing to the users, improvements in performance and exisng funconality,

new funconality, ease of use, and great changes in look and feel. The following is a meline

of the major events related to PDI since its acquision by Pentaho:

June 2006: PDI 2.3 is released. Numerous developers had joined the project and

there were bug xes provided by people in various regions of the world. Among

other changes, the version included enhancements for large scale environments

and mullingual capabilies.

February 2007: Almost seven months aer the last major revision, PDI 2.4 is

released including remote execuon and clustering support (more on this in

Chapter 13), enhanced database support, and a single designer for the two

main elements you design in Kele—jobs and transformaons.

May 2007: PDI 2.5 is released including many new features, the main feature being

the advanced error handling.

November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to

gain massive performance. The look and feel also changed completely.

October 2008: PDI 3.1 comes with an easier-to-use tool, along with a lot of new

funconalies as well.

April 2009: PDI 3.2 is released with a really large number of changes for a

minor version—new funconality, visualizaon improvements, performance

improvements, and a huge pile of bug xes. The main change in this version was the

incorporaon of dynamic clustering (see Chapter 13 for details).

In 2010 PDI 4.0 will be released, delivering mostly improvements with regard to

enterprise features such as version control.

Most users sll refer to PDI as Kele, its further name. Therefore, the names PDI,

Pentaho Data Integraon, and Kele will be used interchangeably throughout

the book.



Chapter 1

[ 11 ]

Using PDI in real world scenarios

Paying aenon to its name, Pentaho Data Integraon, you could think of PDI as a tool to

integrate data.

In you look at its original name, K.E.T.T.L.E., then you must conclude that it is a tool used

for ETL processes which, as you may know, are most frequently seen in data warehouse

environments.

In fact, PDI not only serves as a data integrator or an ETL tool, but is such a powerful tool

that it is common to see it used for those and for many other purposes. Here you have

some examples.

Loading datawarehouses or datamarts

The loading of a datawarehouse or a datamart involves many steps, and there are many

variants depending on business area or business rules. However, in every case, the process

involves the following steps:

Extracng informaon from one or dierent databases, text les, and other sources.

The extracon process may include the task of validang and discarding data that

doesn't match expected paerns or rules.

Transforming the obtained data to meet the business and technical needs required

on the target. Transformaon implies tasks such as converng data types, doing

some calculaons, ltering irrelevant data, and summarizing.

Loading the transformed data into the target database. Depending on the

requirements, the loading may overwrite the exisng informaon, or may

add new informaon each me it is executed.

Geng Started with Pentaho Data Integraon

[ 12 ]

Kele comes ready to do every stage of this loading process. The following sample

screenshot shows a simple ETL designed with Kele:

Integrating data

Imagine two similar companies that need to merge their databases in order to have a unied

view of the data, or a single company that has to combine informaon from a main ERP

applicaon and a CRM applicaon, though they're not connected. These are just two of

hundreds of examples where data integraon is needed. Integrang data is not just a maer

of gathering and mixing data; some conversions, validaon, and transport of data has to be

done. Kele is meant to do all those tasks.

Data cleansing

Why do we need that data be correct and accurate? There are many reasons—for the

eciency of business, to generate trusted conclusions in data mining or stascal studies,

to succeed when integrang data, and so on. Data cleansing is about ensuring that the

data is correct and precise. This can be ensured by verifying if the data meets certain rules,

discarding or correcng those that don't follow the expected paern, seng default values

for missing data, eliminang informaon that is duplicated, normalizing data to conform

minimum and maximum values, and so on—tasks that Kele makes possible, thanks to its

vast set of transformaon and validaon capabilies.

Chapter 1

[ 13 ]

Migrating information

Think of a company of any size that uses a commercial ERP applicaon. One day the owners

realize that the licences are consuming an important share of its budget and so they decide

to migrate to an open source ERP. The company will no longer have to pay licences, but if

they want to do the change, they will have to migrate the informaon. Obviously it is not an

opon to start from scratch, or type the informaon by hand. Kele makes the migraon

possible, thanks to its ability to interact with most kinds of sources and desnaons such as

plain les, and commercial and free databases and spreadsheets.

Exporting data

Somemes you are forced by government regulaons to export certain data to be processed

by legacy systems. You can't just print and deliver some reports containing the required data.

The data has to have a rigid format, with columns that have to obey some rules (size, format,

content), dierent records for heading and tail, just to name some common demands. Kele

has the power to take crude data from the source and generate these kinds of ad hoc reports.

Integrating PDI using Pentaho BI

The previous examples show typical uses of PDI as a standalone applicaon. However, Kele

may be used as part of a process inside the Pentaho BI Plaorm. There are many things

embedded in the Pentaho applicaon that Kele can do—preprocessing data for an on-line

report, sending mails in a schedule fashion, or generang spreadsheet reports.

You'll nd more on this in Chapter 13. However, the use of PDI integrated

with the BI Suite is beyond the scope of this book.

Pop quiz – PDI data sources

Which of the following aren't valid sources in Kele:

1. Spreadsheets

2. Free database engines

3. Commercial database engines

4. Flat les

5. None of the above

Geng Started with Pentaho Data Integraon

[ 14 ]

Installing PDI

In order to work with PDI you need to install the soware. It's a simple task; let's do it.

Time for action – installing PDI

These are the instrucons to install Kele, whatever your operang system.

The only prerequisite to install PDI is to have JRE 5.0 or higher installed. If you don't have it,

please download it from http://www.javasoft.com/ and install it before proceeding.

Once you have checked the prerequisite, follow these steps:

1. From http://community.pentaho.com/sourceforge/ follow the link to

Pentaho Data Integraon (Kele). Alternavely, go directly to the download page

http://sourceforge.net/projects/pentaho/files/Data Integration.

2. Choose the newest stable release. At this me, it is 3.2.0.

3. Download the le that matches your plaorm. The preceding screenshot should

help you.

4. Unzip the downloaded le in a folder of your choice

—C:/Kettle or /home/your_dir/kettle.

Chapter 1

[ 15 ]

5. If your system is Windows, you're done. Under UNIX-like environments, it's

recommended that you make the scripts executable. Assuming that you

chose Kele as the installaon folder, execute the following command:

cd Kettle

chmod +x *.sh

What just happened?

You have installed the tool in just a few minutes. Now you have all you need to start working.

Pop quiz – PDI prerequisites

Which of the following are mandatory to run PDI? You may choose more than one opon.

1. Kele

2. Pentaho BI plaorm

3. JRE

4. A database engine

Launching the PDI graphical designer: Spoon

Now that you've installed PDI, you must be eager to do some stu with data. That will be

possible only inside a graphical environment. PDI has a desktop designer tool named Spoon.

Let's see how it feels to work with it.

Time for action – starting and customizing Spoon

In this tutorial you're going to launch the PDI graphical designer and get familiarized with itsn this tutorial you're going to launch the PDI graphical designer and get familiarized with its

main features.

1. Start Spoon.

If your system is Windows, type the following command:

Spoon.bat

In other plaorms such as Unix, Linux, and so on, type:

Spoon.sh

If you didn't make spoon.sh executable, you may type:

sh Spoon.sh



Geng Started with Pentaho Data Integraon

[ 16 ]

2. As soon as Spoon starts, a dialog window appears asking for the repository

connecon data. Click the No Repository buon. The main window appears. You

will see a small window with the p of the day. Aer reading it, close that window.

3. A welcome! window appears with some useful links for you to see.

4. Close the welcome window. You can open that window later from the main menu.

5. Click Opons... from the Edit menu. A window appears where you can change

various general and visual characteriscs. Uncheck the circled checkboxes:

6. Select the tab window Look Feel.

Chapter 1

[ 17 ]

7. Change the Grid size and Preferred Language sengs as follows:

8. Click the OK buon.

9. Restart Spoon in order to apply the changes. You should neither see the repository

dialog, nor the welcome window. You should see the following screen instead:

Geng Started with Pentaho Data Integraon

[ 18 ]

What just happened?

You ran for the rst me the graphical designer of PDI Spoon, and applied some

custom conguraon.

From the Look Feel conguraon window, you changed the size of the doed grid that

appears in the canvas area while you are working. You also changed the preferred language.

In the Opon tab window, you chose not to show either the repository dialog or the

welcome window at startup. These changes were applied as you restarted the tool, not

before.

The second me you launched the tool, the repository dialog didn't show up. When the

main window appeared, all the visible texts were shown in French, which was the selected

language, and instead of the welcome window, there was a blank screen.

Spoon

This tool that you're exploring in this secon is the PDI's desktop design tool. With Spoon you

design, preview, and test all your work, that is, transformaons and jobs. When you see PDI

screenshots, what you are really seeing are Spoon screenshots. The other PDI components

that you will meet in the following chapters are executed from terminal windows.

Setting preferences in the Options window

In the tutorial you changed some preferences in the Opons window. There are several look

and feel characteriscs you can change beyond those you changed. Feel free to experiment

with this seng.

Remember to restart Spoon in order to see the changes applied.

If you choose any language as preferred language other than English, you

should select a dierent language as alternave. If you do so, every name or

descripon not translated to your preferred language will be shown in the

alternave language.

Just for the curious people: Italian and French are the overall winners of the list of languages

to which the tool has been translated from English. Below them follow Korean, Argennean

Spanish, Japanese, and Chinese.

Chapter 1

[ 19 ]

One of the sengs you changed was the appearance of the welcome window at start up.

The welcome window has many useful links, all related with the tool: wiki pages, news,

forum access, and more. It's worth exploring them.

You don't have to change the sengs again to see the welcome window.

You can open it from the menu Help | Show the Welcome Screen.

Storing transformations and jobs in a repository

The rst me you launched Spoon, you chose No Repository. Aer that, you congured

Spoon to stop asking you for the Repository opon. You must be curious about what the

repository is and why not to use it. Let's explain it.

As said, the results of working with PDI are Transformaons and Jobs. In order to save the

Transformaons and Jobs, PDI oers two methods:

Repository: When you use the repository method you save jobs and

transformaons in a repository. A repository is a relaonal database specially

designed for this purpose.

Files: The les method consists of saving jobs and transformaons as regular XML

les in the lesystem, with extension kjb and ktr respecvely.

The following diagram summarizes this:

exclusive

REPOSITORY FILE SYSTEM

.ktr .kjb

Design, Preview, Run

SPOON

Kettle Engine KETTLE

Transfor

mations Jobs

Transformations Jobs

Design, Preview, Run



Geng Started with Pentaho Data Integraon

[ 20 ]

You cannot mix the two methods (les and repository) in the same project. Therefore, you

must choose the method when you start the tool.

Why did we choose not to work with repository, or in other words, to work with les? This is

mainly for the following two reasons:

Working with les is more natural and praccal for most users.

Working with repository requires minimum database knowledge and that you also

have access to a database engine from your computer. Having both precondions

would allow you to learn working with both methods. However, it's probable that

you haven't.

Throughout this book, we will use the le method. For details of working with repositories,

please refer to Appendix A.

Creating your rst transformation

Unl now, you've seen the very basic elements of Spoon. For sure, you must be waing to do

some interesng task beyond looking around. It's me to create your rst transformaon.

Time for action – creating a hello world transformation

How about starng by saying Hello to the World? Not original but enough for a very rst

praccal exercise. Here is how you do it:

1. Create a folder named pdi_labs under the folder of your choice.

2. Open Spoon.

3. From the main menu select File | New Transformaon.

4. At the le-hand side of the screen, you'll see a tree of Steps. Expand the Input

branch by double-clicking it.

5. Le-click the Generate Rows icon.



Chapter 1

[ 21 ]

6. Without releasing the buon, drag-and-drop the selected icon to the main canvas.

The screen will look like this:

7. Double-click the Generate Rows step that you just put in the canvas and ll the text

boxes and grid as follows:

8. From the Steps tree, double-click the Flow step.

9. Click the Dummy icon and drag-and-drop it to the main canvas.

Geng Started with Pentaho Data Integraon

[ 22 ]

10. Click the Generate Rows step and holding the Shi key down, drag the cursor

towards the Dummy step. Release the buon. The screen should look like this:

11. Right-click somewhere on the canvas to bring up a contextual menu.

12. Select New note. A note editor appears.

13. Type some descripon such as Hello World! and click OK.

14. From the main menu, select Transformaon | Conguraon. A window appears

to specify transformaon properes. Fill the Transformaon name with a simple

name as hello_world. Fill the Descripon eld with a short descripon such as

My rst transformaon. Finally provide a more clear explanaon in the Extended

descripon text box and click OK.

15. From the main menu, select File | Save.

16. Save the transformaon in the folder pdi_labs with the name hello_world.

17. Select the Dummy step by le-clicking it.

18. Click on the Preview buon in the menu above the main canvas.

Chapter 1

[ 23 ]

19. A debug window appears. Click the Quick Launch buon.

20. The following window appears to preview the data generated by the transformaon:

21. Close the preview window and click the Run buon.

22. A window appears. Click Launch.

Geng Started with Pentaho Data Integraon

[ 24 ]

23. The execuon results are shown in the boom of the screen. The Logging tab

should look as follows:

What just happened?

You've just created your rst transformaon.

First, you created a new transformaon. From the tree on the le, you dragged two steps

and drop them into the canvas. Finally, you linked them with a hop.

With the Generate Rows step, you created 10 rows of data with the message Hello World!.

The Dummy step simply served as a desnaon of those rows.

Aer creang the transformaon, you did a preview. The preview allowed you to see the

content of the created data, this is, the 10 rows with the message Hello World!

Chapter 1

[ 25 ]

Finally, you ran the transformaon. You could see the results of the execuon at the boom

of the windows. There is a tab named Step Metrics with informaon about what happens

with each steps in the transformaon. There is also a Logging tab showing a complete detail

of what happened.

Directing the Kettle engine with transformations

As shown in the following diagram, transformaon is an enty made of steps linked by hops.

These steps and hops build paths through which data ows. The data enters or is created in a

step, the step applies some kind of transformaon to it, and nally the data leaves that step.

Therefore, it's said that a transformaon is data-ow oriented.

Steps

Transformation

Output

Input

Hops

Step1 Step2 StepN

...

A transformaon itself is not a program nor an executable le. It is just plain XML. The

transformaon contains metadata that tells the Kele engine what to do.

A step is the minimal unit inside a transformaon. A big set of steps is available. These steps

are grouped in categories such as the input and ow categories that you saw in the example.

Each step is conceived to accomplish a specic funcon, going from reading a parameter to

normalizing a dataset. Each step has a conguraon window. These windows vary according

to the funconality of the steps and the category to which they belong. What all steps have

in common are the name and descripon:

Step property Descripon

Name A representative name inside the transformation.

Description A brief explanation that allows you to clarify the purpose of the step.

It's not mandatory but it is useful.

A hop is a graphical representaon of data owing between two steps—an origin and a

desnaon. The data that ows through that hop constutes the output data of the origin

step and the input data of the desnaon step.

Geng Started with Pentaho Data Integraon

[ 26 ]

Exploring the Spoon interface

As you just saw, the Spoon is the tool using which you create, preview, and run

transformaons. The following screenshot shows you the basic work areas:

The words canvas and work area will be used interchangeably throughout

the book.

Viewing the transformation structure

If you click the View icon in the upper le corner of the screen, the tree will change to show

the structure of the transformaon currently being edited.

Download from Wow! eBook <www.wowebook.com>

Chapter 1

[ 27 ]

Running and previewing the transformation

The Preview funconality allows you to see a sample of the data produced for selected steps.

In the previous example, you previewed the output of the Dummy Step. The Run opon

eecvely runs the whole transformaon.

Whether you preview or run a transformaon, you'll get an execuon results window

showing what happened. Let's explain it through an example.

Time for action – running and previewing the hello_world

transformation

Let's do some tesng and explore the results:

1. Open the hello_world transformaon.

2. Edit the Generate Rows step, and change the limit from 10 to 1000 so that it

generates 1,000 rows.

3. Select the Logging tab window at the boom of the screen.

4. Click on Run.

5. In the Log level drop-down list, select RowLevel detail.

6. Click on Launch.

7. You can see how the logging window shows every task in a very detailed way.

8. Edit the Generate Rows step, and change the limit to 10,000 so that it generates

10,000 rows.

9. Select the Step Metrics.

Geng Started with Pentaho Data Integraon

[ 28 ]

10. Run the transformaon.

11. You can see how the numbers change as the rows travel through the steps.

What just happened?

You did some tests with the hello_world transformaon and saw the results in the

Execuon Results window.

Previewing the results in the Execution Results window

The Execuon Results window shows you what is happening while you preview or run

a transformaon.

The Logging tab shows the execuon of your transformaon, step by step. By default, the

level of the logging detail is Basic but you can change it to see dierent levels of detail—from

a minimal logging (level Minimal) to a very detailed one (level RowLevel).

The Step Metrics tab shows, for each step of the transformaon, the executed operaons

and several status and informaon columns. You may be interested in the following columns:

Column Descripon

Read Contains the number of rows coming from previous steps

Written Contains the number of rows leaving from this step toward the next

Input Number of rows read from a le or table

Output Number of rows written to a le or table

Errors Errors in the execution. If there are errors, the whole row becomes red

Active Tells the current status of the execution

In the example, you can see that the Generate Rows step writes rows, which then are read

by the Dummy step. The Dummy step also writes the same rows, but in this case those

go nowhere.

Chapter 1

[ 29 ]

Pop quiz – PDI basics

For each of the following, decide if the sentence is true or false:

1. There are several graphical tools in PDI, but Spoon is the most used.

2. You can choose to save Transformaons either in les or in a database.

3. To run a Transformaon, an executable le has to be generated from Spoon.

4. The grid size opon in the Look and Feel windows allows you to resize the work area.

5. To create a transformaon, you have to provide external data.

Installing MySQL

Before skipping to the next chapter, let's devote some minutes to the installaon of MySQL.

In Chapter 8 you will begin working with databases from PDI. In order to do that, you will

need access to some database engine. As MySQL is the world's most popular open source

database, it was the database engine chosen for the database-related tutorials in the book.

In this secon you will learn to install the MySQL database engine both in Windows and

Ubuntu, the most popular distribuon of Linux these days. As the procedures for installing

the soware are dierent, a separate explanaon is given for each system.

Time for action – installing MySQL on Windows

In order to install MySQL on your Windows system, please follow these instrucons:

1. Open an internet browser and type http://dev.mysql.com/downloads/mysql/.

2. Select the Microso Windows plaorm and download the mysql-essenal package

that matches your system: 32-bit or 64-bit.

3. Double-click the downloaded le. A wizard will guide you through the process.

4. When asked about the setup type, select Typical.

5. Several screens follow. When the wizard is complete you'll have the opon to

congure the server. Check Congure the MySQL Server now and click Finish.

Geng Started with Pentaho Data Integraon

[ 30 ]

6. A new wizard will be launched that lets you congure the server.

7. When asked about the conguraon type, select Standard Conguraon.

8. When prompted, set the Windows opons as shown in the next screenshot:

9. When prompted for the security opons, provide a password for the root user.

You'll have to retype the password.

Provide a password that you can remember. You'll need it

later to connect to the MySQL server.

Chapter 1

[ 31 ]

10. In the next window click on Execute to proceed with the conguraon. When the

conguraon is done, you'll see this:

11. Click on Finish. Aer installing MySQL it is recommended that you install the GUI

tools for administering and querying the database.

12. Open an Internet browser and type

http://dev.mysql.com/downloads/gui-tools/.

13. Look for the Windows downloads and download the Windows (x86) package.

14. Double-click the downloaded le. A wizard will guide you through the process.

15. When asked about the setup type, select Complete.

16. Several screens follow. Just follow the wizard instrucons.

17. When the wizard ends, you'll have the GUI tools added to the MySQL menu.

Geng Started with Pentaho Data Integraon

[ 32 ]

What just happened?

You downloaded and installed MySQL on your Windows system. You also installed MySQL

GUI tools, a soware package that includes an administrator and a query browser ulity and

that will make your life easier when working with the database.

Time for action – installing MySQL on Ubuntu

This tutorial shows you the procedure to install MySQL on Ubuntu.

In order to follow the tutorial you need to be connected to

the Internet.

Please follow these instrucons:

1. Check that you have access to the Internet.

2. Open the Synapc package manager from System | Administraon | Synapc

Package Manager.

3. Under Quick search type mysql-server and click on the Search buon.

4. Among the results, locate mysql-server-5.1, click in the ny square to the le,

and select Mark for Installaon.

5. You'll be prompted for conrmaon. Click on Mark.

Chapter 1

[ 33 ]

6. Now search for a package named mysql-admin.

7. When found, mark it for installaon in the same way.

8. Click on Apply on the main toolbar.

9. A window shows up asking for conrmaon. Click on Mark again. What follows is

the download process followed by the installaon process.

10. At a parcular moment a window appears asking you for a password for the root

user—the administrator of the database. Enter a password of your choice. You'll

have to enter it twice.

Think of a password that you can remember. You'll need it

later to connect to the MySQL server.

11. When the process ends, you will see the changes applied.

Geng Started with Pentaho Data Integraon

[ 34 ]

12. Under Applicaons a new menu will also be added to access the GUI tools.

What just happened?

You installed MySQL server and GUI Tools in your Ubuntu system.

The previous direcons are for standard installaons. For custom installaons,

instrucons related to other operang systems, or for troubleshoong, please

check the MySQL documentaon at—http://dev.mysql.com/doc/

refman/5.1/en/installing.html.

Summary

In this rst chapter, you were introduced to Pentaho Data Integraon. Specically, you learned

what Pentaho Data Integraon is and you installed the tool. You were also introduced to

Spoon, the graphical designer of PDI, and you created your rst transformaon.

As an addional exercise, you installed a MySQL server and the MySQL GUI tools. You will

need this soware when you start working with databases in Chapter 8.

Now that you've learned the basics, you're ready to begin creang your own transformaons

to explore real data. That is the topic of the next chapter.

Getting Started with Transformations

In the previous chapter you used the graphical designer Spoon to create

your rst transformaon: Hello world. Now you will start creang your own

transformaons to explore data from the real world. Data is everywhere; in

parcular you will nd data in les. Product lists, logs, survey results, and

stascal informaon are just a sample of the dierent kinds of informaon

usually stored in les. In this chapter you will create transformaons to get

data from les, and also to send data back to les. This in turn will allow you to

learn the basic PDI terminology related to data.

Reading data from les

Despite being the most primive format used to store data, les are broadly used and they

exist in several avors as xed width, comma-separated values, spreadsheet, or even free

format les. PDI has the ability to read data from all types of les; in this rst tutorial let's

see how to use PDI to get data from text les.

Geng Started with Transformaons

[ 36 ]

Time for action – reading results of football matches from les

Suppose you have collected several football stascs in plain les. Your les look like this:

Group 1|02/June|Argentina|2-1|Hungary

Group 1|06/June|Italy|3-1|Hungary

Group 1|06/June|Argentina|2-1|France

Group 1|10/June|France|3-1|Hungary

Group 1|10/June|Italy|1-0|Argentina

-------------------------------------------

World Cup 78

Group 1

You don't have one, but many les, all with the same structure. You now want to unify all the

informaon in one single le. Let's begin by reading the les.

1. Create the folder named pdi_files. Inside it, create the input and

output subfolders.

2. By using any text editor, type the le shown and save it under the name

group1.txt in the folder named input, which you just created. You can also

download the le from Packt's ocial website.

3. Start Spoon.

4. From the main menu select File | New Transformaon.

5. Expand the Input branch of the steps tree.

6. Drag the Text le input icon to the canvas.

7. Double-click the text input le icon and give a name to the step.

8. Click the Browse... buon and search the le group1.txt.

9. Select the le. The textbox File or directory will be temporarily populated with the full

path of the le—for example, C:\pdi_files\input\group1.txt.

Chapter 2

[ 37 ]

10. Click the Add buon. The full text will be moved from the File or directory textbox to the

grid. The conguraon window should look as follows:

11. Select the Content tab and ll it like this:

Geng Started with Transformaons

[ 38 ]

12. Select the Fields tab. Click the Get Fields buon. The screen should look like this:

13. In the small window that proposes you a number of sample lines, click OK.

14. Close the scan results window.

15. Change the second row. Under the Type column select Date, and under the Format

column, type dd/MMM.

16. The result value is text, not a number, so change the fourth row too. Under the Type

column select String.

17. Click the Preview rows buon, and then the OK buon.

18. The previewed data should look like the following:

Chapter 2

[ 39 ]

19. Expand the Transform branch of the steps tree.

20. Drag the Select values icon to the canvas.

21. Create a hop from the Text le input step to the Select values step.

Remember that you do it by selecng the rst step, then dragging

toward the second while holding down the Shi key.

22. Double-click the Select values step icon and give a name to the step.

23. Select the Remove tab.

24. Click the Get elds to remove buon.

25. Delete every row except the rst and the last one by le-clicking them and

pressing Delete.

26. The tab window looks like this:

27. Click OK.

28. From the Flow branch of the steps tree, drag the Dummy icon to the canvas.

29. Create a hop from the Select values step to the Dummy step. Your transformaon

should look like the following:

Geng Started with Transformaons

[ 40 ]

30. Congure the transformaon by pressing Ctrl+T and giving a name and a descripon to

the transformaon.

31. Save the transformaon by pressing Ctrl+S.

32. Select the Dummy step.

33. Click the Preview buon located on the transformaon toolbar:

34. Click the Quick Launch buon.

35. The following window appears, showing the nal data:

What just happened?

You read your plain le with results of football matches into a transformaon.

By using a Text le input step, you told Kele the full path to your le, along with the

characteriscs of the le so that Kele was able to read the data correctly—you specied

that the le had a header, had three rows at the end that should be ignored, and specied

the name and type of the columns.

Aer reading the le, you used a Select values step to remove columns you didn't need— the

rst and the last column.

Chapter 2

[ 41 ]

With those two simple steps, you were able to preview the data in your le from inside

the transformaon.

Another thing you may have noced is the use of shortcuts instead of the menu opons—for

example, to save the transformaon.

Many of the menu opons can be accessed more quickly by using shortcuts. The

available shortcuts for the menu opons are menoned as part of the name of

the operaon—for example, Run F9.

For a full shortcut reference please check Appendix D.

Input les

Files are one of the most used input sources. PDI can take data from several types of les,

with very few limitaons.

When you have a le to work with, the rst thing you have to do is to specify where the le

is, how it looks, and what kinds of values it contains. That is exactly what you did in the rst

tutorial of this chapter.

With the informaon you provide, Kele can create the dataset to work within the

current transformaon.

Input steps

There are several steps that allow you to take a le as the input data. All those steps such as

Text le input, Fixed le input, Excel Input, and so on are under the Input step category.

Despite the obvious dierences that exist between these types of les, the ways to congure

the steps have much in common. The following are the main properes you have to specify

for an input step:

Name of the step: It is mandatory and must be dierent for every step in

the transformaon.

Name and locaon of the le: These must be specied of course. At the moment

you create the transformaon, it's not mandatory that the le exists. However, if it

does, you will nd it easier to congure this step.

Content type: This data includes delimiter character, type of encoding, whether a

header is present, and so on. The list depends on the kind of le chosen. In every

case, Kele propose default values, so you don't have to enter too much data.



Geng Started with Transformaons

[ 42 ]

Fields: Kele has the facility to get the denions automacally by clicking the Get

Fields buon. However, Kele doesn't always guess the data types, size, or format

as expected. So, aer geng the elds you may change what you consider more

appropriate, as you did in the tutorial.

Filtering: Some steps allow you to lter the data—skip blank rows, read only the rst

n rows, and so on.

Aer conguring an input step, you can preview the data just as you did, by Clicking

the Preview Rows buon. This is useful to discover if there is something wrong in the

conguraon. In that case, you can make the adjustments and preview again, unl your

data looks ne.

Reading several les at once

Unl now you used an input step to read one le. But you have several les, all with the very

same structure. That will not be a problem because with Kele it is possible to read more

than a le at a me.

Time for action – reading all your les at a time using a single

Text le input step

To read all your les follow the next steps:

1. Open the transformaon, double-click the input step, and add the other les in the

same way you added the rst.

2. Aer Clicking the Preview rows buon, you will see this:



Chapter 2

[ 43 ]

What just happened?

You read several les at once. By pung in the grid the names of all the input les, you could

get the content of every specied le one aer the other.

Time for action – reading all your les at a time using a single

Text le input step and regular expressions

You could do the same thing you did above by using a dierent notaon.

Follow these instrucons:

1. Open the transformaon and edit the conguraon windows of the input step.

2. Delete the lines with the names of the les.

3. In the rst row of the grid, type C:\pdi_files\input\ under the File/Directory

column, and group[1-4]\.txt under the Wildcard (Reg.Exp.) column.

4. Click the Show lename(s)... buon. You'll see the list of les that match

the expression.

5. Close the ny window and click Preview rows to conrm that the rows shown

belong to the four les that match the expression you typed.

Geng Started with Transformaons

[ 44 ]

What just happened?

In this parcular case, all lenames follow a paern—group1.txt, group2.txt, and so

on. In order to specify the names of the les, you used a regular expression. In the column

File/Directory you put the stac part of the names, while in the Wildcard (Reg.Exp.) column

you put the regular expression with the paern that a le must follow to be considered:

the text group followed by a number between 1 and 4, and then .txt. Then, all les that

matched the expression were considered as input les.

Regular expressions

There are many places inside Kele where you may or have to provide a regular expression.

A regular expression is much more than specifying the known wildcards ? and *.

Here you have some examples of regular expressions you may use to specify lenames:

The following regular

expression ...

Matches ... Examples

.*\.txt Any txt le thisisaValidExample.

txt

test(19|20)\d\d-

(0[1-9]|1[012])\.txt

Any txt le beginning with test

followed by a date using the format

yyyy-mm

test2009-12.txt

test2009-01.txt

(?i)test.+\.txt Any txt le beginning with test,

upper or lower case

TeSTcaseinsensitive.

tXt

Please note that the * wildcard doesn't work the same as it does on

the command line. If you want to match any character, the * has to be

preceded by a dot.

Here are some useful links in case you want to know more about regular expressions:

Regular Expression Quick Start:

http://www.regular-expressions.info/quickstart.html

The Java Regular Expression Tutorial:

http://java.sun.com/docs/books/tutorial/essential/regex/

Java Regular Expression Paern Syntax: http://java.sun.com/javase/6/

docs/api/java/util/regex/Pattern.html



Chapter 2

[ 45 ]

Troubleshooting reading les

Despite the simplicity of reading les with PDI, obstacles and errors appear. Many mes

the soluon is simple but dicult to nd if you are new to PDI. Here you have a list of

common problems and possible soluons for you to take into account while reading and

previewing a le:

Problem Diagnosc Possible soluons

You get the message

Sorry, no rows found to

be previewed.

This happens when the input le

doesn't exist or is empty.

It also may happen if you

specied the input les with

regular expressions and there

is no le that matches the

expression.

Check the name of the input les.

Verify the syntax used, check that

you didn't put spaces or any strange

character as part of the name.

If you used regular expressions, check

the syntax.

Also verify that you put the lename

in the grid. If you just put it in the File

or directory textbox, Kele will not

read it.

When you preview the

data you see a grid with

blank lines

The le contains empty lines, or

you forgot to get the elds.

Check the content of the le.

Also check that you got the elds in the

Fields tab.

You see the whole line

under the rst dened

eld.

You didn't set the proper

separator and Kele couldn't split

the dierent elds.

Check and x the separator in the

Content tab.

You see strange

characters.

You le the default content but

your le has a dierent format or

encoding.

Check and x the Format and Encoding

in the Content tab.

If you are not sure of the format, you

can specify mixed.

You don't see all the

lines you have in the le

You are previewing just a sample

(100 lines by default).

Or you put a limit to the number

of rows to get.

Another problem may be that you

set the wrong number of header

or footer lines.

When you preview, you see just a

sample. This is not a problem.

If you raise the previewed number of

rows and sll have few lines, check the

Header, Footer and Limit opons in

the Content tab.

Geng Started with Transformaons

[ 46 ]

Problem Diagnosc Possible soluons

Instead of rows of

data, you get a window

headed ERROR with an

extract of the log

Dierent errors may happen, but

the most common has to do with

problems in the denion of the

elds.

You could try to understand the log

and x the denion accordingly. For

example if you see:

Couldn't parse eld [Integer] with

value [Italy].

The error is that PDI found the text

Italy in a eld that you dened as

Integer.

If you made a mistake, you could x

it. On the other hand, if the le has

errors, you could read all elds as

String and you will not get the error

again. In chapter 7 you will learn how

to overcome these situaons.

Grids

Grids are tables used in many Spoon places to enter or display informaon. You already saw

grids in several conguraon windows—Text le input, Text le output, and Select values.

Many grids contain eld informaon. Examples of these grids are the Field tab window in the

Text Input and Output steps, or the main conguraon window of the Select Values step. In

these cases, the grids are usually accompanied by a Get Fields buon. The Get Fields buon

is a facility to avoid typing. When you press that buon, Kele lls the grid with all the

available elds.

For example, when reading a le, the Get Fields buon lls the grid with the columns of the

incoming le. When using a Select Values step or a File output step, the Get Fields buon

lls the grid with all the elds entering from a previous step.

Every me you see a Get Fields buon, consider it as a shortcut to avoid typing.

Kele will bring the elds available to the grid; you will only have to check the

informaon brought and make minimal changes.

There are many places in Spoon where the grid serves also to edit other kinds of informaon.

One example of that is the grid where you specify the list of les in a Text File Input step. No

maer what kind of grid you are eding, there is always a contextual menu, which you may

access by right-clicking on a row. That menu oers eding opons to copy, paste, or move

rows of the grid.

Chapter 2

[ 47 ]

When the number of rows in the grid is big, use shortcuts! Most of the eding

opons of a grid have shortcuts that make the eding work easier and quicker.

You'll nd a full list of shortcuts for eding grids in Appendix E.

Have a go hero – explore your own les

Try to read your own text les from Kele. You must have several les with dierent kinds of

data, dierent separators, and with or without header or footer. You can also search for les

over the Internet; there are plenty of les there to download and play with. Aer conguring

the input step, do a preview. If the data is not shown properly, x the conguraon and

preview again unl you are sure that the data is read as expected. If you have trouble

reading the les, please refer to the Troubleshoong reading les secon seen earlier for

diagnosis and possible ways to solve the problems.

Sending data to les

Now you know how to bring data into Kele. You didn't bring the data just to preview it; you

probably want to do some transformaon on the data, to nally send it to a nal desnaon

such as another plain le. Let's learn how to do this last task.

Time for action – sending the results of matches to a plain le

In the previous tutorial, you read all your "results of matches" les. Now you want to send

the data coming from all les to a single output le.

1. Create a new transformaon.

2. Drag a Text le input step to the canvas and congure it just as you did in the

previous tutorial.

3. Drag a Select values step to the canvas and create a hop from the Text le input

step to the Select values step.

4. Double-click the Select values step.

5. Click the Get elds to select buon.

Geng Started with Transformaons

[ 48 ]

6. Modify the elds as follows:

7. Expand the Output branch of the steps tree.

8. Drag the Text le output icon to the canvas.

9. Create a hop from the Select values step to the Text le output step.

10. Double-click the Text le output step and give it a name.

11. In the le name type: C:/pdi_files/output/wcup_first_round.

Note that the path contains forward slashes. If your system is Windows,

you may use back or forward slashes. PDI will recognize both notaons.

12. In the Content tab, leave the default values.

13. Select the Fields tab and congure it as follows:

Chapter 2

[ 49 ]

14. Click OK.

15. Give a name and descripon to the transformaon.

16. Save the transformaon.

17. Click Run and then Launch.

18. Once the transformaon is nished, check the le generated. It should have been

created as C:/pdi_files/output/wcup_first_round.txt and should look

like this:

Match Date;Home Team;Away Team;Result

02/06;Italy;France;2-1

02/06;Argentina;Hungary;2-1

06/06;Italy;Hungary;3-1

06/06;Argentina;France;2-1

10/06;France;Hungary;3-1

10/06;Italy;Argentina;1-0

01/06;Germany FR;Poland;0-0

02/06;Tunisia;Mexico;3-1

06/06;Germany FR;Mexico;6-0

…

What just happened?

You gathered informaon from several les and sent all the data to a single le. Before

sending the data out, you used a Select Value step to select the data you wanted for the le

and to rename the elds so that the header of the desnaon le looks clearer.

Output les

We saw that PDI could take data from several types of les. The same applies to output data.

The data you have in a transformaon can be sent to dierent types of les. All you have to

do is redirect the ow of data towards an Output step.

Geng Started with Transformaons

[ 50 ]

Output steps

There are several steps that allow you to send the data to a le. All those steps are under the

Output step category: Text le output and Excel Output are examples of them.

For an Output step, just like you do for an Input step, you also have to dene:

Name of the step: It is mandatory and must be dierent for every step in

the transformaon.

Name and locaon of the le: These must be specied. If you specify an exisng

le, the le will be replaced by a new one (unless you check the Append checkbox

present in some of the output steps).

Content type: This data includes delimiter character, type of encoding, whether to

put a header, and so on. The list depends on the kind of le chosen. If you check

Header, the header will be built with the names of the elds.

If you don't like the names of the elds as header names in your le,

you may use a Select values step just to rename those elds.

Fields: Here you specify the list of elds that has to be sent to the le, and provide

some format instrucons. Just like in the input steps, you may use the Get Fields

buon to ll the grid. In this case, the grid is going to be lled based on the data

that arrives from the previous step. You are not forced to send every piece of data

coming to the output step, nor to send the elds in the same order.

Some data denitions

From the Kele's point of view, data can be anything ready to be processed by soware (for

example les or data in databases). Whichever the subject or origin of the data, whichever

its format, Kele transformaons can get the data for further processing and delivering.

Rowset

Transformaons deals with datasets, that is, data presented in a tabular form, where:

Each column represents a eld. A eld has a name and a data type. The data type

can be any of the common data types—number (oat), string, date, Boolean, integer,

or big number.

Each row corresponds to a given member of the dataset. All rows in a dataset have

the same structure, that is, all rows have the same elds, in the same order. A eld

in a row may be null, but it has to be present.



Chapter 2

[ 51 ]

The dataset is called rowset. The following is an example of rowset. It is the rowset

generated in the World Cup tutorial:

Streams

Once the data is read, it travels from step to step, through the hops that link those steps.

Nothing happens in the hops except data owing. The real manipulaon of data, as well as

the modicaon of a stream by adding or removing columns, occurs in the steps.

Right-click on the Select values step of the transformaon you created. In the contextual

menu select Show output elds. You'll see this:

This window shows the metadata of the data that leaves this step, this is, name, type, and

other properes of each eld leaving this step towards the following step.

In the same way, if you select Show input elds, you will see the metadata of the data that

le the previous step.



Geng Started with Transformaons

[ 52 ]

The Select values step

The Select values step allows you to select, rename, and delete elds, or change the

metadata of a eld. The step has three tabs:

Select & Alter: This tab is also used to rename the elds or reorder them. This is

how we used it in the last exercise.

Remove: This tab is useful to discard undesirable elds. We used it in the matches

exercise to drop the rst and last elds. Alternavely, we could use the Select &

Alter tab, and specify the elds that you want to keep. Both are equivalent for

that purpose.

Meta-data: This tab is used when you want to change the denion of a eld such

as telling Kele to interpret a string eld as a date. We will see examples of this later

in this book.

You may use only one of the Select Values step tabs at a

me. Kele will not restrain you from lling more than one

tab, but that could lead to unexpected behavior.

Have a go hero – extending your transformations by writing output les

Suppose you read your own les in the previous secon, modify your transformaons by

wring some or all the data back into les, however, changing the format, headers, number

or order of elds, and so on this me around. The objecve is to get some experience to see

what happens. Aer some tests, you will feel condent with input and output les, and be

ready to move forward.

Getting system information

Unl now, you have learned how to read data from known les, and send data back to les.

What if you don't know beforehand the name of the le to process? There are several ways

to handle this with Kele. Let's learn the simplest.



Chapter 2

[ 53 ]

Time for action – updating a le with news about examinations

Imagine you are responsible to collect the results of an annual examinaon that is being

taken in a language school. The examinaon evaluates wring, reading, speaking, and

listening skills. Every professor gives the exam to the students, the students take the

examinaon, the professors grade the examinaons in the scale 0-100 for each skill, and

write the results in a text le, like the following:

student_code;name;writing;reading;speaking;listening

80711-85;William Miller;81;83;80;90

20362-34;Jennifer Martin;87;76;70;80

75283-17;Margaret Wilson;99;94;90;80

83714-28;Helen Thomas;89;97;80;80

61666-55;Maria Thomas;88;77;70;80

All the les follow that paern.

When a professor has the le ready, he/she sends it to you, and you have to integrate the

results in a global list. Let's do it with Kele.

1. Before starng, be sure to have a le ready to read. Type it or download the sample les

from the Packt's ocial website.

2. Create the le where the news will be appended. Type this:

---------------------------------------------------------

Annual Language Examinations

Testing writing, reading, speaking and listening skills

---------------------------------------------------------

student_code;name;writing;reading;speaking;listening;file_

processed;process_date

Save the le as C:/pdi_files/output/examination.txt.

3. Create a new transformaon.

4. Expand the Input branch of the steps tree.

5. Drag the Get System Info and Text le input icons to the canvas.

6. Expand the Output branch of the steps tree, and drag a Text le output step to

the canvas.

Geng Started with Transformaons

[ 54 ]

7. Link the steps as follows:

8. Double-click the rst Get System Info step icon and give it a name.

9. Fill the grid as follows:

10. Click OK.

11. Double-click the Text le Input step icon and congure it like here:

Chapter 2

[ 55 ]

12. Select the Content tab.

13. Check the Include lename in output? checkbox and type file_processed in the

Filename eldname textbox.

14. Check the Add lenames to result checkbox.

15. Select the Fields tab and Click the Get Fields buon to ll the grid.

16. Click OK.

17. Double-click the second Get System Info step icon and give it a name.

18. Add a eld named process_date, and from the list of choices select system

date (xed).

19. Double-click the Text le output step icon and give it a name.

20. Type C:/pdi_files/output/examination as the lename.

21. In the Fields tab, press the Get Fields buon to ll the grid.

22. Change the format of the Date row to yy/MM/dd.

23. Give a name and descripon to the transformaon and save it.

24. Press F9 to run the transformaon.

25. Fill in the argument grid, wring the full path of the le created.

26. Click Launch.

Geng Started with Transformaons

[ 56 ]

27. The output le should look like this:

---------------------------------------------------------

Annual Language Examinations

Testing writing, reading, speaking and listening skills

---------------------------------------------------------

student_code;name;writing;reading;speaking;listening;file_

processed;process_date

80711-85;William Miller;81;83;80;90;C:\exams\exam1.txt;28-05-2009

20362-34;Jennifer Martin;87;76;70;80;C:\exams\exam1.txt;28-05-2009

75283-17;Margaret Wilson;99;94;90;80;C:\exams\exam1.txt;28-05-2009

83714-28;Helen Thomas;89;97;80;80;C:\exams\exam1.txt;28-05-2009

61666-55;Maria Thomas;88;77;70;80;C:\exams\exam1.txt;28-05-2009

28. Run the transformaon again.

29. This me ll the argument grid with the name of a second le.

30. Click Launch.

31. Verify that the data from this second le was appended to the previous data in the

output le.

What just happened?

You read a le whose name is known at runme, and fed a desnaon le by appending the

contents of the input le.

The rst Get System Info step tells Kele to take the rst command line argument, and

assume that it is the name of the le to read.

In the Text File Input step, you didn't specify the name of the le, but told Kele to take as

the name of the le, the eld coming from the previous step, which is the read argument.

With the second Get System Info step you just took from the system, the date, which you

used later to enrich the data sent to the desnaon le.

The desnaon le is appended with new data every me you run the transformaon.

Beyond the basic required data (student code and grades), the name of the processed le

and the date on which the data is being appended are added as part of the data.

When you don't specify the name and locaon of a le (like in this example), or

when the real le is not available at design me, you won't be able to use the

Get Fields buon, nor preview to see if the step is well congured. The trick is

to congure the step by using a real le idencal to the expected one. Aer the

step is congured, change the name and locaon of the le as needed.

Chapter 2

[ 57 ]

Getting information by using Get System Info step

The Get System Info step allows you to get dierent informaon from the system. In this

exercise, you took the system date and an argument. If you look to the available list, you

will see more than just these two opons.

Here we used the step in two dierent ways:

As a resource to take the name of the le from the command line

To add a eld to the dataset

The use of this step will be clearer with a picture.

In this example, the Text File Input doesn't know the name or the locaon of the le. It takes

it from the previous step, which is a Get System Info Step. As the Get System Info serves as

a supplier of informaon, the hop that leaves the step changes its look and feel to show

the situaon.



Geng Started with Transformaons

[ 58 ]

The second me the Get System Info is used, its funcon is simply to add a eld to the

incoming dataset.

Data types

Every eld must have a data type. The data type can be any of the common data

types—number (oat), string, date, Boolean, integer, or big number. Strings are simple,

just text for which you may specify a length. Date and numeric elds have more variants,

and are worthy of while a separate explanaon.

Date elds

Date is one the main data types available in Kele. In the matches tutorial, you have an

example of date eld—the match date eld. Its values were 2/Jun, 6/Jun, 10/Jun. Take a

look at how you dened that eld in the Text le input step. You dened the eld as a date

eld with format dd/MMM. What does it mean? To Kele it means that it has to interpret the

eld as a date, where the rst two posions represent the day, then there is a slash, and

nally there is the month in leers (that's the meaning of the three last posions).

Generally speaking, when a date eld is created, like the text input eld of the example, you

have to dene the format of the data so that Kele can recognize in the eld the dierent

components of the date. There are several formats that may be dened for a date, all of

them combinaons of leers that represents date or me components. Here are the most

basic ones:

Leers Meaning

yYear

M Month

d Day

H Hour (0-23)

m Minutes

s Seconds

Now let's see the other end of the same transformaon—the output step. Here you set

another format for the same eld: dd/MM. According the table, this means the date has to

have two posions for the day, then a slash, and then two posions for the month. Here, the

format specicaon represents the mask you want to apply when the date is shown. Instead

of 2/Jun, 6/Jun, 10/Jun, in the output le, you expect to see 02/06, 06/06, 10/06.

In the examinaon tutorial, you also have a Date eld—the process date. When you created

it, you didn't specify a format because you took the system date which, by denion, is a

date and Kele knows it. But when wring this date to the output le, again you dened a

format, in this case it was yyyy/MM/dd.

Chapter 2

[ 59 ]

In general, when you are wring a date, the format aribute is used of format the data

before sending it to the desnaon. In case you don't specify a format, Kele sets a

default format.

As said earlier, there are more combinaons to dene the format to a date eld.

For a complete reference, check the Sun Java API documentaon located at

http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html.

Numeric elds

Numeric elds are present in almost all Kele transformaons. In the Examinaon example,

you encountered numeric elds for the rst me. The input le had four numeric elds.

As the numbers were all integer, you didn't set a specic format. When you have more

elaborate elds such as numbers with separators, dollar signs, and so on, you should specify

a format to tell Kele how to interpret the number. If you don't, Kele will do its best to

interpret the number, but this could lead to unexpected results.

At the other extreme of the ow, when wring to the output le text, you may specify the

format in which you want the number to be shown.

There are several formats you may apply to a numeric eld. The format is basically a

combinaon of predened symbols, each with a special meaning. The following are

the most used symbols:

Symbol Meaning

#Digit Leading zeros are not shown

0 Digit If the digit is not present, zero is displayed in its place

. Decimal separator

- Minus sign

%Field has to be mulplied by 100 and shown as a percentage

These symbols are not used alone. In order to specify the format of your numbers, you

have to combine them. Suppose that you have a numeric eld whose value is 99.55; the

following table shows you the same value aer applying dierent formats to it:

Format Result

# 100

0 100

#.# 99.6

#.## 99.55

#.000 99.550

000.000 099.550

Geng Started with Transformaons

[ 60 ]

If you don't specify a format for your numbers, you may sll provide a Length and

Precision. Length is the total number of signicant gures, while precision is the number

of oang-point digits.

If you neither specify format nor length or precision, Kele behaves as follow. While reading,

it does its best to interpret the incoming number, and when wring, it sends the data as it

comes without applying any format.

For a complete reference on number formats, you can check the Sun Java API

documentaon available at http://java.sun.com/javase/6/docs/api/java/text/

DecimalFormat.html.

Running transformations from a terminal window

In the examinaon exercise, you specied that the name of the input le will be taken

from the rst command-line argument. That means when execung the transformaon,

the lename has to be supplied as an argument. Unl now, you only ran transformaons

from inside Spoon. In the last exercise, you provided the argument by typing it in a dialog

window. Now it is me to learn how to run transformaons with or without arguments from

a terminal window.

Time for action – running the examination transformation from

a terminal window

Before execung the transformaon from a terminal window, make sure that you have a new

examinaon le to process, let's say exam3.txt. Then follow these instrucons:

1. Open a terminal window and go to the directory where Kele is installed.

On Windows systems type:

C:\pdi-ce>pan.bat /file:c:\pdi_labs\examinations.ktr c:\

pdi_files\input\exam3.txt

On Unix, Linux, and other Unix-based systems type:

/home/yourself/pdi-ce/pan.sh /file:/home/yourself/pdi_labs/

examinations.ktr c:/pdi_files/input/exam3.txt

If your transformaon is in another folder, modify the command

accordingly.



Chapter 2

[ 61 ]

2. You will see how the transformaon runs, showing you the log in the terminal.

3. Check the output le. The contents of exam3.txt should be at the end of the le.

What just happened?

You executed a transformaon with Pan, the program that runs transformaons from

terminal windows. As part of the command, you specied the name of the transformaon

le and provided the name of the le to process, which was the only argument expected by

the transformaon. As a result, you got the same as if you had run the transformaon from

Spoon—a small le appended to the global le.

When you are designing transformaons, you run them with Spoon; you don't use Pan. Pan

is mainly used as part of batch processes, for example processes that run every night in a

scheduled fashion.

Appendix B tells you all the details about using Pan.

Have a go hero – using different date formats

Change the main transformaon of the last tutorial so that the process_date is saved with

a full format, that is, including day of week (Monday, Tuesday, and so on), month in leers

(January, February, and so on), and me.

Geng Started with Transformaons

[ 62 ]

Go for a hero – formatting 99.55

Create a transformaon to see for yourself the dierent formats for the number 99.55. Test

the formats shown in the Numeric elds secon and try some other opons as well.

To test this, you will need a dataset with a single row and a single eld—the

number. You can generate it with a Generate rows step.

Pop quiz–formatting data

Suppose that you read a le where the rst column is a numeric idener: 1, 2, 3, and so on.

You read the eld as a Number. Now you want to send the data back to a le. Despite being

a number, this eld is regular text to you because it is a code. How do you dene the eld in

the Text output step (you may choose more than one opon):

a. As a Number. In the format, you put #.

b. As a String. In the format, you put #.

c. As a String. You leave the format blank.

XML les

Even if you're not a system developer, you must have heard about XML les. XML les

or documents are not only used to store data, but also to exchange data between

heterogeneous systems over the Internet. PDI has many features that enable you to

manipulate XML les. In this secon you will learn to get data from those les.

Time for action – getting data from an XML le with information

about countries

In this tutorial you will build an Excel le with basic informaon about countries. The source

will be an XML le that you can download from the Packt website.

1. If you work under Windows, open the kettle.properties le located in the

C:/Documents and Settings/yourself/.kettle folder and add the

following line:

LABSOUTPUT=c:/pdi_files/output

Chapter 2

[ 63 ]

On the other hand, if you work under Linux (or similar), open the kettle.

properties le located in the /home/yourself/.kettle folder and add the

following line:

LABSOUTPUT=/home/yourself/pdi_files/output

2. Make sure that the directory specied in kettle.properties exists.

3. Save the le.

4. Restart Spoon.

5. Create a new transformaon.

6. Give a name to the transformaon and save it in the same directory you have all the

other transformaons.

7. From the Packt website, download the resources folder containing a file named

countries.xml. Save the folder in your working directory. For example, if your

transformaons are in pdi_labs, the le will be in pdi_labs/resources/.

The last two steps are important. Don't skip them! If you do,

some of the following steps will fail.

8. Take a look at the le. You can edit it with any text editor, or you can double-click it to

see it within an explorer. In any case, you will see informaon about countries. This is

just the extract for a single country:

<?xml version="1.0" encoding="UTF-8"?>

<world>

...

<name>Argentina</name>

<capital>Buenos Aires</capital>

<name>Spanish</name>

</language>

<name>Italian</name>

</language>

Geng Started with Transformaons

[ 64 ]

<name>Indian Languages</name>

</language>

</country>

...

</world>

9. From the Input steps, drag a Get data from XML step to the canvas.

10. Open the conguraon window for this step by double-clicking it.

11. In the File or directory textbox, press Ctrl+Space. A drop-down list appears as shown in

the next screenshot:

12. Select Internal.Transformation.Filename.Directory. The textbox gets lled

with this text.

13. Complete the text so that you can read ${Internal.Transformation.Filename.

Directory}/resources/countries.xml.

14. Click on the Add buon. The full path is moved to the grid.

15. Select the Content tab and click Get XPath nodes.

16. In the list that appears, select /world/country/language.

Chapter 2

[ 65 ]

17. Select the Fields tab and ll the grid as follows:

18. Click Preview rows, and you should see something like this:

19. Click OK.

20. From the Output steps, drag an Excel Output step to the canvas.

21. Create a hop from the Get data from XML step to the Excel Output step.

22. Open the conguraon window for this step by double-clicking it.

Geng Started with Transformaons

[ 66 ]

23. In the Filename textbox press Ctrl+Space.

24. From the drop-down list, select ${LABSOUTPUT}.

25. By the side of that text type /countries_info. The complete text should be

${LABSOUTPUT}/countries_info.

26. Select the Fields tab and click the Get Fields buon to ll the grid.

27. Click OK. This is your nal transformaon.

28. Save the transformaon.

29. Run the transformaon.

30. Check that the countries_info.xls le has been created in the output directory

and contains the informaon you previewed in the input step.

What just happened?

You got informaon about countries from an XML le and saved it in a more readable

format—an Excel spreadsheet—for the common people.

To get the informaon, you used a Get data from XML step. As the source le was

taken from a folder relave to the folder where you stored the transformaon, you set

the directory to ${Internal.Transformation.Filename.Directory}. When

the transformaon ran, Kele replaced ${Internal.Transformation.Filename.

Directory} with the real path of the transformaon: c:/pdi_labs/.

In the same way, you didn't put a xed value for the path of the nal Excel le. As directory,

you used ${LABSOUTPUT}. When the transformaon ran, Kele replaced ${LABSOUTPUT}

with the value you wrote in the kettle.properties le. The output le was then saved in

that folder: c:/pdi_files/output.

Chapter 2

[ 67 ]

What is XML

XML stands for EXtensible Markup Language. It is basically a language designed to describe

data. XML les or documents contain informaon wrapped in tags. Look at this piece of XML

taken from the countries le:

<?xml version="1.0" encoding="UTF-8"?>

<world>

...

<name>Argentina</name>

<capital>Buenos Aires</capital>

<name>Spanish</name>

</language>

<name>Italian</name>

</language>

<name>Indian Languages</name>

</language>

</country>

...

</world>

The rst line in the document is the XML declaraon. It denes the XML version of the

document, and should always be present.

Below the declaraon is the body of the document. The body is a set of nested elements.

An element is a logical piece enclosed by a start-tag and a matching end-tag—for example,

<country> </country>.

Within the start-tag of an element, you may have aributes. An aribute is a markup

construct consisng of a name/value pair—for example, isofficial="F".

These are the most basic terminology related to XML les. If you want to know more about

XML, you can visit http://www.w3schools.com/xml/.

Geng Started with Transformaons

[ 68 ]

PDI transformation les

Despite the .ktr extension, PDI transformaons are just XML les. As such, you are able to

explore them inside and recognize dierent XML elements. Look the following sample text:

<?xml version="1.0" encoding="UTF-8"?>

<info>

<name>hello_world</name>

<description>My first transformation</description>

<extended_description>

This transformation generates 10 rows

with the message Hello World.

</extended_description>

...

</transformation>

This is an extract from the hello_world.ktr le. Here you can see the root element

named transformation, and some inner elements such as info and name.

Note that if you copy a step by selecng it in the Spoon canvas and pressing Ctrl+C , and then

pass it to a text editor, you can see its XML denion. If you copy it back to the canvas, a new

idencal step will be added to your transformaon.

Getting data from XML les

In order to get data from an XML le, you have to use the Get Data From XML input step.

To tell PDI which informaon to get from the le, it is required that you use a parcular

notaon named XPath.

XPath

XPath is a set of rules used for geng informaon from an XML document. In XPath, XML

documents are treated as trees of nodes. There are several types of nodes; elements,

aributes, and texts are some of them. As an example, world, country, and isofficial

are some of the nodes in the sample le.

Among the nodes there are relaonships. A node has a parent, zero or more children,

siblings, ancestors, and descendants depending on where the other nodes are in

the hierarchy.

In the sample countries le, country is the the parent of the elements name, capital, and

language. These three elements are children of country.

To select a node in an XML document, you have to use a path expression relave to a

current node.

Chapter 2

[ 69 ]

The following table has some examples of path expressions that you may use to specify

elds. The examples assume that the current node is language.

Path expression Descripon Sample expression

node_name Selects all child nodes of the

node named node_name.

percentage

This expression selects all child nodes of

the node percentage. It looks for the node

percentage inside the current node language.

.Selects the current node language

.. Selects the parent of the

current node

../capital

This expression selects all child nodes of the

node capital. It doesn't look in the current

node (language), but inside its parent, which

is country.

@Selects an aribute @isofficial

This expression gets the aribute isofficial

in the current node language.

Note that the expressions name and ../name are not the same. The

rst selects the name of the language, while the second selects the

name of the country.

For more informaon on XPath, follow this link: http://www.w3schools.com/XPath/.

Conguring the Get data from XML step

In order to specify the name and locaon of an XML le, you have to ll the File tab just as

you do in any le input step. What is dierent here is how you get the data.

The rst thing you have to do is select the path that will idenfy the current node. You do

it by lling the Loop XPath textbox in the Content tab. You can type it by hand, or you can

select it from the list of available paths by Clicking the Get XPath nodes buon.

Once you have selected a path, PDI will generate one row of data for every found path.

In the tutorial you selected /world/country/language. Then PDI generates one row for

each /world/country/language element in the le.

Aer selecng the loop XPath, you have to specify the elds to get. In order to do that,

you have to ll the grid in the Fields tab by using XPath notaon as explained in the

preceding secon.

Geng Started with Transformaons

[ 70 ]

Note that if you click the Get elds buon, PDI will ll the grid with the child nodes of the

current node. If you want to get some other node, you have to type its XPath by hand.

Also note the notaon for the aributes. To get an aribute, you can use the @ notaon as

explained, or you can simply type the name of the aribute without @ and select Aribute

under the Element column, as you did in the tutorial.

Kettle variables

In the last tutorial, you used the string ${Internal.Transformation.Filename.

Directory} to idenfy the folder where the current transformaon was saved. You also

used the string ${LABSOUTPUT} to dene the desnaon folder of the output le.

Both strings, ${Internal.Transformation.Filename.Directory} and

${LABSOUTPUT}, are Kele variables, that is, keywords linked to a value. You use the

name of a variable, and when the transformaon runs, the name of the variable is

replaced by its value.

The rst of these two variables is an environment variable, and it is not the only available.

Other known environment variables are ${user.home}, ${java.io.tmpdir}, and

${java.home}. All these variables are ready to use any me you need.

The second variable is a variable you dened in the kettle.properties le. In this le

you may dene as many variables as you want. The only thing you have to keep in mind is

that those variables will be available inside Spoon aer you restart it.

These two kinds of variables—environment variables and variables dened in the

kettle.properties le—are the most primive kinds of variables found in PDI.

All of these variables are string variables and their scope is the Java virtual machine.

How and when you can use variables

Any me you see a red dollar sign by the side of a textbox, you may use a variable. Inside the

textbox you can mix variable names with stac text, as you did in the tutorial when you put

the name of the desnaon as ${LABSOUTPUT}/countries_info.

To see all the available variables, you have to posion the cursor in the textbox, press

Ctrl+Space, and a full list is displayed for you to select the variable of your choice. If you put

the mouse cursor over any of the variables for a second, the actual value of the variable will

be shown.

If you know the name of the variable, you don't need to select it from the list. You may type

its name, by using either of these notaons—${<name>} or %%<name>%%.

Chapter 2

[ 71 ]

Have a go hero – exploring XML les

Now you can explore by yourself. On the Packt website there are some sample XML les.

Download them and try this:

• Read the customer.xml le and create a list of customers.

• Read the tomcat-users.xml le and get the users and their passwords.

• Read the areachart.xml and get the color palee, that is, the list of colors used.

The customer le is included in the Pentaho Report Designer soware package.

The others come with the Pentaho BI package. This soware has many XML les

for you to use. If you are interested you can download the soware from

http://sourceforge.net/projects/pentaho/files/.

Have a go hero – enhancing the output countries le

Modify the transformaon in the tutorial so that the Excel output uses a template. The

template will be an Excel le with the header and format already applied, and will be located

in a folder inside the pdi_labs folder.

Templates are congured in the Content tab of the Excel conguraon window.

In order to set the name for the template, use internal variables.

Have a go hero – documenting your work

As explained, transformaons are nothing dierent than XML les. Now you'll create a new

transformaon that will take as input the transformaons you've created so far, and will

create a simple Excel spreadsheet with the name and descripon of all your transformaons.

If you keep this sheet updated by running the transformaon on a regular basis, it will be

easier to nd a parcular transformaon you created in the past.

To get data from the transformaons les, use the Get data from XML step.

As wildcard, use .*\.ktr. Doing so, you'll get all the les.

On the other hand, as Loop XPath, use /transformation/info.

Geng Started with Transformaons

[ 72 ]

Summary

In this chapter you learned how to get data from les and put data back into les.

Specically, you learned how to:

Get data from plain les and also from XML les

Put data into text les and Excel les

Get informaon from the operang system such as command-line arguments and

system date

We also discussed the following:

The main PDI terminology related to data, for example datasets, data types,

and streams

The Select values step, a commonly used step for selecng, reordering, removing

and changing data

How and when to use Kele variables

How to run transformaons from a terminal with the Pan command

Now that you know how to get data into a transformaon, you are ready to start

manipulang data. This is going to happen in the next chapter.



Basic Data Manipulation

In the previous chapter, you learned how to get data into PDI. Now you're ready to

begin transforming that data. This chapter explains the simplest and most used ways

of transforming data. We will cover the following:

Execung basic operaons

Filtering and sorng of data

Looking up data outside the main stream of data

By the end of this chapter, you will be able to do simple but meaningful transformaons on

dierent types of data.

Basic calculations

You already know how to create a transformaon and read data from an external source.

Now, taking that data as a starng point, you will begin to do basic calculaons.



Basic Data Manipulaon

[ 74 ]

Time for action – reviewing examinations by using the

Calculator step

Can you recollect the exercise about examinaons you did in the previous chapter? You

created an incremental le with examinaon results. The nal le looked like the following:

---------------------------------------------------------

Annual Language Examinations

Testing writing, reading, speaking and listening skills

---------------------------------------------------------

student_code;name;writing;reading;speaking;listening;file_

processed;process_date

80711-85;William Miller; 81;83;80;90;C:\pdi_files\input\first_turn.

txt;28-05-2009

20362-34;Jennifer Martin; 87;76;70;80;C:\pdi_files\input\first_turn.

txt;28-05-2009

75283-17;Margaret Wilson; 99;94;90;80;C:\pdi_files\input\first_turn.

txt;28-05-2009

83714-28;Helen Thomas; 89;97;80;80;C:\pdi_files\input\first_turn.

txt;28-05-2009

61666-55;Maria Thomas; 88;77;70;80;C:\pdi_files\input\first_turn.

txt;28-05-2009

...

Now you want to convert all grades in the scale 0-100 to a new scale from 0 to 5. Also, you

want to take the average grade to see how the students did.

1. Create a new transformaon, give it a name and descripon, and save it.

2. By using a Text le input step, read the examination.txt le. Give the name and

locaon of the le, check the Content tab to see that everything matches your le, and

ll the Fields tab as here:

Chapter 3

[ 75 ]

3. Do a preview just to conrm that the step is well congured.

Noce that you have several lines as header. Because the

names of the elds are not in the rst row, you won't be able

to use the Get Fields buon successfully. You will have to write

the elds manually, or you can avoid it by doing the following:

Congure the step with a copy of the le that doesn't have the

extra heading, just the heading row with the names of the elds.

Then, restore the name of your le in the File tab, adjust the

number of headings in the Content tab, and your step is ready.

4. Use the Select values step to remove the elds you will not use—file_processed

and process_date.

Basic Data Manipulaon

[ 76 ]

5. Drag another Select values step to the canvas. Select the Meta-data tab and change the

meta-data of the numeric elds like here:

6. Near the upper-le corner of the screen, above the step tree, there is a textbox for

searching. Type calc in the textbox. While you type, a lter is applied to show you only

the steps that contain, in their name or descripon, the text you typed. You should be

seeing this:

7. Among the steps you see, select the Calculator step and drag it to the canvas.

Chapter 3

[ 77 ]

8. To remove the lter, clear the typed text.

9. Create a hop from the Text le input step to the Calculator step.

10. Edit the Calculator step and ll the grid as follows:

11. To ll the Calculaon column, simply select the operaon from the list provided. Be sure

to ll every column in the grid like shown in the screenshot.

You don't have to feel like you are doing data entry instead

of learning PDI. You can avoid typing by copying and pasng

similar rows, and then xing the values properly. Appendix D

has a list of shortcuts you can use when eding grids like these.

12. Leave the Calculator step selected and click the Preview this transformaon buon

followed by the Quick Launch buon. You should see something similar to the

following screenshot:

Basic Data Manipulaon

[ 78 ]

The numbers may vary according to the contents of your le.

13. Edit the calculator again and change the content of the Remove column like here:

14. From the Transform category of steps, add a Sort rows step and create a hop from the

Calculator step to this new step.

15. Edit the Sort rows step by double-clicking it, click the Get Fields buon, and adjust the

grid as follows:

16. Click OK.

Chapter 3

[ 79 ]

17. Drag a third Select values step, create a hop from the Sort rows step to this new step,

and use it to keep only the elds by which you ordered the data:

18. From the Flow category of steps, add a Dummy step and create a hop from the last

Select values step to this.

19. Select the Dummy step and do a preview.

20. The nal preview looks like the following screenshot:

Basic Data Manipulaon

[ 80 ]

If you get an error or a dierent result, review the explanaon and make

sure that you followed the instrucons correctly. Do a preview on each

step to discover in which one you have the problem. If you realize that

the problem is in any of the steps that read the input les, please refer

to the Troubleshoong reading les secon in Chapter 2.

What just happened?

You read the examination.txt le, and did some calculaons to see how the students did.

You did the calculaons by using the Calculator step.

First of all, you removed the elds you didn't need from the stream of data.

Aer that, you did the following calculaons:

By dividing by 20, you converted all grades from the scale 0-100 to the scale 0-5.

Then, you calculated the average of the grades for the four skills—wring, reading, listening,

and speaking. You created two auxiliary elds, aux1 and aux2, to calculate paral sums. Aer

that, you created the eld total with the sum of aux1 and aux2, another auxiliary eld with

the number 4, and nally the avg as the division of the total by the eld four.

In order to obtain the new grades, as well as the average with two decimal posions, you

need the result of the operaon to be of a numeric type with precision 2. Therefore, you

had to change the metadata, by adding a Select values step before the Calculator. With the

Select values you changed the type of the numeric elds from integer to number, that is,

oat numbers. If you didn't, the quoents would have been rounded to integer numbers.

You can try and see for yourself!

The rst me you edited the calculator, you set the eld Remove to N for every row in the

calculator grid. By doing this, you could preview every eld created in the calculator, even

the auxiliary ones such as the elds twenty, aux1, and aux2. You then changed the eld to

Y so that the auxiliary elds didn't pass to the next step.

Aer doing the calculaons, you sorted the data by using a Sort rows step. You specied the

order by avg descending, then by student_code ascending.

Chapter 3

[ 81 ]

Sorng data

For small datasets, the sorng algorithm runs mainly using the JVM memory.

When the number of rows exceeds 5,000, it works dierently. Every ve

thousand rows, the process sorts them and writes them to a temporary le.

When there are no more rows, it does a merge sort on all those les and gives

you back the sorted dataset. You can conclude that for huge datasets a lot

of reading and wring operaons are done on your disk, which slows down

the whole transformaon. Fortunately, you can change the number of rows

in memory (5,000 by default) by seng a new value in the Sort size (rows in

memory) textbox. The bigger this number, the faster the sorng process.

Note that a sort size that works in your system may not work in a machine with

a dierent conguraon. To avoid that risk, you can use a dierent approach.

In the Sort rows conguraon window, you can set a Free memory threshold

(in %) value. The process begins to use temporary les when the percentage

of available memory drops below the indicated threshold. The lower the

percentage, the faster the process.

As it's not possible to know the exact amount of free memory, it's not

recommended to set a very small free memory threshold. You denitely

shouldn't use that opon in complex transformaons or when there is more

than one sort going on, as you could sll run out of memory.

The two nal steps were added to keep only the elds of interest, and to preview the result

of the transformaon. You can change the Dummy step for any of the output steps you

already know.

You've used the Dummy step several mes but sll nothing has been said

about it. Mainly it was because it does nothing! However, you can use it as a

placeholder for tesng purposes as in the last exercise.

Note that in this tutorial you used the Select values step in three dierent ways:

To remove elds by using the Remove tab.

To change the meta-data of some elds by using the Meta-data tab.

To select and rename elds by using the Select tab.

Remember that the Select values step's tabs are exclusive! You can't use more

than one in the same step!



Basic Data Manipulaon

[ 82 ]

Besides calculaon, in this tutorial you did something you hadn't before—searching the

step tree.

When you don't remember where a step is in the steps tree, or when you just

want to nd if there is a step that does some kind of operaon, you could simply

type the search criterion in the textbox above the steps tree. PDI does a search

and lters all the steps that have that text as part of their name or descripon.

Adding or modifying elds by using different PDI steps

In this tutorial you used the Calculator step to create new elds and add them to

your dataset. The Calculator is one the many steps that PDI has to create new elds by

combining existent ones. Usually you will nd these steps under the Transform category

of the steps tree. The following table describes some of them (the examples refer to the

examinaon le):

Step Descripon Example

Split Fields Split a single eld into two

or more. You have to give

the character that acts as

separator.

Split the name into two elds: Name and

Last Name. The separator would be a space

character.

Add constants Add one or more constants

to the input rows

Add two constants: four and twenty. Then

you could use them in the Calculator step

without dening the auxiliary elds.

Replace in string Replace all occurrences of

a text in a string eld with

another text

Replace the – in the student code by a /.

For example: 108418-95 would become

108418/95.

Number range Create a new eld based on

ranges of values. Applies to

a numeric eld.

Create a new eld called exam_range with

two ranges: Range A with the students with

average grade below 3.5, and Range B with

students with average grade greater or equal

to 3.5.

Value Mapper Creates a correspondence

between the values of

a eld and a new set of

values.

Suppose you calculated the average grade as

an integer number ranging from 0 to 5. You can

map the average to A, B, C, D, like this:

Old value: 5; New value: A

Old value: 3, 4; New value: B

Old value: 1, 2; New value: C

Old value: 0; New value: D

Chapter 3

[ 83 ]

Step Descripon Example

User Dened Java

Expression

Creates a new eld by

using a Java expression that

involves one or more elds.

This step may eventually

replace any of the above

but it's only recommended

for those familiar with Java.

Create a ag (a Boolean eld) that tells if a

student passed. A student passes if his/her

average grade is above 4.5.

The expression to use could be:

(((writing+reading+speaking+

listening)/4)>4.5)?true:false

Any of these steps when added to your transformaon, are executed for every row in the

stream. It takes the row, idenes the elds needed to do its tasks, calculates the new

eld(s), and adds it to the dataset.

For details on a parcular step, don't hesitate to visit the Wiki page for steps:

http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+v3.2.+St

eps

The Calculator step

The Calculator step you used in the tutorial, allows you to do simple calculaons not only

on numeric elds, but also on data and text. The Calculator step is not the only means to do

calculaons, but it is the simplest. It allows you to do simple calculaons in a quick fashion.

The step has a grid where you can add all the elds you want to. Every row represents

an operaon that involves from one up to three operands (depending on the selected

operaon). When you select an operaon, the descripon of the operaon itself tells you

which argument it needs. For example:

If you select Set constant eld to value A, you have to provide a constant value

under the column name A.

If you select A/B, the operaon needs two arguments, and you have to provide

them by indicang the elds to use in the columns named A and B respecvely.

The result of every operaon becomes a new eld in your dataset, unless you set the

Remove column to Y. The name of the new eld is the one you type under the New

eld column.

For each and every row of the data set, the operaons dened in the Calculator are

calculated in the order in which they appear. Therefore, you may create auxiliary elds and

then use them in rows of the Calculator grid that are below them. That is what you did in

the tutorial when you dened the auxiliary elds aux1 and aux2 and then used them in the

eld total.



Basic Data Manipulaon

[ 84 ]

Just like every grid in Kele, you have a contextual menu (and its corresponding shortcuts)

that lets you manipulate the rows by deleng, moving, copying and pasng, and so on.

The Formula step

The Formula step is another step you can use for doing calculaons. Let's give it a try by

using it in the examinaon tutorial.

Time for action – reviewing examinations by using the

Formula step

In this tutorial you will redo the previous exercise, but this me you will do the calculaons

with the Formula step.

1. Open the transformaon you just nished.

2. Delete from the transformaon the Calculator step, and put in its place a Formula

step. You will nd it under the Scripng category of steps.

3. Add a eld named writing.

4. When you click the cell under the Formula column, a window appears to edit the

formula for the new eld.

5. In the upper area of the window, type [writing]/20. You will noce that the

sentence is red if it is incomplete or the syntax is incorrect. In that case, the error is

shown below the eding area, like in the following example:

Chapter 3

[ 85 ]

6. As soon as the formula is complete and correct, the red color disappears.

7. Click OK.

8. The formula you typed will be displayed in the cell you clicked.

9. Set Number as the type for the new eld, and type writing in the Replace value

column.

10. Add three more elds to the grid in the same way you added this eld so that the

grid looks like the following:

11. Click OK.

12. Add a second Formula step.

13. Add a eld named avg and click the Formula cell to edit it.

14. Expand the Mathemacal category of funcons to the leside of the window, and click

the AVERAGE funcon.

Basic Data Manipulaon

[ 86 ]

15. The explanaon of the selected funcon appears to guide you.

16. In the eding area, type average([writing];[reading];[speaking];

[listening]).

17. Click OK.

18. Set the Value type to Number.

19. Click OK.

20. Create a hop from this step to the Sort rows step.

21. Edit the last Select values step.

22. Click Get elds to select.

23. A queson appears to ask you what to do. Click Clear and add all.

24. The grid is reloaded with the modied elds.

25. Click on the Dummy step and do a preview.

26. There should be no dierence with what you had in the Calculator version of

the tutorial:

Chapter 3

[ 87 ]

What just happened?

You read the examination.txt le, and did some calculaons using the Formula step to

see how the students did.

It may happen that the preview window shows you less decimal posions than

expected. This is a preview issue. One of the ways you have to see the numbers

with more decimals is to send the numbers to an output le with a proper

format and see the numbers in the le.

As you saw, you have quite a lot of funcons available for building formulas and expressions.

To reference a eld you have to use square brackets, like in [writing]. You may reference

only the current elds of the row. You have no way to access previous rows of the grid as

you have in the Calculator step and so you needed two Formula steps to replace a single

Calculator. But you saved auxiliary elds because the Formula allows you to type complex

formulas in a single eld without using paral calculaons.

When the calculaons are not simple, that is, they require resolving a complex

formula or involve many operands, then you might prefer the Formula step over

the Calculator.

The Formula step uses the library Libformula. The syntax used in LibFormula is based

on the OpenFormula standard. For more informaon on OpenFormula, you may visit

http://wiki.oasis-open.org/office/About_OpenFormula.

Basic Data Manipulaon

[ 88 ]

Have a go hero – listing students and their examinations results

Let's play a lile with the examinaon le. Suppose you decide that only those students

whose average grade was above 3.9 will pass the examinaon; the others will not. List the

students ordered by average (desc.), last name (asc.), and name (asc.). The output list should

have the following elds:

Student code

Name

Last Name

Passed (yes/no)

average grade

Pop quiz – concatenating strings

Suppose that you want to create a new eld as the student_code plus the name of the

student separated by a space, as for example 867432-94 Linda Rodriguez. Which of the

following are possible soluons for your problem:

a. Use a Calculator, using the calculaon a+b+c, where a is student_code, b is a

space, and c is the name eld.

b. Use a Formula, using as formula [student_code]+" "+[name]

c. Use a Formula, using as formula [student_code]&" "&[name]

You may choose more than one opon.

Calculations on groups of rows

You just learned to do simple operaons for every row of a dataset. Now you are ready to

go beyond. Suppose you have a list of daily temperatures of a given country over a year. You

may want to know the overall average temperature, the average temperature by region,

or the coldest day of the year. When you work with data, these types of calculaons are a

common requirement. In this secon you will learn to address those requirements with PDI.



Chapter 3

[ 89 ]

Time for action – calculating World Cup statistics by

grouping data

Let's forget the examinaons for a while, and retake the World Cup tutorial from the

previous chapter. The le you obtained from that tutorial was a list of results of football

matches. These are sample rows of the nal le:

Match Date;Home Team;Away Team;Result

02/06;Italy;France;2-1

02/06;Argentina;Hungary;2-1

06/06;Italy;Hungary;3-1

06/06;Argentina;France;2-1

10/06;France;Hungary;3-1

10/06;Italy;Argentina;1-0

...

Now you want to take that informaon to obtain some stascs such as the maximum

number of goals per match in a given day. To do it, follow these instrucons:

1. Create a new transformaon, give it a name and descripon, and save it.

2. By using a Text le input step, read the wcup_first_round.txt le you generated

in Chapter 2. Give the name and locaon of the le, check the Content tab to see that

everything matches your le, and ll the Fields tab.

3. Do a preview just to conrm that the step is well congured.

4. From the Transform category of step, select a Split Fields step, drag it to the work area,

and create a hop from the Text le input to this step.

5. Double-click the Split Fields steps and ll the grid like done in the following screenshot:

Basic Data Manipulaon

[ 90 ]

6. Add a Calculator step to the transformaon and create a hop from the Split Fields step

to this step and edit the step to create the following new elds:

7. Add a Sort rows step to the transformaon, create a hop from the Calculator step to this

step, and sort the elds by Match_Date.

8. Expand the Stascs category of steps, and drag a Group by step to the canvas. Create a

hop from the Sort rows step to this new step.

9. Edit the Group by step and ll the conguraon window as shown next:

Chapter 3

[ 91 ]

10. When you click the OK buon, a window appears to warn you that this step

needs the input to be sorted on the specied keys—the Range eld in this case.

Click I understand, and don't worry because you already sorted the data in the

previous step.

11. Add a nal Dummy step.

12. Select the Dummy and the Group by steps, le-click one and holding down the Shi

key, le-click the other.

13. Click the Preview this transformaon buon. You will see the the following:

14. Click Quick Launch. The following window appears:

15. Double-click the Sort rows step. A window appears with the data coming out of the Sort

rows step.

16. Double-click the Dummy step. A window appears with the data coming out of the

Dummy step.

Basic Data Manipulaon

[ 92 ]

17. If you rearrange the preview windows, you can see both preview windows at a me, and

understand beer what happened with the numbers. The following would be the data

shown in the windows:

What just happened?

You opened a le with results from several matches and got some stascs from it.

In the le, there was a column with the match result in the format n-m, with n being the

goals of the home team and m being the goals of the away team. With the Split Fields step,

you split this eld in two—one with each of these two numbers.

With the Calculator you did two things:

You created a new eld with the total number of goals for each match.

You created a descripon for the match.



Chapter 3

[ 93 ]

Note that in order to create a descripon, you used the + operator to

concatenate string rather than add numbers.

Aer that, you ordered the data by match date with a Sort rows step.

In the preview window of the Sort rows step, you could see all the calculated elds: home

team goals, away team goals, match goals, and descripon.

Finally, you did some stascal calculaons:

First, you grouped the rows by match date. You did this by typing Match_Date in the

upper grid of the Group by step.

Then, for every match date, you calculated some stascs. You did the calculaons by

adding rows in the lower grid of the step, one for every stasc you needed.

Let's see how it works. Because the Group by step was preceded by a Sort rows step, the

rows came to the step already ordered. When the rows arrive to the Group by step, Kele

creates groups based on the eld(s) indicated in the upper grid—the Match_Date eld in this

case. The following drawing shows this idea:



Basic Data Manipulaon

[ 94 ]

Then, for every group, the elds that you put in the lower grid are calculated. Let's see, for

example, the group for the match date 03/06. For the rows in this group, Kele calculated

the following:

Matches: The number of matches played on 03/06. There were 4.

Sum of goals: The total number of goals converted on 03/06. There were 3+2+3+4=12.

Maximum: The maximum number of goals converted in a single match played on 03/06.

The maximum among 3, 2, 3, and 4 was 4.

Teams: The descripons of the teams which played on 03/06, separated by ; : Austria-

Spain; Sweden-Brazil; Netherlands-Iran; Peru-Scotland.

The same calculaons were made for every group. You can verify the details by looking in the

preview window.

Look at the Step Metrics tab in the Execuon Results area of the screen:

Note that 24 rows entered the Group by step and only 7 came out of that step towards the

Dummy step. That is because aer the grouping, you no longer have the detail of matches.

The output of the Group by step is your new data now—one row for every group created.

Group by step

The Group by step allows you to create groups of rows and calculate new elds over

those groups.

In order to dene the groups, you have to specify which eld(s) are the keys. For every

combinaon of values for those elds, Kele builds a new group.

In the tutorial you grouped by a single eld Match_date. Then for every value of

Match_date, Kele created a dierent group.



Chapter 3

[ 95 ]

The Group by step operates on consecuve rows. Suppose that the rows are already sorted

by date, but those with date 10/06 are above the rest. The step traverses the dataset and

each me the value for any of the grouping eld changes, it creates a new group. If you

see it this way, you will noce that the step will work even if the data is not sorted by the

grouping eld.

As you probably don't know how the data is ordered, it is safer and

recommended that you sort the data by using a Sort rows step just

before using a Group by step.

Once you have dened the groups, you are free to specify new elds to be calculated

for every group. Every new eld is dened as an aggregate funcon over some of the

existent elds.

Let's review some of the elds you created in the tutorial:

The Matches eld is the result of applying the Number of values funcon over

the eld Match_date.

The Sum of goals eld is the result of applying the Sum funcon over the

eld goals.

The Maximum eld is the result of applying the Maximum funcon over the

eld goals.

Finally, you have the opon to calculate aggregate funcons over the whole dataset. You do

this by leaving the upper grid blank. Following the same example, you could calculate the

total number of matches and the average number of goals for all those matches. This is how

you do it:



Basic Data Manipulaon

[ 96 ]

The following is what you get:

In any case, as a result of the Group by step, you will no longer have the detailed rows,

unless you check the Include all rows? checkbox.

Have a go hero – calculating statistics for the examinations

Here you have one more task related with the examinaons le. Create a new

transformaon, read the le, and calculate:

The number of students who passed

The number of students who failed

The average wring, reading, speaking, and listening grade obtained by students

who passed

The average wring, reading, speaking, and listening grade obtained by students

who failed

The minimum and maximum average grade among students who passed

The minimum and maximum average grade among students who failed

Use the Number range step to dene the range of the average

grade; then use a Group by step to calculate the stascs.



Chapter 3

[ 97 ]

Have a go hero – listing the languages spoken by country

Read the le with countries' informaon you used in Chapter 2. Build a le where each row

has two columns—the name of a country and the list of spoken languages in that country.

As aggregate, use the opon Concatenate strings separated by.

Filtering

Unl now you learned how to accomplish several kinds of calculaons that enriched the set

of data. There is sll another kind of operaon that is frequently used, and does not have to

do with enriching the data but with discarding data. It is ltering unwanted data. Now you

will learn how to discard rows under given condions.

Time for action – counting frequent words by ltering

Let's suppose, you have some plain text les, and you want to know what is said in them. You

don't want to read them, so you decide to count the mes that words appear in the text, and

see the most frequent ones to get an idea of what the les are about.

Before starng, you'll need at least one text le to play with. The text le used in

this tutorial is named smcng10.txt and is available for you to download from

the Packt website.

Let's work:

1. Create a new transformaon.

2. By using a Text le input step, read your le. The trick here is to put as a separator

a sign you are not expecng in the le, for example |. By doing so, the enre line

would be recognized as a single eld. Congure the Fields tab by dening a single

string eld named line.

3. From the Transform category of step, drag to the canvas a Split eld to rows step,

and create a hop from Text le input step to this new step.

Basic Data Manipulaon

[ 98 ]

4. Congure the step like this:

5. With this last step selected, do a preview. Your preview window should look like this:

6. Close the preview window.

7. Expand the Flow category of steps, and drag a Filter rows step to the work area.

8. Create a hop from the last step to the Filter rows step.

9. Edit the Filter rows step by double-clicking it.

Chapter 3

[ 99 ]

10. Click the <field> textbox to the le of the = sign. The list of elds appears.

Select word.

11. Click the = sign. A list of operaons appears. Select IS NOT NULL.

12. The window looks like the following:

13. Click OK.

14. From the Transform category of steps drag a Sort rows step to the canvas, and

create a hop from the Filter rows step to this new step.

15. Sort the rows by word.

16. From the Stascs category, drag a Group by step, and create a hop from the Sort

rows step to this step.

17. Congure the grids in the Group by conguraon window like shown:

Basic Data Manipulaon

[ 100 ]

18. Add a Calculator step, create a hop from the last step to this, and calculate the new

eld len_word represenng the length of the words. For that, use the calculator

funcon Return the length of a string A and select word from the

drop-down menu for Field A.

19. Expand the Flow category and drag another Filter rows step to the canvas.

20. Create a hop from the Calculator step to this step and edit it.

21. Click <field> and select counter.

22. Click the = sign, and select >.

23. Click <value>. A small window appears.

24. In the Value textbox of the lile window, enter 2.

25. Click OK.

26. Posion the mouse cursor over the icon in the upper-right corner of the window.

When the text Add condion shows up, click on the icon.

27. A new blank condion is shown below the one you created.

28. Click on null = [] and create the condion len_word>3, in the same way you

created the condion counter>2.

29. Click OK.

Chapter 3

[ 101 ]

30. The nal condion looks like this:

31. Add one more Filter rows step to the transformaon and create a hop from the last

step to this new step.

32. On the le side of the condion, select word.

33. As comparator select IN LIST.

34. At the end of the condion, inside the textbox value, type the following:

a;an;and;the;that;this;there;these.

35. Click the upper-le square above the condion and the word NOT will appear.

36. The condion looks like the following:

Basic Data Manipulaon

[ 102 ]

37. Add a Sort rows step, create a hop from the previous step to this step, and sort the

rows in the descending order of counter.

38. Add a Dummy step at the end of the transformaon, create a hop from the last step

to the Dummy step.

39. With the Dummy step selected, preview the transformaon. The following is what

you should see now:

What just happened?

You read a regular plain le and arranged the words that appear in the le in some

parcular fashion.

The rst thing you did was to read the plain le and split the lines so that every word became

a new row in the dataset. Consider, for example, the following line:

subsidence; comparison with the Portillo chain.

Chapter 3

[ 103 ]

The spling of this line resulted in the following rows being generated:

Thus, a new eld named word became the basis for your transformaon.

First of all, you discarded rows with null words. You did it by using a lter with the condion

word IS NOT NULL. Then, you counted the words by using the Group by step you learned

in the previous tutorial. Once you counted the words, you discarded those rows where the

word was too short (length less than 4) or too common (comparing to a list you typed).

Once you applied all those lters, you sorted the rows in the descending order of

the number of mes the word appeared in the le so that you could see the most

frequent words.

Scrolling down a lile the preview window to skip some preposions, pronouns, and other

very common words that have nothing to do with a specic subject, you found words such

as shells, strata, formaon, South, elevaon, porphyric, Valley, terary, calcareous, plain,

North, rocks, and so on. If you had to guess, you would say that this was a book or arcle

about geology, and you would be right. The text taken for this exercise was Geological

Observaons on South America by Charles Darwin.

Filtering rows using the Filter rows step

The Filter rows step allows you to lter rows based on condions and comparisons.

The step checks the condion for every row. Then it applies a lter leng pass only the rows

for which the condion is true. The other rows are lost.

In the counng words exercise, you used the Filter rows step several mes so you already

have an idea of how it works. Let's review it.

Basic Data Manipulaon

[ 104 ]

In the Filter rows seng window you have to enter a condion. The following table

summarizes the dierent kinds of condions you may enter:

Condion Descripon Example

A single eld followed by IS NULL or

IS NOT NULL

Checks whether the value of a

eld in the stream is null

word IS NOT NULL

A eld, a comparator, and a constant Compares a eld in the stream

against a constant value.

counter > 2

Two elds separated by a comparator Compares two elds in the

stream

line CONTAINS

word

You can combine condions as shown here:

counter > 2

AND

len_word>3

You can also create subcondions such as:

(

counter > 2

AND

len_word>3

)

(word in list geology; sun)

In this last example, the condion lets the word geology pass even if it appears only once. It

also lets the word sun pass, despite its length.

When eding condions, you always have a contextual menu which allows you to add and

delete sub-condions, change the order of existent condions, and more.

Maybe you wonder what the Send 'true' data to step: and Send 'false' data to step: textboxes

are for. Be paent, you will learn how to use them in Chapter 4.

Have a go hero – playing with lters

Now it is your turn to try ltering rows. Modify the counng_words transformaon in the

following way:

Alter the Filter rows steps. By using a Formula step create a ag (a Boolean eld)

that evaluates the dierent condions (counter>2, and so on). Then use only one

Filter rows step that lters the rows for which the ag is true. Test it and verify that

the results are the same as before the change.



Chapter 3

[ 105 ]

In the Formula eding window, use the opons under the Logic category.

Then in the Filter rows step, you can type true or Y as the value against which

you compare the ag.

Add a sub-condion to avoid excluding some words, just like the one in the example:

(word in list geology; sun). Change the list of words and test the lter to see

that the results are as expected.

Have a go hero – counting words and discarding those that are

commonly used

If you take a look at the results in the tutorial, you may noce that some words appear more

than once in the nal list because of special signs such as . , ) or ", or because of lower

or upper case leers. For example, look how many mes the word rock appears: rock (99

occurrences) - rock,(51 occurrences) – rock. (11 occurrences) – rock." (1 occurrence)

- rock: (6 occurrences) - rock; - (2 occurrences). You can x this and make the word rock

appear only once: Before grouping the words, remove all extra signs and convert all words to

lower case or upper case, so they are grouped as expected.

Try one or more of the following steps: Formula, Calculator, Replace in string.

Looking up data

Unl now, you have been working with a single stream of data. When you did calculaons or

created condions to compare elds, you only involved elds of your stream. Usually, this is

not enough, and you need data from other sources. In this secon you will learn to look up

data outside your stream.

Time for action – nding out which language people speak

An Internaonal Musical Contest will take place and 24 countries will parcipate, each

presenng a duet. Your task is to hire interpreters so the contestants can communicate in

their nave language. In order to do that, you need to nd out the language they speak:

1. Create a new transformaon.

2. By using a Get Data From XML step, read the countries.xml le that contains

informaon about countries that you used in Chapter 2.



Basic Data Manipulaon

[ 106 ]

To avoid conguring the step again, you can open the transformaon

that reads this le, copy the Get data from XML step, and paste it here.

3. Drag a Filter rows step to the canvas.

4. Create a hop from the Get data from XML step to the Filter rows step.

5. Edit the Filter rows step and create the condion- isofficial= T.

6. Click the Filter rows step and do a preview. The list of previewed rows will show the

countries along with the ocial languages:

Now let's create the main ow of data:

7. From the book website download the list of contestants. It looks like this:

ID;Country;Duet

1;Russia;Mikhail Davydova

;;Anastasia Davydova

2;Spain;Carmen Rodriguez

;;Francisco Delgado

3;Japan;Natsuki Harada

;;Emiko Suzuki

4;China;Lin Jiang

;;Wei Chiu

5;United States;Chelsea Thompson

;;Cassandra Sullivan

6;Canada;Mackenzie Martin

;;Nathan Gauthier

7;Italy;Giovanni Lombardi

;;Federica Lombardi

Chapter 3

[ 107 ]

8. In the same transformaon, drag a Text le Input step to the canvas and read the

downloaded le.

The ID and country have values only in the rst of the two

lines for each country. In order to repeat the values in the

second line use the ag Repeat in the Fields tab. Set it to Y.

9. Expand the Lookup category of steps.

10. Drag a Stream lookup step to the canvas.

11. Create a hop from the Text le input you just created, to the Stream lookup step.

12. Create another hop from the Filter rows step to the Stream lookup step.

13. Edit the Stream lookup step by double-clicking it.

14. In the Lookup step drop-down list, select Filter ocial languages, the step that brings

the list of languages.

15. Fill the grids in the conguraon window as follows:

Note that Country Name is a eld coming from the text le stream, while the country

eld comes from the countries stream.

Basic Data Manipulaon

[ 108 ]

16. Click OK.

17. The hop that goes from the Filter rows step to the Stream lookup step changes its look

and feel, to show that this is the stream where the Stream lookup is going to look:

18. Aer the Stream lookup, add a Filter rows step.

19. In the Filter rows step, type the condion language-IS NOT NULL.

20. By using a Select values step, rename the elds Duet, Country Name and

language to Name, Country, and Language.

21. Drag a Text le output step to the canvas and create the le

people_and_languages.txt with the selected elds.

22. Save the transformaon.

23. Run the transformaon and check the nal le, which should look like this:

Name|Country|Language

Mikhail Davydova|Russia|

Anastasia Davydova|Russia|

Carmen Rodriguez|Spain|Spanish

Francisco Delgado|Spain|Spanish

Natsuki Harada|Japan|Japanese

Emiko Suzuki|Japan|Japanese

Chapter 3

[ 109 ]

Lin Jiang|China|Chinese

Wei Chiu|China|Chinese

Chelsea Thompson|United States|English

Cassandra Sullivan|United States|English

Mackenzie Martin|Canada|French

Nathan Gauthier|Canada|French

Giovanni Lombardi|Italy|Italian

Federica Lombardi|Italy|Italian

What just happened?

First of all, you read a le with informaon about countries and the languages spoken in

those countries.

Then you read a list of people along with the country they come from. For every row in this

list, you told Kele to look for the country (Country Name eld) in the countries stream

(country eld), and to give you back a language and the percentage of people that speaks

that language (language and percentage elds). Let's explain it with a sample row: The

row for Francisco Delgado from Spain. When this row gets to the Stream lookup step,

Kele looks in the list of countries for a row with the country Spain. It nds it. Then, it

returns the value of the columns language and percentage: Spanish and 74.4.

Now take another sample row—the row with the country Russia. When the row gets to the

Stream lookup step, Kele looks for it in the list of countries, but it doesn't nd it. So what

you get as language is a null string.

Whether the country is found or not, two new elds are added to your stream—language

and percentage.

Aer the Stream lookup step, you discarded the rows where language is null, that is, those

whose country wasn't found in the list of countries.

With the successful rows you generated an output le.

The Stream lookup step

The Stream lookup step allows you to look up data in a secondary stream.

You tell Kele which of the incoming streams is the stream used to look up, by selecng the

right choice in the Lookup step list.

The upper grid in the conguraon window allows you to specify the names of the elds that

are used to look up.

Basic Data Manipulaon

[ 110 ]

In the le column, Field, you indicate the eld of your main stream. You can ll this

column by using the Get Fields buon, and deleng all the elds you don't want to use

for the search.

In the right column, Lookup Field, you indicate the eld of the secondary stream.

When a row of data comes to the step, a lookup is made to see if there is a row in the

secondary stream for which, every pair (Field, LookupField) in the grid has the value of

Field equal to the value of LookupField. If there is one, the look up will be successful.

In the lower grid, you specify the names of the secondary stream elds that you want back

as a result of the look up. You can ll this column by using the Get lookup elds buon, and

deleng all the elds you don't want to retrieve.

Aer the lookup, new elds are added to your dataset—one for every row of this grid.

For the rows for which the look up is successful, the values for the new elds will be taken

from the lookup stream.

For the others, the elds will remain null, unless you set a default value.

Chapter 3

[ 111 ]

When you use a Stream lookup, all lookup data is loaded into memory. Then the stream

lookup is made using a hash table algorithm. Even if you don't know how this algorithm

works, it is important that you know the implicaons of using this step:

First, if the data where you look is huge, you take the risk of running out

of memory.

Second, only one row is returned per key. If the key you are looking for is present

more than once in the lookup stream, only one will be returned—for example, in the

tutorial where there are more than one ocial languages spoken in a country, you

get just one. Somemes you don't care, but on some occasions this is not acceptable

and you have to try some other methods. You'll learn other ways to do this later in

the book.

Have a go hero – counting words more precisely

The tutorial where you counted the words in a le worked prey well, but you may have

noced that it has some details you can x or enhance.

You discarded a very small list of words, but there are much more that are quite usual

in English—preposions, pronouns, auxiliary verbs, and many more. So here is the challenge:

Get a list of commonly used words and save it in a le. Instead of excluding words from a

small list as you did with a Filter rows step, exclude the words that are in your common

words le.

Use a Stream lookup step.

Test the transformaon with the same le, and also with other les, and verify

that you get beer results with all these changes.



Basic Data Manipulaon

[ 112 ]

Summary

This chapter covered the simplest and most common ways of transforming data. Specically,

it covered how to:

Use dierent transformaon steps to calculate new elds

Use the Calculator and the Formula steps

Filter and sort data

Calculate stascs on groups of rows

Look up data

Aer learning basic manipulaon of data, you may now create more complex

transformaons, where the streams begin to split and merge. That is the core

subject of the next chapter.



Controlling the Flow of Data

In the previous chapter, you learned the basics of transforming data. Basically

you read data from some le, did some transformaon to the data, and sent

the data back to a dierent output. This is the simplest scenario. Think of

a dierent situaon. Suppose you collect results from a survey. You receive

several les with the data and those les have dierent formats. You have to

merge those les somehow and generate a unied view of the informaon.

You also want to put aside the rows of data whose content is irrelevant. Finally,

based on the rows that interest you, you want to create another le with some

stascs. This kind of requirement is very common. In this chapter you will

learn how to implement it with PDI.

Splitting streams

Unl now, you have been working with simple, straight ows of data. When you deal with

real problems, those simple ows are not enough. Many mes, the rows of your dataset

have to take dierent paths. This situaon is handled very easily, and you will learn how to

do it in this secon.

Controlling the Flow of Data

[ 114 ]

Time for action – browsing new PDI features by copying

a dataset

Before starng, let's introduce the Pentaho BI Plaorm Tracking site. At the tracking site you

can see the current Pentaho roadmap and browse their issue tracking system. The PDI page

for that site is http://jira.pentaho.com/browse/PDI.

In this exercise, you will export the list of proposed new features for PDI from the site, and

generate detailed and summarized les from that informaon.

1. Access the main Pentaho tracking site page: http://jira.pentaho.com.

2. In the main menu, click on FIND ISSUES.

3. On the le side, select the following lters:

Project: Pentaho Data Integraon {Kele}

Issue Type: New Feature

Status: Open

4. At the boom of the lter list, click View >>. A list of found issues will appear.



Download from Wow! eBook <www.wowebook.com>

Chapter 4

[ 115 ]

5. Above the list, select Current eld to export the list to an Excel le.

6. Save the le to the folder of your choice.

The Excel le exported from the JIRA site is a Microso Excel 97-

2003 Worksheet. PDI doesn't recognize this version of worksheets.

So, before proceeding, open the le with Excel or Calc and convert

it to Excel 97/2000/XP.

7. Create a transformaon.

8. Read the le by using an Excel Input step. Aer selecng the le, click on the Sheets

tab, and ll it as shown in the next screenshot so that it skips the header rows and

the rst column:

9. Click the Fields tab and ll the grid by clicking the Get elds from header

row... buon.

Controlling the Flow of Data

[ 116 ]

10. Click the Preview rows just to be sure that you are reading the le properly. You

should see all the contents of the Excel le except the three heading lines.

11. Click OK.

12. Add a Filter rows step to drop the rows where the Summary eld is null.

13. Aer the Filter rows step, add a Value Mapper step and ll it like here:

14. Aer the Value Mapper step, add a Sort rows step and order the rows by

priority_order (asc.), Summary (asc.).

Chapter 4

[ 117 ]

15. Aer that add an Excel Output step, and congure it to send the priority_order

and Summary elds to an Excel le named new_features.xls.

16. Drag a Group by step to the canvas.

17. Create a new hop from the Sort rows step to the Group by step.

18. A warning window appears asking you to decide whether you wish to Copy or

Distribute.

19. Click Copy to send the rows toward both output steps.

20. The hops leaving the Sort rows step change to show you the decision you made. So

far you have this:

21. Congure the Group by steps like shown:

22. Add a new Excel Output step to the canvas and create a hop from the Group by step

to this new step.

Controlling the Flow of Data

[ 118 ]

23. Congure the Excel Output step to send the Priority and Quantity elds to an

Excel le named new_features_summarized.xls.

24. Save the transformaon and run it.

25. Verify that both les, new_features.xls and new_features_summarized.xls,

have been created.

26. The rst le should look like this:

27. And the second le should look like this:

Chapter 4

[ 119 ]

What just happened?

Aer exporng an Excel le with the PDI new features from the JIRA site, you read the le and

created two Excel les—one with a list of the issues and the other with a summary of the list.

The rst steps of the transformaon are well known—read a le, lter null rows, map a eld,

and sort.

Note that the mapping creates a new eld to give an order to the Priority

eld so that the more severe issues are rst in the list, while the minor priories

remain at the end of the list.

You linked the Sort rows step to two dierent steps. This caused PDI to ask you what to

do with the rows leaving the step. By answering Copy, you told PDI to create a copy of

the dataset. Aer that, two idencal copies le the Sort rows step, each to a dierent

desnaon step.

From the moment you copied the dataset, those copies became independent, each following

its way. The rst copy was sent to a detailed Excel le. The other copy was used to create a

summary of the elds, which then was sent to another Excel le.

Copying rows

At any place in a transformaon, you may decide to split the main stream into two or more

streams. When you do so, you have to decide what to do with the data that leaves the last

step—copy or distribute.

To copy means that the whole dataset is copied to each of the desnaon steps. Once the

rows are sent to those steps, each follows its own way.

When you copy, the hops that leave the step from which you are copying change visually to

indicate the copy acon.

Controlling the Flow of Data

[ 120 ]

In the tutorial, you created two copies of the main dataset. You could have created more

than two, like in this example:

When you split the stream into two or more streams, you can do whatever you want with

each one as if they had never been the same. The transformaons you apply to any of those

output streams will not modify the data in the others.

You shouldn't assume a parcular order in the execuon of the output

streams of a step. All the output streams receive the rows in synch and

you don't have control over the order in which they are executed.

Have a go hero – recalculating statistics

Do you remember the exercise from Chapter 3 where you calculated some stascs? You

created two transformaons. One was to generate a le with students that failed. The other

was to create a le with some stascs such as average grade, number of students who

failed, and so.

Now you can do all that work in a single transformaon, reading the le once.

Distributing rows

As said, when you split a stream, you can copy or distribute the rows. You already saw that

copy is about creang copies of the whole dataset and sending each of them to each output

stream. To distribute means the rows of the dataset are distributed among the desnaon

steps. Let's see how it works through a modied exercise.

Chapter 4

[ 121 ]

Time for action – assigning tasks by distributing

Let's suppose you want to distribute the issues among three programmers so that each of

them implements a subset of the new features.

1. Select Transformaon | Copy transformaon to clipboard in the main menu.

Close the transformaon and select Transformaon | Paste transformaon from

clipboard. A new transformaon is created idencal to the one you copied. Change

the descripon and save the transformaon under a dierent name.

2. Now delete all the steps aer the Sort rows step.

3. Change the lter step to keep only the unassigned issues: Assignee eld equal to

the string Unassigned. The condion looks like the next screenshot:

4. From the Transform category of steps, drag an Add sequence step to the canvas and

create a hop from the Sort rows step to this new step.

5. Double-click the Add sequence step and replace the content of the Name of value

textbox with nr.

6. Drag three Excel Output steps to the canvas.

7. Link the Add sequence step to one of these steps.

Controlling the Flow of Data

[ 122 ]

Congure the Excel Output step to send the elds nr, Priority, and Summary to an Excel

le named f_costa.xls (the name of one of the programmers). The Fields tab should look

like this:

8. Create a hop from the Add sequence step to the second Excel Output step. When

asked to decide between Copy and Distribute, select Distribute.

9. Congure the step like before, but name the le as b_bouchard.xls

(the second programmer).

10. Create a hop from the Add sequence step to the last Excel Output step.

11. Congure this last step like before, but name the le as a_mercier.xls

(the last programmer).

12. The transformaon should look like the following:

Chapter 4

[ 123 ]

13. Run the transformaon and look at the execuon tab window to see

what happened:

14. To see which rows belong to which of the created les, open any of them. It should

look like this:

What just happened?

You distributed the issues among three programmers.

In the execuon window, you could see that 84 rows leave the Add sequence step, and 28

arrive to each of the Excel Output steps, that is, a third of the number of rows to each of

them. You veried that when you explored the Excel les.

In the transformaon, you added an Add sequence step that did nothing more than adding

a sequenal number to the rows. This sequence helps you recognize that one out of every

three rows were distributed to every le.

Controlling the Flow of Data

[ 124 ]

Here you saw a praccal example for the distribung opon. When you distribute, the

desnaon steps receive the rows in turn. For example, if you have three target steps, the

rst row goes to the rst target step, the second row goes to the second step, the third row

goes to the third step, the fourth row now goes to the rst step, and so on.

As you could see, when distribung, the hop leaving the step from which you distribute is

plain; it doesn't change its look and feel.

Despite this example showing clearly how the Distribute… method works, this is not how

you will regularly use this opon. The Distribute… opon is mainly used for performance

reasons. Throughout this book you will always use the Copy… opon. To avoid being asked

for the acon to take every me you create more that one hop leaving a step, you can set

the Copy… opon as default; you do this by opening the PDI opons window (Edit|Opons

… from the main menu) and unchecking the opon Show "copy or distribute" dialog?.

Remember that to see the change applied, you will have to restart Spoon.

Once you have changed this opon, the default method is copying rows. If you want to

distribute rows, you can change the acon by right-clicking the step from which you want

to copy or distribute, selecng Data Movement... in the contextual menu that appears, and

then selecng the desired opon.

Chapter 4

[ 125 ]

Pop quiz – data movement (copying and distributing)

Look at the following transformaons:

If you do a preview on the Steps named Preview, which of the following is true:

a. The number of rows you see in (a) is greater or equal than the number of rows you

see in (b)

b. The number of rows you see in (b) is greater or equal than the number of rows you

see in (a)

c. The dataset you see in (a) is exactly the same as you see in (b) no maer what data

you have in the Excel le.

You can create a transformaon and test each opon to check the results for yourself. To

be sure you understand correctly where and when the rows take one or other way, you can

preview every step in the transformaon, not just the last one.

Splitting the stream based on conditions

In the previous secon you learned to split the main stream of data into two or more

streams. The whole dataset was copied or distributed among the desnaon steps. Now

you will learn how to put condions so that the rows take one way or another depending

on the condions.

Controlling the Flow of Data

[ 126 ]

Time for action – assigning tasks by ltering priorities with the

Filter rows step

Following with the JIRA subject, let's do a more realisc distribuon of tasks among

programmers. Let's assign the serious task to our most experienced programmer,

and the remaining tasks to others.

1. Create a new transformaon.

2. Read the JIRA le and lter the unassigned tasks, just as you did in the

previous tutorial.

3. Add a Filter rows step and two Excel Output steps to the canvas, and link them to

the other steps as follows:

4. Congure one of the Excel Output steps to send the elds, Priority and Summary,

to an Excel le named b_bouchard.xls (the name of the senior programmer).

5. Congure the other Excel Output step to send the elds Priority and Summary to

an Excel le named new_features_to_develop.xls.

6. Double-click the Filter row step to edit it.

7. Enter the condion Priority = Critical OR Priority = Severe.

8. From the rst drop-down list, Send 'true' data to step, select the step that creates

the b_bouchard.xls Excel le.

9. From the other drop-down list, Send 'false' data to step, select the step that creates

the Excel new_features_to_develop.xls Excel le.

10. Click OK.

Chapter 4

[ 127 ]

11. The hops leaving the Filter rows step change to show which way a row will take,

depending on the result of the condion.

12. Save the transformaon.

13. Run the transformaon, and verify that the two Excel les were created.

14. The les should look like this:

Controlling the Flow of Data

[ 128 ]

What just happened?

You sent the list of PDI new features to two Excel les—one le with the crical issues and

the other le with the rest of the issues.

In the Filter row step, you put a condion to evaluate if the priority of a task was severe

or crical. For every row coming to the lter, the condion was evaluated. The rows that

had a severe or crical priority were sent toward the Excel Output step that creates the

b_bouchard.xls le. The rows with another priority were sent towards the other Excel

Output step, the one that creates the new_features_to_develop.xls le.

PDI steps for splitting the stream based on conditions

When you have to make a decision, and upon that decision split the stream in two, you can

use the Filter row step as you did in this last exercise. In this case, the Filter rows step acts as a

decision maker. It has a condion and two possible desnaons. For every row coming to the

lter, the step evaluates the condion. Then if the result of the condion is true, it decides

to send the row toward the step selected in the rst drop-down list of the conguraon

window—Send 'true' data to step.

If the result of the condion is false, it sends the row toward the step selected in the second

drop-down list of the conguraon window: Send 'false' data to step.

Somemes you have to make nested decisions; consider the next gure for example:

In the transformaon shown in the preceding diagram, the condions are as simple as tesng

if a eld is equal to a value. In situaons like this you have a simpler way for accomplishing

the same..

Chapter 4

[ 129 ]

Time for action – assigning tasks by ltering priorities with the

Switch/ Case step

Let's use a Switch/Case step to replace the nested Filter Rows steps shown in the

preceding diagram

1. Create a transformaon like the following:

2. You will nd the Switch/Case step in the Flow category of steps.

To save me, you can take the last transformaon you created

as the starng point.

Controlling the Flow of Data

[ 130 ]

3. Note that the hops arriving to the Excel Output steps look strange. They are doed

orange lines. This look and feel shows you that the target steps are unreachable. In

this case, it means that you sll have to congure the Switch/Case step. Double-click

it and ll it like here:

4. Save the transformaon and run it

5. Open the Excel les generated to see that the transformaon distributed the task among

the les based on the given condions.

What just happened?

In this tutorial you learned to use the Switch/Case step. This step routes rows of data to one

or more target steps based on the value encountered in a given eld.

In the Switch/Case step conguraon window, you told Kele where to send the row

depending on a condion. The condion to evaluate was the equality of the eld set in Field

name to switch and the value indicated in the grid. In this case, the eld name to switch

is Priority, and the values against which it will be compared are the dierent values for

priories: Severe, Crical, and so on. Depending on the values of the Priority eld, the rows

will be sent to any of the target steps. For example, the rows where Priority=Medium, will be

sent toward the target step New Features for Federica Costa.

Note that it is possible to specify the same target step more than once.

The Default target step represents the step where the rows that don't match any of the case

values are sent. In this example, the rows with a priority not present in the list will be sent to

the step New Features without priority.

Chapter 4

[ 131 ]

Have a go hero – listing languages and countries

Open the transformaon you created in the Finding out which language people speak

tutorial in Chapter 3. If you run the transformaon and check the content of the output le,

you'll noce that there are missing languages. Modify the transformaon so that it generates

two les—one with the rows where there is a language, that is, the rows for which the

lookup didn't fail, and another le with the list of countries not found in the countries.

xml le.

Pop quiz – splitting a stream

Connuing with the contestant exercise, suppose that the number of interpreters you will

hire depends on the number of people that speak each language:

Number of people that speaks the language Number of interpreters

Less than 3 1

Between 3 and 6 2

More that 6 3

You want to create a le with the languages with a single interpreter, another le with the

languages with two interpreters, and a nal le with the languages with three interpreters.

Which of the following would solve your situaon when it comes to spling the languages

into three output streams:

a. A Number range step followed by a Switch/Case step.

b. A Switch/Case step.

c. Both

In order to gure out the answer, create a transformaon and count the number

of people that speak each language. You'll have to use a Sort rows step followed

by a Group by step. Aer that, try to develop each alternave soluon and see

what happens.

Merging streams

You've just seen how the rows of a dataset can take dierent paths. Here you will learn the

opposite—how data coming from dierent places is merged into a single stream.

Controlling the Flow of Data

[ 132 ]

Time for action – gathering progress and merging all together

Suppose that you delivered the Excel les you generated in the Assigning tasks by ltering

priories tutorial earlier in the chapter. You gave the b_bouchard.xls to Benjamin

Bouchard, the senior programmer. You also gave the other Excel le to a project leader who

is going to assign the tasks to dierent programmers. Now they are giving you back the

worksheets, with a new column indicang the progress of the development. In the case of

the shared le, there is also a column with the name of the programmer who is working on

every issue. Your task is now to unify those sheets.

Here is what the Excel les look like:

1. Create a new transformaon.

2. Drag an Excel Input step to the canvas and read one of the les.

3. Add a Filter row step to keep only the rows where the progress is not null, that is,

the rows belonging to tasks whose development has been started.

4. Aer the lter, add a Sort rows step, and congure it to order the elds by

Progress, in descending order.

Chapter 4

[ 133 ]

5. Add another Excel Input step, read the other le, and lter and sort the rows just

like you did before. Your transformaon should look like this:

6. From the Transform category of steps, select the Add Constants step and drag it

onto the canvas.

7. Link the step to the stream that reads the B. Bouchard's le; edit the step and add a

new eld named Programmer, with type string and value Benjamin Bouchard.

8. Aer this step, add a Select values step and reorder the elds so that they remain

in a specic order Priority, Summary, Programmer, Progress—to resemble the

other stream.

9. Now, from the Transform category add an Add sequence step, name the new eld

ID, and link the step with the Select values step.

10. Create a hop from the Sort rows step of the other stream to the Add sequence step.

Your transformaon should look like the one shown next:

Controlling the Flow of Data

[ 134 ]

11. Select the Add sequence step and do a preview. You will see this:

What just happened?

You read two similar Excel les and merged them into one single dataset.

First of all, you read, ltered, and sorted the les as usual. Then you altered the stream

belonging to B. Bouchard, so it looked similar to the other. You added the eld Programmer,

and reordered the elds.

Aer that, you used an Add sequence step to create a single dataset containing the rows of

both streams, with the rows numbered.

PDI options for merging streams

You can create a union of two or more streams anywhere in your transformaon. To create a

union of two or more data streams, you can use any step. The step unies the data, takes the

incoming streams in any order, and then it completes its task in the same way as if the data

came from a single stream.

In the example, you used an Add sequence step as the step to join two streams. The step

gathered the rows from the two streams, and then proceeded to numerate the rows with

the sequence name ID.

Chapter 4

[ 135 ]

This is only one example of how you can mix streams together. As said, any step can be used

to unify two streams. Whichever the step, the most important thing you have to have in

mind is that you cannot mix rows that have a dierent layout. The rows have to have the

same lengths, the same data types, and the same elds in the same order.

Fortunately, there is a trap detector that provides warnings at design me if a step is

receiving mixed layouts.

You can try this out. Delete the Select values step. Create a hop from the Add constants step

to the Add sequence step. A warning message appears as shown next:

In this case, the third eld of the rst stream, Programmer (String), does not have the

same name or the same type as the third eld of the second stream, Progress (Number).

Note that PDI warns you but it doesn't prevent you from mixing row layouts

when creang the transformaon.

If you want Kele to prevent you from running transformaons with mixed row

layouts, you can check the opon Enable safe mode in the window that shows

up when you dispatch the transformaon. Have in mind that doing this will

cause a performance drop.

Controlling the Flow of Data

[ 136 ]

When you use an arbitrary step to unify, the rows remain in the same order as they

were in their original stream, but the streams are joined in any order. Take a look at the

example's preview. The rows of the Bouchard's stream as well as the rows of the other

stream remained sorted within its original group. However, whether the Bouchard's stream

appeared before or aer the rows of the other stream was just a maer of chance. You

didn't decide the order of the streams; PDI decided it for you. If you care about the order in

which the union is made, there are some steps that can help you. Here are the opons

you have:

If you want to ... You can do this ...

Append two or more streams, and

don't care about the order

Use any step. The selected step will take all the incoming

streams in any order, and then will proceed with its specic

task.

Append two streams in a given order Use the Append streams step from the Flow category. It

helps to decide which stream goes rst.

Merge two streams ordered by one or

more elds

Use a Sorted Merge step from the Joins category. This

step allows you to decide on which eld(s) to order the

incoming rows before sending them to the desnaon

step(s). The input streams must be sorted on that eld(s).

Merge two streams keeping the newest

when there are duplicates

Use a Merge Rows (di) step from the Joins category.

You tell PDI the key elds, that is, the elds that say that

a row is the same in both streams. You also give PDI the

elds to compare when the row is found in both streams.

PDI tries to match rows of both streams, based on the key

elds. Then it creates a eld that will act as a ag, and lls

it as follows:

If a row was only found in the rst stream, the

ag is set to deleted.

If a row was only found in the second stream, the

ag is set to new.

If the row was found in both streams, and the

elds to compare are the same, the ag is set to

identical.

If the row was found in both streams, and at least

one of the elds to compare is dierent, the ag

is set to changed.



Let's try one of these opons.

Chapter 4

[ 137 ]

Time for action – giving priority to Bouchard by using

Append Stream

Suppose you want the Bouchard's row before the other rows. You can modify the

transformaon as follows:

1. From the Flow category of steps, drag an Append Streams step to the canvas.

Rearrange the steps and hops so the transformaon looks like this:

2. Edit the Append streams step and select as the Head hop the one belonging to the

Bouchard's rows, and as the Tail hop the other. Doing this, you indicate toPDI how it

has to order the streams.

3. Do a preview on the Add sequence step. You should see this:

Controlling the Flow of Data

[ 138 ]

What just happened?

You changed the transformaon to give priority to Bouchard's issues.

You made it by using the Append Streams step. By telling that the head hop was the one coming

from the Bouchard's le, you got the expected order—rst the rows with the tasks assigned

to Bouchard, sorted by progress descending, and then the rows with the tasks assigned to

other programmers, also sorted by progress descending.

Whether you use arbitrary steps or some of the special steps menoned

here to merge streams, don't forget to verify the layouts of the streams

you are merging. Pay aenon to the warnings of the trap detector and

avoid mixing row layouts.

Have a go hero – sorting and merging all tasks

Modify the previous exercise so that the nal output is sorted by priority. Try two possible

soluons:

Sort the input streams on their own and then use a Sorted Merge step.

Merge the stream with a Dummy step and then sort.

Which one do you think would give the best performance?

Refer to the Sort rows step issues in Chapter 3.

In which circumstances would you use the other opon?

Have a go hero – trying to nd missing countries

As you saw in the countries exercises, there are missing countries in the countries.xml

le. In fact, the countries are there, but with dierent names. For example, Russia in the

contestant le is Russian Federation in the XML le. Modify the transformaon that

looks for the language. Split the stream in two—one for the rows where a language was

found and the other for the rows where no language was found. For this last stream, use a

Value Mapper step to rename the countries you idened as wrong, that is, rename Russia

as Russian Federation. Then look again for a language now with the new name. Finally,

merge the two streams and create the output le with the result.



Chapter 4

[ 139 ]

Summary

In this chapter, you learned dierent opons that PDI oers to combine or split ows of data.

The chapter covered the following:

Copying and distribung rows

Spling streams based on condions

Merging independent streams in dierent ways

With the concepts you learned in the inial chapters, the range of tasks you are able to

perform is already broad. In the next chapter, you will learn how to insert JavaScript code in

your transformaons not only as an alternave to perform some of those tasks, but also as

a way to accomplish other tasks that are complicated or even unthinkable to carry out with

regular PDI steps.



Transforming Your Data

with JavaScript Code and the

JavaScript Step

Whichever transformaon you need to do on your data, you have a big chance

of nding that PDI steps are able to do the job. Despite that, it may happen that

there are not proper steps that serve your requirements, or that an apparently

minor transformaon consumes a lot of steps linked in a very confusing

arrangement dicult to test or understand. Pung colorful icons here and

there is funny and praccal, but there are some situaons like the ones

described above where you inevitably will have to code. This chapter explains

how to do it with JavaScript and the special JavaScript step.

In this chapter you will learn how to:

Insert and test JavaScript code in your transformaons

Disnguish situaons where coding is the best opon, from those where there are

beer alternaves

Doing simple tasks with the JavaScript step

One of the tradional steps inside PDI is the JavaScript step that allows you to code inside

PDI. In this secon you will learn how to use it for doing simple tasks.



Transforming Your Data with JavaScript Code and the JavaScript Step

[ 142 ]

Time for action – calculating scores with JavaScript

The Internaonal Musical Contest menoned in Chapter 4 has already taken place. Each duet

performed twice. The rst me technical skills were evaluated, while in the second, the focus

was on arsc performance.

Each performance was assessed by a panel of ve judges who awarded a mark out of a

possible 10.

The following is the detailed list of scores:

Note that the elds don't t in the screen, so the lines are wrapped and doed lines are

added for you to disnguish each line.

Now you have to calculate, for each evaluated skill, the overall score as well as an

average score.

1. Download the sample le from the Packt website.

2. Create a transformaon and drag a Fixed le input step to the canvas to read

the le.

Chapter 5

[ 143 ]

3. Fill the conguraon window as follows:

4. Press the Get Fields buon. A window appears to help you dene the columns.

5. Click between the elds to add markers that dene the limits. The window will look

like this:

6. Click on Next >. A new window appears for you to congure the elds.

7. Click on the rst eld at the le of the window and change the name to Performance.

Verify that the type is set to String.

Transforming Your Data with JavaScript Code and the JavaScript Step

[ 144 ]

8. To the right, you will see a preview of the data for the eld.

9. Select each eld to the le of the window, change the names, and adjust the types. Set

ID, Country, Duet, and Skill elds as String, and elds from Judge 1 to Judge

5 as Integer.

10. Go back and forth between these two windows as many mes as you need unl you are

done with the denions of the elds.

11. Click on Finish.

12. The grid at the boom is now lled.

13. Set the column Trim type to both for every eld.

14. The window should look like the following:

Chapter 5

[ 145 ]

15. Click on Preview the transformaon. You should see this:

16. From the Scripng category of steps, select a Modied JavaScript Value step and drag it

to the canvas.

17. Link the step to the Fixed le input step, and double-click it to congure it.

18. Most of the conguraon window is blank, which is the eding area. Type the following

text in it:

var totalScore;

var wAverage;

totalScore = Judge1 + Judge2 + Judge3 + Judge4 + Judge5;

wAverage = 0.35 * Judge1 + 0.35 * Judge2

+ 0.10 * Judge3 + 0.10 * Judge4 + 0.10 * Judge5;

19. Click on the Get variables buon.

Transforming Your Data with JavaScript Code and the JavaScript Step

[ 146 ]

20. The grid under the eding area gets lled with the two variables dened in the code.

The window looks like this:

21. Click on OK.

22. Keep the JavaScript step selected and do a preview.

23. This is how the nal data looks like:

Chapter 5

[ 147 ]

What just happened?

You read the detailed list of scores and added two elds with the overall score and an

average score for each evaluated skill.

In order to read the le, you used a step you hadn't used before—the Fixed le input step.

You congured the step with the help of a wizard. You could have also lled the eld grid

manually if you wanted to.

Aer reading the le, you used a JavaScript step to create new elds. The code you typed

was pure JavaScript code. In this case, you typed a simple code to calculate the total score

and a weighted average combining the elds from Judge 1 to Judge 5.

Note that the average was dened by giving more weight, that is, more importance, to the

scores coming from Judge 1 and Judge 2.

For example, consider the rst line of the le. This is how the new elds were calculated:

totalScore = Judge1 + Judge2 + Judge3 + Judge4 + Judge5 = 8+8+9+8+9

= 42

wAverage = 0.35*Judge1 + 0.35*Judge2+ 0.10*Judge3 + 0.10*Judge4 +

0.10*Judge5 = 0.35*8 + 0.35*8+ 0.10*8 + 0.10*8 + 0.10*8 = 8.2

In order to add these new elds to your dataset, you brought them to the grid at the boom

of the window.

Note that this is not the only way to do calculaons in PDI. All you did with the JavaScript

step can also be done with other steps.

Using the JavaScript language in PDI

JavaScript is a scripng language primarily used in website development. However, inside PDI

you use just the core language; you neither run a web browser nor do you care about HTML.

There are many available JavaScript engines. PDI uses the Rhino engine, from Mozilla. Rhino

is an open source implementaon of the core JavaScript language; it doesn't contain objects

or methods related to manipulaon of web pages. If you are interested in knowing more

about Rhino, you can visit https://developer.mozilla.org/en/Rhino_Overview.

The core language is not too dierent from other languages you might know. It has basic

statements, block statements (statements enclosed by curly brackets), condional statements

(if..else and switch case), and loop statements ( for, do..while, and while). If you

are interested in the language itself, you can access a good JavaScript guide following this link:

https://developer.mozilla.org/En/Core_JavaScript_1.5_Guide.

Transforming Your Data with JavaScript Code and the JavaScript Step

[ 148 ]

Besides the basics, an interesng feature included in the PDI implementaon is E4X, a

programming language extension that allows you to manipulate XML objects inside JavaScript.

You can nd an E4X tutorial as well as a reference manual at https://developer.

mozilla.org/En/E4X/Processing_XML_with_E4X.

Finally, there is a complete tutorial and reference at http://www.w3schools.com/

js/. Despite being quite oriented to web development, which is not your concern, it is clear,

complete, and has plenty of examples.

Inserting JavaScript code using the Modied Java Script

Value step

The Modied Java Script Value step (JavaScript step in short) allows you to insert JavaScript

code inside your transformaon. The code you type here is executed once per row coming to

the step.

Let's explore its dialog window.

Most of the window is occupied by the eding area. It's there that you write JavaScript code

using the standard syntax of the language and the funcons and elds from the tree

to the le of the window.

The Transform Funcons branch of the tree contains a rich list of funcons, ready to use.

Chapter 5

[ 149 ]

The funcons are grouped by category.

String, Numeric, Date, and Logic categories contain usual JavaScript funcons.

This is not a full list of JavaScript funcons. You are allowed to

use JavaScript funcons even if they are not in this list.

The Special category contains a mix of ulity funcons. Most of them are not

JavaScript funcons but Kele funcons. You will use some of them later in

this chapter.

Finally, the File category, as its name suggests, contains a list of funcons that

do simple vericaons or acons related to les and folders—for example,

fileExist() or createFolder().

To add a funcon to your script, simply double-click on it, and drag it to the locaon in your

script where you wish to use it, or just type it.

If you are not sure about how to use a parcular funcon or what a

funcon does, just right-click on the funcon and select Sample. A new

script window appears with a descripon of the funcon and sample code

showing how to use it.

The Input elds branch contains the list of the elds coming from previous steps. To see and

use the value of a eld for the current row, you need to double-click on it or drag it to the

code area. You can also type it by hand as you did in the tutorial.

When you use one of these elds in the code, it is treated as a JavaScript variable. As such,

the name of the eld has to follow the convenons for a variable name—for example, it

cannot contain dots, nor can it start with non-character symbols.

As Kele is quite permissive with names, you can have elds in your stream whose names

are not valid to be used inside JavaScript code.

If you intend to use a eld with a name that doesn't follow the name rules,

rename it just before the JavaScript step with a Select values step. If you use

that eld without renaming it, you will not be warned when coding, but you'll

get an error or unexpected results when you execute the transformaon.

The Output elds is a list of the elds that will leave the step.



Transforming Your Data with JavaScript Code and the JavaScript Step

[ 150 ]

Adding elds

At the boom of the window, there is a grid where you put the elds you created in the

code. This is how you add a new eld:

1. Dene the eld as a variable in the code—for example, var totalScore.

2. Fill the grid manually or by clicking the Get variables buon. A new row will be lled

for every variable you dened in the code.

That was exactly what you did for the new elds, totalScore and wAverage.

In the JavaScript code you can create and use all variables you need without declaring them.

However, if you intend to add a variable as a eld in your stream, the declaraon with the

var sentence is mandatory.

The variables you dene in the JavaScript code are not Kele variables.

JavaScript variables are local to the step, and have nothing to do with the

Kele variables you know.

Modifying elds

Instead of adding a eld, you may want to change the value and eventually the data type of

an existent eld. You can do that but not directly in the code.

Imagine that you wanted to change the eld Skill, converng it to uppercase. To

accomplish this, double-click the JavaScript step and add the following two lines:

var uSkill;

uSkill = upper(Skill);

Add the new field to the grid at the bottom:

By renaming uSkill to Skill and seng the Replace value 'Fieldname' or 'Rename to' to

Y, the uSkill eld is renamed to Skill and replaces the old Skill eld.

Don't use the setValue() funcon to change existent elds. It may

cause problems and remains just for compability reasons.

Chapter 5

[ 151 ]

Turning on the compatibility switch

In the JavaScript window, you might have seen the Compability mode checkbox. This

checkbox, unchecked by default, causes JavaScript to work like it did in version 2.5 of the

JavaScript engine. With that version, you could modify the values and their types directly in

the code, which allows mixing data types, thus causing many problems.

Old JavaScript programs run in compability mode. However, when creang new code,

you should make use of the new engine; that is, you should leave the compability mode

turned o.

Do not check the compability switch. Leaving it unchecked, you will have a

cleaner, faster, and safer code.

Have a go hero – adding and modifying elds to the contest data

Take the contest le as source and do the following:

Add a eld named average. For the rst performance, calculate the average as

a weighted average, just like you did in the tutorial. For the second performance,

calculate the eld as a regular average, that is, the sum of the ve scores divided

by ve.

Modify the Performance eld. Replace Duet 1st Performance and Duet 2nd

Performance by 1st and 2nd.

There is no single way to code this, but here you have a list of funcons or sentences you can

use: if..then...else, indexOf(), substr()

Testing your code

Aer you type a script, you may want to test it. You can do it from inside the JavaScript

conguraon window. Let's try it:



Transforming Your Data with JavaScript Code and the JavaScript Step

[ 152 ]

Time for action – testing the calculation of averages

Let's test the code you've just created.

1. Double-click the JavaScript step.

2. Click on the Test script buon.

3. A window appears to create a set of rows for tesng. Fill it like here:

4. Click on Preview the transformaon. A window appears showing ve idencal rows

with the provided sample values. Close the preview window.

5. Click on OK to test the code.

A window appears with the result that will appear when we execute the script with

the test data.

Chapter 5

[ 153 ]

What just happened?

You tested the code of the JavaScript step.

You clicked on the Test script buon, and created a dataset that served as the basis for

tesng the script. You previewed the test dataset.

Aer that, you did the test itself. A window appeared showing you how the created dataset

looks like aer the execuon of the script—the totalScore and wAverage elds were

added, and the skill eld was converted to uppercase.

Testing the script using the Test script button

The Test script buon allows you to check that the script does what it is intended to do.

It actually generates a transformaon in the back with two steps—a Generate Rows step

sending data to a copy of the JavaScript step. Just aer clicking on the buon, you are

allowed to ll the Generates Rows window with the test dataset.

The rst thing that the test funcon does is to verify that the code is properly wrien; that is,

that there are no syntax errors in the code. Try deleng the last parenthesis in the code and

click on the Test script buon. When you click OK to see the result of the execuon, instead

of a dataset you will see an error window.

If the script is syntaccally correct, what follows is the preview of the JavaScript for the

transformaon in the back, that is, the JavaScript code applied to the test dataset.

If you don't see any error and the previewed data shows the expected results, you are

done. If not, you can check the code, x it, and test it again unl you see that the step

works properly.

Have a go hero – testing the new calculation of the average

Open the transformaon of the previous Hero secon, and test:

The weighted average code

The regular code

To test one or the other, simply change the test data. Don't

touch your code!



Transforming Your Data with JavaScript Code and the JavaScript Step

[ 154 ]

Enriching the code

In the previous secon, you learned how to insert code in your transformaon by using

a JavaScript step. In this secon, you will see how to use variables from outside to give

exibility to your code. You also will learn how to take control of the rows from inside the

JavaScript step.

Time for action – calculating exible scores by using variables

Suppose that by the me you are creang the transformaon, the weights for calculang

the weighted average are unknown. You can modify the transformaon by using parameters.

Let's do it:

1. Open the transformaon of the previous secon and save it with a new name.

2. Press Ctrl+T to open the Transformaon properes dialog window.

3. Select the Parameters tab and ll it like here:

4. Replace the JavaScript step by a new one and double-click it.

5. Expand the Transform Scripts branch of the tree at the le of the window.

6. Right-click the script named Script 1, select Rename, and type main as the

new name.

Chapter 5

[ 155 ]

7. Posion the mouse cursor over the eding window and right-click to bring up the

following contextual menu:

8. Select Add new to add the script, which will execute before your main code.

9. A new script window appears. The script is added to the list of scripts under

Transform Scripts.

10. Bring up the contextual menu again, but this me clicking on the tle of the new script.

Select Set Start Script.

11. Right-click the script in the tree list, and rename the new script as Start.

12. In the eding area of the new script, type the following code to bring the

transformaon parameters to the JavaScript code:

w1 = str2num(getVariable('WEIGHT1',0));

w2 = str2num(getVariable('WEIGHT2',0));

w3 = str2num(getVariable('WEIGHT3',0));

w4 = str2num(getVariable('WEIGHT4',0));

w5 = str2num(getVariable('WEIGHT5',0));

writeToLog('Getting weights...');

Transforming Your Data with JavaScript Code and the JavaScript Step

[ 156 ]

13. Select the main script by clicking on its tle and type the following code:

var wAverage;

wAverage = w1 * Judge1 + w2 * Judge2

+ w3 * Judge3 + w4 * Judge4 + w5 * Judge5;

writeToLog('row:' + getProcessCount('r') + ' wAverage:' +

num2str(wAverage));

if (wAverage >=7)

trans_Status = CONTINUE_TRANSFORMATION;

else

trans_Status = SKIP_TRANSFORMATION;

14. Click Get variables to add the wAverage variable to the grid.

15. Close the JavaScript window.

16. With the JavaScript step selected, click on the Preview this transformaon buon.

17. When the preview window appears, click on Congure.

18. In the window that shows up, modify the parameters as follows:

19. Click Launch.

Chapter 5

[ 157 ]

20. The preview window shows this data:

21. The log window shows this:

...

2009/07/23 14:46:54 - wAverage with Param..0 - Getting weights...

2009/07/23 14:46:54 - wAverage with Param..0 - row:1 wAverage:8

2009/07/23 14:46:54 - wAverage with Param..0 - row:2 wAverage:8

2009/07/23 14:46:54 - wAverage with Param..0 - row:3 wAverage:7.5

2009/07/23 14:46:54 - wAverage with Param..0 - row:4 wAverage:8

2009/07/23 14:46:54 - wAverage with Param..0 - row:5 wAverage:7.5

...

What just happened?

You modied the code of the JavaScript step to use parameters.

First, you created four parameters for the transformaon, containing the weights for

the calculaon.

Then in the JavaScript step, you created a Start script to read the variables. That script

executed once, before the main script. Note that you didn't declare the variables. You could

have done it, but it's not mandatory unless you intend to add them as output elds.

In the main script, the script that is executed for every row, you typed the code to calculate

the average by using those variables instead of xed numbers.

Aer the calculaon of the average, you kept only the rows for which the average was

greater or equal to 7. You did it by seng the value of trans_Status to CONTINUE_

TRANSFORMATION for the rows you wanted to keep, and to SKIP_TRANSFORMATION

for the rows you wanted to discard.

Transforming Your Data with JavaScript Code and the JavaScript Step

[ 158 ]

In the preview window, you could see that the average was calculated as a weighted

average of the scores you provided, and that only the rows with an average greater or

equal to 7 were kept.

Using named parameters

The parameters that you put in the transformaon dialog window are called named

parameters. They can be used through the transformaon as regular variables, as if you

had created them before—for example, in the kettle.properties le.

From the point of view of the transformaon, the main dierence between

variables dened in the kettle.properties le and named parameters is

that the named parameters have a default value that can be changed at the me

you run the transformaon.

In this case, the default values for the variables dened as named parameters WEIGHT1 to

WEIGHT5 were 0.35, 0.35, 0.10, 0.10, and 0.10—the same that you had used in previous

exercises. But when you executed, you changed the default and used 0.50, 0.50, 0, 0, and 0

instead. This caused the formula for calculang the weighted average to work as an average

of the rst two scores. Take, for example, the numbers for the rst row of the le. Consider

the following code line:

wAverage = w1 * Judge1 + w2 * Judge2 + w3 * Judge3 + w4 * Judge4 + w5

* Judge5;

It was calculated as:

wAverage = 0.50 * 8 + 0.50 * 8 + 0 * 9 + 0 * 8 + 0 * 9;

giving a weighted average equal to 8.

Note that the named parameters are ready to use through the transformaon as regular

variables. You can see and use them at any place where the icon with the dollar sign

is present.

If you want to use a named parameter or any other Kele variable such as LABSINPUT or

java.io.tmpdir inside the JavaScript code, you have to use the getVariable() funcon

as you did in the Start script.

When you run the transformaon from the command line, you also have the possibility to

specify values for named parameters. For details about this, check Appendix B.

Chapter 5

[ 159 ]

Using the special Start, Main, and End scripts

The JavaScript step allows you to create mulple scripts. The Transformaon Script list

displays a list with all scripts of the step.

In the tutorial, you added a special script named Start and used it to read the variables.

The Start Script is a script that executes only once, before the execuon of the main script

you already know.

The Main script, the script that is created by default, executes for every row. As this script

is executed aer the start script, all variables dened in the main script there are accessible

here. As an example of this, in the tutorial you used the start script to set values for the

variables w1 through w5. Then in the main script you used those variables.

It is also possible to have an End Script that executes at the end of the execuon of the step,

that is, aer the main script has been executed for all rows.

When you create a Start or an End script, don't forget to give it a name so

that you can recognize it. If you don't, you may get confused because nothing in

the step shows you the type of the scripts.

Beyond main, start, and end scripts, you can use extra scripts to avoid overloading the main

script with code. The code in the extra scripts will be available aer the execuon of the

special funcon LoadScriptFromTab().

Note that in the exercises, you wrote some text to the log by using the writeToLog()

funcon. That had the only purpose of showing you that the start script executed at the

beginning and the main script executed for every row. You can see this sequence in the

execuon log.

Using transformation predened constants

In the tree to the le-hand side of the JavaScript window, under Transformaon Constants,

you have a list of predened constants. You can use those constants to change the value of

the predened variable, trans_Status, such as:

trans_Status = SKIP_TRANSFORMATION

Here is how it works:

Value of the trans_Status variable Eect on the current row

SKIP_TRANSFORMATION The current row is removed from the dataset

CONTINUE_TRANSFORMATION The current row is retained

ERROR_TRANSFORMATION The current row causes aboron of the transformaon

Transforming Your Data with JavaScript Code and the JavaScript Step

[ 160 ]

In other words, you can use that constant to control what will happen to the rows. In the

exercise you put:

if (wAverage >=7)

trans_Status = CONTINUE_TRANSFORMATION;

else

trans_Status = SKIP_TRANSFORMATION;

This means a row where the average is greater than or equal to 7 will connue its way to the

following steps. On the contrary, a row with a lower average will be discarded.

Pop quiz – nding the 7 errors

Look at the following screenshot:

Chapter 5

[ 161 ]

Does it look good? Well, it is not. There are seven errors in it. Can you nd them?

Have a go hero – keeping the top 10 performances

Modify the last tutorial. By using a JavaScript step, keep the top 10 performances, that is,

the 10 performances with the best average.

Sort the data using a regular Sort rows step. Give the

getProcessCount() funcon a try.

Have a go hero – calculating scores with Java code

If you are a Java programmer, or just curious, you will like to know that you can access

Java libraries from inside the JavaScript step. On the book site there is a JAR le named

pdi_chapter_5.jar. The JAR le contains a class with two methods—w_average()

and r_average(), for calculang a weighted average and a regular average.

Here is what you have to do:

1. Download the le from Packt's site, copy it to the libext folder inside the PDI

installaon folder, and restart Spoon.

2. Replace the JavaScript calculaon of the averages by a call to one of these

methods. You'll have to specify the complete name of the class. Consider the

next line for example:

wAverage = Packages.Averages.w_average(Judge1, Judge2, Judge3,

Judge4, Judge5);

3. Preview the transformaon and verify that it works properly.

The Java le is available as well. You can change it by adding new methods and trying them

from PDI.

Likewise, you can try using any Java objects, as long as they are in PDI's classpath. Don't

forget to type the complete name as in the following examples:

java.lang.Character.isDigit(c);

var my_date = new java.util.Date();

var val = Math.floor(Math.random()*100);

Transforming Your Data with JavaScript Code and the JavaScript Step

[ 162 ]

Reading and parsing unstructured les

It is marvelous to have input les where the informaon is well formed; that is, the number

of columns and the type of its data is precise, all rows follow the same paern, and so on.

However, it is common to nd input les where the informaon has lile or no structure, or

the structure doesn't follow the matrix (n rows by m columns) you expect. In this secon you

will learn how to deal with such les.

Time for action – changing a list of house descriptions with

JavaScript

You won the loery and decided to invest the money in a new house. You asked a real-estate

agency for a list of candidate houses for you and it gave you this:

...

Property Code: MCX-011

Status: Active

5 bedrooms

5 baths

Style: Contemporary

Basement

Laundry room

Fireplace

2 car garage

Central air conditioning

More Features: Attic, Clothes dryer, Clothes washer, Dishwasher

Property Code: MCX-012

4 bedrooms

3 baths

Fireplace

Attached parking

More Features: Alarm System, Eat in Kitchen, Powder Room

Property Code: MCX-013

3 bedrooms

...

Chapter 5

[ 163 ]

You want to compare the properes before vising them, but you're nding it hard to do so

because the le doesn't have a precise structure. Fortunately, you have the JavaScript step,

which will help you to give the le some structure.

1. Create a new transformaon.

2. Get the sample le from Packt site and read it with a Text le input step. Uncheck

the Header checkbox and create a single eld named text.

3. Do a preview. You should see the content of the le under a single column named

text. Add a JavaScript step aer the input step and double-click it to edit it.

4. In the eding area, type the following JavaScript code to create a eld with the code

of the property:

var prop_code;

posCod = indexOf(text,'Property Code:');

if (posCod>=0)

prop_code = trim(substr(text,posCod+15));

5. Click Get variables to add the prop_code variable to the grid under the code.

6. Click OK.

7. With the JavaScript step selected, do a preview. You should see this:

Transforming Your Data with JavaScript Code and the JavaScript Step

[ 164 ]

What just happened?

You read a le where each house was described in several rows. You added to every row

the code of the house to which that row belonged. In order to obtain the property code,

you idened the lines with a code, and then you cut the Property Code: text with the

substr funcon and discarded the leading spaces with trim.

Looking at previous rows

The code you wrote may seem a lile strange at the beginning, but it is not. It

creates a variable named prod_code, which will be used to create a new eld to

idenfy the properes. When the JavaScript code detects a property header row such

as Property Code: MCX-002, it sets the variable prop_code to the code it nds

in that line—MCX – 002 in this case.

Unl a new header row appears, the prop_code variable keeps that value. Thus all the

rows following a row like the one shown above will have the same value for the variable

prop_code.

The variable is then used to create a new eld, which will contain for every row, the code for

the house to which it belongs.

This is an example of when you can keep values from previous rows to be used in the

current row.

Note that here you use JavaScript to see and use values from previous rows, but

you can't modify them! JavaScript always works on the current row.

Have a go hero – enhancing the houses le

Modify the exercise from the tutorial by doing the following:

1. Aer keeping the property code, discard the rows that headed each property

descripon.

2. Create two new elds named feature and description. Fill the feature eld

with the feature described in the row (Exterior construcon) and the description

eld with the descripon of that feature (Brick). If you think that is not worth

keeping some features (Living Room), you may discard some rows. Discard also the

original eld text. Here you have a sample house descripon showing a possible

output aer the changes:

Chapter 5

[ 165 ]

prop_code; Feature; Description

MCX-023;bedrooms;4

MCX-023;baths;4

MCX-023;Style;Colonial

MCX-023;family room;yes

MCX-023;basement;yes

MCX-023;fireplace;yes

MCX-023;Heating features;Hot Water Heater

MCX-023;Central air conditioning present;yes

MCX-023;Exterior construction;Stucco

MCX-023;Waterview;yes

MCX-023;More Features;Attic, Living/Dining Room, Eat-In-Kitchen

Have a go hero – ll gaps in the contest le

Take a look at the contest le. Each performance occupies two rows, one showing each

evaluated skill. The name of the country appeared only in the rst row.

Open the rst version of the contest transformaon and modify it to ll the column Country

where it is blank.

Avoiding coding by using purpose-built steps

You saw through the exercises how powerful the JavaScript step is for helping you in your

transformaons. In older versions of PDI, coding JavaScript was the only means you had for

doing specic tasks. In the latest releases of PDI, actual steps appeared that eliminate the

need for coding in many cases. Here you have some examples of that:

Formula: You saw it in Chapter 3. Before the appearance of this step,

there were a lot of funcons such as the text funcons that you could only

solve with JavaScript.

Analyc Query: This step oers a way to retrieve informaon from rows before or

aer the current.

Split eld to rows: The step is used to create several rows from a single string value.

You used this step in Chapter 3 to create a new row for each word found in a le.

Analyc Query and Split elds to row are examples of where not only the need for coding

was eliminated, they also eliminated the need for accessing internal objects and funcons

such as Clone() or putRow() that you probably saw in old sample code or when browsing

the PDI forum. The use of those objects and funcons can lead to odd behavior and data

corrupon, and so their use is strongly discouraged.



Transforming Your Data with JavaScript Code and the JavaScript Step

[ 166 ]

Despite the appearance of new steps, you sll have the choice to do the tasks with code.

In fact, quite a lot of tasks you do with regular PDI steps may also be done with JavaScript,

by using the JavaScript step. This is a temptaon to programmers who end up with

transformaons having plenty of JavaScript steps.

Whenever there is a step that does what you want to do, you should

prefer that step to coding.

Why should you prefer to use a specic step rather than code? Here are some reasons:

To code takes more me to develop. You don't have to waste your me coding if

there are steps that can solve your problem.

Code is hard to maintain. If you have to modify or x a transformaon, it will be

much easier to tackle the change if the transformaon is a bunch of colorful steps

with meaningful names than if the transformaon consists of just a couple of

JavaScript icons.

A bunch of icons is self documented. A JavaScript step is like Pandora's box. Unl

you open it, you don't know exactly what it does, or whether it contains just a line of

code or thousands.

JavaScript is inherently slow. Faster alternaves for simple expressions are the User

Dened Java Expression and Calculator steps. They are typically more than twice

as fast. The next PDI release will feature a User Dened Java Class step. One of the

purposes of this step, intended to be used by Java developers, is to overcome the

drawbacks of JavaScript.

On the contrary, there are situaons where you may prefer or have to use JavaScript. Let's

enumerate some of them:

To handle unstructured input data

For accessing Java libraries

When you need to use a funcon provided by the JavaScript language that is not

provided by any of the regular PDI steps

When the JavaScript code saves a lot of regular PDI steps (as well as screen space),

and you think it is not worth showing the details of what those steps do

In the end, it is up to you to choose one or the other opon. The following exercise will help

you a lile in the recognion of pros and cons.



Chapter 5

[ 167 ]

Have a go hero – creating alternative solutions

Redo the following Hero exercises you did in this chapter:

Adding and modifying elds to the contest data

Keeping the top 10 performances

Enhancing the houses le

Filling gaps in the contest le

Do these exercises without using JavaScript when possible. In each case, compare both

versions, having in mind the following:

Time to develop

Maintenance

Documentaon

Capability to handle unstructured data

Number of steps required

Performance

Decide which opon you would choose if you had to decide.

To keep the 10 rst performances, use an Add Sequence step.

To ll the gaps, use an Analyc Query step.

Summary

In this chapter, you learned to code JavaScript into PDI. Specically, you learned:

What the JavaScript step is and how to use it

How to modify elds and add new elds to your dataset from inside your

JavaScript step

How to deal with unstructured input data

You also considered the pros and cons of coding JavaScript inside your transformaons, as

well as alternave ways to do things, avoiding wring code when possible.

As a bonus, you learned the concept of named parameters.

If you feel condent with all you've learned unl now, you are certainly ready to move on to

the next chapter, where you will learn in a simple fashion how to solve some sophiscated

problems such as normalizing data from pivot tables.



Transforming the Row Set

So far, you have been working with simple datasets, that is, datasets where

the each row represented a dierent enty (for example a student) and each

column represented a dierent aribute for that enty (for example student

name). There are occasions when your dataset doesn’t resemble such a simple

format, and working with it as is, may be complicate or even impossible. In

other occasions your data simply does not have the structure you like or the

structure you need.

Whichever your situaon, you have to transform the dataset in an appropriate format and

the soluon is not always about changing or adding elds, or about ltering or adding rows.

Somemes it has to do with twisng the whole dataset. In this chapter you will learn how to:

Convert rows to columns

Convert columns to rows

Operate on sets of rows

You will also be introduced to a core subject in data warehousing: Time dimensions.

Converting rows to columns

In most datasets each row belongs to a dierent element such as a dierent match or

a dierent student. However, there are datasets where a single row doesn't completely

describe one element. Take, for example, the real-estate le from Chapter 5. Every

house was described through several rows. A single row gave incomplete informaon

about the house. The ideal situaon would be one in which all the aributes for the

house were in a single row. With PDI you can convert the data into this alternave

format. You will learn how to do it in this secon.



Transforming the Row Set

[ 170 ]

Time for action – enhancing a lms le by converting

rows to columns

In this tutorial we will work with a le that contains list of all French movies ever made. Each

movie is described through several rows. This is how it looks like:

...

Caché

Year: 2005

Director:Michael Haneke

Cast: Daniel Auteuil, Juliette Binoche, Maurice Bénichou

Jean de Florette

Year: 1986

Genre: Historical drama

Director: Claude Berri

Produced by: Pierre Grunstein

Cast: Yves Montand, Gérard Depardieu, Daniel Auteuil

Le Ballon rouge

Year: 1956

Genre: Fantasy | Comedy | Drama

...

In order to process the informaon of the le, it would be beer if the rows belonging to

each movie were merged into a single row. Let's work on that.

1. Download the le from the Packt website.

2. Create a transformaon and read the le with a Text le input step.

3. In the Content tab of the Text le input step put : as separator. Also uncheck the

Header and the No empty rows opons.

4. In the Fields tab enter two string elds—feature and description. Do a preview of

the input le to see if it is well congured. You should see two columns—feature with

the texts to the le of the semicolons, and description with the text to the right of

the semicolons.

5. Add a JavaScript step and type the following code that will create the film eld:

var film;

if (getProcessCount('r') == 1) film = '';

if (feature == null)

film = '';

else if (film == '')

film = feature;

Chapter 6

[ 171 ]

6. Click on the Get variables buon to add to the dataset the eld film.

7. Add a Filter rows step with the condion description IS NOT NULL.

8. With the Filter rows step selected, do a preview. This is what you should see:

9. Aer the lter step, add a Row denormalizer step. You can nd it under the

Transform category.

10. Double-click the step and ll it like here:

Transforming the Row Set

[ 172 ]

11. From the Ulity category select an If eld value is null step.

12. Double-click it , check the Select elds opon, and ll the Fields grid as follows:

13. With this last step selected, do a preview. You will see this:

What just happened?

You read a le with a selecon of lms in which each lm was described through

several rows.

First of all, you created a new eld with the name of the lm by using a small piece of

JavaScript code. If you look at the code, you will note that the empty rows are key for

calculang the new eld. They are used in order to disnguish between one lm and the

next and that is the reason for unchecking the No empty rows opon. When the code

executes for an empty row, it sets the lm to an empty value. Then, when it executes for the

rst line of a lm (film == '' in the code), it sets the new value for the film eld. When

the code executes for other lines, it does nothing but the lm already has the right value.

Aer that, you used a Row denormalizer step to translate the descripon of lms from rows

to columns, so the nal dataset had a single row by lm.

Finally, you used a new step to replace some null elds with the text n/a.

Chapter 6

[ 173 ]

Converting row data to column data by using the Row

denormalizer step

The Row denormaliser step converts the incoming dataset into a new dataset by moving

informaon from rows to columns according to the values of a key eld.

To understand how the Row denormaliser works, let's do a sketch of the desired

nal dataset:

FILM YEAR GENRE DIRECTOR ACTORS

1 film

by row

Here, a lm is described by using a single row. On the contrary, in your input le the

descripon for every lm was spread over several rows.

To tell PDI how to combine a group of rows into a single one, there are three things you have

to think about:

Among the input elds there must be a key eld. Depending on the value of that key

eld, you decide how the new elds will be lled. In your example, the key eld is

feature. Depending on the value of the column feature, you will send the value

of the eld description to some of the new elds: Year, Genres, Director,

or Actors.

You have to decide which eld or elds make up the groups of rows. In our example,

that eld is film. All rows with the same value for the eld film make up a

dierent group.

Decide the rules that have to be applied in order to ll the new target elds. All rules

follow this paern:

If the value for the key eld is equal to A, then put the value of the

eld B into the new eld C.

A sample rule could be: If the value for the eld feature

(our key eld) is equal to Directed by, put the value of the

eld description into the new eld Director.





Transforming the Row Set

[ 174 ]

Once you are clear about these three things, all you have to do is ll the Row denormaliser

conguraon window to tell PDI how to do this task.

1. Fill the key eld textbox with the name of the key eld. In the example, the eld

is feature.

2. Fill the upper grid with the elds that make up the grouping. In this case, it is film.

The dataset must be sorted on the grouping elds. If not, you will

get unexpected results.

3. Finally, ll the lower grid. This grid contains the rules for the new elds. Fill it

following this example:

To add this rule ... Fill a row like this ...

If the value for the key eld is equal to A, put the

value of the eld B into the new eld C.

Key value: A

Value eldname: B

Target eldname: C

This is how you ll the row for the sample rule:

If the value for the eld feature (our key eld) is

equal to 'Directed by,' put the value of the eld

description into the new eld Director.

Key value: Directed by

Value eldname: description

Target eldname: Director

For every rule you must ll a dierent row in the target elds' grid.

Let's see how the Row denormalizer works for the following sample rows:

PDI creates an output row for the lm Manon Des Sources. Then it processes every row

looking for values to ll the new elds.

Chapter 6

[ 175 ]

Let's take the rst row. The value for the key eld feature is Directed by. PDI searches

in the target elds' grid to see if there is an entry where the Key value is Directed by; it

nds it.

Then it puts the value of the eld description as the content for the target eld

Director. The output row is now like this:

Now take the second row. The value for the key eld feature is 'Produced by.'

PDI searches in the target elds' grid to see if there is an entry where the Key value is

Produced by. It cannot nd it, and the informaon for this row is lost.

The following screenshot shows the rule applied to the third sample row. It also shows how

the nal output row looks like:

Transforming the Row Set

[ 176 ]

Note that the presence of rows is not mandatory for every key value entered in the target

elds' grid. If an entry in the grid is not used, the target eld is created anyway but it

remains empty.

In this sample lm, the year was not present. Then the eld Year remained empty.

Have a go hero – houses revisited

Take the output le for the Hero exercise to enhance the houses le from the previous

chapter. You can also download the sample le from the Packt site. Create a transformaon

that reads that le and generates the following output:

Aggregating data with a Row denormalizer step

In the previous secon, you learned how to use the Row denormalizer step to combine

several rows into one. The Row denormalizer step can also be used to take as input a dataset

and generate as output a new dataset with aggregated or consolidated data. Let's see it with

an example.

Chapter 6

[ 177 ]

Time for action – calculating total scores by performances

by country

Let's work now with the contest le from Chapter 5. You will need the output le for the

Hero exercise. Fill gaps in the contest le from that chapter. If you don't have it, you can

download it from the Packt website.

In this tutorial, we will calculate the total score for each performance by country.

1. Create a new transformaon.

2. Read the le with a Text le input step and do a preview to see that the step is well

congured. You should see this:

3. With a Select values step, keep only the following columns: Country, Performance,

and totalScore.

4. With a Sort Rows step sort the data by Country ascendant.

5. Aer the Sort Rows step, put a Row denormalizer step.

6. Double-click this last step to congure it.

7. As the key eld put Performance, and as group elds put Country.

8. Fill the target elds' grid like shown:

Transforming the Row Set

[ 178 ]

9. Close the window.

10. With the Row denormalizer step selected, do a preview. You will see this:

What just happened?

You read the contest le, grouped the data by country, and then created a new column

for every performance. As values for those new elds you put the sum of the scores by

performance and by country.

Using Row denormalizer for aggregating data

The purpose for which you used the Row denormaliser step in this tutorial was dierent

from the purpose in the previous tutorial. In this case, you put the countries in rows, the

performances in columns, and in the cells you put sums. The nal dataset was kind of a cross

tab like those you create with the DataPilot tool in Open Oce, or the Pivot in Excel. The

big dierence is that here the nal dataset is not interacve because, in essence, PDI is not.

Another dierence is that here you have to know the names or elements for the columns

in advance.

Let's explain how the Row denormalizer step works in these cases. Basically, the way it

works is quite the same as before:

The step groups the rows by the grouping elds and creates a new output row for

each group.

The novelty here is the aggregaon of values. When more than one row in the group

matches the value for the key eld, PDI calculates the new output eld as the result of

applying an aggregate funcon to all the values. The aggregate funcons available are the

same you already saw when you learned the Group by step—sum, minimum, rst value,

and so on. Take a look at the following sample rows:

Chapter 6

[ 179 ]

The rst two rows had 1st as the value for the key eld Performance. According to the rule

of the Row denormaliser step, the values for the eld totalScore of these two rows go to

the new target eld score_1st_performance. As the rule applies for two rows, the values

for those rows have to be added, as Sum was the selected aggregaon funcon.

So, the output data for this sample group is this:

The value for the new eld score_1st_performance is 77 and is the sum of 38 and 39,

the values of the eld totalScore for the input rows where Performance was "1st."

Please note the dierence between the Row denormaliser and the Group

by step for aggregang. With the Row denormaliser step, you generate

another new eld for each interesng key value. Using the Group by step

for the tutorial, you couldn't have created the two columns shown in the

preceding screenshot—score_1st_performance and score_2nd_

performance.

Have a go hero – calculating scores by skill by continent

Create a new transformaon. Read the contest le and generate the following output:

To get the connent for each country, download the countries.txt le from the Packt

website and get the informaon with a Stream lookup step.

Transforming the Row Set

[ 180 ]

Normalizing data

Some datasets are nice to see but complicate to process further. Take a look at the matches

le we saw in Chapter 3:

Match Date;Home Team;Away Team;Result

02/06;Italy;France;2-1

02/06;Argentina;Hungary;2-1

06/06;Italy;Hungary;3-1

06/06;Argentina;France;2-1

10/06;France;Hungary;3-1

10/06;Italy;Argentina;1-0

...

Imagine you want to answer these quesons:

1. How many teams played?

2. Which team converted most goals?

3. Which team won all matches it played?

The dataset is not prepared to answer those quesons, at least in an easy way. If you want

to answer those quesons in a simple way, you will rst have to normalize the data, that is,

convert it to a suitable format before proceeding. Let's work on it.

Time for action – enhancing the matches le by normalizing

the dataset

Now you will convert the matches le you generated in Chapter 2 to a format suitable for

answering the proposed quesons.

1. Search on your disk for the le you created in Chapter 2, or download it from the

Packt website.

2. Create a new transformaon and read the le by using a Text le input step.

3. With a Split Fields step, split the Result eld in two: home_t_goals and

away_t_goals. (Do you remember having done this in chapter 3?)

4. From the Transform category of steps, drag a Row Normalizer step to the canvas.

5. Create a hop from the last step to this new one.

Chapter 6

[ 181 ]

6. Double-click the Row Normalizer step to edit it and ll the window as follows:

7. With the Row Normalizer selected, do a preview. You should see this:

What just happened?

You read the matches le and converted the dataset to a new one where both the home

team and the away team appeared under a new column named team, together with another

new column named goals holding the goals converted by each team. With this new format,

it is really easy now to answer the quesons proposed at the beginning of the secon.

Transforming the Row Set

[ 182 ]

Modifying the dataset with a Row Normalizer step

The Row Normalizer step modies your dataset, so it becomes more suitable for processing.

Usually this involves transforming columns into rows.

To understand how it works, let's take as example the le from the tutorial. Here is a sketch

of what we want to have at the end:

MATCH DATE TEAM GOALS

1st Match

2nd Match

02/06

Italy

France

Hungary

1Away Team

02/06

Argentina

... ... ...

Away Team

Home Team

What we have now is this:

1st Match

2nd Match

Match Date Home Team Goals Away Team

02/06

Italy

Argentina

France

Hungary

Goals

02/06

... ... ... ... ...

Now it is just a maer of creang a correspondence between the old columns and the

new ones.

Chapter 6

[ 183 ]

Just follow these steps and you have the work done:

Step Example

Idenfy the new desired elds. Give them a name. team, goals.

Look at the old elds and idenfy which ones you

want to translate to the new elds.

Home_Team, home_t_goals, Away_Team,

away_t_goals.

From that list, idenfy the columns you want to

keep together in the same row, creang a sort

of classicaon of the elds. Give each group a

name. Also, give a name to the classicaon.

You want to keep together the elds Home_

Team and home_t_goals. So, you create a

group with those elds, and name it home.

Likewise, you create a group named away with

the elds Away_Team and away_t_goals.

Name the classicaon as class.

Dene a correspondence between the elds

idened above, and the new elds.

The old eld Home_Team goes to the new

eld team.

The old eld home_t_goals goes to the new

eld goals.

The old eld Away_Team goes to the new

eld team.

The old eld away_t_goals goes to the new

eld goals.

Transcript all these denions to the Row Normalizer conguraon window as shown below:

Transforming the Row Set

[ 184 ]

In the elds grid, insert one row for each of the elds you want to normalize.

Once you normalize, you have a new dataset where the elds for the groups you dened

were converted to rows.

The number of rows in the new dataset is equal to the number of groups dened by the

number of rows in the old dataset. In the tutorial, the nal number is 24 rows x 2

groups = 48 rows.

Note that the elds not menoned in the conguraon of the Row Normalizer (Match_Date

eld in the example) are kept without changes. They are simply duplicated for each new row.

In the tutorial, every group was made by two elds: Home_Team and home_t_goals for the

rst group, and Away_Team and away_t_goals for the second. When you normalize, a group

may have just one eld, two elds (as in this example), or more than two elds.

Summarizing the PDI steps that operate on sets of rows

The Row Normaliser and Row denormalizer steps you learned in this chapter are some of

the PDI steps which, rather than treang single rows, operate on sets of rows. The following

table gives you an overview of the main PDI steps that fall into this parcular group of steps:

Step Purpose

Group by Builds aggregates such as Sum, Maximum, and so on, on groups of rows.

Univariate

Stascs

Computes some simple stascs. It complements the Group by. It has less

capabilies than that step but provides more aggregate funcons such as

median and percentiles.

Split Fields Splits a single eld into more than one. Actually it doesn't operate on a set of

rows, but it's common to use it combined with some of the steps in this table.

For example: You could use a Group by step to concatenate a eld, followed by

a Split Fields step that splits that concatenated eld into several columns.

Row Normaliser Transforms columns into rows making the dataset more suitable for processing.

Row denormaliser Moves informaon from rows to columns according to the values of a key eld.

Row aener Flaens consecuve rows. You could achieve the same by using a Group by to

concatenate the eld to aen, followed by a Split Field step.

Sort rows Sorts rows based on eld values. Alternavely, it can keep only unique rows.

Split eld to rows Splits a single string eld and creates a new row for each split term.

Unique rows Removes double consecuve rows and leaves only unique occurrences.

For examples on using these steps or for geng more informaon about them, please refer

to Appendix C, Quick reference: Steps and Job Entries.

Chapter 6

[ 185 ]

Have a go hero – verifying the benets of normalization

Extend the transformaon and answer the quesons proposed at the beginning of

the secon:

How many teams played?

Which team converted most goals?

Which team won all matches it played?

For answering the third queson, you'll have to modify the Row

Normalizer step as well.

If you are not convinced that the normalizer process makes the work easier, you can try to

answer the quesons without normalizing. That eort will denively convince you!

Have a go hero – normalizing the Films le

Consider the output of the rst Time for acon secon in this chapter. Generate the

following output:

You have two opons here:

To modify the tutorial by sending the output to a new le. Then to use that new le

to do this exercise.

To extend the stream in the original transformaon by adding new steps aer the

Row Denormalizer step.



Transforming the Row Set

[ 186 ]

Aer doing the exercise, think about this: Does it make sense to denormalize and then

normalize again? What is the dierence between the original le and the output of this

exercise? Could you have done the same without denormalizing and normalizing?

Have a go hero – calculating scores by judge

Take the contest le and generate the following output, where the columns represent the

minimum, maximum, and average score given by every judge:

This exercise may appear dicult at rst, but here's a clue: Aer reading the le, use a Group

by step to calculate all the values you need for your nal output. Leave the group eld empty

so that the step groups all rows in the dataset.

Generating a custom time dimension dataset by using

Kettle variables

Dimensions are sets of aributes useful for describing a business. A list of products along

with their shape, color, or size is a typical example of dimension. The me dimension is a

special dimension used for describing a business in terms of when things happened. Just

think of a me dimension as a list of dates along with aributes describing those dates. For

example, given the date 05/08/2009, you know that it is a day of August, it belongs to the

third quarter and it is Wednesday. These are some of the aributes for that date.

In the following tutorial you will create a transformaon that generates the dataset for a

me dimension. The dataset for a me dimension has one row for every date in a given

range of dates and one column for each aribute of the date.

Chapter 6

[ 187 ]

Time for action – creating the time dimension dataset

In this tutorial we will create a simple dataset for a me dimension.

First we will create a stream with the days of the week:

1. Create a new transformaon.

2. Press Ctrl+T to access the Transformaon sengs window.

3. Select the Parameters tab and ll it like shown in the next screenshot:

4. Expand the Job category of steps.

5. Drag a Get Variables step to the canvas, double-click the step, and ll the window

like here:

6. Aer the Get Variables step, add a Split Fields step and use it to split the eld

week_days into seven String elds named sun, mon, tue, wed, thu, fri, and

sat. As Delimiter, set a comma (,).

7. Add one more Split Fields step and use it to split the eld week_days_short into

seven String elds named sun_sh, mon_sh, tue_sh, wed_sh, thu_sh, fri_sh,

and sat_sh. As Delimiter, set a comma (,).

Transforming the Row Set

[ 188 ]

8. Aer this last step, add a Row Normalizer step.

9. Double-click the Row Normalizer step and ll it as follows:

10. Keep the Row Normalizer step selected and do a preview. You will see this:

Now let's build the main stream:

1. Drag a Generate Rows step, an Add sequence step, a Calculator step, and a Filter

rows step to the canvas.

Chapter 6

[ 189 ]

2. Link them so you get this:

3. Double-click the Generate Rows step and use it to generate 45000 lines. Add a single

Date eld named first_day. As Format select yyyyMMdd and as Value write

19000101.

4. Double-click the Add sequence step. In the Name of value textbox, type days.

5. Double-click the Calculator step and ll the window as shown next:

6. Double-click the Filter rows step and add the lter date <= 31/12/2020. When you

enter the date 31/12/2020, make sure to set the Type to Date and the Conversion

format to dd/MM/yyyy. Aer the Filter rows step add a Stream lookup step.

7. Create two hops—one from the Filter rows step to the Stream lookup step and the

other from the Row Normalizer step to the Stream lookup step.

8. Double-click the Stream lookup step. In the upper grid add a row, seng week_day

under the Field column and w_day under the LookupField column. Use the lower grid

to retrieve the String elds week_desc and week_short_desc. Finally, aer the

Stream lookup step, add a Select values step.

Transforming the Row Set

[ 190 ]

9. Use the Select values step to remove the unused elds first_day and days. Create a

hop from the Stream lookup step to this step.

10. With the Select values step selected, click the preview buon.

11. When the preview window appears click on Congure.

12. Fill the column value in the Parameters grid of the transformaon execuon window

as follows:

13. Click the Launch buon. You will see this:

What just happened?

You generated data for a me dimension with dates ranging from 01/01/1900 through

31/12/2020. Time dimensions are meant to answer quesons related with me such as: Do

I sell more on Mondays or on Fridays? Am I selling more this quarter than the same quarter

last year? The list of aributes you need to include in your me dimension depends on the

kind of queson you want to answer. Typical elds in a me dimension include: year, month

(a number between 1 and 12), descripon of month, day of month, week day, and quarter.

Chapter 6

[ 191 ]

In the tutorial you created a few aributes, but you could have added much more. Among

the aributes included you had the week day. The week descripons were taken from

named parameters, which allowed you to set the language of the week descripons at the

me you ran the transformaon. In the tutorial you specied Portuguese descripons. If

you had le the parameters grid empty, the transformaon would have used the English

descripons that you put as default.

Let's explain how you build the stream with the number and descripons for the days of

the week. First, you created a dataset by geng the variables with the descripons for the

days of the week. Aer creang the dataset, you split the descripons and by using the Row

Normalize step, you converted that row into a list of rows, one for every day of the week.

In other words, you created a single row with all the descripons for the days of the week.

Then you normalized it to create the list of days.

This method used for creang the list of days of a week is very useful when

you have to create a very small dataset. It avoids the creaon of external

les to hold that data.

The transformaon you created was inspired by the sample transformaon

General - Populate date dimension.ktr found in the samples/transformations

folder inside the PDI installaon folder. You can take a look at that transformaon. It builds

the dataset in a slightly dierent way, also by using Row Normalizer steps.

Getting variables

To create the secondary stream of the tutorial, you used a Get Variables step. The Get

Variables step allows you to get the value of one or more variables. In this tutorial you

read two variables that had been dened as named parameters.

When put as the rst step of a stream like in this case, this step creates a dataset with one

single row and as many elds as read variables.

The following is the dataset created by the Get Variables step in the me dimension tutorial:

Transforming the Row Set

[ 192 ]

When put in the middle of a stream, this step adds to the incoming dataset, as many elds as

the number of variables it reads. Let's see how it works.

Time for action – getting variables for setting the default

starting date

Let's modify the transformaon so that the starng date depends on a parameter.

1. Press Ctrl+T to open the transformaon sengs window.

2. Add a parameter named START_DATE with default value 01/12/1999.

3. Add a Get variables step between the Calculator step and the Filter rows step .

4. Edit the Get variables step and a new eld named start_date. Under Variable write

${START_DATE}. As Type select Date, and under Format select or type dd/MM/yyyy.

5. Modify the lter step so the condion is now: date>=start_date and

date<=31/12/2020.

6. Modify the Select values step to remove the start_date eld.

7. With the Select values step selected do a preview. You will see this:

What just happened?

You added a starng date as a named parameter. Then you read that variable into a new eld

and used it to keep only the dates that are greater or equal to its value.

Chapter 6

[ 193 ]

Using the Get Variables step

As you just saw, the Get Variables step allows you to get the value of one or more variables. In

the main tutorial you saw how to use the step at the beginning of a stream. Now you saw how

to use it in the middle. The following is the dataset aer the Get Variables step for this

last exercise:

With the Get Variables step, you can read any Kele variable—variables dened in the

kettle.properties le, internal variables as for example ${user.dir}, named parameters

as in this tutorial, or variables dened in another transformaon (you haven't yet learned about

these variables but you will soon).

As you know, the type of Kele variables is String by default. However, at the me you get a

variable, you can change its metadata. As an example of that, in this last exercise you converted

${START_DATE} to a Date by using the mask dd/MM/yyyy.

Note that you specied the variables as ${name of the variable}. You could have used

%%name of the variable%% also. The full specicaon of the name of a variable allows you

to mix variables with plain text.

Suppose that instead of a date you create a parameter named YEAR with default value 1950.

In the Get variables step you may specify 01/01/${YEAR} as the value.

When you execute the transformaon, this text will be expanded to 01/01/1950 or to

01/01/ plus the year you enter if you overwrite the default value.

Note that the purpose of using the Get Variable step is to have the values of

variables as elds in the dataset. Otherwise, you don't need to use this step

for using a variable. You just use it wherever you see a dollar sign icon.

Transforming the Row Set

[ 194 ]

Have a go hero – enhancing the time dimension

Modify the me dimension generaon by doing the following:

Add the following elds to the dataset, taking as model the generaon of weeks:

Name of month, Short name of month, and Quarter.

Add two more parameters: start_year and end_year. Modify the transformaon

so that it generates dates only between those years. In other words, you have

to discard dates out of that range. You may assume that the parameters will be

between 1900 and 2020.

Pop quiz – using Kettle variables inside transformations

There are some Kele predened variables that hold informaon about the logged in user:

user.country, user.language, etc. The following tasks involve the use of some of those

variables. Which of the tasks can be accomplished without using a Get Variables step or a

JavaScript step (Remember from the previous chapter that you can also get the value for a

Kele variable with a Javascript step):

a. Create a le named hello_<user>.txt, where <user> is the name of the

logged user.

b. Create a le named hello.txt that contains a single line with the text Hello,

<user>!, <user> being is the name of the logged user.

c. Write to the log (by using the Write to log step) a greeng message like Hello,

user!. The message has to be wrien in a dierent language depending on the

language of the logged user.

d. All of the above

e. None of the above

Summary

In this chapter, you learned to transform your dataset by applying two magical steps: Row

Normalizer and Row denormalizer. These two steps aren't the kind of steps you use every

day such as a Filter Rows or a Select values step. But when you need to do the kind of task

they achieve, you are really grateful that these steps exist. They do a complex task in a quite

simple way. You also learned what a me dimension is and how to create a dataset for a

me dimension.

So far, you've been learning to transform data. In the next chapter, you will set that kind of

learning aside for a while. The chapter will be devoted to an essenal subject when it comes

to working in producve environments and dealing with real data—data validaon and

error handling.



Validating Data and Handling Errors

So far, you have been working alone in front of your own computer. In

the "Time for acon" exercises, the step-by-step instrucons along with the

error-free sample data helped you create and run transformaons free of errors.

During the "Have a go hero" exercises, you likely encountered numerous errors,

but ps and troubleshoong notes were there to help you get rid of them.

This is quite dierent from real scenarios, mainly for two reasons:

Real data has errors—a fact that can't be avoided. If you fail to heed it, the

transformaons that run with your sample data will probably crash when

running with real data.

In most cases, who runs your nal work is decided by an automated process and is

not user dened. Therefore, if a transformaon crashes, there will be nobody to x

the problem.

In this chapter you will learn about the opons that PDI oers to treat errors and validate

data so that your transformaons are well prepared to be run in a producve environment.

Capturing errors

Suppose that you are running or previewing a transformaon from Spoon. As you already

know, if an error occurs it is shown in the Logging window inside the Execuon Results

pane. As a consequence, you can look at the error, try to x it, and run the transformaon

again. This is far from what happens in real life. As said, transformaons in real scenarios

are supposed to be automated. Therefore, it is not acceptable to have a transformaon that

crashes without someone who noces it and reacts to that situaon. On the contrary, it's

your duty to do everything you can to trap errors that may happen, avoiding unexpected

crashes when possible. In this secon you will learn how to do that.



Validang Data and Handling Errors

[ 196 ]

Time for action – capturing errors while calculating the age

of a lm

In this tutorial you will use the output of the denormalizing process from the previous

chapter. You will calculate the age of the lms and classify them according to their age.

1. Get the le with the lms. You can take the transformaon that denormalized the

data and generate the le with a Text le output step, or you can take a sample le

from the Packt website.

2. Create a new transformaon and read the le with a Text le input step.

3. Do a preview of the data. You will see the following:

4. Aer the Text le input step, add a Get System Info step.

5. Edit the step, add a new eld named today, and choose Today 00:00:00 as

its value.

6. Add a JavaScript step.

7. Edit the step and type the following piece of code:

var diff;

film_date = str2date('01/01/' + Year, 'dd/MM/yyyy');

diff = dateDiff(film_date,today,”y”);

8. Click on Get variables to add diff as a new eld.

Download from Wow! eBook <www.wowebook.com>

Chapter 7

[ 197 ]

9. Add a Number range step, edit it, and ll its window as follows:

10. With a Sort rows step, sort the data by diff.

11. Finally, add a Group by step and double-click to edit it.

12. As group eld put age_of_film. In the Aggregates grid create a eld named

number_of_films to hold the number of lms with that age. Put film as the

Subject and select Number of values (N) as the Type.

13. Add a Dummy step at the end and do a preview. You will be surprised by an error

like this:

Validang Data and Handling Errors

[ 198 ]

14. Look at the logging window. It looks like this:

15. Now drag Write to log step to the canvas from the Ulity category.

16. Create a hop from the JavaScript step to this new step.

17. Select the JavaScript step, right-click it to bring up a contextual menu, and select

Dene error handling....

18. The error handling sengs window appears. Fill it like shown:

19. Click on OK.

Chapter 7

[ 199 ]

20. Save the transformaon and do a new preview on the Dummy step. You will

see this:

21. The logging window will show you this:

... - Bad rows.0 -

... - Bad rows.0 - ------------> Linenr 1-------------------------

... - Bad rows.0 - null

... - Bad rows.0 -

... - Bad rows.0 - Javascript error:

... - Bad rows.0 - Could not apply the given format dd/MM/yyyy

on the string for 01/01/null : Format.parseObject(String) failed

(script#4)

... - Bad rows.0 -

... - Bad rows.0 - --> 4:0

... - Bad rows.0 - SCR-001

... - Bad rows.0 -

... - Bad rows.0 - ====================

... - Bad rows.0 -

... - Bad rows.0 - ------------> Linenr 2-------------------------

... - Bad rows.0 - null

... - Bad rows.0 -

... - Bad rows.0 - Javascript error:

... - Bad rows.0 - Could not apply the given format dd/MM/yyyy

on the string for 01/01/null : Format.parseObject(String) failed

(script#4)

... - Bad rows.0 -

... - Bad rows.0 - --> 4:0

...

The date was cut from the log for clarity of the log messages.

Validang Data and Handling Errors

[ 200 ]

22. Now do a preview on the Write to log step. This is what you see:

What just happened?

You created a transformaon to read a list of lms and group them according to their age,

that is, how old the movie is. You were surprised by an unexpected error caused by the rows

in which the year was undened. Then you implemented error handling to capture that error

and to avoid the aboron of the transformaon. With the treatment of the error, you split

the stream in two:

The rows that caused the error went to a new stream that wrote to the log

informaon about the error

The rows that passed the JavaScript step without problem went through the

main path

Using PDI error handling functionality

With the error handling funconality, you can capture errors that otherwise would cause

the transformaon to halt. Instead of aborng, the rows that cause the errors are sent to a

dierent stream for further treatment.

You don't need to implement error handling in every step, but in those where it's more

likely to have errors when running the transformaon. A typical situaon where you should

consider handling errors is in a JavaScript step. A code that works perfectly when designing

might fail while execung against real data, where the most common errors are related to

data type conversions or indexes out of range. Another common use of error handling is

when working with databases (you will see more on this later in the book).

To congure the error handling, you have to right-click the step and select Dene

Error handling.



Chapter 7

[ 201 ]

Note that not all steps support error handling. The Dene Error handling opon

is available only when clicking on steps that support it.

Aer opening the sengs window, you have to ll it just as you did in the tutorial. You have

to specify the target step for the bad rows along with the name of the extra elds being

added, as part of the treatment of errors:

Field Descripon

Nr of errors eldname Name for the eld that will have the number of errors

Error elds eldname Name for the eld that will have the name of the eld(s) that

caused the errors

Error codes eldname Name for the eld that will have the error code

Error descripons eldname Name for the eld that will have the error descripon

The rst two are trivial. The last two deserve an explanaon. The values for the error code

and descripon elds are the same as those you see in the Logging tab when you don't trap

the error. In the tutorial there was a JavaScript error with code SCR-001 and descripon

JavaScript error: Could not apply the given format.... You saw this

code as well as its descripon in the Logging tab when you didn't trap the error and the

transformaon crashed, and in the preview you made at the end of the error stream. This

parcular error was a JavaScript one, but the kind of error you get depends always on the

kind of step where it occurs.

You are not forced to ll all the textboxes in the error seng window. Only the elds for

which you provide a name will be added to the dataset. By doing a preview on the target

step, you can see the extra elds that were added.

Aborting a transformation

You can handle the errors by detecng and sending bad rows to an extra stream. But when

the errors are too many or when the errors are severe, the best opon is to cancel the whole

transformaon. Let's see how to force the aboron of a transformaon in such a situaon.

Validang Data and Handling Errors

[ 202 ]

Time for action – aborting when there are too many errors

1. Open the transformaon from the previous tutorial and save it under a

dierent name.

2. From the Flow category, drag an Abort step to the canvas.

3. Create a hop from the Write to log step to the Abort step.

4. Double-click the Abort step. Enter 5 as Abort threshold. As Abort message, type

Too many errors calculating age of film!.

5. Click OK.

6. Select the Dummy step and do a preview. As a result, a warning window shows up

informing you that there were no rows to display. In the Step Metrics tab, the Abort

aer 5 errors line becomes red to show you that there was an error:

7. The log looks like this:

... - Bad rows.0 -

... - Bad rows.0 - ====================

... - Abort after 5 errors.0 - Row nr 6 causing abort :

[Trois couleurs - Blanc], [null], [Comedy | Drama],

[Krzysztof Kieslowski], [Zbigniew Zamachowski, Julie Delpy],

[2009/08/18 00:00:00.000], [

... - Abort after 5 errors.0 - Javascript error:

... - Abort after 5 errors.0 - Could not apply the given

format dd/MM/yyyy on the string for 01/01/null : Format.

parseObject(String) failed (script#4)

... - Abort after 5 errors.0 -

... - Abort after 5 errors.0 - --> 4:0], [SCR-001]

... - Abort after 5 errors.0 - Too many errors calculating age of

film!

... - Abort after 5 errors.0 - Finished processing (I=0, O=0, R=6,

W=6, U=0, E=1)

Chapter 7

[ 203 ]

...

... - Spoon - The transformation has finished!!

... - error_handling_with_abort - ERROR (version 3.2.0-GA, build

10572 from 2009-05-12 08.45.26 by buildguy) : Errors detected!

What just happened?

You forced the aboron of a transformaon aer ve erroneous rows.

Aborting a transformation using the Abort step

Through the use of the Abort step, you force the aboron of a transformaon. Its main use is

in error handling.

You can use the Abort step to force the aboron as soon as a row arrives to it, or aer a

certain number of rows as you did in the tutorial. To decide between one and the other

opon, you use the Abort threshold opon. If threshold is 0, the Abort step will abort

aer the rst row arrives. If threshold is N, the Abort step will cause the aboron of the

transformaon when the row number N+1 arrives at it.

Beyond the error handling situaon, you may use the Abort step in any unexpected

situaon. Examples of that could be when you expect parameters and they are not present

or when an incoming le is empty when it shouldn't be. In situaons like these, you can force

an abnormal ending of the execuon just by adding an Abort step aer the step that detects

the anomaly.

Fixing captured errors

In the Time for acon—capturing errors while calculang the age of a lm secon of this

chapter, you sent the bad rows to the log. However, when you capture errors, you can send

the bad rows toward any step as long as the step knows how to treat those rows. Let's see

an example of that.

Time for action – treating errors that may appear

1. Open the transformaon from the tutorial and save it under a dierent name.

2. From the Transform category, drag the Add constants step to the canvas.

3. Create a hop from the Write to log step to the Add constants step.

4. Add an Integer constant named diff with value 999, and a String constant

named age_of_film with value unknown.

Validang Data and Handling Errors

[ 204 ]

5. Aer the Add constants step, add a Select values step and use it to remove the

elds err_code and err_desc.

6. Create a hop from the Select values step to the Sort rows step. Your transformaon

should look like this:

Note that you are merging two streams. Those

streams must have the same metadata. If you get a

trap detector warning, please verify that you executed

these instrucons exactly as explained.

7. Select the Dummy step and do a preview. You will see this:

What just happened?

You modied the transformaon so that you didn't end up discarding the erroneous rows. In

the error stream (the stream aer the red doed line), you xed the rows by pung default

values for the new elds. Aer that you returned the rows to the main stream.

Chapter 7

[ 205 ]

Treating rows coming to the error stream

If the errors are not severe enough to discard the rows, if you can somehow guess

what data was supposed to be there instead of the error, or if you have default values for

erroneous data, you can do your best to x the errors and send the rows back to the

main stream.

What you did instead of discarding the rows with no year informaon was to x the rows

and send them back to the main stream. The Group by step grouped them under a separate

category named unknown.

There are no rules for what to do with bad rows where you handle errors. You always have

the opon to discard the bad rows or try to x them. Somemes you can x only a few and

discard the rest of them. It always depends on your parcular data or business rules.

Pop quiz – PDI error handling

What does the PDI error-handling funconality do:

a. Avoids the happening of unexpected errors

b. Captures errors that happen and discards erroneous rows so you can connue

working with valid data

c. Captures errors that happen and sends erroneous rows to a new stream, leng you

decide what to do with them

Have a go hero – capturing errors while seeing who wins

On the Packt website you will nd a modied football match le named

wcup_modified.txt. This modied le has some intenonal errors.

Download the le and do the following:

1. Create a transformaon, read the le with a Text le input step. Set all elds

as string.

2. Add a JavaScript step and type the following code in it:

var result_desc;

result_split = Result.split('-');

home_g = str2num(result_split[0]);

away_g = str2num(result_split[1]);

if (home_g > away_g)

result_desc = Home_Team + ' wins';

else if (home_g < away_g)

result_desc = Away_Team + ' wins';

else result_desc = 'Nobody wins';

Validang Data and Handling Errors

[ 206 ]

3. In the grid below the code, add the string variable result_desc.

4. Do a preview on the JavaScript step and see what happens.

5. Now try any of the following two soluons:

Handle the errors and discard the rows that cause those errors.

Abort if there are more than 10 errors.

Handle the errors and x the transformaon by seng a default

result descripon for the rows that cause the errors.

Avoiding unexpected errors by validating data

To avoid unexpected errors that happen or just to meet your requirements is a common

pracce to validate your data before processing it. Let's do some validaons.

Time for action – validating genres with a Regex Evaluation step

In this tutorial you will read the modied lms le and validate the genres eld.

1. Create a new transformaon.

2. Read the modied lms le just as you did in the previous tutorial.

3. In the Content tab, check the Rownum in output? opon and ll the Rownum

eldname with the text rownum.

4. Do a preview. You should see this:

5. Aer the Text le input step, add a Regex Evaluaon step. You will nd it under the

Scripng category of steps.



Chapter 7

[ 207 ]

6. Under the Step sengs box, select Genres as the Field to evaluate, and type

genres_ok as the Result Fieldname.

7. In the Regular expression textbox type [A-Za-z\s\-]*(\|[A-Za-z\s\-]*)* .

8. Add the Filter rows step, an Add constants step, and two Text le output steps and

link them as shown next:

9. Edit the Add constants step.

10. Add a String constant named err_code with value GEN_INV and a String constant

named err_desc with value Invalid list of genres.

11. Congure the Text le output step aer the Add constant step to create the

${LABSOUTPUT}/films_err.txt le, with the elds rownum, err_code, and

err_desc.

12. Congure the other Text le output step to create the ${LABSOUTPUT}/films_

ok.txt le, with the elds film, Year, Genres, Director, and Actors.

13. Double-click the Filter rows step and add the condion genres_ok = Y, Y being a

Boolean value. Send true data to the stream that generates the films_ok.txt

le. Send false data to the other stream.

14. Run the transformaon.

15. Check the generated les. The films_err.txt le looks like the following:

rownum;err_code;err_desc

12;GEN_INV;Invalid list of genres

18;GEN_INV;Invalid list of genres

20;GEN_INV;Invalid list of genres

21;GEN_INV;Invalid list of genres

22;GEN_INV;Invalid list of genres

33;GEN_INV;Invalid list of genres

34;GEN_INV;Invalid list of genres

...

Validang Data and Handling Errors

[ 208 ]

The films_ok.txt le looks like this:

film;Year;Genres;Director;Actors

Persepolis;2007;Animation | Comedy | Drama | History;Vincent

Paronnaud, Marjane Satrapi;Chiara Mastroianni, Catherine Deneuve,

Danielle Darrieux

Trois couleurs - Rouge;1994;Drama;Krzysztof Kieslowski;Irène

Jacob, Jean-Louis Trintignant, Frédérique Feder, Jean-Pierre

Lorit, Samuel Le Bihan

Les Misérables;1933;Drama | History;Raymond Bernard;

...

What just happened?

You read the lms le and checked that the Genres eld was a list of strings separated by |.

You created two les:

One le with the valid rows.

Another le with the rows with an invalid Genres eld. Note that the rownum eld

you added when you read the le is used here for idenfying the wrong lines.

In order to check the validity of the Genres eld, you used a regular expression. The

expression you typed accepts any combinaon of characters, spaces, or hyphens separated

by a pipe. The * symbol allows empty genres as well. For a detailed explanaon of regular

expressions, please refer to Chapter 2.

Validating data

As said, you would validate data mainly for two reasons:

To prevent the transformaon from aborng because of unexpected errors

To check that your data meets some pre-exisng requirements

For example, consider some of the sample data from previous chapters:

In the match le, the results eld had to be a string formed by two numbers

separated by a -

In the real estate le, the ag for Fireplace had to be Yes or No

In the contest le the name of the country had to be a valid country, not a

random string

If your data doesn't meet these requirements, it is possible that you don't have errors but

you will sll be working with invalid data.



Chapter 7

[ 209 ]

In the last tutorial you just validated one of the elds. If you want to validate more than one

eld, you have a specic step that simplies that work: The Data Validator.

Time for action – checking lms le with the Data Validator

Let's validate not only the Genres eld, but also the Year eld.

1. Open the last transformaon and save it under a new name.

2. Delete all steps except the Text le input and Text le output steps.

3. In the Fields tab of the Text le input step, change the Type of the Year from

Integer to String.

4. From the Validaon category add a Data Validator step. Also add a Select values

step. Link all steps as follows:

5. Double-click the Data Validator step.

6. Check the Report all errors, not only the rst opon found on at the top of the

window. This will enable the Output one row, concatenate errors with separator

opon. Check this opon too, and ll the textbox to the right with a slash /. Click on

New validaon and type genres as the name of the validaon rule.

7. Click on OK.

8. Click on genres. The right half of the window is lled with checkboxes and textboxes

where you will dene the rule.

Validang Data and Handling Errors

[ 210 ]

9. Fill the header of the rule denion as follows:

10. In the Regular expression expected to match textbox, type

[A-Za-z\s\-]*(\|[A-Za-z\s\-]*)*

11. Click on New validaon and type year as the name of the validaon rule.

12. Click on OK.

13. Click on year and ll the header of the rule denion as follows:

14. In the data block, select the Only numeric data expected checkbox opon.

15. Click on OK.

16. Right-click the Data Validator step and select Dene error handling....

17. Fill the error handling sengs window as follows: As Target step, select the step that

generates the le with invalid rows. Check the Enable the error handling? checkbox.

Type err_desc as Error descripon eld name, err_field as Error elds, and

err_code as Error codes. Click on OK.

18. Use the Select values step to change the metadata for the Year from String

to Integer.

19. Save the transformaon and run it.

Chapter 7

[ 211 ]

20. Check the generated les. The films_err.txt le now has more detail, as you

validated two elds.

rownum;err_code;err_desc

9;YEAR_NULL;Year invalid or absent

12;GEN_INV;Invalid list of genres

18;GEN_INV;Invalid list of genres

20;GEN_INV;Invalid list of genres

21;GEN_INV;Invalid list of genres

22;GEN_INV;Invalid list of genres

33;GEN_INV;Invalid list of genres

34;GEN_INV;Invalid list of genres

47;YEAR_NULL/GEN_INV;Year invalid or absent/Invalid list of genres

48;YEAR_NULL/GEN_INV;Year invalid or absent/Invalid list of genres

49;YEAR_NULL;Year invalid or absent

...

21. The films_ok.txt le should have less rows instead, as the lms with year invalid

or absent are no longer sent to this le.

What just happened?

You used the Data Validator step to validate both the genres list and the year. You created

a le with the good rows, and another le with the informaon to show you which errors

were found.

Dening simple validation rules using the Data Validator

The Data Validator step, or DV for short, allows you to dene simple validaon rules to

describe the expected characteriscs for the incoming elds. The good thing about the DV

step is that it concentrates several validaons into a single step, and obviously it supports

error handling.

For every validaon rule, you have to specify these elds:

Field Descripon

Name of the eld to validate Name of the incoming eld whose value will be

validated with this rule

Error code The error code to pass to error handling. If

omied, a default is set

Error descripon The error descripon to pass to error handling. If

omied, a default is set

Validang Data and Handling Errors

[ 212 ]

The error code and error descripon are useful to idenfy which eld was erroneous when

you have more than one validaon rule in a single DV step.

It is possible for more than one eld to cause a row pass to error handling. In that case,

you can generate one output row per error or a single row with all error descripons

concatenated. In the tutorial you chose this last opon.

In the sengs window, once you select a validaon rule, you have two blocks of sengs—

the Data block where you dene the expected data for a eld and the Type block where you

validate if a eld matches a given type or not.

In the Data block you set the actual validaon rule for a eld.

The following table summarizes the kinds of validaons you may apply in this block:

Validaon Data block opons

Allowing (only) null values Null allowed? / Only null values allowed?

Making sure that the length of the selected

eld is between a range of values

Max string length / Min string length.

You may use one or both at the same me.

Making sure that the value of the selected

eld is between a range of values

Maximum value / Minimum value

You may use one or both at the same me.

Making sure that the selected eld matches

a paern

Only numeric data expected

Expected start string

Expected end string

Regular expression expected to match



Making sure that the selected eld doesn't

match a paern

Not allowed start string

Not allowed end string

Regular expression not allowed

to match



Making sure that the selected eld is one of

the values in a given list

Allowed values (when you have a

xed list)

Read allowed values from another step?

(when the list comes from

another stream)



In the tutorial, you used just a couple from this long list of opons. For the validaon of the

genres, you used a regular expression that the eld had to match. For the year, you checked

that the eld wasn't null and that it contained only numeric data.

Let's briey explain what you did to validate the year. You read the year as a String. Then

with the DV you checked that it contained only numeric data. If the data was valid, you

changed the metadata to Integer aer the row le the DV step.

Chapter 7

[ 213 ]

Why didn't you simply validate whether the year was an Integer? This is because the type

validaon just checks that the year is an integer eld, rather than checking if it can be

converted into an integer. In this case, the year is of type String because you read it

as a String in the Text le input step.

What would happen if you read the year as an Integer? The invalid elds would cause

an error in the Text le input step, and the row would never arrive to the DV step to

be validated.

The type block allows you to validate the type of an incoming eld. This just

checks the real data type, rather than checking if the eld can be converted

into a given data type.

Have a go hero – validating the football matches le

From the Packt website, download the valid_countries.txt le. Modify the

transformaon from the previous "Hero" secon doing the following things.

Aer reading the le, apply the following validaon rules:

Field Validaon rule

Match_Date dd/mm, where dd/mm is a valid date.

Home_Team Belongs to the list of countries in the valid_countries.txt le.

Away_Team Belongs to the list of countries in the valid_countries.txt le.

Result n-n where n is a number.

Also validate that Home_Team is dierent from Away_Team.

Use a Data Validator step when possible.

Send the bad rows to a le of invalid data and the good rows to the JavaScript step.

Test your transformaon and check that every validaon rule is applied as expected.

Cleansing data

While validaon means mainly rejecng data, data cleansing detects and tries to x not only

invalid data, but also data considered illegal or inaccurate in a specic domain.

For example, consider a eld represenng a year. A year containing non-numeric symbols

should always be considered invalid and then rejected.

Validang Data and Handling Errors

[ 214 ]

Now look at the lms example. In this specic case, the year might not be important to you.

If you nd a non-numeric value, you could just replace it by a null year meaning unknown

and keep the data.

On the contrary, the simple rule that looks for numeric values is not enough. A year equal to

1084 should also be considered invalid as it is impossible to have a lm made at that me.

However, as it is a common error to type 0 instead of 9, you may assume that there was a

human mistake and you could replace the 0 in 1084 by a 9 automacally.

Doing data cleansing actually involves trying to detect and deal with these kinds of

situaons, knowing in advance the rules that apply.

Data cleansing, also known as data cleaning or data scrubbing, may be done manually

or automacally depending on the complexity of the cleansing. With PDI you can use the

automated opon. For the validang part of the process, you can use any of the steps or

mechanisms explained above. While for the cleaning part you can use any PDI step that suits,

there are some steps that are parcularly useful.

Step Purpose

If eld value is null If a eld is null, it changes its value to a constant. It can be applied to all elds

of the same data type, or to parcular elds.

Null if... Sets a eld value to null if it is equal to a constant value.

Number range Creates ranges based on a numeric eld. An example of use is converng

oang numbers to a discrete scale such as 0, 0.25, 0.50, and so on.

Value Mapper Maps values of a eld from one value to another. For example, you can use

this step to convert yes/no, true/false, or 0/1 values to Y/N.

Stream lookup Looks up values coming from another stream. In data cleansing, you can use

it to set a default value if your eld is not in a given list.

Database lookup Same as Stream lookup but looking in a database table.

Unique rows Removes double consecuve rows and leaves only unique occurrences.

For examples that use these steps or for geng more informaon about them, please refer

to Appendix C, Job Entries and Steps Reference.

Have a go hero – cleansing lms data

From the Packt's website, download the fix_genres.txt le. The le has the

following lines:

erroneous;fixed

commedy;comedy

sci-fi; science fiction

science-fiction; science fiction

musical;music

historical;history

Chapter 7

[ 215 ]

Create a new transformaon and do the following:

Read the modied lms le that you have used throughout the chapter. Validate that the

genre is a list of strings separated by |. Send the bad rows to a le of bad rows. So far,

this is the same as you did in the last two tutorials. Now clean the genres in the lists.

For every genre:

1. Check that it is not null. If it is null, discard it.

2. Split composed genres in two. For example, Historical drama becomes

historical and drama.

3. Standardize the descripons:

Remove trailing spaces.

Change the descripons to lower case.

4. Check that it is not misspelled. For doing that, use the miss_genres.txt le. If

the genre is in the list, replace the text by the correct descripon.

5. Aer all this cleaning add a Dummy step and preview the results.

To validate each genre, you can split the Genres eld into rows. Aer the

cleansing, you can recover the original lines by grouping the rows, using as

aggregate Concatenate strings separated by | to concatenate

the validated genres.

Summary

In this chapter, you learned two essenal subjects when it comes to the running of

transformaons by nontechnical users, in producve environments, with real data—

validang data and handling errors.

In the next chapter, we go back to the development, this me with a subject that most of

you must be waing for since Chapter 1—working with databases.



Working with Databases

Database systems are the main mechanism used by most organizaons to store

and administer organizaonal data. Online sales, bank-related operaons,

customer service history, and credit card transacons are some examples of

data stored in databases.

This is the rst of two chapters fully dedicated to working with databases. This chapter

provides an overview of the main database concepts. It also covers the following topics:

Connecng to databases

Previewing and geng data from a database

Inserng, updang, and deleng data from a database

Introducing the Steel Wheels sample database

As you were told in the rst chapter, there is a Pentaho Demo that includes data for a

conal store named Steel Wheels and you can download it from the Internet. This data is

stored in a database that is going to be the starng point for you to learn how to work with

databases in PDI. Before beginning to work on databases, let's briey introduce the Steel

Wheels database along with some database denions.



Working with Databases

[ 218 ]

A relaonal database is a collecon of items stored in tables. Typically, all items stored in a

table belong to a parcular type of data. The following table lists some of the tables in the

Steel Wheels database:

Table Content

CUSTOMERS Steel Wheels' customers

EMPLOYEES Steel Wheels' employees

PRODUCTS Products sold by Steel Wheels

OFFICES Steel Wheels' oces

ORDERS Informaon about sales orders

ORDERDETAILS Details about the sales orders

The items stored in the tables represent an enty or a concept in the real world. As an

example, the CUSTOMERS table stores items represenng customers. The ORDERS table

stores items that represent sales orders in the real world.

In technical terms, a table is uniquely idened by a name such as CUSTOMERS, and contains

columns and rows of data.

You can think of a table as a PDI dataset. You have elds (the columns of the table) and rows

(the records of the table).

The columns, just like the elds in a PDI dataset, have a metadata describing their name, type,

and length. The records hold the data for those columns; each record represents a dierent

instance of the items in the table. As an example, the table CUSTOMERS describes the

customers with the columns CUSTOMERNUMBER, CUSTOMERNAME, CONTACTLASTNAME and so

forth. Each record of the table CUSTOMERS belongs to a dierent Steel Wheels' customer.

A table usually has a primary key. A primary key or PK is a combinaon of one or more

columns that uniquely idenfy each record of the table. In the sample table, CUSTOMERS,

the primary key is made up of a single column—CUSTOMERNUMBER. This means there cannot

be two customers with the same customer number.

Tables in a relaonal database are usually related to one another. For example, the

CUSTOMERS and ORDERS tables are related to convey the fact that real-world customers

have placed one or more real-world orders. In the database, the ORDERS table has a column

named CUSTOMERNUMBER with the number of the customer who placed the order. As said,

CUSTOMERNUMBER is the column that uniquely idenes a customer in the CUSTOMERS

table. Thus, there is a relaonship between both tables. This kind of relaonship between

columns in two tables is called foreign key or FK.

Chapter 8

[ 219 ]

Connecting to the Steel Wheels database

The rst thing you have to do in order to work with a database is tell PDI how to access the

database. Let's learn how to do it.

Time for action – creating a connection with the Steel Wheels

database

In this rst database tutorial, you will download the sample database and create a

connecon for accessing it from PDI.

The Pentaho BI demo includes the sample data. So, if you have already

downloaded the demo as explained in Chapter 1, just skip the rst

three steps. If the Pentaho BI demo is running on your machine, the

database server is running as well. In that case, skip the rst four steps.

1. Go to the Pentaho Download site: http://sourceforge.net/projects/

pentaho/files/.

2. Under the Business Intelligence Server | 1.7.1-stable, look for the le namedlook for the le named

pentaho_sample_data-1.7.1.zip and download it.

3. Unzip the downloaded le.

4. Run start_hypersonic.bat under Windows or start_hypersonic.sh under

Unix-based operang systems. If you download the sample data, you will nd

these scripts in the folder named pentaho-data. If you download the Pentaho BI

server instead, you will nd them in the folder named data. The following screen is

displayed when the database server starts:

Working with Databases

[ 220 ]

Don't close this window. It would cause the database

server to stop.

5. Open Spoon and create a new transformaon.

6. Click on the View opon that appears in the upper-le corner of the screen.

7. Right-click the Database connecons opon and click on New.

8. Fill the Database Connecon dialog window as follows:

Chapter 8

[ 221 ]

9. Click on the Test buon. The following window shows up:

If you get an error message instead of the Message window shown in the

previous screenshot, please recheck the data you entered in the connecon

window. Also verify that the database is running, that is, the terminal window

is sll opened and doesn't show an error message. If you see an error, or if

you don't see the terminal, please start the database server again as explained

at the beginning of the tutorial.

10. Click on OK to close the test window.

11. Click on OK again to close the database denion window. A new database

connecon is added to the tree.

12. Right-click on the database connecon and click on Share. The connecon

is available in all transformaons you create from now onwards. The shared

connecons are shown in bold leers.

13. Save the transformaon.

What just happened?

You created and tested a connecon to the Pentaho Sample database. Finally, you shared the

connecon so that it could be reused in other transformaons.

Working with Databases

[ 222 ]

Connecting with Relational Database Management Systems

Even if you've never worked with databases, you must have heard terms such as MySQL, Oracle,

DB2, or MS SQL server. These are just some of many Relaonal Database Management

Systems (RDBMS) on the market. An RDBMS is a soware that lets you create and administer

relaonal databases.

In the tutorial you worked with HyperSQL DataBase (HSQLDB), just another RDBMS formerly

known as Hypersonic DB. HSQLDB has a small, fast database engine wrien in Java. HSQLDB

is currently being used in many open source soware projects such as OpenOce.org 3.1

as well as in commercial projects and products such as Mathemaca. You can get more

informaon about HSQLDB at http://hsqldb.org/.

PDI has the ability to connect with both commercial RDBMSes such as Oracle or MS SQL

server and free RDBMSes such as MySQL. In order to get connected to a parcular database,

you have to dene a connecon to it.

A database connecon describes all parameters needed to connect PDI to a database.

To create a connecon, you must give the connecon a name and ll at least the

general sengs:

Seng Descripon Steel Wheels sample

Connecon

type

Type of database system: HSQLDB, Oracle, MySQL,

Firebird, and so on.

HSQLDB

Method of

access

Nave (JDBC), ODBC, JNDI, or OCI. The available opons

depend on the type of DB.

Native (JDBC)

Host name Name or IP address for the host where the database is. localhost

Database

name

Idenes the database to which you want to connect. sampledata

Port number PDI sets as default the most usual port number for the

selected type of database. You can change it of course.

9001

User Name /

Password

Name of the user and password to connect to the

database.

pentaho_admin /

password

If you don't nd your database engine in the list, you will sll be able to connect

to it by specifying as connecon type, the Generic database opon. In that case,

you have to provide a connecon URL and the driver class name.

Aer creang a connecon, you can click the Test buon to check that the connecon has

been dened correctly and that you can reach them from PDI.

Chapter 8

[ 223 ]

The database connecons will be available just in the transformaon where you dened

them, unless you share it for reuse as you did in the tutorial. Normally, you share

connecons because you know that you will use them later in many transformaons.

The informaon about shared connecons is stored in a le named shared.xml, located in the

same folder as the kettle.properties le.

When you have shared connecons and you save the transformaon, the connecon

informaon is saved in the transformaon itself.

If there is more than one shared connecon, all of them will be saved

along with the transformaon, even if the transformaon doesn't use

them all. To avoid this, go to the eding opons and check the Only

save used connecons to XML? opon. This opon limits the XML

content of a transformaon to just the used connecons.

Pop quiz – dening database connections

Which opons do you have to connect to the same database in several transformaons:

a. Dene the connecon in each transformaon that needs it

b. Dene a connecon once and share it

c. Either of the above opons

d. Neither of the above opons

Have a go hero – connecting to your own databases

You must have access to a database, whether local or in the network to which you are logged

in. Get the connecon informaon for the database. From PDI create a connecon to the

database and test it to verify that you can access it from PDI.

Exploring the Steel Wheels database

In the previous secon, you learned about what RDBMSs are and how to connect to an

RDBMS from PDI. Before beginning to work with the data in a database, it would be useful to

get familiarized with that database. In this secon, you will learn to explore databases with

the PDI Database explorer.

Working with Databases

[ 224 ]

Time for action – exploring the sample database

Let's explore the sample database:

1. Open the transformaon you just created.

2. Right-click the connecon in the Database connecons list and select Explore in the

contextual menu. The Database explorer on connecon window opens.

3. Expand the Tables node of the tree and select CUSTOMERS. This is how the

explorer looks:

4. Click on the Open SQL for [CUSTOMERS] opon.

5. The following SQL editor window appears:

Chapter 8

[ 225 ]

6. Modify the text in the window so that you have the following:

SELECT

CUSTOMERNUMBER

, CUSTOMERNAME

, CITY

, COUNTRY

FROM CUSTOMERS

7. Click on Execute. You will see the following result:

8. Close the preview window (the window that tells the result of the execuon) and

the SQL editor window.

9. Click on OK to close the database explorer window.

What just happened?

You explored the Pentaho sample database with the PDI Database explorer.

A brief word about SQL

Before explaining the details of the database explorer, it's worth giving an introducon to

SQL—a central topic in relaonal database terminology.

SQL, that is, Structured Query Language is the language that lets you access and manipulate

databases in a RDBMS.

SQL can be divided into two parts—DDL and DML.

Working with Databases

[ 226 ]

The DDL, that is, Data Denion Language is the branch of the language that basically allows

creang or deleng databases and tables.

The following is an example of DDL. It is the DDL statement that creates the

CUSTOMERS table.

CREATE TABLE CUSTOMERS

(

CUSTOMERNUMBER INTEGER

, CUSTOMERNAME VARCHAR(50)

, CONTACTLASTNAME VARCHAR(50)

, CONTACTFIRSTNAME VARCHAR(50)

, PHONE VARCHAR(50)

, ADDRESSLINE1 VARCHAR(50)

, ADDRESSLINE2 VARCHAR(50)

, CITY VARCHAR(50)

, STATE VARCHAR(50)

, POSTALCODE VARCHAR(15)

, COUNTRY VARCHAR(50)

, SALESREPEMPLOYEENUMBER INTEGER

, CREDITLIMIT BIGINT

)

;

This DDL statement tells the database to create the table CUSTOMERS with the columns

CUSTOMERNUMBER of the type INTEGER, the column CUSTOMERNAME of the type VARCHAR

with length 50, and so on.

Note that INTEGER, VARCHAR, and BIGINT are HSQLDB types of data, not PDI ones. The

DML, that is, Data Manipulaon Language allows you to retrieve data from a database. It

also lets you insert, update, or delete data from the database.

The statement you typed in the SQL editor is an example of DML:

SELECT

CUSTOMERNUMBER

, CUSTOMERNAME

, CITY

, COUNTRY

FROM CUSTOMERS

Chapter 8

[ 227 ]

This statement is asking the database to retrieve all the rows for the CUSTOMERS table,

showing only CUSTOMERNUMBER, CUSTOMERNAME, CITY, and COUNTRY columns. Aer you

clicked Execute, PDI queried the database and showed you a window with the data you had

asked for.

If you were to leave the following statement:

SELECT * FROM CUSTOMERS

the window would have showed you all columns for the CUSTOMERS table.

SELECT is the statement that allows you to retrieve data from one or more tables. It is the

most commonly used DML statement and you're going to use it a lot when working with

databases in PDI. You will learn more about the SELECT statement in the next secon of

this chapter.

Other important DML statements are:

INSERT: This allows you to insert rows in a table

UPDATE : This allows you to update the values in rows of a table

DELETE: This statement is used to remove rows from a table

It is important to understand the meaning of these basic statements, but you are not

forced to learn them as PDI oers you ways to insert, update, and delete without typing

any SQL statement.

Although SQL is a standard, each database engine has its own version of the SQL language.

However, all database engines support the main commands.

When you type SQL statements in PDI, try to keep the code within the

standard. Your transformaons will then be reusable in case you have

to change the database engine.

If you are interested in learning more about SQL, there are a lot of tutorials on the Internet.

The following are a few useful links with tutorials and SQL references:

http://www.sqlcourse.com/

http://www.w3schools.com/SQl/

http://sqlzoo.net/

Unl now, you have used only HSQLDB. In the tutorials to come, you will also work with the

MySQL database engine. So, you may be interested in specic documentaon for MySQL,

which you can nd at http://dev.mysql.com/doc/. You can nd even more informaon

in books; there are plenty of books available about both SQL language and MySQL databases.



Working with Databases

[ 228 ]

Exploring any congured database with the PDI Database explorer

The database explorer allows you to explore any congured database. When you open the

database explorer, the rst thing you see is a tree with the dierent objects of the database.

As soon as you select a database table, all buons to the right side become available for you

to explore that table. The following are the funcons oered by the buons at the right side

of the database explorer:

Opon Meaning

Preview rst 100 rows of ... Return the rst 100 rows of the selected table, or all the rows

if the table has less that 100. This opon shows all columns of

the table.

Preview rst...rows of ... The same as the previous opon, but here you decide the

number of rows to show.

Number of rows of ... Tells you the total number of records in the table.

Show layout of ... Shows you the metadata for the columns of the table.

Generate DDL Shows you the DDL statement that creates the selected table.

Generate DDL for other

connecon

It lets you select another existent connecon. Then it shows

you the DDL just like the previous opon. The dierence is that

the DDL is wrien with the syntax of the database engine of the

selected connecon.

Open SQL for ... Lets you edit a SELECT statement to query the table. Here you

decide which columns and rows to retrieve.

Truncate table Deletes all rows from the selected table.

In the tutorial you opened the Database explorer from the contextual

menu in the Database connecons tree. You can also open it by clicking

the Explore opon in the database denion window.

Have a go hero – exploring the sample data in depth

In the tutorial you just tried the Open SQL buon. Feel free to try other buons to

explore not only the CUSTOMERS table but also the rest of the tables found in the Steel

Wheels database.

Chapter 8

[ 229 ]

Have a go hero – exploring your own databases

In the previous secon, there was a Hero exercise that asked you to connect to your own

databases. If you have done that, then use a database connecon dened by you and

explore the database. See if you can recognize the dierent objects of the database. Run

some previews to verify that everything looks as expected.

Querying a database

So far you have just connected to a database. You haven't yet worked with the data. Now is

the me to do that.

Time for action – getting data about shipped orders

Let's connue working with the sample data.

1. Create a new transformaon.

2. Select the Design view.

3. Expand the input category of steps and drag a Table Input step to the canvas.

4. Double-click the step.

5. Click on the Get SQL select statement... buon. The database explorer

window appears.

6. Expand the tables list and select ORDERS.

7. Click on OK.

8. PDI asks if you want to include the eld names in the SQL. Answer Yes.

9. The SQL box gets lled with a SELECT SQL statement.

SELECT

ORDERNUMBER

, ORDERDATE

, REQUIREDDATE

, SHIPPEDDATE

, STATUS

, COMMENTS

, CUSTOMERNUMBER

FROM ORDERS

Working with Databases

[ 230 ]

10. At the end of the SQL statement, add the following clause:

WHERE STATUS = 'Shipped'

11. Click Preview and then OK. The following window appears:

12. Close the window and click OK to close the step conguraon window.

13. Aer the Table input step add a Calculator step, a Number Range step, a Sort step,

and a Select values step and link them as follows:

14. With the Calculator step, add an Integer eld to calculate the dierence between

the shipped date and the required date. Use the calculaon Date A – Date B

(in days) and name the eld diff_days. Use the Number ranges step to classify

the delays in delivery.

Chapter 8

[ 231 ]

15. Use the Sort rows step to sort the rows by the diff_days eld.

16. Use the Select values step to select the delivery, ORDERNUMBER, REQUIREDDATE,

and SHIPPEDDATE elds.

17. With the Select values step selected, do a preview. The following is how the nal

data will look:

What just happened?

From the sample database, you got informaon about shipped orders. Aer you read

the data from the database, you classied the orders based on the me it took to do

the shipment.

Getting data from the database with the Table input step

The Table input step is the main step to get data from a database. In order to use it, you have

to specify the connecon with the database. In the tutorial you didn't explicitly specify one

because there was just one connecon and PDI put it as the default value.

Working with Databases

[ 232 ]

The connecon was available because you shared it before. If you hadn't, you should have

created here again.

The output of a Table Input step is a regular dataset. Each column of the SQL query leads

to a new eld and the rows generated by the execuon of the query become the rows of

the dataset.

As the data types of the databases are not exactly the same as the PDI data types, when

geng data from a table, PDI implicitly converts the metadata of the new elds.

For example, consider the ORDERS table. Open the Database Explorer and look at the DDL

denion for the table. Then right-click the Table input step and select Show output elds to

see the metadata of the created dataset. The following table shows you how the metadata

was translated:

Table columns Database data type PDI metadata

ORDERNUMBER,

CUSTOMERNUMBER

INTEGER Integer(9)

ORDERDATE, REQUIREDDATE,

SHIPPEDDATE

TIMESTAMP Date

STATUS VARCHAR(15) String(15)

COMMENTS TEXT String(214748364)

Once the data comes out of the Table input step and the metadata is adjusted, PDI forgets

that it comes from a database. It treats it just as regular data, no maer if it came from a

database or any other data source.

Using the SELECT statement for generating a new dataset

The SQL area of a Table input step is where you write the SELECT statement that will

generate the new dataset. As said before, SELECT is the statement that you use to retrieve

data from one or more tables in your database.

The simplest SELECT statement is as follows:

SELECT <values>

FROM <table name>

Here <table name> is the name of the table that will be queried to get the result set and

<values> is the list of the desired columns of that table, separated by commas.

This is another simple SELECT statement:

SELECT ORDERNUMBER, ORDERDATE, STATUS

FROM ORDERS

Chapter 8

[ 233 ]

If you want to select all columns, you can just put a * as here:

SELECT *

FROM ORDERS

There are some oponal clauses that you can add to a SELECT statement. The most

commonly used among the oponal clauses are WHERE and ORDER BY. The WHERE clause

limits the list of retrieved records, while ORDER BY is used to retrieve the rows sorted by

one or more columns.

Another common clause is DISTINCT that can be used to return only dierent records.

Let's see some sample SELECT statements:

Sample statement Output

SELECT ORDERNUMBER, ORDERDATE

FROM ORDERS

WHERE SHIPPEDDATE IS NULL

Returns the number and order date for the orders

that have not been shipped.

SELECT *

FROM EMPLOYEES

WHERE JOBTITLE = 'Sales Rep'

ORDER BY LASTNAME, FIRSTNAME

Returns all columns for the employees whose job

is sales representave, ordered by last name and

rst name.

SELECT PRODUCTNAME

FROM PRODUCTS

WHERE PRODUCTLINE LIKE '%Cars%'

Returns the list of products whose product line

contains cars—for example, Classic cars and

Vintage cars.

SELECT DISTINCT CUSTOMERNUMBER

FROM PAYMENTS

WHERE AMOUNT > 80000

Returns the list of customer numbers who have

made payments with checks above USD80,000.

The customers who have paid more than once

with a check above USD80,000 appear more than

once in the PAYMENTS table, but only once in this

result set.

You can try these statements in the database explorer to check that the result sets are

as explained.

When you add a Table input step, it comes with a default SELECT statement for you

to complete.

SELECT <values> FROM <table name> WHERE <conditions>

If you need to query a single table, you can take advantage of the Get SQL select

statement... buon that generates the full statement for you. Aer you get the statement,

you can modify it at your will by adding, say, WHERE or ORDER clauses just as you did in the

tutorial. If you need to write more complex queries, you will have to do it manually.

Working with Databases

[ 234 ]

You can write any SELECT query as long as it is a valid SQL statement for

the selected type of database. Remember that every database engine has

its own dialect of the language.

Whether simple or complex, you may need to pass some parameters to the query. You can

do it in a couple of ways. Let's explain this with two praccal examples.

Making exible queries by using parameters

One of the ways you have to make your queries more exible is by passing it through some

parameters. In the following tutorial you will learn how to do it.

Time for action – getting orders in a range of dates by using

parameters

Now you will modify your transformaon so that it shows orders in a range of dates.

1. Open the transformaon from the previous tutorial and save it under a new name.

2. From the Input category, add a Get System Info step.

3. Double-click it and use the step to get the command line argument 1 and command

line argument 2 values. Name the elds as date_from and date_to respecvely.

Create a hop from the Get System Info step to the Table input step.

4. Double-click the Table input step.

5. Modify the SELECT statement as follows:

SELECT

ORDERNUMBER

, ORDERDATE

, REQUIREDDATE

, SHIPPEDDATE

FROM ORDERS

WHERE STATUS = 'Shipped'

AND ORDERDATE BETWEEN ? AND ?

6. In the drop-down list to the right side of Insert data from step, select the

incoming step.

7. Click OK.

8. With the Select values step selected, click the Preview buon.

Chapter 8

[ 235 ]

9. Click on Congure.

10. Fill the Arguments grid. To the right of the argument 01, type 2004-12-01. To the

right of the argument 02, type 2004-12-10.

11. Click OK. The following window appears:

What just happened?

You modied the transformaon from the previous tutorial to get orders in a range of dates

coming from the command line.

Adding parameters to your queries

You can make your queries more exible by adding parameters. Let's explain how you do it.

The rst thing to do is obtain the elds that will be plugged as parameters. You can get them

from any source by using any number of steps, as long as you create a hop from the last step

toward the Table input step.

In the tutorial you just used a Get System Info step that read the parameters from the

command line.

Once you have the parameters for the query, you have to change the Table input step

conguraon. In the Insert data from step opon, you have to select the name of the step

that the parameters will come from. In the query, you have to put a queson mark (?) for

each incoming parameter.

When you execute the transformaon, the queson marks are replaced, one by one, with

the data that comes to the Table input step.

Working with Databases

[ 236 ]

Let's see how it works in the tutorial. The following is the output of the Get System Info step:

In the SQL statement, you have two queson marks. The rst is replaced by the value of the

date_from eld and the second is replaced by the value of the date_to eld. Now the SQL

statement becomes:

SELECT

ORDERNUMBER

, ORDERDATE

, REQUIREDDATE

, SHIPPEDDATE

FROM ORDERS

WHERE STATUS = 'Shipped'

AND ORDERDATE BETWEEN '2004-12-01' AND '2004-12-10'

Here 2004-12-01 and 2004-12-10 are the values you entered as arguments for

the transformaon.

The replacement of the markers respects the order of the incoming elds.

When you use queson marks to parameterize a query, you can't forget

the following—the number of elds coming to a Table input step must be

exactly the same as the number of queson marks found in the query.

Making exible queries by using Kettle variables

Another way you have to make your queries exible is by using Kele variables. Let's explain

how you do it using an example.

Chapter 8

[ 237 ]

Time for action – getting orders in a range of dates by using

variables

In this tutorial you will do the same as you did in the previous tutorial, but another method

will be explained to you.

1. Open the main transformaon we created in the Time for acon–geng

data about shipped orders secon and save it under a new name.

2. Double-click the Table input step.

3. Modify the SELECT statement as follows:

SELECT

ORDERNUMBER

, ORDERDATE

, REQUIREDDATE

, SHIPPEDDATE

FROM ORDERS

WHERE STATUS = 'Shipped'

AND ORDERDATE BETWEEN '${DATE_FROM}' AND '${DATE_TO}'

4. Tick the Replace variables in script? checkbox.

5. Save the transformaon.

6. With the Select values step selected, click the Preview buon.

7. Click on Congure.

8. Fill the Variables grid in the sengs dialog window—type 2004-12-01 to the right

of the DATE_FROM opon and 2004-12-10 to the right of the DATE_TO opon.

9. Click OK. This following window appears:

Working with Databases

[ 238 ]

What just happened?

You modied the transformaon from the previous tutorial, so the range of dates is taken

from two variables—DATE_FROM and DATE_TO. The nal result set was exactly the same you

got in the previous version of the transformaon.

Using Kettle variables in your queries

As an alternave to the use of posional parameters, you can use Kele variables. Instead

of geng the parameters from an incoming step, you check the opon Replace variables in

script? and replace the queson marks by names of variables. The nal result is the same.

PDI replaces the names of the variables by their values. Only aer that, it sends the SQL

statement to the database engine to be evaluated.

The advantage of using posional parameters over the variables is quite obvious—you don't

have to dene the variables in advance.

On the contrary, Kele variables have several advantages over the use of queson marks:

You can use the same variable more than once in the same query.

You can use variables for any poron of the query, not just the values. For example,

you could have the following query:

SELECT ORDERNUMBER FROM ${ORDER_TABLE}

Then the result will vary upon the content of the variable ${ORDER_TABLE}. In the

case of this example, the variable could be ORDERS or ORDERDETAILS.

A query with variables is easier to understand and less error prone than a query

with posional parameters. When you use posional parameters, it's quite common

to get confused and make mistakes.

Note that in order to provide parameters to a statement in a

Table input step, it's perfectly possible to combine both methods:

posional parameters and Kele variables.

Pop quiz – database datatypes versus PDI datatypes

Aer you read data from the database with a Table Input step, what happens to the data

types of that data:

a. They remain unchanged

b. PDI converts the database data types to internal data types

c. It depends on how you dened the database connecon



Chapter 8

[ 239 ]

Have a go hero – querying the sample data

Based on the sample data:

Create a transformaon to list the oces of Steel Wheels located in USA. Modify the

transformaon so that the country is entered by command line.

Create a transformaon that lists the contact informaon of clients whose credit

limit is above USD100,000. Modify the transformaon so that the threshold is

100000 by default, but can be modied when you run the transformaon.

(Hint: Use named parameters.)

Create a transformaon that generates two Excel les—one with a list of planes

and the other with a list of ships. Include the code, name, and descripon of

the products.

Sending data to a database

By now you know how to get data from a database. Now you will learn how to insert data

into it. For the next tutorials we will use a MySQL database, so before proceeding make sure

you have MySQL installed ad running.

If you haven't yet installed MySQL, please refer to Chapter 1. It has

basic instrucons on installing MySQL, both on Windows and on

Linux operang systems.

Time for action – loading a table with a list of manufacturers

Suppose you love jigsaw puzzles and decided to open a store for selling them. You have

made all the arrangements and the only missing thing is the soware. You have already

acquired a soware to handle your business, but you sll have one hard task to do—insert

data into the database, that is, load the database with the basic informaon about the

products you are about to sell.

As this is the rst of several tutorials in which you will interact with that database, the rst

thing you have to do is to create the database.

For MySQL-specic tasks such as the creaon of a database, we will use

the MySQL Query Browser, included in the MySQL GUI Tools soware. If

you don't have it or don't like it, you can accomplish the same tasks by

using the MySQL Command Line Client or any other GUI Tool.



Working with Databases

[ 240 ]

1. From the Packt website, download the script le js.sql.

2. Launch the MySQL Query Browser.

3. A dialog window appears asking you for the connecon informaon. Enter

localhost as Server Host, and as Username and Password, enter the name

and password of the user you created when you installed the soware .

4. Click on OK.

5. From the File menu, select Open Script....

6. Locate the downloaded le and open it.

7. Click on the Execute buon or press Ctrl+Enter.

8. In the Schemata tab window, a new database, js, appears.

9. Right-click the name of the database and select Make Default Schema.

10. In the Schemata tab window, expand the js tree and you will see the tables of

the database.

11. Close the script window.

Now that the database has been created, let's load some data into it:

1. From the Packt website, download the manufacturers.xls le.

2. Open Spoon and create a new transformaon.

3. Create a connecon to the created database. Under Connecon Type, select

MySQL. In the Sengs frame, insert the same values you provided for the

connecon in MySQL Query Browser—enter localhost as Host Name and

js (the database you just created) as Database Name, and as User Name

and Password, enter the name and password of the user you created when

you installed MySQL. For other sengs in the window, leave the default

values. Test the connecon to see if it has been properly created.

The main reason for a failed test is either erroneous data provided

in the seng window or the non-funconing of the server. If the

test fails, please read the error message to know exactly what the

error was and act accordingly.

4. Right-click the database connecon and share it.

Chapter 8

[ 241 ]

5. Drag an Excel Input step to the canvas and use it to read the

manufacturers.xls le.

6. Click on Preview Rows to check that you are reading the le properly. You should

see the following:

7. From the Output category of steps, drag a Table Output step to the canvas.

8. Create a hop from the Excel Input step to the Table output step.

9. Double-click the Table output step and ll the main sengs window as

follows—select js as Connecon, as Target table, browse and select the

table manufacturers or type it. Check the Specify database elds opon.

It is not mandatory but recommended in this parcular exercise that

you also check the Truncate table opon. Otherwise, the output

table will have duplicate records if you run the transformaon more

than once.

10. Select the Database elds tab.

Working with Databases

[ 242 ]

11. Fill the grid as follows:

12. Click OK.

13. Aer the Table output step, add a Write to log step.

14. Right-click the Table output step and select Dene error handling....

15. Fill the error handling sengs window. As Target step, select the

Write to log step. Check the Enable the error handling? opon. Enter

db_err_desc as Error descripons eldname, db_err_field as

Error elds eldname, and db_err_cod as Error codes eldname.

16. Click OK. The following is your nal transformaon:

17. Save the transformaon and run it.

Chapter 8

[ 243 ]

18. Take a look at the Steps Metrics tab window. You will see the following:

19. Now look at the Logging tab window. The following is what you see:

20. Switch to MySQL Query Browser.

21. In the Schemata window, double-click the manufacturers table.

22. The query entry box is lled with a basic SELECT statement for that table such as:

SELECT * FROM manufacturers m;

Working with Databases

[ 244 ]

23. Click Execute. The following result set is shown:

What just happened?

In the rst part of the tutorial, you created the Jigsaw Puzzle database.

In Spoon, you created a connecon to the new database.

Finally, you created a transformaon that read an Excel le with a list of puzzle

manufacturers and inserted that data into the manufacturers table. Note that not

all rows were inserted. The row that couldn't be inserted was reported in the log.

In the data for the tutorial, there was a descripon too long to be inserted in the table. That

was properly reported in the log because you implemented error handling. Doing that, you

avoided the aboron of the transformaon due to errors like that. As you learned in the

previous chapter, when a row causes an error, it is up to you to decide what to do with that

row. In this case, the row was sent to the log and wasn't inserted. Other possible opons

for you are:

Chapter 8

[ 245 ]

Fixing the problem in the Excel le and rerunning the transformaon

Validang the data and xing it properly (for example, cung the descripons)

before the data arrives to the Table output step

Sending the full data for the erroneous rows to a le, xing manually the data in the

le, and creang a transformaon that inserts only this data

Inserting new data into a database table with the Table

output step

The Table output step is the main PDI step for inserng new data into a database table.

The use of this step is simple. You have to enter the name of the database connecon

and the name of the table where you want to insert data. The names for the connecon

and the table are mandatory, but as you can see, there are some extra sengs for the

Table output step.

The database eld tab lets you specify the mapping between the dataset stream elds and

the table elds.

In the tutorial the dataset had two elds—CODE and NAME. The table has two columns

named man_code and man_desc.

As the names are dierent, you have to explicitly indicate that the CODE eld is to be wrien

in the table eld named man_code, and that the NAME eld is to be wrien in the table eld

named man_desc.

The following are some important ps and warnings about the use of the Table output step:

If the names of the elds in the PDI stream are equal to the names of the columns

in the table, you don't have to specify the mapping. In that case, you have to leave

the Specify database elds checkbox unchecked and make sure that all the elds

coming to the Table output step exist in the table.

Before sending data to the Table output step, check your transformaon against the

denion of the table. All the mandatory columns that don't have a default value

must have a corresponding eld in the PDI stream coming to the Table output step.

Check the data types for the elds you are sending to the table. It is possible

that a PDI eld type and the table column data type don’t match. In that case, x

the problem before sending the data to the table. You can, for example, use the

Metadata tab of a Select values step to change the data type of the data.



Working with Databases

[ 246 ]

In the Table output step, you may have noted a buon named SQL. This buon generates

the DDL to create the output table. In the tutorial, the output table, manufacturers,

already existed. But if you want to create the table from scratch, this buon allows you

to do it based on the database elds you provided in the step.

Inserting or updating data by using other PDI steps

The Table output step provides the simplest but not the only way to insert data into a

database table. In this secon, you will learn some alternaves for feeding a table with PDI.

Time for action – inserting new products or updating

existent ones

So far, you created the Jigsaw Puzzles database and loaded a list of puzzles manufacturers.

It's me to start loading informaon about the products you will sell— puzzles.

Suppose, in order to show you what they are selling, the suppliers provide you with the lists

of products made by the manufacturers themselves. Fortunately, they don't give you the lists

in the form of papers, but they give you either plain les or spreadsheets. In this tutorial, you

will take the list of products oered by the manufacturer Classic DeLuxe and load it into the

puzzles table.

1. From the Packt website, download the sample lists of products.

2. Open Spoon and create a new transformaon.

3. Add a Text le input step and congure it to read the

productlist_LUX_200908.txt le.

Pay aenon to the each eld. It's the price of the product and must be congured

as a Number with format $0.00.

4. Preview the le. You should see the following:

Chapter 8

[ 247 ]

5. In the Selected Files grid, replace the text productlist_

LUX_200908.txt by ${PRODUCTLISTFILE}.

6. Click on OK.

7. Aer the Text le input step, add an Add constants step.

8. Use it to add a String constant named man_code with value LUX.

9. From the Output category of steps, drag an Insert/Update step to the

canvas. Create a hop from the Add constants step to this new step.

10. Double-click the step. Select js as Connecon. As Target table,

browse and select products. In the upper grid of the window, add the

condions pro_code = prodcod and man_code = man_code. Click

the Edit mapping buon. The mapping dialog window shows up.

11. Under the Source elds list, click on prodcod, under the Target elds list click

on pro_code, and then click the Add buon. Again, under the Source elds

list click on title, under the Target elds list click on pro_name, and then

nally click Add. Proceed with the mapping unl you get the following:

12. Click OK.

Working with Databases

[ 248 ]

13. Fill the Update column for the price row with the value Y. Fill the rest of the

column with the value N. The following is how the nal grid looks like:

14. Aer the Insert/Update step, add a Write to log step.

15. Right-click the Insert/Update step and select Dene error handling....

16. Fill the error handling sengs window just as you did in the previous tutorial.

17. Save the transformaon and run it by pressing the F9 key.

18. In the sengs window, assign the PRODUCTLISTFILE variable

with the value productlist_LUX_200908.txt.

19. Click on Launch.

20. When the transformaon ends, check the Step Metrics. You will see the following:

Chapter 8

[ 249 ]

21. Switch to the SQL Query Browser applicaon.

22. Type the following in the query entry box:

SELECT * FROM products p;

23. Click on Execute. The following result set is shown:

What just happened?

You populated the products table with data found in text les. For inserng the data, you

used the Insert/Update step.

As this was the rst me you dealt with the products table, before you ran the

transformaon, the table was empty. Aer running the transformaon, you could

see how all products in the le were inserted in the table.

Time for action – testing the update of existing products

In the preceding tutorial, you used an Insert/Update step, but only inserted records. Let's try

the transformaon again to see how the update opon works.

1. If you closed the transformaon, please open it.

2. Press F9 to launch the transformaon again.

Working with Databases

[ 250 ]

3. As the value for the PRODUCTLISTFILE variable,

insert productlist_LUX_200909.txt.

4. Click Launch.

5. When the transformaon ends, check the Step Metrics tab. You will see

the following:

6. Switch to the SQL Query Browser applicaon and click Execute to run the query

again. This me you will see this:

Chapter 8

[ 251 ]

What just happened?

You reran the transformaon that was created in the previous tutorial, this me using a

dierent input le. In this le there were new products and some products were removed

from the list, whereas some had their descripons, categories, and prices modied.

When you ran the transformaon for the second me, the new products were added to

the table. Also, the modied prices of the products were updated. In the Step Metrics tab

window, you can see the number of inserted records (Output column) and the number of

updated ones (Update column).

Note that as the supplier may give you updated lists of products with dierent

names of les, for the name of the le you used a variable. Doing so, you were

able to reuse the transformaon for reading dierent les each me.

Inserting or updating data with the Insert/Update step

While the Table output step allows you to insert brand new data, the Insert/Update step

allows you to do both, insert and update data in a single step.

The rows coming to the Insert/Update step can be new data or can be data that already

exists in the table. Depending on the case, the Insert/Update step behaves dierently. Let's

see each case in detail:

For each incoming row, the rst thing the step does is use the lookup condion you put in

the upper grid to check if the row already exists in the table.

In the tutorial you wrote two condions: pro_code = prodcod and man_code = man_

code. Doing so, you told the step to look for a row in the products table for which the table

column pro_code is equal to the eld prodcod of your row, and the table column

man_code is equal to the eld with the same name of your row.

If the lookup fails, that is, the row doesn't exist, the step inserts the row in the table by using

the mapping you put in the lower grid.

The rst me you ran the tutorial transformaon, the table was empty. There were no

rows against which to compare. In this case, all the lookups failed and, consequently, all

rows were inserted.

Working with Databases

[ 252 ]

This insert operaon is exactly the same that you could have done with a Table output step.

That implies that here you also have to be careful about the following:

All the mandatory columns that don't have a default value must be present in the

Update Field grid, including the keys you used in the upper grid

The data types for the elds you are sending to the table must match the data type

for the columns of the table

If the lookup succeeds, the step updates the table replacing the old values with the new

ones. This update is made only for the elds where you put Y as the value for the Update

column in the lower grid.

If you don't want to perform any update operaon, you can check the Don't perform any

updates opon.

The second me you ran the tutorial, you had two types of products in the le—products

that already existed in the database and new products. For example, consider the following

row found in the second le:

PDI looks for a row in the table where the prod_code is equal to CLTR1001 and man_code

is equal to LUX (the eld added with the Add constants step). It doesnt nd it. Then it

inserts a new row with the data coming from the le.

Take another sample row:

Puzzles in a Box

PDI looks for a row in the table where the prod_code is equal to CLBO1007 and man_code

equal to LUX. It nds the following:

There are two dierences between the old and the new versions of the product. Both the

name and the price have changed.



Chapter 8

[ 253 ]

As you congured the Insert/Update step to update only the price column, the update

operaon does so. The new record in the table aer the execuon of the transformaon

is this:

Have a go hero – populating a lms database

From the Packt website, download the films.sql script le. Run the script In MySQL. A

new database will be created to hold lm data.

Browse the folder where you have the les for Chapter 7 and get the French lms le.

You will use it to populate the following tables of the films database: GENRES, PEOPLE,

and FILMS.

Now follow these instrucons:

Create a connecon to the database.

In order to populate the GENRES table, you have to build a list of genres, no

duplicates! For the primary key, GEN_ID, you don't have a value in the le. Create

the key with an Add sequence step.

The table, PEOPLE, will have the names of both actors and directors. In order to

populate that table, you will have to create a single list of people, no duplicates here

either! To generate the primary key, use the same method as before.

4. Finally, populate the FILMS table with the whole list of lms found in the le.

Don't forget to handle errors so that you can detect bad rows.

Have a go hero – creating the time dimension

Now you're going to nish what you started back in Chapter 6—the creaon of a

me dimension.

From the Packt website, download the js_dw.sql script le. Run the script in MySQL.

A new database named js_dw will be created.

Now you are going to modify the time_dimension.ktr transformaon to load the me

dataset into the lk_time table.

Working with Databases

[ 254 ]

The following are some ps:

Create a connecon to the created database

Find a correspondence between each eld in the dataset and each column in the

LK_TIME table

Use a Table output step to send the dataset to the table

Aer running the transformaon, check if all rows were inserted as expected.

Pay aenon to the main eld in the me dimension—date.

In the transformaon the date is a eld whose type is Date.

However, in the table the type for the date eld is CHAR(8). This column

is meant to hold the date as a String with the format YYYYMMDD—for

example 20090915.

As explained, the data types of the data you sent to the table have to match

the data types in the table. In this case, as the types don't match, you will

have to use a Select values step and change the metadata of the date eld

from Date to String.

Have a go hero – populating the products table

This exercise has two parts. The rst is intended to enrich the transformaon you created

in the tutorial. The transformaon processed the product list les supplied by the Classics

DeLuxe manufacturer. In the le, there was some extra informaon that you could put in the

table such as the number of pieces of a puzzle. However, the data didn't come ready to use.

Consider, for example, this text: 500 pieces each. In order to get the number of pieces, you

need to do some transformaon. Modify the transformaon so that you can enrich the data

in the products table.

The second part of the exercise has to do with populang the products table with products

from other manufacturers. Unfortunately, you can't expect that all manufacturers to share

the same structure for the list of products. Not only the structure changes, but also the

kind of informaon they give you can vary. On the Packt website, you have several sample

product les belonging to dierent manufactures. Explore them, analyze them to see if you

can idenfy the dierent data you need for the products table, and load all the products

into the database by using a dierent transformaon for each manufacturer.



Chapter 8

[ 255 ]

The following are some ps:

Take as a model the transformaon for the tutorial. You may reuse most of it.

You don't have to worry about the stock columns or the pro_type column

because they already have default values.

Use the comments in the le to idenfy potenal values for the pro_packaging,

pro_shape and pro_style columns. Use the pro_packaging eld for values

such as 2 puzzles in a box. Use the pro_shape eld for values such as

Panoramic Puzzle or 3D Puzzle. Use the puzzle_type eld for values

such as Glow in the Dark or Wooden Puzzle.

You can leave the pro_description empty or put in it whatever you feel that

ts—a x string such as Best in market!, or the full comment found in the le,

or whatever your imaginaon says.

Pop quiz – Insert/Update step versus Table Output/Update steps

In the last tutorial you read a le and used an Insert/Update step to populate the products

table. Look at the following variant of the transformaon:

Suppose you use this transformaon instead of the original. Compared to the results you got

in the tutorial, aer the execuon of this version of the transformaon, the products table

will have:

a. The same number of records

b. More records

c. Less records

d. It depends on the contents of the le



Working with Databases

[ 256 ]

Pop quiz – ltering the rst 10 rows

The following SELECT statement:

SELECT TOP 10 * FROM CUSTOMERS

gives you the rst ten customers in the CUSTOMERS table of the sample database.

Suppose you want to get the rst ten products in the PRODUCTS table of the Jigsaw Puzzles

database. Which of the following statements would do that:

a. SELECT TOP 10 * FROM product

b. SELECT * FROM product WHERE ROWNUM<11

c. SELECT * FROM product LIMIT 10

d. Any of the above statements

Eliminating data from a database

Deleng informaon from a database is not the most common operaon with databases, but

it is an important one. Now you will learn how to do it with PDI.

Time for action – deleting data about discontinued items

Suppose a manufacturer informs you about the categories of products that will no longer be

available. You don't want to have in your database products something that you will not sell.

Then you use PDI to delete them.

1. From the Packt website, download the LUX_discontinued.txt le.

2. Create a new transformaon.

3. With a Text le input step, read the le.

4. Preview the le. You will see the following:

Chapter 8

[ 257 ]

5. Aer the Text le input step, add an Add constants step to add

a String constant named man_code with value LUX.

6. Expand the Output category of steps and drag a Delete step to the canvas.

7. Create a hop from the Add constants step to the Delete step.

8. Double-click the Delete step. Select js as Connecon and, as Target table, browse

and select products. In the grid add the condions man_code = man_code and

pro_theme LIKE category. Aer the Delete step, add a Write to log step.

9. Right-click the Delete step and dene the error handling just like you did in each of

the previous tutorials in this chapter.

10. Save the transformaon.

11. Before running the transformaon, open the Database Explorer.

12. Under the js connecon, locate the products table and click

Open SQL for [products].

13. In the simple SQL editor type:

SELECT pro_theme, pro_name FROM js.products p

ORDER BY pro_theme, pro_name;

14. Click on Execute. You will see the following result set:

Working with Databases

[ 258 ]

15. Close the preview data window and the results of the SQL window.

16. Minimize the database explorer window.

17. The database explorer is collapsed at the boom of the Spoon window.

18. Run the transformaon.

19. Look at the Step Metrics. The following is what you should see:

20. Maximize the database explorer window.

21. In the SQL editor window click Execute again. This me you will see this:

Chapter 8

[ 259 ]

What just happened?

You deleted from the products table all products belonging to the categories found in the

LUX_discontinued.txt le.

Note that to query the list of products, you used the PDI Database explorer. You could have

done the same by using MySQL Query Browser.

Deleting records of a database table with the Delete step

The Delete step allows you to delete records of a database table based on a given condion.

For each row coming to the step, PDI deletes the records that match the condion set in its

conguraon window.

Let's see how it worked in the tutorial. The following is the dataset coming to the

Delete step:

For each of these two rows PDI performs a new delete operaon.

For the rst row, the records deleted from the products table are those where man_code is

equal to LUX and pro_theme is like FAMOUS LANDMARKS.

For the second row, the records deleted from the products table are those where

man_code is equal to LUX and pro_theme is like COUNTRYSIDE.

You can verify the performed operaons by comparing the result sets you got in the

database explorer before and aer running the transformaon.

Just for your informaon, you could have done the same task with the following

DELETE statements:

DELETE FROM products

WHERE man_code = 'LUX' and pro_theme LIKE 'FAMOUS LANDMARKS'

DELETE FROM products

WHERE man_code = 'LUX' and pro_theme LIKE 'COUNTRYSIDE'

Working with Databases

[ 260 ]

In the Step Metrics result, you may noce that the updated column for the Delete step has

value 2. This number is the number of delete operaons, not the number of deleted records,

which was actually a bigger number.

Have a go hero – deleting old orders

Create a transformaon that asks for a date from the command line and deletes all orders

from the Steel Wheels database whose order dates are before the given date.

Summary

This chapter discussed how to use PDI to work with databases. Specically, the chapter

covered the following:

Introducon to the Pentaho Sample Data Steel Wheels—the starng point for you

to learn basic database theory

Creang connecons from PDI to dierent database engines

Exploring databases with the PDI Database explorer

Basics of SQL

Performing CRUD (Create, Read, Update, and Delete) operaons on databases

In the next chapter you will connue working with databases. You will learn some advanced

concepts, including datawarehouse-specic operaons.



Performing Advanced Operations

with Databases

In this chapter you will learn about advanced operaons with databases. The rst part of the

chapter includes:

Populang the Jigsaw puzzle database so that it is prepared for the rest of the

acvies in the chapter

Doing simple lookups in a database

Doing complex lookups

The second part of the chapter is fully devoted to datawarehouse-related concepts. The list

of the topics that will be covered includes:

Introducing dimensional modeling

Loading dimensions

Preparing the environment

In order to learn the concepts of this chapter, a database with lile or no data is useless.

Therefore, the rst thing you'll do is populang your Jigsaw puzzle database.

Time for action – populating the Jigsaw database

To load data massively into your Jigsaw database, you must have the Jigsaw database

created and the MySQL server running. You already know how to do this. If not, please

refer to Chapter 1 for the installaon of MySQL and Chapter 8 for the creaon of the

Jigsaw database.



Performing Advanced Operaons with Databases

[ 262 ]

This tutorial will overwrite all your data in the js database. If you don't

want to overwrite the data in your js database, you could simply create a new

database with a dierent name and run the js.sql script to create the tables

in your new database.

Aer checking that everything is in order, follow these instrucons:

1. From Packt's website download the js_data.sql script le.

2. Launch the MySQL query browser.

3. From the File menu select Open Script....

4. Locate the downloaded le and open it.

5. At the beginning of the script le you will see this line:

USE js;

If you created a new database, replace the name js by the name of your new

database.

6. Click on the Execute buon.

7. At the boom of the screen, you'll see a progress message.

8. When the script execuon ends, verify that the database has been populated.

Execute some SELECT statements such as:

SELECT * FROM cities

All tables must have records.

Having populated the database, let's prepare the Spoon environment:

1. Edit the kettle.properties le located in the PDI home directory. Add the

following variables: DB_HOST, DB_NAME, DB_USER, DB_PASS, and DB_PORT. As

values put the seng for your connecon to the Jigsaw database. Use the following

lines as a guide:

DB_HOST=localhost

DB_NAME=js

DB_USER=root

DB_PASS=1234

DB_PORT=3306

Chapter 9

[ 263 ]

2. Add the following variables: DW_HOST, DW_NAME, DW_USER, DW_PASS, and

DW_PORT. As values, put the seng for your connecon to the js_dw

database—the database you created in Chapter 8 to load the me dimension.

Here are some sample lines for you to use:

DW_HOST=localhost

DW_NAME=js_dw

DW_USER=root

DW_PASS=1234

DW_PORT=3306

Save the le.

3. Included in the downloaded material is a le named shared.xml. Copy it to your

PDI home directory (the same directory where the kettle.properties le is)

overwring the exisng le.

Before overwring the le, please take a backup, as this will

delete any share connecons you might have created.

4. Launch Spoon. If it was running, restart it so that it recognizes the changes in the

kettle.properties le.

5. Create a new transformaon.

If you don't see the shared database connecons js and

dw, please verify that you copied the shared.xml le to

the right folder.

6. Right-click the js database connecon and select Edit. In the Sengs frame, instead

of xed values, you will see variables: ${DS_HOST} for Host Name, ${DS_NAME} for

Database Name, and so on.

7. Test the connecon.

8. Repeat the steps for the js_dw shared connecon: Right-click the database

connecon and select Edit. In the Sengs frame, you will see the variables you

dened in the kettle.properties le—${DW_HOST}, ${DW_NAME}, and so on.

9. Test the dw_js connecon.

Performing Advanced Operaons with Databases

[ 264 ]

If any of the database tests fail, please check that the connecon

variables you put in the kettle.properties le are correct.

Also check that MySQL is running database.

What just happened?

In this tutorial you prepared the environment for working in the rest of the chapter.

You did two dierent things:

First, you ran a script that emped all the js database tables and loaded data into them.

Then, you redened the database connecons to the databases js and js_dw.

Note that the names for the connecon don't have to match the

names of the databases. This can benet you in the following way: If

you created a database with a dierent name for the Jigsaw database

puzzle, your connecon may sll be named js, and all code you

download from the Packt website should work without touching

anything but the kettle.properties le.

You edited the kettle.properties le by adding variables with the database connecon

values such as host name, database name, and so on. Then you edited the database

connecons. There you saw that the database sengs didn't have values but variable

names—the variables you had dened in the kettle.properties le. For shared

connecons, PDI takes the database denion from the shared.xml le.

Note that you didn't save the transformaon you created. That was

intenonal. The only purpose for creang it was to be able to see the

shared connecons.

Exploring the Jigsaw database model

The informaon in this secon allows you to understand the organizaon of the data in the

Jigsaw database. In the rst place, you have a DER. A DER or enty relaonship diagram is

a graphical representaon that allows you to see how the tables in a database are related to

each other. The following is the DER for the js database:



Chapter 9

[ 265 ]

The following table contains a brief explanaon of what each table is for:

Table name Content

manufacturers Informaon about manufacturers of the products.

products It is about the products you sell such as puzzles and accessories. The

table has descripve informaon and data about prices and stock. The

pro_type column has the type of product—puzzle, glue, and so on.

Several of the columns apply only to puzzles, such as shape or pieces.

buy_methods It contains informaon about the list of methods for buying—for

example, in store, by telephone, and so on.

payment_methods Informaon about list of methods of payment such as cash, check,

credit card, and so on.

countries The list of countries.

cities The list of cies.

customers A list of customers. A customer has a number, a name, and an address.

invoices The header of invoices including date, customer number, and total

amount. The invoices dates range from 2004 to 2010.

Performing Advanced Operaons with Databases

[ 266 ]

Looking up data in a database

You already know how to create, update, and delete data from a database. It's now me to

learn to look up data. Lookup is the act of searching for informaon in a database. You can

look up a column of a single table or you can do more complex lookups. Let's begin with the

simplest way of looking up.

Doing simple lookups

Somemes you need to get informaon from a database table based on the data you have in

your main stream. Let's see how you can do it.

Time for action – using a Database lookup step to create a list

of products to buy

Suppose you have an online system for your customers to order products. On a daily basis,

the system creates a le with the orders informaon. Now you will check if you have stock

for the ordered products and make a list of the products you'll have to buy.

1. Create a new transformaon.

2. From the Input category of steps, drag a Get data from XML step to the canvas.

3. Use it to read the orders.xml le. In the Content tab, ll the Loop XPath opon

with the /orders/order string. In the Fields tab get the elds.

4. Do a preview. You will see the following:

Chapter 9

[ 267 ]

To keep this exercise simple, the le contains a single

product by order.

5. Add a Sort rows step and use it to sort the data by man_code, prod_code.

6. Add a Group by step and double-click it.

7. Use the upper grid for grouping by man_code and prod_code.

8. Use the lower grid for adding a eld with the number of orders in each group. As

Name write quantity, as Subject ordernumber, and as Type write Number of

Values (N). Expand the Lookup category of steps.

9. Drag a Database lookup step to the canvas and create a hop from the Group by step

toward this step.

10. Double-click the Database lookup step.

11. As Connecon, select js and in Lookup table, browse the database and select

products or just type its name.

12. Fill the grids as follows:

If you don't see both grids, just resize the window. This is

one of the few conguraon steps that lack the scrollbar to

the right side.

Also remember that with all grids in PDI, you always have

the opon to populate the grids by using the Get Fields and

Get lookup elds buons respecvely.

Performing Advanced Operaons with Databases

[ 268 ]

13. Click on OK.

14. Add a lter step to pass only the rows where pro_stock<quantity.

15. Add a Text le output step to send the manufacturer code, the product code, the

product name, and the ordered quanty to a le named products_to_buy.txt.

16. Run the transformaon.

17. The le should have the following content:

man_code;prod_code;pro_name;quantity

EDU;ED13_93;Times Square;1

RAV;RVZ50031;Disney World Map;2

RAV;RVZ50106;Star Wars Clone Wars;1

What just happened?

You processed a le with orders. You grouped and counted the ordered products by product

code. Then with the Database lookup step, you looked up the product table for the record

belonging to the ordered product. You added to your stream, the name and stock for the

products. Aer that, you kept only the rows for which the stock was lower than the units

your customers ordered. With the rows that passed, you created a list of products to buy.

Looking up values in a database with the Database lookup step

The Database lookup step allows you to look up values in a database table. In the upper grid of

the seng window, you specify the keys to look up. In the example you look for a record that

has the same product code and manufacturer code as the codes coming in the stream.

In the lower grid you put the name of the table columns you want back. Those elds are

added to the output stream. In this case, you added the name and the stock of the product.

The step returns only one row even if it doesn't nd a matching record or if it nds more

than one. When the step doesn't nd a record with the given condions, it returns null for

all the added elds, unless you specify a default value for those new elds.

Note that this behavior is quite similar to the Stream lookup step's behavior. You search for

a match and, if a record is found, the step returns you the specied elds. If not, the new

elds are lled with default values. Besides the fact that the data is searched in a database,

the new thing here is that you specify the comparator to be used: =, <, >, and so on. The

Stream lookup step looks only for equal values. As all the products in the le existed in your

database, the step found a record for every row, adding to your stream two elds: the name

and the stock for the product. You can check it by doing a preview on the Database lookup

step. Aer the Database lookup setup, you used a Filter rows step to discard the rows where

the stock was lower than the required quanty of products. You can avoid adding this step

Chapter 9

[ 269 ]

by rening the lookup conguraon. In the upper grid you could add the condion pro_

stock<quantity and check the Do not pass the row if the lookup fails checkbox; you now

get a dierent result. The step will look not only for the product, but also for the condion

pro_stock<quantity. If it doesn't nd a record that matches, that is, the lookup fails, the

check Do not pass the row if the lookup fails does its work—lters the row. Doing these

changes, you don't have to use the extra Filter rows step, nor add the pro_stock eld to

the stream unless you need it for another use.

As a nal remark—if the lookup returns more than one row, only the rst is returned. You

have the opon to abort the whole transformaon if this happens—simply check the Fail on

mulple results? checkbox.

Making a performance dierence when looking up data in a database

Database lookups are costly and can severely impact transformaon

performance. However, performance can be signicantly improved by using the

cache feature of the Database lookup step. To enable the cache feature, just

check the Enable cache? opon.

This is how it works: Think of the cache as a buer of high-speed memory that

temporarily holds frequently requested data. By enabling the cache opon,

Kele will look rst in the cache and then in the database.

If the table where you look up has few records, you could preload the cache with

all the data in the lookup table. You do it by checking the Load all data from

table opon. This will give you the best performance.

On the contrary, if the number of rows in the lookup table is too large to t enrely

into memory, instead of caching the whole table you can tell Kele the maximum

number of rows to hold in cache. You do it by specifying the number in the Cache

size in rows textbox. The bigger this number, the faster the lookup process.

Be careful when seng the cache opons. If you have a large table

or don't have much memory, you risk running out of memory.

Have a go hero – preparing the delivery of the products

Create a new transformaon and do the following. Taking as source the orders le, create a

list of the customers who ordered products. Include their name, last name, and full address.

Order the data by country name.

You will need two Database lookup steps—one for geng the customers'

informaon and the other to get the name of the country.

Performing Advanced Operaons with Databases

[ 270 ]

Have a go hero – rening the transformation

Modify the original transformaon. As the le may have been manipulated, it may contain

invalid data. Apply the following treatment:

Verify that there is a customer with the given number. If the customer doesn't exist,

discard the row. Use the Do not pass the row if the lookup fails checkbox.

In the rows that passed, verify that there is a product with the given manufacturer and

product codes. If the data is valid, check the stock and proceed. If not, make a list so that

the cases can be handled later by the customer care department.

Doing complex lookups

The Database lookup step is very useful and quite simple, but it lets you search only

for columns of a specic table. Let's now try a step that allows you to do more

complex searches.

Time for action – using a Database join step to create a list of

suggested products to buy

If your customers ordered a product that is out of stock and you don't want to let them

down, you will suggest them some alternave puzzles to buy.

1. Open the transformaon of the previous tutorial and save it under a new name.

2. Delete the Text le output step.

3. Double-click the Group by step and add an aggregated eld named customers with

the list of customers separated by (,). Under Subject, select idcus and as Type,

select Concatenate strings separated by ,.

4. Double-click the Database lookup step. In the Values to return from the lookup

table grid, add pro_theme as value in the String eld.

5. Add a Select values step. Use it to select the elds customers, quantity,

pro_theme, and pro_name. Also rename quantity as quantity_param and

pro_theme as theme_param. From the Lookup category, drag a Database join

step to the canvas. Create a hop from the Select values step to this step.

6. Double-click the Database join step.

7. Select js as Connecon.



Chapter 9

[ 271 ]

8. In the SQL frame type the following statement:

SELECT man_code

, pro_code

, pro_name

FROM products

WHERE pro_theme like ?

AND pro_stock>=?

9. In the Number of rows to return textbox, type 4.

10. Fill the grid as shown:

11. Click on OK. The transformaon looks like this:

12. With the last step selected, do a Preview.

13. You should see this:

Performing Advanced Operaons with Databases

[ 272 ]

14. In the Step Metrics you should see this:

What just happened?

You took the list of orders and ltered those for which you ran out of products. For the

customers that ordered those products you built a list of four alternave puzzles to buy.

The selecon of the puzzles was based on the theme. To lter the suggested puzzles, you

used the theme of the ordered product.

The second parameter in the Database join step, the ordered quanty, was used to oer only

alternaves for products for which there is a sucient stock.

Joining data from the database to the stream data by using a Database

join step

With the Database join step, you can combine your incoming stream with data from your

database, based on given condions. The condions are put as parameters in the query you

write in the Database join step.

Note that this is not really a database join as the name suggests; it is a

join of data from the database to the stream data.

In the tutorial you used two parameters—the theme and the quanty ordered. With those

parameters, you queried the list of products with the same theme:

where pro_theme like ?

and for which you have stock:

and pro_stock>=?

Chapter 9

[ 273 ]

You set the parameters as queson marks. This works like the queson marks in a Table

input step you learned in the last chapter—the parameters are replaced posionally. The

dierence is that here you dene the list and the order of the parameters. You do it in the

small grid at the boom of the sengs window. This means you aren't forced to use all the

incoming elds as parameters, and that you also may change the order.

Just as you do in a Table input step, instead of using posional parameters, you can use

Kele variables by using the ${} notaon and checking the Replace variables checkbox.

You don't need to add the Select values step to discard elds and rename the

parameters. You did it just to have fewer elds in the nal screenshot so that it

was easier to understand the output of the Database join step.

The step will give you back the manufacturer code, the product code, and the product name

for the matching records.

As you cannot do a preview here, you can write and try your query inside a

Table input step or in MySQL Query Browser. When you are done, just copy

and paste the query here.

So far, you did the same you could have done with Database lookup step—looking for a

record with a given condion, and adding new elds to the stream. However, there is a big

dierence here—you put 4 as the Number of rows to return. This means for each incoming

row, the step will give you back up to four results. The following shows you this:

Note that if you had le the Number of rows to return empty, the step would

have returned all found rows.

Performing Advanced Operaons with Databases

[ 274 ]

You may need to use a Database join step in several situaons:

When, as the result of the lookup, there is more than one row for each incoming row.

This was the case in the tutorial.

When you have to look in a combinaon of tables. Look at the following SQL statement:

SELECT co.country_name

FROM customers cu

, cities ci

, countries co

WHERE cu.city_id = ci.city_id

AND ci.cou_id = co.cou_id

AND cu.cus_id = 1000

This statement returns the name of the country where the customer with id 1000

lives. If you want to look up the countries where a list of customers live, you can do

it with a sentence like this by using a Database join step.

When you want to look for an aggregate result. Look at this sample query:

SELECT pro_theme

, count(*) quant

FROM products

GROUP BY pro_theme

ORDER BY pro_theme

This statement returns the number of puzzles by theme. If you have a list of themes

and you want to nd out how many puzzles you have for each theme, you can use a

query like this also by using a Database join step.

The last opon in the list can also be developed without using the Database join step.

You could execute the SELECT statement with a Table Input step, and then look for the

calculated quanty by using a Stream lookup step.

As you can see, this is another situaon where PDI oers more that one

way to do the same thing. Somemes it is a maer of taste. In general, you

should test each opon and choose the method which gives you the best

performance.



Chapter 9

[ 275 ]

Have a go hero – rebuilding the list of customers

Redo the Hero exercise preparing the delivery of the products, this me using a Database

join step. Try to discover which one is preferable from the point of view of performance. If

you don't see any dierence, try with a bigger number of records in the main stream. You

will have to create your own dataset for this test.

Introducing dimensional modeling

So far you have dealt with the Jigsaw puzzles database, a database used for daily operaonal

work. In the real-world, a database like this is maintained by an On-Line Transacon

Processing (OLTP) system. The users of an OLTP system perform operaonal tasks—sell

products, process orders, control stock, and so on.

As a counterpart, a datawarehouse is a nonoperaonal database; it is a specialized database

designed for decision support purposes. Users of a datawarehouse analyze the data, and

they do it from dierent points of view.

The most used technique for delivering data to datawarehouse users is dimensional

modeling. This technique makes databases simple and understandable.

The primary table in a dimensional model is the fact table. A fact table stores numerical

measurements of the business such as quanty of products sold, amount represented by the

sold products, discounts, taxes, number of invoices, number of claims, and anything that can

be measured. These measurements are referred as facts.

A fact is useless without the dimension tables. Dimension tables contain the textual

descriptors of the business. Typical dimensions are product, me, customers, and regions.

The fact along with all the surrounding dimension tables make a star-like structure oen

called a star schema.

Datawarehouse is a very broad concept. In this book we will deal with datamarts. While a

datawarehouse represents a global vision of an enterprise, a datamart holds the data from a

single business process .

Data stored in datawarehouses and datamarts usually comes from dierent sources, the

operaonal database being the main. The process that takes the informaon from the source,

transforms it in several ways, and nally loads the data into the datamart or datawarehouse is

the already menoned ETL process. As said, PDI is a perfect tool for accomplishing that task.

In the rest of this chapter, you will learn how to load dimension tables with PDI. This will build

the basis for the nal project of the book: Loading a full datamart.

Performing Advanced Operaons with Databases

[ 276 ]

Through the tutorials you will learn more about this. However, the terminology introduced

here constutes just a preamble to dimensional modeling. There is much more you

can learn. If you are really interested in the subject, you should start by reading The

Data Warehouse Toolkit (Second Edion) by Ralph Kimball and Margy Ross. The book is

undoubtedly the best guide to dimensional modeling.

Loading dimensions with data

A dimension is an enty that describes your business—customers and products are examples

of dimensions. A very special dimension is the me dimension that you already know. A

dimension table (no surprises here) is a table that contains informaon about a dimension.

In this secon you will learn to load dimension tables, that is, ll dimension tables with data.

Time for action – loading a region dimension with a

Combination lookup/update step

In this tutorial you will load a dimension that stores geographical informaon.

1. Launch Spoon.

2. Create a new transformaon.

3. Drag a Table input step to the canvas and double-click it.

4. As connecon select js.

5. In the SQL area type the following query:

SELECT ci.city_id, city_name, country_name

FROM cities ci, countries co

WHERE ci.cou_id = co.cou_id

6. Click on OK.

7. Expand the Data Warehouse category of steps.

8. Select the Combinaon lookup/update step and drag it to the canvas.

9. Create a hop from the Table input step to this new step.

10. Double-click the Combinaon lookup/update step.

11. As Connecon select dw.

Chapter 9

[ 277 ]

12. As Target table browse and select lk_regions or simply type it.

13. Enter id as Technical key eld and lastupdate as Date of last update eld.

14. Click OK.

15. Aer the Combinaon lookup/update step, add an Update step.

16. Double-click the Update step.

17. Select dw as Connecon and lk_regions as Target table.

18. Fill the upper grid adding the condion id = id. The id to the le is the table id,

while the id to the right is the stream id.

19. Fill the lower grid: Add one row with the values city and city_name. Add a

second row with the values country and country_name. This will update the

table columns city and country with the values city_name and country_name

coming in the stream.

20. Now create another stream: Add to the canvas a Generate Rows step, a Table

output step, and a Dummy step.

21. Link the steps in the order you added them.

22. Edit the Generate Rows step and set Limit to 1.

23. Add four elds in this order: An Integer eld named id with value 0, a String

eld named city with value N/A, another String named country with value

N/A, and an Integer eld named id_js with value 0. Double-click the Table

Output step.

24. Select dw as Connecon and lk_regions as Target table.

25. Click on OK.

26. In the Table output step, enable error handling and send the bad rows to the

Dummy step.

Performing Advanced Operaons with Databases

[ 278 ]

27. The transformaon looks like this:

28. Save the transformaon and run it.

29. The Step metrics looks like this:

30. Explore the js_dw database and do a preview of the lk_regions table. You should

see this:

Download from Wow! eBook <www.wowebook.com>

Chapter 9

[ 279 ]

What just happened?

You loaded the region dimension with geographical informaon—cies and countries.

Note that you took informaon from the operaonal database

js and loaded a table in another database js_dw.

Before running the transformaon, the dimension table lk_region was empty. When the

transformaon ran, all cies were inserted in the dimension table.

Besides the records with cies from the cities table, you also inserted a special record

with values n/a for the descripve elds. You did it in the second stream added to the

transformaon.

Note that the dimension table lk_regions has a column named region that you didn't

update because you don't have data for that column. The column is lled with a default

value set in the DDL denion of the table.

Time for action – testing the transformation that loads the

region dimension

1. In the previous tutorial you loaded a dimension that stores geographical

informaon. You ran it once, causing the inseron of one record for each city and a

special record with values n/a for the descripve elds. Let's apply some changes in

the operaonal database, and run the transformaon again to see what happens.

2. Launch MySQL Query Browser.

3. Type the following sentence to change the names of the countries to upper case:

UPDATE countries SET country_name = UCASE(country_name)

4. Execute it.

5. If the transformaon created in the last tutorial is not open, open it again.

6. Run the transformaon.

Performing Advanced Operaons with Databases

[ 280 ]

7. The Step Metrics looks like this:

8. Explore the js_dw database again and do a preview of the lk_regions table. This

me you will see the following:

What just happened?

Aer changing the leer case for the names of the countries in the transaconal database

js, you again ran the transformaon that updates the Regions dimension. This me the

descripons for the dimension table were updated.

As for the special record with values n/a for the descripve elds, it had been created the

rst me the transformaon ran. This me, as the record already existed, the row passed by

to the Dummy step.

Chapter 9

[ 281 ]

Describing data with dimensions

A dimension table contains descripons about a parcular enty or category of your

business. Dimensions are one of the basic blocks of a datawarehouse or a datamart.

A dimension has the purpose of grouping, ltering, and describing data.

Think of a typical report you would like to have—sales grouped by region, by customer, by

method of payment ordered by date. The by word lets you idenfy potenal dimensions—

regions, customers, method of payments, and date.

Best pracces say that a dimension table must have its own technical key column dierent

to the business key column used in the operaonal database. This technical key is known

as a surrogate key. In the lk_region dimension table the surrogate key is the column

named id.

While in the operaonal database the key may be a string such as the manufacturer code in

the manufacturers table, surrogate keys are always integers. Another good pracce is to have

a special record for unavailable data. In the case of the regions example, this implies that

besides one record for every city, you should have a record with key equal to zero, and n/a

or unknown or something that represents invalid data for all the descripve aributes.

Along with the descripve aributes that you save in a dimension, you usually keep the

business key so that you can match the data in the dimension table with the data in the

source database. The following screenshot depicts typical columns in a dimension table:

Performing Advanced Operaons with Databases

[ 282 ]

In the tutorial, you took informaon from the cities and countries tables and used that

data to load the regions dimension. When there were changes in the transaconal database,

the changes were translated to the dimension table overwring the old values. A dimension

where changes may occur from me to me is called a Slowly Changing Dimension or SCD

for short. If, when you update an SCD dimension, you don't preserve historical values but the

old values, the dimension is called Type I slowly changing dimension (Type I SCD).

Loading Type I SCD with a Combination lookup/update step

In the tutorial, you loaded a Type I SCD by using a Combinaon lookup/update step. The

Combinaon lookup/update or Combinaon L/U for short, looks in the dimension table

for a record that matches the key elds you put in the upper grid in the sengs window. If

the combinaon exists, the step returns the surrogate key of the found record. If it doesn't

exist, the step generates a new surrogate key and inserts a row with the key elds and the

generated surrogate key. In any case, the surrogate key is added to the output stream.

Be aware that in the Combinaon Lookup/update step the following opons

do not refer to elds in the stream, but to columns in the table: Dimension

eld, Technical key eld, and Date of last update eld. You should read

Dimension column, Technical key column, and Date of last update column.

Also note that the term Technical refers to the surrogate key.

Let's see how the Combinaon lookup/update step works with an example. Look at the

following screenshot:

Chapter 9

[ 283 ]

The record to the right of the Table input icon is a sample city among the cies that the Table

input step gets from the js database.

With the Combinaon L/U step, PDI looks for a record in the lk_region table in the dw

database, where id_js is equal to the eld city_id in the incoming stream, which is 7001.

The rst me you run the transformaon, the dimension table is empty, so the lookup fails.

This causes PDI to generate a new surrogate key according to what you put in the Technical

key eld area of the sengs window.

You told PDI that the column that holds the surrogate key is the column named id. You also

told PDI that in order to generate the key, the value should be equal to the maximum key

found in the target table plus one. In this example, it generates a key equal to 7. You may

also use a sequence or an auto increment eld if the database engine allows it. If that is not

the case, those opons are disabled.

Then PDI generates the key and inserts the record you can see to the right of the

Combinaon L/U step in the draw. Note that the record contains only values for the key

elds and the technical key eld.

The Combinaon L/U step put the returned technical key in the output stream. Then you

used that key for updang the descripons for city and country with the use of an Update

step. Aer that step, the record is fully generated, as shown in the record to the right of

the Update icon.

As the Combinaon L/U only maintains the key informaon, if

you have non-key columns in the table you must update them

with an extra Update step.

Note that those values must have a default value or must

allow null values. If none of these condions is true, the insert

operaon will fail.

Aer converng to upper case, all the country names in the source database, you run the

transformaon again.

This me the incoming record for the same city is this:

PDI looks for a record in the lk_region table, in the dw database, where id_js is equal

to 7001. It nds it. It is the record inserted the rst me you ran the transformaon, as

explained above.

Then, the Combinaon L/U simply returns the key eld adding it to the output stream.

Performing Advanced Operaons with Databases

[ 284 ]

Then you use the key that the step added to update the descripons for city and country.

Aer the Update step, the old values for city and country name are overwrien by the

new ones:

Have a go hero – adding regions to the Region Dimension

Modify the transformaon that loads the Region dimension to ll the region column. Get

the values from the regions.xls le you can nd among the downloaded material for this

chapter. To add the region informaon to your stream, use a Stream lookup step.

While you are playing with dimensions, you may want to throw away all the

inserted data and start over again. For doing that, simply explore the database

and use the Truncate table opon. You can do the same in MySQL Query

Explorer. For the lk_regions dimension, you could execute any of

the following:

DELETE FROM lk_regions or TRUNCATE TABLE lk_regions

Have a go hero – loading the manufacturers dimension

Create a transformaon that loads the manufacturers dimension—lk_manufacturers.

Here you have the table denion and some guidance for loading:

Column Descripon

id Surrogate key.

name Name of the manufacturer.

id_js Business key. Here you have to store the manufacturer's code

(man_code eld of the source table manufacturers).

lastupdate Date of dimension update—system date.

Chapter 9

[ 285 ]

Have a go hero – loading a mini-dimension

A mini-dimension is a dimension where you store the frequently analyzed or frequently

changing aributes of a large dimension. Look at the products in the Jigsaw puzzles

database. There are several puzzle aributes you may be interested in, for example, when

you analyze the sales—number of puzzles in a single pack, number of pieces of the puzzles,

material of the product, and so on. Instead of creang a big dimension with all puzzle

aributes, you can create a mini-dimension that stores only a selecon of aributes. There

would be one row in this mini-dimension for each unique combinaon of the selected

aributes encountered in the products table, not one row per puzzle.

In this exercise, you'll have to load a mini-dimension with puzzle aributes. Here you have

the denion of the table that will hold the mini-dimension data:

Column Descripon

id Surrogate key

glowsInDark Y/N

is3D Y/N

wooden Y/N

isPanoramic Y/N

nrPuzzles Number of puzzles in a single pack

nrPieces Number of pieces of the puzzle

Take as a starng point the following query:

SELECT DISTINCT pro_type

, pro_packaging

, pro_shape

, pro_style

FROM products

WHERE pro_type = 'PUZZLE'

Use the output stream for creang the elds you need for the dimension—for example,

for the eld is3D, you'll have to check the value of the pro_shape eld.

Once you have all the elds you need, insert the records in the dimension table by using a

Combinaon L/U step. In this mini-dimension, the key is made by all the elds of the table.

As a consequence, you don’t need an extra Update step.

Performing Advanced Operaons with Databases

[ 286 ]

Keeping a history of changes

The Region dimension is a typical Type I SCD dimension. If some descripon changes, as

the country names did, it makes no sense to keep the old values. The new values simply

overwrite the old ones. This is not always the best choice. Somemes you would like to

keep a history of the changes. Now you will learn to load a dimension that keeps a history.

Time for action – keeping a history of product changes with the

Dimension lookup/update step

Let's load a puzzles dimension along with the history of the changes in puzzle aributes:

1. Create a new transformaon.

2. Drag a Table input step to the work area and double-click it.

3. Select js as Connecon.

4. Type the following query in the SQL area:

SELECT pro_code

, man_code

, pro_name

, pro_theme

FROM products

WHERE pro_type LIKE 'PUZZLE'

5. Click on OK.

6. Add an Add constants step, and create a hop from the Table input, step toward it.

7. Use the step to add a Date eld named changedate. As Format type dd/MM/

yyyy, and as Value, type 01/10/2009.

8. Expand the Data Warehouse category of steps.

9. Select the Dimension lookup/update step and drag it to the canvas.

10. Create a hop from the Add constants step to this new step.

11. Double-click the Dimension lookup/update step.

12. As Connecon select dw.

13. As Target table type lk_puzzles.

Chapter 9

[ 287 ]

14. Fill the Key elds as shown:

15. Select id as Technical key eld.

16. In the frame Creaon of technical key, leave the default to

Use table maximum + 1.

17. As Version eld, select version.

18. As Stream Dateeld, select changedate.

19. As Date range start eld, select start_date.

20. As Table daterange end, select end_date.

21. Select the Fields tab and ll it like this:

22. Close the sengs window.

23. Save the transformaon, and run it.

24. Explore the js_dw database and do a preview of the lk_puzzles table.

Performing Advanced Operaons with Databases

[ 288 ]

25. You should see this:

What just happened?

You loaded the puzzle dimension with the name and theme of the puzzles you sell. The

dimension table has the usual columns for a dimension—technical id (eld id), elds that

store the key elds in the table of the operaonal database (prod_code and man_code),

and columns for the puzzle aributes (name and theme). It also has some extra elds

specially designed to keep history.

When you ran the transformaon, all records were inserted in the dimension table. Also a

special record was automacally inserted for unavailable data.

So far, there is nothing new except for a few extra columns with dates. In the next tutorial,

you will learn more about those columns.

Time for action – testing the transformation that keeps a history

of product changes

1. In the previous tutorial you loaded a dimension with products by using a Dimension

lookup/update step. You ran the transformaon once, causing the inseron of one

record for each product and a special record with values n/a for the descripve elds.

Let's apply some changes in the operaonal database, and run the transformaon again

to see how the Dimension lookup/update step keeps history.

2. In MySQL Query Browser, open the script update_jumbo_products.sql and

run it.

3. Switch to Spoon.

4. If the transformaon created in the last tutorial is not open, open it again.

Chapter 9

[ 289 ]

5. Run the transformaon. Explore the js_dw database again. Press Open SQL for

[lk_puzzles] and type the following sentence:

SELECT *

FROM lk_puzzles

WHERE id_js_man = 'JUM'

ORDER BY id_js_prod

, version

6. You will see this:

What just happened?

Aer making some changes in the operaonal database, you ran the transformaon for a

second me. The modicaons you made caused the inseron of new records recreang the

history of the puzzle aributes.

Keeping an entire history of data with a Type II slowly changing dimension

Type II SCDs dier from Type I SCDs in that a Type II keeps the whole history of the data of

your dimension. Typical examples of aributes for which you would like to keep a history are

sales territories that change over me, categories of products that are reclassied from me

to me, and promoons that you apply to products and are valid in a given range of dates.

There are no rules that dictate whether or not you keep/retain the history in a

dimension. It's the nal user who decides based on his requirements.

Performing Advanced Operaons with Databases

[ 290 ]

In the puzzle dimension, you kept informaon about the changes for the name and theme

aributes. Let's see how the history is kept for this sample dimension.

Each puzzle is to be represented by one or more records, each with the informaon valid

during a certain period of me, as in the following example:

1900 2199

VERSION : 1

To :

01-01-1900

664

Castles

01-10-2009

JUM, JUMB0107

Valid From :

Surrogate Key :

Business Key :

Fields :

Name :

Theme :

Cindrellas Grand Arrival

01-10-2009

JUM, JUMB0107

31-12-2199

1031

Disney

VERSION : 2 (current)

To :

Valid From :

Surrogate Key :

Business Key :

Fields :

Name :

Theme :

Cindrellas Grand Arrival

1900

The history is kept in three extra elds in the dimension table—version, date_from,

and date_to.

The version eld is an automacally incremented value that maintains a revision number of

the records for a parcular puzzle.

The date range is used to indicate the period of applicability of the data.

In the tutorial you also had a current eld, that acted as a ag to show if a record is the

record valid in the present day.

The sample puzzle, Cinderellas Grand Arrival, was classied in the category Castles unl

October 1, 2009. Aer that date, the puzzle was reclassied as a Disney puzzle. This is the

second version of the puzzle, as indicated by the column version. It's also the current

version, as indicated by the column current.

Chapter 9

[ 291 ]

In general, if you have to implement a Type II SCD with PDI, your dimension table

must have the rst three elds—version, date from, and date to. The current ag

is oponal.

Loading Type II SCDs with the Dimension lookup/update step

Type II SCDs can be loaded by using the Dimension lookup/update step. The Dimension

lookup/update or Dimension L/U for short, looks in the dimension for a record that matches

the informaon you put in the Keys grid of the sengs window.

If the lookup fails, it inserts a new record. If a record is found, the step inserts or updates

records depending on how you congured the step.

Let's explain how the Dimension L/U works with the following sample puzzle in the

js database:

The rst me you run the transformaon, the step looks in the dimension for a record where

id_js_prod is equal to JUMBO107and id_js_man is equal to JUM. Not only that, the

period from start_date to end_date of the found record must contain the value of the

stream datefield, which is 01/10/2009.

Because you never loaded this table before, the table was empty and so the lookup failed.

As a result, the step inserts the following record:

Note the values that the step put for the special elds:

The version for the new record is 1, the current ag is set to true, and the start_date and

end_date take as values the dates you put in the Min.year and Max.year: 01/01/1900

and 31/12/2199.

Performing Advanced Operaons with Databases

[ 292 ]

Aer making some modicaons to the operaonal database, you ran the transformaon

again. Look at the following screenshot:

The puzzle informaon changed. As you see to the right of the Table input step, the puzzle is

now classied as a Disney puzzle.

This me the lookup succeeds. There is a record for which the keys match and the period

from start_date to end_date of the found record, 01/01/1900 to 31/12/2199,

obviously contains the value of the stream datefield, 01/10/2009.

Once found, the step compares the elds you put in the Fields tab—name and theme in the

dimension table against pro_name and pro_theme in the incoming stream.

As there is a dierence in the theme eld, the step inserts a new record, and modies the

current—it changes the validity dates and sets the current ag to false. Now this puzzle has

two versions in the dimension table, as you see below the Dimension L/U icon in the drawing.

These update and insert operaons are made for all records that changed.

For the records that didn't change, dimension records are found but as nothing changed,

nothing is inserted or updated.

Take a note about the stream date: The eld you put here is key to the loading

process of the dimension, as its value is interpreted by PDI as the eecve

date of the change. In the tutorial, you put a xed date—01/10/2009. In

real situaons you should use the eecve or last changed date of the data if

that date is available. If it is not available, leave the eld blank. PDI will use the

system date.

Chapter 9

[ 293 ]

In this example, you lled the column Type of SCD update with the opon Insert for every

eld. Doing so, you loaded a pure Type II SCD, that is, a dimension that keeps track of all

changes in all elds.

In the sample puzzles dimension, you kept a history of changes both in the theme and in

the name. For the sample puzzle, the theme was changed from Castles to Disney. If, aer

some me, you query the sales and noce that the sales for that puzzle increased aer the

change, then you may conclude that the customers are more interested in Disney puzzles

than in castle puzzles. The possibility of creang these kinds of reports is a good reason for

maintaining a Type II SCD.

On the other hand, if the name of the puzzle changes, you may not be so interested in

knowing what the name was before. Fortunately, you may change the conguraon and

create a Hybrid SCD. Instead of selecng Insert for every eld, you may select Update

or Punch through:

When there is a change in a eld for which you chose Update, the new value

overwrites the old value in the last dimension record version, this being the usual

behavior in Type I SCDs.

When there is a change in a eld for which you chose Punch through, the new

data overwrites the old value in all record versions.

Note that selecng Punch through for all the elds, the Dimension L/U step allows you

to load a Type I SCD dimension. When you build Type I SCD you are not interested in range

dates. Thus, you can leave the Stream dateeld textbox empty. The current date is assumed

by default.

In pracce both Type I, Type II, and Hybrid SCDs are used. The choice of the type of SCD

depends on the business needs.

Besides all those inserts and updates operaons, the Dimension L/U automacally inserts in

the dimension a record for unavailable data.

In order to insert the special record with key equal to zero, all elds must have

default values or allow nulls. If none of these condions are true, the automac

inseron will fail.

In order to load a dimension with the Dimension L/U step, your table has to have columns

for the version, date from, and date to. The step automacally maintains those columns.

You simply have to put their names in the right textbox in the sengs window.

Besides those elds, your dimension table may have a column for the current ag, and

another column for the date of last insert or update. To ll those oponal columns, you

have to add them in the Fields tab as you did in the tutorial.



Performing Advanced Operaons with Databases

[ 294 ]

Have a go hero – keeping a history just for the theme of a product

Modify the loading of the products dimension so that it only keeps a history of the theme. If

the name of the product changes, just overwrite the old values. Modify some data in the js

database and run your transformaon to conrm that it works as expected.

Have a go hero – loading a Type II SCD dimension

As you saw in the Hero exercise to add regions to the Region Dimension, the countries were

grouped in three: Spain, Rest of Europe, Rest of the World.

As the sales rose in several countries of the world, you decided to regroup the countries in

more than three groups. However, you want to do it starng in 2008. For older sales you

prefer to keep seeing the sales grouped by the original categories.

This is what you will do: Use the table named lk_regions_2 to create a Type II Region

dimension. Here is a guide to follow:

Create a transformaon that loads the dimension. You will take the stream date (the date

you use for loading the dimension) from the command line. If the command line argument is

empty, use the present day.

As the name for the sheet with the region denion, use a named parameter.

Stream date

If the command line argument is present, remember to change it to Date

before using it. You do that with a Select values step.

Note that you have to dene the format of the entered data in advance.

Suppose that you want to enter as argument the date January 1, 2008. If

you chose the format dd-mm-yyyy, you'll have to enter the argument as

01-01-2008.

In case the command line argument is absent, you can get the default with

a Get System Info step. Note that the system date you add with this step is

already a Date eld.

Chapter 9

[ 295 ]

Now just follow these steps:

1. Run the transformaon by using the regions.xls le. Don't worry about the

command line argument. Check that the dimension was loaded as expected. There

has to be a single record for every city.

2. Run the transformaon again. This me use the regions2008.xls le as source

for the region column. As command line, enter January 1st, 2008. Remember to type

the date in the expected format (check the preceding p). Explore the dimension

table. There has to be two records for each country—one valid before 2008 and

one valid aer that date.

3. Modify the sheet to create a new grouping for the American countries. Use your

imaginaon for this task! Run the transformaon for the third me. This me use

the sheet you created and as date, type the present day (or leave the argument

blank). Explore the dimension table. Now each city for the countries you regrouped

has to have three versions, where the current is the version you created. The other

cies should connue to have two versions each, because nothing related to those

cies changed.

Pop quiz – loading slowly changing dimensions

Suppose you have DVDs with the French lms in the catalog you've created so far. You

rent those DVDs and keep the rental informaon in the database. Now you will design a

dimensional model for that data.

1. You begin by designing a dimension to store the names of the lms. How do you

create the Films dimension:

a. As a Type I SCD

b. As a Type II SCD

c. You will decide when you have rented enough lms so you make the

right decision.

2. In order to create that dimension, you could use:

a. A Dimension L/U step

b. A Combinaon L/U step

c. Either of the above

d. Neither of the above

Performing Advanced Operaons with Databases

[ 296 ]

Pop quiz – loading type III slowly changing dimensions

Type III SCD are dimensions that store the immediately preceding and current value for a

descripve eld of the dimension. Each enty is stored in a single record. The eld for which

you want to keep the previous value has two columns assigned in the record: One for the

current value and the other for the old. Somemes, it is possible to have a third column

holding the date of eecve change.

Type III SCDs are appropriate when you don't want to keep all the history, but mainly when

you need to support two views of the aribute simultaneously—the previous and the

current. Suppose you have an Employees dimension. Among the aributes you have their

posion. People are promoted from me to me and you want to keep these changes in the

dimension; however, you are not interested in knowing all the intermediate posions the

employees have been through. In this case, you may implement a Type III SCD.

The queson is, how would you load a Type III SCD with PDI:

a. With a Dimension L/U step conguring it properly

b. By using a Database lookup step to get the previous value. Then with a Dimension

L/U step or a Combinaon L/U step to insert or update the records.

c. You can't load Type III SCDs with PDI

It's worth saying that type III SCD are used rather infrequently and not always can be

automated. Somemes they are used to represent human-applied changes and the

implementaon has to be made manually.

Summary

In this chapter you learned to perform some advanced operaons on databases.

First, you populated the Jigsaw database in order to have data for the acvies in the

chapter. Then, you learned to do simple and complex searches in a database.

Then you were introduced to dimensional concepts and learned what dimensions are

and how to load them with PDI. You learned about Type I, Type II, Type III SCDs and

mini-dimensions. You sll have to learn when and how to use those dimensions. You

will do so in Chapter 12.

The steps you learned in this and the preceding chapter are far from being the full list of

steps that PDI oers to work with databases. However, taking into account all you learned,

you are now ready to use PDI for implemenng most of your database requirements. In the

next chapter, you will switch to a totally dierent yet core subject needed to work with

PDI—jobs.

Creating Basic Task Flows

So far you have been working with data. You got data from a le, a sheet, or a

database, transformed it somehow, and sent it back to some le or table in a

database. You did it by using PDI transformaons. A PDI transformaon does

not run in isolaon. Usually, it is embedded in a bigger process. Here are

some examples:

Download a le, clean it, load the informaon of the le in a database,

and ll an audit le with the result of the operaon.

Generate a daily report and transfer the report to a shared repository.

Update a datawarehouse. If something goes wrong, nofy the

administrator by e-mail.

All these examples are typical processes of which a transformaon is only a piece. These

types of processes can be implemented by PDI Jobs. In this chapter, you will learn to build

basic jobs. These are the topics that will be covered:

Introducon to jobs

Execung tasks depending upon condions

Introducing PDI jobs

A PDI job is analogous to a process. As with processes in real life, there are basic jobs and

there are jobs that do really complex tasks. Let's start by creang a job in the rst group—a

hello world job.



Creang Basic Task Flows

[ 298 ]

Time for action – creating a simple hello world job

In this tutorial, you will create a very simple job so that you get an idea of what jobs

are about.

Although you will now learn how to create a job, for this tutorial you rst have to create

a transformaon.

1. Open Spoon.

2. Create a new transformaon.

3. Drag a Generate rows step to the canvas and double-click it.

4. Add a String value named message, with the value Hello, World!.

5. Click on OK.

6. Add a Text le output step and create a hop from the Generate rows step to this

new step.

7. Double-click the step.

8. Type ${LABSOUTPUT}/chapter10/hello as lename.

9. In the Fields tab, add the only eld in the stream—message.

10. Click on OK.

11. Inside the folder where you save your work, create a folder named

transformations.

12. Save the transformaon with the name hello_world_file.ktr in the folder you

just created. The following is your nal transformaon:

Chapter 10

[ 299 ]

Now you are ready to create the main job.

13. Select File | New | Job or press Ctrl+Alt+N. A new job is created.

14. Press Ctrl+J. The Job properes window appears.

15. Give a name and descripon to the job.

Creang Basic Task Flows

[ 300 ]

16. Save the job in the folder where you created the transformations folder, with

the name hello_world.kjb.

17. To the le of the screen, there is a tree with job entries. Expand the General

category of job entries, select the START entry, and drag it to the work area.

18. Expand the File management category, select the Create a folder entry, and drag it

to the canvas.

19. Select both entries. With the mouse cursor over the second entry, right-click and

select New hop. A new hop is created.

Chapter 10

[ 301 ]

Just like in a transformaon, you have several ways to create hops.

For more detail, please refer to the Time for acon – creang a

Hello Word transformaon secon in Chapter 1 where hops were

introduced or to Appendix D, Spoon Shortcuts.

20. Double-click the Create a folder...icon.

21. In the textbox next to the Folder name opon, type ${LABSOUTPUT}/chapter10

and click on OK. From the General category, drag a transformaon job entry to

the canvas.

22. Create a hop from the Create a folder entry to the transformaon entry.

23. Double-click the transformaon job entry.

24. Posion the cursor in the Transformaon lename textbox, press Ctrl+Space, and

select ${Internal.Job.Filename.Directory}.

This variable is the counterpart to the variable {Internal.

Transformation.Filename.Directory} you already know.

{Internal.Job.Filename.Directory} evaluates the

directory where the job resides.

Creang Basic Task Flows

[ 302 ]

25. Click on the icon to the right of the textbox. The following dialog window shows up:

26. As you can see, the {Internal.Job.Filename.Directory} variable provides

a convenient starng place for looking up the transformaon le. Select the

hello_world_file.ktr transformaon and click OK.

27. Now the Transformaon lename has the full path to the transformaon.

Replace the full job path back to ${Internal.Job.Filename.Directory}

so that the nal text for the Transformaon lename eld is as shown in the

following screenshot:

28. Click on OK.

29. Press Ctrl+S to save the job.

Chapter 10

[ 303 ]

30. Press F9 to run the job. The following window shows up:

Remember that in the inial chapters, you dened the

LABSOUTPUT variable in the kettle.properties le. You

should see its value in the Variables grid. If you removed the

variable from that le, provide a value here.

31. Click on Launch.

32. At the boom of the screen, you'll see the Execuon results. The Job metrics screen

looks as follows:

Creang Basic Task Flows

[ 304 ]

33. Select the Logging tab. It looks like this:

34. Explore the folder pointed to by your ${LABSOUTPUT} variable—for example, c:/

pdi_files/output. You should see a new folder named chapter10.

35. Inside the chapter10 folder, you should see a le named hello.txt.

36. Explore the le. It should have the following content:

Message

Hello, World!

What just happened?

First of all, you created a transformaon that generated a simple le with the message

Hello, World!. The le was congured to be created in a folder named chapter10.

Aer that, you created a PDI Job. The job was built to create a folder named chapter10

and then to execute the hello_world transformaon.

When you ran the job, the chapter10 folder was created, and inside it, a le with the

Hello, World! message was generated.

Chapter 10

[ 305 ]

Executing processes with PDI jobs

A Job is a PDI enty designed for the execuon of processes. In the tutorial, you ran a

simple process that created a folder and then generated a le in that folder. A more complex

example could be the one that truncates all the tables in a database and loads data in all the

tables from a set of text les. Other examples involve sending e-mails, transferring les, and

execung shell scripts.

The unit of execuon inside a job is called a job entry. In Spoon you can see the entries

grouped into categories according to the purpose of the entries. In the tutorial, you used

job entries from two of those categories: General and File management.

Most of the job entries in the File management category have a self-explanatory name such

as Create a folder, and their use is quite intuive. Feel free to experiment with them!

As to the General category, it contains many of the most used entries. Among them is the

START job entry that you used. A job must start with a START job entry.

Don't forget to start your sequence of job entries with a START. A job

can have any mix of job entries and hops, as long as they start with this

special kind of job entry.

A Hop is a graphical representaon that links two job entries. The direcon of the hop

denes the order of execuon of the job entries it links. Besides, the execuon of the

desnaon job entry does not begin unl the job entry that precedes it has nished. Look,

for example, at the job in the tutorial. There is an entry that creates a folder, followed by an

entry that executes a transformaon. First of all, the job creates the folder. Once the folder

has been created, the execuon of the transformaon begins. This allows the transformaon

to assume that the folder exists. So, it safely creates a le in that folder.

A hop connects only two job entries. However, a job entry may be reached by more than one

hop. Also, more than one hop may leave a job entry.

A job, like a transformaon, is neither a program nor an executable le. It is simply

plain XML. The job contains metadata that tells the Kele engine which processes to

run and the order of execuon of those processes. Therefore, it is said that a job is

ow-control oriented.

Creang Basic Task Flows

[ 306 ]

Using Spoon to design and run jobs

As you just saw, with Spoon you not only create, preview, and run transformaons, but you

also create and run jobs.

You are already familiar with this graphical tool, so you don't need too much explanaon

about the basic work areas. So, let's do a brief review.

The following table describes the main dierences you will noce while designing a job

compared to designing a transformaon:

Area Descripon

Design tree You don’t see a list of steps but a list of job entries (despite on top of

the list you see the word Steps).

Job menu You no longer see some opons that only have sense while working

with datasets. One of them is the Preview buon.

Job metrics tab

(Execuon results window)

Instead of a Step Metrics, you have this tab. Here you can see metrics

for each job entry.

Chapter 10

[ 307 ]

If you click the View icon in the upper-le corner of the screen, the tree will change to show

the structure of the job currently being edited.

Using the transformation job entry

The transformaon job entry allows you to call a transformaon from a job.transformaon job entry allows you to call a transformaon from a job. job entry allows you to call a transformaon from a job.

There are several situaons where you may need to use a transformaon job entry.

In the tutorial, you had a transformaon that generated a le in a given folder. You called the

transformaon from a job that created that folder in advance. In this case, the job and the

transformaon performed complementary tasks.

Somemes the job just keeps your work organized. Consider the transformaons that loaded

the dimension tables for the js database. As you will usually run them together, you can

embed them into a single job as shown in this gure:

Creang Basic Task Flows

[ 308 ]

The only task done by this job is to keep the transformaons together. Although the picture Although the picture

implies the entries are run simultaneoulsy, that is not the case.

Job entries typically execute sequenally, this being one of the central

dierences between jobs and transformaons.

When you link two entries with a hop, you force an order of execuon. On the contrary,

when you create a job as shown in this preceding gure, you needn't give an order and the

entries sll run in sequence, one entry aer another depending on the creaon sequence.

Launching job entries in parallel

As the transformaons that load dimensions are not dependent on each other,

as an opon, you can ask the START entry to launch them simultaneously. For

doing that, right-click the START entry and select Launch next entries in parallel.

Once selected, the arrows to the next job entries will be shown in dashed lines.

This opon is available in any entry, not just in the START entry.

The jobs explained earlier are just two examples of how and when you use a transformaon

job entry. Note that many transformaons perform their tasks by themselves. In that

case you are not forced to embed them into jobs. It makes no sense to have a job with

just a START entry, followed by a transformaon job entry. You can sll execute those

transformaons alone, as you used to do unl now.

Pop quiz – dening PDI jobs

1. A job is:

a. A big transformaon that groups smaller transformaons

b. An ordered group of task denions

c. An unordered group of task denions

2. For each of the following sentences select True or False. A job allows you to:

a. Send e-mails

b. Compare folders

c. Run transformaons

d. Truncate database tables

e. Transfer les with FTP

Chapter 10

[ 309 ]

Have a go hero – loading the dimension tables

Create a job that loads the main dimension tables in the Jigsaw database—manufacturers,

products, and regions. Test the job.

Receiving arguments and parameters in a job

Jobs, as well as transformaons, are more exible when receiving parameters from outside.

You already learned to parameterize your transformaons by using named parameters and

command-line arguments. Let's extend these concepts to jobs.

Time for action – customizing the hello world le with

arguments and parameters

Let's create a more exible version of the job you did in the previous secon.

1. Create a new transformaon.

2. Press Ctrl+T to bring up the Transformaon properes window.

3. Select the Parameters tab.

4. Add a named parameter HELLOFOLDER. Insert chapter10 as the default value.

5. Click on OK.

6. Drag a Get System Info step to the canvas .

7. Double-click the step.

8. Add a eld named yourname. Select command line argument 1 as the Type.

9. Click on OK.

10. Now add a Formula step located in the Scripng category of steps.

11. Use the step to add a String eld named message. As Formula, type "Hello, "

& [yourname] & "!".

12. Finally, add a Text le output step.

13. Use the step to send the message data to a le. Enter ${LABSOUTPUT}/

${HELLOFOLDER}/hello as the name of the le.

14. Save the transformaon in the transformations folder you created in the

previous tutorial, under the name hello_world_param.ktr.

Creang Basic Task Flows

[ 310 ]

15. Open the hello_world.kjb job you created in the previous tutorial and save it

under a new job named hello_world_param.kjb.

16. Press Ctrl+J to open the Job properes window.

17. Select the Parameters tab.

18. Add the same named parameter you added in the transformaon.

19. Click on OK.

20. Double-click the Create a folder entry.

21. Change the Folder name textbox content to ${LABSOUTPUT}/${HELLOFOLDER}.

22. Double-click the Transformaon entry.

23. Change the transformaon lename textbox to point to the new transformaon:

${Internal.Job.Filename.Directory}/transformations/hello_world_

param.ktr.

24. Click on OK.

25. Save the job and run it.

26. Fill the dialog window with a value for the named parameter and a value for the

command-line argument.

Chapter 10

[ 311 ]

27. Click on Launch.

28. When the execuon nishes, check the output folder. The folder named

my_folder, which you inially specied as a named parameter, should be created.

29. Inside that folder there should be a le named hello.txt. This me the content of

the le has been customized with the name you provided:

Hello, pdi student!

What just happened?

You created a transformaon that generated a hello.txt le in a folder given as the

named parameter. The content of the le is a customized "Hello" message that gets the

name of the reader from the command line.

In the main job you also dened a named parameter, the same that you dened in the

transformaon. The job needs the parameter to create the folder.

When you run the job, you provided both the command-line argument and the named

parameter in the job dialog window that shows up when you launch the execuon. Then

a folder was created with the name you gave, and a le was generated with the name you

typed as argument.

Creang Basic Task Flows

[ 312 ]

Using named parameters in jobs

You can use named parameters in jobs in the same way you do in transformaons. You

dene them in the Job properes window. You provide names and default values, and then

you use them just as regular variables. The places where you can use variables, just as in a

transformaon, are idened with a dollar sign to the right of the textboxes. In the tutorial,

you used a named parameter in the Create a folder job entry. In this parcular example,

you used the same named parameter both in the main job and in the transformaon called

by the job. So, you dened the named parameter HELLOFOLDER in two places—in the Job

sengs window and in the Transformaon properes window.

If a named parameter is used only in the transformaon, you

don't need to dene it in the job that calls the transformaon.

Have a go hero – backing up your work

Suppose you want to back up your output les regularly, that is, the les in your

${LABSOUTPUT} directory. Build a job that creates a ZIP le with all your output les. For

the name and locaon of the ZIP le, use two named parameters.

Use the Zip le job entry located in the File

management category.

Running jobs from a terminal window

In the main tutorial of this secon, both the job and the transformaon called by the job

used a named parameter. The transformaon also required a command-line argument.

When you executed the job from Spoon, you provided both the parameter and the

argument in the job dialog window. You will now learn to launch the job and provide

that informaon from a terminal window.

Chapter 10

[ 313 ]

Time for action – executing the hello world job from a terminal

window

In order to run the job from a terminal window, follow these instrucons:

1. Open a terminal window.

2. Go to the directory where Kele is installed.

On Windows systems type:

C:\pdi-ce>kitchen /file:c:/pdi_labs/hello_world_param.kjb

Maria -param:"HELLOFOLDER=my_work" /norep

On Unix, Linux, and other Unix-like systems type:

/home/yourself/pdi-ce/kitchen.sh /file:/home/yourself/

pdi_labs/hello_world_param.kjb Maria -param:"HELLOFOLDER=

my_work" /norep

3. If your job is in another folder, modify the command accordingly. You may also

replace the name Maria with your name, of course. If your name has spaces,

enclose the whole argument within "".

4. You will see how the job runs, following the log in the terminal:

5. Go to the output folder—the folder pointed by your LABS_OUTPUT variable.

6. A folder named my_work should have been created.

7. Check the content of the folder. A le named hello.txt should be there. Edit the

le. You should see the following:

Hello,Maria!





Creang Basic Task Flows

[ 314 ]

What just happened?

You ran the job with Kitchen, the program that executes jobs from the terminal window.

Aer the name of the command, kitchen.bat or kitchen.sh, depending on the

plaorm, you provided the following:

The full path to the job le: /file:c:/pdi_labs/hello_world_param.kjb

A command-line argument: Maria.

A named parameter, and a -param:"HELLOFOLDER=my_work"

The switch /norep to tell Kele not to connect to a repository

Aer running the job, you could see that the folder had been created and a le with a

custom "Hello" message had been generated.

Here you used some of the opons available when you run Kitchen. Appendix B tells you all

the details about using Kitchen for running jobs.

Have a go hero – experiencing Kitchen

Run the hello_world_param.kjb job from Kitchen, with and without providing

arguments and parameters. See what happens in each case.

Using named parameters and command-line arguments

in transformations

As you know, transformaons accept both arguments from the command line and named

parameters. When you run a transformaon from Spoon, you supply the values for

arguments and named parameters in the transformaon dialog window that shows up

when you launch the execuon. From a terminal window, you provide those values in the

Pan command line.

In this chapter you learned to run a transformaon embedded in a job. Here, the methods

you have for supplying named parameters and arguments needed by the transformaon

are quite similar. From Spoon you supply the values in the job dialog window that shows up

when you launch the job execuon. From the terminal window you provide the values in the

Kitchen command line.

Whether you run a job from Spoon or from Kitchen, the named parameters

and arguments you provide are unique and shared by the main job and

all transformaons called by that job. Each transformaon, as well as the

main job, may or may not use them according to their needs.



Chapter 10

[ 315 ]

There is sll another way in which you can pass parameters and arguments to a

transformaon. Let's see it by example.

Time for action – calling the hello world transformation with

xed arguments and parameters

This me you will call the parameterized transformaon from a new job.

1. Open the hello_world.kjb job you created in the rst secon and save it as

hello_world_fixedvalues.kjb.

2. Double-click the Create a folder job entry.

3. Replace the chapter10 string by the string fixedfolder.

4. Double-click the transformaon job entry.

5. Change the Transformaon lename as ${Internal.Job.Filename.

Directory}/transformations/hello_world_param.ktr.

6. Fill the Argument tab as follows.

7. Click the Parameters tab and ll it as follows:

8. Click on OK.

9. Save the job.

Creang Basic Task Flows

[ 316 ]

10. Open a terminal window and go to the directory where Kele is installed.

On Windows systems type:

C:\pdi-ce>kitchen /file:c:/pdi_labs/

hello_world_param.kjb /norep

On Unix, Linux, and other Unix-like systems type:

/home/yourself/pdi-ce/kitchen.sh /file:/home/yourself/

pdi_labs/hello_world_param.kjb /norep

11. When the execuon nishes, check the output folder. A folder named

fixedfolder has been created.

12. In that folder, you can see a hello.txt with the following content:

Hello, reader!

What just happened?

You reused the transformaon that expects an argument and a named parameter from the

command line. This me you created a job that called the transformaon and set both the

parameter and the argument in the transformaon job entry seng window.

Then you ran the job from a terminal window, without typing any arguments or parameters.

It didn't make any dierence for the transformaon. Whether you provide parameters and

arguments from the command line or you set constant values in a transformaon job entry,

the transformaon does its job—creang a le with a custom message in the folder with the

name given by the ${HELLOFOLDER}parameter.

Instead of running from the terminal window, you could have run the

job by pressing F9 and then clicking Launch, without typing anything

in either the parameter or the argument grid. The nal result should

be exactly the same.

Have a go hero – saying hello again and again

Modify the hello_world_param.kjb job so that it generates three les in the default

${HELLOFOLDER}, each saying "hello" to a dierent person.

Aer the creaon of the folder, use three transformaon job entries.

Provide dierent arguments for each.

Run the job to see that it works as expected.



Chapter 10

[ 317 ]

Have a go hero – loading the time dimension from a job

In Chapter 6, you built a transformaon that created the data for a me dimension. Then in

Chapter 8, you nished the transformaon loading the data into a me dimension table.

The transformaon had several named parameters, one of them being START_DATE.

Create a job that loads a me dimension with dates starng at 01/01/2000. In

technical jargon, create a job that calls your transformaon and passes it a value for

the START_DATE parameter.

Deciding between the use of a command-line argument

and a named parameter

Both command-line arguments and named parameters are means for creang more exible

jobs and transformaons. The following table summarizes the dierences and the reasons

for using one or the other. In the rst column, the word argument refers to the external

value you will use in your job or transformaon. That argument could be implemented

as a named parameter or as a command-line argument.

Situaon Soluon using named

parameters

Soluon using arguments

It is desirable to have a

default for the argument

Named parameters are

perfect in this case. You

provide default values at the

me you dene them.

Before using the command-line

argument, you have to evaluate if it

was provided in the command line. If

not, you have to set the default value

at that moment.

The argument is

mandatory

You don't have means

to determine if the user

provided a value for the

named parameter.

To know if the user provided a value

for the command-line argument, you

just get the command-line argument

and compare it to a null value.

You need several

arguments but it is

probable that not all of

them are present.

If you don't have a value for a

named parameter, you are not

forced to enter it when you

run the job or transformaon.

Let's suppose that you expect three

command line arguments. If you

have a value only for the third, you

sll have to provide empty values for

the rst and the second.

You need several

arguments and it is

highly probable that all

of them are present.

The command line would be

too long. It will help explain

clearly the purpose of each

parameter, but typing the

command line would be

tedious.

The command-line is simple as you

just list the values one aer the

other. However, there is a risk—you

may unintenonally enter the values

unordered, which could lead to

unexpected results.

Creang Basic Task Flows

[ 318 ]

Situaon Soluon using named

parameters

Soluon using arguments

You want to use the

argument in several places

You can do it, but you must

assure that the value will not be

overwrien in the middle of the

execuon.

You can get the command-line

argument by using a Get System Info

step as many mes as you need.

You need to use the value

in a place where a variable

is needed

Named parameters are ready to

be used as Kele variables.

First, you need to set a variable with

the command-line argument value.

Usually this requires creang addional

transformaons to be run before any

other job or transformaon.

Depending on your parcular situaon, you would prefer one or the other soluon. Note

that you can mix both as you did in the previous tutorials.

Have a go hero – analysing the use of arguments and named parameters

In the Time for acon – customizing the hello world le with xed arguments and parameters

secon, you created a transformaon that used an argument and a named parameter. Based

on this preceding table, try to understand why the folder was dened as named parameter

and the name of the person you want to say Hello to was dened as command-line

argument. Would you have applied the same approach?

Running job entries under conditions

A job may contain any number of entries. Not all of them execute always. Some of them

execute depending on the result of previous entries in the ow. Let's see it in pracce.

Time for action – sending a sales report and warning the

administrator if something is wrong

Now you will build a sales report and send it by e-mail. In order to follow the tutorial, you

will need two simple prerequisites:

As the report will be based on the Jigsaw database you created in Chapter 8, you will

need the MySQL server running.

In order to send e-mails, you will need at least one valid Gmail account. Sign up for

an account. Alternavely, if you are familiar with you own SMTP conguraon, you

could use it instead.



Chapter 10

[ 319 ]

Once you've checked these prerequisites, you are ready to start.

1. Create a new transformaon.

2. Add a Get System Info step. Use it to add a eld named today. As Type, select

Today 00:00:00.

3. Now add a Table input step.

4. Double-click the step.

5. As Connecon, select js—the name of the connecon to the jigsaw

puzzles database.

Note that if the connecon is not shared, you will have to

dene it.

6. In the SQL frame, type the following statement:

SELECT pay_code

, COUNT(*) quantity

, SUM(inv_price) amount

FROM invoices

WHERE inv_date = ?

GROUP BY pay_code

7. In the drop-down list to the right of Insert data from step, select the name of the

Get System Info step.

8. Finally, add an Excel Output step.

9. Double-click the step.

10. Enter type ${LABSOUTPUT}/sales_ as Filename.

11. Check the Specify Date me format opon. In the Date me format drop-down list,

select yyyyMMdd.

Creang Basic Task Flows

[ 320 ]

12. Make sure you don't uncheck the Add lenames to result opon. Click on OK. Fill

the Fields tab as here:

13. Save the transformaon under the transformations folder you created in a

previous tutorial, with the name sales_report.ktr.

14. Create a new job by pressing Ctrl+Alt+N.

15. Add a START job entry.

16. Aer the START entry, add a Transformaon entry.

17. Double-click the Transformaon entry.

18. Enter ${Internal.Job.Filename.Directory}/transformations/sales_

report.ktr as the transformaon lename, either by hand or by browsing the

folder and selecng the le.

19. Click on OK.

20. Expand the Mail category of entries and drag a Mail entry to the canvas.

21. Create a hop from the transformaon entry to the Mail entry.

22. Double-click the Mail entry.

23. Fill the main tab Addresses with the desnaon and the sender e-mail addresses,

that is, provide values for the Desnaon address, Sender name, and Sender

address textboxes. If you have two accounts to play with, put one of them as

desnaon and the other as sender. If not, use the same e-mail twice.

24. Select the Server tab and ll the SMTP Server frame as follows—enter smtp.gmail.

com as SMTP Server and 465 as Port.

Chapter 10

[ 321 ]

25. Fill the Authencaon frame. Check the Use authencaon? checkbox. Fill the

Authencaon user and Authencaon password textboxes. For example, if your

account is pdi_account@gmail.com, then as user enter pdi_account and as

password provide your e-mail password.

26. Check the Use secure authencaon? opon. In Secure connecon type, leave the

default to SSL. Select the Email Message tab. In the Message Sengs frame, check

the Only send comment in mail body? opon.

27. Fill the Message frame, providing a subject and a comment for the e-mail—enter

Sales report as Subject and Please check the aachment as Comment. Select the

Aached Files tab and check the Aach le(s) to message? opon.

28. In the Select le type list, select the type General.

29. Click OK.

30. Drag another Mail job entry to the canvas.

31. Create a hop from the transformaon entry to this new entry. This hop will appear

in red.

32. Double-click the new entry.

33. Fill the Desnaon and Sender frames with desnaon and sender e-mail

addresses. If you have another account to use as desnaon, use it here. Select the

Server tab and ll it exactly as you did in the other Mail entry.

34. Select the Email Message tab. In the Subject textbox, type Error generating

sales report.

35. Click on OK.

36. Save the job and run it.

37. Once the job nished, log into your account. You should have received a mail!

Creang Basic Task Flows

[ 322 ]

38. Open the e-mail. This is what you should see:

39. Click on the Open as a Google spreadsheet opon. You will see the following:

Chapter 10

[ 323 ]

40. Simulate being an intruder and do something that makes your transformaon

fail. You could, for example, stop MySQL or add some strange characters in the

SQL statement.

41. Run the job again.

42. Check the administrator e-mail—the mail you put as desnaon in the second Mail

job entry.

43. The following is the e-mail you received this me:

What just happened?

You generated an Excel le with a crosstab report of sales on a parcular day. If the le is

generated successfully, an e-mail is sent with the Excel le aached. If some error occurs,

an e-mail reporng the problem is sent to the administrator.

Creang Basic Task Flows

[ 324 ]

If you skipped Chapter 8 and sll know nothing about databases with PDI, don't

miss this exercise. Instead of the proposed sales report, create a transformaon

that generates any Excel le. The contents of the sheet is not the key here. Just

make sure you leave the Add lenames to result opon checked in the Excel

output conguraon window. Then proceed as explained.

In this example you used Gmail accounts for sending e-mails from a PDI job. You

can use any mail server as long as you have access to the informaon required in

the Server tab.

Changing the ow of execution on the basis of conditions

The execuon of any job entry either succeeds or fails.

In parcular, the job entries under the category Condions just evaluates something and

success or failure depends upon the result of the evaluaon.

For example, the job entry File Exists succeeds if the le you put in its window exists.

Otherwise, it fails.

Whichever the job entry, you can use the result of its execuon to decide which of the

entries following it execute and which don't.

In the tutorial, you included a transformaon job entry. If the transformaon runs without

problem, this entry succeeds. Then the execuon follows the green hop to the rst Mail

job entry.

If, while running the transformaon, some error occurs, the transformaon entry fails. Then

the execuon follows the red path toward the e-mail to the administrator.

So, when you create a job, you not only arrange the entries and hops according to the

expected order of execuon, you also specify under which condion each job entry runs.

Chapter 10

[ 325 ]

You can dene the condions in the hops. The following table lists the possibilies:

Color of the hop What the color represents The interpretaon

Black Uncondional execuon The desnaon entry executes no maer the

result of the previous entry.

Green Execuon upon success The desnaon entry executes only if the

previous job entry is successful.

Red Execuon upon failure The desnaon entry executes only if the

previous job entry failed.

At any hop, you can dene the condion under which the desnaon job entry will execute.

By default, the rst hop that leaves an entry is created green, whereas the second hop is

created red. You can change the color, that is, the behavior of the hop. Just right-click on the

hop, select Evaluaon, and then the condion.

One excepon is the hop or hops that leave the START step. You cannot edit them. The

desnaon job entries execute uncondionally, that is, always.

Another excepon is the special entry Dummy that does nothing, not even allowing you to

decide if the job entries aer it run or not. They always run.

Have a go hero – rening the sales report

Here we will modify the job that sends the e-mail containing the sales report.

1. Modify the transformaon so that the le is generated in the temporary folder

${java.io.tmpdir}. If there is no sale for today, don't generate the le. You do

this by checking the Do not create le at start opon in the Excel output step.

2. Send the e-mail only if there were sales, that is, only if the le exists.

3. Aer sending the e-mail with the report aached, delete the le.

Creang Basic Task Flows

[ 326 ]

Use these new job entries: File Exists from the Condions category and

Delete le from the File management category.

Creating and using a le results list

In the tutorial you congured two Mail job entries. In the mail that follows the green hop,

you aached the Excel le generated by the transformaon. However, you didn't explicitly

specify the name of the le to aach. How could PDI realize that you wanted to aach

that le? it could because of the Add lenames to result checkbox in the Excel output

conguraon window. By checking that opon, you added the name of the Excel le to a

special list named File result.

When PDI hits an e-mail entry where Aach le(s) to message? is checked, it aaches to the

e-mail all les in the File result list.

Most of the transformaon steps that read or write les have this checkbox, and it is checked

by default. The following sample belongs to a Text le input step:

Each me you use one of these steps you are adding names of les to this list, unless you

uncheck the checkbox.

Chapter 10

[ 327 ]

There are also several job entries in the File management and the File transfer categories

that add one or more les to the File result list. Consider the following Copy Files…

entry screen:

As with the Mail entry, there are some other entries that use the File result list. One example

is Copy or Move result lenames. This entry copies or moves the les whose names are in

this special list named File result.

Have a go hero – sharing your work

Suppose you want to share your PDI work with a friend. Send to him/her some of your ktr

les by mail.

Use the Add lenames to result job entry located in the File management

category to build the File result list. Then send the e-mail with the les aached.

Summary

In this chapter, you learned the basics about PDI jobs—what a job is, what you can do with a

job, and how jobs are dierent from transformaons. In parcular, you learned to use a job

for running one or more transformaons.

You also saw how to use named parameters in jobs, and how to supply parameters and

arguments to transformaons when they are run from jobs.

In the next chapter, you will learn to create jobs that are a lile more elaborave than the

jobs you created here, which will give you more power to implement all types of processes.

Creating Advanced

Transformations and Jobs

Iterang over a list of items (les, people, codes, and so on), implemenng

a process ow, and developing a reusable procedure are very common

requirements in real world projects. Implemenng these kind of needs in PDI

is not intuive, but it’s not complicate either. It’s just a maer of learning the

right techniques that we will see in this chapter. Among other things, you will

learn to implement process ows, nest jobs, and iterate the execuon of jobs

and transformaons.

Enhancing your processes with the use of variables

For the tutorials in this chapter, you will take as your starng point a Time for acon tutorial

you did in Chapter 2 that involves updang a le with news about examinaons. You are

responsible for collecng the results of an annual examinaon where wring, reading,

speaking, and listening skills are evaluated. The professors grade the examinaons of their

students in the scale 0-100 for each skill, and generate text les with the informaon. Then

they send the les to you for integrang the results in a global list.

In the inial chapters, you were learning the basics of PDI. You were worried about how to

do simple stu such as reading a le or doing simple calculaons. In this chapter, you will go

beyond that and take care of the details such as making a decision if the lename expected

as a command line is not provided or if it doesn't exist.

Creang Advanced Transformaons and Jobs

[ 330 ]

Time for action – updating a le with news about examinations

by setting a variable with the name of the le

The transformaon in the Time for acon from Chapter 2 that we just talked about reads a

le provided by a professor, simply by taking the name of the le from the command line,

and appends the le to the global one. Let's enhance that work.

1. Copy the examinaon les you used in Chapter 2 to the input les and folder

dened in your kettle.properties le. If you don't have them, download them

from the Packt website.

2. Open Spoon and create a new transformaon.

3. Use a Get System Info step to get the rst command-line argument. Name the eld

as filename.

4. Add a Filter rows step and create a hop from the Get System Info step to this step.

5. From the Flow category drag an Abort step to the canvas, and from the Job category

of steps drag a Set Variables step.

6. From the Filter rows step, create two hops—one to the Abort step and the other

to the Set Variables step. Double-click the Abort step. As Abort message, put File

name is mandatory.

7. Double-click the Set Variables step and click on Get Fields. The window will be lled

as shown here:

8. Click on OK.

Chapter 11

[ 331 ]

9. Double-click the Filter rows step. Add the following lter: filename IS NOT

NULL. In the drop-down list to the right of Send 'true' data to step, select the Set

Variables step, whereas in the drop-down list to the right of Send 'false' data to

step, select the Abort step.

10. The nal transformaon looks like this:

11. Save the transformaon in the transformations folder under the name

getting_filename.ktr.

12. Open the transformaon named examinations.ktr that was created in Chapter

2 or download it from the Packt website. Save it in the transformations folder

under the name examinations_2.ktr.

13. Delete the Get System Info step.

14. Double-click the Text le input step.

15. In the Accept lenames from previous steps frame, uncheck the Accept lenames

from previous step opon.

16. Under File/Directory in the Selected les grid, type ${FILENAME}. Save the

transformaon.

17. Create a new job.

18. From the General category, drag a START entry and a Transformaon entry to the

canvas and link them.

19. Save the job as examinations.kjb.

Creang Advanced Transformaons and Jobs

[ 332 ]

20. Double-click the Transformaon entry. As Transformaon lename, put the name

of the rst transformaon that you created: ${Internal.Job.Filename.

Directory}/transformations/getting_filename.ktr.

21. Click on OK.

Remember that you can avoid typing that long variable name by

clicking Ctrl+Space and selecng the variable from the list.

22. From the Condions category, drag a File Exists entry to the canvas and create a hop

from the Transformaon entry to this new one.

23. Double-click the File Exists entry.

24. Write ${FILENAME} in the File name textbox and click on OK.

25. Add a new Transformaon entry and create a hop from the File Exists entry to

this one.

26. Double-click the entry and, as Transformaon lename, put the name of the

second transformaon you created:${Internal.Job.Filename.Directory}/

transformations/examinations_2.ktr.

27. Add a Write To Log entry, and create a hop from the File Exists entry to this. The hop

should be red, to indicate when execuon fails. If not, right-click the hop and change

the evaluaon condion to Follow when result is false.

28. Double-click the entry and ll all the textboxes as shown:

Chapter 11

[ 333 ]

29. Add two entries—an abort and a success. Create hops to these new entries as

shown next:

30. Save the job.

31. Press F9 to run the job.

32. Set the logging level to Minimal logging and click on Launch.

33. The job fails. The following is what you should see in the Logging tab in the

Execuon results window:

Creang Advanced Transformaons and Jobs

[ 334 ]

34. Press F9 again. This me set Basic logging as the logging level.

35. In the arguments grid, write the name of a cous le—for example,

c:/pdi_files/input/nofile.txt.

36. Click on Launch. This is what you see now in the Logging tab window:

37. Press F9 for the third me. Now provide a real examinaon lename such as

c:/pdi_files/input/exam1.txt.

38. Click on Launch. This me you see no errors. The examinaon le is appended to

the global le:

Chapter 11

[ 335 ]

What just happened?

You enhanced the transformaon you created in Chapter 3 for appending an examinaon le

to a global examinaon le. This me you embedded the transformaon in a job. The rst

transformaon checks that the argument is not null. In that case, it sets a variable with the

name provided. The main job veries that the le exists. If everything is all right, then the

second transformaon performs the main task—it appends the given le to the global le.

Note that you changed the logging levels just according to what you needed to see—the

highlighted lines in the earlier explanaon.

You may choose any logging level you want depending on the details

of informaon you want to see.

Setting variables inside a transformation

So far, you had dened variables only in the kettle.properties le or inside Spoon while

you were designing a transformaon. In this last exercise, you learned to dene your own

variables at run me. You set a variable with the name of the le provided as a command-line

argument. You used that variable in the main job to check if the le existed. Then you used the

variable again in the main transformaon. There you used it as the name of the le to read.

This example showed you the how to set a variable with the value of a command-line

argument. This is not always the case. The value you set in a variable can be originated in

dierent ways—it can be a value coming from a table in a database, a value dened with a

Generate rows step, a value calculated with a Formula or a Calculator step, and so on.

The variables you dene with a Set variables step can be used in the same way and the same

places where you use any Kele variable. Just take precauons to avoid using these variables

in the same transformaon where you have set them.

The variables dened in a transformaon are not available

for using unl you leave that transformaon.

Have a go hero – enhancing the examination tutorial even more

Modify the job in the tutorial to avoid processing the same le twice. If the le is

successfully appended to the global le, rename the original le by changing the

extension to processed—for example, aer processing the exam1.txt le rename

it to exam1.processed.

Creang Advanced Transformaons and Jobs

[ 336 ]

Aer verifying if the le exists, also check whether the .processed version exists. If it

exists, put a proper message in the log and abort. If someone accidently tries to process

a le that is already processed, it will be ignored.

Besides the variable with the lename, create a variable with the name

for the processed le. To build this name, simply manipulate the given

name with some PDI steps.

Have a go hero – enhancing the jigsaw database update process

In the Time for acon – inserng new products or updang existent ones secon in Chapter

8, you read a le with a list of products belonging to the manufacturer Classic DeLuxe.

The list was expected as a named parameter. Enhance that process. Create a job that rst

validates the existence of the provided le. If the le doesn't exist, put the proper error

message in the log. If it exists, process the list. Then move the processed le to a folder

named processed.

You don't need to create a transformaon to set a variable with the

name of the le. As it is expected as a named parameter, it is already

available as a variable.

Have a go hero – executing the proper jigsaw database update process

In the hero exercise in Chapter 8 that involves populang the products table, you created

dierent transformaons for updang the products—one for each manufacturer. Now you

will put all that work together.

Create a job that accepts two arguments—the name of the le to process and the code of

the manufacturer to which the le belongs.

Create a transformaon that validates that the code provided belongs to an existent

manufacturer. If the code exists, set a variable named TRANSFORMATION_FILE with the

name of the transformaon that knows how to process the le for that manufacturer.

The transformaon must also check that the name provided is not null. If it is not null, set a

variable named FILENAME with the name supplied.

Then, in the job, check that the le exists. If it exists and the manufacturer code is valid, run

the proper transformaon. In order to do so, put ${TRANSFORMATION_FILE} as the name

of the transformaon in the transformaon job entry dialog window. Now test your job.

Chapter 11

[ 337 ]

Enhancing the design of your processes

When your jobs or transformaons begin to grow, you may nd them a lile disorganized or

jumbled up. It's now me to do some rework. Let's see an example of this.

Time for action – generating les with top scores

In this tutorial, you will read the examinaon global le and generate four les—one for each

parcular skill. The les will contain the top 10 scores for each skill. The scores will not be

the original, but converted to a scale with values in the range 0-5.

As you must be already quite condent with PDI, some explanaons in this

secon will not have the full details. On the contrary, the general explanaon

will be focused on the structure of the jobs and transformaons.

1. Create a new transformaon and save it in the transformations folder under the

name top_scores.ktr.

2. Use a Text le input step to read the global examinaon le generated in the

previous tutorial.

3. Aer the Text le input step, add the following steps and link them in the

same order:

A Select values step to remove the unused elds—

file_processed and process_date.

A Split Fields to split the name of the students in two—name and

last name.

A Formula step to convert name and last name to uppercase.

With the same Formula step, change the scale of the scores.

Replace each skill eld writing, reading, speaking, and

listening with the same value divided by 20—for example,

[writing]/20. You have already done this in Chapter 3.



Creang Advanced Transformaons and Jobs

[ 338 ]

4. Do a preview on compleon of the nal step to check that you are doing well. You

should see this:

5. Aer the last Formula step, add and link in this order the following steps:

A Sort rows step to order the rows in descending order by the

writing eld.

A JavaScript step to lter the rst 10 rows. Remember that you

learned to do this in the chapter devoted to JavaScript. You do it by

typing the following piece of code:

trans_Status = CONTINUE_TRANSFORMATION;

if (getProcessCount('r')>10) trans_Status =

SKIP_TRANSFORMATION;

An Add sequence step to add a eld named seq_w. Leave the

defaults so that the eld contains the values 1, 2, 3 …

A Select values step to rename the eld seq_w as position and

the eld writing as score. Specify this change in the Select &

Alter tab, and check the opon Include unspecied elds, ordered.

A Text le output step to generate a le named writing_top10.

txt at the locaon specied by the ${LABSOUTPUT} variable. In the

Fields tab, put the following elds— position, student_code,

student_name, student_lastname, and score.

6. Save the transformaon, as you've added a lot of steps and don't want to lose

your work.



Chapter 11

[ 339 ]

7. Repeat step number 5, but this me sort by the reading eld, rename the sequence

seq_r as position and the eld reading as score, and send the data to the

reading_top10.txt le.

To save me, you can copy all those steps, paste them, and do

the proper adjustments.

8. Repeat the same procedure for the speaking eld and the listening eld.

9. This is how the transformaon looks like:

10. Save the transformaon.

Creang Advanced Transformaons and Jobs

[ 340 ]

11. Run the transformaon. Four les should have been generated. All the les should look

similar. Let's check the writing_top10.txt le (the names and values may vary

depending on the examinaon les that you have appended to the global le):

What just happened?

You read the big le with examinaon results and generated four les with informaon

about the top scores—one le for each skill.

Beyond having used the Add sequences step for the rst me, there was nothing new.

However, there are several improvements you can do to this transformaon. The next

tutorials are meant to teach you some tricks.

Pop quiz – using the Add Sequence step

In the previous tutorial, you used dierent names for the sequences and then you renamed

all of them to position. Which of the following opons gives you the same results you got

in the tutorial?

a. Using position as the name of the sequence in all Add sequence steps

b. Joining the four streams with a single Add sequence step and then spling

the stream back into four streams by using the Distribute method you learned

in Chapter 4

c. Joining the four streams with a single Add sequence step and then spling the

stream back into four streams by using a Switch case step that distributes the

rows properly

d. All of them

e. None of them

Chapter 11

[ 341 ]

Reusing part of your transformations

As you noced, the sequence of steps used to get the ranks are almost idencal for the four

skills. You could have avoided copying and pasng or doing the same work several mes by

moving those steps to a subtransformaon. Let's do it.

Time for action – calculating the top scores with a

subtransformation

Let's modify the transformaon that calculates the top scores to avoid unnecessary

duplicaon of steps:

1. Under the transformation folder, create a new folder named

subtransformations.

2. Create a new transformaon and save it in that new folder with the name

scores.ktr.

3. Expand the Mapping category of steps. Select a Mapping input specicaon step

and drag it to the work area.

4. Double-click the step and ll it like this:

5. Add a Sort rows step and use it to sort the score eld in descending order.

6. Add a JavaScript step and type the following code to lter the top 10 rows:

trans_Status = CONTINUE_TRANSFORMATION;

if (getProcessCount('r')>10) trans_Status = SKIP_TRANSFORMATION;

7. Add an Add sequence step to add a sequence eld named seq.

Creang Advanced Transformaons and Jobs

[ 342 ]

8. Finally, add a Mapping output specicaon step. You will nd it in the Mapping

category of steps. Your transformaon looks like this:

9. Save the transformaon.

10. Open the transformaon top_scores.ktr and save it as top_scores_with_

subtransformations.ktr.

11. Modify the wring stream. Delete all steps except the Text le output step—the

Sort rows, JavaScript, Add sequence, and the Select rows steps.

12. Drag a Mapping (sub-transformaon) step to the canvas and put it in the place

where all the deleted steps were. You should have this:

13. Double-click the Mapping step.

Chapter 11

[ 343 ]

14. In the Mapping transformaon frame, select the opon named Use a le for

the mapping transformaon. In the textbox below it, type ${Internal.

Transformation.Filename.Directory}/subtransformations/scores.

ktr. Select the Input tab, check the Is this the main data path? opon, and ll the

grid as shown:

15. Select the Output tab and ll the grid as shown:

16. Click on OK.

17. Repeat the steps 11 to 16 for the other streams—reading, speaking, and listening.

The only dierence is what you put in the Input tab of the Mapping steps—instead

of writing, you should put reading, speaking, and listening.

Note that you added four Mapping (subtransformaon)

steps, but you only need one subtransformaon le.

Creang Advanced Transformaons and Jobs

[ 344 ]

18. The nal transformaon looks as follows:

19. Save the transformaon.

20. Press F9 to run the transformaon.

21. Select Minimal logging and click on Launch. The Logging window looks like

the following:

Chapter 11

[ 345 ]

22. The output les should have been generated and should look exactly the same as

before. This me let's check the reading_top10.txt le (the names and values

may vary depending on the examinaon les that you appended to the global le):

What just happened?

You took the bunch of steps that calculate the top scores and moved it to a

subtransformaon. Then, in the main transformaon, you simply called the

subtransformaon four mes, each me using a dierent eld.

It's worth saying that the Text le output step could also have been moved to the

subtransformaon. However, instead of simplifying the work, it would have complicated it.

This is because the names of the les are dierent in each case and, in order to build that

name, it would have been necessary to add some extra logic.

Creating and using subtransformations

Subtransformaons are, as the named suggests, transformaons inside transformaons.

The PDI proper name for a subtransformaon is mapping. However, as the

word mapping is also used with other meanings in PDI, we will use the old,

more intuive name subtransformaon.

In the tutorial, you created a subtransformaon to isolate a task that you needed

to apply four mes. This is a common reason for creang a subtransformaon—to

isolate a funconality that is likely to be needed more than once. Then you called the

subtransformaons by using a single step.

Creang Advanced Transformaons and Jobs

[ 346 ]

Let's see how subtransformaons work. A subtransformaon is like a regular transformaon,

but it has input and output steps, connecng it to the transformaons that use it.

The Mapping input specicaon step denes the entry point to the subtransformaon.

You specify here just the elds needed by the subtransformaon. The Mapping output

specicaon step simply denes where the ow ends.

The presence of Mapping input specicaon and Mapping output

specicaon steps is the only fact that makes a subtransformaon

dierent from a regular transformaon.

In the sample subtransformaon you created in the tutorial, you dened a single eld named

score. You sorted the rows by that eld, ltered the top 10 rows, and added a sequence to

idenfy the rank—a number from 1 to 10.

You call or execute a subtransformaon by using a Mapping (sub-transformaon) step. In

order to execute the subtransformaon successfully, you have to establish a relaonship

between your elds and the elds dened in the subtransformaon.

Let's rst see how to dene the relaonship between your data and the input specicaon.

For the sample subtransformaon, you have to dene which of your elds is to be used as

the input eld score dened in the input specicaon. You can do it in an Input tab in the

Mapping step dialog window. In the rst Mapping step, you told the subtransformaon to

use the eld writing as its score eld.

If you look at the output elds coming out of the Mapping step, you will no longer see the

writing eld but a eld named score. It is the same eld writing that was renamed as

score. If you don't want your elds to be renamed, simply check the Ask these values to

be renamed back on output? opon found in the Input tab. That will cause the eld to be

renamed back to its original name—writing in this example.

Let's now see how to dene the relaonship between your data and the output specicaon.

If the subtransformaon creates new elds, you may want to add them to your main

dataset. To add to your dataset, a eld created in the subtransformaon, you use an Output

tab of the Mapping step dialog window. In the tutorial, you were interested in adding the

sequence. So, you congured the Output tab, telling the subtransformaon to retrieve the

eld named seq in the subtransformaon but renamed as position. This causes a new

eld named position to be added to your stream.

If you want the subtransformaon to simply transform the incoming stream without adding

new elds, or if you are not interested in the elds added in the subtransformaon, you

don't have to create an Output tab.

Chapter 11

[ 347 ]

The following screenshot summarizes what was explained just now. The upper and lower grids

show the datasets before and aer the streams have own through the subtransformaon.

The subtransformaon in the tutorial allowed you to reuse a bunch of steps that were

present in several places, avoiding doing the same task several mes. Another common

situaon where you may use subtransformaons is the one where you have a transformaon

with too many steps. If you can idenfy a subset of steps that accomplish a specic purpose,

you may move those steps to a subtransformaon. Doing so, your transformaon will

become cleaner and easier to understand.

Have a go hero – rening the subtransformation

Modify the subtransformaon in the following way:

Add a new eld named below_first. The eld should have the dierence between the

score in the current row and the maximum score. For example, if the maximum score is 5

and the current score is 4.85, the value for the eld should be 0.15.

Modify the main transformaon by adding the new eld to all output les.

Creang Advanced Transformaons and Jobs

[ 348 ]

Have a go hero – counting words more precisely (second version)

Combine the following Hero exercises from Chapter 3:

Counng words, discarding those that are commonly used

Counng words more precisely

Create a subtransformaon that receives a String value and cleans it. Remove extra signs that

may appear as part of the string such as . , ) or ". Then convert the string to lower case.

Also create a ag that tells whether the string is a valid word. Remember that the word is

valid if its length is at least 3 and if it is not in a given list of common words.

Retrieve the modied word and the ag.

Modify the main transformaon by using the subtransformaon. Aer the

subtransformaon step, lter the words by looking at the ag.

Creating a job as a process ow

With the implementaon of a subtransformaon, you simplify much of the transformaon.

But you sll have some reworking to do. In the main transformaon, you basically do two

things. First you read the source data from a le and prepare it for further processing. And

then, aer the preparaon of the data, you generate the les with the top scores. To have a

clearer vision of these two tasks, you can split the transformaon in two, creang a job as a

process ow. Let's see how to do that.

Time for action – splitting the generation of top scores by

copying and getting rows

Now you will split your transformaon into two smaller transformaon so that each meets a

specic task. Here are the instrucons.

1. Open the transformaon in the previous tutorial. Select all steps related to

the preparaon of data, that is, all steps from the Text le input step upto the

Formula step.

2. Copy the steps and paste them in a new transformaon.

3. Expand the Job category of steps.



Chapter 11

[ 349 ]

4. Select a Copy rows to result step, drag it to the canvas, and create a hop from the

last step to this new one. Your transformaon looks like this:

5. Save the transformaon in the transformations folder with the name

top_scores_flow_preparing.ktr.

6. Go back to the original transformaon and select the rest of the steps, that is, the

Mapping and the Text le output steps.

7. Copy the steps and paste them in a new transformaon.

8. From the Job category of steps select a Get rows from result step, drag it to

the canvas, and create a hop from this step to each of the Mapping steps. Your

transformaon looks like this:

9. Save the transformaon in the transformations folder with the name top_

scores_flow_processing.ktr.

10. In the top_scores_flow_preparing transformaon , right-click the step Copy

rows to result and select Show output elds.

Creang Advanced Transformaons and Jobs

[ 350 ]

11. The grid with the output dataset shows up.

12. Select all rows. Press Ctrl+C to copy the rows.

13. In the top_scores_flow_processing transformaon, double-click the step Get

rows from result.

14. Press Ctrl+V to paste the values. You have the following result:

15. Save the transformaon.

Chapter 11

[ 351 ]

16. Create a new Job.

17. Add a START and two transformaon entries to the canvas and link them one aer

the other.

18. Double-click the rst transformaon. Put ${Internal.Job.Filename.

Directory}/transformations/top_scores_flow_preparing.ktr as the

name of the transformaon.

19. Double-click the second transformaon. Put ${Internal.Job.Filename.

Directory}/transformations/top_scores_flow_processing.ktr as

the name of the transformaon.

20. Your job looks like the following:

21. Save the job. Press F9 to open the Job properes window and click on Launch.

Again, the four les should have been generated, with the very same informaon.

What just happened?

You split the main transformaon in two—one for the preparaon of data and the other for

the generaon of the les. Then you embedded the transformaons into a job that executed

them one aer the other. By using the Copy rows to result step, you sent the ow of data

outside the transformaon, and using Get rows from result step, you picked that data to

connue with the ow. The nal result was the same as before the change.

Noce that you split the last version of the transformaon—the

one with the subtransformaons inside. You could have split the

original. The result would have been exactly the same.

Creang Advanced Transformaons and Jobs

[ 352 ]

Transferring data between transformations by using the copy /get rows

mechanism

The copy/get rows mechanism allows you to transfer data between two transformaons,

creang a process ow. The following drawing shows you how it works:

Copy

rows

step

Transformation A

Data being

transferred

Get

rows

step

Transformation B

The Copy rows to result step transfers your rows of data to the outside of the

transformaon. You can then pick that data by using a Get rows from result step. In the

preceding image, Transformaon A copies the rows and, Transformaon B, which executes

right aer Transformaon A, gets the rows. If you create a single transformaon with all

steps from Transformaon A followed by all steps from Transformaon B, you would get

the same result.

The copy of the dataset is made in memory. It's useful when you have

small datasets. For bigger datasets, you should prefer saving the data in a

temporary le or database table in the rst transformaon, and then create

the dataset from the le or table in the second transformaon.

The Serialize to le /De-serialize from le steps are very useful for this, as the

data and the metadata are saved together.

Chapter 11

[ 353 ]

There is no limit to the number of transformaons that can be chained using this

mechanism. Look at the following image:

Transformation A

Get

Rows

Get

Rows

Copy

rows

Copy

rows .......... ..........

Transformation B Transformation N

As you can see, you may have a transformaon that copies the rows, followed by another

that gets the rows and copies again, followed by a third transformaon that gets the rows,

and so on.

Have a go hero – modifying the ow

Modify the last exercise in the following way:

Include just the students who had an average score above 70.

Note that you have to modify just the transformaon that prepares

the informaon, without caring about what the second process

does with that data.

Generate just the top ve scores for every skill.

Note that you have to modify just the transformaon (or the

subtransformaon) that processes the informaon, without

caring about how the list of students was built.

Create each le in a dierent transformaon. The transformaons execute one aer

the other.



Creang Advanced Transformaons and Jobs

[ 354 ]

This exercise requires that you modify the ow. Each

transformaon gets the rows from the previous transformaon,

then generates a le, and copies the rows to the result to be

used for the next transformaon.

Nesting jobs

Suppose that every me you append a le with examinaon results, you want to generate

updated les with the top 10 scores. You can do it manually, running one job aer the other,

or you can nest jobs.

Time for action – generating the les with top scores by

nesting jobs

Let's modify the job that updates the global examinaon le, so at the end it generates

updated top scores les:

1. Open the examinations job you created in the rst tutorial of this chapter.

2. Aer the last transformaon job entry, add a job entry as Job. You will nd it under

the General category of entries.

3. Double-click the Job job entry.

4. Type ${Internal.Job.Filename.Directory}/top_scores_flow.kjb

as Job lename.

5. Click on OK.

6. Save the job.

7. Pick an examinaon that you have not yet appended to the global le—for example,

exam5.txt.

8. Press F9.

9. In the Arguments grid, type the full path of the chosen le: c:/pdi_files/input/

exam5.txt.

10. Click on Launch.

Download from Wow! eBook <www.wowebook.com>

Chapter 11

[ 355 ]

11. In the Job metrics tab of the Execuon results window, you will see the following:

12. Also the chosen le should have been added to the global le, and updated les with

top scores should have been generated.

What just happened?

You modied the job that updates the global examinaon le by including the generaon of

the les with top scores as part of the process. You did it by using a Job job entry whose task

is to run a job inside a job.

In the Job metrics, you could see a hierarchy showing the details of the nested job as a

sub-tree of that hierarchy.

Running a job inside another job with a job entry

The job entry, Job, allows you to run a job inside a job. Just like any job entry, this entry may

end successfully or fail. Upon that result, the main job decides which of the entries that

follows it will execute. None of the entries following the job entry starts unl the nested job

ends its execuon. There is no limit to the levels of nesng. You may call a job, which calls

a job, which again calls a job, and so on. Usually you will not need more than two or

three levels.

Creang Advanced Transformaons and Jobs

[ 356 ]

As with a transformaon job entry, you must specify the locaon and name of the job le.

If the job (or any transformaon inside the nested job) uses arguments or has dened

named parameters, you have the possibility of providing xed values just as you do in a

Transformaon job entry—by lling the Arguments and Parameters tabs.

Understanding the scope of variables

By nesng jobs, you implicitly create a relaonship between the jobs. Look at the

following diagram:

Here you can see how a job, and even a transformaon, may have parents and grandparents.

The main job is called root job. This hierarchy is useful to understand the scope of variables.

When you dene a variable, you have the opon to set the scope, that is, dene the places

where the variable is visible.

Chapter 11

[ 357 ]

The following table explains which jobs and transformaons can access the variable

depending on the variable's scope.

Variable scope type Visibility of the variable

Valid in the parent job Can be seen by the job that called the transformaon and any

transformaon called by this job.

Valid in the grand-parent job Can be seen by the job that called the transformaon, the job that

called that job, and any transformaon called by any of these jobs.

Valid in the root job Can be seen by all jobs in the chain starng with the main job, and

any transformaon called by any of these jobs.

Valid in the Java Virtual

Machine

Seen by all the jobs and transformaons run from the same

Java Virtual Machine. For example, suppose that you dene a

variable with scope in the Java Virtual Machine. If you run the

transformaon from Spoon, then the variable will be available in all

jobs and transformaons you run from Spoon as long as you don't

exit Spoon.

Pop quiz – deciding the scope of variables

In the rst tutorial you created a transformaon that set a variable with the name of a le.

For the scope, you le the default value: Valid in the root job. Which of the following scope

types could you have chosen geng the same results (you may select more than one):

a. Valid in the parent job

b. Valid in the grand-parent job

c. Valid in the Java Virtual Machine

In general, if you have doubts about which scope type to use, you can use Valid

in the root job and you will be good. Simply ensure that you are not using the

same name of variable for dierent purposes.

Iterating jobs and transformations

It may happen that you develop a job or a transformaon to be executed several mes, once

for each dierent row of your data. Consider that you have to send a custom e-mail to a list

of customers. You would build a job that, for a given customer, get the relevant data such as

name or e-mail account and send the e-mail. You would then run the job manually several

mes, once for each customer. Instead of doing that, PDI allows you to execute the job

automacally once for each customer in your list.

Creang Advanced Transformaons and Jobs

[ 358 ]

The same applies to transformaons. If you have to execute the same transformaon several

mes, once for each row of a set of data, you can do it by iterang the execuon. The next

Time for acon tutorial shows you how to do this.

Time for action – generating custom les by executing a

transformation for every input row

Suppose that 60 is the threshold below which a student must retake the examinaon. Let's

nd out the list of students with a score below 60, that is, those who didn't succeed in the

wring examinaon. Then, let's create one le per student telling him/her about this.

First of all, let's create a transformaon that generates the list of students who will take

the examinaon:

1. Create a new transformaon.

2. Drag a Text le input, a Filter rows, and a Select values step to the canvas and link

them in that order.

3. Use the Text le input step to read the global examinaon le.

4. Use the Filter rows step to keep only those students with a wring score below 60.

5. With the Select values step, keep just the student_code and name values.

6. Aer this last step, add a Copy rows to result step.

7. Do a preview on this last step. You will see the following (the exact names and

values depend on the number of les you have appended to the global le):

8. Save the transformaon in the transformations folder with the name

students_list.ktr.

Chapter 11

[ 359 ]

Now let's create a transformaon that generates a single le. This transformaon will be

executed for each student in the list shown in the preceding screenshot:

1. Create a new transformaon.

2. Drag a Get rows from result step to the canvas.

3. Double-click the Get rows from result step and use it to dene two String

elds—a eld named student_code and another eld named name.

4. Add a Formula step and create a hop from the Get rows from result step to this

new step.

5. Use the Formula step to create a new String eld named text. As value, type:

"You'll have to take the examinaon again, " & [name] & ".".

6. Aer the Formula step, add a Delay row step. You will nd it under the Ulity

category of steps.

7. Finally, add a Text le output step, and double-click the step to congure it.

8. As lename type ${LABSOUTPUT}/hello. Check the opon Include me

in lename?.

9. In the content tab, uncheck Header. As Field, select the eld text.

10. This is how your nal transformaon looks:

11. Save the transformaon in the transformations folder under the name

hello_each.ktr.

You can't test this transformaon alone. If you want to test it, just

replace temporarily the Copy rows from result step with a Generate

rows step, generate a single row with xed values for the elds, and

run the transformaon.

Creang Advanced Transformaons and Jobs

[ 360 ]

Let's create a job that puts everything together:

1. Create a job.

2. Drag a START, a Delete les, and two transformaon entries to the canvas, and link

them one aer the other as shown:

3. Save the job.

4. Double-click the Delete les step. Fill the Files/Folders: grid with a single

row—under File/Folder type ${LABSOUTPUT} and under Wilcard (RegExp) type

hello.*\.txt. This regular expression includes all .txt les whose name start

with the string "hello" in the ${LABSOUTPUT} folder.

5. Double-click the rst transformaon entry. As Transformaon lename, put

${Internal.Job.Filename.Directory}/transformations/student_

list.ktr and click on OK.

6. Double-click the second transformaon entry. As Transformaon lename,

put ${Internal.Job.Filename.Directory}/transformations/

hello_each.ktr.

7. Check the opon Execute for every input row? and click on OK.

8. Save the job and press F9 to run it.

9. When the execuon nishes, explore the folder pointed by your ${LABSOUTPUT}

variable. You should see one le for each student in the list. The les are named

hello_<hhmmddss>.txt where <hhmmddss> is the me in your system at the

moment that the le was generated. The generated les look like the following:

Chapter 11

[ 361 ]

What just happened?

You built a list of students who had to retake the wring examinaon and, for each student,

you generated a le with a custom message.

First, you created a transformaon that built the list of the students and copied the rows

outside the transformaon by using the Copy rows to result step.

Then you created another transformaon that gets a row from the result and generates a

le with a custom hello message.

Finally, you created the main job. First of all, the job deletes all les just in case you run

the job more than once. Then it calls the rst transformaon and then executes the

transformaon that generates the le once for every copied row, that is, once for every

student. Each me the transformaon gets the rows from the result, it gets a single row with

informaon about a single student and generates a le with the message for that student.

Before proceeding with the details about execung each row mechanism, let's briey

explain the new step used here—the Delay row step that is used to deliberately slow down

a transformaon. For each incoming row, the step waits for the amount of me indicated in

its seng window which, by default, is 1 second. Aer that me, the row is given to the

next step.

In this tutorial, the Delay row step is used to ensure that each me the transformaon

executes, the name of the le is dierent. As part of the name for the le, you put the me

of your system including hours, minutes, and seconds. By waing for a second, you can be

sure that in every execuon of the transformaon the name of the le will be dierent from

the name of the previous le.

Executing for each row

The execute for every input row? opon you have in the transformaon entry seng

window allows you to run the transformaon once for every row copied in a previous

transformaon by using the Copy rows to result step. PDI executes the transformaon as

many mes as the number of copied rows, one aer the other. Each me the transformaon

executes and gets the rows from the result, it actually gets a dierent row.

Note that in the transformaon you don't limit the number of incoming

rows. You simply assume that you are receiving a single row. If you forget to

set the execute for every input row? opon in the job, the transformaon

will run but you will get unexpected results.

Creang Advanced Transformaons and Jobs

[ 362 ]

This drawing shows you the mechanism for a dataset with three rows:

Copy

rows

step

Transformation A

1st row

Get rows

step

Get rows

step

Get rows

step

Transformation B

......

2nd row

3rd row

The transformaon A in the example copies three rows. Then the transformaon B is

executed three mes—rst for the rst copied row, then for the second, and nally for

the third row.

Chapter 11

[ 363 ]

If you look at the log in the tutorial, you can see it working:

The transformaon that builds the list of students copies four rows to the results. Then the

main job executes the second transformaon four mes—once for each of those students.

Creang Advanced Transformaons and Jobs

[ 364 ]

The following sketch shows it clearly:

This mechanism of execung for every input row applies also to jobs. To execute a single

job several mes, once for every copied row, you have to check the execute for every input

row? opon that you have in the job entry sengs window.

Have a go hero – processing several les at once

Modify the rst tutorial about Updang a le with news about examinaons. But this me

accept a folder as parameter. Then process all the text les in that folder, ordered by date of

the le. For each processed le, put a line in the log telling the name of the processed le.

You can use the following hint. Create a rst transformaon that, instead of validang the

parameter as a le, validates it as a folder. In order to do that, use the File exists step inside

the Lookup category of steps.

Chapter 11

[ 365 ]

If the folder exists, use a Get File Names step. That step allows you to retrieve the list of

lenames in a given folder, including the aributes for those les. To dene which les to

get, use the opons in the box Filenames from eld. Sort the list by le date and copy the

names to the results.

In the second transformaon, executed for every input row, get a row from the result, then

use a Text le input step accepng the name from the previous step, and proceed as usual.

As you may nd it dicult to use steps you never used before, you

may download a working version for the rst transformaon. You'll

nd it among the material for this chapter.

Have a go hero – building lists of products to buy

This exercise is related to the JS database.

Create a transformaon to nd out the manufacturers for the products that have been sold

best in the current month. Take the rst three manufacturers in the list.

Create another transformaon that, for every manufacturer in that list, builds a le with a list

of products out of stock.

Hint

The rst transformaon must copy the rows to the result. The second transformaon must

execute for every input row. Start the transformaon with a Get rows from result step, then

a Table Input step that receives as parameter a manufacturer's code. The SQL to use could

be something like:

SELECT *

FROM products

WHERE code_man LIKE '?' AND pro_stock<pro_stock_min

Have a go hero – e-mail students to let them know how they did

Suppose some students have asked you to send them an e-mail to tell them how they did in

the examinaon. Get the list of students from a le you'll nd inside the resources, nd out

their scores, and send them an e-mail with that informaon.

Hint

Create a transformaon that builds the list of students that have asked you to send them the

examinaon results, along with their e-mail and scores, and copies the rows to the result.

Creang Advanced Transformaons and Jobs

[ 366 ]

Create a job that does the following: Call a transformaon that gets a row from a result with

the name, e-mail, and scores for a single student. Use that informaon to create variables

needed to send an e-mail, for example Subject. Aer calling that transformaon, use a Mail

entry to send the e-mail by using the dened variables.

Create a main job. Execute the transformaon that builds the list followed by the job

described above, execung it for every input row.

To test the job that sends e-mails, you may temporarily replace the Get rows

from result step with a Generate rows with xed values step.

To test the main job, replace the e-mail accounts in the le with accounts you

have access to.

Summary

In this chapter you learned techniques to combine jobs and transformaons in

dierent ways.

First, you learned to dene your own variables at run me. You dened variables in one

transformaon and then used them in other jobs and/or transformaons. You also learned

to dene dierent scopes for those variables.

Aer that, you learned to isolate part of a transformaon as a subtransformaon. You also

learned to implement process ows by copying and geng rows, and how to nest jobs. By

using all these PDI capabilies, your work will look cleaner and will be more organized.

Finally, you learned to iterate the execuon of jobs and transformaons.

Let's say that this was a really producve chapter. By now, you should be equipped with

enough knowledge to use PDI for developing most of your requirements.

You're now ready for the next chapter, where you will develop the nal project that will allow

you to review a lile of everything you've learned throughout the book.

Developing and Implementing a

Simple Datamart

In this chapter you will develop a simple but complete process of loading a

datamart while reviewing all concepts you learned throughout the book.

The chapter will cover the following:

Introducon to a sales datamart based on the Jigsaw puzzles database

Loading the dimensions of the sales datamart

Loading the fact table for the sales datamart

Automang what has been done

Exploring the sales datamart

In Chapter 9, you were introduced to star schemas. In short, a star schema consists of

a central table known as the fact table, surrounded by dimension tables. While the fact

has indicators of your business such as sales in dollars, the dimensions have descripve

informaon for the aributes of your business such as me, customers, and products.

A star that addresses a specic department's needs or that is built for use by a parcular

group of users is called a datamart. You can have datamarts focused on customer

relaonship management, inventory, human resources management, budget,

and more. In this chapter, you will load a datamart focused on sales.

Somemes the term datamart is confused with datawarehouse. However, datamarts and

datawarehouses are not the same.



Developing and Implemenng a Simple Datamart

[ 368 ]

The main dierence between datamarts and datawarehouses is that

datawarehouses address the needs of the whole organizaon, whereas

a datamarts addresses the needs of a parcular department.

Datawarehouses contain informaon from mulple subject areas, allowing you to have a

global vision of your business. Therefore, they are oriented to the company's sta such as

execuves or managers.

The following star represents your sales datamart—a central fact named SALES, surrounded

by six dimensions:

Product Type Time

Payment

Method

Buy

Method

Manufac

turer

SALES Region

The following is a brief descripon for the dimensions in your SALES star:

Dimension Descripon

Time The date on which the sales occurred

Regions The geographical area where the products were sold

Manufacturers The name of the manufacturers that build the products sold

Payment method Cash, Check, and so on

Buy method Internet, by telephone, and so on

Product type Puzzle, glue, frame, and so on

In real models you may nd two types of dimensions related with

me—a dimension holding calendar day aributes and a separate

dimension with aributes such as hours, minutes, and seconds.

Chapter 12

[ 369 ]

Let's now look at the DER for the database that represents this model. The fact table is

represented by a table named ft_sales.

The following table shows you the correspondence between the dimensions in the model

and the tables in the database:

Dimension Table

Manufacturers lk_manufacturer

Time lk_time

Regions lk_regions_2

Payment method lk_junk_sales

Buy method lk_junk_sales

Product type none

As you can see, there is no one-to-one relaonship between the dimensions in the model

and the tables in the database.

A one-to-one relaonship between a dimension and a database table is not

required, but may coincidentally exist.

The rst three dimensions have their corresponding tables.

The payment and buy method dimensions share a junk dimension. A junk dimension is an

abstract dimension that groups unrelated low-cardinality ags, indicators, and aributes. Each

of those items could technically be a dimension on its own, but grouping them into a junk

dimension has the advantage of keeping your database model simple and it also saves space.

Developing and Implemenng a Simple Datamart

[ 370 ]

The last dimension, product type, doesn't have a separate table. It is so simple that it isn't

worth creang a dimension table. Instead, its values are stored in a dedicated eld in the

fact table. This kind of dimension is called degenerate dimension.

Deciding the level of granularity

The level of detail in your star model is called grain. The granularity is directly related to the

types of quesons you expect your model to answer. Let's see some examples.

The product-related informaon your model has is the manufacturer and the kind of product

(puzzle, glue, and so on). Thus, it allows you to ask quesons such as:

Beyond puzzles, what type of product is the best sold?

Do you sell more products manufactured by Ravensburger than products

manufactured by Educa Jigsaws?

What if you want to know the names of the top ten products sold? You simply can't, as that

level of detail is not stored in the model. For answering this type of queson, you need a

lower level of granularity. You could have that by adding a product dimension where each

record represents a parcular product.

Now let's see the me dimension. Each record in that dimension represents a parcular

calendar day. This allows you to answer quesons such as: how many products did you sell

every day in the last four months?

If you were not interested in daily, but in monthly informaon, you could have designed a

model with a higher level of granularity by creang a me dimension with just one record

per month.

Understanding the level of granularity of your model is a key to the process of loading the

fact table, as you will see when you load the sales fact table.

Loading the dimensions

As you saw, the sales star model consists of a fact surrounded by the dimension tables. In

order to load the star, rst you have to load the dimensions. You already learned how to

load dimension tables. Here you will load the dimensions for the sales star.



Chapter 12

[ 371 ]

Time for action – loading dimensions for the sales datamart

In this tutorial, you will load each dimension for the sales datamart and enclose them into a

single job. Before starng, check the following things:

Check that the database engine is up and that both the js and the js_dw databases

are accessible from PDI.

If your me dimension table, lk_time, has data, truncate the table. You may do it

by using the Truncate table [lk_me] opon in the database explorer.

You may reuse the js_dw database in which you have been loading data in

previous chapters. There is no problem with that. However, creang a whole

new database is preferred so that you can see how the enre process works.

The explanaon will be focused on the general process. For details of creang a

transformaon that loads a parcular type of dimension, please refer to Chapter 9. You

can also download the full material for this chapter where the transformaons and jobs

are ready to browse and try.

1. Create a new transformaon and use it to load the manufacturer dimension.

This is a Type I SCD dimension. The data for the dimension comes from the

manufacturers table in the js database. The dimension table in js_dw is

lk_manufacturer. Use the following screenshot as a guide:

2. Save the transformaon the lk_transformations.

3. Create a new transformaon and use it to load the regions dimension.



Developing and Implemenng a Simple Datamart

[ 372 ]

You already loaded this dimension in the Time for acon

– loading a region dimension with a Combinaon lookup/

update step secon in Chapter 9. The load of the region eld

was part of a Hero exercise in that chapter. If you did it, you

may skip this step.

4. The region dimension is a Type II SCD dimension. The data for the dimension comes

from the city and country tables. The informaon about regions is in Excel les

that you can download from the Packt web site. The dimension table in js_dw is

lk_regions_2. Use the following screenshot as a guide:

5. Save the transformaon in the lk_transformations folder.

6. Create a new transformaon and use it to load the me dimension.

You already created the dataset for the me dimension

in the Time for acon –creang the me dimension

dataset secon in Chapter 6. Then in Chapter 8 the loading

of the data into a table was part of a Hero exercise. If you

have done it, you may skip this step.

The dimension table in js_dw is lk_time.

7. Save the transformaon in the lk_transformations folder.

Chapter 12

[ 373 ]

Now you will create a job to put it all together:

8. Create a new job and save it in the same folder where you created the

lk_transformations folder.

9. Drag a START entry and two Transformaon job entries to the canvas.

10. Create a hop from the START entry to each of the transformaon entries. You have

the following:

11. Use one of the transformaon entries to execute the transformaon that loads the

manufacturer dimension.

12. Use the other transformaon entry to execute the transformaon that loads the

region dimension.

13. Add an Evaluate rows number in a table entry to the canvas. You'll nd it under the

Condions category.

14. Create a hop from the START entry towards this new entry.

15. Double-click the new entry and ll it like shown:

Developing and Implemenng a Simple Datamart

[ 374 ]

16. Aer this entry, add another transformaon entry and use it to execute the

transformaon that loads the me dimension.

17. Finally, from the General category add a Success entry.

18. Create a hop from the Evaluate… step to this entry. The hop should be red, meaning

that this step executes when the evaluaon fails.

19. Your nal job looks like this:

20. Save the job.

21. Run the job. The manufacturer and regions dimensions should be loaded. You can

verify it by exploring the tables from the PDI explorer or in MySQL query browser.

22. In the logging window, you'll see that the evaluaon succeeded and so the me

dimension is also loaded:

Chapter 12

[ 375 ]

23. You can check it by exploring the table.

24. Run the transformaon again. This me the evaluaon fails and the transformaon

that loads the me dimension is not executed this me.

What just happened?

You created the transformaons to load the dimensions you need for your sales star.

As already explained in Chapter 10, the job entries connected to the START entry run one

aer the other, not in parallel as the arrangement in the work area might suggest.

As for the me dimension, once it is loaded, you don't need to load it again. Therefore, you

put an evaluaon entry to check if the table had already been loaded. The rst me you

run the job, there were no records, so the me dimension was loaded. The second me,

the me dimension had already been loaded. This me the evaluaon failed, avoiding the

execuon of the transformaon that loaded the me dimension.

Developing and Implemenng a Simple Datamart

[ 376 ]

Note that you put in a Success entry to avoid the job failing aer

the failed evaluaon.

Extending the sales datamart model

You may, and you usually, have more than one fact table sharing some of the dimensions.

Look at the following diagram:

Product Type Time

Payment

Method

Buy

Method

Manufac

turer

SALES Region PUZZLES

SALES

Theme Glows in the

Dark

3D Puzzle

Wooden

Puzzle

Packaging

Pieces

Panoramic

Puzzle

It shows two stars sharing three dimensions: Regions, Manufacturers, and Time. The star model

to the le is the sales star model you already know. The star model to the right doesn't have

data for accessories, but does have more detail for puzzles such as the number of pieces they

have or the category or theme they belong to. When you have more than one fact table sharing

dimensions as here, you have what is called a constellaon.

The following table summarizes the dimensions added to the datamart:

Dimension Descripon

Pieces Number of pieces of the puzzle, grouped in the following ranges: 0-25,

26-100, and so on

Theme Classicaon of the puzzle in any of the following categories: Fantasy,

Castles, Landscapes, and so on

Glows in the dark Yes/No

3D puzzle Yes/No

Wooden puzzle Yes/No

Panoramic puzzle Yes/No

Packaging Number of puzzles packed together: 1, 2, 3, 4

Chapter 12

[ 377 ]

The following is the updated ERD for the database that represents the model:

The new fact table is represented by a table named ft_puzz_sales.

The following table shows you the correspondence between the dimensions added to the

model and the tables in the database.

Dimension Table

Pieces lk_pieces

Theme lk_puzzles

Glows in the dark lk_mini_prod

3D puzzle lk_mini_prod

Wooden puzzle lk_mini_prod

Panoramic lk_mini_prod

Packaging lk_mini_prod

The following Hero exercise allows you to pracce what you learned in the tutorial, but this

me applied to the puzzle star model.

Developing and Implemenng a Simple Datamart

[ 378 ]

Have a go hero – loading the dimensions for the puzzles star model

In this exercise you will load some of the dimensions that were added to the model.

Create a transformaon that loads the lk_pieces dimension. You may create any

range you like. The following table may help you in the creaon:

min max description

0 25 Under 25

26 100 26-100

101 1000 101-1000

1001 2000 1001-2000

2000 99999 >2000

Create another transformaon that loads the lk_puzzles dimensions. This is a

Type II SCD, and you have already loaded it in Chapter 9. If you have the transformaon

that does it, half of your work is done.

Finally, modify the job in the tutorial by adding the execuon of these new

transformaons. Note that the lk_pieces dimension has to be loaded just once.

Loading a fact table with aggregated data

Now that you have data in your dimensions, you are ready to load the sales fact table. In this

secon, you will learn how to do it.

Time for action – loading the sales fact table by looking up

dimensions

Let's load the sales fact table, ft_sales, with sales informaon for a given range of dates.

Before doing this exercise, be sure that you have already loaded the dimensions. You did it in

the previous tutorial.

Also check that the database engine is up and that both the js and the js_dw databases are

accessible from PDI. If everything is in order, you are ready to start:

1. Create a new transformaon.

2. Drag a Table input step to the canvas.

3. Double-click the step. Select js as Connecon—the connecon to the

operaonal database.



Chapter 12

[ 379 ]

4. In the SQL frame type the following query:

SELECT i.inv_date

,d.man_code

,cu.city_id

,pr.pro_type product_type

,b.buy_desc

,p.pay_desc

,sum(d.cant_prod) quantity

,sum(d.price) amount

FROM invoices i

,invoices_detail d

,customers cu

,buy_methods b

,payment_methods p

,products pr

WHERE i.invoice_number = d.invoice_number

AND i.cus_id = cu.cus_id

AND i.buy_code = b.buy_code

AND i.pay_code = p.pay_code

AND d.pro_code = pr.pro_code

AND d.man_code = pr.man_code

AND i.inv_date BETWEEN cast('${DATE_FROM}' as date)

AND cast('${DATE_TO}' as date)

GROUP BY i.inv_date

,d.man_code

,cu.city_id

,pr.pro_type

,b.buy_desc

,p.pay_desc

5. Check the Replace variables in script? opon and click OK.

Let's retrieve the surrogate key for the manufacturer:

6. From the Lookup category, drag a Database lookup step to the canvas.

7. Create a hop from the Table input step to this new step.

8. Double-click the Database lookup step.

9. Select dw as Connecon—the connecon to the datamart database.

10. Click on Browse...and select the lk_manufacturers table.

11. Fill the upper grid with the following condion: id_js = man_code.

Developing and Implemenng a Simple Datamart

[ 380 ]

12. Fill the lower grid—under Field type id, as New name type id_manufacturer, as

Default type 0, and as Type select Integer.

13. Click on OK.

Now you will get the surrogate key for the region:

14. From the Data Warehouse category drag a Dimension lookup/update step to

the canvas.

15. Create a hop from the Database lookup step to this new step.

16. Double-click the Dimension lookup/update step.

17. As Connecon select dw.

18. Browse and select the lk_regions_2 table.

19. Fill the Keys grid as shown next:

20. Select id as Technical key eld. In the new name textbox, type id_region.

21. As Stream Dateeld select inv_date.

22. As Date range star eld and Table daterange end select start_date and end_date

respecvely.

23. Select the Fields tab and ll it like here:

Chapter 12

[ 381 ]

Now it's me to generate the surrogate key for the junk dimension:

24. From the Data Warehouse category drag a Combinaon lookup/update step to the

canvas.

25. Create a hop from the Dimension lookup/update step to this new step.

26. Double-click the Combinaon lookup/update step.

27. Select dw as Connecon.

28. Browse and select the lk_junk_sales table.

29. Fill the grid as shown:

30. As Technical key eld type id. In the Creaon of technical key frame, leave the default

value Use table maximum + 1.

31. Click OK.

32. Add a Select values step and use it to rename the eld id to id_junk_sales.

Finally, let's do some adjustments and send the data to the fact table:

33. Add another Select values step to change the metadata of the inv_date eld

as shown:

Developing and Implemenng a Simple Datamart

[ 382 ]

34. Add a Table output step and double-click it.

35. Select dw as Connecon.

36. Browse and select the ft_sales table.

37. Check the Specify database elds opon, select the Database elds grid, and ll it as

shown:

Remember that you can avoid typing by using the

Get elds buon.

38. Click on OK. The following is your nal transformaon. Press Ctrl+S to save it.

Chapter 12

[ 383 ]

39. Press F9 to run it.

40. In the sengs window, provide some values for the date range.

41. Click on Launch.

42. The fact table should have been loaded. To check it, open the database explorer and run

the following query:

SELECT * FROM ft_sales

You will get the following:

43. To verify that only the sales between the provided dates were processed, run the

following query:

SELECT MIN(dt), MAX(dt) FROM ft_sales

44. You will get the following:

Developing and Implemenng a Simple Datamart

[ 384 ]

What just happened?

You loaded the sales fact table with the sales in a given range of dates.

First of all you got the informaon from the source database. You did it by typing an SQL

query in a Table input step. You already know how a Table input step works.

As said, a fact table has foreign keys to the primary key of the dimension tables. The query

you wrote gave you business keys. So, aer geng the data from the source, you translated

the business keys into surrogate keys. You did it in dierent ways depending on the kind of

each related dimension.

Finally, you inserted the obtained data into the fact table ft_sales.

Getting the information from the source with SQL queries

You already know how to use a Table input step to get informaon from any database.

However, the query in the tutorial may have looked strange or long compared with the

queries you wrote in previous chapters. There is nothing mysterious in that query: It's simply

a maer of knowing what to put in it. Let's explain it in detail.

The rst thing you have to do in order to load the fact table is to look at the grain.

As menoned at the beginning of the chapter, the grain, or level of detail, of the fact is

implicitly expressed in terms of the dimension.

Looking at the model, you can see the following dimensions, along with their level of detail:

Dimension Level of detail (most atomic data)

Manufacturers manufacturer

Regions city

Time day

Product Type product type

Payment method payment method

Buy method buy method

Does this have anything to do with loading the fact? Well, the answer is yes. This is

because the numbers you have to put as measures in the numeric elds must be aggregated

accordingly to the dimensions. These are the measurements—quantity represenng the

number of products sold and Sales represenng the amounts.

So, in order to feed the table, what you need to take from the source is the sum of

quantity and the sum of sales for every combinaon of manufacturer, day, city,

product type, payment method, and buy method.

Chapter 12

[ 385 ]

In SQL terms you do it with a query such as the one you wrote in the Table input step. The

query is not as complicated as it may seem at rst. Let's dissect the query, beginning with

the FROM clause.

FROM invoices i

,invoices_detail d

,customers cu

,buy_methods b

,payment_methods p

,products pr

These are the tables to take the informaon from. The word following the name of the

table is an alias for the table—for example, pr for the table products. The alias is used

to disnguish elds that have the same name but are in dierent tables.

The database engine takes all the records for all the listed tables, side by side, and creates

all the possible combinaon of records where each new record has all the elds for all

the tables.

WHERE i.invoice_number = d.invoice_number

AND i.cus_id = cu.cus_id

AND i.buy_code = b.buy_code

AND i.pay_code = p.pay_code

AND d.pro_code = pr.pro_code

AND d.man_code = pr.man_code

These condions represent the join between tables. A join limits the number of

records you have when combining tables as explained above. For example, consider

the following condion:

i.cus_id = cu.cus_id

This condion implies that out of all the records, the engine keeps only those where

the customer ID in the table invoices is the same as that of the customer ID in the

table customers.

AND i.inv_date BETWEEN cast('${DATE_FROM}' as date)

AND cast('${DATE_TO}' as date)

This query simply lters the sales in the given range. The cast funcon converts a string

to a date.

Developing and Implemenng a Simple Datamart

[ 386 ]

Dierent engines have dierent ways to cast or convert elds from one data type

to another. If you are using an engine dierent from MySQL, you may have to

check your database documentaon and x this part of the query.

GROUP BY i.inv_date

,d.man_code

,cu.city_id

,pr.pro_type

,b.buy_desc

,p.pay_desc

By using the GROUP BY clause, you ask the SQL engine that for each dierent combinaon of

the listed elds, it should return just one record.

Finally, look at the elds following the SELECT clause:

SELECT i.inv_date

,d.man_code

,cu.city_id

,pr.pro_type product_type

,b.buy_desc

,p.pay_desc

,sum(d.cant_prod) quantity

,sum(d.price) amount

These elds are the business keys you need—date of sale, manufacturer, city, and so

on—one for each dimension in the sales model. Note the word product_type aer

the pro_type eld. This is an alias for the eld. Using an alia,s the eld is renamed in

the output.

As you can see, with the excepon of the highlighted elds, the elds you put aer the

SELECT clause are exactly the same as you put in the GROUP BY clause. When you have a

GROUP BY clause in your sentence, aer the SELECT clause you can put only those elds

that are listed in the GROUP BY clause or aggregated funcons such as the following:

,sum(d.cant_prod) quantity

,sum(d.price) amount

sum() is an aggregate funcon that gives you the sum of the column you put into brackets.

Therefore, these last two elds are the sum of the cant_prod eld and the sum of the

price eld for all the grouped records. These two elds give you the measures for your

fact table.

Chapter 12

[ 387 ]

To conrm that the GROUP BY works as explained, let's explore one example. Remove from

the query, the sum() funcons, leaving just the elds, along with the GROUP BY clause.

Do a preview seng 2009-07-07 both as start_date and end_date. You will see

the following:

As you can see, in the same day, in the same city, you sold two products of the same type, made

for the same manufacturer, by using the same payment and buy method. In the fact table you will

not save two records, but will save a single record. Restore the original query and do a preview.

You will see the following:

Here you can see that the GROUP BY clause has grouped those two records into a single one. For

quantity and amount it summed the individual values.

Note that the GROUP BY clause, along with the aggregate funcons, does the same as you

could have done by using a Sort rows step to sort by the listed elds, followed by a Group by

step to get the sum of the numeric elds.

Wherever the database can do the operaons, for performance reasons it's

recommended that you allow the database engine do it.

Developing and Implemenng a Simple Datamart

[ 388 ]

Translating the business keys into surrogate keys

You already have the transaconal data for the fact table. But that data contains business

keys. Look at the elds denion for your fact table:

dt CHAR(8) NOT NULL,

id_manufacturer INT(10) NOT NULL,

id_region INT(4) NOT NULL,

id_junk_sales INT(10) NOT NULL,

product_type CHAR(10) NOT NULL,

quantity INT(6) DEFAULT 0 NOT NULL,

amount NUMERIC(8,2) DEFAULT 0 NOT NULL

id_manufacturer, id_region, and id_junk_sales are foreign keys to surrogate keys.

So, before inserng the data into the fact, for each business key you have to nd the proper

surrogate key. Depending on the kind of dimensions referenced by the IDs in the fact table,

you get those IDs in a dierent way. Let's see in the following secon, how you do it in

each case.

Obtaining the surrogate key for a Type I SCD

For geng the surrogate key in the case of a Type I SCD such as the Manufacturer one, you

used a Database lookup step. You are already familiar with this step, so understanding how

to use it is easy.

In the rst grid you provided the business keys. The key to look up in the incoming stream is

man_code, whereas the key to look up in the dimension table is stored in the eld id_js.

With the Database lookup step you returned the eld named id, which is the eld that

stores the surrogate key. You renamed it to id_manufacturer, as this is the name you

need for the fact table.

If the key is not found, you use 0 as default, that is, the record in the dimension reserved for

unknown values.

Chapter 12

[ 389 ]

The following screenshot shows you how it works:

Obtaining the surrogate key for a Type II SCD

In the case of a Type II SCD such as the Region dimension, you used the same step that was

used to load the table dimension—a Dimension L/U step. The dierence is that here you

unchecked the Update the dimension? opon. By doing that, the step behaves just as a

database lookup—you provide the keys to lookup and the step returns the elds you put

both in the Fields tab and in the Technical key eld opon. The dierence with this step is

that here you have to provide me informaon. By using that me informaon, PDI nds

and returns, from the Type II SCD, the proper record in me:

Developing and Implemenng a Simple Datamart

[ 390 ]

Here you give PDI the names for the columns that store the data ranges—start_date

and end_date. You also give it the name of the eld stream to use in order to compare

the dates—in this case inv_date, that is, the date of the sale.

Look at the following screenshot to understand how the lookup works:

The step has to get the surrogate key for the city with ID 261. There are two records

for that city. The key is in nding the proper record, the record valid on 07/07/2009.

So, PDI compares the date provided against the start_date and end_date elds

and returns the surrogate key 582, for which the city is classied as belonging to the

Nordic Countries region.

If no record is found for the given keys on the given date, the step retrieves the ID 0, which is

used for the unknown data.

Chapter 12

[ 391 ]

Obtaining the surrogate key for the Junk dimension

The payment and buy methods are stored in a junk dimension. A junk dimension can be

loaded by using a Combinaon L/U step. You learned how to use this step in the Time for

acon named Loading a region dimension with a Combinaon lookup/update step in

Chapter 9. As all the elds in a junk dimension are part of the primary key, you don't

need an extra Update step to load it.

In the tutorial, you loaded the dimension at the same me you loaded the fact. You know

from Chapter 9 that when you use a Combinaon L/U step, the step returns you the

generated key. So, the use of the step here for loading and geng the key at the same

me ts perfectly.

If the dimension had been loaded previously, instead of a Combinaon

L/U step you could have used a Database lookup step by pung the key

elds in the upper grid and the key in the lower grid of the Database lookup

conguraon window.

Obtaining the surrogate key for the Time dimension

You already obtained the surrogate keys for Type I and Type II SCDs and for the Junk

dimension. Finally, there is a Time dimension. As for the key, you use the date in string format;

the method for geng the surrogate key is simply changing the metadata from date to string

by using the proper format. Once again, if you had used a regular surrogate key instead of the

date, for geng the surrogate key you would have to use a Database lookup step.

The following table summarizes the dierent possibilies:

Dimension type Method for geng the surrogate key Sample

dimension

Type I SCD Database lookup step. Manufacturer

Type II SCD Dimension L/U step. Regions

Junk and Mini Combinaon L/U step if you load the dimension at the same

me as you load the fact (as in the tutorial).

Database lookup step if the dimension is already loaded.

Sales Junk

dimension

Degenerate As you don't have a table nor key to translate, you just store the

data as a eld in the fact. You don't have to worry about geng

surrogate keys.

Product Type

Time Change the metadata to the proper format if you use date as the

key (as in the tutorial).

Dimension L/U step if you use a normal surrogate key.

Time

Developing and Implemenng a Simple Datamart

[ 392 ]

Pop quiz – modifying a star model and loading the star with PDI

Suppose you want to do some modicaons to your star model. What are the changes you'll

have to make in each case:

1. Instead of using a region dimension that keeps history of the changes (Type II SCD),

you want to use a classic region dimension (Type I).

a. As table for the region dimension:

i. You reuse the table lk_regions_2.

ii. You use a dierent table.

iii. Any of the above.

b. As eld with the foreign key in the fact table:

i. You reuse the id_region eld.

ii. You create a new eld.

c. For geng the surrogate key:

i. You keep using the Dimension lookup/update step.

ii. You replace the Dimension lookup/update step by another step.

iii. It depends on the how your dimension table looks.

2. You want to change the grain for the Time dimension; you are interested in

monthly informaon.

a. As table for the me dimension:

i. You reuse the table lk_time.

ii. You use a dierent table.

iii. Any of the above.

b. As eld with the foreign key in the fact table:

i. You reuse the dt eld.

ii. You create a new eld.

c. For geng the surrogate key:

i. You keep using the Select values step and changing the metadata.

ii. You use another method.

Chapter 12

[ 393 ]

3. You decided to create a new table for the product type dimension. The table will

have the following columns: id, product_type_description, and product_

type. As data you would have, for example: 1, puzzle, puzzle for the product

type puzzle, or 2, glue, accessory for the product type glue.

a. As eld with the foreign key in the fact table:

i. You reuse the product_type eld.

ii. You create a new eld.

b. For geng the surrogate key:

i. You use a Combinaon lookup/update step

ii. You use a Dimension lookup/update step

iii. You use a Database lookup/update step

Have a go hero – loading a puzzles fact table

In the previous Hero exercise you were asked to load the dimensions for the puzzle star

model. Now you will load the fact table.

To load the fact table you'll need to build a query taking data from the source. Try to gure

out what the query looks like. Then you may try wring the query by yourself, or you may

cheat; this query will serve you as a starng point:

SELECT

i.inv_date

,d.man_code

,cu.city_id

,pr.pro_theme

,pr.pro_pieces

,pr.pro_packaging

,pr.pro_shape

,pr.pro_style

,SUM(d.cant_prod) quantity

FROM invoices i

,invoices_detail d

,customers cu

,products pr

WHERE i.invoice_number = d.invoice_number

AND i.cus_id = cu.cus_id

AND d.pro_code = pr.pro_code

AND d.man_code = pr.man_code

AND pr.pro_type like 'PUZZLE'

AND i.inv_date BETWEEN cast('${DATE_FROM}' as date)

AND cast('${DATE_TO}' as date)

Developing and Implemenng a Simple Datamart

[ 394 ]

GROUP BY i.inv_date

,d.man_code

,cu.city_id

,pr.pro_theme

,pr.pro_pieces

,pr.pro_packaging

,pr.pro_shape

,pr.pro_style

Aer that, look for the surrogate keys for dimensions of Type I and II.

Here you have a mini-dimension. You may load it at the same me you load the fact as you

did in the tutorial with the Junk dimension. Also, make sure that you properly modify the

metadata for the me eld.

Insert the data into the fact, and check whether the data was loaded as expected.

Getting facts and dimensions together

Loading the star involves both loading the dimensions and loading the fact. You already loaded

the dimensions and the fact separately. In the following two tutorials, you will put it all together:

Time for action – loading the fact table using a range of dates

obtained from the command line

Now you will get the range of dates from the command line and load the fact table using

that range:

1. Create a new transformaon.

2. With a Get system info step, get the rst two arguments from the command line and

name them date_from and date_to.

3. By using a couple of steps, check that the arguments are not null, have the proper

format (yyyy-mm-dd), and are valid dates.

4. If something is wrong with the arguments, abort.

5. If the arguments are valid, use a Set variables step to set two variables named

DATE_FROM and DATE_TO.

6. Save the transformaon in the same folder you saved the transformaon that loads the

fact table.

7. Test the transformaon by providing valid and invalid arguments to see that it works

as expected.

Chapter 12

[ 395 ]

8. Create a job and save it in the same folder you saved the job that loads the dimensions.

9. Drag to the canvas a START and two transformaon job entries, and link them one aer

the other.

10. Use the rst transformaon entry to execute the transformaon you just created.

11. Use the second transformaon entry to execute the transformaon that loads the

fact table.

12. This is how your job should look like:

13. Save the job.

14. Press F9 to run the job.

15. Fill the job sengs window as follows:

16. Click on Launch.

Developing and Implemenng a Simple Datamart

[ 396 ]

17. When the execuon nishes, explore the database to check that the data for the given

dates was loaded in the fact table. You will see this:

What just happened?

You built a main job that loads the sales fact table. First, it reads from the command line the

range of dates to be used for loading the fact and validates it. If they are not valid, the process

aborts. If they are valid, the fact table is loaded for the dates in that range.

Time for action – loading the sales star

You already created a job for loading the dimensions and another job for loading the fact.

In this tutorial, you will put them together in a single main job:

1. Create a new job in the same folder in which you saved those jobs. Name this job

load_dm_sales.kjb.

2. Drag to the canvas a START and two job entries, and link them one aer the other.

3. Use the rst job entry to execute the job that loads the dimensions.

4. Use the second Job entry to execute the job you just created for loading the fact table.

5. Save the job. This is how it looks:

Chapter 12

[ 397 ]

6. Press F9 to run the job.

7. As arguments, provide a new range of dates: 2009-09-01, 2009-09-30. Then

press Launch.

8. The dimensions will be loaded rst, followed by the loading of the fact table.

9. The Job metrics tab in the Execuon results window shows you the whole

process running:

Developing and Implemenng a Simple Datamart

[ 398 ]

10. Exploring the database, you'll see once again the data updated:

What just happened?

You built a main job that loads the sales datamart. First, it loads the dimensions. Aer that, it

loads the fact table by ltering sales in a range of dates coming from the command line.

Have a go hero – enhancing the loading process of the sales fact table

Facts tables are rarely updated. Usually you just insert new data. However, aer loading a

fact, you may detect that there were errors in the source. Or it could also happen that some

data arrives late to the system. In order to take into account those situaons, you should

have the possibility to reprocess data already processed. To avoid duplicates in the fact table,

do the following modicaon to the loading process:

Aer geng the start and end date and before loading the fact table, delete the records that

may have been inserted in a previous execuon for the given range of dates.

Have a go hero – loading the puzzles sales star

Modify the main job so that it also loads the puzzle fact table.

Make sure that the job that loads the dimensions includes all the dimensions needed for both

fact tables. Also, pay aenon that you don't read and validate the arguments twice.

Have a go hero – loading the facts once a month

Modify the whole soluon so the loading of the fact tables is made once a month. Don't modify

the model! You sll want to have daily informaon in the fact tables; what you want to do is

simply replace the daily updang process with a monthly process. Ask for a single parameter as

yyyymm and validate it. Replace the old parameters START_DATE and END_DATE with this new

one, wherever you use them.

Chapter 12

[ 399 ]

Getting rid of administrative tasks

The soluon you built during the chapter loads both dimensions and fact in a star model for

a given range of dates. Now suppose that you want to keep your datamart always updated.

Would you sit every day in front of your computer, and run the same job over and over

again? You probably would, but you know that it wouldn't be a good idea. There are beer

ways to do this. Let's see how you can get rid of that task.

Time for action – automating the loading of the sales datamart

Suppose that every day you want to update your sales datamart by adding the informaon

about the sales for the day before. Let's do some modicaons to the jobs and

transformaons you did so that the job can run automacally.

In order to test the changes, you'll have to change the date for your system. Set the current

date as 2009-10-02.

1. Create a new transformaon.

2. Drag to the canvas a Get system data step and ll it like here:

3. With a Select values step, change the metadata of both elds: As type put String

and as format, yyyy-MM-dd.

4. Add a Set variables step and use the two elds to create two variables named

START_DATE and END_DATE.

5. Save the transformaon in the same folder you saved the transformaon that loads

the fact.

Developing and Implemenng a Simple Datamart

[ 400 ]

6. Modify the job that loads the fact so that instead of execung, the transformaon

that takes the range of dates from the command line executes this one. The job

looks like this:

7. Save it.

Now let's create the scripts for execung the job from the command line:

1. Create a folder named log in the folder of your choice.

2. Open a terminal window.

3. Create a new le with your favorite text editor.

4. If your system is not Windows, go to step 7.

5. Under Windows systems, type the following:

for /f "tokens=1-3 delims=/- " %%a in ('date /t') do set

XDate=%%c%%b%%a

for /f "tokens=1-2 delims=: " %%a in ('time /t') do set

XTime=%%a.%%b

set path_etl=C:\pdi_labs

set path_log=C:\logs

c:\

cd ..

cd pdi-ce

kitchen.bat /file:%path_etl%\load_dm_sales.kjb /level:Detailed >>

%path_log%\sales_"%Xdate% %XTime%".log

6. Save the le as dm_sales.bat in a folder of your choice. Skip the following

two steps.

Chapter 12

[ 401 ]

7. Under Linux, Unix, and similar systems, type the following:

UNXETL=/pdi_labs

UNXLOG=/logs

cd /pdi-ce

kitchen.sh /file:$UNXETL/load_dm_sales.kjb /level:Detailed >>

$UNXLOG/sales_'date +%y%m%d-%H%M'.log

8. Save the le as dm_sales.sh in a folder of your choice.

Irrespecve of your system, please replace the names of the

folders in the highlighted lines with the names of your own

folders, that is path_etl (the folder where your main job is),

path_log (the folder you just created), and pdi-ce (the

folder where PDI is installed).

Now let's test what you've done:

1. Execute the batch you created:

Under windows, type: dm_sales.bat

Under Unix-like systems, type: sh dm_sales.sh

2. When the prompt in the command window is available, it means that the batch

ended. Check the log folder. You'll nd a new le with the extension log, named

sales followed by the date and hour, for example:sales_0210Fri 06.46.log.

3. Edit the log. You'll see the full log for the execuon of the job. Within the lines, you'll

see these:

INFO 02-10 17:46:39,015 - Set Variables DATE_FROM and DATE_TO.0

- Set variable DATE_FROM to value [2009-10-01]

INFO 02-10 17:46:39,015 - Set Variables DATE_FROM and DATE_TO.0

- Set variable DATE_TO to value [2009-10-01]

Developing and Implemenng a Simple Datamart

[ 402 ]

4. Also check the fact table. The fact should have data for the sales made yesterday:

Don't forget to restore the date in your system!

What just happened?

You modied the job that loads the sales datamart so that it always loads the sales from a

day before. You also created a script that embedded the execuon of the Kitchen command

and sent the result to a log. The name of the log is dierent for every day; this allows you

keep a history of logs.

To understand exactly the full Kitchen command line you put into the scripts, please refer to

Appendix B, Pan and Kitchen: Launching Transformaons and Jobs from the Command Line.

Doing all this, you don't have to worry about providing dates for the process, nor running

Spoon, nor remembering the syntax of the Kitchen command. Not only that, if you use a

system ulity such as a cron in Unix or the scheduler in Windows to schedule this script to

run every day aer midnight, you are done. You got rid of all the administrave tasks!

Have a go hero – Creating a back up of your work automatically

Choose a folder where you use to save your work (it could be for example the pdi_labs folder)

Create a job that zips your work under the name backup_yyyymmdd.zip where yyyymmdd

represents the system date. Test the job.

Chapter 12

[ 403 ]

Then create a .bat or .sh le that executes your job sending the log to a le. Test the script.

Finally, schedule the script to be executed weekly.

Have a go hero – enhancing the automate process by sending an e-mail if

an error occurs

Modify the main job so if something goes wrong, it sends you an e-mail reporng the problem.

Doing so, you don't have to worry about checking the daily log to see if everything went ne.

Unless there is a problem with the e-mail server, you'll be noed whenever some error occurs.

Summary

In this chapter you created a set of jobs and transformaons that loads a sales datamart.

Specically, you learned how to load a fact table and to embed that process into a bigger

one—the process that loads a full datamart.

You also learned to automate PDI processes, which is useful to get rid of tedious and

repeve manual tasks. In parcular, you automated the loading of your sales datamart.

Beyond that, you must have found this chapter useful for reviewing all you learned since

the rst chapter. If you can't wait for more, read the next chapter. There you will nd useful

informaon for going further.

Taking it Further

The lessons learned in previous chapters gave you the basis of PDI. If you liked

working with PDI and intend to use it in your own projects, there is much more

ranging from applying best pracces to using PDI integrated with the Pentaho

BI Suite.

This chapter points you the right direcon for taking it further. The chapter begins by giving

you some advice to take into account in your daily work with PDI. Aer that it introduces you

some advanced PDI concepts for you to know to what extent you can use the tool beyond

the basics.

PDI best practices

If you intend to work seriously with PDI, knowing how to accomplish dierent tasks is not

enough. Here are some guidelines that will help you go in the right direcon.

Outline your ideas on paper before creang a transformaon or a job:

Don't drop steps randomly on the canvas trying to get things working. You could end

up with a transformaon or job that is dicult to understand and even useless.

Document your work:

Write at least a simple descripon in the transformaons and jobs seng windows.

Replace the default names of steps and job entries with meaningful ones. Use notes

to clarify the purpose of the transformaons and jobs. Doing this, your work will be

quite self documented.



Taking it Further

[ 406 ]

Make your jobs and transformaons clear to understand:

Arrange the elements in the canvas so that it doesn't look like a puzzle to solve.

Memorize the shortcuts for arrangement and alignment, and use them regularly.

You'll nd a full list in Appendix D, Spoon shortcuts.

Organize PDI elements in folders:

Don't save all the transformaons and jobs in the same folder. Organize them

according to their purpose.

Make your work exible and reusable:

Make use of arguments, variables, and named parameters. If you idenfy tasks that

are going to be used in several situaons, create subtransformaons.

Make your work portable (ready for deployment):

This involves making sure even if you move your work to another machine or

another folder, or the paths to source or desnaon les change, or the connecon

properes to the databases change, everything should work either with minimal

changes or without changes. In order to make ensure that, don't use xed names

but variables. If you know the values for the variables beforehand, dene the

variables in the kettle.properties le. For the name of the transformaons and

jobs, use relave paths—use the ${Internal.Job.Filename.Directory} and

${Internal.Transformation.Filename.Directory} variables.

Avoid overloading your transformaons:

A transformaon should do a precise task. If it doesn't, think of spling it in two

or more, or create subtransformaons. Doing this will make your transformaon

clearer and also reusable in the case of subtransformaons.

Handle errors:

Try to gure out the kind of errors that may happen and trap them by validang and

handling errors, and taking appropriate acons such as xing data, taking alternave

paths, sending friendly message to the log les, and so on.

Do everything you can to opmize the PDI performance:

You can nd a full checklist at http://wiki.pentaho.com/display/COM/PDI+

Performance+tuning+check-list. As of version 3.1.0, PDI introduced a tool for

tracking the performance of individual steps in a transformaon. You can nd more

informaon at http://wiki.pentaho.com/display/EAI/Step+performance

+monitoring.



Chapter 13

[ 407 ]

Keep track of jobs and transformaons history:

You can use a versioning system such as subversion. Doing so, you could recover

older versions of your jobs and transformaons or examine the history of how they

changed. For more on subversion, visit http://subversion.tigris.org/.

Bookmark the forum page and visit it frequently. The PDI forum is available

at http://forums.pentaho.org/forumdisplay.php?f=135.

The following is the main PDI forum page:

If you get stuck with something, search for a soluon in the forum. If you don't nd what you're

looking for, create a new thread, expose your doubts or scenario clearly, and you'll get a prompt

answer, as the Pentaho community, and parcularly the PDI one, is quite acve.



Taking it Further

[ 408 ]

Getting the most out of PDI

Throughout the book you learned, step by step, how to use PDI for accomplishing several

kinds of tasks— reading from dierent kinds of sources, wring back to them, transforming

data in several ways, loading data into databases, and even loading a full data mart. You

already have the knowledge and the experience to do anything you want or you need with

PDI from now on. However, PDI oers you some more features that may be useful for you as

well. The following secons will introduce them and will guide you so that you know where

to look for in case they want to put them into pracce.

Extending Kettle with plugins

As you could see while learning Kele, there is a large set of steps and job entries to choose

from when designing jobs and transformaons. The number rises above 200 between steps

and entries! If you sll feel like you need more, there are more opons—plugins.

Kele plugins are basically steps or job entries that you install separately. The available

plugins are listed at http://wiki.pentaho.org/display/EAI/List+of+Available+

Pentaho+Data+Integration+Plugins.

Most of the listed plugins can be downloaded and used for free. Some are so popular or

useful that they end up becoming standard steps of PDI—for example, the Formula step that

you used several mes throughout the book.

There are other plugins that come as a trial version and you have to pay to use them.

It's also possible for you to develop your own plugins. The only prerequisite is knowing how

to develop code in Java. If you are interested in the subject, you can get more informaon at

http://wiki.pentaho.com/display/EAI/Writing+your+own+Pentaho+Data+Int

egration+Plug-In.

It's no coincidence that the author of those pages is Jens Bleuel. Jens used the

plugin architecture back in 2004, in order to connect Kele with SAP, when he was

working at Prorao. The plugin support was incorporated in Kele 2.0 and the

PRORATIO - SAP Connector, today available as a commercial plugin, was one of

the rst developed Kele plugins.

You should know that 3.x plugins no longer work on Kele 4.0.

Chapter 13

[ 409 ]

Have a go hero – listing the top 10 students by using the Head plugin step

Browse the plugin page and look for a plugin named Head. As described in the page, this

plugin is a step that keeps the rst x rows of the stream. Download the plugin and install it.

The installaon process is really straighorward. You have to copy a couple of *.jar les to

the libext directory inside the PDI installaon folder, add the environment variable for the

PDI to nd the libraries, and restart Spoon. The downloaded le includes a documentaon

with full instrucons. Once installed, the Head will appear as a new step within the

Transformaon category of steps as shown here:

Create a transformaon that reads the examinaon le that was used in the Time for Acon

– reviewing examinaon by using the Calculator step secon in Chapter 3 and some other

chapters as well. Generate an output le with the top 10 students by average score in

descending order. In order to keep the top 10, use the Head plugin.

Before knowing of the existence of this plugin, you used to do this kind of

ltering by using the JavaScript step. Another way to do it is by using an Add

sequence step followed by a Filter rows step. Note that none of these methods

use an ad hoc step.

Taking it Further

[ 410 ]

Overcoming real world risks with some remote execution

In order to learn to use Kele, you used very simple and small sets of data. It's worth saying

that all you learned can be also applied for processing huge les and databases with millions

of records. However, that's not for free! When you deal with such datasets, there are many

risks—your transformaons slow down, you may run out of memory, and so on.

The rst step in trying to overcome those problems is to do some remote execuon. Suppose

you have to process a huge le located at a remote machine and that the only thing you

have to do with that le is to get some stascs such as the maximum, minimum, and

average value for a parcular piece of data in the le. If you do it in the classic way, the data

in the le would travel along the network for being processed by Kele in your machine,

loading the network unnecessarily.

PDI oers you the possibility to execute the tasks remotely. The remote execuon capability

allows you to run the transformaon in the machine where the data resides. Doing so, the

only data that would travel through the network will be the calculated data.

This kind of remote execuon is done by Carte, a simple server that you can install in

a remote machine and that does nothing but run jobs and transformaons on demand.

Therefore, it is called a slave server. You can start, monitor, and stop the execuon of

jobs and transformaons remotely as depicted here:

Transformations

and Jobs

Starts

Monitors

Stops

SPOON CARTE

Executes

Kettle Engine Kettle Engine

Network

You don't need to download addional soware because Carte is distributed as part of the

Kele soware. For documentaon on carte, follow this link: http://wiki.pentaho.

com/display/EAI/Carte+User+Documentation.

Chapter 13

[ 411 ]

Scaling out to overcome bigger risks

As menoned above, PDI can handle huge volumes of data. However, the bigger the volume

or complexity of your tasks, the bigger the risks. The soluon not only lies in execung

remotely, but in order to enhance your performance and avoid undesirable situaons, you'd

beer increase your power. You basically have two opons—you can either scale up or

scale out. Scaling up involves buying a beer processor, more memory, or disks with more

capacity. Scaling out means to provide more processing power by distribung the work over

mulple machines.

With PDI you can scale out by execung jobs and transformaons in a cluster. A cluster is a

group of Carte instances or slave servers that collecvely execute a job or a transformaon.

One of those servers is designed as the master and takes care of controlling the execuon

across the cluster. Each server executes only a poron of the whole task.

Transformations

and Jobs

Network

CARTE

Kettle Engine

CARTE

Kettle Engine

CARTE

Kettle Engine

CARTE

Kettle Engine

Executes

Executes Executes Executes

MASTER Slave server1 Slave server2 Slave serverN

The list of servers that would make up a cluster may be known in advance, or you can have

dynamic clusters—clusters where the slave servers are known only at run me. This feature

allows you to hire resources—for example, server machines provided as a service over the

Internet, and run your jobs and transformaons processes over those servers in a dynamic

cluster. This kind of Internet service is quite new and is known as cloud-compung, Amazon

EC2 being one of the most popular.

If you are interested in the subject, there is an interesng paper named Pentaho Data

Integraon: Scaling Out Large Data Volume Processing in the Cloud or on Premise,

presented by the Pentaho partner Bayon Technologies. You can download it from

http://www.bayontechnologies.com.

Taking it Further

[ 412 ]

Pop quiz – remote execution and clustering

For each of the following, decide if the sentence is true or false:

a. Carte is a graphical tool for designing jobs and transformaons that are going to be

run remotely.

b. In order to run a transformaon remotely you have to dene a cluster.

c. When you have very complex transformaons or huge datasets you have to execute

in a cluster because PDI doesn’t support that load in a single machine.

d. To run a transformaon in a cluster you have to know the list of servers in advance.

e. If you want to run jobs or transformaons remotely or in a cluster you need the PDI

Enterprise Edion.

Integrating PDI and the Pentaho BI suite

In this book you learned to use PDI standalone, but as menoned in the rst chapter, it is

possible to use it integrated with the rest of the suite. There are a couple of opons for

doing so.

PDI as a process action

In Chapter 1 you were introduced to the Pentaho plaorm. Everything in the Pentaho

plaorm is made by acon sequences. An acon sequence is, as its name suggests, a

sequence of atomic acons that together accomplish small business processes.

Look at the following sample with regard to the Puzzle business:

Consider that you regularly receive updated price lists (one for each manufacturer) and you

drop the les in a given folder. When you decide to hike the prices, you process one of those

les and get a web-based report with the updated prices. You can implement that process

with an acon sequence.

Get list of files

to process Prompt for the file Update prices Run web-based

report

There are four atomic acons in this sequence. You already know how to do the rst and

third acons (building the list of available price lists and updang the prices) with PDI. You

can create transformaons or jobs that perform these tasks and then use them as acons

in an acon sequence. The following is a sample screenshot of Design Studio, the acon

sequence editor:

Chapter 13

[ 413 ]

The screenshot shows how the acon sequence editor looks like while eding the explained

acon sequence. In the tree at the le side, you can see the list of acons, while the right

secon allows you to congure each acon. The acon being edited in the screenshot is the

PDI transformaon that updates the prices.

PDI as a datasource

You already created several transformaons that, aer doing some data manipulaon,

generated plain les or Excel sheets. What if, instead of these types of output les, you

wanted the same data displayed as a more aracve, colorful, and interacve web-based

report? You can't do it with PDI alone. With the newest Pentaho report engine you can take

the data that came out of a transformaon and use it as the data source for your report.

Having the data, the reporng tool allows you to generate any kind of output.

Taking it Further

[ 414 ]

If you want to learn more about Pentaho reporng, you can start by vising the wiki at

http://wiki.pentaho.com/display/Reporting/Pentaho+Reporting+Communi

ty+Documentation. Or you can buy the book Pentaho Reporng 3.5 for Java Developers

(ISBN: 3193), authored by Will Gorman, published by Packt Publishing. Despite its name, it

is not just a book for developers; it's a great book for those who are unfamiliar with the tool

and who want to learn how to create reports with it.

Data coming out of a transformaon can also be used as a source data for a CDF dashboard.

A dashboard is an applicaon that shows you visual indicators such as charts, trac lights, or

dials. A CDF dashboard is a dashboard created with a toolkit, known as Community Dashboard

Framework, which is developed by members of the Pentaho community. The CDF dashboards,

recently incorporated as part of the Pentaho suite, accept many types of data sources, PDI

transformaons being one of them. The only restricon (at least for now) is that they only

accept transformaons stored in a repository (see Chapter 1 and Appendix A for details). For

more about CDF here is a link to the wiki page: http://wiki.pentaho.com/display/

COM/Community+Dashboard+Framework.

More about the Pentaho suite

The opons menoned earlier for using PDI integrated with other components of the suite

are a good starng point to begin working with the Pentaho BI suite. By pung into pracce

those examples, you can gradually get familiarized with the suite.

There is much more to learn once you get started. Look at the following sample screen:

Chapter 13

[ 415 ]

This represents a muldimensional view of your sales data mart. Here you can see cross-tab

informaon for puzzle sales in August and September for three specic manufacturers across

dierent regions, countries, and cies. It looks really useful for exploring your sales numbers,

doesn't it? Well, this is just an example of what you can do with Pentaho beyond using PDI

reporng and dashboard tools menoned earlier.

For more about the suite, you can visit the wiki page hp://wiki.pentaho.com/ or the

Pentaho site (www.pentaho.com). If, instead of browsing here and there, you prefer to

read it all in a single place, there is also a new book that brings you a good introducon to

the whole suite. The book is tled Pentaho Soluons (Wiley publishing), authored by Roland

Bouman and Jos van Dongen—two seasoned Pentaho community members.

PDI Enterprise Edition and Kettle Developer Support

Pentaho oers an Enterprise Edion of the Pentaho BI Suite and also for PDI. The PDI

Enterprise Edion adds an Enterprise Console for performance monitoring, remote

administraon, and alerng. There are also a growing number of extra plugins for

enterprise customers. In addion to the PDI extensions, customers get services and support,

indemnicaon, soware maintenance (x versions, e.g. 3.2.2), and a knowledge base with

addional technical resources.

Since the end of 2009, Pentaho also oers Kele Developer Support for the Community

Edion. With this, you can get direct assistance from the product experts for the design,

development, and tesng phases of the ETL lifecycle. This opon is perfect for geng

started, removing roadblocks, and troubleshoong ETL processes.

For further informaon, check the Pentaho site (www.pentaho.com).

Taking it Further

[ 416 ]

Summary

This chapter provided you with a list of best pracces to apply while working with PDI. If

you follow the given advice, your work will not only be useful, but also exible, reusable,

documented, and neatly presented.

You were introduced to PDI plugins, a mechanism that allows you to customize the tool.

A quick review about remote execuon and clustering was given for those interested in

developing PDI in large environments.

Finally, an introducon was given showing you how PDI can be used not only as a standalone

tool but can also be integrated with the Pentaho BI suite.

Some links and references were provided for those of you who, aer reading the book and

parcularly this chapter, are anxious to learn more.

I hope you enjoyed reading the book and learning PDI, and will start using PDI to solve all

your data requirements.

Working with Repositories

Spoon allows you to store your transformaons and jobs under two

dierent conguraons—le based and repository based. In contrast to the

le-based conguraon that keeps the transformaons and jobs in XML format

such as *.ktr and *.kjb les in the local le system, the repository-based

conguraon keeps the same informaon in tables in a relaonal database.

While working with the le-based system is simple and praccal, the repository-based

system can be convenient in some situaons. The following is a list of some of the disncve

repository features:

Repositories implement security. In order to work with a repository, you need

credenals. You can create users and proles with dierent permissions on the

repository; however, keep in mind that the kind of permissions you may apply is

limited.

Repositories are prepared for basic team development. The elements you create

(transformaons, jobs, database connecons, and so on) are shared by all repository

users as soon as you create them.

If you want to use PDI as the input source in dashboards made with the CDF (refer to

Chapter 13 for details), the only way you have is by working with repositories.

PDI 4, in its Enterprise version, will include a lot of new repository features such as

version control.



Working with Repositories

[ 418 ]

Before you decide on working with a repository, you have to be aware of the le-based

system benets that you may lose out on. Here are some examples:

When working with the repository-based system, you need access to the repository

database. If, for some reason, you cannot access the database (due to a network

problem or any other issue), you will not be able to work. You don't have this

restricon when working with les—you need only the soware and the

.ktr/.kjb les.

When working with repositories, it is dicult to keep track of the changes. On

the other hand, when you work with the le system, it's easier to know which

jobs or transformaons are modied. If you use Subversion, you even have

version control.

Suppose you want to search and replace some text in all jobs and transformaons.

If you are working with repositories, you would have to do it for each table in the

repository database. When working with the le-based system, this task is quite

simple—you could create an Eclipse project, load the root directory of your jobs

and transformaons, and do the task by using the Eclipse ulies.

This appendix explains how to create a repository and how to work with it. You can give

repositories a try and decide for yourself which method, repository-based or le-based, suits

you best.

Creating a repository

If you want to work with the repository-based conguraon, you have to create a repository

in advance.

Time for action – creating a PDI repository

To create a repository, follow these steps:

1. Open MySQL Command Line Client.

2. In the command window, type the following:

CREATE DATABASE PDI_REPO;

3. Open Spoon.

4. If the repository dialog appears, skip to step 6.

5. Open the repository dialog from the Repository | Connect to repository menu.

6. Click on New to create a new repository. The repository informaon dialog shows

up. Click on New to create a new database connecon.



Appendix A

[ 419 ]

7. The database connecon window appears. Dene a connecon to the database

you have just created and give a name to the connecon— PDI_REPO_CONN

in this case.

If you want to refer to the steps on creang the database

connecon, check out Time for acon – creang a connecon to

the Steel Wheels database secon in Chapter 8.

8. Test the connecon to see that it is properly congured.

9. Click OK to close the database connecon window. The Select database connecon

box will show the created connecon.

10. Give the name MY_REPO to the repository. As descripon, type My rst repository.

11. Click on Create or Upgrade.

12. PDI will ask you if you are sure you want to create the repository on the specied

database connecon. Answer Yes if you are sure of the sengs you entered.

13. A dialog appears asking if you want to do a dry run to evaluate the generated SQL

before execuon.

14. Answer No unless you want to preview the SQL that will create the reposProgress

progress window appears showing you the progress while the repository is

being created.

15. Finally, you see a window with the message Kele created the repository on the

specied connecon. Close the dialog window.

Working with Repositories

[ 420 ]

16. Click on OK to close the repository informaon window. You will be back in the

repository dialog, this me with a new repository available in the repository

drop-down list.

17. If you want to start working with the created repository, please refer to the Working

with the repository storage system secon. If not, click on No Repository. This will

close the window.

What just happened?

In MySQL you created a new database named PDI_REPO. Then you used that database to

create a PDI repository.

Creating repositories to store your transformations and jobs

A Kele repository is a database that provides you with a storage system for your

transformaons and jobs. The repository is the alternave to the *.ktr and *.kjb

le-based system.

In order to create a new repository, a database must have been created previously. In the

tutorial, the repository was created in a MySQL RDBMS. However, you can create your

repositories in any relaonal database.

The PDI repository database should be used exclusively for its purpose!

Note that if the repository has already been created from another machine or by another

user, that is, another prole in the operang system, you don't have to create the repository

again. In that case, just dene the connecon to the repository but don't create it again. In

other words, follow all the instrucons but don't click the Create or Upgrade buon.

Once you have created a repository, its name, descripon, and connecon informaon are

stored in a le named repositories.xml, which is located in the PDI home directory.

The repository database is populated with a bunch of tables with familiar names such as

transformation, job, steps, and steps_type.

Appendix A

[ 421 ]

Note that you may have more than one repository—dierent repositories for dierent

projects, dierent repositories for dierent versions of a project, a repository just for tesng

new PDI features, and another for serious development, and so on. Therefore, it is important

that you give the repositories meaningful names and descripons so that you don't get

confused if you have more than one.

Working with the repository storage system

In order to work with a repository, you must have created at least one. If you haven't, please

refer to the secon Creang a repository.

If you already have a repository and you want to work with it, the rst thing you have to do is

to log into it. The next tutorial helps you do this.

Time for action – logging into a repository

To log into an existent repository, follow these instrucons:

1. Launch Spoon.

2. If the repository dialog window doesn't show up, select Repository | Connect to

repository from the main menu. The repository dialog window appears.

3. In the drop-down list, select the repository you want to log into.

4. Type your username and password. If you have never created any users, use the

default username and password—admin and admin. Click on OK.

5. You will now be logged into the repository. You will see the name of the repository

in the upper-le corner of Spoon:

What just happened?

You opened Spoon and logged into a repository. In order to do that, you provided the name

of the repository and proper credenals. Once you did it, you were ready to start working

with the repository.

Working with Repositories

[ 422 ]

Logging into a repository by using credentials

If you want to work with the repository storage system, you have to log into the repository

before you begin your work. In order to do that, you have to choose the repository and

provide a repository username and password.

The repository dialog that allows you to log into the repository can be opened from the

main Spoon menu. If you intend to log into the repository oen, you'd beer select Edit |

Opons... and check the general opon Show repository dialog at startup?. This will cause

the repository dialog to always show up when you launch Spoon.

It is possible to log into the repository automacally. Let's assume you have a repository

named MY_REPO and you use the default user. Add the following lines to the

kettle.properties le:

KETTLE_REPOSITORY=MY_REPO

KETTLE_USER=admin

KETTLE_PASSWORD=admin

The next me you launch Spoon, you will be logged into the repository automacally.

For details about the kettle.properties le, refer to the secon on

Kele variables in Chapter 2.

Because the log informaon is exposed, auto login is not recommended.

Dening repository user accounts

To log into a repository, you need a user account. Every repository user has a prole

that dictates the permissions that the user has on the repository. There are three

predened proles:

Prole Permissions

Read-only Cannot create nor modify any element in the repository

User Can create, modify, and delete any object in the repository excepng

users and proles

Administrator Has full permissions, including creang new users and proles

Appendix A

[ 423 ]

There are also two predened users:

admin: A user with Administrator prole. This is the user you used to log into the

repository for the rst me. It has full permissions on the repository.

guest: A user with Read-only prole.

If you have Administrator prole, you can create, modify, rename, or delete users and

proles from the Repository explorer. For details, please refer to the secon Examining and

modifying the contents of a repository with the Repository explorer, later in this chapter. Any

user may change his/her own user informaon both from the Repository explorer and from

the Repository | Edit current user menu opon.

Creating transformations and jobs in repository folders

In a repository, the jobs and transformaons are organized in folders. A folder in a repository

fullls the same purpose as a folder in your drive—it allows you to keep your work organized.

Once you create a folder, you can save both transformaons and jobs in it.

While connected to a repository you design, preview, and run jobs and transformaons

just as you do with les. However, there are some dierences when it comes to opening,

creang, or saving your work. So, let's summarize how you do those tasks when logged

into a repository:

Task Procedure

Open a transformaon / job Select File | Open. The Repository explorer shows up. Navigate the

repository unl you nd the transformaon or job you want to open.

Double-click it.

Create a folder Select Repository | Explore repository, expand the transformaon

or job tree, locate the parent folder, right-click and create the folder.

Alternavely, double-click the parent folder.

Create a transformaon Select File | New | Transformaon or press Ctrl+N.

Create a Job Select File | New | Job or press Ctrl+Alt+N.

Save a transformaon Press Ctrl+T. Give a name to the transformaon. In the Directory

textbox, select the folder where the transformaon is going to be saved.

Press Ctrl+S. The transformaon will now be saved in the selected

directory under the given name.

Save a job Press Ctrl+J. Give a name to the job. In the Directory textbox, select the

folder where the job is going to be saved. Press Ctrl+S. The job will be

saved in the selected directory under the given name.



Working with Repositories

[ 424 ]

Creating database connections, partitions, servers, and clusters

Besides users, proles, jobs, and transformaons, there are some addional PDI elements

that you can dene:

Element Descripon

Database connecons Connecon denions to relaonal databases. These are covered in

Chapter 8.

Paron schemas Paroning is a mechanism by which you send individual rows

to dierent copies of the same step—for example, based on a

eld value.

This is an advanced topic not covered in this book.

Slave servers Slave servers are installed in remote machines to execute jobs and

transformaons remotely. They are introduced in Chapter 13.

Clusters Clusters are groups of slave servers that collecvely execute a job or

a transformaon. They are also introduced in Chapter 13.

All these elements can also be created, modied, and deleted from the Repository explorer.

Once you create any of these elements, it is automacally shared by all repository users.

Backing up and restoring a repository

A PDI repository is a database. As such, you may regularly backup it with the ulies

provided by the RDBMS. However, PDI oers you a method for creang a backup in

an XML le.

You create a backup from the Repository explorer. Right-click the name of the repository and

select Export all objects to an XML le. You will be asked for the name and locaon of the

XML le that will contain the backup data. In order to back up a single folder, instead of right-

clicking the repository name, right-click the name of the folder.

You can restore a backup made in an XML le also from the Repository explorer. Right-click

the name of the repository and select Import all objects from an XML le. You will be asked

for the name and locaon of the XML le that contains the backup.

Examining and modifying the contents of a repository

with the Repository explorer

The Repository explorer shows you a tree view of the repository to which you are

connected. From the main Spoon menu, select Repository | Explore Repository and you

get to the explorer window. The following screenshot shows you a sample Repository

explorer screen:

Appendix A

[ 425 ]

In the tree you can see: Database connecons, Paron schemas, Slave servers (slaves in

the tree), Clusters, Transformaons, Jobs, Users, and Proles.

You can sort the dierent elements by name, user, changed data, or descripon by just

clicking on the appropriate column header: Name, User, Changed date, or Descripon. The

sort is made within each folder.

The Repository explorer not only shows you these elements, but also allows you to create,

modify, rename, and delete them. The following table summarizes the available acons:

Acon Procedure Example

Create a new element

(any but

transformaons and

jobs)

Double-click the name of the

element at the top of the list.

Alternavely, right-click any

element in its category and

select the New opon.

In order to create a new user,

double-click the word Users at the

top of the users list, or right-click any

user and select New User.

Open an element for

eding

Right-click it and select the Open

opon. Alternavely, double-

click it.

In order to edit a job, double-click it,

or right-click and select Open job.

Delete an element Right-click it and select the

Delete opon.

In order to delete a user, right-click it

and select Delete user.

Working with Repositories

[ 426 ]

When you explore the repository, you don't see jobs and transformaons

mixed. Consequently, the whole folder tree appears twice—once under

Transformaons and then under Jobs.

In order to conrm your work, click on Commit changes. If you make a mistake, click on

Rollback changes.

Migrating from a le-based system to a repository-based

system and vice-versa

No maer which storage system you are using, le based or repository based, you may want

to move your work to the other system. The following tables summarize the procedure for

doing that:

Migrang from le-based conguraon to repository-based conguraon:

PDI element Procedure for migrang from le to repository

Transformaons or jobs From File | Import from an XML le, browse to locate the .ktr/.kjb le

to import and open it. Once the le has been imported, you can save it into

the repository as usual.

Database connecons,

paron schemas,

slaves, and clusters

When imporng from XML, a job or transformaon that uses the database

connecon, the connecon is imported as well. The same applies to

parons, slave servers, and clusters.

Migrang from le-based conguraon to repository-based conguraon:

PDI element Procedure for migrang from repository to le

Single transformaon

or job

Open the job or transformaon, select File | Export to an XML le, browse

to the folder where you want to save the job or transformaon, and save

it. Once it has been exported, it will be available to work with under the le

storage method or to import from another repository.

All transformaons

saved in a folder

In the Repository explorer, right-click the name of the folder and select

Export transformaons. You will be asked to select the directory where the

folder along with all its subfolders and transformaons will be exported to.

If you right-click the name of the repository or the root folder in the

transformaon tree, you may export all the transformaons.

Appendix A

[ 427 ]

PDI element Procedure for migrang from repository to le

All jobs saved in a folder In the Repository explorer, right-click the name of the folder and select

Export Jobs. You will be asked to select the directory where the folder

along with all its subfolders and jobs will be exported to.

If you right-click the name of the repository or the root folder in the job

tree, you may export all the jobs.

Database connecons,

paron schemas,

slaves and clusters

When exporng to XML a job or transformaon that uses the database

connecon, the connecon is exported as well (it's saved as part of the

KTR/KJB le). The same applies to parons, slave servers, and clusters.

You have to be logged into the repository in order to perform any of the

explained operaons.

If you share a database connecon, a paron schema, a slave server, or a cluster, it will

be available for using both from a le and from any repository, as the shared elements are

always saved in the shared.xml le in the Kele home directory.

Summary

This appendix covered the basics concepts for working with repositories. Besides the topics

covered here, working with repositories is prey much the same as working with les.

Although the tutorials in this book were explained assuming that you work with les, all of

them can be implemented under a repository-based conguraon with minimal changes. For

example, instead of saving a transformaon in c:\pdi_labs\hello.ktr, you could save it

in a folder named pdi_labs with the name hello. Besides these ny details, you shouldn't

have any trouble in developing and tesng the exercises.

Pan and Kitchen: Launching

Transformations and Jobs from the

Command Line

All the transformaons and jobs you design in Spoon end up being used as part of batch

processes—for example, processes that run every night in a scheduled fashion. When it

comes to running them in that way, you need Pan and Kitchen.

Pan is a command line program that lets you launch the transformaons designed in

Spoon, both from .ktr les and from a repository.

The counterpart to Pan is Kitchen that allows you to run jobs both from .kjb les

and from a repository.

This appendix shows you dierent opons you have to run these commands.

Running transformations and jobs stored in les

In order to run a transformaon or job stored as a .ktr / .kjb le, follow these steps:

1. Open a terminal window.

2. Go to the Kele installaon directory.

3. Run the proper command according to the following table:

Running a ... Windows Unix-like system

transformaon pan.bat /file:<ktr file

name>

pan.sh /file:<ktr file name>

job kitchen.bat /file:<kjb

file name>

kitchen.sh /file:<kjb file

name>



Download from Wow! eBook <www.wowebook.com>

Pan and Kitchen: Launching Transformaons and Jobs from the Command Line

[ 430 ]

When specifying the .ktr/.kjb lename, you must include the full path. If the name

contains spaces, surround it with double quotes.

Here are some examples:

Suppose that you work with Windows and that your Kele installaon directory is

c:\pdi-ce. In order to execute a transformaon stored in the le c:\pdi_labs\

hello.ktr, you have to type the following commands:

cd \pdi-ce

pan.bat /file:"c:\pdi_labs\hello.ktr"

Suppose that you work with a Unix-like system and that your Kele installaon

directory is /home/yourself/pdi-ce. In order to execute a job stored in the le

/home/pdi_labs/hellojob.kjb, you have to type the following commands:

cd /home/yourself/pdi-ce

kitchen.sh /file:"/home/yourself/pdi-ce/hellojob.kjb"

If you have a repository with auto login (refer Appendix A), as

part of the command, add /norep. This will avoid that PDI

Running transformations and jobs from a repository

In order to run a transformaon or job stored in a repository follow these steps:

1. Open a terminal window.

2. Go to the Kele installaon directory.

3. Run the proper command according to the following table:

Running a ... Windows Unix-like system

transformaon pan.bat /rep:<value>

/user:<user>

/pass:<value>

/trans:<value>

/dir:<value>

pan.sh /rep:<value>

/user:<user>

/pass:<value>

/trans:<value>

/dir:<value>

job kitchen.bat /rep:<value>

/user:<user>

/pass:<value>

/job:<value>

/dir:<value>

kitchen.sh /rep:<value>

/user:<user>

/pass:<value>

/job:<value>

/dir:<value>



Appendix B

[ 431 ]

In this preceding table:

• rep is the name of the repository to log into

• user and pass are the credenals to log into the repository

• trans and job are the names of the transformaon or job to run

• dir is the name of the directory where the transformaon or job is located

The parameters are shown on dierent lines for you to clearly idenfy all the opons.

When you type the command, you have to write all the parameters on

the same line.

Suppose that you work on Windows, you have a repository named MY_REPO, and you log

into the repository with user PDI_USER and password 1234. To run a transformaon named

Hello located in a directory named MY_WORK in that repository, type the following:

pan.bat /rep:"MY_REPO" /user:"PDI_USER" /pass:"1234" /trans:"Hello" /

dir:"/MY_WORK/"

If you dened auto-login, you don't need to provide the repository

informaon— the rep, user, and pass command line parameters—

as part of the command.

Specifying command line options

In the examples provided in this appendix, all opons are specied by using the /option:

value syntax—for example, /trans:"Hello".

Instead of /, you can also use -. Between the name of the opon and the value, you can also

use =. This means the opons /trans:"Hello" and -trans="Hello" are equivalents.

You may use any combinaon of /,-, :, and =.

In Windows, the use of - and = may cause problems; it's

recommended that you use the /option:value syntax.

If there are spaces in the values, you can use quotes ('') or double quotes ("") to keep the

values together. If there are no spaces, the quotes are oponal.

Pan and Kitchen: Launching Transformaons and Jobs from the Command Line

[ 432 ]

Checking the exit code

Both Pan and Kitchen return an error code based on how the execuon went. To check the

exit code of Pan or Kitchen under Windows, type the following command:

echo %ERRORLEVEL%

To check the exit code of Pan or Kitchen under Unix-like systems, type the

following command:

echo $?

If you get a zero, it means that there are no errors, whereas a value greater than zero

implies failure. To understand the meaning of the error, please refer to the Pan / Kitchen

documentaon; URL references are provided at the end of the appendix.

Providing options when running Pan and Kitchen

When you execute a transformaon or a job with Spoon, you have the opon to provide

addional informaon such as named parameters. The following Spoon dialog window

shows you an example of that:

Appendix B

[ 433 ]

When you execute the transformaon or job with Pan or Kitchen respecvely, you provide

this same informaon as opons in the command line. This is how you do it compared

side-by-side with Spoon:

Log details

Spoon Pan/Kitchen opon Example

You specify the log level in

the drop-down list inside

the Details box.

When the transformaon or

job runs, the log is shown

in the Execuon Results

window.

/level:<logging level>

where the logging level can be

one of the following:

Error, Nothing, Minimal,

Basic, Detailed, Debug, or

Rowlevel.

/level:Detailed

The log appears in the terminal

window, but you can use the

command language of your

operang system to redirect

it to a le.

Named parameters

Spoon Pan/Kitchen opon Example

You specify the named

parameters in the

Parameters box. The

window shows you the

name of the dened named

parameters for you to ll

the values or keep the

default values.

/param:

<parameter name>=

/param:

"REPORTS_FOLDER=

c:\my_rep\"

Arguments

Spoon PAN/Kitchen opon Example

You specify the command

line arguments in the

Arguments grid. Each line

corresponds to a dierent

argument.

You type them in order as part of

the command.

20091001 20091031

Variables

Spoon Pan/Kitchen

The grid named Variables shows the

variables used in the transformaon/job

as well as their current values. At the

me of the execuon, you can type

dierent values.

You cannot set variables either in the Pan or in the

Kitchen command. The variables have to exist. You may

dene them in the kettle.properties le. To

get the details of this le, refer to the Kele Variables

secon in Chapter 2.

Pan and Kitchen: Launching Transformaons and Jobs from the Command Line

[ 434 ]

Suppose that the sample transformaon shown in the screenshot is located at

c:\pdi_labs\sales_report.ktr. Then the following Pan command

pan.bat /file:"c:\pdi_labs\sales_report.ktr" 20091001 20091031 /level:De-

tailed > c:\pdi_labs\logs\sales_report.log

executes the transformaon with the same opons shown in the screenshot. The command

redirects the log to the le c:\pdi_labs\logs\sales_report.log.

Besides these, both Pan and Kitchen have addional opons. For a full list and more

examples, visit the Pan and Kitchen documentaon at http://wiki.pentaho.com/

display/EAI/Pan+User+Documentation and http://wiki.pentaho.com/

display/EAI/Kitchen+User+Documentation.

Quick Reference:

Steps and Job Entries

This appendix summarizes the purpose of the steps and job entries used in the tutorials

throughout the book. For each of them, you can see the name of the Time for acon secon

where it was introduced and also a reference to the chapters where you can nd more

examples that use it.

How to use this reference

Suppose you are inside Spoon, eding a Transformaon. If the transformaon

uses a step that you don't know and you want to understand what it does or

how to use it, double-click the step and take note of the tle of the sengs

window; that tle is the name of the step. Then search for that name in the

transformaon steps reference table. The steps are listed in alphabecal order

so that you can nd them quickly. The last column will take you to the place in

the book where the step is explained.

The same applies to jobs. If you see in a job an unknown entry, double-click the

entry and take note of the tle of the sengs window; that tle is the name of

the entry. Then search for that name in the job entries reference table. The job

entries are also listed in alphabecal order.

Quick Reference: Steps and Job Entries

[ 436 ]

Transformation steps

The following table includes all the transformaon steps used in the book. For a full list

of steps and their descripons, select Help | Show step plug-in informaon in Spoon's

main menu.

You can also visit http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integra

tion+v3.2.+Steps for a full step reference along with some examples.

Icon Name Purpose Time for acon

Abort Aborts a transformation Aborting when there are too

many errors (Chapter 7); also in

Chapters 11 and 12

Add constants Adds one or more constant

elds to the stream

Gathering progress and

merging all together (Chapter

4); also in Chapters 7, 8, and 9

Add sequence Gets the next value from a

sequence

Assigning tasks by Distributing

(Chapter 4); also in Chapters 6

and 11

Append

streams

Appends two streams in

an ordered way

Giving priority to Bouchard

by using Append Stream

(Chapter 4)

Calculator Creates new elds by

performing simple

calculations

Reviewing examination by

using the Calculator step

(Chapter 3); also in

Chapters 6 and 8

Combination

lookup/update

Updates a junk dimension.

Alternatively, it can be

used to update

Type I SCD.

Loading a region dimension

with a Combination lookup/

update step (Chapter 9); also in

Chapter 12

Copy rows to

result

Write rows to the

executing job. The

information will then be

passed to the next entry in

the job.

Splitting the generation of top

scores by copying and getting

rows (Chapter 11)

Data Validator Validates elds based on a

set of rules

Checking lms le with the

Data Validator (Chapter 7)

Database join Executes a database query

using stream values as

parameters

Using a Database join step

to create a list of suggested

products to buy (Chapter 9)

Database

lookup

Looks up values in a

database table

Using a Database lookup step to

create a list of products to buy

(Chapter 9), also in Chapter 12

Appendix C

[ 437 ]

Icon Name Purpose Time for acon

Delay row For each incoming row,

waits a given time before

giving the row to the

next step

Generating custom les by

executing a transformation for

every input row (Chapter 11)

Delete Delete data in a database

table

Deleting data about

discontinued items (Chapter 8)

Dimension

lookup/update

Updates or looks up a

Type II SCD. Alternatively,

it can be used to update

Type I SCD or hybrid

dimensions.

Keeping a history of product

changes with the Dimension

lookup/update step

(Chapter 9), also in Chapter 12

Dummy (do

nothing)

This step type doesn't do

anything! However it is

used often.

Creating a hello world

transformation (Chapter 1),

also in Chapters 2, 3, 7, and 9

Excel Input Reads data from a

Microsoft Excel (.xls) le

Browsing PDI new features by

copying a dataset (Chapter 4);

also in Chapter 8

Excel Output Writes data to a Microsoft

Excel (.xls) le

Getting data from an XML

le with information about

countries (Chapter 2); also in

Chapters 4 and10

Filter rows Splits the stream in two

upon a given condition.

Alternatively, it is used to

let pass just the rows that

meet the condition.

Counting frequent words by

ltering (Chapter 3); also in

Chapters 4, 6, 7, 9, 11, and 12

Fixed le input Reads data from a xed

width le

Calculating Scores with

JavaScript (Chapter 5)

Formula Creates new elds by

using formulas. It uses

Pentaho's libformula.

Reviewing examination by

using the Formula step

(Chapter 3); also in

Chapters 10 and 11

Generate Rows Generates a number of

equal rows

Creating a hello world

transformation (Chapter 1);

also in Chapters 6, 9, and 10

Get data from

XML

Gets data from XML les Getting data from an XML

le with information about

countries(Chapter 2); also in

chapters 3 and 9

Quick Reference: Steps and Job Entries

[ 438 ]

Icon Name Purpose Time for acon

Get rows from

result

Reads rows from a

previous entry in a job

Splitting the generation of top

scores by copying and getting

rows (Chapter 11)

Get System

Info

Gets information from the

system like system date,

arguments, etc.

Updating a le with news about

examination (Chapter 2) also in

Chapters 7, 8, 10, 11, and12

Get Variables Takes the values of

environment or Kettle

variables and adds them as

elds in the stream

Creating the time dimension

dataset(Chapter 6)

Group by Builds aggregates in a

group by fashion. This

works only on a sorted

input. If the input is

not sorted, only double

consecutive rows are

handled correctly

Calculating World Cup statistics

by grouping data (Chapter 3);

also in Chapters 4, 7, and 9

If eld value is

null

If a eld is null, it changes

its value to a constant. It

can be applied to all elds

of a same data type, or to

particular elds

Enhancing a lms le by

converting rows to columns

(Chapter 6)

Insert / Update Updates or inserts rows in

a database table

Inserting new products or

updating existent ones

(Chapter 8)

Mapping (sub-

transformation)

Runs a subtransformation Calculating the top scores with a

subtransformation (Chapter 11)

Mapping input

specication

Species the input

interface of a

sub-transformation

Calculating the top scores with a

subtransformation (Chapter 11)

Mapping

output

specication

Species the output

interface of a

sub-transformation

Calculating the top scores with a

subtransformation (Chapter 11)

Modied Java

Script Value

Allows you to code

Javascript to modify or

create new elds. It's also

possible to code Java

Calculating Scores with

JavaScript(Chapter 5); also in

Chapters 6, 7, and 11

Number range Creates ranges based on a

numeric eld

Capturing errors while

calculating the age of a lm

(Chapter 7); also in Chapter 8

Appendix C

[ 439 ]

Icon Name Purpose Time for acon

Regex

Evaluation

Evaluates a eld with a

regular expression

Validating Genres with a Regex

Evaluation step (Chapter 7); also

in Chapter 12

Row

denormaliser

Denormalises rows by

looking up key-value pairs

Enhancing a lms le by

converting rows to columns

(Chapter 6)

Row

Normaliser

Normalises data

de-normalised

Enhancing the matches le

by normalizing the dataset

(Chapter 6)

Select values Selects, reorders, or

removes elds. Also

allows you to change the

metadata of elds

Reading all your les at a

time using a single Text le

input step (Chapter 2); also in

Chapters 3, 4, 6, 7, 8, 9, 11,

and 12

Set Variables Sets Kettle variables based

on a single input row

Updating a le with news

about examinations by setting

a variable with the name of the

le (Chapter 11); also in

Chapter 12

Sort rows Sorts rows based upon

eld values, ascending or

descending

Reviewing examinations by

using the Calculator step

(Chapter 3); also in Chapters 4,

6, 7, 8, 9, and 11

Split eld to

rows

Splits a single string eld

and creates a new row for

each split term

Counting frequent words by

ltering (Chapter 3)

Split Fields Splits a single eld into

more than one

Calculating World Cup statistics

by grouping data (Chapter 3);

also in Chapters 6 and 11

Stream lookup Looks up values coming

from another stream in the

transformation

Finding out which language

people speak (Chapter 3); also

in Chapter 6

Switch / Case Switches a row to a certain

target step based on the

value of a eld

Assigning tasks by ltering

priorities with the Switch/ Case

step (Chapter 4)

Table input Reads data from a database

table

Getting data about shipped

orders (Chapter 8); also in

Chapters 9, 10, and 12

Table output Writes data to a database

table

Loading a table with a list of

manufacturers (Chapter 8), also

in Chapters 9 and 12

Quick Reference: Steps and Job Entries

[ 440 ]

Icon Name Purpose Time for acon

Text le input Reads data from a text le Reading all your les at a

time using a single Text le

input step (Chapter 2); also in

Chapters 3, 5, 6, 7, 8, and 11

Text le output Writes data to a text le Sending the results of matches

to a plain le (Chapter 2); also in

Chapters 3, 7, 9, 10, and 11

Update Updates data in a database

table

Loading a region dimension

with a Combination lookup/

update step (Chapter 9)

Value Mapper Maps values of a certain

eld from one value to

another

Browsing PDI new features by

copying a dataset (Chapter 4)

Job entries

The following table includes all the job entries used in the book. For a full list of job

entries and their descripons, select Help | Show job entries plug-in informaon in

Spoon's main menu.

You can also visit http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integra

tion+v3.2.+Job+Entries for more informaon.

There you'll nd a full job entries reference and some examples as well.

Icon Name Purpose Time for acon

Abort job Aborts the job Updating a le with news

about examinations by setting

a variable with the name of the

le (Chapter 11)

Create a folder Creates a folder Creating a simple Hello world

job (Chapter 10)

Delete le Deletes a le Generating custom les by

executing a transformation for

every input row (Chapter 11)

Evaluate rows

number in a

table

Evaluates the content of a

table

Loading the dimensions for the

sales datamart (Chapter 12)

Appendix C

[ 441 ]

Icon Name Purpose Time for acon

File Exists Checks if a le exists Updating a le with news

about examinations by setting

a variable with the name of the

le (Chapter 11)

Job Executes a job Generating the les with top

scores by nesting jobs (Chapter

11); also in Chapter 12

Mail Sends an e-mail Sending a sales report and

warning the administrator

if something were wrong

(Chapter 10)

Special entries Start job entry; mandatory

at the beginning of a job

Creating a simple Hello

world job (Chapter 10); also in

Chapters 11 and 12

Success Forces the success of a job

execution

Updating a le with news

about examinations by setting

a variable with the name of the

le (Chapter 11); also in

Chapter 12

Transformation Executes a transformation Creating a simple Hello

world job (Chapter 10); also in

Chapters 11 and 12

Note that this appendix is just a quick reference. It's not meant at all for learning

to use PDI. In order to learn from scratch, you should read the book starng from

the rst chapter.

Spoon Shortcuts

The following tables summarize the main Spoon shortcuts. Have this appendix handy; it will

save a lot of me while working with Spoon.

If you are a Mac user, please be aware that a mixture of Windows and Mac keys

is used. Thus, the shortcut keys are not always what you expect. For example, in

some cases you copy with Ctrl+C, while in others you do it with Command+C.

General shortcuts

The following table lists general Spoon shortcuts:

Acon Shortcut

New job Ctrl+Alt+N

New transformaon Ctrl+N

Open a job/transformaon Ctrl+O

Save a job/transformaon Ctrl+S

Close a job/transformaon Ctrl+F4

Run a job/transformaon F9

Preview a transformaon F10

Debug a transformaon Shi+F10

Verify a transformaon F11

Job sengs Ctrl+J

Transformaon sengs Ctrl+T

Search metadata Ctrl+F

Set environment variables Ctrl+Alt+J

Show environment variables Ctrl+L

Show arguments Ctrl+Alt+U

Spoon Shortcuts

[ 444 ]

Designing transformations and jobs

The following are the shortcuts that help the design of transformaons and jobs:

Acon Shortcut

New step/job entry Drag the step/job entry icon to the work

area and drop it there

Edit step/job entry Double-click

Edit step descripon Double-click the middle mouse buon

New hop Click a step and drag toward the second

step while holding down the middle

mouse buon or while pressing Shi and

holding down the le mouse buon

Edit a hop Double-click in transformaons,

right-click in jobs

Split a hop Drag a step over the hop unl it

gets wider

Select some steps/job entries Ctrl+click

Select all steps Ctrl+A

Clear selecon Esc

Copy selected steps/job entries to clipboard Ctrl+C

Paste from clipboard to work area Ctrl+V

Delete selected steps/job entries Del

Align selected steps/job entries to top Ctrl+Up

Align selected steps/job entries to boom Ctrl+Down

Align selected steps/job entries to le Ctrl+Le

Align selected steps/job entries to right Ctrl+Right

Distribute selected steps/job entries horizontally Alt+Right

Distribute selected steps/job entries vercally Alt+Up

Zoom in Page up

Zoom out Page down

Zoom 100% Home

Snap to grid Alt+Home

Undo Ctrl+Z

Redo Ctrl+Y

Show output stream (only available in transformaons) Posion the mouse cursor over the step;

then press Space bar

Appendix D

[ 445 ]

Grids

Acon Shortcut

Move a row up Ctrl+Up

Move a row down Ctrl+Down

Resize all columns to see the full values (header included) F3

Resize all columns to see the full values (header excluded) F4

Select all rows Ctrl+A

Clear selecon Esc

Copy selected lines to clipboard Ctrl+C

Paste from clipboard to grid Ctrl+V

Cut selected lines Ctrl+X

Delete selected lines Del

Keep only selected lines Ctrl+K

Undo Ctrl+Z

Redo Ctrl+Y

Repositories

Acon Shortcut

Connect to repository Ctrl+R

Disconnect repository Ctr+D

Explore repository Ctrl+E

Edit current user Ctrl+U

Introducing PDI 4 Features

While wring this book, version 4.0 of PDI was sll under development. Kele 4.0 was

mainly created to provide a new API for the future—the API that is cleaned up, exible, more

pluggable, and so on. Beside those architectural changes, Kele 4.0 also includes some new

funconal features. This appendix will quickly introduce you to those features.

Agile BI

Pentaho Agile Business Intelligence (Agile BI) is a new, iterave design approach to BI

development. Agile BI provides an integrated soluon that enables you, as an ETL designer,

to work iteravely, modeling the data, visualizing it, and nally providing the data to users

for self-service reporng and analysis. Agile BI is delivered as a plugin to Pentaho Data

Integraon. You can learn more about Agile BI at http://wiki.pentaho.com/display/

AGILEBI/Documentation.

Visual improvements for designing transformations and

jobs

The new version of the product includes mainly Enterprise or advanced features. There are,

however, a couple of noveles in the Community Edion that will catch your aenon as

soon as you start using the new version of the soware. In this secon you will learn about

those noveles.

Experiencing the mouse-over assistance

The mouse-over assistance is the rst new feature you will noce. It assists you while eding

jobs and transformaons. Let's see it working.

Introducing PDI 4 Features

[ 448 ]

Time for action – creating a hop with the mouse-over assistance

You already know several ways to create a hop between two job entries or two steps. Now

you will learn a new way:

1. Create a job and drag two job entries to the canvas. Name the entries A and B.

2. Posion the mouse cursor over the entry named A and wait unl a ny toolbar

shows up below the entry icon as shown:

3. Click on the output connector (the last icon in the toolbar), and drag toward the

entry named B. A grayed hop is displayed.

4. When the mouse cursor is over the B entry, release the mouse buon. A hop is

created from the A entry to the B entry.

What just happened?

You created a hop between two job entries by using the mouse-over assistance—a feature

incorporated in PDI 4.

Using the mouse-over assistance toolbar

When you posion the mouse cursor over a step in a transformaon or a job entry in a job, a

ny toolbar shows up to assist you. The following diagram depicts its opons:

Appendix E

[ 449 ]

The following table explains each buon in this toolbar:

Buon Descripon

Edit Equivalent to double-clicking the job entry/step to edit it.

Menu Equivalent to right-clicking the job entry/step to bring up the

contextual menu.

Input

connector

Assistant for creang hops leaving from this job entry/step. If the job

entry/step doesn't accept any input (that is, START entry job or Generate

Rows step), the input connector is disabled.

Output

connector

Assistant for creang hops directed toward this job entry/step. It's used as

shown in the tutorial, but the direcon of the created hop is the opposite.

In the tutorial, you created a simple hop between two job entries. You can create hops

between steps in the same way. In this case, depending on the kind of source step, you might

be prompted for the kind of hop to create. For example, when leaving a Filter rows step, you

will be asked if the desnaon step is where you'll send the "true" data, or where you will

send the "false" data, or if this is the main output of the step.

Experiencing the sniff-testing feature

The sni-tesng feature allows you to see the rows that are coming into or out of a step

in real me. While a transformaon is running, right-click a step, select Sni test during

execuon | Sni test output rows. A window appears showing you the output data as it

is being processed. In the same way, you can select Sni test during execuon | Sni test

input rows to see the incoming rows.

Note that the sni-tesng feature slows down the transformaon and

its use is recommended just for debugging purposes.

Experiencing the job drill-down feature

In Chapters 10 and 11, you learned how to nest jobs and transformaons. You even learned

how to create subtransformaons. Whichever the case, when you ran the main job or

transformaon, there was a single log tab showing the log for the main and all nested jobs

and transformaons.

In PDI 4.0, when a job entry is running, you can drill-down into that. Drilling down means

opening that entry and seeing what's going on inside that job or transformaon. In a

separate window, you'll see both the step metrics and the log. If there are more nested

transformaons or jobs, you can connue drilling down. You can go even further into a

running subtransformaon. In any of these jobs or transformaons, you may sni test

as well, as described above.

Introducing PDI 4 Features

[ 450 ]

Drilling down is useful, for example, to understand why your jobs or transformaons don't

behave as expected or to nd out where a performance problem is.

You can see the job drill-down and sni-tesng in acon in two videos made by Ma Casters,

Kele chief leader and author of these features at: http://www.ibridge.be/?p=179.

Experiencing even more visual changes

Besides the features that we have just seen, there are some other UI improvements

worth menoning:

Enhanced notes editor: Now you can apply dierent fonts and colors to the notes

you create in Spoon.

Color-coded logs: Now it is easier to read a log, as dierent colors allow you to

quickly idenfy dierent kinds of log messages.

Revamped Repository explorer: The Repository explorer has been completely

redesigned, making this a major UI improvement in Kele 4.0.

Enterprise features

As said, most of the funconal features included in Kele 4.0 apply only to the Enterprise

version of the product. Among those features, the following are the most remarkable:

Job and transformaon versioning and locking

Robust security and administraon capabilies

Ability to schedule jobs and transformaons from Spoon

Enhanced logging architecture for real-me monitoring and debugging of

transformaons

Summary

This appendix introduced you to the main features included in Kele 4.0. All the explanaons

and exercises in this book have been developed, explained, and tested in the latest stable

version 3.2. However, as the new version of the product includes mainly Enterprise or

advanced features, working with Kele 4.0 Community Edion is not so dierent from

working with Kele 3.2. You can try all the examples and exercises in the book in

Kele 4.0 if you want to. You shouldn't have any dicules.



Pop Quiz Answers

Chapter 1

PDI data sources

PDI prerequisites

1 1 and 3

PDI basics

1False (Spoon is the only graphical tool)

2True

3False (Spoon doesn't generate code, but interprets Transformaon and Jobs)

4 False (The grid size is intended to line up steps in the screen)

5 False (As an example the transformaon in this chapter created the rows of data

from scratch; it didn't use external data)

Pop Quiz Answers

[ 452 ]

Chapter 2

formatting data

1(a) and (b). The eld is already a Number, so you may dene the output

eld as a Number, taking care of the format you apply. If you dene

the output eld as a String and you don't set a format, Kele will send

the eld to the output as 1.0, 2.0, 3.0, etc., which clearly is not

the same as your code. Just to conrm this, create a single le and a

transformaon to see the results for yourself.

Chapter 3

concatenating strings

1(a) and (c). The calculator allows you to use the + operator both for

adding numbers and for concatenang text. The Formula step makes a

dierence: To add numbers you use

+; to concatenate text you have to use & instead.

Chapter 4

data movement (copying and distributing)

1 (b). In the second transformaon the rows are copied, so all the

unassigned rows reach the dummy step. In the rst transformaon the

rows are distributed, so to the lter step arrives half of the rows. When

you do the preview, you see only the unassigned tasks for this half; you

don't see the unassigned tasks that went to the other stream.

splitting a stream

1(c). Both (a) and (b) solve the situaon.

Appendix F

[ 453 ]

Chapter 5

nding the seven errors

11. The type of log a doesn't exist. Look at the sample provided for the

funcon to see the valid opons.

2. The variable uSkill is not dened. Its denion is required if you want

to add it to the list of new elds.

3. setValue() cause an error without compability mode. To change the

value of the Country eld, a new variable should be used instead.

4. A variable named average is calculated but wAverage is used as the

new eld.

5. It is not trans_status; it is trans_Status.

6. No data type was specied for the totalScore eld.

7. The sentence writeToLog(‘Ready to calculate averages...')

will be wrien for every row. To write it at the beginning, you have to put it

in a Start script, not in the main.

Chapter 6

using Kettle variables inside transformations

1(a). You don't need a Get Variables step in this case. As name of the le you simply type

hello_${user.name} or hello_%%user.name%%.

In (b) and (c) you need to add the variables ${user.name} and ${user.

language} respecvely as elds of your dataset. You do it with a Get

Variables step.

Chapter 7

PDI error handling

1 (c). With PDI you cannot avoid unexpected errors; you can capture them avoiding the

crash of the transformaon. Aer that, discarding or treang the bad rows is up to you.

Pop Quiz Answers

[ 454 ]

Chapter 8

dening database connections

1(c)

database datatypes versus PDI datatypes

1(b)

Insert/Update step versus Table Output/Update steps

1(a) If an incoming row belongs to a product that doesn't exist in the products table,

both the Insert/Update step and the Table output step will insert the record.

If an incoming row belongs to a product that already exist in the products table,

the Insert/Update step updates it. In this alternave version, the Table output will

fail (there cannot be two products with the same value for the primary key) but the

failing row goes to the Update step that updates the record.

If an incoming row contains invalid data (for example, a price with a non numeric

value), neither of the Insert/Update step, the Table output step, and the Update

step would insert or update the table with this product.

ltering the rst 10 rows

1(c). To limit the number of rows in MySQL you use the clause LIMIT. (a) and (b) are

dialects: (a) is valid in HSQLDB. (b) is valid in Oracle. If you put any of this opons in

a Table Input for querying the js database, the transformaon would fail

Chapter 9

loading slowly changing dimensions

1(a). The decision for the kind of dimension is not related to data you have.

You just have to know your business, so the last opon is out. You don't

need to keep history for the name of the lm. If the name changes it is because it was

misspelled, or because you want to change the name to upper case, or something like

that. It doesn't have sense to keep the old value. So you create

a Type I SCD.

2(c). You can use any of these steps for loading a Type I SCD. In the tutorial for loading

a type I SCD you used a Combinaon L/U, but you could have used the other too, as

explained above.

Appendix F

[ 455 ]

loading type III slowly changing dimensions

1(b). With a Database lookup to get the current value stored in the dimension. If there is

no data in the dimension table, the lookup fails and returns null; that is not a problem.

Aer that, you compare the found data with the new one and set the proper values for

the dimension columns. Then you load the dimension either with a Combinaon L/U or

with a Dimension lookup, just as you do for a regular Type I SCD.

Chapter 10

dening PDI jobs

1(b)

2All the given opons are True. Simply explore the Job entries tree and you'll nd the

answers.

Chapter 11

using the Add sequence step

1(e) None of the proposed soluon gives you the same results you obtained in the

tutorial. The Add sequence step gives you the next value in a sequence which

can be a database sequence or transformaon counter. In the tutorial you used a

transformaon counter. In the opons (b) and (c), instead of four sequences from 1

to 10, a single sequence from 1 to 40 would have been generated. No maer which

method you use for generang the sequence, if you use the same name of sequence in

more than one Add sequence step, the sequence is the same and is shared by all those

steps. Therefore, the opon (a) also would have generated a single sequence from 1 to

40 shared by the four streams.

Besides these details about the generaon of sequences, the (b) opon introduces an

extra inconvenience. By distribung rows, you cannot be sure that the rows will go to

the proper stream. PDI would have distributed them in its own fashion.

deciding the scope of variables

1All the opons are valid. In the tutorial you had just a transformaon and its parent

job, that is also the root job. So (a) is valid. The grand-parent job scope includes the

parent job so opon (b) is valid too. Opon (c) includes all the other opons, so it is a

valid opon too.

Pop Quiz Answers

[ 456 ]

Chapter 12

modifying a star model and loading the star with PDI

1 a iii As menoned in Chapter 9, despite being

designed for building Type II SCDs, the

Dimension L/U step can be used for building

Type I SCDs as well. So, you have two opons:

Reuse the table (modifying the transformaon

that loads it) and get the surrogate key with

a Dimension L/U step, or use another table

without all elds specic to Type II dimensions

and, for geng the surrogate key, use a DB

Lookup step.

In any case, you may reuse the id_region

eld, as it is a integer and serves in any

situaon.

b i

c iii

2 a ii The dimension table has to have one record by

month. Therefore a dierent table is needed.

For the key you could use a string with the

format yyyymm. If you don't want to change

the fact table, you may reuse the dt eld

leaving blank the last two characters, but it

would be more appropriate to have a string

eld with just 6 posions. For geng the

surrogate key you use a Select values step

changing the metadata but this me you put as

format the new mask yyyymm.

b ii

c i

3 a ii The product_type eld is a string; it's not

the proper eld for referencing a surrogate key

from a fact table, so you have to dene a new

eld for that purpose. For geng the right key

you use a Database lookup step.

b iii

Chapter 13

remote execution and clustering

1 None of the sentences are true.

Index

Symbols

${<variable>} notaon 193

%%<variable>%% notaon 193

*.kjb format 417

*.ktr format 417

.kjb le jobs 429, 430

.ktr le transformaons

running 429, 430

/opon:value syntax 431

acon sequence 412

administrave tasks

geng rid of 399

sales datamart loading, automang 399-402

work backup, creang automacally 402

Agile BI 447

basic calculaon

calculator step, using 74

data, sorng 81

Dummy step 81

examinaon review, calculator step used 74,

78, 80

eld modicaon, PDI used 82

elds, modifying 82

Select values step, using 81

basic modicaon

Group by step 94

business keys to surrogate keys, sales fact table

junk dimension surrogate key, obtaining 391

me dimension surrogate key, obtaining 391

translang 388-391

TypeII SCD surrogate key, obtaining 389, 390

Type I SCD surrogate key, obtaining 388, 389

calculator step used, basic calculaon

about 74

average, taking 74-77

eding 78

examinaon, reviewing 80, 81

nal preview 80

preview 78

Select Values step 79

Sort rows Step 78

Carte 410

CDF 414

change history, maintaining

Dimension lookup/update step, using 286-288

steps 286

transformaon, tesng 288, 289

cloud-compung 411

cluster 411, 424

coding

disadvantages 166

command line argument

named parameters, dierenang between

317, 318

passing, to transformaon 315, 316

use, analyzing 318

Community Dashboard Framework. See CDF

Community Edion 415

complex lookups, data

customers list, rebuilding 275

database to stream data, joining 272-274

performing 270

suggested products list, creang 270, 272

[ 458 ]

columns 218

connecng, with RDBMS 222, 223

constellaon 376

custom me dimension dataset

creang 187-191

generang, Kele variables used 186

Get Variables step 191-193

dashboard

screenshot 414

data

normalizing 180, 181

normalizing, Row Normalizer step used 182,

184

reading, from les 35

data, database

complex lookups, doing 270

looking up 266

simple lookups, doing 266

data, reading

football match results, reading 36-40

from les 35

grids 46

input les 41

input les, properes 41, 42

mulple les, reading at once 42, 43

mulple les reading, single Text le input step

used 43, 44

reading les, troubleshoong 45, 46

regular expressions 44

data, sending to database

data, inserng 246-251

data, updang 246-251

Insert/Update step, using 251-253

inserng, table output step used 245

table list, loading 239-244

data, XML les

Get Data From XML input step, conguring 69

node, selecng 69

obtaining 68

path expression, examples 69

XPath, using 68

database connecons 424

database connecons See also connecng, with

RDBMS

database explorer, using 228

database operaons 261

database querying

data, working with 229-231

data obtaining, table input step used 231, 232

SELECT statement, using 232, 233

data cleaning. See data cleansing

data cleansing

about 213, 214

example 214

PDI step, using 214

data eliminaon, from database

Delete step using 259, 260

steps 256-258

data manipulaon

basic calculaon 73

ltering 97

Data Manipulaon Language. See DML

datamart

about 275, 367

datawarehouse, dierence 368

sales datamart 368

data scrubbing. See data cleansing

dataset

custom me dimension dataset, generang 186

data, normalizing 180

modifying, Row Normalizer step used 182, 184

rows, converng to columns 169

data to les, transferring

about 47

eld 50

eld, deleng 52

eld, selecng 52

eld metadata, changing 52

match results, sending 47, 49

output les 49

row 50

rowset 50

streams 51

data transformaon 141

data type, system informaon

date eld 58, 59

date formats, using 62

number 99.55, formang 62

numeric elds 59, 60

transformaon, execung 60, 61

[ 459 ]

data validaon

example 208

lms, checking 209, 210

need for 208

simple validaon rules, dening 211-213

datawarehouse

about 275

datamart, dierence 368

DDL

about 226

example 226

degenerate dimension 370

Design Studio

screenshot 413

dimensional modeling

about 275, 276

datamart 275

datawarehouse 275

dimension tables 275

fact table 275

junk dimension 369

mini dimension 285

SCD 282

star schema 275

dimensions, sales datamart

about 186

loading 371-376

Dimension tables 275

dimension tables, with data

about 275

change history, maintaining 286

dimension data, describing 281, 282

loading 276

loading, combinaon lookup/update step used

276-281

DML

about 226

example 226, 227

dynamic clusters 411

E4X 148

Eclipse 418

Enterprise Console 415

Enterprise features 450

enty relaonship diagram. See ERD

ERD 264

errors, capturing

Abort step, using 203

about 195

captured errors, xing 203, 204

error handling funconality, using 200, 201

lm age, calculang 196-199

PDI error handling funconality, acvies 205

rows, treang 205, 206

transformaon, aborng 201-203

ETL 7

exit code

checking, under Unix-based systems 432

checking, under Windows 432

EXtensible Markup Language. See XML

Extracng, Transforming and Loading. See ETL

facts 275

fact table

about 275

loading, date ranges used 394-396

eld 50

eld modicaon, basic modicaon

add constants eld 82

calculator step 83

examples 89-92

Formula step 84-87

number range eld 82

replace in string eld 82

split elds 82

student, lisng 88

User Dened Java Expression 83

Value Mapper 82

le

data, reading from 35

le result list, creang 326

le result list, using 326

output les, wring 52

updang 53-56

le-based system

migrang, to repository-based system 426

le result 326

ltering

frequent words, counng 97-102

rows 104, 105

rows, lter rows used 103

[ 460 ]

spoken language, idenfying 105-109

Stream lookup step 109

word count, discarding commonly used 105

lter rows step

using, for ltering row 103

rst transformaon, Spoon

hello world transformaon, creang 20-24

interface, exploring 26

Kele engine, direcng 25

previewing 27, 28

previewing, results in Execuon Results window

running 27, 28

structure, viewing 26

ow-control oriented 305

foreign keys (FK) 218

formula step 165

grain 370

grid shortcuts

Ctrl+A 445

Ctrl+C 445

Ctrl+Down 445

Ctrl+K 445

Ctrl+Up 445

Ctrl+V 445

Ctrl+X 445

Ctrl+Y 445

Ctrl+Z 445

Del 445

Esc 445

F3 445

F4 445

Group by step, basic modicaon

about 94

elds, reviewing 95

preview 96

tasks 96

hash table algorithm 111

hop

about 25, 305

creang 298-300

hop color 325

HSQLDB 222

Hybrid SCD 293

HyperSQL DataBase. See HSQLDB

id_junk_sales key 388

id_manufacturer key 388

id_region key 388

installing

MySQL 29

MySQL, on Ubuntu 32-34

MySQL, on Windows 29-31

PDI 14, 15

JavaScript

advantages 166

JavaScript code, inserng

about 148

average calculaons, tesng 152, 153

Clone() funcon 165

code, tesng 151

compability switch, turning on 151

elds, adding 150, 151

elds, modifying 150, 151

getProcessCount() funcon 161

Input elds branch 149

LoadScriptFormTab() funcon 159

new average calculaons, tesng 153

script, tesng 153

Transform Funcons 149

JavaScript step

about 154

End Script 159

JavaScript code, inserng 148

Java code, using 161

Main script 159

named parameters, using 158

scores, calculang 142-147

simple tasks, doing 142

Start Script 159

transformaon predened constants, using

159-161, 453

transformaons, modifying 154-157

using, in PDI 147

Download from Wow! eBook <www.wowebook.com>

[ 461 ]

jigsaw puzzle database

buy_methods table 265

cies table 265

countries table 265

customers table 265

exploring 264, 265

invoices table 265

manufactures table 265

payment_methods table 265

populang 261-264

products table 265

job

designing, shortcuts 444

exible version, creang 309-311

hello world le, customizing 309-311

hello world job, creang 298-304

named parameters, using 312

processes, execung 305

running, from repository 430, 431

transformaon job entry, using 307, 308

job, creang as process ow

data ow, modifying 353

data transfer, copy/get rows mechanism used

352, 353

transformaon, spling 348-351

job, running from repository

command line opons, specifying 431

steps 430, 431

job, running from terminal window

steps 313

job entry

abort job 440

about 305

create a folder 440

delete le 440

evaluate rows number in a table 440

File Exists 441

Job 441

mail 441

special entries 441

success 441

transformaon 441

job entry, execung

execuon ow, modifying 324, 325

launching, in parallel 308

sales report, sending 318-323

job iteraon

about 357

custom les, execung 358-361

every input row, execung 361-366

jobs, nesng

les, generang 354, 355

job, running inside another job 355

join 385

junk dimension 369

KDE Extracon, Transportaon, Transformaon

and loading Environment. See Kele

Kele 9

kele.properes le 62, 63, 264

Kele 4.0, features

Agile BI 447

Enterprise features 450

visual improvements 447

Kele Developer Support 415

Kele repository 420

Kele variables, XML les

about 70

exploring 71

Get Variable step 193

scope types 357

using 70

variables, geng 192

work documentaon 71

Key Performance Indicators. See KPIs

Kitchen

about 429

arguments 433

documentaon 434

log details 433

named parameters 433

running, opons 432

sales datamart loading, automang 399-402

variables 433

KPIs 8

LoadScriptFromTab() funcon 159

[ 462 ]

mapping 345

master 411

mini-dimension

loading 285

Modied JavaScript Values step. See JavaScript

step

mouse-over assistance

toolbar, using 448, 449

working 448

MySQL

installing 29

installing, on Ubuntu 32-34

installing, on Windows 29-31

MySQL, installing

onUbuntu 32-34

on Windows 29-32

named parameters

command line argument, dierenang

between 317, 318

passing, to transformaon 315, 316

use, analysing 318

using 158

OLTP 275

On-Line Transacon Processing. See OLTP

output les

output steps 50

Pan

about 429

arguments 433

documentaon 434

examinaon transformaon, execung from

terminal window 60, 61

log details 433

named parameters 433

running, opons 432

variables 433

paron schemas 424

PDI

about 7

and Pentaho BI Suite 7

best pracces 405

cloud-compung 411

cluster 411

dynamic clusters 411

features 408-411

integrang, with Pentaho BI suite 412

graphic designer, launching 15-18

installing 14, 15

job 297

Kele 9

Kele plug-ins 408, 409

master 411

PDI 2.3 10

PDI 2.4 10

PDI 2.5 10

PDI 3.0 10

PDI 3.1 10

PDI 3.2 10

PDI 4.0 10

real world risk, overcoming 410

scaling out 411

scaling up 411

Spoon 15

using, in real world scenarios 11

PDI, using in real world scenarios

data, cleansing 12

data, exporng 13

data, integrang 12

datamart, loading 11, 12

datawarehouse, loading 11, 12

informaon, migraon 13

integrang, Pentaho BI used 13

PDI best pracces 405-407

PDI elements

clusters 424

database connecons 424

paral schemas 424

slave servers 424

PDI Enterprise Edion 415

PDI features

browsing 114

browsing, dataset copied 114-119

PDI graphic designer. See Spoon

[ 463 ]

PDI integraon, with Pentaho BI suite

about 412

as datasource 413

as process acon 412, 413

Pentaho suite 414, 415

PDI opons, stream merge

Bouchard’s rows 137, 138

choosing 134, 135

tasks, merging 138

tasks, sorng 138

union, creang 134

PDI, steps

about 184

lms le, normalizing 185, 186

normalize benets, verifying 185

scores, calculang 186

Pentaho Agile Business Intelligence. See Agile BI

Pentaho BI Suite

analysis engine 7

and PDI 7

dashboards 8

data integraon 8

data mining 8

Pentaho BI Plaorm 8

reporng engine 8

Pentaho BI suite integraon, with PDI

about 412-415

as datasource 413

as process acon 412, 413

Pentaho Data Integraon. See PDI

plug-in

Kele plug-in 408, 409

primary key (PK) 218

process execuon, PDI job

about 305

hop 305

job design, comparing with job transformaon

306

job entry 305

job running, Spoon used 306

puzzles fact table

loading 393

RDBMS 222

records 218

regular expressions 44

relaonal database 218

Relaonal Database Management System. See

RDBMS

repository

backing up 424

creang 418-420

details, storing 420

features 417

le-based system benets 418

jobs in folders, creang 42

Kele repository 420

logging into 421

logging into, credenals used 422

restoring 424

storage system, working with 421

tasks 423

transformaon in folders, creang 423

user accounts, using 422, 423

shortcuts 445

repository-based system

migrang, to le-based system 426, 427

repository explorer

element, creang 425

element, deleng 425

element, opening 425

using, for content examinaon 424, 425

using, for content modicaon 424, 425

repository shortcuts

Ctr+D 445

Ctrl+E 445

Ctrl+R 445

Ctrl+U 445

Rhino engine 147

root-job 365

row 50

Row denormalizer

about 173

data, aggregang 176-179

working 173

rows, converng to columns

about 169

data, aggregang 176-179

lms le, enhancing 170-172

Row denormalizer step, using 173-176

total scores, calculang 177-179

rows, Stream split

copying 119, 120

[ 464 ]

distribung 120

tasks, assigning 121-124

rowset 51

sales datamart

degenerate dimension 370

dimensions 368

dimensions, loading 370

exploring 369

granularity level, determining 370

junk dimension 369

model 376

sales datamart model

about 376, 377

added dimensions 376

added dimensions, loading 378

sales fact table

business keys to surrogate keys, translang 388

informaon obtaining, SQL queries used

384-387

loading 378

scaling out 411

scaling up 411

SCD

about 282

Type II SCD 289

Type I SCD 282

SELECT statement

aggregate funcon 386

Kele variables, advantages 238

Kele variables, using 236, 237

Kele variables, using in queries 238

parameters, adding 235, 236

parameters, using 234, 235

using 232, 233

simple lookups, data

buyers product list, creang 266, 267

database values, looking up 268, 269

performing 266

slave server 410, 424

Slowly Changing Dimensions. See SCD

sni-tesng feature 449

sorng data 74

split eld to rows step 165

Spoon

about 18

les method 19

rst transformaon, creang 20

jobs, storing 19

launching 16, 17, 18

method, choosing 20

opons window preference, seng 18

repository method 19

shortcuts 443

starng 15

transformaon, storing 19

Spoon Interface

Design view 26, 306

exploring 26

View perspecve 26, 307

transformaon structure 26

Spoon shortcuts

Ctrl+Alt+J 443

Ctrl+Alt+N 443

Ctrl+Alt+U 443

Ctrl+F 443

Ctrl+F4 443

Ctrl+J 443

Ctrl+L 443

Ctrl+N 443

Ctrl+O 443

Ctrl+S 443

Ctrl+T 443

F10 443

F11 443

F9 443

of job design 444

of transformaon design 444

Shi+F10 443

SQL

about 225

cast funcon 385

DDL 226

DML 226

star schema 275

Steel Wheels database

about 217, 218

congured database exploring, database

explorer used 228

connecng, with RDBMS 222, 223

connecng to 219

[ 465 ]

connecon, creang 219-221

sample database, exploring 224, 225

SQL 225

tables 218

stream, merging

about 131

PDI opons 134, 135

progress, gathering 132-134

Stream lookup step, ltering

using 109, 110

word counng, precisely 111

streams

merging 131

spling 113

spling, based on condion 126

Stream split

PDI features, browsing 113

rows, copying 119, 120

rows, distribung 120-124

Stream split, based on condion

PDI, steps 128

task, assigning 128

tasks assignment, Filter rows step used 126,

127

tasks assignment, Switch/Case step used 129,

130

Structured Query Language. See SQL

subtransformaon, transformaon design

about 345

redening 347

using 345

working 346, 347

scores, calculang 341-345

Subversion 418

surrogate key 281

system informaon

data type 58

examinaon news le, updang 53-56

Get System Info step 57

table 218

me dimension 186

transformaon

command line arguments, using 314-317

designing, shortcuts 444

running, from repository 430, 431

steps 436

named parameters, using 314-317

transformaon, designing

color-coded logs 450

enhanced notes editor 450

job drill-down feature 449

mouse-over assistance 447, 448

mouse-over assistance toolbar, using 448

revamped repository explorer 450

sni-tesng feature 449

transformaon, enhancing

variables, seng 335, 336

variables, using 330-335

transformaon, running from repository

command line opons, specifying 431

steps 430, 431

transformaon design, enhancing

example 337-340

job, creang as process ow 348

jobs, nesng 354

transformaon iteraon. See job iteraon

transformaon steps

abort 436

add constants 436

add sequence 436

analyc query step 165

append streams 436

calculator 436

combinaon lookup/update 436

copy rows to result 436

database join 436

database lookup 436

data Validator 436

delay row 437

delete 437

dimension lookup/update 437

dummy 437

excel input 437

excel output 437

lter rows 437

xed le input 437

formula 437

generate rows 437

get data from XML 437

get rows from result 438

Get System Info 438

[ 466 ]

Get Variables 438

Group by 438

If eld value is null 438

Insert / Update 438

mapping (sub-transformaon) 438

mapping input specicaon 438

mapping output specicaon 438

Merge Rows (di) 136

Modied Java Script Value 438

Number range 438

Regex Evaluaon 439

Replace in String 82

Row denormaliser 439

RowFlaener 184

Row Normaliser 439

select values 439

Set Variables 439

Sorted Merge 136

Sort rows 439

Split Fields 439

Split eld to rows 439

stream lookup 439

Switch / Case 439

table input 439

table output 439

text le input 440

text le output 440

update 440

Unique rows 184, 214

Univariate Stascs 184

User Dened Java Expression 83

Value Mapper 440

trap detector 135

Type II SCD

about 289-291

loading, Dimension lookup/update step used

291-294

using, to maintain enre history 289-291, 294

Type I SCD

loading, with combinaon lookup/update step

282-284

manufactures dimension, loading 284, 285

regions, adding 284

Ubuntu

MySQL, installing 32-34

unexpected errors, avoiding

data, cleansing 213

data, validang 206-210

genres eld, validang 206, 207

unstructured les

contest les, modifying 165

modifying 164

previous rows, viewing 164

reading 162, 163

user accounts, repository

administrator 422

dening 422, 423

predened user, admin 423

predened user, guest 423

read-only 422

user 422

variables. See Kele variables, XML les

Windows

MySQL, installing 29-31

XML

about 67

PDI transformaons les 68

XML les

about 62

basic country informaon, building 62-66

data, obtaining 68

Kele variables 70

Thank you for buying

Pentaho 3.2 Data Integration:

Beginner's Guide

Packt Open Source Project Royalties

When we sell a book written on an Open Source project, we pay a royalty directly to that

project. Therefore by purchasing Pentaho 3.2 Data Integration: Beginner's Guide, Packt will

have given some of the money received to the Pentaho Data Integration project.

In the long term, we see ourselves and you—customers and readers of our books—as part of

the Open Source ecosystem, providing sustainable revenue for the projects we publish on.

Our aim at Packt is to establish publishing royalties as an essential part of the service and

support a business model that sustains Open Source.

If you're working with an Open Source project that you would like us to publish on, and

subsequently pay royalties to, please get in touch with us.

Writing for Packt

We welcome all inquiries from people who are interested in authoring. Book proposals

should be sent to author@packtpub.com. If your book idea is still at an early stage and you

would like to discuss it rst before writing a formal book proposal, contact us; one of our

commissioning editors will get in touch with you.

We're not just looking for published authors; if you have strong technical skills but no writing

experience, our experienced editors can help you develop a writing career, or simply get some

additional reward for your expertise.

About Packt Publishing

Packt, pronounced 'packed', published its rst book "Mastering phpMyAdmin for Effective

MySQL Management" in April 2004 and subsequently continued to specialize in publishing

highly focused books on specic technologies and solutions.

Our books and publications share the experiences of your fellow IT professionals in adapting

and customizing today's systems, applications, and frameworks. Our solution-based books

give you the knowledge and power to customize the software and technologies you're using

to get the job done. Packt books are more specic and less general than the IT books you have

seen in the past. Our unique business model allows us to bring you more focused information,

giving you more of what you need to know, and less of what you don't.

Packt is a modern, yet unique publishing company, which focuses on producing quality,

cutting-edge books for communities of developers, administrators, and newbies alike. For

more information, please visit our website: www.PacktPub.com.

Pentaho Reporting 3.5 for

Java Developers

ISBN: 978-1-847193-19-3 Paperback: 384 pages

Create advanced reports, including cross tabs,

sub-reports, and charts that connect to practically any

data source using open source Pentaho Reporting

1. # Create great-looking enterprise reports in

PDF, Excel, and HTML with Pentaho's Open

Source Reporting Suite, and integrate report

generation into your existing Java application

with minimal hassle

2. Use data source options to develop advanced

graphs, graphics, cross tabs, and sub-reports

3. Dive deeply into the Pentaho Reporting

Engine's XML and Java APIs to create

dynamic reports

Practical Data Analysis and

Reporting with BIRT

ISBN: 978-1-847191-09-0 Paperback: 312 pages

Use the open-source Eclipse-based Business

Intelligence and Reporting Tools system to design

and create reports quickly

1. Get started with BIRT Report Designer

2. Develop the skills to get the most from it

3. Transform raw data into visual and

interactive content

4. Design, manage, format, and deploy high-

quality reports

Please check www.PacktPub.com for information on our titles

Oracle Warehouse Builder 11g:

Getting Started

ISBN: 978-1-847195-74-6 Paperback: 368 pages

Extract, Transform, and Load data to build a

dynamic, operational data warehouse

1. Build a working data warehouse from scratch

with Oracle Warehouse Builder

2. Cover techniques in Extracting, Transforming,

and Loading data into your data warehouse

3. Learn about the design of a data warehouse

by using a multi-dimensional design with an

underlying relational star schema.

Creating your MySQL Database:

Practical Design Tips and

Techniques

ISBN: 978-1-904811-30-5 Paperback: 108 pages

A short guide for everyone on how to structure

your data and set-up your MySQL database tables

efciently and easily

1. How best to collect, name, group, and structure

your data

2. Design your data with future growth in mind

3. Practical examples from initial ideas to nal

designs

4. The quickest way to learn how to design good

data structures for MySQL

Please check www.PacktPub.com for information on our titles

Packtpub.Pentaho.3.2.Data.Integration.Beginners.Guide.Apr.2010

Navigation menu

Versions of this User Manual:

Views

Navigation