Packtpub.Pentaho.3.2.Data.Integration.Beginners.Guide.Apr.2010
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 493 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- Cover
- Copyright
- Credits
- Foreword
- The Kettle Project
- About the Author
- About the Reviewers
- Table of Contents
- Preface
- Chapter 1: Getting started with Pentaho Data Integration
- Pentaho Data Integration and Pentaho BI Suite
- Pentaho Data Integration
- Installing PDI
- Time for action – installing PDI
- Launching the PDI graphical designer: Spoon
- Time for action – starting and customizing Spoon
- Time for action – creating a hello world transformation
- Time for action – running and previewing the hello_world
- transformation
- Installing MySQL
- Time for action – installing MySQL on Windows
- Time for action – installing MySQL on Ubuntu
- Summary
- Chapter 2: Getting Started with Transformations
- Reading data from files
- Time for action – reading results of football matches from files
- Time for action – reading all your files at a time using a single
- Text file input step
- Time for action – reading all your files at a time using a single
- Text file input step and regular expressions
- Sending data to files
- Time for action – sending the results of matches to a plain file
- Getting system information
- Time for action – updating a file with news about examinations
- Time for action – running the examination transformation from
- a terminal window
- XML files
- Time for action – getting data from an XML file with information
- about countries
- Summary
- Chapter 3: Basic data manipulation
- Basic calculations
- Time for action – reviewing examinations by using the
- Calculator step
- Time for action – reviewing examinations by using the
- Formula step
- Calculations on groups of rows
- Time for action – calculating World Cup statistics by
- grouping data
- Filtering
- Time for action – counting frequent words by filtering
- Looking up data
- Time for action – finding out which language people speak
- Summary
- Chapter 4: Controlling the Flow of Data
- Splitting streams
- Time for action – browsing new PDI features by copying
- a dataset
- Time for action – assigning tasks by distributing
- Splitting the stream based on conditions
- Time for action – assigning tasks by filtering priorities with the
- Filter rows step
- Time for action – assigning tasks by filtering priorities with the
- Switch/ Case step
- Merging streams
- Time for action – gathering progress and merging all together
- Time for action – giving priority to Bouchard by using
- Append Stream
- Summary
- Chapter 5: Transforming Your Data with JavaScript Code and the JavaScript Step
- Doing simple tasks with the JavaScript step
- Time for action – calculating scores with JavaScript
- Time for action – testing the calculation of averages
- Enriching the code
- Time for action – calculating flexible scores by using variables
- Reading and parsing unstructured files
- Time for action – changing a list of house descriptions with
- JavaScript
- Avoiding coding by using purpose-built steps
- Summary
- Chapter 6: Transforming the Row Set
- Converting rows to columns
- Time for action – enhancing a films file by converting
- rows to columns
- Time for action – calculating total scores by performances
- by country
- Normalizing data
- Time for action – enhancing the matches file by normalizing
- the dataset
- Generating a custom time dimension dataset by using Kettle variables
- Time for action – creating the time dimension dataset
- Time for action – getting variables for setting the default
- starting date
- Summary
- Chapter 7: Validating Data and Handling Errors
- Capturing errors
- Time for action – capturing errors while calculating the age
- of a film
- Time for action – aborting when there are too many errors
- Time for action – treating errors that may appear
- Avoiding unexpected errors by validating data
- Time for action – validating genres with a Regex Evaluation step
- Time for action – checking films file with the Data Validator
- Summary
- Chapter 8: Working with Databases
- Introducing the Steel Wheels sample database
- Time for action – creating a connection with the Steel Wheels
- database
- Time for action – exploring the sample database
- Querying a database
- Time for action – getting data about shipped orders
- Time for action – getting orders in a range of dates by using
- parameters
- Time for action – getting orders in a range of dates by using
- variables
- Sending data to a database
- Time for action – loading a table with a list of manufacturers
- Time for action – inserting new products or updating
- existent ones
- Time for action – testing the update of existing products
- Eliminating data from a database
- Time for action – deleting data about discontinued items
- Summary
- Chapter 9: Performing Advanced Operations with Databases
- Preparing the environment
- Time for action – populating the Jigsaw database
- Looking up data in a database
- Time for action – using a Database lookup step to create a list
- of products to buy
- Time for action – using a Database join step to create a list of
- suggested products to buy
- Introducing dimensional modeling
- Loading dimensions with data
- Time for action – loading a region dimension with a
- Combination lookup/update step
- Time for action – testing the transformation that loads the
- region dimension
- Time for action – keeping a history of product changes with the
- Dimension lookup/update step
- Time for action – testing the transformation that keeps a history
- of product changes
- Summary
- Chapter 10: Creating Basic Task Flows
- Introducing PDI jobs
- Time for action – creating a simple hello world job
- Receiving arguments and parameters in a job
- Time for action – customizing the hello world file with
- arguments and parameters
- Running jobs from a terminal window
- Time for action – executing the hello world job from a terminal
- window
- Using named parameters and command-line arguments in transformations
- Time for action – calling the hello world transformation with
- fixed arguments and parameters
- Deciding between the use of a command-line argument and a named parameter
- Running job entries under conditions
- Time for action – sending a sales report and warning the
- administrator if something is wrong
- Summary
- Chapter 11: Creating Advanced Transformations and Jobs
- Enhancing your processes with the use of variables
- Time for action – updating a file with news about examinations
- by setting a variable with the name of the file
- Enhancing the design of your processes
- Time for action – generating files with top scores
- Time for action – calculating the top scores with a
- subtransformation
- Time for action – splitting the generation of top scores by
- copying and getting rows
- Time for action – generating the files with top scores by
- nesting jobs
- Iterating jobs and transformations
- Time for action – generating custom files by executing a
- transformation for every input row
- Summary
- Chapter 12: Developing and Implementing a Simple Datamart
- Exploring the sales datamart
- Loading the dimensions
- Time for action – loading dimensions for the sales datamart
- Extending the sales datamart model
- Loading a fact table with aggregated data
- Time for action – loading the sales fact table by looking up
- dimensions
- Getting facts and dimensions together
- Time for action – loading the fact table using a range of dates
- obtained from the command line
- Time for action – loading the sales star
- Getting rid of administrative tasks
- Time for action – automating the loading of the sales datamart
- Summary
- Chapter 13: Taking it Further
- Appendix A: Working with Repositories
- Creating a repository
- Time for action – creating a PDI repository
- Working with the repository storage system
- Time for action – logging into a repository
- Examining and modifying the contents of a repository with the Repository explorer
- Migrating from a file-based system to a repository-based system and vice-versa
- Summary
- Appendix B: Pan and Kitchen: Launching Transformations and Jobs from the Command Line
- Appendix C: Quick Reference: Steps and Job Entries
- Appendix D: Spoon Shortcuts
- Appendix E: Introducing PDI 4 Features
- Appendix F: Pop Quiz Answers
- Index
Pentaho 3.2 Data Integration
Beginner's Guide
Explore, transform, validate, and integrate your data with ease
María Carina Roldán
BIRMINGHAM - MUMBAI
Pentaho 3.2 Data Integration
Beginner's Guide
Copyright © 2010 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmied in any form or by any means, without the prior wrien permission of the
publisher, except in the case of brief quotaons embedded in crical arcles or reviews.
Every eort has been made in the preparaon of this book to ensure the accuracy of the
informaon presented. However, the informaon contained in this book is sold without
warranty, either express or implied. Neither the author, Packt Publishing, nor its dealers or
distributors will be held liable for any damages caused or alleged to be caused directly or
indirectly by this book.
Packt Publishing has endeavored to provide trademark informaon about all the companies
and products menoned in this book by the appropriate use of capitals. However, Packt
Publishing cannot guarantee the accuracy of this informaon.
First published: April 2010
Producon Reference: 1050410
Published by Packt Publishing Ltd.
32 Lincoln Road
Olton
Birmingham, B27 6PA, UK.
ISBN 978-1-847199-54-6
www.packtpub.com
Cover Image by Parag Kadam (paragvkadam@gmail.com)
Credits
Author
María Carina Roldán
Reviewers
Jens Bleuel
Roland Bouman
Ma Casters
James Dixon
Will Gorman
Gretchen Moran
Acquision Editor
Usha Iyer
Development Editor
Reshma Sundaresan
Technical Editors
Gaurav Datar
Rukhsana Khambaa
Copy Editor
Sanchari Mukherjee
Editorial Team Leader
Gagandeep Singh
Project Team Leader
Lata Basantani
Project Coordinator
Poorvi Nair
Proofreader
Sandra Hopper
Indexer
Rekha Nair
Graphics
Geetanjali Sawant
Producon Coordinator
Shantanu Zagade
Cover Work
Shantanu Zagade
Foreword
If we look back at what has happened in the data integraon market over the last 10
years we can see a lot of change. In the rst half of that decade there was an explosion
in the number of data integraon tools and in the second half there was a big wave of
consolidaons. This consolidaon wave put an ever growing amount of data integraon
power in the hands of only a few large billion dollar companies. For any person, company
or project in need of data integraon, this meant either paying large amounts of money or
doing hand-coding of their soluon.
During that exact same period, we saw web servers, programming languages, operang
systems, and even relaonal databases turn into a commodity in the ICT market place. This
was driven among other things by the availability of open source soware such as Apache,
GNU, Linux, MySQL, and many others. For the ICT market, this meant that more services
could be deployed at a lower cost. If you look closely at what has been going on in those last
10 years, you will noce that most companies increasingly deployed more ICT services to
end-users. These services get more and more connected over an ever growing network.
Prey much anything ranging from ny mobile devices to huge cloud-based infrastructure
is being deployed and all those can contain data that is valuable to an organizaon.
The job of any person that needs to integrate all this data is not easy. Complexity of
informaon services technology usually increases exponenally with the number of systems
involved. Because of this, integrang all these systems can be a daunng and scary task that
is never complete. Any piece of code lives in what can be described as a soware ecosystem
that is always in a state of ux. Like in nature, certain ecosystems evolve extremely fast
where others change very slowly over me. However, like in nature all ICT systems change.
What is needed is another wave of commodicaon in the area of data integraon and
business intelligence in general. This is where Pentaho comes in.
Pentaho tries to provide answers to these problems by making the integraon soware
available as open source, accessible, easy to use, and easy to maintain for users and
developers alike. Every release of our soware we try to make things easier, beer, and
faster. However, even if things can be done with nice user interfaces, there are sll a huge
amount of possibilies and opons to choose from.
As the founder of the project I've always liked the fact that Kele users had a lot of choice.
Choice translates into creavity, and creavity oen delivers good soluons that are
comfortable to the person implemenng them. However, this choice can be daunng to any
beginning Kele developer. With thousands of opons to choose from, it can be very hard to
get started.
This is above all others the reason why I'm very happy to see this book come to life. It will
be a great and indispensable help for everyone that is taking steps into the wonderful world
of data integraon with Kele. As such, I hope you see this book as an open invitaon to get
started with Kele in the wonderful world of data integraon.
Ma Casters
Chief Data Integraon at Pentaho
Kele founder
The Kettle Project
Whether there is a migraon to do, an ETL process to run, or a need for massively loading
data into a database, you have several soware tools, ranging from expensive and
sophiscated to free open source and friendly ones, which help you accomplish the task.
Ten years ago, the scenario was clearly dierent. By 2000, Ma Casters, a Belgian business
intelligent consultant, had been working for a while as a datawarehouse architect and
administrator. As such, he was one of quite a number of people who, no maer if the
company they worked for was big or small, had to deal with the dicules that involve
bridging the gap between informaon technology and business needs. What made it even
worse at that me was that ETL tools were prohibively expensive and everything had to
be craed done. The last employer he worked for, didn't think that wring a new ETL tool
would be a good idea. This was one of the movaons for Ma to become an independent
contractor and to start his own company. That was in June 2001.
At the end of that year, he told his wife that he was going to write a new piece of soware
for himself to do ETL tasks. It was going to take up some me le and right in the evenings
and weekends. Surprised, she asked how long it would take you to get it done. He replied
that it would probably take ve years and that he perhaps would have something working
in three.
Working on that started in early 2003. Ma's main goals for wring the soware included
learning about databases, ETL processes, and data warehousing. This would in turn improve
his chances on a job market that was prey volale. Ulmately, it would allow him to work
full me on the soware.
Another important goal was to understand what the tool had to do. Ma wanted a scalable
and parallel tool, and wanted to isolate rows of data as much as possible.
The last but not least goal was to pick the right technology that would support the tool. The
rst idea was to build it on top of KDE, the popular Unix desktop environment. Trolltech, the
people behind Qt, the core UI library of KDE, had released database plans to create drivers
for popular databases. However, the lack of decent drivers for those databases drove Ma
to change plans and use Java. He picked Java because he had some prior experience as he
had wrien a Japanese Chess (Shogi) database program when Java 1.0 was released. To
Sun's credit, this soware sll runs and is available at http://ibridge.be/shogi/.
Aer a year of development, the tool was capable of reading text les, reading from
databases, wring to databases and it was very exible. The experience with Java was not
100% posive though. The code had grown unstructured, crashes occurred all too oen, and
it was hard to get something going with the Java graphic library used at that moment, the
Abstract Window Toolkit (AWT); it looked bad and it was slow.
As for the library, Ma decided to start using the newly released Standard Widget Toolkit
(SWT), which helped solve part of the problem. As for the rest, Kele was a complete mess.
It was me to ask for help. The help came in hands of Wim De Clercq, a senior enterprise
Java architect, co-owner of Ixor (www.ixor.be) and also friend of Ma. At various intervals
over the next few years, Wim involved himself in the project, giving advices to Ma about
good pracces in Java programming. Listening to that advice meant performing massive
amounts of code changes. As a consequence, it was not unusual to spend weekends doing
nothing but refactoring code and xing thousands of errors because of that. But, bit by bit,
things kept going in the right direcon.
At that same me, Ma also showed the results to his peers, colleagues, and other senior
BI consultants to hear what they thought of Kele. That was how he got in touch with the
Flemish Trac Centre (www.verkeerscentrum.be/verkeersinfo/kaart) where billions
of rows of data had to be integrated from thousands of data sources all over Belgium. All of
a sudden, he was being paid to deploy and improve Kele to handle that job. The diversity of
test cases at the trac center helped to improve Kele dramacally. That was somewhere in
2004 and Kele was by its version 1.2.
While working at Flemish, Ma also posted messages on Javaforge (www.javaforge.com)
to let people know they could download a free copy of Kele for their own use. He got a
few reacons. Despite some of them being remarkably negave, most were posive. The
most interesng response came from a nice guy called Jens Bleuel in Germany who asked if
it was possible to integrate third-party soware into Kele. In his specic case, he needed a
connector to link Kele with the German SAP soware (www.sap.com). Kele didn't have a
plugin architecture, so Jens' queson made Ma think about a plugin system, and that was
the main movaon for developing version 2.0.
For various reasons including the birth of Ma's son Sam and a lot of consultancy work,
it took around a year to release Kele version 2.0. It was a fairly complete release with
advanced support for slowly changing dimensions and junk dimensions (Chapter 9 explains
those concepts), ability to connect to thirteen dierent databases, and the most important
fact being support for plugins. Ma contacted Jens to let him know the news and Jens was
really interested. It was a very memorable moment for Ma and Jens as it took them only a
few hours to get a new plugin going that read data from an SAP/R3 server. There was a lot
of excitement, and they agreed to start promong the sales of Kele from the Kettle.be
website and from Prorao (www.proratio.de), the company Jens worked for.
Those were days of improvements, requests, people interested in the project. However, it
became too much to handle. Doing development and sales all by themselves was no fun
aer a while. As such, Ma thought about open sourcing Kele early in 2005 and by late
summer he made his decision. Jens and Prorao didn't mind and the decision was nal.
When they nally open sourced Kele on December 2005, the response was massive. The
downloadable package put up on Javaforge got downloaded around 35000 mes during rst
week only. The news got spread all over the world prey quickly.
What followed was a ood of messages, both private and on the forum. At its peak in March
2006, Ma got over 300 messages a day concerning Kele.
In no me, he was answering quesons like crazy, allowing people to join the development
team and working as a consultant at the same me. Added to this, the birth of his daughter
Hannelore in February 2006 was too much to deal with.
Fortunately, good mes came. While Ma was trying to handle all that, a discussion was
taking place at the Pentaho forum (http://forums.pentaho.org/) concerning the ETL
tool that Pentaho should support. They had selected Enhydra Octopus, a Java-based ETL
soware, but they didn't have a strong reliance on a specic tool.
While Jens was evaluang all sorts of open source BI packages, he came across that thread.
Ma replied immediately persuading people at Pentaho to consider including Kele. And
he must be convincing because the answer came quickly and was posive. James Dixon,
Pentaho founder and CTO, opened Kele the possibility to be the premier and only ETL
tool supported by Pentaho. Later on, Ma came in touch with one of the other Pentaho
founders, Richard Daley, who oered him a job. That allowed Ma to focus full-me on
Kele. Four years later, he's sll happily working for Pentaho as chief architect for data
integraon, doing the best eort to deliver Kele 4.0. Jens Bleuel, who collaborated with
Ma since the early versions, is now also part of the Pentaho team.
About the Author
María Carina was born in a small town in the Patagonia region in Argenna. She earned
her Bachelor degree in Computer Science at UNLP in La Plata and then moved to Buenos
Aires where she has lived since 1994 working in IT.
She has been working as a BI consultant for the last 10 years. At the beginning she worked
with Cognos suite. However, over the last three years, she has been dedicated, full me, to
developing Pentaho BI soluons both for local and several Lan-American companies, as well
as for a French automove company in the last months.
She is also an acve contributor to the Pentaho community.
At present, she lives in Buenos Aires, Argenna, with her husband Adrián and children
Camila and Nicolás.
Wring my rst book in a foreign language and working on a full me job
at the same me, not to menon the upbringing of two small kids, was
denitely a big challenge. Now I can tell that it's not impossible.
I dedicate this book to my husband and kids; I'd like to thank them for all
their support and tolerance over the last year. I'd also like to thank my
colleagues and friends who gave me encouraging words throughout the
wring process.
Special thanks to the people at Packt; working with them has been
really pleasant.
I'd also like to thank the Pentaho community and developers for making
Kele the incredible tool it is. Thanks to the technical reviewers who,
with their very crical eye, contributed to make this a book suited to
the audience.
Finally, I'd like to thank Ma Casters who, despite his busy schedule, was
willing to help me from the rst moment he knew about this book.
About the Reviewers
Jens Bleuel is a Senior Consultant and Engineer at Pentaho. He is also working as a project
leader, trainer, and product specialist in the services and support department. Before he
joined Pentaho in mid 2007, he was soware developer and project leader, and his main
business was Data Warehousing and the architecture along with designing and developing of
user friendly tools. He studied business economics, was on a grammar school for electronics,
and has been programming in a wide area of environments such as Assembler, C, Visual
Basic, Delphi, .NET, and these days mainly in Java. His customer focus is on the wholesale
market and consumer goods industries. Jens is 40 years old and lives with his wife and two
boys in Mainz, Germany (near the nice Rhine river). In his spare me, he pracces Tai-Chi,
Qigong, and photography.
Roland Bouman has been working in the IT industry since 1998, mostly as a database and
web applicaon developer. He has also worked for MySQL AB (later Sun Microsystems) as
cercaon developer and as curriculum developer.
Roland mainly focuses on open source web technology, databases, and Business Intelligence.
He's an acve member of the MySQL and Pentaho communies and can oen be found
speaking at worldwide conferences and events such as the MySQL user conference, the
O'Reilly Open Source conference (OSCON), and at Pentaho community events.
Roland is co-author of the MySQL 5.1 Cluster DBA Cercaon Study Guide (Vervante,
ISBN: 595352502) and Pentaho Soluons: Business Intelligence and Data Warehousing with
Pentaho and MySQL (Wiley, ISBN: 978-0-470-48432-6). He also writes on a regular basis for
the Dutch Database Magazine (DBM).
Roland is @rolandbouman on Twier and maintains a blog at
http://rpbouman.blogspot.com/.
Ma Casters has been an independent senior BI consultant for almost two decades. In that
period he led, designed, and implemented numerous data warehouses and BI soluons for
large and small companies. In that capacity, he always had the need for ETL in some form
or another. Almost out of pure necessity, he has been busy wring the ETL tool called Kele
(a.k.a. Pentaho Data Integraon) for the past eight years. First, he developed the tool mostly
on his own. Since the end of 2005 when Kele was declared an open source technology,
development took place with the help of a large community.
Since the Kele project was acquired by Pentaho in early 2006, he has been Chief of Data
Integraon at Pentaho as the lead architect, head of development, and spokesperson for the
Kele community.
I would like to personally thank the complete community for their help
in making Kele the success it is today. In parcular, I would like to thank
Maria for taking the me to write this nice book as well as the many
arcles on the Pentaho wiki (for example, the Kele tutorials), and her
appreciated parcipaon on the forum. Many thanks also go to my
employer Pentaho, for their large investment in open source BI in
general and Kele in parcular.
James Dixon is the Chief Geek and one of the co-founders of Pentaho Corporaon—the
leading commercial open source Business Intelligence company. He has worked in the
business intelligence market since graduang in 1992 from Southampton University with a
degree in Computer Science. He has served as Soware Engineer, Development Manager,
Engineering VP, and CTO at mulple business intelligence soware companies. He regularly
uses Pentaho Data Integraon for internal projects and was involved in the architectural
design of PDI V3.0.
He lives in Orlando, Florida, with his wife Tami and son Samuel.
I would like to thank my co-founders, my parents, and my wife Tami for all
their support and tolerance of my odd working hours.
I would like to thank my son Samuel for all the opportunies he gives me to
prove I'm not as clever as I think I am.
Will Gorman is an Engineering Team Lead at Pentaho. He works on a variety of Pentaho's
products, including Reporng, Analysis, Dashboards, Metadata, and the BI Server. Will
started his career at GE Research and earned his Masters degree in Computer Science at
Rensselaer Polytechnic Instute in Troy, New York. Will is the author of Pentaho Reporng
3.5 for Java Developers (ISBN: 3193), published by Packt Publishing.
Gretchen Moran is a graduate of University of Wisconsin – Stevens Point with a Bachelor's
degree in Computer Informaon Systems with a minor in Data Communicaons. Gretchen
began her career as a corporate data warehouse developer in the insurance industry and
joined Arbor Soware/Hyperion Soluons in 1999 as a commercial developer for the
Hyperion Analyzer and Web Analycs team. Gretchen has been a key player with Pentaho
Corporaon since its incepon in 2004. As Community Leader and core developer, Gretchen
managed the explosive growth of Pentaho's open source community for her rst 2 years
with the company. Gretchen has contributed to many of the Pentaho projects, including the
Pentaho BI Server, Pentaho Data Integraon, Pentaho Metadata Editor, Pentaho Reporng,
Pentaho Charng, and others.
Thanks Doug, Anthony, Isabella and Baby Jack for giving me my favorite
challenges and crowning achievements—being a wife and mom.
Table of Contents
Preface 1
Chapter 1: Geng started with Pentaho Data Integraon 7
Pentaho Data Integraon and Pentaho BI Suite 7
Exploring the Pentaho Demo 9
Pentaho Data Integraon 9
Using PDI in real world scenarios 11
Loading data warehouses or data marts 11
Integrang data 12
Data cleansing 12
Migrang informaon 13
Exporng data 13
Integrang PDI using Pentaho BI 13
Installing PDI 14
Time for acon – installing PDI 14
Launching the PDI graphical designer: Spoon 15
Time for acon – starng and customizing Spoon 15
Spoon 18
Seng preferences in the Opons window 18
Storing transformaons and jobs in a repository 19
Creang your rst transformaon 20
Time for acon – creang a hello world transformaon 20
Direcng the Kele engine with transformaons 25
Exploring the Spoon interface 26
Running and previewing the transformaon 27
Time for acon – running and previewing the
hello_world transformaon 27
Installing MySQL 29
Time for acon – installing MySQL on Windows 29
Time for acon – installing MySQL on Ubuntu 32
Summary 34
Table of Contents
[ ii ]
Chapter 2: Geng Started with Transformaons 35
Reading data from les 35
Time for acon – reading results of football matches from les 36
Input les 41
Input steps 41
Reading several les at once 42
Time for acon – reading all your les at a me using a single
Text le input step 42
Time for acon – reading all your les at a me using a single
Text le input step and regular expressions 43
Regular expressions 44
Grids 46
Sending data to les 47
Time for acon – sending the results of matches to a plain le 47
Output les 49
Output steps 50
Some data denions 50
Rowset 50
Streams 51
The Select values step 52
Geng system informaon 52
Time for acon – updang a le with news about examinaons 53
Geng informaon by using Get System Info step 57
Data types 58
Date elds 58
Numeric elds 59
Running transformaons from a terminal window 60
Time for acon – running the examinaon transformaon from
a terminal window 60
XML les 62
Time for acon – geng data from an XML le with informaon
about countries 62
What is XML 67
PDI transformaon les 68
Geng data from XML les 68
XPath 68
Conguring the Get data from XML step 69
Kele variables 70
How and when you can use variables 70
Summary 72
Table of Contents
[ iii ]
Chapter 3: Basic data manipulaon 73
Basic calculaons 73
Time for acon – reviewing examinaons by using the Calculator step 74
Adding or modifying elds by using dierent PDI steps 82
The Calculator step 83
The Formula step 84
Time for acon – reviewing examinaons by using the Formula step 84
Calculaons on groups of rows 88
Time for acon – calculang World Cup stascs by grouping data 89
Group by step 94
Filtering 97
Time for acon – counng frequent words by ltering 97
Filtering rows using the Filter rows step 103
Looking up data 105
Time for acon – nding out which language people speak 105
The Stream lookup step 109
Summary 112
Chapter 4: Controlling the Flow of Data 113
Spling streams 113
Time for acon – browsing new PDI features by copying a dataset 114
Copying rows 119
Distribung rows 120
Time for acon – assigning tasks by distribung 121
Spling the stream based on condions 125
Time for acon – assigning tasks by ltering priories with the Filter rows step 126
PDI steps for spling the stream based on condions 128
Time for acon – assigning tasks by ltering priories with the Switch/ Case step 129
Merging streams 131
Time for acon – gathering progress and merging all together 132
PDI opons for merging streams 134
Time for acon – giving priority to Bouchard by using Append Stream 137
Summary 139
Chapter 5: Transforming Your Data with JavaScript Code and
the JavaScript Step 141
Doing simple tasks with the JavaScript step 141
Time for acon – calculang scores with JavaScript 142
Using the JavaScript language in PDI 147
Inserng JavaScript code using the Modied Java Script Value step 148
Adding elds 150
Table of Contents
[ iv ]
Modifying elds 150
Turning on the compability switch 151
Tesng your code 151
Time for acon – tesng the calculaon of averages 152
Tesng the script using the Test script buon 153
Enriching the code 154
Time for acon – calculang exible scores by using variables 154
Using named parameters 158
Using the special Start, Main, and End scripts 159
Using transformaon predened constants 159
Reading and parsing unstructured les 162
Time for acon – changing a list of house descripons with JavaScript 162
Looking at previous rows 164
Avoiding coding by using purpose-built steps 165
Summary 167
Chapter 6: Transforming the Row Set 169
Converng rows to columns 169
Time for acon – enhancing a lms le by converng rows to columns 170
Converng row data to column data by using the Row denormalizer step 173
Aggregang data with a Row denormalizer step 176
Time for acon – calculang total scores by performances by country 177
Using Row denormalizer for aggregang data 178
Normalizing data 180
Time for acon – enhancing the matches le by normalizing the dataset 180
Modifying the dataset with a Row Normalizer step 182
Summarizing the PDI steps that operate on sets of rows 184
Generang a custom me dimension dataset by using Kele variables 186
Time for acon – creang the me dimension dataset 187
Geng variables 191
Time for acon – geng variables for seng the default starng date 192
Using the Get Variables step 193
Summary 194
Chapter 7: Validang Data and Handling Errors 195
Capturing errors 195
Time for acon – capturing errors while calculang the age of a lm 196
Using PDI error handling funconality 200
Aborng a transformaon 201
Time for acon – aborng when there are too many errors 202
Aborng a transformaon using the Abort step 203
Fixing captured errors 203
Table of Contents
[ v ]
Time for acon – treang errors that may appear 203
Treang rows coming to the error stream 205
Avoiding unexpected errors by validang data 206
Time for acon – validang genres with a Regex Evaluaon step 206
Validang data 208
Time for acon – checking lms le with the Data Validator 209
Dening simple validaon rules using the Data Validator 211
Cleansing data 213
Summary 215
Chapter 8: Working with Databases 217
Introducing the Steel Wheels sample database 217
Connecng to the Steel Wheels database 219
Time for acon – creang a connecon with the Steel Wheels database 219
Connecng with Relaonal Database Management Systems 222
Exploring the Steel Wheels database 223
Time for acon – exploring the sample database 224
A brief word about SQL 225
Exploring any congured database with the PDI Database explorer 228
Querying a database 229
Time for acon – geng data about shipped orders 229
Geng data from the database with the Table input step 231
Using the SELECT statement for generang a new dataset 232
Making exible queries by using parameters 234
Time for acon – geng orders in a range of dates by using parameters 234
Making exible queries by using Kele variables 236
Time for acon – geng orders in a range of dates by using variables 237
Sending data to a database 239
Time for acon – loading a table with a list of manufacturers 239
Inserng new data into a database table with the Table output step 245
Inserng or updang data by using other PDI steps 246
Time for acon – inserng new products or updang existent ones 246
Time for acon – tesng the update of exisng products 249
Inserng or updang data with the Insert/Update step 251
Eliminang data from a database 256
Time for acon – deleng data about disconnued items 256
Deleng records of a database table with the Delete step 259
Summary 260
Chapter 9: Performing Advanced Operaons with Databases 261
Preparing the environment 261
Time for acon – populang the Jigsaw database 261
Exploring the Jigsaw database model 264
Table of Contents
[ vi ]
Looking up data in a database 266
Doing simple lookups 266
Time for acon – using a Database lookup step to create a list of products to buy 266
Looking up values in a database with the Database lookup step 268
Doing complex lookups 270
Time for acon – using a Database join step to create a list of
suggested products to buy 270
Joining data from the database to the stream data by using a Database join step 272
Introducing dimensional modeling 275
Loading dimensions with data 276
Time for acon – loading a region dimension with a
Combinaon lookup/update step 276
Time for acon – tesng the transformaon that loads the region dimension 279
Describing data with dimensions 281
Loading Type I SCD with a Combinaon lookup/update step 282
Keeping a history of changes 286
Time for acon – keeping a history of product changes with the
Dimension lookup/update step 286
Time for acon – tesng the transformaon that keeps a history
of product changes 288
Keeping an enre history of data with a Type II slowly changing dimension 289
Loading Type II SCDs with the Dimension lookup/update step 291
Summary 296
Chapter 10: Creang Basic Task Flows 297
Introducing PDI jobs 297
Time for acon – creang a simple hello world job 298
Execung processes with PDI jobs 305
Using Spoon to design and run jobs 306
Using the transformaon job entry 307
Receiving arguments and parameters in a job 309
Time for acon – customizing the hello world le with
arguments and parameters 309
Using named parameters in jobs 312
Running jobs from a terminal window 312
Time for acon – execung the hello world job from a terminal window 313
Using named parameters and command-line arguments in transformaons 314
Time for acon – calling the hello world transformaon with
xed arguments and parameters 315
Deciding between the use of a command-line argument and a named parameter 317
Running job entries under condions 318
Table of Contents
[ vii ]
Time for acon – sending a sales report and warning the
administrator if something is wrong 318
Changing the ow of execuon on the basis of condions 324
Creang and using a le results list 326
Summary 327
Chapter 11: Creang Advanced Transformaons and Jobs 329
Enhancing your processes with the use of variables 329
Time for acon – updang a le with news about examinaons by seng
a variable with the name of the le 330
Seng variables inside a transformaon 335
Enhancing the design of your processes 337
Time for acon – generang les with top scores 337
Reusing part of your transformaons 341
Time for acon – calculang the top scores with a subtransformaon 341
Creang and using subtransformaons 345
Creang a job as a process ow 348
Time for acon – spling the generaon of top scores by
copying and geng rows 348
Transferring data between transformaons by using the copy /get rows mechanism 352
Nesng jobs 354
Time for acon – generang the les with top scores by nesng jobs 354
Running a job inside another job with a job entry 355
Understanding the scope of variables 356
Iterang jobs and transformaons 357
Time for acon – generang custom les by execung a transformaon
for every input row 358
Execung for each row 361
Summary 366
Chapter 12: Developing and Implemenng a Simple Datamart 367
Exploring the sales datamart 367
Deciding the level of granularity 370
Loading the dimensions 370
Time for acon – loading dimensions for the sales datamart 371
Extending the sales datamart model 376
Loading a fact table with aggregated data 378
Time for acon – loading the sales fact table by looking up dimensions 378
Geng the informaon from the source with SQL queries 384
Translang the business keys into surrogate keys 388
Obtaining the surrogate key for a Type I SCD 388
Obtaining the surrogate key for a Type II SCD 389
Obtaining the surrogate key for the Junk dimension 391
Obtaining the surrogate key for the Time dimension 391
Table of Contents
[ viii ]
Geng facts and dimensions together 394
Time for acon – loading the fact table using a range of dates obtained
from the command line 394
Time for acon – loading the sales star 396
Geng rid of administrave tasks 399
Time for acon – automang the loading of the sales datamart 399
Summary 403
Chapter 13: Taking it Further 405
PDI best pracces 405
Geng the most out of PDI 408
Extending Kele with plugins 408
Overcoming real world risks with some remote execuon 410
Scaling out to overcome bigger risks 411
Integrang PDI and the Pentaho BI suite 412
PDI as a process acon 412
PDI as a datasource 413
More about the Pentaho suite 414
PDI Enterprise Edion and Kele Developer Support 415
Summary 416
Appendix A: Working with Repositories 417
Creang a repository 418
Time for acon – creang a PDI repository 418
Creang repositories to store your transformaons and jobs 420
Working with the repository storage system 421
Time for acon – logging into a repository 421
Logging into a repository by using credenals 422
Dening repository user accounts 422
Creang transformaons and jobs in repository folders 423
Creang database connecons, parons, servers, and clusters 424
Backing up and restoring a repository 424
Examining and modifying the contents of a repository with
the Repository explorer 424
Migrang from a le-based system to a repository-based system and
vice-versa 426
Summary 427
Appendix B: Pan and Kitchen: Launching Transformaons and
Jobs from the Command Line 429
Running transformaons and jobs stored in les 429
Running transformaons and jobs from a repository 430
Specifying command line opons 431
Table of Contents
[ ix ]
Checking the exit code 432
Providing opons when running Pan and Kitchen 432
Log details 433
Named parameters 433
Arguments 433
Variables 433
Appendix C: Quick Reference: Steps and Job Entries 435
Transformaon steps 436
Job entries 440
Appendix D: Spoon Shortcuts 443
General shortcuts 443
Designing transformaons and jobs 444
Grids 445
Repositories 445
Appendix E: Introducing PDI 4 Features 447
Agile BI 447
Visual improvements for designing transformaons and jobs 447
Experiencing the mouse-over assistance 447
Time for acon – creang a hop with the mouse-over assistance 448
Using the mouse-over assistance toolbar 448
Experiencing the sni-tesng feature 449
Experiencing the job drill-down feature 449
Experiencing even more visual changes 450
Enterprise features 450
Summary 450
Appendix F: Pop Quiz Answers 451
Chapter 1 451
PDI data sources 451
PDI prerequisites 451
PDI basics 451
Chapter 2 452
formang data 452
Chapter 3 452
concatenang strings 452
Chapter 4 452
data movement (copying and distribung) 452
spling a stream 452
Chapter 5 453
nding the seven errors 453
Table of Contents
[ x ]
Chapter 6 453
using Kele variables inside transformaons 453
Chapter 7 453
PDI error handling 453
Chapter 8 454
dening database connecons 454
database datatypes versus PDI datatypes 454
Insert/Update step versus Table Output/Update steps 454
ltering the rst 10 rows 454
Chapter 9 454
loading slowly changing dimensions 454
loading type III slowly changing dimensions 455
Chapter 10 455
dening PDI jobs 455
Chapter 11 455
using the Add sequence step 455
deciding the scope of variables 455
Chapter 12 456
modifying a star model and loading the star with PDI 456
Chapter 13 456
remote execuon and clustering 456
Index 457
Preface
Pentaho Data Integraon (aka Kele) is an engine along with a suite of tools responsible
for the processes of Extracng, Transforming, and Loading—beer known as the ETL
processes. PDI not only serves as an ETL tool, but it's also used for other purposes such as
migrang data between applicaons or databases, exporng data from databases to at
les, data cleansing, and much more. PDI has an intuive, graphical, drag-and-drop design
environment, and its ETL capabilies are powerful. However, geng started with PDI can be
dicult or confusing. This book provides the guidance needed to overcome that diculty,
covering the key features of PDI. Each chapter introduces new features, allowing you to
gradually get involved with the tool.
By the end of the book, you will have not only experimented with all kinds of examples, but
will also have built a basic but complete datamart with the help of PDI.
How to read this book
Although it is recommended that you read all the chapters, you don't need to. The book
allows you to tailor the PDI learning process according to your parcular needs.
The rst four chapters, along with Chapter 7 and Chapter 10, cover the core concepts. If
you don't know PDI and want to learn just the basics, reading those chapters would suce.
Besides, if you need to work with databases, you could include Chapter 8 in the roadmap.
If you already know the basics, you can improve your PDI knowledge by reading chapters 5,
6, and 11.
Finally, if you already know PDI and want to learn how to use it to load or maintain a
datawarehouse or datamart, you will nd all that you need in chapters 9 and 12.
While Chapter 13 is useful for anyone who is willing to take it further, all the appendices are
valuable resources for anyone who reads this book.
Preface
[ 2 ]
What this book covers
Chapter 1, Geng started with Pentaho Data Integraon serves as the most basic
introducon to PDI, presenng the tool. The chapter includes instrucons for installing PDI
and gives you the opportunity to play with the graphical designer (Spoon). The chapter also
includes instrucons for installing a MySQL server.
Chapter 2, Geng Started with Transformaons introduces one of the basic components
of PDI—transformaons. Then, it focuses on the explanaon of how to work with les. It
explains how to get data from simple input sources such as txt, csv, xml, and so on, do a
preview of the data, and send the data back to any of these common output formats. The
chapter also explains how to read command-line parameters and system informaon.
Chapter 3, Basic Data Manipulaon explains the simplest and most commonly used ways of
transforming data, including performing calculaons, adding constants, counng, ltering,
ordering, and looking for data.
Chapter 4—Controlling the Flow of Data explains dierent opons that PDI oers to combine
or split ows of data.
Chapter 5, Transforming Your Data with JavaScript Code and the JavaScript Step explains how
JavaScript coding can help in the treatment of data. It shows why you need to code inside
PDI, and explains in detail how to do it.
Chapter 6, Transforming the Row Set explains the ability of PDI to deal with some
sophiscated problems, such as normalizing data from pivoted tables, in a simple fashion.
Chapter 7, Validang Data and Handling Errors explains the dierent opons that PDI has to
validate data, and how to treat the errors that may appear.
Chapter 8, Working with Databases explains how to use PDI to work with databases. The
list of topics covered includes connecng to a database, previewing and geng data, and
inserng, updang, and deleng data. As database knowledge is not presumed, the chapter
also covers fundamental concepts of databases and the SQL language.
Chapter 9, Performing Advanced Operaons with Databases explains how to perform
advanced operaons with databases, including those specially designed to load
datawarehouses. A primer on datawarehouse concepts is also given in case you are not
familiar with the subject.
Chapter 10, Creang Basic Task Flow serves as an introducon to processes in PDI. Through
the creaon of simple jobs, you will learn what jobs are and what they are used for.
Chapter 11, Creang Advanced Transformaons and Jobs deals with advanced concepts that
will allow you to build complex PDI projects. The list of covered topics includes nesng jobs,
iterang on jobs and transformaons, and creang subtransformaons.
Preface
[ 3 ]
Chapter 12, Developing and implemenng a simple datamart presents a simple datamart
project, and guides you to build the datamart by using all the concepts learned throughout
the book.
Chapter 13, Taking it Further gives a list of best PDI pracces and recommendaons for
going beyond.
Appendix A, Working with repositories guides you step by step in the creaon of a PDI
database repository and then gives instrucons to work with it.
Appendix B, Pan and Kitchen: Launching Transformaons and Jobs from the Command Line is
a quick reference for running transformaons and jobs from the command line.
Appendix C, Quick Reference: Steps and Job Entries serves as a quick reference to steps and
job entries used throughout the book.
Appendix D, Spoon Shortcuts is an extensive list of Spoon shortcuts useful for saving me
when designing and running PDI jobs and transformaons.
Appendix E, Introducing PDI 4 features quickly introduces you to the architectural and
funconal features included in Kele 4—the version that was under development while
wring this book.
Appendix F, Pop Quiz Answers, contains answers to pop quiz quesons.
What you need for this book
PDI is a mulplaorm tool. This means no maer what your operang system is, you will
be able to work with the tool. The only prerequisite is to have JVM 1.5 or a higher version
installed. It is also useful to have Excel or Calc along with a nice text editor.
Having an Internet connecon while reading is extremely useful as well. Several links are
provided throughout the book that complement what is explained. Besides, there is the
PDI forum where you may search or post doubts if you are stuck with something.
Who this book is for
This book is for soware developers, database administrators, IT students, and everyone
involved or interested in developing ETL soluons or, more generally, doing any kind of data
manipulaon. If you have never used PDI before, this will be a perfect book to start with.
You will nd this book to be a good starng point if you are a database administrator, a data
warehouse designer, an architect, or any person who is responsible for data warehouse
projects and need to load data into them.
Preface
[ 4 ]
You don't need to have any prior data warehouse or database experience to read this book.
Fundamental database and data warehouse technical terms and concepts are explained in
an easy-to-understand language.
Conventions
In this book, you will nd a number of styles of text that disnguish between dierent
kinds of informaon. Here are some examples of these styles, and an explanaon of
their meaning.
Code words in text are shown as follows: "You read the examination.txt le, and did
some calculaons to see how the students did."
New terms and important words are shown in bold. Words that you see on the screen, in
menus or dialog boxes for example, appear in our text like this: "Edit the Sort rows step by
double-clicking it, click the Get Fields buon, and adjust the grid."
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—
what you liked or may have disliked. Reader feedback is important for us to develop tles
that you really get the most out of.
To send us general feedback, simply drop an email to feedback@packtpub.com, and
menon the book tle in the subject of your message.
If there is a book that you need and would like to see us publish, please send us a note in the
SUGGEST A TITLE form on www.packtpub.com or email suggest@packtpub.com.
If there is a topic that you have experse in and you are interested in either wring or
contribung to a book, see our author guide on www.packtpub.com/authors.
Preface
[ 5 ]
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
Downloading the example code for the book
Visit http://www.packtpub.com/files/code/9546_Code.zip to
directly download the example code.
The downloadable les contain instrucons on how to use them.
Errata
Although we have taken every care to ensure the accuracy of our contents, mistakes do
happen. If you nd a mistake in one of our books—maybe a mistake in text or code—we
would be grateful if you would report this to us. By doing so, you can save other readers
from frustraon, and help us to improve subsequent versions of this book. If you nd any
errata, please report them by vising http://www.packtpub.com/support, selecng
your book, clicking on the let us know link, and entering the details of your errata.
Once your errata are veried, your submission will be accepted and the errata added
to any list of exisng errata. Any exisng errata can be viewed by selecng your tle
from http://www.packtpub.com/support.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt,
we take the protecon of our copyright and licenses very seriously. If you come across any
illegal copies of our works in any form on the Internet, please provide us with the locaon
address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated material.
We appreciate your help in protecng our authors, and our ability to bring you
valuable content.
Questions
You can contact us at questions@packtpub.com if you are having a problem with any
aspect of the book, and we will do our best to address it.
1
Getting Started with Pentaho
Data Integration
Pentaho Data Integraon is an engine along with a suite of tools responsible
for the processes of extracng, transforming, and loading—best known as the
ETL processes. This book is meant to teach you how to use PDI.
In this chapter you will:
Learn what Pentaho Data Integraon is
Install the soware and start working with the PDI graphical designer
Install MySQL, a database engine that you will use when you start working
with databases
Pentaho Data Integration and Pentaho BI Suite
Before introducing PDI, let's talk about Pentaho BI Suite. The Pentaho Business Intelligence
Suite is a collecon of soware applicaons intended to create and deliver soluons for
decision making. The main funconal areas covered by the suite are:
Analysis: The analysis engine serves muldimensional analysis. It's provided by the
Mondrian OLAP server and the JPivot library for navigaon and exploring.
Geng Started with Pentaho Data Integraon
[ 8 ]
Reporng: The reporng engine allows designing, creang, and distribung reports
in various known formats (HTML, PDF, and so on) from dierent kinds of sources.
The reports created in Pentaho are based mainly in the JFreeReport library, but it's
possible to integrate reports created with external reporng libraries such as Jasper
Reports or BIRT.
Data Mining: Data mining is running data through algorithms in order to understand
the business and do predicve analysis. Data mining is possible thanks to the
Weka Project.
Dashboards: Dashboards are used to monitor and analyze Key Performance
Indicators (KPIs). A set of tools incorporated to the BI Suite in the latest version
allows users to create interesng dashboards, including graphs, reports, analysis
views, and other Pentaho content, without much eort.
Data integraon: Data integraon is used to integrate scaered informaon
from dierent sources (applicaons, databases, les) and make the integrated
informaon available to the nal user. Pentaho Data Integraon—our main
concern—is the engine that provides this funconality.
All this funconality can be used standalone as well as integrated. In order to run analysis,
reports, and so on integrated as a suite, you have to use the Pentaho BI Plaorm. The
plaorm has a soluon engine, and oers crical services such as authencaon,
scheduling, security, and web services.
Chapter 1
[ 9 ]
This set of soware and services forms a complete BI Plaorm, which makes Pentaho Suite
the world's leading open source Business Intelligence Suite.
Exploring the Pentaho Demo
Despite being out of the scope of this book, it's worth to briey introduce the Pentaho
Demo. The Pentaho BI Plaorm Demo is a precongured installaon that lets you explore
several capabilies of the Pentaho plaorm. It includes sample reports, cubes, and
dashboards for Steel Wheels. Steel Wheels is a conal store that sells all kind of scale
replicas of vehicles.
The demo can be downloaded from http://sourceforge.net/projects/pentaho/
files/. Under the Business Intelligence Server folder, look for the latest stable
version. The le you have to download is named biserver-ce-3.5.2.stable.zip for
Windows and biserver-ce-3.5.2.stable.tar.gz for other systems.
In the same folder you will nd a le named biserver-getting_started-ce-
3.5.0.pdf. The le is a guide that introduces you the plaorm and gives you some
guidance on how to install and run it. The guide even includes a mini tutorial on building
a simple PDI input-output transformaon.
You can nd more about Pentaho BI Suite at www.pentaho.org.
Pentaho Data Integration
Most of the Pentaho engines, including the engines menoned earlier, were created as
community projects and later adopted by Pentaho. The PDI engine is no excepon—Pentaho
Data Integraon is the new denominaon for the business intelligence tool born as Kele.
The name Kele didn't come from the recursive acronym Kele Extracon,
Transportaon, Transformaon, and Loading Environment it has now, but from
KDE Extracon, Transportaon, Transformaon and Loading Environment,
as the tool was planned to be wrien on top of KDE, as menoned in the
introducon of the book.
In April 2006 the Kele project was acquired by the Pentaho Corporaon and Ma Casters,
Kele's founder, also joined the Pentaho team as a Data Integraon Architect.
Geng Started with Pentaho Data Integraon
[ 10 ]
When Pentaho announced the acquision, James Dixon, the Chief Technology Ocer, said:
We reviewed many alternaves for open source data integraon, and Kele clearly
had the best architecture, richest funconality, and most mature user interface.
The open architecture and superior technology of the Pentaho BI Plaorm
and Kele allowed us to deliver integraon in only a few days, and make that
integraon available to the community.
By joining forces with Pentaho, Kele beneted from a huge developer community, as well
as from a company that would support the future of the project.
From that moment the tool has grown constantly. Every few months a new release is
available, bringing to the users, improvements in performance and exisng funconality,
new funconality, ease of use, and great changes in look and feel. The following is a meline
of the major events related to PDI since its acquision by Pentaho:
June 2006: PDI 2.3 is released. Numerous developers had joined the project and
there were bug xes provided by people in various regions of the world. Among
other changes, the version included enhancements for large scale environments
and mullingual capabilies.
February 2007: Almost seven months aer the last major revision, PDI 2.4 is
released including remote execuon and clustering support (more on this in
Chapter 13), enhanced database support, and a single designer for the two
main elements you design in Kele—jobs and transformaons.
May 2007: PDI 2.5 is released including many new features, the main feature being
the advanced error handling.
November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to
gain massive performance. The look and feel also changed completely.
October 2008: PDI 3.1 comes with an easier-to-use tool, along with a lot of new
funconalies as well.
April 2009: PDI 3.2 is released with a really large number of changes for a
minor version—new funconality, visualizaon improvements, performance
improvements, and a huge pile of bug xes. The main change in this version was the
incorporaon of dynamic clustering (see Chapter 13 for details).
In 2010 PDI 4.0 will be released, delivering mostly improvements with regard to
enterprise features such as version control.
Most users sll refer to PDI as Kele, its further name. Therefore, the names PDI,
Pentaho Data Integraon, and Kele will be used interchangeably throughout
the book.
Chapter 1
[ 11 ]
Using PDI in real world scenarios
Paying aenon to its name, Pentaho Data Integraon, you could think of PDI as a tool to
integrate data.
In you look at its original name, K.E.T.T.L.E., then you must conclude that it is a tool used
for ETL processes which, as you may know, are most frequently seen in data warehouse
environments.
In fact, PDI not only serves as a data integrator or an ETL tool, but is such a powerful tool
that it is common to see it used for those and for many other purposes. Here you have
some examples.
Loading datawarehouses or datamarts
The loading of a datawarehouse or a datamart involves many steps, and there are many
variants depending on business area or business rules. However, in every case, the process
involves the following steps:
Extracng informaon from one or dierent databases, text les, and other sources.
The extracon process may include the task of validang and discarding data that
doesn't match expected paerns or rules.
Transforming the obtained data to meet the business and technical needs required
on the target. Transformaon implies tasks such as converng data types, doing
some calculaons, ltering irrelevant data, and summarizing.
Loading the transformed data into the target database. Depending on the
requirements, the loading may overwrite the exisng informaon, or may
add new informaon each me it is executed.
Geng Started with Pentaho Data Integraon
[ 12 ]
Kele comes ready to do every stage of this loading process. The following sample
screenshot shows a simple ETL designed with Kele:
Integrating data
Imagine two similar companies that need to merge their databases in order to have a unied
view of the data, or a single company that has to combine informaon from a main ERP
applicaon and a CRM applicaon, though they're not connected. These are just two of
hundreds of examples where data integraon is needed. Integrang data is not just a maer
of gathering and mixing data; some conversions, validaon, and transport of data has to be
done. Kele is meant to do all those tasks.
Data cleansing
Why do we need that data be correct and accurate? There are many reasons—for the
eciency of business, to generate trusted conclusions in data mining or stascal studies,
to succeed when integrang data, and so on. Data cleansing is about ensuring that the
data is correct and precise. This can be ensured by verifying if the data meets certain rules,
discarding or correcng those that don't follow the expected paern, seng default values
for missing data, eliminang informaon that is duplicated, normalizing data to conform
minimum and maximum values, and so on—tasks that Kele makes possible, thanks to its
vast set of transformaon and validaon capabilies.
Chapter 1
[ 13 ]
Migrating information
Think of a company of any size that uses a commercial ERP applicaon. One day the owners
realize that the licences are consuming an important share of its budget and so they decide
to migrate to an open source ERP. The company will no longer have to pay licences, but if
they want to do the change, they will have to migrate the informaon. Obviously it is not an
opon to start from scratch, or type the informaon by hand. Kele makes the migraon
possible, thanks to its ability to interact with most kinds of sources and desnaons such as
plain les, and commercial and free databases and spreadsheets.
Exporting data
Somemes you are forced by government regulaons to export certain data to be processed
by legacy systems. You can't just print and deliver some reports containing the required data.
The data has to have a rigid format, with columns that have to obey some rules (size, format,
content), dierent records for heading and tail, just to name some common demands. Kele
has the power to take crude data from the source and generate these kinds of ad hoc reports.
Integrating PDI using Pentaho BI
The previous examples show typical uses of PDI as a standalone applicaon. However, Kele
may be used as part of a process inside the Pentaho BI Plaorm. There are many things
embedded in the Pentaho applicaon that Kele can do—preprocessing data for an on-line
report, sending mails in a schedule fashion, or generang spreadsheet reports.
You'll nd more on this in Chapter 13. However, the use of PDI integrated
with the BI Suite is beyond the scope of this book.
Pop quiz – PDI data sources
Which of the following aren't valid sources in Kele:
1. Spreadsheets
2. Free database engines
3. Commercial database engines
4. Flat les
5. None of the above
Geng Started with Pentaho Data Integraon
[ 14 ]
Installing PDI
In order to work with PDI you need to install the soware. It's a simple task; let's do it.
Time for action – installing PDI
These are the instrucons to install Kele, whatever your operang system.
The only prerequisite to install PDI is to have JRE 5.0 or higher installed. If you don't have it,
please download it from http://www.javasoft.com/ and install it before proceeding.
Once you have checked the prerequisite, follow these steps:
1. From http://community.pentaho.com/sourceforge/ follow the link to
Pentaho Data Integraon (Kele). Alternavely, go directly to the download page
http://sourceforge.net/projects/pentaho/files/Data Integration.
2. Choose the newest stable release. At this me, it is 3.2.0.
3. Download the le that matches your plaorm. The preceding screenshot should
help you.
4. Unzip the downloaded le in a folder of your choice
—C:/Kettle or /home/your_dir/kettle.
Chapter 1
[ 15 ]
5. If your system is Windows, you're done. Under UNIX-like environments, it's
recommended that you make the scripts executable. Assuming that you
chose Kele as the installaon folder, execute the following command:
cd Kettle
chmod +x *.sh
What just happened?
You have installed the tool in just a few minutes. Now you have all you need to start working.
Pop quiz – PDI prerequisites
Which of the following are mandatory to run PDI? You may choose more than one opon.
1. Kele
2. Pentaho BI plaorm
3. JRE
4. A database engine
Launching the PDI graphical designer: Spoon
Now that you've installed PDI, you must be eager to do some stu with data. That will be
possible only inside a graphical environment. PDI has a desktop designer tool named Spoon.
Let's see how it feels to work with it.
Time for action – starting and customizing Spoon
In this tutorial you're going to launch the PDI graphical designer and get familiarized with itsn this tutorial you're going to launch the PDI graphical designer and get familiarized with its
main features.
1. Start Spoon.
If your system is Windows, type the following command:
Spoon.bat
In other plaorms such as Unix, Linux, and so on, type:
Spoon.sh
If you didn't make spoon.sh executable, you may type:
sh Spoon.sh
Geng Started with Pentaho Data Integraon
[ 16 ]
2. As soon as Spoon starts, a dialog window appears asking for the repository
connecon data. Click the No Repository buon. The main window appears. You
will see a small window with the p of the day. Aer reading it, close that window.
3. A welcome! window appears with some useful links for you to see.
4. Close the welcome window. You can open that window later from the main menu.
5. Click Opons... from the Edit menu. A window appears where you can change
various general and visual characteriscs. Uncheck the circled checkboxes:
6. Select the tab window Look Feel.
Chapter 1
[ 17 ]
7. Change the Grid size and Preferred Language sengs as follows:
8. Click the OK buon.
9. Restart Spoon in order to apply the changes. You should neither see the repository
dialog, nor the welcome window. You should see the following screen instead:
Geng Started with Pentaho Data Integraon
[ 18 ]
What just happened?
You ran for the rst me the graphical designer of PDI Spoon, and applied some
custom conguraon.
From the Look Feel conguraon window, you changed the size of the doed grid that
appears in the canvas area while you are working. You also changed the preferred language.
In the Opon tab window, you chose not to show either the repository dialog or the
welcome window at startup. These changes were applied as you restarted the tool, not
before.
The second me you launched the tool, the repository dialog didn't show up. When the
main window appeared, all the visible texts were shown in French, which was the selected
language, and instead of the welcome window, there was a blank screen.
Spoon
This tool that you're exploring in this secon is the PDI's desktop design tool. With Spoon you
design, preview, and test all your work, that is, transformaons and jobs. When you see PDI
screenshots, what you are really seeing are Spoon screenshots. The other PDI components
that you will meet in the following chapters are executed from terminal windows.
Setting preferences in the Options window
In the tutorial you changed some preferences in the Opons window. There are several look
and feel characteriscs you can change beyond those you changed. Feel free to experiment
with this seng.
Remember to restart Spoon in order to see the changes applied.
If you choose any language as preferred language other than English, you
should select a dierent language as alternave. If you do so, every name or
descripon not translated to your preferred language will be shown in the
alternave language.
Just for the curious people: Italian and French are the overall winners of the list of languages
to which the tool has been translated from English. Below them follow Korean, Argennean
Spanish, Japanese, and Chinese.
Chapter 1
[ 19 ]
One of the sengs you changed was the appearance of the welcome window at start up.
The welcome window has many useful links, all related with the tool: wiki pages, news,
forum access, and more. It's worth exploring them.
You don't have to change the sengs again to see the welcome window.
You can open it from the menu Help | Show the Welcome Screen.
Storing transformations and jobs in a repository
The rst me you launched Spoon, you chose No Repository. Aer that, you congured
Spoon to stop asking you for the Repository opon. You must be curious about what the
repository is and why not to use it. Let's explain it.
As said, the results of working with PDI are Transformaons and Jobs. In order to save the
Transformaons and Jobs, PDI oers two methods:
Repository: When you use the repository method you save jobs and
transformaons in a repository. A repository is a relaonal database specially
designed for this purpose.
Files: The les method consists of saving jobs and transformaons as regular XML
les in the lesystem, with extension kjb and ktr respecvely.
The following diagram summarizes this:
exclusive
REPOSITORY FILE SYSTEM
.ktr .kjb
Design, Preview, Run
SPOON
Kettle Engine KETTLE
Transfor
mations Jobs
Transformations Jobs
Design, Preview, Run
Geng Started with Pentaho Data Integraon
[ 20 ]
You cannot mix the two methods (les and repository) in the same project. Therefore, you
must choose the method when you start the tool.
Why did we choose not to work with repository, or in other words, to work with les? This is
mainly for the following two reasons:
Working with les is more natural and praccal for most users.
Working with repository requires minimum database knowledge and that you also
have access to a database engine from your computer. Having both precondions
would allow you to learn working with both methods. However, it's probable that
you haven't.
Throughout this book, we will use the le method. For details of working with repositories,
please refer to Appendix A.
Creating your rst transformation
Unl now, you've seen the very basic elements of Spoon. For sure, you must be waing to do
some interesng task beyond looking around. It's me to create your rst transformaon.
Time for action – creating a hello world transformation
How about starng by saying Hello to the World? Not original but enough for a very rst
praccal exercise. Here is how you do it:
1. Create a folder named pdi_labs under the folder of your choice.
2. Open Spoon.
3. From the main menu select File | New Transformaon.
4. At the le-hand side of the screen, you'll see a tree of Steps. Expand the Input
branch by double-clicking it.
5. Le-click the Generate Rows icon.
Chapter 1
[ 21 ]
6. Without releasing the buon, drag-and-drop the selected icon to the main canvas.
The screen will look like this:
7. Double-click the Generate Rows step that you just put in the canvas and ll the text
boxes and grid as follows:
8. From the Steps tree, double-click the Flow step.
9. Click the Dummy icon and drag-and-drop it to the main canvas.
Geng Started with Pentaho Data Integraon
[ 22 ]
10. Click the Generate Rows step and holding the Shi key down, drag the cursor
towards the Dummy step. Release the buon. The screen should look like this:
11. Right-click somewhere on the canvas to bring up a contextual menu.
12. Select New note. A note editor appears.
13. Type some descripon such as Hello World! and click OK.
14. From the main menu, select Transformaon | Conguraon. A window appears
to specify transformaon properes. Fill the Transformaon name with a simple
name as hello_world. Fill the Descripon eld with a short descripon such as
My rst transformaon. Finally provide a more clear explanaon in the Extended
descripon text box and click OK.
15. From the main menu, select File | Save.
16. Save the transformaon in the folder pdi_labs with the name hello_world.
17. Select the Dummy step by le-clicking it.
18. Click on the Preview buon in the menu above the main canvas.
Chapter 1
[ 23 ]
19. A debug window appears. Click the Quick Launch buon.
20. The following window appears to preview the data generated by the transformaon:
21. Close the preview window and click the Run buon.
22. A window appears. Click Launch.
Geng Started with Pentaho Data Integraon
[ 24 ]
23. The execuon results are shown in the boom of the screen. The Logging tab
should look as follows:
What just happened?
You've just created your rst transformaon.
First, you created a new transformaon. From the tree on the le, you dragged two steps
and drop them into the canvas. Finally, you linked them with a hop.
With the Generate Rows step, you created 10 rows of data with the message Hello World!.
The Dummy step simply served as a desnaon of those rows.
Aer creang the transformaon, you did a preview. The preview allowed you to see the
content of the created data, this is, the 10 rows with the message Hello World!
Chapter 1
[ 25 ]
Finally, you ran the transformaon. You could see the results of the execuon at the boom
of the windows. There is a tab named Step Metrics with informaon about what happens
with each steps in the transformaon. There is also a Logging tab showing a complete detail
of what happened.
Directing the Kettle engine with transformations
As shown in the following diagram, transformaon is an enty made of steps linked by hops.
These steps and hops build paths through which data ows. The data enters or is created in a
step, the step applies some kind of transformaon to it, and nally the data leaves that step.
Therefore, it's said that a transformaon is data-ow oriented.
Steps
Transformation
Output
Input
Hops
Step1 Step2 StepN
...
A transformaon itself is not a program nor an executable le. It is just plain XML. The
transformaon contains metadata that tells the Kele engine what to do.
A step is the minimal unit inside a transformaon. A big set of steps is available. These steps
are grouped in categories such as the input and ow categories that you saw in the example.
Each step is conceived to accomplish a specic funcon, going from reading a parameter to
normalizing a dataset. Each step has a conguraon window. These windows vary according
to the funconality of the steps and the category to which they belong. What all steps have
in common are the name and descripon:
Step property Descripon
Name A representative name inside the transformation.
Description A brief explanation that allows you to clarify the purpose of the step.
It's not mandatory but it is useful.
A hop is a graphical representaon of data owing between two steps—an origin and a
desnaon. The data that ows through that hop constutes the output data of the origin
step and the input data of the desnaon step.
Geng Started with Pentaho Data Integraon
[ 26 ]
Exploring the Spoon interface
As you just saw, the Spoon is the tool using which you create, preview, and run
transformaons. The following screenshot shows you the basic work areas:
The words canvas and work area will be used interchangeably throughout
the book.
Viewing the transformation structure
If you click the View icon in the upper le corner of the screen, the tree will change to show
the structure of the transformaon currently being edited.
Download from Wow! eBook <www.wowebook.com>
Chapter 1
[ 27 ]
Running and previewing the transformation
The Preview funconality allows you to see a sample of the data produced for selected steps.
In the previous example, you previewed the output of the Dummy Step. The Run opon
eecvely runs the whole transformaon.
Whether you preview or run a transformaon, you'll get an execuon results window
showing what happened. Let's explain it through an example.
Time for action – running and previewing the hello_world
transformation
Let's do some tesng and explore the results:
1. Open the hello_world transformaon.
2. Edit the Generate Rows step, and change the limit from 10 to 1000 so that it
generates 1,000 rows.
3. Select the Logging tab window at the boom of the screen.
4. Click on Run.
5. In the Log level drop-down list, select RowLevel detail.
6. Click on Launch.
7. You can see how the logging window shows every task in a very detailed way.
8. Edit the Generate Rows step, and change the limit to 10,000 so that it generates
10,000 rows.
9. Select the Step Metrics.
Geng Started with Pentaho Data Integraon
[ 28 ]
10. Run the transformaon.
11. You can see how the numbers change as the rows travel through the steps.
What just happened?
You did some tests with the hello_world transformaon and saw the results in the
Execuon Results window.
Previewing the results in the Execution Results window
The Execuon Results window shows you what is happening while you preview or run
a transformaon.
The Logging tab shows the execuon of your transformaon, step by step. By default, the
level of the logging detail is Basic but you can change it to see dierent levels of detail—from
a minimal logging (level Minimal) to a very detailed one (level RowLevel).
The Step Metrics tab shows, for each step of the transformaon, the executed operaons
and several status and informaon columns. You may be interested in the following columns:
Column Descripon
Read Contains the number of rows coming from previous steps
Written Contains the number of rows leaving from this step toward the next
Input Number of rows read from a le or table
Output Number of rows written to a le or table
Errors Errors in the execution. If there are errors, the whole row becomes red
Active Tells the current status of the execution
In the example, you can see that the Generate Rows step writes rows, which then are read
by the Dummy step. The Dummy step also writes the same rows, but in this case those
go nowhere.
Chapter 1
[ 29 ]
Pop quiz – PDI basics
For each of the following, decide if the sentence is true or false:
1. There are several graphical tools in PDI, but Spoon is the most used.
2. You can choose to save Transformaons either in les or in a database.
3. To run a Transformaon, an executable le has to be generated from Spoon.
4. The grid size opon in the Look and Feel windows allows you to resize the work area.
5. To create a transformaon, you have to provide external data.
Installing MySQL
Before skipping to the next chapter, let's devote some minutes to the installaon of MySQL.
In Chapter 8 you will begin working with databases from PDI. In order to do that, you will
need access to some database engine. As MySQL is the world's most popular open source
database, it was the database engine chosen for the database-related tutorials in the book.
In this secon you will learn to install the MySQL database engine both in Windows and
Ubuntu, the most popular distribuon of Linux these days. As the procedures for installing
the soware are dierent, a separate explanaon is given for each system.
Time for action – installing MySQL on Windows
In order to install MySQL on your Windows system, please follow these instrucons:
1. Open an internet browser and type http://dev.mysql.com/downloads/mysql/.
2. Select the Microso Windows plaorm and download the mysql-essenal package
that matches your system: 32-bit or 64-bit.
3. Double-click the downloaded le. A wizard will guide you through the process.
4. When asked about the setup type, select Typical.
5. Several screens follow. When the wizard is complete you'll have the opon to
congure the server. Check Congure the MySQL Server now and click Finish.
Geng Started with Pentaho Data Integraon
[ 30 ]
6. A new wizard will be launched that lets you congure the server.
7. When asked about the conguraon type, select Standard Conguraon.
8. When prompted, set the Windows opons as shown in the next screenshot:
9. When prompted for the security opons, provide a password for the root user.
You'll have to retype the password.
Provide a password that you can remember. You'll need it
later to connect to the MySQL server.
Chapter 1
[ 31 ]
10. In the next window click on Execute to proceed with the conguraon. When the
conguraon is done, you'll see this:
11. Click on Finish. Aer installing MySQL it is recommended that you install the GUI
tools for administering and querying the database.
12. Open an Internet browser and type
http://dev.mysql.com/downloads/gui-tools/.
13. Look for the Windows downloads and download the Windows (x86) package.
14. Double-click the downloaded le. A wizard will guide you through the process.
15. When asked about the setup type, select Complete.
16. Several screens follow. Just follow the wizard instrucons.
17. When the wizard ends, you'll have the GUI tools added to the MySQL menu.
Geng Started with Pentaho Data Integraon
[ 32 ]
What just happened?
You downloaded and installed MySQL on your Windows system. You also installed MySQL
GUI tools, a soware package that includes an administrator and a query browser ulity and
that will make your life easier when working with the database.
Time for action – installing MySQL on Ubuntu
This tutorial shows you the procedure to install MySQL on Ubuntu.
In order to follow the tutorial you need to be connected to
the Internet.
Please follow these instrucons:
1. Check that you have access to the Internet.
2. Open the Synapc package manager from System | Administraon | Synapc
Package Manager.
3. Under Quick search type mysql-server and click on the Search buon.
4. Among the results, locate mysql-server-5.1, click in the ny square to the le,
and select Mark for Installaon.
5. You'll be prompted for conrmaon. Click on Mark.
Chapter 1
[ 33 ]
6. Now search for a package named mysql-admin.
7. When found, mark it for installaon in the same way.
8. Click on Apply on the main toolbar.
9. A window shows up asking for conrmaon. Click on Mark again. What follows is
the download process followed by the installaon process.
10. At a parcular moment a window appears asking you for a password for the root
user—the administrator of the database. Enter a password of your choice. You'll
have to enter it twice.
Think of a password that you can remember. You'll need it
later to connect to the MySQL server.
11. When the process ends, you will see the changes applied.
Geng Started with Pentaho Data Integraon
[ 34 ]
12. Under Applicaons a new menu will also be added to access the GUI tools.
What just happened?
You installed MySQL server and GUI Tools in your Ubuntu system.
The previous direcons are for standard installaons. For custom installaons,
instrucons related to other operang systems, or for troubleshoong, please
check the MySQL documentaon at—http://dev.mysql.com/doc/
refman/5.1/en/installing.html.
Summary
In this rst chapter, you were introduced to Pentaho Data Integraon. Specically, you learned
what Pentaho Data Integraon is and you installed the tool. You were also introduced to
Spoon, the graphical designer of PDI, and you created your rst transformaon.
As an addional exercise, you installed a MySQL server and the MySQL GUI tools. You will
need this soware when you start working with databases in Chapter 8.
Now that you've learned the basics, you're ready to begin creang your own transformaons
to explore real data. That is the topic of the next chapter.
2
Getting Started with Transformations
In the previous chapter you used the graphical designer Spoon to create
your rst transformaon: Hello world. Now you will start creang your own
transformaons to explore data from the real world. Data is everywhere; in
parcular you will nd data in les. Product lists, logs, survey results, and
stascal informaon are just a sample of the dierent kinds of informaon
usually stored in les. In this chapter you will create transformaons to get
data from les, and also to send data back to les. This in turn will allow you to
learn the basic PDI terminology related to data.
Reading data from les
Despite being the most primive format used to store data, les are broadly used and they
exist in several avors as xed width, comma-separated values, spreadsheet, or even free
format les. PDI has the ability to read data from all types of les; in this rst tutorial let's
see how to use PDI to get data from text les.
Geng Started with Transformaons
[ 36 ]
Time for action – reading results of football matches from les
Suppose you have collected several football stascs in plain les. Your les look like this:
Group|Date|Home Team |Results|Away Team|Notes
Group 1|02/June|Italy|2-1|France|
Group 1|02/June|Argentina|2-1|Hungary
Group 1|06/June|Italy|3-1|Hungary
Group 1|06/June|Argentina|2-1|France
Group 1|10/June|France|3-1|Hungary
Group 1|10/June|Italy|1-0|Argentina
-------------------------------------------
World Cup 78
Group 1
You don't have one, but many les, all with the same structure. You now want to unify all the
informaon in one single le. Let's begin by reading the les.
1. Create the folder named pdi_files. Inside it, create the input and
output subfolders.
2. By using any text editor, type the le shown and save it under the name
group1.txt in the folder named input, which you just created. You can also
download the le from Packt's ocial website.
3. Start Spoon.
4. From the main menu select File | New Transformaon.
5. Expand the Input branch of the steps tree.
6. Drag the Text le input icon to the canvas.
7. Double-click the text input le icon and give a name to the step.
8. Click the Browse... buon and search the le group1.txt.
9. Select the le. The textbox File or directory will be temporarily populated with the full
path of the le—for example, C:\pdi_files\input\group1.txt.
Chapter 2
[ 37 ]
10. Click the Add buon. The full text will be moved from the File or directory textbox to the
grid. The conguraon window should look as follows:
11. Select the Content tab and ll it like this:
Geng Started with Transformaons
[ 38 ]
12. Select the Fields tab. Click the Get Fields buon. The screen should look like this:
13. In the small window that proposes you a number of sample lines, click OK.
14. Close the scan results window.
15. Change the second row. Under the Type column select Date, and under the Format
column, type dd/MMM.
16. The result value is text, not a number, so change the fourth row too. Under the Type
column select String.
17. Click the Preview rows buon, and then the OK buon.
18. The previewed data should look like the following:
Chapter 2
[ 39 ]
19. Expand the Transform branch of the steps tree.
20. Drag the Select values icon to the canvas.
21. Create a hop from the Text le input step to the Select values step.
Remember that you do it by selecng the rst step, then dragging
toward the second while holding down the Shi key.
22. Double-click the Select values step icon and give a name to the step.
23. Select the Remove tab.
24. Click the Get elds to remove buon.
25. Delete every row except the rst and the last one by le-clicking them and
pressing Delete.
26. The tab window looks like this:
27. Click OK.
28. From the Flow branch of the steps tree, drag the Dummy icon to the canvas.
29. Create a hop from the Select values step to the Dummy step. Your transformaon
should look like the following:
Geng Started with Transformaons
[ 40 ]
30. Congure the transformaon by pressing Ctrl+T and giving a name and a descripon to
the transformaon.
31. Save the transformaon by pressing Ctrl+S.
32. Select the Dummy step.
33. Click the Preview buon located on the transformaon toolbar:
34. Click the Quick Launch buon.
35. The following window appears, showing the nal data:
What just happened?
You read your plain le with results of football matches into a transformaon.
By using a Text le input step, you told Kele the full path to your le, along with the
characteriscs of the le so that Kele was able to read the data correctly—you specied
that the le had a header, had three rows at the end that should be ignored, and specied
the name and type of the columns.
Aer reading the le, you used a Select values step to remove columns you didn't need— the
rst and the last column.
Chapter 2
[ 41 ]
With those two simple steps, you were able to preview the data in your le from inside
the transformaon.
Another thing you may have noced is the use of shortcuts instead of the menu opons—for
example, to save the transformaon.
Many of the menu opons can be accessed more quickly by using shortcuts. The
available shortcuts for the menu opons are menoned as part of the name of
the operaon—for example, Run F9.
For a full shortcut reference please check Appendix D.
Input les
Files are one of the most used input sources. PDI can take data from several types of les,
with very few limitaons.
When you have a le to work with, the rst thing you have to do is to specify where the le
is, how it looks, and what kinds of values it contains. That is exactly what you did in the rst
tutorial of this chapter.
With the informaon you provide, Kele can create the dataset to work within the
current transformaon.
Input steps
There are several steps that allow you to take a le as the input data. All those steps such as
Text le input, Fixed le input, Excel Input, and so on are under the Input step category.
Despite the obvious dierences that exist between these types of les, the ways to congure
the steps have much in common. The following are the main properes you have to specify
for an input step:
Name of the step: It is mandatory and must be dierent for every step in
the transformaon.
Name and locaon of the le: These must be specied of course. At the moment
you create the transformaon, it's not mandatory that the le exists. However, if it
does, you will nd it easier to congure this step.
Content type: This data includes delimiter character, type of encoding, whether a
header is present, and so on. The list depends on the kind of le chosen. In every
case, Kele propose default values, so you don't have to enter too much data.
Geng Started with Transformaons
[ 42 ]
Fields: Kele has the facility to get the denions automacally by clicking the Get
Fields buon. However, Kele doesn't always guess the data types, size, or format
as expected. So, aer geng the elds you may change what you consider more
appropriate, as you did in the tutorial.
Filtering: Some steps allow you to lter the data—skip blank rows, read only the rst
n rows, and so on.
Aer conguring an input step, you can preview the data just as you did, by Clicking
the Preview Rows buon. This is useful to discover if there is something wrong in the
conguraon. In that case, you can make the adjustments and preview again, unl your
data looks ne.
Reading several les at once
Unl now you used an input step to read one le. But you have several les, all with the very
same structure. That will not be a problem because with Kele it is possible to read more
than a le at a me.
Time for action – reading all your les at a time using a single
Text le input step
To read all your les follow the next steps:
1. Open the transformaon, double-click the input step, and add the other les in the
same way you added the rst.
2. Aer Clicking the Preview rows buon, you will see this:
Chapter 2
[ 43 ]
What just happened?
You read several les at once. By pung in the grid the names of all the input les, you could
get the content of every specied le one aer the other.
Time for action – reading all your les at a time using a single
Text le input step and regular expressions
You could do the same thing you did above by using a dierent notaon.
Follow these instrucons:
1. Open the transformaon and edit the conguraon windows of the input step.
2. Delete the lines with the names of the les.
3. In the rst row of the grid, type C:\pdi_files\input\ under the File/Directory
column, and group[1-4]\.txt under the Wildcard (Reg.Exp.) column.
4. Click the Show lename(s)... buon. You'll see the list of les that match
the expression.
5. Close the ny window and click Preview rows to conrm that the rows shown
belong to the four les that match the expression you typed.
Geng Started with Transformaons
[ 44 ]
What just happened?
In this parcular case, all lenames follow a paern—group1.txt, group2.txt, and so
on. In order to specify the names of the les, you used a regular expression. In the column
File/Directory you put the stac part of the names, while in the Wildcard (Reg.Exp.) column
you put the regular expression with the paern that a le must follow to be considered:
the text group followed by a number between 1 and 4, and then .txt. Then, all les that
matched the expression were considered as input les.
Regular expressions
There are many places inside Kele where you may or have to provide a regular expression.
A regular expression is much more than specifying the known wildcards ? and *.
Here you have some examples of regular expressions you may use to specify lenames:
The following regular
expression ...
Matches ... Examples
.*\.txt Any txt le thisisaValidExample.
txt
test(19|20)\d\d-
(0[1-9]|1[012])\.txt
Any txt le beginning with test
followed by a date using the format
yyyy-mm
test2009-12.txt
test2009-01.txt
(?i)test.+\.txt Any txt le beginning with test,
upper or lower case
TeSTcaseinsensitive.
tXt
Please note that the * wildcard doesn't work the same as it does on
the command line. If you want to match any character, the * has to be
preceded by a dot.
Here are some useful links in case you want to know more about regular expressions:
Regular Expression Quick Start:
http://www.regular-expressions.info/quickstart.html
The Java Regular Expression Tutorial:
http://java.sun.com/docs/books/tutorial/essential/regex/
Java Regular Expression Paern Syntax: http://java.sun.com/javase/6/
docs/api/java/util/regex/Pattern.html
Chapter 2
[ 45 ]
Troubleshooting reading les
Despite the simplicity of reading les with PDI, obstacles and errors appear. Many mes
the soluon is simple but dicult to nd if you are new to PDI. Here you have a list of
common problems and possible soluons for you to take into account while reading and
previewing a le:
Problem Diagnosc Possible soluons
You get the message
Sorry, no rows found to
be previewed.
This happens when the input le
doesn't exist or is empty.
It also may happen if you
specied the input les with
regular expressions and there
is no le that matches the
expression.
Check the name of the input les.
Verify the syntax used, check that
you didn't put spaces or any strange
character as part of the name.
If you used regular expressions, check
the syntax.
Also verify that you put the lename
in the grid. If you just put it in the File
or directory textbox, Kele will not
read it.
When you preview the
data you see a grid with
blank lines
The le contains empty lines, or
you forgot to get the elds.
Check the content of the le.
Also check that you got the elds in the
Fields tab.
You see the whole line
under the rst dened
eld.
You didn't set the proper
separator and Kele couldn't split
the dierent elds.
Check and x the separator in the
Content tab.
You see strange
characters.
You le the default content but
your le has a dierent format or
encoding.
Check and x the Format and Encoding
in the Content tab.
If you are not sure of the format, you
can specify mixed.
You don't see all the
lines you have in the le
You are previewing just a sample
(100 lines by default).
Or you put a limit to the number
of rows to get.
Another problem may be that you
set the wrong number of header
or footer lines.
When you preview, you see just a
sample. This is not a problem.
If you raise the previewed number of
rows and sll have few lines, check the
Header, Footer and Limit opons in
the Content tab.
Geng Started with Transformaons
[ 46 ]
Problem Diagnosc Possible soluons
Instead of rows of
data, you get a window
headed ERROR with an
extract of the log
Dierent errors may happen, but
the most common has to do with
problems in the denion of the
elds.
You could try to understand the log
and x the denion accordingly. For
example if you see:
Couldn't parse eld [Integer] with
value [Italy].
The error is that PDI found the text
Italy in a eld that you dened as
Integer.
If you made a mistake, you could x
it. On the other hand, if the le has
errors, you could read all elds as
String and you will not get the error
again. In chapter 7 you will learn how
to overcome these situaons.
Grids
Grids are tables used in many Spoon places to enter or display informaon. You already saw
grids in several conguraon windows—Text le input, Text le output, and Select values.
Many grids contain eld informaon. Examples of these grids are the Field tab window in the
Text Input and Output steps, or the main conguraon window of the Select Values step. In
these cases, the grids are usually accompanied by a Get Fields buon. The Get Fields buon
is a facility to avoid typing. When you press that buon, Kele lls the grid with all the
available elds.
For example, when reading a le, the Get Fields buon lls the grid with the columns of the
incoming le. When using a Select Values step or a File output step, the Get Fields buon
lls the grid with all the elds entering from a previous step.
Every me you see a Get Fields buon, consider it as a shortcut to avoid typing.
Kele will bring the elds available to the grid; you will only have to check the
informaon brought and make minimal changes.
There are many places in Spoon where the grid serves also to edit other kinds of informaon.
One example of that is the grid where you specify the list of les in a Text File Input step. No
maer what kind of grid you are eding, there is always a contextual menu, which you may
access by right-clicking on a row. That menu oers eding opons to copy, paste, or move
rows of the grid.
Chapter 2
[ 47 ]
When the number of rows in the grid is big, use shortcuts! Most of the eding
opons of a grid have shortcuts that make the eding work easier and quicker.
You'll nd a full list of shortcuts for eding grids in Appendix E.
Have a go hero – explore your own les
Try to read your own text les from Kele. You must have several les with dierent kinds of
data, dierent separators, and with or without header or footer. You can also search for les
over the Internet; there are plenty of les there to download and play with. Aer conguring
the input step, do a preview. If the data is not shown properly, x the conguraon and
preview again unl you are sure that the data is read as expected. If you have trouble
reading the les, please refer to the Troubleshoong reading les secon seen earlier for
diagnosis and possible ways to solve the problems.
Sending data to les
Now you know how to bring data into Kele. You didn't bring the data just to preview it; you
probably want to do some transformaon on the data, to nally send it to a nal desnaon
such as another plain le. Let's learn how to do this last task.
Time for action – sending the results of matches to a plain le
In the previous tutorial, you read all your "results of matches" les. Now you want to send
the data coming from all les to a single output le.
1. Create a new transformaon.
2. Drag a Text le input step to the canvas and congure it just as you did in the
previous tutorial.
3. Drag a Select values step to the canvas and create a hop from the Text le input
step to the Select values step.
4. Double-click the Select values step.
5. Click the Get elds to select buon.
Geng Started with Transformaons
[ 48 ]
6. Modify the elds as follows:
7. Expand the Output branch of the steps tree.
8. Drag the Text le output icon to the canvas.
9. Create a hop from the Select values step to the Text le output step.
10. Double-click the Text le output step and give it a name.
11. In the le name type: C:/pdi_files/output/wcup_first_round.
Note that the path contains forward slashes. If your system is Windows,
you may use back or forward slashes. PDI will recognize both notaons.
12. In the Content tab, leave the default values.
13. Select the Fields tab and congure it as follows:
Chapter 2
[ 49 ]
14. Click OK.
15. Give a name and descripon to the transformaon.
16. Save the transformaon.
17. Click Run and then Launch.
18. Once the transformaon is nished, check the le generated. It should have been
created as C:/pdi_files/output/wcup_first_round.txt and should look
like this:
Match Date;Home Team;Away Team;Result
02/06;Italy;France;2-1
02/06;Argentina;Hungary;2-1
06/06;Italy;Hungary;3-1
06/06;Argentina;France;2-1
10/06;France;Hungary;3-1
10/06;Italy;Argentina;1-0
01/06;Germany FR;Poland;0-0
02/06;Tunisia;Mexico;3-1
06/06;Germany FR;Mexico;6-0
…
What just happened?
You gathered informaon from several les and sent all the data to a single le. Before
sending the data out, you used a Select Value step to select the data you wanted for the le
and to rename the elds so that the header of the desnaon le looks clearer.
Output les
We saw that PDI could take data from several types of les. The same applies to output data.
The data you have in a transformaon can be sent to dierent types of les. All you have to
do is redirect the ow of data towards an Output step.
Geng Started with Transformaons
[ 50 ]
Output steps
There are several steps that allow you to send the data to a le. All those steps are under the
Output step category: Text le output and Excel Output are examples of them.
For an Output step, just like you do for an Input step, you also have to dene:
Name of the step: It is mandatory and must be dierent for every step in
the transformaon.
Name and locaon of the le: These must be specied. If you specify an exisng
le, the le will be replaced by a new one (unless you check the Append checkbox
present in some of the output steps).
Content type: This data includes delimiter character, type of encoding, whether to
put a header, and so on. The list depends on the kind of le chosen. If you check
Header, the header will be built with the names of the elds.
If you don't like the names of the elds as header names in your le,
you may use a Select values step just to rename those elds.
Fields: Here you specify the list of elds that has to be sent to the le, and provide
some format instrucons. Just like in the input steps, you may use the Get Fields
buon to ll the grid. In this case, the grid is going to be lled based on the data
that arrives from the previous step. You are not forced to send every piece of data
coming to the output step, nor to send the elds in the same order.
Some data denitions
From the Kele's point of view, data can be anything ready to be processed by soware (for
example les or data in databases). Whichever the subject or origin of the data, whichever
its format, Kele transformaons can get the data for further processing and delivering.
Rowset
Transformaons deals with datasets, that is, data presented in a tabular form, where:
Each column represents a eld. A eld has a name and a data type. The data type
can be any of the common data types—number (oat), string, date, Boolean, integer,
or big number.
Each row corresponds to a given member of the dataset. All rows in a dataset have
the same structure, that is, all rows have the same elds, in the same order. A eld
in a row may be null, but it has to be present.
Chapter 2
[ 51 ]
The dataset is called rowset. The following is an example of rowset. It is the rowset
generated in the World Cup tutorial:
Streams
Once the data is read, it travels from step to step, through the hops that link those steps.
Nothing happens in the hops except data owing. The real manipulaon of data, as well as
the modicaon of a stream by adding or removing columns, occurs in the steps.
Right-click on the Select values step of the transformaon you created. In the contextual
menu select Show output elds. You'll see this:
This window shows the metadata of the data that leaves this step, this is, name, type, and
other properes of each eld leaving this step towards the following step.
In the same way, if you select Show input elds, you will see the metadata of the data that
le the previous step.
Geng Started with Transformaons
[ 52 ]
The Select values step
The Select values step allows you to select, rename, and delete elds, or change the
metadata of a eld. The step has three tabs:
Select & Alter: This tab is also used to rename the elds or reorder them. This is
how we used it in the last exercise.
Remove: This tab is useful to discard undesirable elds. We used it in the matches
exercise to drop the rst and last elds. Alternavely, we could use the Select &
Alter tab, and specify the elds that you want to keep. Both are equivalent for
that purpose.
Meta-data: This tab is used when you want to change the denion of a eld such
as telling Kele to interpret a string eld as a date. We will see examples of this later
in this book.
You may use only one of the Select Values step tabs at a
me. Kele will not restrain you from lling more than one
tab, but that could lead to unexpected behavior.
Have a go hero – extending your transformations by writing output les
Suppose you read your own les in the previous secon, modify your transformaons by
wring some or all the data back into les, however, changing the format, headers, number
or order of elds, and so on this me around. The objecve is to get some experience to see
what happens. Aer some tests, you will feel condent with input and output les, and be
ready to move forward.
Getting system information
Unl now, you have learned how to read data from known les, and send data back to les.
What if you don't know beforehand the name of the le to process? There are several ways
to handle this with Kele. Let's learn the simplest.
Chapter 2
[ 53 ]
Time for action – updating a le with news about examinations
Imagine you are responsible to collect the results of an annual examinaon that is being
taken in a language school. The examinaon evaluates wring, reading, speaking, and
listening skills. Every professor gives the exam to the students, the students take the
examinaon, the professors grade the examinaons in the scale 0-100 for each skill, and
write the results in a text le, like the following:
student_code;name;writing;reading;speaking;listening
80711-85;William Miller;81;83;80;90
20362-34;Jennifer Martin;87;76;70;80
75283-17;Margaret Wilson;99;94;90;80
83714-28;Helen Thomas;89;97;80;80
61666-55;Maria Thomas;88;77;70;80
All the les follow that paern.
When a professor has the le ready, he/she sends it to you, and you have to integrate the
results in a global list. Let's do it with Kele.
1. Before starng, be sure to have a le ready to read. Type it or download the sample les
from the Packt's ocial website.
2. Create the le where the news will be appended. Type this:
---------------------------------------------------------
Annual Language Examinations
Testing writing, reading, speaking and listening skills
---------------------------------------------------------
student_code;name;writing;reading;speaking;listening;file_
processed;process_date
Save the le as C:/pdi_files/output/examination.txt.
3. Create a new transformaon.
4. Expand the Input branch of the steps tree.
5. Drag the Get System Info and Text le input icons to the canvas.
6. Expand the Output branch of the steps tree, and drag a Text le output step to
the canvas.
Geng Started with Transformaons
[ 54 ]
7. Link the steps as follows:
8. Double-click the rst Get System Info step icon and give it a name.
9. Fill the grid as follows:
10. Click OK.
11. Double-click the Text le Input step icon and congure it like here:
Chapter 2
[ 55 ]
12. Select the Content tab.
13. Check the Include lename in output? checkbox and type file_processed in the
Filename eldname textbox.
14. Check the Add lenames to result checkbox.
15. Select the Fields tab and Click the Get Fields buon to ll the grid.
16. Click OK.
17. Double-click the second Get System Info step icon and give it a name.
18. Add a eld named process_date, and from the list of choices select system
date (xed).
19. Double-click the Text le output step icon and give it a name.
20. Type C:/pdi_files/output/examination as the lename.
21. In the Fields tab, press the Get Fields buon to ll the grid.
22. Change the format of the Date row to yy/MM/dd.
23. Give a name and descripon to the transformaon and save it.
24. Press F9 to run the transformaon.
25. Fill in the argument grid, wring the full path of the le created.
26. Click Launch.
Geng Started with Transformaons
[ 56 ]
27. The output le should look like this:
---------------------------------------------------------
Annual Language Examinations
Testing writing, reading, speaking and listening skills
---------------------------------------------------------
student_code;name;writing;reading;speaking;listening;file_
processed;process_date
80711-85;William Miller;81;83;80;90;C:\exams\exam1.txt;28-05-2009
20362-34;Jennifer Martin;87;76;70;80;C:\exams\exam1.txt;28-05-2009
75283-17;Margaret Wilson;99;94;90;80;C:\exams\exam1.txt;28-05-2009
83714-28;Helen Thomas;89;97;80;80;C:\exams\exam1.txt;28-05-2009
61666-55;Maria Thomas;88;77;70;80;C:\exams\exam1.txt;28-05-2009
28. Run the transformaon again.
29. This me ll the argument grid with the name of a second le.
30. Click Launch.
31. Verify that the data from this second le was appended to the previous data in the
output le.
What just happened?
You read a le whose name is known at runme, and fed a desnaon le by appending the
contents of the input le.
The rst Get System Info step tells Kele to take the rst command line argument, and
assume that it is the name of the le to read.
In the Text File Input step, you didn't specify the name of the le, but told Kele to take as
the name of the le, the eld coming from the previous step, which is the read argument.
With the second Get System Info step you just took from the system, the date, which you
used later to enrich the data sent to the desnaon le.
The desnaon le is appended with new data every me you run the transformaon.
Beyond the basic required data (student code and grades), the name of the processed le
and the date on which the data is being appended are added as part of the data.
When you don't specify the name and locaon of a le (like in this example), or
when the real le is not available at design me, you won't be able to use the
Get Fields buon, nor preview to see if the step is well congured. The trick is
to congure the step by using a real le idencal to the expected one. Aer the
step is congured, change the name and locaon of the le as needed.
Chapter 2
[ 57 ]
Getting information by using Get System Info step
The Get System Info step allows you to get dierent informaon from the system. In this
exercise, you took the system date and an argument. If you look to the available list, you
will see more than just these two opons.
Here we used the step in two dierent ways:
As a resource to take the name of the le from the command line
To add a eld to the dataset
The use of this step will be clearer with a picture.
In this example, the Text File Input doesn't know the name or the locaon of the le. It takes
it from the previous step, which is a Get System Info Step. As the Get System Info serves as
a supplier of informaon, the hop that leaves the step changes its look and feel to show
the situaon.
Geng Started with Transformaons
[ 58 ]
The second me the Get System Info is used, its funcon is simply to add a eld to the
incoming dataset.
Data types
Every eld must have a data type. The data type can be any of the common data
types—number (oat), string, date, Boolean, integer, or big number. Strings are simple,
just text for which you may specify a length. Date and numeric elds have more variants,
and are worthy of while a separate explanaon.
Date elds
Date is one the main data types available in Kele. In the matches tutorial, you have an
example of date eld—the match date eld. Its values were 2/Jun, 6/Jun, 10/Jun. Take a
look at how you dened that eld in the Text le input step. You dened the eld as a date
eld with format dd/MMM. What does it mean? To Kele it means that it has to interpret the
eld as a date, where the rst two posions represent the day, then there is a slash, and
nally there is the month in leers (that's the meaning of the three last posions).
Generally speaking, when a date eld is created, like the text input eld of the example, you
have to dene the format of the data so that Kele can recognize in the eld the dierent
components of the date. There are several formats that may be dened for a date, all of
them combinaons of leers that represents date or me components. Here are the most
basic ones:
Leers Meaning
yYear
M Month
d Day
H Hour (0-23)
m Minutes
s Seconds
Now let's see the other end of the same transformaon—the output step. Here you set
another format for the same eld: dd/MM. According the table, this means the date has to
have two posions for the day, then a slash, and then two posions for the month. Here, the
format specicaon represents the mask you want to apply when the date is shown. Instead
of 2/Jun, 6/Jun, 10/Jun, in the output le, you expect to see 02/06, 06/06, 10/06.
In the examinaon tutorial, you also have a Date eld—the process date. When you created
it, you didn't specify a format because you took the system date which, by denion, is a
date and Kele knows it. But when wring this date to the output le, again you dened a
format, in this case it was yyyy/MM/dd.
Chapter 2
[ 59 ]
In general, when you are wring a date, the format aribute is used of format the data
before sending it to the desnaon. In case you don't specify a format, Kele sets a
default format.
As said earlier, there are more combinaons to dene the format to a date eld.
For a complete reference, check the Sun Java API documentaon located at
http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html.
Numeric elds
Numeric elds are present in almost all Kele transformaons. In the Examinaon example,
you encountered numeric elds for the rst me. The input le had four numeric elds.
As the numbers were all integer, you didn't set a specic format. When you have more
elaborate elds such as numbers with separators, dollar signs, and so on, you should specify
a format to tell Kele how to interpret the number. If you don't, Kele will do its best to
interpret the number, but this could lead to unexpected results.
At the other extreme of the ow, when wring to the output le text, you may specify the
format in which you want the number to be shown.
There are several formats you may apply to a numeric eld. The format is basically a
combinaon of predened symbols, each with a special meaning. The following are
the most used symbols:
Symbol Meaning
#Digit Leading zeros are not shown
0 Digit If the digit is not present, zero is displayed in its place
. Decimal separator
- Minus sign
%Field has to be mulplied by 100 and shown as a percentage
These symbols are not used alone. In order to specify the format of your numbers, you
have to combine them. Suppose that you have a numeric eld whose value is 99.55; the
following table shows you the same value aer applying dierent formats to it:
Format Result
# 100
0 100
#.# 99.6
#.## 99.55
#.000 99.550
000.000 099.550
Geng Started with Transformaons
[ 60 ]
If you don't specify a format for your numbers, you may sll provide a Length and
Precision. Length is the total number of signicant gures, while precision is the number
of oang-point digits.
If you neither specify format nor length or precision, Kele behaves as follow. While reading,
it does its best to interpret the incoming number, and when wring, it sends the data as it
comes without applying any format.
For a complete reference on number formats, you can check the Sun Java API
documentaon available at http://java.sun.com/javase/6/docs/api/java/text/
DecimalFormat.html.
Running transformations from a terminal window
In the examinaon exercise, you specied that the name of the input le will be taken
from the rst command-line argument. That means when execung the transformaon,
the lename has to be supplied as an argument. Unl now, you only ran transformaons
from inside Spoon. In the last exercise, you provided the argument by typing it in a dialog
window. Now it is me to learn how to run transformaons with or without arguments from
a terminal window.
Time for action – running the examination transformation from
a terminal window
Before execung the transformaon from a terminal window, make sure that you have a new
examinaon le to process, let's say exam3.txt. Then follow these instrucons:
1. Open a terminal window and go to the directory where Kele is installed.
On Windows systems type:
C:\pdi-ce>pan.bat /file:c:\pdi_labs\examinations.ktr c:\
pdi_files\input\exam3.txt
On Unix, Linux, and other Unix-based systems type:
/home/yourself/pdi-ce/pan.sh /file:/home/yourself/pdi_labs/
examinations.ktr c:/pdi_files/input/exam3.txt
If your transformaon is in another folder, modify the command
accordingly.
Chapter 2
[ 61 ]
2. You will see how the transformaon runs, showing you the log in the terminal.
3. Check the output le. The contents of exam3.txt should be at the end of the le.
What just happened?
You executed a transformaon with Pan, the program that runs transformaons from
terminal windows. As part of the command, you specied the name of the transformaon
le and provided the name of the le to process, which was the only argument expected by
the transformaon. As a result, you got the same as if you had run the transformaon from
Spoon—a small le appended to the global le.
When you are designing transformaons, you run them with Spoon; you don't use Pan. Pan
is mainly used as part of batch processes, for example processes that run every night in a
scheduled fashion.
Appendix B tells you all the details about using Pan.
Have a go hero – using different date formats
Change the main transformaon of the last tutorial so that the process_date is saved with
a full format, that is, including day of week (Monday, Tuesday, and so on), month in leers
(January, February, and so on), and me.
Geng Started with Transformaons
[ 62 ]
Go for a hero – formatting 99.55
Create a transformaon to see for yourself the dierent formats for the number 99.55. Test
the formats shown in the Numeric elds secon and try some other opons as well.
To test this, you will need a dataset with a single row and a single eld—the
number. You can generate it with a Generate rows step.
Pop quiz–formatting data
Suppose that you read a le where the rst column is a numeric idener: 1, 2, 3, and so on.
You read the eld as a Number. Now you want to send the data back to a le. Despite being
a number, this eld is regular text to you because it is a code. How do you dene the eld in
the Text output step (you may choose more than one opon):
a. As a Number. In the format, you put #.
b. As a String. In the format, you put #.
c. As a String. You leave the format blank.
XML les
Even if you're not a system developer, you must have heard about XML les. XML les
or documents are not only used to store data, but also to exchange data between
heterogeneous systems over the Internet. PDI has many features that enable you to
manipulate XML les. In this secon you will learn to get data from those les.
Time for action – getting data from an XML le with information
about countries
In this tutorial you will build an Excel le with basic informaon about countries. The source
will be an XML le that you can download from the Packt website.
1. If you work under Windows, open the kettle.properties le located in the
C:/Documents and Settings/yourself/.kettle folder and add the
following line:
LABSOUTPUT=c:/pdi_files/output
Chapter 2
[ 63 ]
On the other hand, if you work under Linux (or similar), open the kettle.
properties le located in the /home/yourself/.kettle folder and add the
following line:
LABSOUTPUT=/home/yourself/pdi_files/output
2. Make sure that the directory specied in kettle.properties exists.
3. Save the le.
4. Restart Spoon.
5. Create a new transformaon.
6. Give a name to the transformaon and save it in the same directory you have all the
other transformaons.
7. From the Packt website, download the resources folder containing a file named
countries.xml. Save the folder in your working directory. For example, if your
transformaons are in pdi_labs, the le will be in pdi_labs/resources/.
The last two steps are important. Don't skip them! If you do,
some of the following steps will fail.
8. Take a look at the le. You can edit it with any text editor, or you can double-click it to
see it within an explorer. In any case, you will see informaon about countries. This is
just the extract for a single country:
<?xml version="1.0" encoding="UTF-8"?>
<world>
...
<country>
<name>Argentina</name>
<capital>Buenos Aires</capital>
<language isofficial="T">
<name>Spanish</name>
<percentage>96.8</percentage>
</language>
<language isofficial="F">
<name>Italian</name>
<percentage>1.7</percentage>
</language>
<language isofficial="F">
Geng Started with Transformaons
[ 64 ]
<name>Indian Languages</name>
<percentage>0.3</percentage>
</language>
</country>
...
</world>
9. From the Input steps, drag a Get data from XML step to the canvas.
10. Open the conguraon window for this step by double-clicking it.
11. In the File or directory textbox, press Ctrl+Space. A drop-down list appears as shown in
the next screenshot:
12. Select Internal.Transformation.Filename.Directory. The textbox gets lled
with this text.
13. Complete the text so that you can read ${Internal.Transformation.Filename.
Directory}/resources/countries.xml.
14. Click on the Add buon. The full path is moved to the grid.
15. Select the Content tab and click Get XPath nodes.
16. In the list that appears, select /world/country/language.
Chapter 2
[ 65 ]
17. Select the Fields tab and ll the grid as follows:
18. Click Preview rows, and you should see something like this:
19. Click OK.
20. From the Output steps, drag an Excel Output step to the canvas.
21. Create a hop from the Get data from XML step to the Excel Output step.
22. Open the conguraon window for this step by double-clicking it.
Geng Started with Transformaons
[ 66 ]
23. In the Filename textbox press Ctrl+Space.
24. From the drop-down list, select ${LABSOUTPUT}.
25. By the side of that text type /countries_info. The complete text should be
${LABSOUTPUT}/countries_info.
26. Select the Fields tab and click the Get Fields buon to ll the grid.
27. Click OK. This is your nal transformaon.
28. Save the transformaon.
29. Run the transformaon.
30. Check that the countries_info.xls le has been created in the output directory
and contains the informaon you previewed in the input step.
What just happened?
You got informaon about countries from an XML le and saved it in a more readable
format—an Excel spreadsheet—for the common people.
To get the informaon, you used a Get data from XML step. As the source le was
taken from a folder relave to the folder where you stored the transformaon, you set
the directory to ${Internal.Transformation.Filename.Directory}. When
the transformaon ran, Kele replaced ${Internal.Transformation.Filename.
Directory} with the real path of the transformaon: c:/pdi_labs/.
In the same way, you didn't put a xed value for the path of the nal Excel le. As directory,
you used ${LABSOUTPUT}. When the transformaon ran, Kele replaced ${LABSOUTPUT}
with the value you wrote in the kettle.properties le. The output le was then saved in
that folder: c:/pdi_files/output.
Chapter 2
[ 67 ]
What is XML
XML stands for EXtensible Markup Language. It is basically a language designed to describe
data. XML les or documents contain informaon wrapped in tags. Look at this piece of XML
taken from the countries le:
<?xml version="1.0" encoding="UTF-8"?>
<world>
...
<country>
<name>Argentina</name>
<capital>Buenos Aires</capital>
<language isofficial="T">
<name>Spanish</name>
<percentage>96.8</percentage>
</language>
<language isofficial="F">
<name>Italian</name>
<percentage>1.7</percentage>
</language>
<language isofficial="F">
<name>Indian Languages</name>
<percentage>0.3</percentage>
</language>
</country>
...
</world>
The rst line in the document is the XML declaraon. It denes the XML version of the
document, and should always be present.
Below the declaraon is the body of the document. The body is a set of nested elements.
An element is a logical piece enclosed by a start-tag and a matching end-tag—for example,
<country> </country>.
Within the start-tag of an element, you may have aributes. An aribute is a markup
construct consisng of a name/value pair—for example, isofficial="F".
These are the most basic terminology related to XML les. If you want to know more about
XML, you can visit http://www.w3schools.com/xml/.
Geng Started with Transformaons
[ 68 ]
PDI transformation les
Despite the .ktr extension, PDI transformaons are just XML les. As such, you are able to
explore them inside and recognize dierent XML elements. Look the following sample text:
<?xml version="1.0" encoding="UTF-8"?>
<transformation>
<info>
<name>hello_world</name>
<description>My first transformation</description>
<extended_description>
This transformation generates 10 rows
with the message Hello World.
</extended_description>
...
</transformation>
This is an extract from the hello_world.ktr le. Here you can see the root element
named transformation, and some inner elements such as info and name.
Note that if you copy a step by selecng it in the Spoon canvas and pressing Ctrl+C , and then
pass it to a text editor, you can see its XML denion. If you copy it back to the canvas, a new
idencal step will be added to your transformaon.
Getting data from XML les
In order to get data from an XML le, you have to use the Get Data From XML input step.
To tell PDI which informaon to get from the le, it is required that you use a parcular
notaon named XPath.
XPath
XPath is a set of rules used for geng informaon from an XML document. In XPath, XML
documents are treated as trees of nodes. There are several types of nodes; elements,
aributes, and texts are some of them. As an example, world, country, and isofficial
are some of the nodes in the sample le.
Among the nodes there are relaonships. A node has a parent, zero or more children,
siblings, ancestors, and descendants depending on where the other nodes are in
the hierarchy.
In the sample countries le, country is the the parent of the elements name, capital, and
language. These three elements are children of country.
To select a node in an XML document, you have to use a path expression relave to a
current node.
Chapter 2
[ 69 ]
The following table has some examples of path expressions that you may use to specify
elds. The examples assume that the current node is language.
Path expression Descripon Sample expression
node_name Selects all child nodes of the
node named node_name.
percentage
This expression selects all child nodes of
the node percentage. It looks for the node
percentage inside the current node language.
.Selects the current node language
.. Selects the parent of the
current node
../capital
This expression selects all child nodes of the
node capital. It doesn't look in the current
node (language), but inside its parent, which
is country.
@Selects an aribute @isofficial
This expression gets the aribute isofficial
in the current node language.
Note that the expressions name and ../name are not the same. The
rst selects the name of the language, while the second selects the
name of the country.
For more informaon on XPath, follow this link: http://www.w3schools.com/XPath/.
Conguring the Get data from XML step
In order to specify the name and locaon of an XML le, you have to ll the File tab just as
you do in any le input step. What is dierent here is how you get the data.
The rst thing you have to do is select the path that will idenfy the current node. You do
it by lling the Loop XPath textbox in the Content tab. You can type it by hand, or you can
select it from the list of available paths by Clicking the Get XPath nodes buon.
Once you have selected a path, PDI will generate one row of data for every found path.
In the tutorial you selected /world/country/language. Then PDI generates one row for
each /world/country/language element in the le.
Aer selecng the loop XPath, you have to specify the elds to get. In order to do that,
you have to ll the grid in the Fields tab by using XPath notaon as explained in the
preceding secon.
Geng Started with Transformaons
[ 70 ]
Note that if you click the Get elds buon, PDI will ll the grid with the child nodes of the
current node. If you want to get some other node, you have to type its XPath by hand.
Also note the notaon for the aributes. To get an aribute, you can use the @ notaon as
explained, or you can simply type the name of the aribute without @ and select Aribute
under the Element column, as you did in the tutorial.
Kettle variables
In the last tutorial, you used the string ${Internal.Transformation.Filename.
Directory} to idenfy the folder where the current transformaon was saved. You also
used the string ${LABSOUTPUT} to dene the desnaon folder of the output le.
Both strings, ${Internal.Transformation.Filename.Directory} and
${LABSOUTPUT}, are Kele variables, that is, keywords linked to a value. You use the
name of a variable, and when the transformaon runs, the name of the variable is
replaced by its value.
The rst of these two variables is an environment variable, and it is not the only available.
Other known environment variables are ${user.home}, ${java.io.tmpdir}, and
${java.home}. All these variables are ready to use any me you need.
The second variable is a variable you dened in the kettle.properties le. In this le
you may dene as many variables as you want. The only thing you have to keep in mind is
that those variables will be available inside Spoon aer you restart it.
These two kinds of variables—environment variables and variables dened in the
kettle.properties le—are the most primive kinds of variables found in PDI.
All of these variables are string variables and their scope is the Java virtual machine.
How and when you can use variables
Any me you see a red dollar sign by the side of a textbox, you may use a variable. Inside the
textbox you can mix variable names with stac text, as you did in the tutorial when you put
the name of the desnaon as ${LABSOUTPUT}/countries_info.
To see all the available variables, you have to posion the cursor in the textbox, press
Ctrl+Space, and a full list is displayed for you to select the variable of your choice. If you put
the mouse cursor over any of the variables for a second, the actual value of the variable will
be shown.
If you know the name of the variable, you don't need to select it from the list. You may type
its name, by using either of these notaons—${<name>} or %%<name>%%.
Chapter 2
[ 71 ]
Have a go hero – exploring XML les
Now you can explore by yourself. On the Packt website there are some sample XML les.
Download them and try this:
• Read the customer.xml le and create a list of customers.
• Read the tomcat-users.xml le and get the users and their passwords.
• Read the areachart.xml and get the color palee, that is, the list of colors used.
The customer le is included in the Pentaho Report Designer soware package.
The others come with the Pentaho BI package. This soware has many XML les
for you to use. If you are interested you can download the soware from
http://sourceforge.net/projects/pentaho/files/.
Have a go hero – enhancing the output countries le
Modify the transformaon in the tutorial so that the Excel output uses a template. The
template will be an Excel le with the header and format already applied, and will be located
in a folder inside the pdi_labs folder.
Templates are congured in the Content tab of the Excel conguraon window.
In order to set the name for the template, use internal variables.
Have a go hero – documenting your work
As explained, transformaons are nothing dierent than XML les. Now you'll create a new
transformaon that will take as input the transformaons you've created so far, and will
create a simple Excel spreadsheet with the name and descripon of all your transformaons.
If you keep this sheet updated by running the transformaon on a regular basis, it will be
easier to nd a parcular transformaon you created in the past.
To get data from the transformaons les, use the Get data from XML step.
As wildcard, use .*\.ktr. Doing so, you'll get all the les.
On the other hand, as Loop XPath, use /transformation/info.
Geng Started with Transformaons
[ 72 ]
Summary
In this chapter you learned how to get data from les and put data back into les.
Specically, you learned how to:
Get data from plain les and also from XML les
Put data into text les and Excel les
Get informaon from the operang system such as command-line arguments and
system date
We also discussed the following:
The main PDI terminology related to data, for example datasets, data types,
and streams
The Select values step, a commonly used step for selecng, reordering, removing
and changing data
How and when to use Kele variables
How to run transformaons from a terminal with the Pan command
Now that you know how to get data into a transformaon, you are ready to start
manipulang data. This is going to happen in the next chapter.
3
Basic Data Manipulation
In the previous chapter, you learned how to get data into PDI. Now you're ready to
begin transforming that data. This chapter explains the simplest and most used ways
of transforming data. We will cover the following:
Execung basic operaons
Filtering and sorng of data
Looking up data outside the main stream of data
By the end of this chapter, you will be able to do simple but meaningful transformaons on
dierent types of data.
Basic calculations
You already know how to create a transformaon and read data from an external source.
Now, taking that data as a starng point, you will begin to do basic calculaons.
Basic Data Manipulaon
[ 74 ]
Time for action – reviewing examinations by using the
Calculator step
Can you recollect the exercise about examinaons you did in the previous chapter? You
created an incremental le with examinaon results. The nal le looked like the following:
---------------------------------------------------------
Annual Language Examinations
Testing writing, reading, speaking and listening skills
---------------------------------------------------------
student_code;name;writing;reading;speaking;listening;file_
processed;process_date
80711-85;William Miller; 81;83;80;90;C:\pdi_files\input\first_turn.
txt;28-05-2009
20362-34;Jennifer Martin; 87;76;70;80;C:\pdi_files\input\first_turn.
txt;28-05-2009
75283-17;Margaret Wilson; 99;94;90;80;C:\pdi_files\input\first_turn.
txt;28-05-2009
83714-28;Helen Thomas; 89;97;80;80;C:\pdi_files\input\first_turn.
txt;28-05-2009
61666-55;Maria Thomas; 88;77;70;80;C:\pdi_files\input\first_turn.
txt;28-05-2009
...
Now you want to convert all grades in the scale 0-100 to a new scale from 0 to 5. Also, you
want to take the average grade to see how the students did.
1. Create a new transformaon, give it a name and descripon, and save it.
2. By using a Text le input step, read the examination.txt le. Give the name and
locaon of the le, check the Content tab to see that everything matches your le, and
ll the Fields tab as here:
Chapter 3
[ 75 ]
3. Do a preview just to conrm that the step is well congured.
Noce that you have several lines as header. Because the
names of the elds are not in the rst row, you won't be able
to use the Get Fields buon successfully. You will have to write
the elds manually, or you can avoid it by doing the following:
Congure the step with a copy of the le that doesn't have the
extra heading, just the heading row with the names of the elds.
Then, restore the name of your le in the File tab, adjust the
number of headings in the Content tab, and your step is ready.
4. Use the Select values step to remove the elds you will not use—file_processed
and process_date.
Basic Data Manipulaon
[ 76 ]
5. Drag another Select values step to the canvas. Select the Meta-data tab and change the
meta-data of the numeric elds like here:
6. Near the upper-le corner of the screen, above the step tree, there is a textbox for
searching. Type calc in the textbox. While you type, a lter is applied to show you only
the steps that contain, in their name or descripon, the text you typed. You should be
seeing this:
7. Among the steps you see, select the Calculator step and drag it to the canvas.
Chapter 3
[ 77 ]
8. To remove the lter, clear the typed text.
9. Create a hop from the Text le input step to the Calculator step.
10. Edit the Calculator step and ll the grid as follows:
11. To ll the Calculaon column, simply select the operaon from the list provided. Be sure
to ll every column in the grid like shown in the screenshot.
You don't have to feel like you are doing data entry instead
of learning PDI. You can avoid typing by copying and pasng
similar rows, and then xing the values properly. Appendix D
has a list of shortcuts you can use when eding grids like these.
12. Leave the Calculator step selected and click the Preview this transformaon buon
followed by the Quick Launch buon. You should see something similar to the
following screenshot:
Basic Data Manipulaon
[ 78 ]
The numbers may vary according to the contents of your le.
13. Edit the calculator again and change the content of the Remove column like here:
14. From the Transform category of steps, add a Sort rows step and create a hop from the
Calculator step to this new step.
15. Edit the Sort rows step by double-clicking it, click the Get Fields buon, and adjust the
grid as follows:
16. Click OK.
Chapter 3
[ 79 ]
17. Drag a third Select values step, create a hop from the Sort rows step to this new step,
and use it to keep only the elds by which you ordered the data:
18. From the Flow category of steps, add a Dummy step and create a hop from the last
Select values step to this.
19. Select the Dummy step and do a preview.
20. The nal preview looks like the following screenshot:
Basic Data Manipulaon
[ 80 ]
If you get an error or a dierent result, review the explanaon and make
sure that you followed the instrucons correctly. Do a preview on each
step to discover in which one you have the problem. If you realize that
the problem is in any of the steps that read the input les, please refer
to the Troubleshoong reading les secon in Chapter 2.
What just happened?
You read the examination.txt le, and did some calculaons to see how the students did.
You did the calculaons by using the Calculator step.
First of all, you removed the elds you didn't need from the stream of data.
Aer that, you did the following calculaons:
By dividing by 20, you converted all grades from the scale 0-100 to the scale 0-5.
Then, you calculated the average of the grades for the four skills—wring, reading, listening,
and speaking. You created two auxiliary elds, aux1 and aux2, to calculate paral sums. Aer
that, you created the eld total with the sum of aux1 and aux2, another auxiliary eld with
the number 4, and nally the avg as the division of the total by the eld four.
In order to obtain the new grades, as well as the average with two decimal posions, you
need the result of the operaon to be of a numeric type with precision 2. Therefore, you
had to change the metadata, by adding a Select values step before the Calculator. With the
Select values you changed the type of the numeric elds from integer to number, that is,
oat numbers. If you didn't, the quoents would have been rounded to integer numbers.
You can try and see for yourself!
The rst me you edited the calculator, you set the eld Remove to N for every row in the
calculator grid. By doing this, you could preview every eld created in the calculator, even
the auxiliary ones such as the elds twenty, aux1, and aux2. You then changed the eld to
Y so that the auxiliary elds didn't pass to the next step.
Aer doing the calculaons, you sorted the data by using a Sort rows step. You specied the
order by avg descending, then by student_code ascending.
Chapter 3
[ 81 ]
Sorng data
For small datasets, the sorng algorithm runs mainly using the JVM memory.
When the number of rows exceeds 5,000, it works dierently. Every ve
thousand rows, the process sorts them and writes them to a temporary le.
When there are no more rows, it does a merge sort on all those les and gives
you back the sorted dataset. You can conclude that for huge datasets a lot
of reading and wring operaons are done on your disk, which slows down
the whole transformaon. Fortunately, you can change the number of rows
in memory (5,000 by default) by seng a new value in the Sort size (rows in
memory) textbox. The bigger this number, the faster the sorng process.
Note that a sort size that works in your system may not work in a machine with
a dierent conguraon. To avoid that risk, you can use a dierent approach.
In the Sort rows conguraon window, you can set a Free memory threshold
(in %) value. The process begins to use temporary les when the percentage
of available memory drops below the indicated threshold. The lower the
percentage, the faster the process.
As it's not possible to know the exact amount of free memory, it's not
recommended to set a very small free memory threshold. You denitely
shouldn't use that opon in complex transformaons or when there is more
than one sort going on, as you could sll run out of memory.
The two nal steps were added to keep only the elds of interest, and to preview the result
of the transformaon. You can change the Dummy step for any of the output steps you
already know.
You've used the Dummy step several mes but sll nothing has been said
about it. Mainly it was because it does nothing! However, you can use it as a
placeholder for tesng purposes as in the last exercise.
Note that in this tutorial you used the Select values step in three dierent ways:
To remove elds by using the Remove tab.
To change the meta-data of some elds by using the Meta-data tab.
To select and rename elds by using the Select tab.
Remember that the Select values step's tabs are exclusive! You can't use more
than one in the same step!
Basic Data Manipulaon
[ 82 ]
Besides calculaon, in this tutorial you did something you hadn't before—searching the
step tree.
When you don't remember where a step is in the steps tree, or when you just
want to nd if there is a step that does some kind of operaon, you could simply
type the search criterion in the textbox above the steps tree. PDI does a search
and lters all the steps that have that text as part of their name or descripon.
Adding or modifying elds by using different PDI steps
In this tutorial you used the Calculator step to create new elds and add them to
your dataset. The Calculator is one the many steps that PDI has to create new elds by
combining existent ones. Usually you will nd these steps under the Transform category
of the steps tree. The following table describes some of them (the examples refer to the
examinaon le):
Step Descripon Example
Split Fields Split a single eld into two
or more. You have to give
the character that acts as
separator.
Split the name into two elds: Name and
Last Name. The separator would be a space
character.
Add constants Add one or more constants
to the input rows
Add two constants: four and twenty. Then
you could use them in the Calculator step
without dening the auxiliary elds.
Replace in string Replace all occurrences of
a text in a string eld with
another text
Replace the – in the student code by a /.
For example: 108418-95 would become
108418/95.
Number range Create a new eld based on
ranges of values. Applies to
a numeric eld.
Create a new eld called exam_range with
two ranges: Range A with the students with
average grade below 3.5, and Range B with
students with average grade greater or equal
to 3.5.
Value Mapper Creates a correspondence
between the values of
a eld and a new set of
values.
Suppose you calculated the average grade as
an integer number ranging from 0 to 5. You can
map the average to A, B, C, D, like this:
Old value: 5; New value: A
Old value: 3, 4; New value: B
Old value: 1, 2; New value: C
Old value: 0; New value: D
Chapter 3
[ 83 ]
Step Descripon Example
User Dened Java
Expression
Creates a new eld by
using a Java expression that
involves one or more elds.
This step may eventually
replace any of the above
but it's only recommended
for those familiar with Java.
Create a ag (a Boolean eld) that tells if a
student passed. A student passes if his/her
average grade is above 4.5.
The expression to use could be:
(((writing+reading+speaking+
listening)/4)>4.5)?true:false
Any of these steps when added to your transformaon, are executed for every row in the
stream. It takes the row, idenes the elds needed to do its tasks, calculates the new
eld(s), and adds it to the dataset.
For details on a parcular step, don't hesitate to visit the Wiki page for steps:
http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+v3.2.+St
eps
The Calculator step
The Calculator step you used in the tutorial, allows you to do simple calculaons not only
on numeric elds, but also on data and text. The Calculator step is not the only means to do
calculaons, but it is the simplest. It allows you to do simple calculaons in a quick fashion.
The step has a grid where you can add all the elds you want to. Every row represents
an operaon that involves from one up to three operands (depending on the selected
operaon). When you select an operaon, the descripon of the operaon itself tells you
which argument it needs. For example:
If you select Set constant eld to value A, you have to provide a constant value
under the column name A.
If you select A/B, the operaon needs two arguments, and you have to provide
them by indicang the elds to use in the columns named A and B respecvely.
The result of every operaon becomes a new eld in your dataset, unless you set the
Remove column to Y. The name of the new eld is the one you type under the New
eld column.
For each and every row of the data set, the operaons dened in the Calculator are
calculated in the order in which they appear. Therefore, you may create auxiliary elds and
then use them in rows of the Calculator grid that are below them. That is what you did in
the tutorial when you dened the auxiliary elds aux1 and aux2 and then used them in the
eld total.
Basic Data Manipulaon
[ 84 ]
Just like every grid in Kele, you have a contextual menu (and its corresponding shortcuts)
that lets you manipulate the rows by deleng, moving, copying and pasng, and so on.
The Formula step
The Formula step is another step you can use for doing calculaons. Let's give it a try by
using it in the examinaon tutorial.
Time for action – reviewing examinations by using the
Formula step
In this tutorial you will redo the previous exercise, but this me you will do the calculaons
with the Formula step.
1. Open the transformaon you just nished.
2. Delete from the transformaon the Calculator step, and put in its place a Formula
step. You will nd it under the Scripng category of steps.
3. Add a eld named writing.
4. When you click the cell under the Formula column, a window appears to edit the
formula for the new eld.
5. In the upper area of the window, type [writing]/20. You will noce that the
sentence is red if it is incomplete or the syntax is incorrect. In that case, the error is
shown below the eding area, like in the following example:
Chapter 3
[ 85 ]
6. As soon as the formula is complete and correct, the red color disappears.
7. Click OK.
8. The formula you typed will be displayed in the cell you clicked.
9. Set Number as the type for the new eld, and type writing in the Replace value
column.
10. Add three more elds to the grid in the same way you added this eld so that the
grid looks like the following:
11. Click OK.
12. Add a second Formula step.
13. Add a eld named avg and click the Formula cell to edit it.
14. Expand the Mathemacal category of funcons to the leside of the window, and click
the AVERAGE funcon.
Basic Data Manipulaon
[ 86 ]
15. The explanaon of the selected funcon appears to guide you.
16. In the eding area, type average([writing];[reading];[speaking];
[listening]).
17. Click OK.
18. Set the Value type to Number.
19. Click OK.
20. Create a hop from this step to the Sort rows step.
21. Edit the last Select values step.
22. Click Get elds to select.
23. A queson appears to ask you what to do. Click Clear and add all.
24. The grid is reloaded with the modied elds.
25. Click on the Dummy step and do a preview.
26. There should be no dierence with what you had in the Calculator version of
the tutorial:
Chapter 3
[ 87 ]
What just happened?
You read the examination.txt le, and did some calculaons using the Formula step to
see how the students did.
It may happen that the preview window shows you less decimal posions than
expected. This is a preview issue. One of the ways you have to see the numbers
with more decimals is to send the numbers to an output le with a proper
format and see the numbers in the le.
As you saw, you have quite a lot of funcons available for building formulas and expressions.
To reference a eld you have to use square brackets, like in [writing]. You may reference
only the current elds of the row. You have no way to access previous rows of the grid as
you have in the Calculator step and so you needed two Formula steps to replace a single
Calculator. But you saved auxiliary elds because the Formula allows you to type complex
formulas in a single eld without using paral calculaons.
When the calculaons are not simple, that is, they require resolving a complex
formula or involve many operands, then you might prefer the Formula step over
the Calculator.
The Formula step uses the library Libformula. The syntax used in LibFormula is based
on the OpenFormula standard. For more informaon on OpenFormula, you may visit
http://wiki.oasis-open.org/office/About_OpenFormula.
Basic Data Manipulaon
[ 88 ]
Have a go hero – listing students and their examinations results
Let's play a lile with the examinaon le. Suppose you decide that only those students
whose average grade was above 3.9 will pass the examinaon; the others will not. List the
students ordered by average (desc.), last name (asc.), and name (asc.). The output list should
have the following elds:
Student code
Name
Last Name
Passed (yes/no)
average grade
Pop quiz – concatenating strings
Suppose that you want to create a new eld as the student_code plus the name of the
student separated by a space, as for example 867432-94 Linda Rodriguez. Which of the
following are possible soluons for your problem:
a. Use a Calculator, using the calculaon a+b+c, where a is student_code, b is a
space, and c is the name eld.
b. Use a Formula, using as formula [student_code]+" "+[name]
c. Use a Formula, using as formula [student_code]&" "&[name]
You may choose more than one opon.
Calculations on groups of rows
You just learned to do simple operaons for every row of a dataset. Now you are ready to
go beyond. Suppose you have a list of daily temperatures of a given country over a year. You
may want to know the overall average temperature, the average temperature by region,
or the coldest day of the year. When you work with data, these types of calculaons are a
common requirement. In this secon you will learn to address those requirements with PDI.
Chapter 3
[ 89 ]
Time for action – calculating World Cup statistics by
grouping data
Let's forget the examinaons for a while, and retake the World Cup tutorial from the
previous chapter. The le you obtained from that tutorial was a list of results of football
matches. These are sample rows of the nal le:
Match Date;Home Team;Away Team;Result
02/06;Italy;France;2-1
02/06;Argentina;Hungary;2-1
06/06;Italy;Hungary;3-1
06/06;Argentina;France;2-1
10/06;France;Hungary;3-1
10/06;Italy;Argentina;1-0
...
Now you want to take that informaon to obtain some stascs such as the maximum
number of goals per match in a given day. To do it, follow these instrucons:
1. Create a new transformaon, give it a name and descripon, and save it.
2. By using a Text le input step, read the wcup_first_round.txt le you generated
in Chapter 2. Give the name and locaon of the le, check the Content tab to see that
everything matches your le, and ll the Fields tab.
3. Do a preview just to conrm that the step is well congured.
4. From the Transform category of step, select a Split Fields step, drag it to the work area,
and create a hop from the Text le input to this step.
5. Double-click the Split Fields steps and ll the grid like done in the following screenshot:
Basic Data Manipulaon
[ 90 ]
6. Add a Calculator step to the transformaon and create a hop from the Split Fields step
to this step and edit the step to create the following new elds:
7. Add a Sort rows step to the transformaon, create a hop from the Calculator step to this
step, and sort the elds by Match_Date.
8. Expand the Stascs category of steps, and drag a Group by step to the canvas. Create a
hop from the Sort rows step to this new step.
9. Edit the Group by step and ll the conguraon window as shown next:
Chapter 3
[ 91 ]
10. When you click the OK buon, a window appears to warn you that this step
needs the input to be sorted on the specied keys—the Range eld in this case.
Click I understand, and don't worry because you already sorted the data in the
previous step.
11. Add a nal Dummy step.
12. Select the Dummy and the Group by steps, le-click one and holding down the Shi
key, le-click the other.
13. Click the Preview this transformaon buon. You will see the the following:
14. Click Quick Launch. The following window appears:
15. Double-click the Sort rows step. A window appears with the data coming out of the Sort
rows step.
16. Double-click the Dummy step. A window appears with the data coming out of the
Dummy step.
Basic Data Manipulaon
[ 92 ]
17. If you rearrange the preview windows, you can see both preview windows at a me, and
understand beer what happened with the numbers. The following would be the data
shown in the windows:
What just happened?
You opened a le with results from several matches and got some stascs from it.
In the le, there was a column with the match result in the format n-m, with n being the
goals of the home team and m being the goals of the away team. With the Split Fields step,
you split this eld in two—one with each of these two numbers.
With the Calculator you did two things:
You created a new eld with the total number of goals for each match.
You created a descripon for the match.
Chapter 3
[ 93 ]
Note that in order to create a descripon, you used the + operator to
concatenate string rather than add numbers.
Aer that, you ordered the data by match date with a Sort rows step.
In the preview window of the Sort rows step, you could see all the calculated elds: home
team goals, away team goals, match goals, and descripon.
Finally, you did some stascal calculaons:
First, you grouped the rows by match date. You did this by typing Match_Date in the
upper grid of the Group by step.
Then, for every match date, you calculated some stascs. You did the calculaons by
adding rows in the lower grid of the step, one for every stasc you needed.
Let's see how it works. Because the Group by step was preceded by a Sort rows step, the
rows came to the step already ordered. When the rows arrive to the Group by step, Kele
creates groups based on the eld(s) indicated in the upper grid—the Match_Date eld in this
case. The following drawing shows this idea:
Basic Data Manipulaon
[ 94 ]
Then, for every group, the elds that you put in the lower grid are calculated. Let's see, for
example, the group for the match date 03/06. For the rows in this group, Kele calculated
the following:
Matches: The number of matches played on 03/06. There were 4.
Sum of goals: The total number of goals converted on 03/06. There were 3+2+3+4=12.
Maximum: The maximum number of goals converted in a single match played on 03/06.
The maximum among 3, 2, 3, and 4 was 4.
Teams: The descripons of the teams which played on 03/06, separated by ; : Austria-
Spain; Sweden-Brazil; Netherlands-Iran; Peru-Scotland.
The same calculaons were made for every group. You can verify the details by looking in the
preview window.
Look at the Step Metrics tab in the Execuon Results area of the screen:
Note that 24 rows entered the Group by step and only 7 came out of that step towards the
Dummy step. That is because aer the grouping, you no longer have the detail of matches.
The output of the Group by step is your new data now—one row for every group created.
Group by step
The Group by step allows you to create groups of rows and calculate new elds over
those groups.
In order to dene the groups, you have to specify which eld(s) are the keys. For every
combinaon of values for those elds, Kele builds a new group.
In the tutorial you grouped by a single eld Match_date. Then for every value of
Match_date, Kele created a dierent group.
Chapter 3
[ 95 ]
The Group by step operates on consecuve rows. Suppose that the rows are already sorted
by date, but those with date 10/06 are above the rest. The step traverses the dataset and
each me the value for any of the grouping eld changes, it creates a new group. If you
see it this way, you will noce that the step will work even if the data is not sorted by the
grouping eld.
As you probably don't know how the data is ordered, it is safer and
recommended that you sort the data by using a Sort rows step just
before using a Group by step.
Once you have dened the groups, you are free to specify new elds to be calculated
for every group. Every new eld is dened as an aggregate funcon over some of the
existent elds.
Let's review some of the elds you created in the tutorial:
The Matches eld is the result of applying the Number of values funcon over
the eld Match_date.
The Sum of goals eld is the result of applying the Sum funcon over the
eld goals.
The Maximum eld is the result of applying the Maximum funcon over the
eld goals.
Finally, you have the opon to calculate aggregate funcons over the whole dataset. You do
this by leaving the upper grid blank. Following the same example, you could calculate the
total number of matches and the average number of goals for all those matches. This is how
you do it:
Basic Data Manipulaon
[ 96 ]
The following is what you get:
In any case, as a result of the Group by step, you will no longer have the detailed rows,
unless you check the Include all rows? checkbox.
Have a go hero – calculating statistics for the examinations
Here you have one more task related with the examinaons le. Create a new
transformaon, read the le, and calculate:
The number of students who passed
The number of students who failed
The average wring, reading, speaking, and listening grade obtained by students
who passed
The average wring, reading, speaking, and listening grade obtained by students
who failed
The minimum and maximum average grade among students who passed
The minimum and maximum average grade among students who failed
Use the Number range step to dene the range of the average
grade; then use a Group by step to calculate the stascs.
Chapter 3
[ 97 ]
Have a go hero – listing the languages spoken by country
Read the le with countries' informaon you used in Chapter 2. Build a le where each row
has two columns—the name of a country and the list of spoken languages in that country.
As aggregate, use the opon Concatenate strings separated by.
Filtering
Unl now you learned how to accomplish several kinds of calculaons that enriched the set
of data. There is sll another kind of operaon that is frequently used, and does not have to
do with enriching the data but with discarding data. It is ltering unwanted data. Now you
will learn how to discard rows under given condions.
Time for action – counting frequent words by ltering
Let's suppose, you have some plain text les, and you want to know what is said in them. You
don't want to read them, so you decide to count the mes that words appear in the text, and
see the most frequent ones to get an idea of what the les are about.
Before starng, you'll need at least one text le to play with. The text le used in
this tutorial is named smcng10.txt and is available for you to download from
the Packt website.
Let's work:
1. Create a new transformaon.
2. By using a Text le input step, read your le. The trick here is to put as a separator
a sign you are not expecng in the le, for example |. By doing so, the enre line
would be recognized as a single eld. Congure the Fields tab by dening a single
string eld named line.
3. From the Transform category of step, drag to the canvas a Split eld to rows step,
and create a hop from Text le input step to this new step.
Basic Data Manipulaon
[ 98 ]
4. Congure the step like this:
5. With this last step selected, do a preview. Your preview window should look like this:
6. Close the preview window.
7. Expand the Flow category of steps, and drag a Filter rows step to the work area.
8. Create a hop from the last step to the Filter rows step.
9. Edit the Filter rows step by double-clicking it.
Chapter 3
[ 99 ]
10. Click the <field> textbox to the le of the = sign. The list of elds appears.
Select word.
11. Click the = sign. A list of operaons appears. Select IS NOT NULL.
12. The window looks like the following:
13. Click OK.
14. From the Transform category of steps drag a Sort rows step to the canvas, and
create a hop from the Filter rows step to this new step.
15. Sort the rows by word.
16. From the Stascs category, drag a Group by step, and create a hop from the Sort
rows step to this step.
17. Congure the grids in the Group by conguraon window like shown:
Basic Data Manipulaon
[ 100 ]
18. Add a Calculator step, create a hop from the last step to this, and calculate the new
eld len_word represenng the length of the words. For that, use the calculator
funcon Return the length of a string A and select word from the
drop-down menu for Field A.
19. Expand the Flow category and drag another Filter rows step to the canvas.
20. Create a hop from the Calculator step to this step and edit it.
21. Click <field> and select counter.
22. Click the = sign, and select >.
23. Click <value>. A small window appears.
24. In the Value textbox of the lile window, enter 2.
25. Click OK.
26. Posion the mouse cursor over the icon in the upper-right corner of the window.
When the text Add condion shows up, click on the icon.
27. A new blank condion is shown below the one you created.
28. Click on null = [] and create the condion len_word>3, in the same way you
created the condion counter>2.
29. Click OK.
Chapter 3
[ 101 ]
30. The nal condion looks like this:
31. Add one more Filter rows step to the transformaon and create a hop from the last
step to this new step.
32. On the le side of the condion, select word.
33. As comparator select IN LIST.
34. At the end of the condion, inside the textbox value, type the following:
a;an;and;the;that;this;there;these.
35. Click the upper-le square above the condion and the word NOT will appear.
36. The condion looks like the following:
Basic Data Manipulaon
[ 102 ]
37. Add a Sort rows step, create a hop from the previous step to this step, and sort the
rows in the descending order of counter.
38. Add a Dummy step at the end of the transformaon, create a hop from the last step
to the Dummy step.
39. With the Dummy step selected, preview the transformaon. The following is what
you should see now:
What just happened?
You read a regular plain le and arranged the words that appear in the le in some
parcular fashion.
The rst thing you did was to read the plain le and split the lines so that every word became
a new row in the dataset. Consider, for example, the following line:
subsidence; comparison with the Portillo chain.
Chapter 3
[ 103 ]
The spling of this line resulted in the following rows being generated:
Thus, a new eld named word became the basis for your transformaon.
First of all, you discarded rows with null words. You did it by using a lter with the condion
word IS NOT NULL. Then, you counted the words by using the Group by step you learned
in the previous tutorial. Once you counted the words, you discarded those rows where the
word was too short (length less than 4) or too common (comparing to a list you typed).
Once you applied all those lters, you sorted the rows in the descending order of
the number of mes the word appeared in the le so that you could see the most
frequent words.
Scrolling down a lile the preview window to skip some preposions, pronouns, and other
very common words that have nothing to do with a specic subject, you found words such
as shells, strata, formaon, South, elevaon, porphyric, Valley, terary, calcareous, plain,
North, rocks, and so on. If you had to guess, you would say that this was a book or arcle
about geology, and you would be right. The text taken for this exercise was Geological
Observaons on South America by Charles Darwin.
Filtering rows using the Filter rows step
The Filter rows step allows you to lter rows based on condions and comparisons.
The step checks the condion for every row. Then it applies a lter leng pass only the rows
for which the condion is true. The other rows are lost.
In the counng words exercise, you used the Filter rows step several mes so you already
have an idea of how it works. Let's review it.
Basic Data Manipulaon
[ 104 ]
In the Filter rows seng window you have to enter a condion. The following table
summarizes the dierent kinds of condions you may enter:
Condion Descripon Example
A single eld followed by IS NULL or
IS NOT NULL
Checks whether the value of a
eld in the stream is null
word IS NOT NULL
A eld, a comparator, and a constant Compares a eld in the stream
against a constant value.
counter > 2
Two elds separated by a comparator Compares two elds in the
stream
line CONTAINS
word
You can combine condions as shown here:
counter > 2
AND
len_word>3
You can also create subcondions such as:
(
counter > 2
AND
len_word>3
)
OR
(word in list geology; sun)
In this last example, the condion lets the word geology pass even if it appears only once. It
also lets the word sun pass, despite its length.
When eding condions, you always have a contextual menu which allows you to add and
delete sub-condions, change the order of existent condions, and more.
Maybe you wonder what the Send 'true' data to step: and Send 'false' data to step: textboxes
are for. Be paent, you will learn how to use them in Chapter 4.
Have a go hero – playing with lters
Now it is your turn to try ltering rows. Modify the counng_words transformaon in the
following way:
Alter the Filter rows steps. By using a Formula step create a ag (a Boolean eld)
that evaluates the dierent condions (counter>2, and so on). Then use only one
Filter rows step that lters the rows for which the ag is true. Test it and verify that
the results are the same as before the change.
Chapter 3
[ 105 ]
In the Formula eding window, use the opons under the Logic category.
Then in the Filter rows step, you can type true or Y as the value against which
you compare the ag.
Add a sub-condion to avoid excluding some words, just like the one in the example:
(word in list geology; sun). Change the list of words and test the lter to see
that the results are as expected.
Have a go hero – counting words and discarding those that are
commonly used
If you take a look at the results in the tutorial, you may noce that some words appear more
than once in the nal list because of special signs such as . , ) or ", or because of lower
or upper case leers. For example, look how many mes the word rock appears: rock (99
occurrences) - rock,(51 occurrences) – rock. (11 occurrences) – rock." (1 occurrence)
- rock: (6 occurrences) - rock; - (2 occurrences). You can x this and make the word rock
appear only once: Before grouping the words, remove all extra signs and convert all words to
lower case or upper case, so they are grouped as expected.
Try one or more of the following steps: Formula, Calculator, Replace in string.
Looking up data
Unl now, you have been working with a single stream of data. When you did calculaons or
created condions to compare elds, you only involved elds of your stream. Usually, this is
not enough, and you need data from other sources. In this secon you will learn to look up
data outside your stream.
Time for action – nding out which language people speak
An Internaonal Musical Contest will take place and 24 countries will parcipate, each
presenng a duet. Your task is to hire interpreters so the contestants can communicate in
their nave language. In order to do that, you need to nd out the language they speak:
1. Create a new transformaon.
2. By using a Get Data From XML step, read the countries.xml le that contains
informaon about countries that you used in Chapter 2.
Basic Data Manipulaon
[ 106 ]
To avoid conguring the step again, you can open the transformaon
that reads this le, copy the Get data from XML step, and paste it here.
3. Drag a Filter rows step to the canvas.
4. Create a hop from the Get data from XML step to the Filter rows step.
5. Edit the Filter rows step and create the condion- isofficial= T.
6. Click the Filter rows step and do a preview. The list of previewed rows will show the
countries along with the ocial languages:
Now let's create the main ow of data:
7. From the book website download the list of contestants. It looks like this:
ID;Country;Duet
1;Russia;Mikhail Davydova
;;Anastasia Davydova
2;Spain;Carmen Rodriguez
;;Francisco Delgado
3;Japan;Natsuki Harada
;;Emiko Suzuki
4;China;Lin Jiang
;;Wei Chiu
5;United States;Chelsea Thompson
;;Cassandra Sullivan
6;Canada;Mackenzie Martin
;;Nathan Gauthier
7;Italy;Giovanni Lombardi
;;Federica Lombardi
Chapter 3
[ 107 ]
8. In the same transformaon, drag a Text le Input step to the canvas and read the
downloaded le.
The ID and country have values only in the rst of the two
lines for each country. In order to repeat the values in the
second line use the ag Repeat in the Fields tab. Set it to Y.
9. Expand the Lookup category of steps.
10. Drag a Stream lookup step to the canvas.
11. Create a hop from the Text le input you just created, to the Stream lookup step.
12. Create another hop from the Filter rows step to the Stream lookup step.
13. Edit the Stream lookup step by double-clicking it.
14. In the Lookup step drop-down list, select Filter ocial languages, the step that brings
the list of languages.
15. Fill the grids in the conguraon window as follows:
Note that Country Name is a eld coming from the text le stream, while the country
eld comes from the countries stream.
Basic Data Manipulaon
[ 108 ]
16. Click OK.
17. The hop that goes from the Filter rows step to the Stream lookup step changes its look
and feel, to show that this is the stream where the Stream lookup is going to look:
18. Aer the Stream lookup, add a Filter rows step.
19. In the Filter rows step, type the condion language-IS NOT NULL.
20. By using a Select values step, rename the elds Duet, Country Name and
language to Name, Country, and Language.
21. Drag a Text le output step to the canvas and create the le
people_and_languages.txt with the selected elds.
22. Save the transformaon.
23. Run the transformaon and check the nal le, which should look like this:
Name|Country|Language
Mikhail Davydova|Russia|
Anastasia Davydova|Russia|
Carmen Rodriguez|Spain|Spanish
Francisco Delgado|Spain|Spanish
Natsuki Harada|Japan|Japanese
Emiko Suzuki|Japan|Japanese
Chapter 3
[ 109 ]
Lin Jiang|China|Chinese
Wei Chiu|China|Chinese
Chelsea Thompson|United States|English
Cassandra Sullivan|United States|English
Mackenzie Martin|Canada|French
Nathan Gauthier|Canada|French
Giovanni Lombardi|Italy|Italian
Federica Lombardi|Italy|Italian
What just happened?
First of all, you read a le with informaon about countries and the languages spoken in
those countries.
Then you read a list of people along with the country they come from. For every row in this
list, you told Kele to look for the country (Country Name eld) in the countries stream
(country eld), and to give you back a language and the percentage of people that speaks
that language (language and percentage elds). Let's explain it with a sample row: The
row for Francisco Delgado from Spain. When this row gets to the Stream lookup step,
Kele looks in the list of countries for a row with the country Spain. It nds it. Then, it
returns the value of the columns language and percentage: Spanish and 74.4.
Now take another sample row—the row with the country Russia. When the row gets to the
Stream lookup step, Kele looks for it in the list of countries, but it doesn't nd it. So what
you get as language is a null string.
Whether the country is found or not, two new elds are added to your stream—language
and percentage.
Aer the Stream lookup step, you discarded the rows where language is null, that is, those
whose country wasn't found in the list of countries.
With the successful rows you generated an output le.
The Stream lookup step
The Stream lookup step allows you to look up data in a secondary stream.
You tell Kele which of the incoming streams is the stream used to look up, by selecng the
right choice in the Lookup step list.
The upper grid in the conguraon window allows you to specify the names of the elds that
are used to look up.
Basic Data Manipulaon
[ 110 ]
In the le column, Field, you indicate the eld of your main stream. You can ll this
column by using the Get Fields buon, and deleng all the elds you don't want to use
for the search.
In the right column, Lookup Field, you indicate the eld of the secondary stream.
When a row of data comes to the step, a lookup is made to see if there is a row in the
secondary stream for which, every pair (Field, LookupField) in the grid has the value of
Field equal to the value of LookupField. If there is one, the look up will be successful.
In the lower grid, you specify the names of the secondary stream elds that you want back
as a result of the look up. You can ll this column by using the Get lookup elds buon, and
deleng all the elds you don't want to retrieve.
Aer the lookup, new elds are added to your dataset—one for every row of this grid.
For the rows for which the look up is successful, the values for the new elds will be taken
from the lookup stream.
For the others, the elds will remain null, unless you set a default value.
Chapter 3
[ 111 ]
When you use a Stream lookup, all lookup data is loaded into memory. Then the stream
lookup is made using a hash table algorithm. Even if you don't know how this algorithm
works, it is important that you know the implicaons of using this step:
First, if the data where you look is huge, you take the risk of running out
of memory.
Second, only one row is returned per key. If the key you are looking for is present
more than once in the lookup stream, only one will be returned—for example, in the
tutorial where there are more than one ocial languages spoken in a country, you
get just one. Somemes you don't care, but on some occasions this is not acceptable
and you have to try some other methods. You'll learn other ways to do this later in
the book.
Have a go hero – counting words more precisely
The tutorial where you counted the words in a le worked prey well, but you may have
noced that it has some details you can x or enhance.
You discarded a very small list of words, but there are much more that are quite usual
in English—preposions, pronouns, auxiliary verbs, and many more. So here is the challenge:
Get a list of commonly used words and save it in a le. Instead of excluding words from a
small list as you did with a Filter rows step, exclude the words that are in your common
words le.
Use a Stream lookup step.
Test the transformaon with the same le, and also with other les, and verify
that you get beer results with all these changes.
Basic Data Manipulaon
[ 112 ]
Summary
This chapter covered the simplest and most common ways of transforming data. Specically,
it covered how to:
Use dierent transformaon steps to calculate new elds
Use the Calculator and the Formula steps
Filter and sort data
Calculate stascs on groups of rows
Look up data
Aer learning basic manipulaon of data, you may now create more complex
transformaons, where the streams begin to split and merge. That is the core
subject of the next chapter.
4
Controlling the Flow of Data
In the previous chapter, you learned the basics of transforming data. Basically
you read data from some le, did some transformaon to the data, and sent
the data back to a dierent output. This is the simplest scenario. Think of
a dierent situaon. Suppose you collect results from a survey. You receive
several les with the data and those les have dierent formats. You have to
merge those les somehow and generate a unied view of the informaon.
You also want to put aside the rows of data whose content is irrelevant. Finally,
based on the rows that interest you, you want to create another le with some
stascs. This kind of requirement is very common. In this chapter you will
learn how to implement it with PDI.
Splitting streams
Unl now, you have been working with simple, straight ows of data. When you deal with
real problems, those simple ows are not enough. Many mes, the rows of your dataset
have to take dierent paths. This situaon is handled very easily, and you will learn how to
do it in this secon.
Controlling the Flow of Data
[ 114 ]
Time for action – browsing new PDI features by copying
a dataset
Before starng, let's introduce the Pentaho BI Plaorm Tracking site. At the tracking site you
can see the current Pentaho roadmap and browse their issue tracking system. The PDI page
for that site is http://jira.pentaho.com/browse/PDI.
In this exercise, you will export the list of proposed new features for PDI from the site, and
generate detailed and summarized les from that informaon.
1. Access the main Pentaho tracking site page: http://jira.pentaho.com.
2. In the main menu, click on FIND ISSUES.
3. On the le side, select the following lters:
Project: Pentaho Data Integraon {Kele}
Issue Type: New Feature
Status: Open
4. At the boom of the lter list, click View >>. A list of found issues will appear.
Download from Wow! eBook <www.wowebook.com>
Chapter 4
[ 115 ]
5. Above the list, select Current eld to export the list to an Excel le.
6. Save the le to the folder of your choice.
The Excel le exported from the JIRA site is a Microso Excel 97-
2003 Worksheet. PDI doesn't recognize this version of worksheets.
So, before proceeding, open the le with Excel or Calc and convert
it to Excel 97/2000/XP.
7. Create a transformaon.
8. Read the le by using an Excel Input step. Aer selecng the le, click on the Sheets
tab, and ll it as shown in the next screenshot so that it skips the header rows and
the rst column:
9. Click the Fields tab and ll the grid by clicking the Get elds from header
row... buon.
Controlling the Flow of Data
[ 116 ]
10. Click the Preview rows just to be sure that you are reading the le properly. You
should see all the contents of the Excel le except the three heading lines.
11. Click OK.
12. Add a Filter rows step to drop the rows where the Summary eld is null.
13. Aer the Filter rows step, add a Value Mapper step and ll it like here:
14. Aer the Value Mapper step, add a Sort rows step and order the rows by
priority_order (asc.), Summary (asc.).
Chapter 4
[ 117 ]
15. Aer that add an Excel Output step, and congure it to send the priority_order
and Summary elds to an Excel le named new_features.xls.
16. Drag a Group by step to the canvas.
17. Create a new hop from the Sort rows step to the Group by step.
18. A warning window appears asking you to decide whether you wish to Copy or
Distribute.
19. Click Copy to send the rows toward both output steps.
20. The hops leaving the Sort rows step change to show you the decision you made. So
far you have this:
21. Congure the Group by steps like shown:
22. Add a new Excel Output step to the canvas and create a hop from the Group by step
to this new step.
Controlling the Flow of Data
[ 118 ]
23. Congure the Excel Output step to send the Priority and Quantity elds to an
Excel le named new_features_summarized.xls.
24. Save the transformaon and run it.
25. Verify that both les, new_features.xls and new_features_summarized.xls,
have been created.
26. The rst le should look like this:
27. And the second le should look like this:
Chapter 4
[ 119 ]
What just happened?
Aer exporng an Excel le with the PDI new features from the JIRA site, you read the le and
created two Excel les—one with a list of the issues and the other with a summary of the list.
The rst steps of the transformaon are well known—read a le, lter null rows, map a eld,
and sort.
Note that the mapping creates a new eld to give an order to the Priority
eld so that the more severe issues are rst in the list, while the minor priories
remain at the end of the list.
You linked the Sort rows step to two dierent steps. This caused PDI to ask you what to
do with the rows leaving the step. By answering Copy, you told PDI to create a copy of
the dataset. Aer that, two idencal copies le the Sort rows step, each to a dierent
desnaon step.
From the moment you copied the dataset, those copies became independent, each following
its way. The rst copy was sent to a detailed Excel le. The other copy was used to create a
summary of the elds, which then was sent to another Excel le.
Copying rows
At any place in a transformaon, you may decide to split the main stream into two or more
streams. When you do so, you have to decide what to do with the data that leaves the last
step—copy or distribute.
To copy means that the whole dataset is copied to each of the desnaon steps. Once the
rows are sent to those steps, each follows its own way.
When you copy, the hops that leave the step from which you are copying change visually to
indicate the copy acon.
Controlling the Flow of Data
[ 120 ]
In the tutorial, you created two copies of the main dataset. You could have created more
than two, like in this example:
When you split the stream into two or more streams, you can do whatever you want with
each one as if they had never been the same. The transformaons you apply to any of those
output streams will not modify the data in the others.
You shouldn't assume a parcular order in the execuon of the output
streams of a step. All the output streams receive the rows in synch and
you don't have control over the order in which they are executed.
Have a go hero – recalculating statistics
Do you remember the exercise from Chapter 3 where you calculated some stascs? You
created two transformaons. One was to generate a le with students that failed. The other
was to create a le with some stascs such as average grade, number of students who
failed, and so.
Now you can do all that work in a single transformaon, reading the le once.
Distributing rows
As said, when you split a stream, you can copy or distribute the rows. You already saw that
copy is about creang copies of the whole dataset and sending each of them to each output
stream. To distribute means the rows of the dataset are distributed among the desnaon
steps. Let's see how it works through a modied exercise.
Chapter 4
[ 121 ]
Time for action – assigning tasks by distributing
Let's suppose you want to distribute the issues among three programmers so that each of
them implements a subset of the new features.
1. Select Transformaon | Copy transformaon to clipboard in the main menu.
Close the transformaon and select Transformaon | Paste transformaon from
clipboard. A new transformaon is created idencal to the one you copied. Change
the descripon and save the transformaon under a dierent name.
2. Now delete all the steps aer the Sort rows step.
3. Change the lter step to keep only the unassigned issues: Assignee eld equal to
the string Unassigned. The condion looks like the next screenshot:
4. From the Transform category of steps, drag an Add sequence step to the canvas and
create a hop from the Sort rows step to this new step.
5. Double-click the Add sequence step and replace the content of the Name of value
textbox with nr.
6. Drag three Excel Output steps to the canvas.
7. Link the Add sequence step to one of these steps.
Controlling the Flow of Data
[ 122 ]
Congure the Excel Output step to send the elds nr, Priority, and Summary to an Excel
le named f_costa.xls (the name of one of the programmers). The Fields tab should look
like this:
8. Create a hop from the Add sequence step to the second Excel Output step. When
asked to decide between Copy and Distribute, select Distribute.
9. Congure the step like before, but name the le as b_bouchard.xls
(the second programmer).
10. Create a hop from the Add sequence step to the last Excel Output step.
11. Congure this last step like before, but name the le as a_mercier.xls
(the last programmer).
12. The transformaon should look like the following:
Chapter 4
[ 123 ]
13. Run the transformaon and look at the execuon tab window to see
what happened:
14. To see which rows belong to which of the created les, open any of them. It should
look like this:
What just happened?
You distributed the issues among three programmers.
In the execuon window, you could see that 84 rows leave the Add sequence step, and 28
arrive to each of the Excel Output steps, that is, a third of the number of rows to each of
them. You veried that when you explored the Excel les.
In the transformaon, you added an Add sequence step that did nothing more than adding
a sequenal number to the rows. This sequence helps you recognize that one out of every
three rows were distributed to every le.
Controlling the Flow of Data
[ 124 ]
Here you saw a praccal example for the distribung opon. When you distribute, the
desnaon steps receive the rows in turn. For example, if you have three target steps, the
rst row goes to the rst target step, the second row goes to the second step, the third row
goes to the third step, the fourth row now goes to the rst step, and so on.
As you could see, when distribung, the hop leaving the step from which you distribute is
plain; it doesn't change its look and feel.
Despite this example showing clearly how the Distribute… method works, this is not how
you will regularly use this opon. The Distribute… opon is mainly used for performance
reasons. Throughout this book you will always use the Copy… opon. To avoid being asked
for the acon to take every me you create more that one hop leaving a step, you can set
the Copy… opon as default; you do this by opening the PDI opons window (Edit|Opons
… from the main menu) and unchecking the opon Show "copy or distribute" dialog?.
Remember that to see the change applied, you will have to restart Spoon.
Once you have changed this opon, the default method is copying rows. If you want to
distribute rows, you can change the acon by right-clicking the step from which you want
to copy or distribute, selecng Data Movement... in the contextual menu that appears, and
then selecng the desired opon.
Chapter 4
[ 125 ]
Pop quiz – data movement (copying and distributing)
Look at the following transformaons:
If you do a preview on the Steps named Preview, which of the following is true:
a. The number of rows you see in (a) is greater or equal than the number of rows you
see in (b)
b. The number of rows you see in (b) is greater or equal than the number of rows you
see in (a)
c. The dataset you see in (a) is exactly the same as you see in (b) no maer what data
you have in the Excel le.
You can create a transformaon and test each opon to check the results for yourself. To
be sure you understand correctly where and when the rows take one or other way, you can
preview every step in the transformaon, not just the last one.
Splitting the stream based on conditions
In the previous secon you learned to split the main stream of data into two or more
streams. The whole dataset was copied or distributed among the desnaon steps. Now
you will learn how to put condions so that the rows take one way or another depending
on the condions.
Controlling the Flow of Data
[ 126 ]
Time for action – assigning tasks by ltering priorities with the
Filter rows step
Following with the JIRA subject, let's do a more realisc distribuon of tasks among
programmers. Let's assign the serious task to our most experienced programmer,
and the remaining tasks to others.
1. Create a new transformaon.
2. Read the JIRA le and lter the unassigned tasks, just as you did in the
previous tutorial.
3. Add a Filter rows step and two Excel Output steps to the canvas, and link them to
the other steps as follows:
4. Congure one of the Excel Output steps to send the elds, Priority and Summary,
to an Excel le named b_bouchard.xls (the name of the senior programmer).
5. Congure the other Excel Output step to send the elds Priority and Summary to
an Excel le named new_features_to_develop.xls.
6. Double-click the Filter row step to edit it.
7. Enter the condion Priority = Critical OR Priority = Severe.
8. From the rst drop-down list, Send 'true' data to step, select the step that creates
the b_bouchard.xls Excel le.
9. From the other drop-down list, Send 'false' data to step, select the step that creates
the Excel new_features_to_develop.xls Excel le.
10. Click OK.
Chapter 4
[ 127 ]
11. The hops leaving the Filter rows step change to show which way a row will take,
depending on the result of the condion.
12. Save the transformaon.
13. Run the transformaon, and verify that the two Excel les were created.
14. The les should look like this:
Controlling the Flow of Data
[ 128 ]
What just happened?
You sent the list of PDI new features to two Excel les—one le with the crical issues and
the other le with the rest of the issues.
In the Filter row step, you put a condion to evaluate if the priority of a task was severe
or crical. For every row coming to the lter, the condion was evaluated. The rows that
had a severe or crical priority were sent toward the Excel Output step that creates the
b_bouchard.xls le. The rows with another priority were sent towards the other Excel
Output step, the one that creates the new_features_to_develop.xls le.
PDI steps for splitting the stream based on conditions
When you have to make a decision, and upon that decision split the stream in two, you can
use the Filter row step as you did in this last exercise. In this case, the Filter rows step acts as a
decision maker. It has a condion and two possible desnaons. For every row coming to the
lter, the step evaluates the condion. Then if the result of the condion is true, it decides
to send the row toward the step selected in the rst drop-down list of the conguraon
window—Send 'true' data to step.
If the result of the condion is false, it sends the row toward the step selected in the second
drop-down list of the conguraon window: Send 'false' data to step.
Somemes you have to make nested decisions; consider the next gure for example:
In the transformaon shown in the preceding diagram, the condions are as simple as tesng
if a eld is equal to a value. In situaons like this you have a simpler way for accomplishing
the same..
Chapter 4
[ 129 ]
Time for action – assigning tasks by ltering priorities with the
Switch/ Case step
Let's use a Switch/Case step to replace the nested Filter Rows steps shown in the
preceding diagram
1. Create a transformaon like the following:
2. You will nd the Switch/Case step in the Flow category of steps.
To save me, you can take the last transformaon you created
as the starng point.
Controlling the Flow of Data
[ 130 ]
3. Note that the hops arriving to the Excel Output steps look strange. They are doed
orange lines. This look and feel shows you that the target steps are unreachable. In
this case, it means that you sll have to congure the Switch/Case step. Double-click
it and ll it like here:
4. Save the transformaon and run it
5. Open the Excel les generated to see that the transformaon distributed the task among
the les based on the given condions.
What just happened?
In this tutorial you learned to use the Switch/Case step. This step routes rows of data to one
or more target steps based on the value encountered in a given eld.
In the Switch/Case step conguraon window, you told Kele where to send the row
depending on a condion. The condion to evaluate was the equality of the eld set in Field
name to switch and the value indicated in the grid. In this case, the eld name to switch
is Priority, and the values against which it will be compared are the dierent values for
priories: Severe, Crical, and so on. Depending on the values of the Priority eld, the rows
will be sent to any of the target steps. For example, the rows where Priority=Medium, will be
sent toward the target step New Features for Federica Costa.
Note that it is possible to specify the same target step more than once.
The Default target step represents the step where the rows that don't match any of the case
values are sent. In this example, the rows with a priority not present in the list will be sent to
the step New Features without priority.
Chapter 4
[ 131 ]
Have a go hero – listing languages and countries
Open the transformaon you created in the Finding out which language people speak
tutorial in Chapter 3. If you run the transformaon and check the content of the output le,
you'll noce that there are missing languages. Modify the transformaon so that it generates
two les—one with the rows where there is a language, that is, the rows for which the
lookup didn't fail, and another le with the list of countries not found in the countries.
xml le.
Pop quiz – splitting a stream
Connuing with the contestant exercise, suppose that the number of interpreters you will
hire depends on the number of people that speak each language:
Number of people that speaks the language Number of interpreters
Less than 3 1
Between 3 and 6 2
More that 6 3
You want to create a le with the languages with a single interpreter, another le with the
languages with two interpreters, and a nal le with the languages with three interpreters.
Which of the following would solve your situaon when it comes to spling the languages
into three output streams:
a. A Number range step followed by a Switch/Case step.
b. A Switch/Case step.
c. Both
In order to gure out the answer, create a transformaon and count the number
of people that speak each language. You'll have to use a Sort rows step followed
by a Group by step. Aer that, try to develop each alternave soluon and see
what happens.
Merging streams
You've just seen how the rows of a dataset can take dierent paths. Here you will learn the
opposite—how data coming from dierent places is merged into a single stream.
Controlling the Flow of Data
[ 132 ]
Time for action – gathering progress and merging all together
Suppose that you delivered the Excel les you generated in the Assigning tasks by ltering
priories tutorial earlier in the chapter. You gave the b_bouchard.xls to Benjamin
Bouchard, the senior programmer. You also gave the other Excel le to a project leader who
is going to assign the tasks to dierent programmers. Now they are giving you back the
worksheets, with a new column indicang the progress of the development. In the case of
the shared le, there is also a column with the name of the programmer who is working on
every issue. Your task is now to unify those sheets.
Here is what the Excel les look like:
1. Create a new transformaon.
2. Drag an Excel Input step to the canvas and read one of the les.
3. Add a Filter row step to keep only the rows where the progress is not null, that is,
the rows belonging to tasks whose development has been started.
4. Aer the lter, add a Sort rows step, and congure it to order the elds by
Progress, in descending order.
Chapter 4
[ 133 ]
5. Add another Excel Input step, read the other le, and lter and sort the rows just
like you did before. Your transformaon should look like this:
6. From the Transform category of steps, select the Add Constants step and drag it
onto the canvas.
7. Link the step to the stream that reads the B. Bouchard's le; edit the step and add a
new eld named Programmer, with type string and value Benjamin Bouchard.
8. Aer this step, add a Select values step and reorder the elds so that they remain
in a specic order Priority, Summary, Programmer, Progress—to resemble the
other stream.
9. Now, from the Transform category add an Add sequence step, name the new eld
ID, and link the step with the Select values step.
10. Create a hop from the Sort rows step of the other stream to the Add sequence step.
Your transformaon should look like the one shown next:
Controlling the Flow of Data
[ 134 ]
11. Select the Add sequence step and do a preview. You will see this:
What just happened?
You read two similar Excel les and merged them into one single dataset.
First of all, you read, ltered, and sorted the les as usual. Then you altered the stream
belonging to B. Bouchard, so it looked similar to the other. You added the eld Programmer,
and reordered the elds.
Aer that, you used an Add sequence step to create a single dataset containing the rows of
both streams, with the rows numbered.
PDI options for merging streams
You can create a union of two or more streams anywhere in your transformaon. To create a
union of two or more data streams, you can use any step. The step unies the data, takes the
incoming streams in any order, and then it completes its task in the same way as if the data
came from a single stream.
In the example, you used an Add sequence step as the step to join two streams. The step
gathered the rows from the two streams, and then proceeded to numerate the rows with
the sequence name ID.
Chapter 4
[ 135 ]
This is only one example of how you can mix streams together. As said, any step can be used
to unify two streams. Whichever the step, the most important thing you have to have in
mind is that you cannot mix rows that have a dierent layout. The rows have to have the
same lengths, the same data types, and the same elds in the same order.
Fortunately, there is a trap detector that provides warnings at design me if a step is
receiving mixed layouts.
You can try this out. Delete the Select values step. Create a hop from the Add constants step
to the Add sequence step. A warning message appears as shown next:
In this case, the third eld of the rst stream, Programmer (String), does not have the
same name or the same type as the third eld of the second stream, Progress (Number).
Note that PDI warns you but it doesn't prevent you from mixing row layouts
when creang the transformaon.
If you want Kele to prevent you from running transformaons with mixed row
layouts, you can check the opon Enable safe mode in the window that shows
up when you dispatch the transformaon. Have in mind that doing this will
cause a performance drop.
Controlling the Flow of Data
[ 136 ]
When you use an arbitrary step to unify, the rows remain in the same order as they
were in their original stream, but the streams are joined in any order. Take a look at the
example's preview. The rows of the Bouchard's stream as well as the rows of the other
stream remained sorted within its original group. However, whether the Bouchard's stream
appeared before or aer the rows of the other stream was just a maer of chance. You
didn't decide the order of the streams; PDI decided it for you. If you care about the order in
which the union is made, there are some steps that can help you. Here are the opons
you have:
If you want to ... You can do this ...
Append two or more streams, and
don't care about the order
Use any step. The selected step will take all the incoming
streams in any order, and then will proceed with its specic
task.
Append two streams in a given order Use the Append streams step from the Flow category. It
helps to decide which stream goes rst.
Merge two streams ordered by one or
more elds
Use a Sorted Merge step from the Joins category. This
step allows you to decide on which eld(s) to order the
incoming rows before sending them to the desnaon
step(s). The input streams must be sorted on that eld(s).
Merge two streams keeping the newest
when there are duplicates
Use a Merge Rows (di) step from the Joins category.
You tell PDI the key elds, that is, the elds that say that
a row is the same in both streams. You also give PDI the
elds to compare when the row is found in both streams.
PDI tries to match rows of both streams, based on the key
elds. Then it creates a eld that will act as a ag, and lls
it as follows:
If a row was only found in the rst stream, the
ag is set to deleted.
If a row was only found in the second stream, the
ag is set to new.
If the row was found in both streams, and the
elds to compare are the same, the ag is set to
identical.
If the row was found in both streams, and at least
one of the elds to compare is dierent, the ag
is set to changed.
Let's try one of these opons.
Chapter 4
[ 137 ]
Time for action – giving priority to Bouchard by using
Append Stream
Suppose you want the Bouchard's row before the other rows. You can modify the
transformaon as follows:
1. From the Flow category of steps, drag an Append Streams step to the canvas.
Rearrange the steps and hops so the transformaon looks like this:
2. Edit the Append streams step and select as the Head hop the one belonging to the
Bouchard's rows, and as the Tail hop the other. Doing this, you indicate toPDI how it
has to order the streams.
3. Do a preview on the Add sequence step. You should see this:
Controlling the Flow of Data
[ 138 ]
What just happened?
You changed the transformaon to give priority to Bouchard's issues.
You made it by using the Append Streams step. By telling that the head hop was the one coming
from the Bouchard's le, you got the expected order—rst the rows with the tasks assigned
to Bouchard, sorted by progress descending, and then the rows with the tasks assigned to
other programmers, also sorted by progress descending.
Whether you use arbitrary steps or some of the special steps menoned
here to merge streams, don't forget to verify the layouts of the streams
you are merging. Pay aenon to the warnings of the trap detector and
avoid mixing row layouts.
Have a go hero – sorting and merging all tasks
Modify the previous exercise so that the nal output is sorted by priority. Try two possible
soluons:
Sort the input streams on their own and then use a Sorted Merge step.
Merge the stream with a Dummy step and then sort.
Which one do you think would give the best performance?
Refer to the Sort rows step issues in Chapter 3.
In which circumstances would you use the other opon?
Have a go hero – trying to nd missing countries
As you saw in the countries exercises, there are missing countries in the countries.xml
le. In fact, the countries are there, but with dierent names. For example, Russia in the
contestant le is Russian Federation in the XML le. Modify the transformaon that
looks for the language. Split the stream in two—one for the rows where a language was
found and the other for the rows where no language was found. For this last stream, use a
Value Mapper step to rename the countries you idened as wrong, that is, rename Russia
as Russian Federation. Then look again for a language now with the new name. Finally,
merge the two streams and create the output le with the result.
Chapter 4
[ 139 ]
Summary
In this chapter, you learned dierent opons that PDI oers to combine or split ows of data.
The chapter covered the following:
Copying and distribung rows
Spling streams based on condions
Merging independent streams in dierent ways
With the concepts you learned in the inial chapters, the range of tasks you are able to
perform is already broad. In the next chapter, you will learn how to insert JavaScript code in
your transformaons not only as an alternave to perform some of those tasks, but also as
a way to accomplish other tasks that are complicated or even unthinkable to carry out with
regular PDI steps.
5
Transforming Your Data
with JavaScript Code and the
JavaScript Step
Whichever transformaon you need to do on your data, you have a big chance
of nding that PDI steps are able to do the job. Despite that, it may happen that
there are not proper steps that serve your requirements, or that an apparently
minor transformaon consumes a lot of steps linked in a very confusing
arrangement dicult to test or understand. Pung colorful icons here and
there is funny and praccal, but there are some situaons like the ones
described above where you inevitably will have to code. This chapter explains
how to do it with JavaScript and the special JavaScript step.
In this chapter you will learn how to:
Insert and test JavaScript code in your transformaons
Disnguish situaons where coding is the best opon, from those where there are
beer alternaves
Doing simple tasks with the JavaScript step
One of the tradional steps inside PDI is the JavaScript step that allows you to code inside
PDI. In this secon you will learn how to use it for doing simple tasks.
Transforming Your Data with JavaScript Code and the JavaScript Step
[ 142 ]
Time for action – calculating scores with JavaScript
The Internaonal Musical Contest menoned in Chapter 4 has already taken place. Each duet
performed twice. The rst me technical skills were evaluated, while in the second, the focus
was on arsc performance.
Each performance was assessed by a panel of ve judges who awarded a mark out of a
possible 10.
The following is the detailed list of scores:
Note that the elds don't t in the screen, so the lines are wrapped and doed lines are
added for you to disnguish each line.
Now you have to calculate, for each evaluated skill, the overall score as well as an
average score.
1. Download the sample le from the Packt website.
2. Create a transformaon and drag a Fixed le input step to the canvas to read
the le.
Chapter 5
[ 143 ]
3. Fill the conguraon window as follows:
4. Press the Get Fields buon. A window appears to help you dene the columns.
5. Click between the elds to add markers that dene the limits. The window will look
like this:
6. Click on Next >. A new window appears for you to congure the elds.
7. Click on the rst eld at the le of the window and change the name to Performance.
Verify that the type is set to String.
Transforming Your Data with JavaScript Code and the JavaScript Step
[ 144 ]
8. To the right, you will see a preview of the data for the eld.
9. Select each eld to the le of the window, change the names, and adjust the types. Set
ID