Course Guide DPV

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 18

DownloadCourse-guide-DPV
Open PDF In BrowserView PDF
Data Science [201400174]
Course year 2018/2019, Quarter 2A
DATE
February 4, 2019
E XCERPT

Data Preparation and Visualization [DPV]

T EACHERS
Maurice van Keulen
Christin Seifert
Mannes Poel
Karin Groothuis-Oudshoorn
Faiza Bukhsh
Miha Lavric
P ROJECT OWNERS
Faiza Bukhsh
Karin Groothuis-Oudshoorn
Maurice van Keulen
Mannes Poel
Michel van Putten
Luc Wismans

C OURSE COORDINATOR
Christin Seifert
Maurice van Keulen

ii

Topic

1

Data Preparation and Visualization [DPV]

1.1

Introduction
Data Warehousing, OLAP and Data Visualization are in essence technologies developed for Business Intelligence. They are, however, also effective for data science. The topic will teach (a) data warehousing
techniques for extracting and transforming data (ETL), (b) modeling data for analytic purposes using the
multidimensional modeling approach of OLAP, and (c) data visualisation techniques.
This topic comes in two flavours: you can do it in a tool-based fashion as well as in a programming
language-based fashion (using the programming language R).
The topic uses open source or otherwise (temporarily) free to use tools to accomplish these tasks. The
method and working of these tools are representative for what is used in practice, both in the fields of
Business Intelligence as well as Data Science.

1.1.1 Global description of the practicum and project
The process of obtaining visualisations from raw data follows four steps:
1. Determine the business questions.
2. Design a data warehouse,
i.e., design a database schema that can answer the business questions based on the multidimensional
modelling method. You can in principle use any conceptual data modelling tool for this, but we do
this by hand (i.e., with pen and paper). By setting up a database (we use PostgreSQL Server as
DBMS) with this schema, you have obtained the structure of the data warehouse suitable for analysis
. . . but no data yet.
3. Fill the data warehouse,
i.e., extract relevant data from the raw sources, transform it, clean it, and store it in the database
based on the principles of ETL. You could in principle do this with a self-written program and SQL
statements, but we advocate a kind of visual programming for ETL similar to how many industrial
tools work; the tool is called Pentaho ETL Community Edition also known as Kettle or Spoon. After
this step, one has obtained a filled data warehouse.

16

Topic 1 – Data Preparation and Visualization [DPV]

4. Visualize the data,
i.e., use a visualisation tool (we use Tableau) that connect to the data warehouse and presents the data
in graphs that effectively answer the business questions.
The practicum assignments follow a different order: one first practices with steps 3 and 4, and then with
step 2. The assignments of the two flavours tool-based and R-based are essentially the same, but you simply
carry them out in a different manner. Therefore, either make the assignments in Section 1.3 or the ones in
Section 1.4 for the respective flavour you have chosen. Information on the data sets and database which are
used for both flavours, can be found in Section 1.2.
If you decide to apply this technology in a project, you are advised to follow the four steps for the given
data set. You can distinguish yourself by formulating original business questions, by paying extra attention
to visualisation or cleaning, or by constructing complex transformations that dig deeper in the data.

1.1.2 Study material and tools
• Data Warehousing Book - ”Multidimensional Databases and Data Warehousing”, Christian S. Jensen,
Torben Bach Pedersen, and Christian Thomsen
• Data Warehouse: PostgreSQL Server (pre-installed)
• Database administration: PhpPgAdmin (pre-installed)
• ETL / data transformation and cleaning:
– Tool-based flavour: Pentaho Data Integrator (Kettle / Spoon)
– R-based flavour: R and R-studio
• Data visualization: Tableau Public

1.1.3 Deliverables and obligatory items
Topic teachers: Maurice van Keulen (general and tool-based part) and Karin Groothuis-Oudhoorn (R-based
part).
The practicum assignments specify the deliverables as separate files. Tips:
•
•
•
•

1.2

Whatever you have written with pen on paper, simply take a picture of it and submit that.
Kettle/Spoon can export an ETL workflow as .jpg
For R, just submit the course files with the program code
Tableau can also export a dashboard visualisation as .jpg, but you may also make screenshots, of
course.

Data set and database
The topic comes in two flavours: tool-based and R-based. The assignments of the two flavours tool-based
and R-based are essentially the same, but you simply carry them out in a different manner. Both use the
same data sets and store the cubes on the same database server. This section gives you information these.
For the practicum assignments, we use CSV-files (BI Raw Data.csv and SuperSales.zip) containing orders
to a warehouse in the US. The files can be found in the Canvas site. The attributes in BI Raw Data.csv have
the following meaning (the ones in SuperSales.zip are a straightforward extension):

1.3 Description of the practical assignments - Tool-based

attribute
Order ID

Order Date Year
Order Date Month
Order Date Day
Customer Name
Customer Country
Product Name
Product Category
Order Price Total
Product Order Unit Price
Product Order Quantity
Product Order Price Total

17

description
Unique identifier for the order. Since one order can have more than one
order line, i.e., one order can contain purchases for more than one product, there are multiple rows with the same Order ID.
Year of the order
Month of the order
Day of the order
Name of the customer who placed the order
The customer’s country
The name of the product being ordered
The category of the product
The total amount of the order
The price of one unit of the product
The number of product units ordered
The total price of the purchase of all units of that product, i.e., the multiplication of the previous two values

You will use a DBMS (database management system) to store and share your cubes. A DBMS is software
that runs in the background and manages tables with data (the cubes you create are essentially also stored as
tables with data). All other tools (including Kettle and Tableau) connect to this software to get access to and
manipulate the data. A DBMS has already been pre-installed for you: PostgreSQL 9.5.11 running on server
castle.ewi.utwente.nl. Each group has their own database on this server. You will receive credentials
(username and password) for your database at the introductory lecture. If you have not participated in this
lecture or have otherwise not received your credentials, ask the teaching assistant or contact Maurice van
Keulen (either on Slack or by email m.vankeulen@utwente.nl). For your convenience, your database
already has two schemas: ass2 and ass3 for assignments 2 and 3.
The same server also runs a web-based database administration tool, called PhpPgAdmin. You can access
it with this link: http://castle.ewi.utwente.nl/phppgadmin. You can easily create, inspect and drop
the tables in your database with this tool1

1.3

Description of the practical assignments - Tool-based
NB: The assignments in this section pertain to the flavour Tool-based. If you have experience with
programming or are interested to start learning how to program, you can also decide to choose the
R-based flavour. These are essentially the same assignments, but carried out with the programming
language R. If you choose the R-based flavour, make the assignments of Section 1.4 instead.

1.3.1 Tools
The tools that you would need to install for the tool-based assignments are:
• Kettle Data Integration (ETL) community edition
• Tableau BI Visualiser
See Section 1.3.6 for solutions to common problems including installation problems.
Installation Pentaho Kettle
• Download Pentaho Community (version 6.0 or higher) from http://community.pentaho.com/ →
“Data integration”. NB: Install the Community Edition, not a trial of the Business version!
• Extract ZIP-file in any directory D
1 You

can also use any other PostgreSQL client such as PgAdmin (host = castle.ewi.utwente.nl; port = 5432).

18

Topic 1 – Data Preparation and Visualization [DPV]

Events
dayofweek
Friday
Friday
Tuesday
Wednesday
Tuesday
..
.

year
2016
2016
2016
2016
2016
..
.

nrofevents
2
1
5
5
6
..
.

totalparticipants
15000
5000
25000
30000
25000
..
.

location
Euclideslaan
Domplein
Domplein
Domplein
Muntstraat
..
.

Figure 1.1: Example of a one-table cube containing data on events happening in the city of Utrecht.
• Start “spoon.sh” (Mac/Unix) or “spoon.bat” (Windows)
• In “View” tab, click on database connections, create new. Connection type: PostgreSQL / Access:
Native (JDBC) / Host Name: castle.ewi.utwente.nl / Database Name: same as user name / Port
Number: 5432 / User Name: your user name / Password: your password.
The installation instructions above do not produce a working tool for everyone. Please see Section 1.3.6 for
some common problems and how to solve them.
Installation Tableau
• Get an academic license for a year at http://www.tableau.com/academic/students or a trial
version at http://www.tableau.com/products/desktop (version 9.2 or higher) and install it as
instructed.
• Connect . . . to a server . . . PostgreSQL. You are likely to get an error message complaining that a
driver is missing. Simply download the driver(s) from the speficied place and install it. Then try
again to connect.

1.3.2 Assignment 1: Facts and dimensions
This is an on-paper assignment. You don’t need any tools for it.

+

2

1.1 See Figure 1.1 which contains data on events happening in the city of Utrecht. We illustrate the
meaning of the data by explaining what the first row means: there were in total 2 events organised in
Euclideslaan on a Friday in 2016 which both together drew 15000 participants.
(a) Which attributes are the facts; which attributes are the dimensions?
(b) Draw a conceptual star schema for this cube.
2

+

1.2 See Figure 1.2 which contains data on the numbers of registered students in The Netherlands. We illustrate the meaning of the data by explaining what the first row of table “Registrations” means: In 2015, there
were 13 female students registered for the bachelor study “Technical Computer Science” at the University
of Twente which is located in the city of Enschede in the province of Overijssel.
(a) Which attributes are the facts; which attributes are the dimensions?
(b) Draw a conceptual star schema for this cube.
2

1.3.3 Assignment 2: Re-create the demo from the lecture
Use the data spreadsheet BI Raw Data.csv.
A large warehouse located in the US has many customers for its different product from around the world.
The warehouse manager Mr Jack Bezos has many large customers from over the world ordering different
2 Both

examples are simplified versions of actual data from http://data.overheid.nl.

1.3.3 Assignment 2: Re-create the demo from the lecture

Institutes
instID institute
1
University of Twente
2
Technical University Eindhoven
3
Leiden University
4
Maastricht University
5
Transnational University Limburg
..
..
.
.
Registrations
instID study
1
Technical Computer Science
1
Technical Computer Science
1
Technical Computer Science
1
Technical Computer Science
1
Computer Science
1
Computer Science
2
Technical Computer Science
2
Technical Computer Science
2
Technical Computer Science
2
Technical Computer Science
..
..
.
.

Cities
cityID
153
546
772
935
..
.

cityID
153
772
546
935
935
..
.
Phase
Bachelor
Bachelor
Bachelor
Bachelor
Master
Master
Bachelor
Bachelor
Bachelor
Bachelor
..
.

19

year
2015
2015
2016
2016
2016
2016
2015
2015
2016
2016
..
.

city
Enschede
Leiden
Eindhoven
Maastricht
..
.

sex
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
..
.

province
Overijssel
Zuid-Holland
Noord-Brabant
Limburg
..
.

nrofregistrations
13
214
22
270
23
126
19
216
30
280
..
.

Figure 1.2: Example of a multi-table cube containing data on the numbers of registered students in The
Netherlands

Figure 1.3: Starschema for Mr Bezos’ warehouse. Note: ‘orderdate’, ‘customerid’ and ‘productid’ are
together the primary key.

products from his warehouse. We follow the method of Section 1.1.1 where all steps are more or less already
given (and demoed at the lecture).

Step 1: Business questions

Mr. Bezos wants to know answers for the following questions
• Who are his top-5 most valued customers?
• What are his top-5 most important products?

Step 2a: Multidimensional model

Notice that Mr. Bezos wants to know something about customers and products. These are dimensions. Time
usually also is a dimension if you are interested in trends, so we also include the dimension ‘orderdate’.
What is it he wants to know? It is ‘most valued’ and ‘most important’, but what does this mean concretely?
How can we measure value and importance? Obviously in terms of money: the most valued customer is the
one who bought for the most money, and the most important product is the one which was sold for the most
money. Therefore, our fact is ‘amount’ (of money), i.e., ‘sales’.

20

Topic 1 – Data Preparation and Visualization [DPV]

Figure 1.4: Kettle/Spoon ETL-flow for Mr Bezos’ warehouse

Figure 1.5: Alternative Kettle/Spoon ETL-flow for Mr. Bezos’ warehouse

Step 2b: Create tables in your database

We create a database schema for this multidimensional model given in Figure 1.3. There is already an empty
schema in your PostgreSQL database with the name “ass2” intended to hold the tables for this assignment.
We use the web-based database management tool PhpPgAdmin to create the tables and attributes in that
schema. Open your browser at http://castle.ewi.utwente.nl/phppgadmin to access the tool. Click
on “PostgreSQL” under “Servers” to login and expand the panel on the left. You can create tables by
navigating to “Schemas” → “ass2” → “Tables”.
The database consists of three tables:3
• ‘Customer’ with attributes customerid:integer (primary key, NOT NULL), name:character varying(100), country:character varying(100)
• ‘Product’ with attributes productid: integer (primary key, NOT NULL), name:character varying(100),
category:character varying(100)
• ‘Sales‘ with attributes orderdate:date (primary key, NOT NULL), customerid:integer (primary key,
NOT NULL), productid: integer (primary key, NOT NULL), sales:double precision
Note: The orderid is not a key! A customer can order more than one product in one order which results in
several rows for one order. Furthermore, in its purest form, a cube does not store information on individual
transactions or cases, but only aggregated data for each combination of dimension values. In the case of Mr
Bezos, the cube does not store individual orders, but the sales (fact) for all combinations of the dimensions
orderdate, customer, and product. In other words, the orderdate, customerid, and productid together are the
primary key!

Step 3: ETL — Fill the database

We use the Pentaho Data Integration tool called Kettle to extract the data from the .csv file, to transform it
into three output tables (one fact table and two dimension tables), and to load it into the database we just
created.
Create a Kettle ETL flow similar to the one from Figure 1.4 and execute it. If something goes wrong, you
can inspect the Tab “Logging” for error messages. Use the Tab “Preview data” to inspect the intermediary
results of each step.
Some advice on ETL construction method
3 Although usually a good thing, we do not create any foreign key declarations, because that will hinder you in re-running your data
transformations.

1.3.4 Assignment 3: Do it yourself

21

• Small do-test steps
Do: Add only one or two small bits, then execute and verify the result, before continuing.
Do not: Add many steps and then don’t know where the mistake is when you receive an error.
• Read the error message carefully
It may contain a lot of gibberish you don’t understand, but part of it may provide clues to what is
wrong.
• GIYF: Google Is Your Friend
You may think Googling is not academic, but the internet is full of information on what may have
caused certain errors and what you can do to fix them.
• Verify
Check your table contents with phpPhAdmin (Browse) to verify that the data sent by the “Table
output” really arrived properly in the tables of your database.
Kettle supports more powerful components than the ones of Figure 1.5, that can perform frequently used
patterns of operations in one go. To get a better idea of such components, also create a Kettle ETL flow
similar to one in Figure 1.5. It does exactly the same thing, but it uses the special “Combination Lookup
/ Update”, instead of more basic components such as “Sort rows”, “Unique rows”, “Get Value from Sequence”, and “Stream Value Lookup”. Try it out as well and inspect the intermediary results of each step
using “Preview data”.
Step 4: Visualize

Now that we have our data ready in a form suitable for analysis, it is time to address the two business
problems raised in the beginning. We create a dashboard with metrics to measure the required KPIs. We
use the visualization tool Tableau for this.
Connect it to the PostgreSQL database. For this you have to choose PostgreSQL and then you get a
pop-up menu where you can select the server (’castle.ewi.utwente.nl’), the database (’name of your
database’), username and password. If all is well, you should see the three tables you created. Drag them
to the top area starting with the fact table “Sales”, then the other two. Notice that Tableau automatically
recognizes how the relationships between the tables are: a benefit from a database schema conforming to a
multidimensional model. Click “Update now” to read the data into Tableau (or use “Update automatically”).
To start our dashboard, click on “Sheet 1”, and drag Customer.Name in Y-axes and Sales.Sales in X-axes.
Then click “Sort name ascending by sales”. You now have a visualization that answers the first business
question about the most valued customer. Make another one for product answering the question about most
important product.
Deliverable: Export the dashboard visualizations as JPG-files as proof that you successfully re-created the
demo of the topic lecture.

1.3.4 Assignment 3: Do it yourself
Use the spreadsheets in SuperSales.zip
We follow the method of Section 1.1.1 again, but now only steps 1 and 2 are given; the assignment is to do
steps 3 and 4 yourself.
Step 1: Business questions

For this new warehouse in Canada, Mr. Bezos wants to know answers for the following questions
• Which products/product categories made the most loss?
• Which products/product categories were shipped really late (more than 2 days)?
• Which products/product categories were returned the most?

22

Topic 1 – Data Preparation and Visualization [DPV]

Figure 1.6: Starschema for SuperSales

Step 2: Multidimensional model and database structure

Re-create the star schema of Figure 1.6 for Mr. Bezos to answer his business questions. We require that
• The dimension “ReturnStatus” has a separate table containing two rows representing the only two
values “Returned” and “NotReturned”.
• The dimension “Late” is an inlined dimension representing the only two values “Late” and “NotLate”.
Step 3: ETL — Fill the database

The main goal of this assignment is that you do this step yourself.
Note: There is an important complication in this case. Notice that the star schema models a proper cube in
the pure sense of multidimensional modeling. For each combination of dimension values, there should be
exactly one row in the fact table. Suppose that the same customer orders the same product twice on the same
day and both products are not late and not returned. In other words, all dimensions are the same for these
two orders. Therefore, we need to combine the facts of these two orders, so that we end up with exactly one
row for this particular combination of dimension values.
As you know from assignment 2, the suggested way of accomplishing this is by using the Kettle component “Group by”. Note that ’combine fact’ may mean different things for different facts: sales, profit,
orderquatity and shippingcost should be summed, but unitprice should not.
Tips: There is more than one CSV-file for which the information needs to be combined. A handy component
to accomplish such a combination is ”MergeJoin”. The rows of two streams are joined if they have the
value(s) for one or more key attributes (e.g., “OrderID”). Find out what ‘join type’ means and which one
you need here.
For the ‘Late’ dimension you need to do some calculations. Handy components for doing calculations, i.e.,
adding a new attribute with the result of a calculation, are “Calculator” and “Formula”.

1.3.5 Assignment 4: multidimensional modeling
Case “Mobile app beta tester service”

23

Step 4: Visualization

Realize a dashboard that answers these business questions.
Deliverables: Submit pictures of the multi-dimensional model / database schema, the Kettle flow and the
dashboard.

1.3.5 Assignment 4: multidimensional modeling
Case “Mobile app beta tester service”
This is an on-paper assignment. You don’t need any tools for it.
The company Pear sells smartphones. For downloading and updating mobile applications they have the
“Pear app store”. Pear management has a new business idea: a beta tester service for game developers. The
purpose of beta testing is mainly for getting feedback on game play: do you become bored too quickly, is
it challenging enough, etc. The purpose is not so much finding bugs. Good gamers can offer themselves as
beta tester and developers can hire them. Developers pay Pear for a subscription plus a fee for hiring a beta
tester of which a small percentage goes to Pear.
A subscription includes support for developers in finding the “best” beta testers for their particular application. Pear needs to realize a data warehouse for this purpose with all gaming data of the beta testers.
They have in their databases timestamps of start of game app, switch to other app, and relevant events (turn,
next level, completion of game, etc.) as well as scores of completed games, and data on the apps: name,
solitary/multiplayer, name of developer, a fine-grained category, etc.
(you may assume more data to be available if it is reasonable a company as Pear would have it; make these
assumptions explicit)
A beta tester receives a small amount of money for joining in exchange for allowing Pear to disclosing their
data to developers. A second purpose of the data warehouse is that during a beta test, Pear provides the
developer with information on how much time the beta tester has devoted to beta testing the game.
(a) For being able to determine the right fact(s) for the star schema an important question is “What
is good”, i.e., “What makes a particular gamer a good beta tester for a particular kind of game?”
More concretely, how can you determine a score that quantifies this ‘goodness’ or ‘suitability’, which
obviously needs to be calculated from the available data.
Propose a formula for “goodness score” and explain why a high value is an indicator for a good beta
tester.
(b) What is the business question, or what are the business questions in this case? Formulate them as
accurately as possible.
(c) Give a star schema. Explain your design by describing the most important design choices and considerations.
Deliverables: A document or scan of written answers to the questions including the picture of the star
schema.

1.3.6 Trouble shooting
Running “spoon.sh” or “spoon.bat” doesn’t work Run “spoonDebug” instead. It first asks a few questions: answer Y, Y, and N. Then it will show an error message.
• If the error message refers to not enough memory, then edit the “spoon.sh” or “spoon.bat” and change
the “-Xmx2048m” into “-Xmx1024m”.
• If the error message refers to “No path found” or “unsupported major.minor number”, then Java is
not installed or a too old version of Java is installed. First download and install the latest version of
“Java SE Runtime Environment” from Oracle’s website (for Windows choose x64 and exe; for Mac

24

Topic 1 – Data Preparation and Visualization [DPV]

OSX choose dmg). If you still get a “No path found” on Windows then Pentaho is unable to find your
Java installation. Do the following
1. Right-click This pc, select properties
2. Click Advanced system settings
3. Click Environment Variables
4. Click the upper New button
5. Fill in the variable name with “JAVA HOME”
6. Click “Browse Directory”
7. Go to Program files and find the java “jre...” directory, select this and click ok
8. Click ok and launch spoon.bat

Null pointer exception when trying to draw a line between two components This is a known bug of
Pentaho Kettle that sporadically, but consistently occurs. Restarting and doing it again doesn’t work. A
workaround that does seem to work, however, is drawing the line (called ‘hop’ in Kettle) the other way
around, so from the target to the source.

The attribute names from the database are not correctly displayed in the dialogue window for the
Table Output component This is also a known bug that sometimes happens. Kettle doesn’t seem to have
loaded the table and attribute names from the database. The “Table Output” component then displays the
attribute names from the ‘flow’ instead. If this occurs, you simply type them in yourself (exactly). You do
not need to select the name from the drop menu.

Error “Duplicate entry ‘some number’ for key ‘PRIMARY’ ” This means that Kettle is trying to insert
a row in the database with a value for the primary key that alreay exists. This typically occurs in two
different situations:
• Your primary key is actually not a primary key, because it is not unique. For example, the orderid
is not unique, because a customer may buy several products in the same order, which will produce
two rows with the same orderid. In this case, change your schema such that another attribute or
combination of attributes is the primary key (in our example, the combination orderid and productid).
• If you run Kettle a second time, the data of the previous run is still in the database. This second run
will try to re-insert the same tuples, which the DBMS refuses because it will produce different rows
with the same value for the primary key. You could empty the database every time, but there is an
easier solution. The ‘table output’ component of Kettle has an option “Truncate table”. Set it, and it
will empty (truncate) the table each time the flow is run.

Error “Field ‘a’ doesn’t have a default value” Every attribute of a table that have “NOT NULL”
switched on should always have a value for each row. But a DBMS also has a facility to define a ‘default value’ that is filled in every time you try to store a NULL in the attribute.
The error usually occurs when you forget to specify the attribute ‘a’ in the Table Output component. Forgetting to specify it means that Kettle is not storing a value for that attribute, hence it stores NULL, hence the
error message. Go to the “Database fields” tab of the Table Output component and make sure all attributes
are specified.

“binary string” vs. “normal string” Sometimes you may encounter an error message complaining about
“data type error” where you also see the mention of “String”. You must know that there
are two types of string in Spoon: binary string and normal string. If you import data from a CSV-file using
“CSV file input” component, then the string attributes are actually typed as binary string. Certain data type
conversions or filters require string attributes to be normal string. So, if you see such an error message for a
certain attribute, you can try to first convert the attribute to a normal string. You can use the “Select values”
component for this: it has a “Meta-data” tab where you can set “Binary to Normal?” to “Y”.

1.4 Description of the practical assignments - R-based

25

Conversion errors Sometimes you may see “Conversion error: null” as value in a column when you
inspect an intermediary result. When reading data using the “CSV file input” component, Spoon will
attempt to postpone the conversion of values from string to other data types. So, if there is a column with
a date in it, is not really used, the data will never be converted to a date data type. But, as soon as Spoon
needs to compare values or has to perform an action that requires the values to be non-string data values,
then it can no longer postpone and will convert the data values. This is called “Lazy conversion”.
If the conversion fails for some reason, then the error will occur at the step that required the conversion,
while actually the problem is with “CSV file input”. There are also some known bugs that occur due to lazy
conversion. Advice is: switch off the lazy conversion in this component.

1.4

Description of the practical assignments - R-based

1.4.1 Tools
The tools that you would need to install for the topic assignments are:
• R and Rstudio
• Tableau BI Visualiser

Installation R and RStudio
•
•
•
•
•
•

•
•
•

•

Install the latest version of R from https://cran.r-project.org.
Install the latest version of RStudio (desktop version, free license) from http://www.rstudio.com/.
Open RStudio
Go to File > new file > R script: you will see the file in the upper left part of Rstudio. In this file you
can type syntax and run in later. Do not forget to save it once and a while.
Go to the right lower pane and click on ’Install’. Now you can install the libraries DBI, RPostgreSQL.
readr, dplyr and lubridate.
After installing these packages you need to activate them by e.g. type the following piece of R script,
select it all and click on ’Run’:
library(DBI)
library(RPostgreSQL)
library(readr)
library(dplyr)
library(lubridate)
Instead of installing packages via the menu as described above you could have also used (e.g.):
install.packages("dplyr").
By typing help() with the name of the function you want to get the helpfile from you will
see the documentation of that function in the lower right corner of R-studio for the tab Help.
It is advised to work in RStudio with projects. So make first a new directory on your laptop with
the name ’datascience’ (off course you can choose another name) and a subdirectory with the name
’data’. Then in RStudio goto File > New Project and select Existing directory and navigate to
the just made directory ’datascience’. If you now open a new script file it will be automaticaly saved
in that directory. Moreover, if you put the datafiles (e.g. BI Raw Data.csv) in the data subdirectory
you need only to specify ’data/filename’ when you need to import some data.
For more background on R we refer to the book ”R for Datascience”, H. Wickham (see http://
r4ds.had.co.nz/).

Installation Tableau Please follow the instructions for “Installation Tableau” from the ‘Tool-based’ section (see Section 1.3.1).

26

Topic 1 – Data Preparation and Visualization [DPV]

1.4.2 Assignment 1: Facts and dimensions
This assignment is the same for Tool-based and R-based. See Section 1.3.2 for the actual assignment.

1.4.3 Assignment 2: Re-create the demo from the lecture
Use the data spreadsheet BI Raw Data.csv.
A large warehouse located in the US has many customers for its different product from around the world.
The warehouse manager Mr Jack Bezos has many large customers from over the world ordering different
products from his warehouse. We follow the method of Section 1.1.1 where all steps are more or less already
given (and demoed at the lecture).

Step 1: Business questions

Mr. Bezos wants to know answers for the following questions
• Who are his top-5 most valued customers?
• What are his top-5 most important products?

Step 2a: Multidimensional model

Notice that Mr. Bezos wants to know something about customers and products. These are dimensions. Time
usually also is a dimension if you are interested in trends, so we also include the dimension ‘orderdate’.
What is it he wants to know? It is ‘most valued’ and ‘most important’, but what does this mean concretely?
How can we measure value and importance? Obviously in terms of money: the most valued customer is the
one who bought for the most money, and the most important product is the one which was sold for the most
money. Therefore, our fact is ‘amount’ (of money), i.e., ‘sales’.

Step 2b: Create tables in your database

We create a database schema for this multidimensional model given in Figure 1.3. There is already an empty
schema in your PostgreSQL database with the name “ass2” intended to hold the tables for this assignment.
We use the web-based database management tool PhpPgAdmin to create the tables and attributes in that
schema. Open your browser at http://castle.ewi.utwente.nl/phppgadmin to access the tool. Click
on “PostgreSQL” under “Servers” to login and expand the panel on the left. You can create tables by
navigating to “Schemas” → “ass2” → “Tables”.
The database consists of three tables:4
• ‘Customer’ with attributes customerid:integer (primary key, NOT NULL), name:character varying(100), country:character varying(100)
• ‘Product’ with attributes productid: integer (primary key, NOT NULL), name:character varying(100),
category:character varying(100)
• ‘Sales‘ with attributes orderdate:date (primary key, NOT NULL), customerid:integer (primary key,
NOT NULL), productid: integer (primary key, NOT NULL), sales:double precision
Note: The orderid is not a key! A customer can order more than one product in one order which results in
several rows for one order. Furthermore, in its purest form, a cube does not store information on individual
transactions or cases, but only aggregated data for each combination of dimension values. In the case of Mr
Bezos, the cube does not store individual orders, but the sales (fact) for all combinations of the dimensions
4 Although usually a good thing, we do not create any foreign key declarations, because that will hinder you in re-running your data
transformations.

1.4.3 Assignment 2: Re-create the demo from the lecture

27

orderdate, customer, and product. In other words, the orderdate, customerid, and productid together are the
primary key!

Step 3: ETL — Fill the database

We use R/RStudio to extract the data from the .csv file, to transform it into three tables: one facts table‘sales‘
and two dimension tables ‘product‘ and ‘customer‘), and load it into the database. Open first a new syntax
file to put your R code in. A usefull package to work with tables is dplyr. With the package readr you
can import data into R, e.g command separated files. So start with installing and loading both packages in
R/RStudio.
library(readr)
library(dplyr)
Then, we load the data by using the command “read delim“ and put it into the object “data0“:
data0 <- read_delim(file = "data/BI_Raw_data.csv",
delim = ";", col_names = TRUE, col_types = NULL)
head(data0)
Use the function ‘head’ to inspect the first five rows of the imported data. You can neglect the fact that in
the product or customer names some special characters are not imported correctly.
In Figure 1.5 you can see the ETL flow for preparing the data from the tool-based part. For example, for the
product table you first need to select the columns from the data0 object that are attributes of products (with
function select()), i.e. the name (Product Name) and the category (Product Category) then rename the
column Product Name to name and the column Product Category to category with function rename() .
In the original .csv file each row is a sales transaction so the products and category columns contain a lot of
duplicates. In the products table you want to have only one row for each product/category combination and
a productid, the primary key. So for each name and category combination you need to delete the duplicate
rows, this can be done by first grouping with function group by() the rows based on name and category
and then use the function distinct(). Finaly, you have to ungroup the data with function ungroup() and
attach a new column with the primary key with variable name productid to the table with the function
mutate() with row number().
An example of all steps to make the product table is as follows (with the pipe operator):
# Step 1: make Product table ’product’:
product <- data0 %>%
select(Product_Name, Product_Category) %>%
rename(name = Product_Name, category = Product_Category) %>%
arrange(name, category) %>%
group_by(name, category) %>%
distinct() %>%
ungroup() %>%
mutate(productid = row_number())
Check that you have now created a table product that contains the required columns and 77 rows. Similarly,
you can make the customer table. For the sales table you first need to select the columns from the original
data that you need and then join the product and customer table to add the productid and customerid to the
sales table and finally you need to drop columns that are redundant in the sales table. In this way you can
join the sales and product table (assuming that you have already created the sales table):
sales <- sales %>%
full_join(product, by = c("Product_Name" = "name",
"Product_Category" = "category")) %>%
select( -Product_Name, -Product_Category)

28

Topic 1 – Data Preparation and Visualization [DPV]

Finally you have to fill the database. From R you can connect to the PostgreSQL database server with
the functions dbDriver, dbConnect as follows (first install and load the packages DBI, RPostgreSQL (for
schema ass2):
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, port = 5432, host = "castle.ewi.utwente.nl",
dbname = "", user = "", password = "",
options="-c search_path=ass2")
dbWriteTable(con, "product", value = product, overwrite = T, row.names = F)
dbWriteTable(con, "customer", value = customer, overwrite = T, row.names = F)
dbWriteTable(con, "sales", value = sales, overwrite = T, row.names = F)
Check now on the website http://castle.ewi.utwente.nl/phppgadmin/ that you have indeed filled the three
tables in your database.
To get the info from the tables in the database you can also do that using R (assuming you have created the
tables customer, sales and product):
dbListTables(con)
str(dbReadTable(con,"customer"))
str(dbReadTable(con,"sales"))
str(dbReadTable(con,"product"))
or if a table is in schema ass2:
dbGetQuery(con,
"SELECT table_name FROM information_schema.tables
WHERE table_schema=’ass2’") ## to get the tables from schema ass2
str(dbReadTable(con, c("ass2", "sales")))

Step 4: Visualize

Now that we have our data ready in a form suitable for analysis, it is time to address the two business
problems raised in the beginning. We create a dashboard with metrics to measure the required KPIs. We
use the visualization tool Tableau for this.
Connect it to the PostgreSQL database. For this you have to choose PostgreSQL and then you get a
pop-up menu where you can select the server (’castle.ewi.utwente.nl’), the database (’name of your
database’), username and password. If all is well, you should see the three tables you created. Drag them
to the top area starting with the fact table “Sales”, then the other two. Notice that Tableau automatically
recognizes how the relationships between the tables are: a benefit from a database schema conforming to a
multidimensional model. Click “Update now” to read the data into Tableau (or use “Update automatically”).
To start our dashboard, click on “Sheet 1”, and drag Customer.Name in Y-axes and Sales.Sales in X-axes.
Then click “Sort name ascending by sales”. You now have a visualization that answers the first business
question about the most valued customer. Make another one for product answering the question about most
important product.
Deliverable: Export the dashboard visualizations as JPG-files as proof that you successfully re-created the
demo of the topic lecture.

1.4.4 Assignment 3: Do it yourself
Use the spreadsheets in SuperSales.zip
We follow the method of Section 1.1.1 again, but now only steps 1 and 2 are given; the assignment is to do
steps 3 and 4 yourself.

1.4.4 Assignment 3: Do it yourself

29

Step 1: Business questions

For this new warehouse in Canada, Mr. Bezos wants to know answers for the following questions
• Which products/product categories made the most loss?
• Which products/product categories were shipped really late (more than 2 days)?
• Which products/product categories were returned the most?
Step 2: Multidimensional model and database structure

Re-create the star schema of Figure 1.6 for Mr. Bezos to answer his business questions. We require that
• The dimension “ReturnStatus” has a separate table containing two rows representing the only two
values “Returned” and “NotReturned”.
• The dimension “Late” is an inlined dimension representing the only two values “Late” and “NotLate”.
Step 3: ETL — Fill the database

The main goal of this assignment is that you do this step yourself.
Note: There is an important complication in this case. Notice that the star schema models a proper cube in
the pure sense of multidimensional modeling. For each combination of dimension values, there should be
exactly one row in the fact table. Suppose that the same customer orders the same product twice on the same
day and both products are not late and not returned. In other words, all dimensions are the same for these
two orders. Therefore, we need to combine the facts of these two orders, so that we end up with exactly one
row for this particular combination of dimension values.
As you know from assignment 2, the suggested way of accomplishing this is by using the function
group by() and then summarise(). Note that combining facts may mean different things for different
facts: sales, profit, orderquatity and shippingcost should be summed, but unitprice should not.
Tips:
• The dates in the CSV file are just characters, but to be able to calculate e.g. the number of days a product is too late you have to transform them into real dates. The function dmy() takes a string and transforms is into the right class. When you have obtained the orderdate and shipdate in the appropriate
format you can calculate the number of days with: interval(orderdate, shipdate)/ddays().
Next you can use the function if else to make a column with two values ”Late” and ”NotLate”.
• There is more than one CSV-file for which the information needs to be combined. A handy function
to accomplish such a combination is full join(). The rows of two streams are joined if they have
the same value(s) for one or more key attributes (e.g., “orderid”).
Step 4: Visualization

Realize a dashboard that answers these business questions.
Deliverables:
Submit pictures of the multi-dimensional model / database schema and the dashboard. Furthermore, submit
the R code as well as the output of the following R-code:
dbGetQuery(con,
"SELECT table_name FROM information_schema.tables
WHERE table_schema=’ass3’") ## to get the tables from schema ass3
str(dbReadTable(con, c("ass2", "")))
NB: Repeat the ‘str’ line for each table you have and fill in the table name where it says “”.
In the ‘dbGetQuery’ line, you need not fill in anything, so write “table name” literally there.

30

Topic 1 – Data Preparation and Visualization [DPV]

1.4.5 Assignment 4: multidimensional modeling
Case “Mobile app beta tester service”
This assignment is the same for Tool-based and R-based. See Section 1.3.5 for the actual assignment.


Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 18
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.17
Create Date                     : 2019:02:04 21:34:41+01:00
Modify Date                     : 2019:02:04 21:34:41+01:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
EXIF Metadata provided by EXIF.tools

Navigation menu