Lab Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 113

DownloadLab Manual
Open PDF In BrowserView PDF
Pattern Recognition
IN4085 laboratory course manual
2018

David M.J. Tax, Marco Loog

Contents
Pattern Recognition

7

1 Introduction
1.1 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Creating artificial and real datasets . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Measuring objects within PRTools . . . . . . . . . . . . . . . . . . . . . . .

11
11
13
15

2 Classification with Normal Densities
2.1 Normal Probability Density Function . . . . . . . . . . . . . . . . . . . . . . .
2.2 Bayesian Decisions based on Normal Distributions . . . . . . . . . . . . . . .

19
19
22

3 Nonparametric Classification
3.1 Non-parametric density estimation . . . . . . . . . . . . . . . . . . . . . . . .
3.2 The scaling problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25
25
30

4 Linear Classifiers
4.1 The perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Bias-variance dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33
33
33
34

5 Linear Classifiers, Part 2
5.1 Logistic Regression, Logistic Discrimination, Log-linear Classifier, loglc . . .
5.2 The Support Vector Machine, svc . . . . . . . . . . . . . . . . . . . . . . . .

35
35
35

6 Nonlinearity: Neural Networks, Support Vector Machines, and Kernelization
39
6.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
6.2 The Nonlinear Support Vector Machine, svc . . . . . . . . . . . . . . . . . . .
40
7 Decision Trees, Combining,
7.1 Decision Trees . . . . . . .
7.2 Combining . . . . . . . . .
7.3 An AdaBoosting Exercise

and Boosting
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43
43
44
45

8 Classifier evaluation
8.1 Sources of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47
47
47

3

8.3
8.4
8.5
8.6

Cross-Validation . . . . . . . . . .
The Confusion Matrix . . . . . . .
Reject Curves . . . . . . . . . . . .
Receiver Operating Characteristic

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

49
50
50
51

9 Clustering
9.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Clustering with a mixture-of-Gaussians . . . . . . . . . . . . . . . . . . . . .

53
53
55
56

10 Cluster validation
10.1 Cluster validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59
59

11 Feature Selection
11.1 Selection Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 Stuff on Scatter Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3 More. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63
63
64
65

12 Feature Extraction
12.1 Principal Component Analysis
Karhunen-Loève Transform . . . .
12.2 Scatter Matrices, Repeating. . . . .
12.3 Supervised Feature Extraction: the
12.4 Kernel PCA . . . . . . . . . . . . .

67
. . . .
. . . .
Fisher
. . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

. . . . . .
. . . . . .
Mapping
. . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

67
68
68
69

13 Dissimilarity Representation

71

A Final assignment
A.1 Case . . . . . . . . . . . . . .
A.2 Input: the NIST dataset . . .
A.3 Expected output: deliverables
A.4 The benchmark . . . . . . . .
A.5 Grading . . . . . . . . . . . .
A.6 Feedback . . . . . . . . . . .

75
75
75
76
77
77
78

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

B Self-evaluation probability theory and linear algebra

79

C Introduction to Matlab
C.1 Getting started with Matlab . . . . . .
C.2 Mathematics with vectors and matrices
C.3 Control flow . . . . . . . . . . . . . . . .
C.4 Script and function m-files . . . . . . . .

.
.
.
.

83
83
87
93
96

.
.
.
.

101
101
101
102
105

D Introduction to
D.1 Introduction
D.2 PRTools .
D.3 Dataset . .
D.4 Datafile . .

PRTools
. . . . . .
. . . . . .
. . . . . .
. . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

D.5 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.6 Training and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

106
107
109

E Introduction to datafiles
111
E.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
E.2 Operations on datafiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
E.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5

6

Pattern Recognition
Course contents and goals
This course will provide you with a practical introduction to data analysis and pattern recognition. The techniques are discussed at a level such that you will be able to apply them in your
research. The emphasis is on using the computer as a tool for pattern recognition. Starting
from the basis for any pattern recognition application, measurements, the topics discussed
will be:
• classification;
• representations;
• evaluation;
• feature selection and extraction;
• clustering.
After you have succesfully completed this course, you should:
• understand pattern recognition theory to such an extent that he/she is able to read
recent literature on the topic in engineering-oriented journals (e.g. IEEE Tr. on PAMI);
• know which statistical methods to apply to which problems, on which assumptions they
are based and how these methods interrelate;
• be able to construct a learning system to solve a given simple problem, using existing
software.

Laboratory work
In addition to the Monday lectures, there is an practical component in which you practice
the construction of a patern recognition system. This runs in parallel with the lectures and
it given on Wednesday or Thursday. The laboratory work starts with exercises related to the
lecture of the week (as given in this manual).

Material
7

Next to the document you are now reading, the book “Pattern Recognition” by Sergios
Theodoridis and Konstantinos Koutroumbas will be used (4th edition, 2009, Academic Press,
ISBN 978-1-59749-272-0.).
Although not necessary, it will be useful for you to have access to a computer running Matlab outside the practical course. The toolbox and data sets used during the course can be
downloaded from the course Blackboard site, so that you can experiment and work on your
final exercise at home.

Prior knowledge

A,B

Basic working knowledge of multivariate statistics and linear algebra is required to follow the
course. The basic elements of statistics will be reiterated in the second week. After the first
week you are asked to do the self-test in Appendix B and submit your answers. If you feel
you are deficient in either statistics or linear algebra, please have a good look at Appendices
A and B of the book “Pattern Recognition”.

Examination
The examination of this course will consist of three parts:
1. A number of papers (6-7) will be discussed in the lectures. In preparation, these paper
will have to be read, and five non-trivial questions about each paper should be handed
in on a separate A4 (with your name and student number) at the beginning of the
lecture. These questions are graded as ”good” or ”insufficient” (20% of the final grade).
2. Lab work is mandatory. As a final assignment, you should construct a working pattern
recognition system to solve a given problem and write a report on it (40% of the grade,
see below).
3. A written exam on pattern recognition theory (40% of the grade).

Final assignment
The final assignment for the practical course is based on a case, in which your group plays
the role of a pattern recognition consultancy company (see Appendix A). Your company is
faced with an assignment by a client company working on automatic bank cheque processing
applications. This client wants you to research the possibility of using pattern recognition
techniques to interpret bank account numbers and the monetary amount. To this end, they
have supplied you with a standard dataset of handwritten digits (“0”, . . . , “9”) put together
by the US National Institute of Standards & Technology, NIST. The digits have already been
segmented (isolated) for you, and are stored as individual images. The problem you will work
on is how to recognize the digits, given these images. The deliverable consists of a report
detailing and motivating the design choices you made, as well as a test of your system on a
set of benchmark data withheld from you by the client.
To prepare for this final assignment and practise with the tools you will need to solve this
problem, the last few exercises for each of the first 9 weeks will consist of some small experiments on the handwritten digit dataset. During the last few weeks you can then work on the
final assignment itself. In the exercises each week, some possible solutions will be presented.
8

These are definitely not the best solutions, so your final system cannot consist of just the
solutions to the exercises each week!
The basic software package used is PRTools. It is introduced in appendix D. It contains
many routines to analyze patterns in object measurements given by a vector representation.
Preprocessing is needed to convert raw data like images and time signals into the proper
representation. PRTools includes a set of routines that enable such preprocessing for image
data. In addition, private code may be used. It is important to store routines that automatically generate (intermediate) results, as this will enable to modify and regenerate later
datasets. It may also be useful to recompute results for larger sets of objects.
Depending on your programming skills, these last exercises each week may take some time.
You are advised to take care that they are at most one to two hours behind schedule.

This manual
This document consists of several parts:
1. exercises for the practical course (Chapters 1-12);
2. the final assignment (Appendix A);
3. a self-test on probability theory (Appendix B);
4. brief introductions to Matlab, to PRTools, a pattern recognition toolbox used
throughout the course; and to datafiles for the construction and treatment of general
measurements (Appendices C-E).
In the first part of the manual, for each week a number of exercises is given. Please read the
text carefully and do the exercises. If you have the idea things are still unclear, ask one of
the assistants.

9

Notation
Most weeks contain a few optional exercises, indicated like this:
optional

An exercise between these lines is optional. This means that you are not required to do it.
It can give you some extra background and tips. Only work on it if you think the subject is
interesting and if you have sufficient time left.
end optional

Some other notational conventions are:
• Matlab variables and code will be indicated using the teletype font (for example:
x, mean(x)). For larger pieces of Matlab code, we will use the notation:
>> % A piece of test Matlab code
>> x = rand(10,2);
>> mean(x)
>> std(x)
Here, >> is the Matlab prompt. If pieces of code are not preceded by >> it will mean
it’s wiser to write a script. You can learn more about Matlab scripts in appendix C.
• An alert sign like the one in the margin here indicates it is essential you read the text
next to it carefully.
X, Slides

• A book sign indicates where you can read more on the theory behind the subject discussed. Numbers indicate chapters and sections in “Statistical Pattern Recognition”.
“Slides” means the theory is discussed in the slides, which can be downloaded in handout format from the Blackboard site.
It might happen that in an exercise you will encounter a Matlab-command you have never
seen before. If this happens, have a look at the help. In almost all cases there is an explanation
of the command, so you can get an idea of what the function does.

10

Week 1

Introduction
Objectives When you have done the exercises for this week, you
• should be able to mathematically derive decision boundaries given simple probability density functions,
• should be able to perform simple computations with Bayes’ rule,
• should be familiar with working under Matlab,
• understand some PRTools commands,
• know what an object and a dataset are,
• should be able to construct and investigate and visualize some simple datasets.

1.1

Decision Theory

Exercise 1.1 (a) Assume that we managed to represent objects from a two-class classification problem by a single feature. We know that the objects from class ω1 have a
Gaussian distribution with µ1 = 0 and σ12 = 1/2, and the objects from class ω2 have a
Gaussian distribution with µ2 = 1 and σ22 = 1/2. Derive the position of the decision
boundary when both class priors are equal.
(b) Again, assume we have a two-class classification problem in a 1D feature space,
but now assume that objects from class ω1 have a uniform distribution between 0 and
1, and objects from class ω2 have a uniform distribution between 2 and 3. Where is the
decision boundary now?
(c) And where is the decision boundary when the objects from class ω2 have a uniform
distribution between 0.5 and 1.5? (The distribution of ω1 did not change, classes have
equal prior.)
(d) And where is the decision boundary when the objects from class ω2 have a uniform
distribution between 0.5 and 2.5? (The distribution of ω1 did not change, classes have
equal prior.)
Exercise 1.2 (a) Assume we represent the objects in a two-class classification problem by
a single feature. We know that the objects from class ω1 have a Gaussian distribution
11

2.2

with µ1 = 0 and σ12 = 1/2, and the objects from class ω2 have a Gaussian distribution
with µ2 = 1 and σ22 = 1/2. Derive the position of the decision boundary when both
class priors are equal, but we have a loss matrix of:


0 0.5
L=
.
(1.1)
1.0 0
(b) Assume again we have a two-class classification problem in a 1D feature space, but
now assume that objects from class ω1 have a uniform distribution between 0 and 1,
and objects from class ω2 have a uniform distribution between 0.5 and 2.5. Given the
loss matrix (1.1), where is the decision boundary now?

1

0.9

0.8

0.7

f

0.6

0.5

0.4

0.3

0.2

0.1

0
−2

−1

0

1

2

3

4

5

x

Figure 1.1: The class-conditional probabilities of two classes p(x|ω1 ) (dashed blue line) and
p(x|ω2 ) (solid black line) in a 1-dimensional feature space.
Exercise 1.3 In figure 1.1 two triangular-shaped class conditional probability density functions are given. The first one p(x|ω1 ) is indicated by a dashed blue line and the second
p(x|ω2 ) with a solid black line. The class priors are assumed equal here.
(a) Again, use the Bayes’ rule to derive the class posterior probabilities of the following objects: x = 3, x = −0.5, x = +0.5. To which class are the objects therefore
assigned?
(b) Which object is on the decision boundary of the Bayes classifier?
Exercise 1.4 Now assume that class ω1 in figure 1.1 is twice as small as class ω2 . That
means p(ω1 ) = 1/3 and p(ω2 ) = 2/3.
(a) Compute the posterior probabilities for x = 3, x = −0.5, x = 0.5.
(b) Where is the decision boundary of the Bayes classifier now?
Exercise 1.5 Compute the Bayes error for the class distributions given in figure 1.1, where
the classes have equal prior.
12

1.2

Creating artificial and real datasets

The general approach to attack a pattern recognition problem (or a data analysis problem)
is presented in section 1.2 in the book. One key entity in data analysis is the idea of an
object. We always assume we can represent any object by a set of values, often just measurements. Whatever object we are considering, in order to perform operations on an object by
a computer, we have to encode this object by some numbers. An object in real life and in
the computer are therefore two different things. Furthermore, in this course we will use the
convention that each object is represented by a row vector of measurements (thus an 1 × d
array).
When you want to conclude something from data, it is often not from one object, but from
a set of objects. We assume we have a set of objects from which we want to obtain some
knowledge. This set is called a dataset. A dataset is a set of n objects and is stored in a n × d
array.
After you have specified the problem you want to solve, you have to collect examples and do
measurements on these examples. For that, you have to define what features are likely to be
informative for the problem you specified.
Exercise 1.6 (a) Make a fake dataset in Matlab containing 10 objects with 3 measurements each. Invent the measurement values yourself. Below is an example piece of code
which fills the matrix x with 2 objects, each containing 3 measurements:
>> x = [ 0.7 0.3 0.2;

2.1 4.5 0];

Make your own matrix x.
(b) Compute the means (using mean) and standard deviations (std) of your 3 measurements of 10 objects. What is the difference between mean(x) and mean(x’)?
Exercise 1.7 (a) When a dataset contains just two features per object, it can be visualized
in a scatterplot (we come back to this later). Make a scatterplot by:
plot(x(:,1),x(:,2),’b*’);
Looking at the data matrix you created in Exercise 1.6, find out which object is plotted
where in the plot.
(b) When you look at the scatterplot, can you identify outlier objects, or structure in
the data?
When the dataset is used to train classifiers, it is also required that for each object a class
label is present. This indicates from which class the object originates, according to the expert
(i.e. you).
Exercise 1.8 (a) Invent labels for the objects that you defined in the previous question,
and store it in a Matlab variable lab. The labels can be numbers, like 1 or 2, but you
can also use character strings, like ’banana’ or ’orange’. Create a PRTools dataset
by
>> a = prdataset(x,lab)
13

Ch.1

Check if the resulting dataset has the correct number of objects, the correct number of
features and correct number of classes (correct here means: what you expect). You can
do that by just typing the variable on the Matlab command line:
>> a

1.2.1

Scatterplot

A scatterplot is the most simple plot you can make: it simply plots the first measurement
against the second measurement. If you have three measurements, you can use 3D plots; if
you have even more, you will have to select at most three of them by hand (although later
we will discuss ways of visualising more measurements at once).
Exercise 1.9 Load the dataset “boomerangs” (use the function boomerangs and choose the
number of objects to generate).
(a) Use scatterd(a) to make a scatterplot of the first two features, and
scatterd(a(:,[2 3])); for the features 2 and 3.
(b) Use scatterd(a,3) for a 3D scatterplot. Find and use the rotate3d button to
view all data.
(c) Use scatterd(a,’legend’) to add a legend with the class labels to the scatterplot.

1.2.2

The definition of an object

Up to now, each object was characterised by a vector of measurements. Sometimes the idea of
an object is somewhat vague and the user has to decide where an object “starts” and “stops”.
Assume we have recorded several sentences of several speakers using a microphone. Now we
have a set of very long time signals, say five seconds, recorded at a sampling rate of about 11
kHz.
Exercise 1.10 (a) Imagine, for instance, that we want to build a speaker recogniser. This
recogniser should identify the person which speaks from a recorded speech signal. By
how many features is each object now represented?
(b) If you want to classify the speech instead of the speaker, what are then the objects?
How many classes would you need?

1.2.3

Real image data

Pattern recognition studies automatic ways to find patterns in data. Very often these are
images, but other sensor signals like time series (e.g. speech, music, weather) or spectra are
equally possible. In addition also non-sensor data like text, web-pages, answers to questionnaires and medical tests are often analyzed by pattern recognition tools. In the first example
we will deal with simple black-and-white images (’blobs’ or ’silhouettes’) as given by the
so-called Kimia dataset (collected by Benjamin B. Kimia).
>> obj = kimia_images
% load 12 objects for 18 classes
>> show(obj,12)
14

This displays 18 classes of objects, each represented by 12 examples. A single image, e.g.
object 100 can be displayed by
>> figure; show(obj(100,:))
The show command belongs the the PRTools toolbox. There many more commands in
Matlab to display a single image, e.g. imshow, image and imagesc. Sometimes they demand
a change of the color table (colormap) or even a change of the intensity scale of the images.
In order to distinguish the classes in the Kimia database in an automatic way, a numeric
representation is needed. A first try is to measure the sizes and the perimeters of the blobs.
The following commands do this for the entire database using tools from the Matlab Image
Processing toolbox (in particular the function regionprops):
>> x = im_features(obj,obj,{’Area’,’Perimeter’,’Eccentricity’});
>> +x
The im features commands define the measurements (it is intended to measure object properties in gray value images for which the shape and position are given by a binary (black-andwhite) image, therefore the obj parameter is supplied twice as we have just binary images).
The function im features accepts a prdatafile, and it outputs a prdataset. This dataset
can be shown on the screen by +x. A better way for the visualization of 2-dimensional data
(i.e. data describing objects by just two numbers) is a scatter plot.
Exercise 1.11 (a) Take the feature set x from the above piece of code and select just
two classes, e.g. the elephants (class 3) and the camels (class 16) by featset =
seldat(x,[3 16]). You may inspect the values by featset = double(featset).
(b) In order to understand these numbers you may select the original images by
>> a = kimia_images
>> b = seldat(a,[3 16]) % select classes 3 and 16
>> show(b)
A single image, e.g. the first
>> c = data2im(b,13);
>> figure; imagesc(c);
>> colormap gray

camel, can be inspected by
% select image 13
% show it on the screen
% apply gray colormap

In this figure the axes show the sizes in pixels. Can you understand the area and
perimeter values of featset?
(c) Inspect the data in a scatterplot by scatterd(x,’legend’). You will now see the
first two features. Note the class names in the legend and the feature labels along the
axis.

1.3

Measuring objects within PRTools

The PRTools package, that will be used extensively in this course, is based on vector representations of objects derived from object measurements. In order to have a unique toolbox, it
has been recently extended (in version 4.1) with possibilities to perform such measurements
on large sets raw object measurements. This extension is the so called prdatafile object of
15

PRTools. An introduction to PRTools is presented in D. The prdatafile extension is
described in E. In the following some introductory examples are presented.
In order to work with PRTools it should be available in the Matlab search path. This
can be verified by help prtools. Various datasets can only be loaded if the prdatasets or
prdatafiles directories are in the path. Use help prdatasets and help prdatafiles to
inspect the list of available datasets.
This example is again based on the Kimia Dataset. It consists of a set of 216 black-and-white
images, silhouettes of toy objects and animals. There are 18 classes of such objects and for
each are 12 images given. In order to speed up processing here we will restrict ourselves to
just 6 classes.

Exercise 1.12 Enter the following commands:
>> delfigs
%
>> obj1 = kimia_images
%
>> obj1 = seldat(obj1,[3:3:18]); %
>> figure(1); show(obj1);
%
>> obj2 = im_box(obj1,0,1);
%
%
>> figure(2); show(obj2);
%
>> obj3 = im_rotate(obj2);
%
>> figure(3); show(obj3);
%
>> obj4 = im_resize(obj3,[20 20]);%
>> figure(4); show(obj4);
%
>> obj5 = im_box(obj4,1,0);
%
>> figure(5); show(obj5);
%
>> showfigs
%

delete all figures
load 12 objects for 18 classes
select the classes 3,6,9,12,15 and 18
show raw data
put objects in a square box
(aspect ratio = 1)
to create square images
rotate images to same orientation
show data
resize images to 20 x 20 pixels
show data
create empty box around objects
show data
show all figures on the screen

The raw data in figure(1) shows the objects in bounding boxes of different sizes, to save
memory space. In figure(2) the aspects ratios are set equal to one, i.e. for each image as
many empty (black) rows or columns are added to make the image square. In figure(3)
the images are rotated, and in figure(4) resampled to a 20 × 20 pixel grid. Some objects
may touch the border of their image, which may cause trouble in further processing, so
finally empty rows and columns are added to all four sides (figure(5)).
Create a file with the above commands. If you want to better inspect what happens,
reduce obj1 further by a command like obj1 = obj1(1:12:37,:), which selects a single
object out of the first 4 classes. Try to understand each of the processing steps.
(a) Find out what the im rotate command does exactly. Why would this be useful/necessary for this set of objects?
(b) Play a little with the image size, i.e. try 10 × 10 or 30 × 30 pixels. What is
optimal?

The data referred to by the variables obj1, obj2, ..., etcetera, are of the datafile data
16

type. The contents can be inspected by
>> struct(obj1)
ans =
files:
rootpath:
type:
preproc:
postproc:
prdataset:

{{1x19 cell} {1x19 cell}}
’C:/Matlab/prdatafiles/kimia’
’raw’
[1x1 struct]
[0x0 mapping]
[72x17907 prdataset]

Objects of the type prdatafile do not store data directly, but just refer to files on disk
and the way they should be processed. In (in rootpath) the name is stored of the top
directory which contains the data. The two cells in the field files contain the names of all
sub-directories and of all files they contain. Such fields may be better inspected by, e.g. s
= getfiles(obj2); s{1}, s{2}{:}. All preprocessing is stored in the preproc field, e.g.
p = getpreproc(obj6), p(4), p(5). (The selection of classes made for obj2 is stored
somewhere else. We will not discuss it here).
It should be emphasized that commands like obj5 = im resize(obj4,[20 20]) do not cause
any processing of data. They just store the command in the description of the variable.
Execution is postponed until data is needed (e.g. for display by the show command) or when
a datafile is converted to a dataset. This will be described later.
All preprocessing can be combined in the following way:
>> preproc = im_box([],0,1)*im_rotate*im_resize([],[20 20])*im_box([],1,0);
>> obj5 = obj1 * preproc;
The *-symbol can be interpreted as the “pipe” command in Unix; preproc is thereby roughly a
small routine that consists of a sequence of operations. When it is applied to obj1, PRTools
substitutes the [] entries by the data before the *-symbol.
Exercise 1.13 We will now measure two properties of the set of objects, the (normalised)
x- and y-coordinates of the centers of gravity:
>> obj = seldat(obj1,2)
% Use just class 2
>> obj = obj*preproc
% normalize
>> figure(1); show(obj);
% show data
>> mean_set = im_mean(obj)
% measure (x,y) of centers of gravity
>> mean_data = double(mean_set) % convert them (unpack)
(a) Do the results correspond to what you expect, given the object images?
The last command, double(mean set), converts the datafile variable mean set to
plain numbers (doubles). So in mean data we now have the measurements of the objects in class 2. These can be further analyzed, e.g. by constructing histograms and
scatterplots.
(b) Plot the histograms of the (x, y) coordinates of the centers of gravities (“means”)
of the objects in class 2.
17

(c) Construct a scatterplot of these two measurements.
(d) Also measure the means of the objects in class 13.
(e) Add these to the scatterplot with a different color.
This week you saw the basis of pattern recognition: object definition, data collection, and
Bayes’ rule. Starting from good measurement data, the rest of the analysis (visualisation,
clustering, classification, regression) becomes much easier. Starting from poorly defined objects or poorly sampled datasets with insufficient example objects, your analysis becomes very
hard and no clear conclusions will be possible (except that more data is needed).

Take-home messages
• If you collect or receive data, think well about what the objects and
measurements are
• Visualization is a very important method for giving you a first idea
of what your data looks like
• At some point, you will have to convert your measurements to identical features for each object (with different values)
• Use Matlab’s and PRTools’ range of visualisation tools to get an
idea of the distribution of your data

18

Week 2

Classification with Normal
Densities
Objectives When you have done the exercises and read the text for this week, you should
• understand how a normal distribution describes multivariate data,
• understand how to visualize and interpret 1D and 2D normal distributions,
• know how to derive decision boundaries and to visualize them for 2D problems,
• know the maximum likelihood estimates for normal models,
• understand the general ML and MAP solution to estimation.

2.1

Normal Probability Density Function

2.1.1

Shapes in 2D

Exercise 2.1 Have a look at Figures 2.3 to 2.6 in the book.
(a) Are the correlations that can be observed in these figures positive, negative, or
zero?
(b) What would it mean for the correlation if the ellipsoidal configuration in Figure
2.6 runs from the bottom left to the top right corner instead?
Exercise 2.2 (a) Make a rough sketch of the shape
 1 0 of the ellipsoids of a normal density for
which the covariance matrix has the form 0 9 . Take this ellipsoid, what is the ratio
of the length of its long axis in comparison with its short axis?
(b) Imagine we rotate the above sketch of ellipsoids clockwise such that their orientation becomes diagonal. Qualitatively, what happens to the variances and the covariances
for this new normal distributions? How do they compare to the original variances and
covariances as given in the covariance matrix above?
(c) Draw a figure visualizing a 2D normal density in which the two variables/features
are completely correlated, i.e., the correlation coefficient between the two variables is 1.
Give the covariance matrix that matches your sketch.
19

2.4.1

optional

Exercise 2.3 (a) Any ideas on how to visualize a 3D normal density?
end optional

2.1.2
2.4.1

Parameter Estimation

Exercise 2.4 Consider Equation (2.27) in the book, which provides the general expression
for a normal distribution in an l-dimensional space. Here we assume that l = 2.
 
Now assume, in addition, that the covariance matrix Σ is given and equals 41 11 . Derive
the maximum likelihood estimate for the 2D mean vector µ.
optional

Exercise 2.5 Consider the MAP solution for the mean, µ̂M AP , in Example 2.5 of the book.
2.5.2

(a) Consider what happens to the estimate for the following limits: σµ2 → ∞, σµ2 ↓ 0,
σ 2 ↓ 0, and N → ∞.
(b) What is the role of the prior over the mean? What happens with the estimate
when there are no observations? What happens to the ML estimate in this case?
end optional

2.1.3

Some Manual Labor

Exercise 2.6 Given the following 3D observations:

h 1 i h 0 i h −1 i h 2 i
0 , 1 ,
1 , π
0

0

1

1

(a) Determine the maximum likelihood estimates for the mean, the variances, and
the covariance matrix (for the normal distribution) using this 3D data. (Provide exact
solutions! No numerical approximate solutions.) (If you don’t know how these maximum
likelihood estimates look like or how you calculate them, then make sure you look this
up! And yes, Wikepedia is probably fine. . . )
(b) Redo the exercise in (a) but now only for the observations made in the first and
third dimension (or for the first and second feature), i.e., determine the mean, variances, and covariance matrix for this 2D data. (Provide exact solutions! No numerical
approximate solutions.)
(c) How do the estimates for the 2D and 3D data relate?
 
We apply the linear transformation 20 01 to the 2D data considered in (b).
(d) How do the new feature vectors look/what are the new coordinates?
(e) Precisely what effect does this transformation have on the maximum likelihood
solutions for the mean etc.? What happens to the correlation between the two features?
20

Exercise 2.7 This may be a difficult exercise but it is a rather important one. . .
2.4.1

Again consider Equation (2.27) that provides the general expression for a normal distribution in an l-dimensional space. Consider an l × 1 vector a and a l × l linear
transformation matrix L. Assume we transform our random variables X, which are
distributed according to Equation (2.27), into new random variables Y by means of an
affine transformation as follows:
Y = LX + a .
(a) Show that Y is still normally distributed. Express the new mean and covariance
matrix in terms of µ, Σ, L, and a.
(b) Check your answer with the outcome in exercise 2.6 (d) and (e).

2.1.4

Some Programming Labor

Exercise 2.8 Generate 1000 random point from a 2D standard normal distribution using
randn. (Note the discrepancy between the math, in which typically column vectors are
employed, and computational software, in which feature vectors often are row vectors.
This means, in this case, that the matrix with random numbers should be size 1000×2.)
(a) Turn these random points into a prdataset a and scatterd the data.
The command w = gaussm(a) determines estimates for a normal distribution based on
the data set a, and stores the estimated parameters in the mapping w.
(b) Use plotm(w) to visualize the 2D density estimates on top of the scattered data.
(c) Multiply one of the features by 3 and, subsequently, rotate
data over an angle
 the
α − sin α . . . ). Scatter
α (say α = π/3) by means of a rotation matrix (you know. . . cos
sin α cos α
the new data in a new figure and plot the new contours. Compare to the other figure
and see where corresponding points end up.
(d) Investigate for which angle the correlation becomes maximum.

optional

Exercise 2.9 (a) Consider the same questions as those in the previous exercise but now
use uniformly distributed data, i.e., use rand to generate the initial data.
end optional

Exercise 2.10 The piece of code below generates 5 data points from a standard normal
distribution, estimates its parameters, and displays the estimated densities through a
contour plot. This is repeated a number of times and all contour are plotted in the
same figure.
21

(a) Before executing the code. By now, can you predict what the various contour plots
will look like? Can you explain why this happens?
for i = 1:8
a = prdataset(randn(5,2));
scatterd([-3 -3;3 3],’.w’) % trick to set the figure axes about right
plotm(gaussm(a))
hold on
pause
end
hold off
(b) Execute the code. Pressing the space bar will add another contour plot. Check
with yourself wether the plot comes close to your expectations. (Don’t fool yourself!
Was this indeed what you expected?)
(c) What happens when the number of random feature vectors increases steadily?
What if we could perform this exercise with an infinite amount of random data points?

2.2

Bayesian Decisions based on Normal Distributions

Exercise 2.11 Consider two 2D normal distributions with different means but equal covariance matrices. Assume
the latter to be equal to a multiple of the identity matrix, i.e.,

take Σ to be 0c 0c for some c > 0. The class priors are not necessarily equal.
(a) What shapes can the decision boundary take on? Demonstrate this mathematically
starting from the expressions in Section 2.4.2 in the book.
2.4.2
Exercise 2.12 Consider two 2D normal distributions with different means but equal covariance matrices.
(a) What shape does the optimal decision boundary have?
Generate 10 data points per class from a data set that fulfills the above assumptions.
Create a data set a with these 20 points and these two classes. We can estimate linear
discriminants by means of ldc(a) and quadratic discriminant boundaries by means of
qdc(a).
(b) Given the shape of the optimal decision boundary, how does the boundary look
for the trained normal density based quadratic discriminant?
The PRTools command plotc plots the decision boundary.
(c) Scatter the data set a and plot the decision boundary estimated by means of ldc(a)
and qdc(a). Can you explain their difference? You might want to revisit the previous
question (b).
(d) What happens when the number of points per class increases? What happens in
the limit of an infinite number of points?
Exercise 2.13 Generate a 2D data set with 2 classes with 10 samples from a uniform distribution (using rand) and 10 samples from a normal distribution (using randn), respectively.
22

(a) Train a ldc and a qdc on this data, scatter the data, and plot the two decision
boundaries (with plotc). Determine the how many points are on the wrong side of the
boundary.
(b) With w a trained classifier and a a data set, the labels that the classifier assigns
to the data points in a can be retrieved using labeld: a*w*labeld. Check your count
from the previous question using this command.

Exercise 2.14 Given two normally distributed classes in 1D. One with mean 0, variance 4,
9
1
and class prior 10
and one class with mean 1, variance 1, and class prior 10
. Determine
the decision boundary for this classification problem.

optional

Exercise 2.15 In case one is dealing with a classification problem in which the classes are
multivariate Gaussian, the quadratic classifier is in principle the classifier that will
always perform optimally as long as there is enough data to accurately estimate the
various parameters like means and covariance matrices. When the class distributions
are not necessarily normal, this does not hold true and there are cases in which the
normality based linear discriminant performs a better discrimination that the normality
based quadratic discriminant.
(a) Construct an artificial 2D problem for which the decision boundary of ldc(a) is
better than the decision boundary obtained by qdc(a).
(b) Take two uniform rectangular classes. One with bottom left coordinate (-1,-2) and
top right coordinate (0,3), the other with bottom left (0,0) and top right (1,1). This
should do the ‘trick’.

end optional

Exercise 2.16 (a) Show that the classification performance does not change for ldc and qdc
when the data set it linearly (or even affinely) transformed. If this is too complicated,
at least show that scaling the axes does not change the decision on the data.
(b) Construct a simple 1D or 2D example that shows the classifier might essentially
change when a nonlinear transformation of the space is performed.
23

Take-home messages
• The decision boundary derived from normal distributions takes on
a quadratic form
• Linearly (or affinely) transforming normally distributed data leads
to data that is again normally distributed
• LDA and QDA are invariant to scaling or, generally, to affine transformations of the data
• Parameters are estimated form a finite amount of data and will
therefore vary
• The new means and (co)variances of some data after an affine transform can be determined using some simple rules

24

Week 3

Nonparametric Classification
Objectives When you have done the exercises for this week, you
• understand what a Parzen density estimator, and a Parzen classifier is,
• understand what a k-nearest neighbor density estimator, and classifier is,
• know what classifiers are sensitive to rescaling of one of the features.

3.1

Non-parametric density estimation

3.1.1

Histograms

The simplest way to estimate the density of one measurement is by plotting a histogram. It is
easiest to do this for one measurement (Matlab has a command hist), but you can create
a histogram plot of 2 measurements at the same time.
The main problem in using histograms is choosing the right bin size. Making the bins too
broad will hide useful information within single bins; making them too small will result in
bins containing too few samples to be meaningful.
Exercise 3.1 (a) Generate 10 1D Gaussian datasets and calculate the average histogram
and the standard deviation:
for i = 1:10
a = gauss(n,0);
h(i,:) = hist(+a,-5:5);
end;
errorbar (-5:5, mean(h), std(h));
with different numbers of samples n (e.g. 10, 100 and 1000).
(b) For what n do you think the histogram starts giving a good impression of the true
density?
(c) Repeat the above for data obtained from a Laplacian distribution (which is more
peaked than the Gaussian); use a = laplace(n,1) to generate n samples. Does the
same hold? Why?
25

2.5.6

Note how you used 2 dimensions to visualise the 1D histogram. Similarly, we can create 2D
histograms using 3 dimensions. In practice, visualising 3D data or a function of 2D data is
often as far as we can go with computers.
Exercise 3.2 Generate a 2D Gaussian dataset and use the hist2 function provided:
a = gauss(100,[0 0]);
hist2(a);
Note that hist2 is not a standard Matlab function, so it is a bit less flexible than the
standard hist function. To have a look around the histogram, try help rotate3d.
(a) For the standard number of bins (10 × 10), how many samples would you say are
needed to get a good idea of the shape of the histogram? Try this by generating various
numbers of samples and inspecting whether the histogram changes if you repeat the
experiment.
(b) You can also play with the number of bins; see help hist2. For 5 × 5 bins,
how many samples do you need? Is the representation still accurate? And for 20 × 20
bins?

3.1.2

2.5.6

Parzen density estimation

Now we are going to estimate the density using a Parzen density estimator, called parzenm
in PRTools.
Exercise 3.3 (a) We start with creating a simple dataset with:
>> a = gendats([20 20],1,8);
(Type help gendats to understand what type of data we have now.)
(b) We define the width parameter h for the Parzen kernel:
>> h = 0.5;
(c) The function parzenm estimates a density for a given dataset. In most cases a
PRTools prdataset is labeled, and these labels are used in the function parzenm to
estimate a density for each class. To define a Parzen density estimator with a certain
width parameter h on the entire dataset, ignoring labels, type:
>> w = parzenm(prdataset(+a),h);
This mapping can now be plotted along with the data:
>> scatterd(+a); plotm(w,1);
If your graphs look a little “bumpy”, you can increase the grid size PRTools uses for
plotting:
>> gridsize(100);
and try the above again.
(d) Plot the Parzen density estimate for different values of h. What is the best value
of h?
26

When you want to evaluate a fit of a density model to some data, you have to define an error.
One possibility is to use the log-likelihood, defined as:
Slides
!
LL(X ) = log

Y

p̂(x i )

i

=

X

log (p̂(x i ))

(3.1)

i

The better the data x fits in the probability
P density model p̂, the higher the values of p̂(x )
will be. This will result in a high value of i log (p̂(x i )). When we have different probability
density estimates p̂, we have to use the one which has the highest value of LL.
Note that when we fill in different values for the width parameters h in the Parzen density
estimation, we have different estimates p̂. Using the log-likelihood as a criterion, we can
optimize the value of this free parameter h to maximise LL.
To get an honest estimate of the log-likelihood, we have to evaluate the log-likelihood (3.1) on
a test set. That means that we have to make (or measure) new data from the same distribution
as where the training data came from. When we would evaluate the log-likelihood on the
data on which the probability density was fitted, we would get a too optimistic estimation of
the error. We might conclude that we have fitted the data very well, while actually a new
dataset from the same distribution does not fit in the density at all! Therefore, if you want
to evaluate the performance of an estimated p̂, use an independent test set!

Exercise 3.4 Use the data from the same distribution as in the previous exercise to train a
Parzen density estimator for different values of h. Compute the log-likelihood of this
training set given the estimated densities (for different h):
a = gendats([20 20],1,8);
% Generate data
hs = [0.01 0.05 0.1 0.25 0.5 1 1.5 2 3 4 5]; % Array of h’s to try
for i = 1:length(hs)
% For each h...
w = parzenm(prdataset(+a),hs(i));
%
estimate Parzen density
LL(i) = sum(log(+(a*w)));
%
calculate log-likelihood
end;
plot(hs,LL);
% Plot log-likelihood as function of h
(since w is the estimated density mapping w, the estimated density p̂ for objects in a
dataset a is given by +(a*w)).
(a) What is the optimal value for h, i.e. the maximal likelihood? Is this also the best
density estimate for the dataset?

Exercise 3.5 (a) Use the same data as in the previous exercise, but now split the data into
a training and test set of equal size. Estimate a Parzen density on the training set and
compute the Parzen density for the test set. Compute the log-likelihood on both the
training and test sets for h = [0.1, 0.25, 0.5, 1, 1.5, 2, 3, 4, 5]. Plot these log-likelihood vs.
27

h curves:
a = gendats([20 20],1,8);
%
[trn,tst] = gendat(a,0.5);
%
hs = [0.1 0.25 0.5 1 1.5 2 3 4 5]; %
for i = 1:length(hs)
%
w = parzenm(prdataset(+trn),hs(i));
Ltrn(i) = sum(log(+(trn*w)));
%
Ltst(i) = sum(log(+(tst*w)));
%
end;
plot(hs,Ltrn,’b-’); hold on;
% Plot
plot(hs,Ltst,’r-’);
% Plot

Generate data
Split into trn and tst, both 50%
Array of h’s to try
For each h...
%
estimate Parzen density on trn
calculate trn log-likelihood
calculate tst log-likelihood
trn log-likelihood as function of h
tst log-likelihood as function of h

What is a good choice of h?
optional

Exercise 3.6 Use the same procedure as in Exercise 1. Change the number of training
objects from 20 per class to 100 per class. What is now the best value of h?
end optional

3.1.3
3.3

Nearest neighbour density estimation

Estimation of the density by the nearest neighbour method is harder, but PRTools contains
an implementation called knnm.
Exercise 3.7 (a) Again generate a dataset a = gendats([20 20],1,8) and apply the
knnm density estimator for k = 1: w = knnm(a,k). Plot the density mapping using
scatterd(a); plotm(w,1). What did you expect to see for k = 1? Why don’t you see
it?
(b) Increase the grid size, e.g. gridsize(500), and plot again.
(c) Try different values of k instead of 1. Judging visually, which do you prefer?
(d) How could you, in theory, optimise k?
Exercise 3.8 Which do you prefer: the Parzen density estimate or the nearest neighbour
density estimate?
Although PRTools contains a nearest neighbour density estimator, for illustration purposes
we can also make one ourselves. The simplest to construct is the 1-nearest neighbour.
Exercise 3.9 (a) Generate 1D data as you did in the previous exercise, but only 5 objects
per class: a = gendats([5 5],1,8). Also generate a grid of data, as follows:
>> gx = [-3:0.2:11]’;
28

(b) Compute the distances from all the grid points to the training set. You can use
the distm function from PRTools:
D = sqrt(distm(gx,a));
(use the help to find out why you need to take a sqrt). Check the size of matrix D.
What does a row and a column in this dataset mean?
(c) To find the nearest object in the training set for each of the grid objects, you have
to find the minimum over a row (or a column, depends on how you defined your matrix
D). To find the minimum we first have to order them. To order them along the columns
do sort(+D,1) and along the rows do sort(+D,2). You will get a sorted distance vector
Dsort for each of the columns or rows.
(d) To compute the density for each of the grid objects, you have to use the general
formula:
k
p̂(x ) =
(3.2)
nV (x )
where V (x ) is the volume of a hypersphere around x with as radius the distance to the
nearest neighbour (k = 1) and n is the total number of objects in the dataset. The
volume of the hypersphere In the 1D dataset this is just two times this distance, so:
phat = 1./(n*2*Dsort(:,1));
(e) Plot this density estimate (scatterd(+a); hold on; plot(gx,phat);) and save
the figure, saveas(gcf,’figknn’,’m’). Are you satisfied with this plot? (Notice the
scale of the y-axis! Restrict the axis with the axis command).
Exercise 3.10 (a) How can you generalize the previous procedure to use the k th nearest
neighbour instead of the first nearest neighbour? Test it by inspecting the density plot.
(b) How would you calculate an optimal value for k?
Exercise 3.11 How should you generalize the procedure in Exercise 3.9 to work in a pdimensional feature space?

3.1.4

The Parzen and kNN Classifier

In the sections above, two approaches to estimating a density are shown. The first is using
the Parzen window approach, the second is using the nearest neighbor approach. When one
of these approaches is used to estimate the class conditional probabilities, Bayes’ rule can
be used to construct a classifier. In the table below, the corresponding PRTools mapping
names are listed.
PRTools name
parzenc
knnc

description
Parzen density classifier
k-nearest neighbor classifier

Exercise 3.12 Generate a dataset a using the function gendatb, and make a scatterplot
of the dataset. Next, train a Parzen classifier and a k-nearest neighbor classifier, and
29

plot the decision boundary using plotc (obviously, to train the classifiers, you have to
choose a value for h and for k):
>> w = parzenc(a,h);
>> v = knnc(a,k);
>> plotc(w); plotc(v,’r’);
(a) Try different values for h and k. What are good values?
(b) How would you make an automatic procedure to optimize the values of h and k
for a classification problem?

3.1.5

Naive Bayes Classifier

In PRTools the naive Bayes classifier is called naivebc. In the implementation of PRTools,
for each of the features it estimates the class conditional probabilities using a histogram with
N bins.
Exercise 3.13 Compare the decision boundary of the naivebc with that one of the Parzen
classifier and the k-nearest neighbor. Look, for instance, at the datasets gendats,
gendatb and gendatd.
2.8
Exercise 3.14 Make exercise 2.39 and 2.41 from the book.
problems

3.2

The scaling problem

In this last section we have a quick look at the scaling problem. It appears that some classifiers
are sensitive to the scaling of features. That means, that when one of the features is rescaled
to very small or very large values, the classifier will change dramatically. It can even mean
that the classifier is not capable of finding a good solution. Here we will try to find out, which
classifiers are sensitive to scaling, and which are not.
Exercise 3.15 (a) Generate a simple 2D dataset (for instance, using gendatb) and plot the
decision boundaries of the six classifiers listed above. Use k = 1 for the knnc.
(b) Make a new dataset in which the second feature is 10 times larger than the dataset
above. Do this by
newtrain = a;
newtrain(:,2) = 10*newtrain(:,2)
Train six classifiers nmc,ldc,qdc,fisherc,parzenc,knnc and plot the decision boundaries.
(c) Which classifiers are affected by this rescaling of the feature space? Why are they
affected, and others not?
(d) Is it an advantage or a disadvantage?
30

Take-home messages
• k-NN density estimation followed by the rule of Bayes, results in
the k-NN classifier. Similarly, applying Bayes’ rule on the Parzen
density estimation results in the Parzen classifier,
• Some classifiers are sensitive to the scale of the individual features.
Rescaling of features can result in much better or worse classification
performances,

31

32

Week 4

Linear Classifiers
Objectives When you have done the exercises for this week, you
• know what a percepron classifier does,
• understand the least squares, or Fisher classifier,
• know what the bias-variance dilemma in a classification setting means.

4.1

The perceptron

Exercise 4.1 Make exercise 3.4 from the book.

3.8
problems

Exercise 4.2 (a) Generate a simple, linearly separable, dataset gendats([20,20],2,6).
Make a scatterplot and check that the dataset is linearly separable. If it is not, generate
a new dataset until it is.
Extract from this dataset the feature matrix X and the label vector (hint: use
getnlab).
(b) Implement the perceptron algorithm, given on pg. 94 of the book. Don’t forget to
add the constant term to the features (as explained at the start of section 3.3 from the
book)!

3.3

(c) Train the perceptron on the data that you generated and plot the normal vector
w and the decision boundary (i.e. all points where wT x = 0) in the figure. Does the
weight vector make sense: does it separate the classes?
(d) What happens when the classes in the dataset are not linearly separable? Check
what solution your implementation gives.

4.2

Least squares

Exercise 4.3 Check that equation (3.28) from the book is equal to equation (3.26).
3.3
33

Exercise 4.4 Implement the Least Squares solution, equation (3.45), in Matlab. Test it on
the same dataset that you have used for the perceptron (exercise 4.2), by plotting the
decision boundary in a scatterplot.
Exercise 4.5 The Least Squares solution is implemented in PRTools as well, and it is
called the Fisher classifier, fisherc. Train a Fisher classifier on the same data as you
used before, and compare the two decision boundaries. Are they equal?

4.3
3.5.3

Bias-variance dilemma

The bias-variance dilemma indicates that when a model is fitted on a finite (small) dataset,
there should be made a trade-off between the bias and the variance. The bias is the difference
between the average model output, and the ‘true’ expected output. The variance is the
average difference between the specific output of a model, and the average model output. Or,
loosely speaking, the bias measures if the model is flexible enough to fit the truth, and the
variance measures how stable the estimate is. In this section we try to identify this in some
classifiers.
Exercise 4.6 Generate a small Bananaset a = gendatb([10,10]), train a Fisher classifier
on this, and plot the decision boundary. Now, resample new datasets from the Bananadistribution and train new Fisher classifiers.
Identify which parts of the Banana-distribution are always incorrectly classified by all
classifiers. Does this contribute to the bias or the variance?
Identify in which parts of the distribution the classifiers often disagree. Does this
contribute to the bias or the variance?
Exercise 4.7 Repeat the previous exercise, but instead of using a Fisher classifier, use a
Parzen classifier. How does the variance and the bias change, when you switch from the
Fisher classifier to the Parzen classifier?
Take-home messages
• There are several ways of optimizing a linear classifier,
• When you have a limited dataset, you need to find a good trade-off
between the bias and the variance.

34

3.4

Week 5

Linear Classifiers, Part 2
Objectives When you have done the exercises and read the text for this week, you should
• be able to derive the expression for the posteriors in the case of logistic discrimination,
• have an idea of the basic shape of the logistic posterior,
• be able to describe to underlying idea of the linear SVM.

5.1

Logistic Regression, Logistic Discrimination, Log-linear
Classifier, loglc

Exercise 5.1 (a) Consider Equation (3.61) in the book with M = 2. Derive an expression
for the posterior p(ω2 |x) (which does not depend on the posterior of the other class ω1 ). Section 3.6
(b) Sketch the general shape of p(ω2 |x) under the logistic model in 1D. Understand its
asymptotic behavior and how its form depend on its parameters. Also try to understand
how the shape of this function looks in 2D, for example.
Given samples from a two-class problem in 1D with feature values −1 and 1 in the one
class and 2 and 3 in the other. Consider the objective function in Equation (3.69) in
the book (again with M = 2).
Section 3.6
(c) Describe the solutions, in terms of formulæ or qualitatively, that maximize Equation
(3.69).
(d) Why is the solution from (c) not unique?

5.2

The Support Vector Machine, svc

Exercise
the following 2D two-class data
 0  5.2 Consider

 2 set. Class one contains two points:
0
1 and 3 . Class two has a single data point: 0 .
(a) Determine the classifier that maximizes the margin on this classification problem,
using a graphical/geometrical reasoning (probably you cannot do the minimization of
the support vector error function by hand). How many support vector are obtained.
35

Shift the first point above,

0
 0 
1 , to −1 .

(b) How does the new maximum margin classifier look? What happened to the number
of support vectors?
Exercise 5.3 (a) Demonstrate, possibly graphically/geometrically, that the support vector
classifier is sensitive to feature scaling. Hint: this can be done in 2D based on a training
set of size 3 (like in (a) and (b)) and a single test point.
Exercise 5.4 (a) Reconsider the 2D configurations from 5.2 (a)–(b) above and compare the
solution of the Fisher classifier to those obtained by means of an SVM. In what cases
do they differ? Do you see the pattern?

optional

Exercise 5.5 Make exercise 3.17 in the book.
Section 3.8
end optional

5.2.1

Linear Classifiers: A First Showdown?

Exercise 5.6 (For this exercise you might want to set up a script. . . )
Generate a 30-dimensional training set trn with 50 samples using gendats (read help
gendats!). Generate, in addition, a test data set tst with 1000 objects.
(a) Train the following classifiers: nmc, ldc, fisherc, loglc, and svc, and test these
with the test set. Hint: testc can be used to test trained classifiers on a test set (again:
read the help!) and with braces one can train (and test) many classifiers in one go. For
instance, trn*{qdc,knnc([],1)} can be used to train a quadratic discriminant and a
1NN on trn simultaneously.
(b) Repeat the previous steps of training and testing a couple of times with newly
generated training sets. Overall, which seems the best performing classifier? Can you
explain this?
(c) Reduce the training set size to 30 objects and redo the experiment. Does you
conclusion change? What happens to ldc when the number of samples is so small
relative to the dimensionality?
(d) Redo (a) to (c) for samples from gendatd instead of gendats. Any remarkable
changes? What is the main difference between the NMC in the case of a sample size of
50 and 30?
(e) Can you construct a classification problem in which a linear SVM would clearly
outperforms a linear discriminant classifier? How about the Fisher classifier v.s. SVM?
36

Take-home messages
• The posteriors of a logistic discriminant have a classical, so-called
sigmoidal shape
• Generally, the C parameter can critically influence the decision
boundary of an SVM
• Where, for instance, the linear discriminant classifier is scaling invariant, for SVM feature scaling matters
• Different linear classifiers perform best under different circumstances, notably, their relative performance is typically training sample size dependent

37

38

Week 6

Nonlinearity: Neural Networks,
Support Vector Machines, and
Kernelization
Objectives When you have done the exercises and read the text for this week, you should
• be able to give the link between a kernel function and its (implicit) feature mapping
• be able to kernelize simple, inherently linear classifiers
• have basic understanding of the nonlinearities neural networks can create
• understand the separation potential of linear classifiers in high-dimensional spaces.

6.1

Neural Networks

Exercise 6.1 Consider a 1D data set with two samples in class A x1 = 0, x2 = 3 and one in
class B x3 = 2.
(a) Construct a neural network with one hidden layer, using neurons that have a
sigmoid activation function, that can perfectly separate class A and B. How many
hidden units does it need?
(b) What if one allows an arbitrary activation functions for every setting? With how
few neurons can one solve this task consisting of three labeled samples?
Exercise 6.2 The function bpxnc implements a simple multilayer neural network, which can
be trained by backpropagation. For instance, w = bpxnc(a,2,1000) trains on data set
a a network with a single hidden layer with two nodes.
Investigate for very(!) small networks and very(!) small samples sizes from two classes
(XOR?) how such networks behave. (You better type prwaitbar off.) Repeat, for
example, the same line of code that trains some neural network and plot the different
classifiers obtained in the same scatterd of the data.
39

Exercise 6.3 Consider in the book. Consider a sample in all of the seven different regions
and label these samples A or B according to the labels in the book, except for the sample Figure 4.8
in area 110, which should also be labeled A (instead of B). So, we have four samples
from A and three from B.
(a) Can this problem be solved perfectly with a multilayer perceptron (i.e., using hard
thresholds as activation functions and not smooth sigmoids) with one (1) hidden layer
with three (3) neurons? How can you see this?
(b) How many regions can this network classify correctly at most?
(c) How many different optima does this multilayer perceptron have for which at most
one region is being misclassified? That is, in how many different ways can the network
reach the maximum number of correctly classified regions as found in b? Explain your
answer.
Exercise 6.4 Design a bpxnc that works well (and fairly stable?) on the data set a =
gendatb.
Exercise 6.5 Some arbitrary geometric insight?
(a) In how many parts/polygons can N linear decision boundaries chop up a feature
space when its dimensionality is equal to or greater than N ?
(b) Consider a two-class problem. How many linear decision boundaries does one
typically need to classify N arbitrarily scattered and labeled points in N -dimensional
space?

6.2

The Nonlinear Support Vector Machine, svc

Exercise 6.6 (a) Assume we have two objects, represented by 1-dimensional feature vectors
x and χ. Find a feature mapping φ that leads to the inner product exp(−(x − χ)2 ).
Hints: expand the term −(x − χ)2 and write exp(2xχ) as a series based on the Taylor
series of the exponential.
(b) What is the dimensionality of the space that φ maps a 1-dimensional feature vector
to?
Exercise 6.7 Let’s kernelize nmc.
(a) Express the distance to any class mean in terms of regular inner products between
the test point x and, say, the NC samples xC
i from class C.
(b) Kernelize the nearest mean classifier by mean of the Gaussian kernel, K(x, χ) =
2
exp(− kx−χk
). Can you show that this boils down to something like a Parzen classifier?
2σ 2
You may limit yourself to the two-class case.
Exercise 6.8 The function svc can be used to both construct linear and nonlinear support
vector machines. Check the help to see how kernels different from the linear one can
be employed. Type help proxm to get an overview of the possible kernels and their
parameters.
40

(a) On a = gendatb([20 20]), train an svc with a ’radial basis’ kernel, i.e., the
Gaussian kernel, for kernel widths that vary from fairly small (0.1?) to fairly large
(10?). Check with a large (enough) independent banana test set how the performance
varies for the different choices of kernel widths.
(b) How does the kernel width of parzenc relate to the width of the radial basis
function?
(c) Why can the svc, potentially, perform much faster at test time than the Parzen
classifier?
optional

Exercise 6.9 Consider the linear least squares problem as in of the book. We aim to kernelize
its solution mapping ŵT x.
Subsection
3.4.1
(a) Express the minimizing solution ŵ explicitly in terms of the training samples.
(b) Using the well-known identity (XX T )+ X = X(X T X)+ , where + indicates the
Moore-Penrose pseudoinverse, rewrite ŵT x in terms of inner products between training
instances xi and the test point x.
Exercise 6.10 The next experiment might ask too much time or too much memory.
Construct a data set with a sharp corner in it. Take, for instance X = rand(1000,2);
and a = dataset(X,X(:,1)>.5&X(:,2)>.5). Study how closely a support vector machine follows the border, and especially the corner, for different kernels and different
settings of the penalty term C. Definitely try the different Gaussian kernels with different sigma parameters in combination with various Cs. (If possible, you might want
to consider using libsvc instead of svc.)
end optional

Take-home messages
• The design and training of multilayer neural networks can be a challenge
• The C parameter can critically influence the decision boundary of
an SVM, also in the nonlinear case
• Neural nets might get stuck in local optima or, generally, bad solutions
• Classifiers like NMC and the Fisher linear discriminant can be kernelized
• The Gaussian kernel enables one to work implicitly (obviously) in
an infinite dimensional space.

41

42

Week 7

Decision Trees, Combining, and
Boosting
Objectives When you have done the exercises and read the text for this week, you should
• understand that pruning might be crucial for decision tree performances
• one can “overprune” and “underprune”
• understand how bagging can potentially improve the performance of a single “instable” classifier
• understand the difference between fixed and trainee combiners
N.B.! The time has come that, during the computer exercises, you better not only spend time
on the exercises for the current week but also on the final assignment. You can find the latter
in Appendix A.

7.1

Decision Trees

Exercise 7.1 (a) Train a decision tree, using treec, with splitting criterion infcrit on a
gendats data set with 50 samples in every class. Vary the degree of pruning up to, say,
12 (help treec!) and check how the classification error varies with it on a large enough
test data set (i.e., to get a good impression of the true performance of the classifier). (In
addition, you might have to average over different training sets to get a good impression
of the expected variability of the error rate for different degrees of pruning.)
What can be observed?
(b) Can you explain the previous behavior? In what way do you think the classification boundaries differ when comparing a high degree of pruning with a low degree?
You might want to check this using scatterd etc. (Note that you might have to put
gridsize to, say, 1000 or so, and replot the plot.)
Exercise 7.2 (a) Redo the experiments from the previous exercise but now with gendatd.
What changes in results do you note? Can you explain these?
43

(b) What if the two class means move closer to each other in the first feature dimension?
What kind of change in behavior with varying degrees of pruning do you expect (apart
from an increase of the error rate)?
optional

Exercise 7.3 (a) This is an open problem, I think. . . Redo the experiment in Exercise 7.1
but now with a training set size of 10 per class (so 20 in total). Explain the error rate
behavior with varying pruning degrees.
Exercise 7.4 (a) (Another open problem?) Any idea what the splitting criterion maxcrit
does?
end optional

7.2

Combining

Exercise 7.5 We are going to combine classifiers trained on different representations of hand
written digits. We compare these to the individual classifiers and the classifier trained
on the data set having all features.
First we construct the various data sets:
a1 = mfeat kar, [b1,c1,I,J] = gendat(a1,0.3);
a2 = mfeat zer, b2 = a2(I,:); c2 = a2(J,:);
a3 = mfeat mor, b3 = a3(I,:); c3 = a3(J,:);
The kar, zer, and mor indicate the type of features we are considering.
(a) Train three nearest mean classifiers on the different data sets:
w1 = nmc(b1)*classc; w2 = nmc(b2)*classc; w3 = nmc(b3)*classc;
(The classc is a trick to make sure that classifiers output posteriors.) Estimate the
three corresponding error rates on the corresponding c-data sets, e.g., for the first data
set: compute c1*w1*testc.
(b) Train and test the same classifier on the data sets b = [b1 b2 b3] and c = [c1
c2 c3], which contain all features from the three data sets above. Compare its error
rate to those from (a).
Given the trained individual classifiers w1, w2, and w3, a fixed combiner can be constructed as follows: v = [w1; w2; w3]*meanc. This is the mean combiner. Other
combiners that PRTools allows are minc, maxc, prodc, medianc, and votec of which it
should be clear what kind of combining rule these implement.
(c) Test some combiners, by means of [c1 c2 c3]*v*testc, and compare the results
to the four earlier error rates you estimated.
(d) Looking back at the exercise, do you understand why the special form of the
gendat command, in which the I and J occur, had to be used? What would we have
been doing if we would just have generated the b and c data sets by applying gendat
to the three a data sets individually? Obviously, you may want to check help gendat
for an initial clue and to see what I and J contain.
44

Exercise 7.6 We consider the classical use of bagging for which we reconsider exercise 7.1
(a).
(a) Basically, redo the experiments from 7.1 (a) but just compare the unpruned tree with its bagged version.
For the latter you need something like
baggingc(trn,treec([],’infcrit’,0)); check help baggingc for more on bagging.
(b) You may want to check that even when using gendatd instead of gendats some
improvement can be obtained.
(c) What does the “stabilization” of the decision tree by means of bagging effectively
do? You could check a plot of the classification boundary of the single decision tree
with a bagged version to get an idea.
Exercise 7.7 (a) Generate a data set by means of gendath and train two differently behaving linear classifiers on it. Take, for instance, w1 equal to nmc and w2 toldc. Generate posterior probabilities from these by computing p1 = a*w1*classc and p2 =
a*w2*classc and combine them in one data set by p = [p1 p2].
How many features does this new data set have and why is this so?
(b) Scatter plot the first and third dimension of p. What do these dimensions correspond to? Try to understand that the original classifiers correspond to horizontal and
vertical lines at 0.5 in this same scatter plot.
(c) Straight lines not parallel to any of the two axes actually combine the posteriors
from the two classifiers. Are there combiners possible that indeed seem improve upon
the two individual classifiers?
Exercise 7.8 As simply as fixed combiners can be constructed, one can construct trained
combiners. An example is v = [nmc*classc ldc*classc]*fisherc, which takes the
two classifier outputs of nmc and ldc and combines them through training a fisherc
on top of it. All is simultaneously trained by executing w3 = a*v on a data set a.
(a) Take the data set from 7.7—gendath, construct a training and a test set, and
compare nmc, ldc, and the trained combiner.

7.3

An AdaBoosting Exercise

Exercise 7.9 The AdaBoost combining scheme, adaboostc in PRTools, enables one to experiment with different AdaBoost schemes based on different base classifiers. The classical one is the so-called decision stump (stumpc, check the help).
(a) Generate a standard banana data set a = gentdatb and visualize the decision boundaries that are obtained by means of boosting when you combine 1, 10,
100, and 1000 base decision stumps, respectively. N.B. You might want to use
the following form to suppress annoying messages, while adaboostc is training:
adaboostc(a,stumpc,number of base classifiers,[],0). The final 0 silences the
verbose.
45

(b) When combining 1000 base classifiers, AdaBoost seem to suffer from what is called
overtraining. How can you maybe see this in the scatter plot? How could you probably
see this from the error rates on an independent test set?
(c) Repeat the experiment but replace the base classifier with nmc and with fisherc.
Are there noteworthy differences between the various solutions obtained? Also compare
results between different base classifiers.
(d) What can we expect AdaBoost to do when we employ it in combination with, say,
a 1NN or a unpruned decision tree? Could one expect any improvements from such
scheme?
Take-home messages
• Pruning and bagging can improve the performance of decision trees
• Both fixed and trained combiners can outperform individual classifiers in particular settings
• AdaBoost can improve simple classifiers but can also overtrain
• ...
• Don’t forget to work on the final assignment: now and in the weeks
to come.

46

Week 8

Classifier evaluation
Objectives When you have done the exercises for this week, you
• can explain what sources of performance variability there are,
• understand the difference between train and test error,
• know what a learning curve is,
• explain what crossvalidation is.

8.1

Sources of Variation

Exercise 8.1 In this exercise we investigate the difference in behavior of the error on the
training and the test set. Generate a large test set and study the variations in the
classification error based on repeatedly generated training sets:
>> t = gendath([500 500]);
>> a = gendath([20 20]); t*ldc(a)*testc
Repeat the last line e.g. 30 times.
(a) What causes the variation in the error?
Now do
>> a =
>> w =
>> t =

the same for different test sets:
gendath([20 20]);
ldc(a);
gendath([500 500]); t*w*testc

Repeat the last line e.g. 30 times.
(b) Again explain what causes the variance observed in the results.

8.2

Learning Curves

The function cleval allows you to calculate so-called learning curves. These are curves that
plot classification errors against the number of points in the training data set. (Check help
cleval for the details.)
47

In this section we are going to study some learning curves for different data sets and different
classifiers. Some of the plots and number that will be generated demonstrate certain salient
and/or noteworthy behavior.
After reading a question, and before just implementing the things to be implemented, try to
think about the outcome to be expected. Eventually, say at the end of this course, you should
not be surprised by many of these things anymore!

8.2.1

Staring at Learning Curves

Exercise 8.2 Generate Highleyman classes (gendath) with a 1000 samples per class. Enlarge
the feature dimensionality of this set by adding 60 dimensions of class independent
randomness, i.e., plain noise. Use cleval to generate learning curves for nmc, ldc, and
qdc using 64, 128, 256, and 512 objects in the training set (make sure that you repeat
often enough. . . ). Use plote to plot these curves in a single figure (check the help).
(a) Can you explain the overall behavior of these curves?
(b) Explain why the curves intersect. Which classifier performs best?
(c) What do you expect the limiting behavior of learning curves is? That is, if we were
able to train on more and more data?
optional

Exercise 8.3 Redo the previous experiments but substitute gendath with gendats and
gendatd. Don’t forget to add the 60 noise features.
(a) What kind of typical differences do you see when comparing behaviors for different
data sets? Explain these differences?
end optional

Exercise 8.4 (a) Take your favorite data set and study a learning curve for the first nearest
neighbor classier (1NN). What can be expected of the learning curve of the apparent
error?

8.2.2

Dipping

Exercise 8.5 Study the learning curve for nmc in combination with gendatc for very, very,
very small training set sizes.
(a) What seems to be happening? And is this expected?
optional

48

8.2.3

I Challenge You

Exercise 8.6 This more an challenge than an exercise. You might consider skipping it and
move on to the next exercise. Then again, you also might take on this challenge!
(Anyway, do keep track of time somewhat.)
(a) Create a two-class problem in which ldc, on average, outperforms qdc for large
sample sizes.
end optional

8.2.4

Feature Curves

Like learning curves that typically plot the classification error against the number of training
set samples, making feature curves can also be informative. The latter studies how the
classification error varies with varying number of feature dimensionality. We can use the
clevalf command to perform such study.
Exercise 8.7 Load the data set mfeat kar and make a feature curve for 4, 8, 16, 32, and 64
features using 50% of the data for training based on qdc.
(a) When redoing this experiment, would you expect the same curve? Why? Or why
not?
(b) What, generally, happens to the learning curve when 40% or 80% of the data is
used for training? Can you explain this?

8.3

Cross-Validation

The prcrossval function allows you to perform a cross-validation on a given data set using
a particular classifier (or even a whole bunch of them in one go).
Exercise 8.8 Generate a small data set using gendatb, say, with 10 objects per class.
(a) Using n-fold cross-validation, make plots for the error rates for kNN and 1NN over
different values of n. Also calculate the standard deviation of the error estimate, e.g.,
by performing the cross-validation 10 time (check the help).
(b) What do you notice about the estimated error rates? What is the general
trend (maybe you should redo the data generation and the cross-validation a couple
of times).
(c) What happen to the variance of the estimates for varying n? Again, we are interested in the general trend.
(d) How would the observations change if one would repeat the experiments with much
larger dataset? Would they change?
49

Exercise 8.9 (a) Let us look back at exercise 1 where a Parzen density is estimated on
some data, and use the same data in this exercise. Now let PRTools find the optimal
h using leave-one-out cross-validation. This can simply be performed by not specifying
h, i.e. calling w = parzenm(+a);. To find out what value of h the function actually
uses, you can call h = parzenml(a);. Does the h found correspond to the one you
found to be optimal above?

8.4

The Confusion Matrix

A confusion matrix is of the size C × C, where C is the number of classes. An element C(i, j)
encodes the confusion between the classes i and j. More precisely, C(i, j) gives the number
of objects that are from i, but are labeled as j. Nothing confusing about it.
Exercise 8.10 To create a confusion matrix you may use something like the following code.
>> lab = getlab(test_set);
>> w = fisherc(train_set);
>> lab2 = test_set*w*labeld;
>> confmat(lab,lab2)
An alternative is:
>> confmat(test_set*w)
(a) What is stored in lab and what does lab2 contain?
Exercise 8.11 Load the digits data sets mfeat zer and mfeat kar. Split the data up in a
training and a test set (using gendat).
(a) What are the error rates on the test sets?
(b) Use confmat to study the error distributions more closely. Where do the larger
parts of the errors stem from?
(c) Can you explain which of the two feature sets is insensitive to image rotations?
Exercise 8.12 Given a confusion matrix

a b
c d



for a two-class problem.

(a) Give an estimate of the expected cost (per object) when misclassifying class 1
as class 2 is twice as expensive as the other way around and assuming that a correct
classification incurs no cost.

8.5

Reject Curves

The error rate can be improved simply by not classifying certain objects, typically the ones
that are near the decision boundary and in that sense difficult to classify reliably. Studying
this so-called reject option is facilitated by the functions reject and rejectc.
Given a classification result, e.g. d = tst*ldc(trn), the classification error is found by e =
testc(d). The number of columns in d equals the number of classes and an object is assigned
50

to the class for which the value in its corresponding row is the largest. Basically, the row
contains the posterior probabilities (or something equivalent) and therefore will typically sum
up to one.
By rejection of objects a threshold is used to determine when this largest value is not sufficiently large. The routine e = reject(d) determines the classification error and the reject
rate for a set of such threshold values. The errors and reject frequencies are stored in e.

Exercise 8.13 Make two data sets both with 1000 samples using gendatc. One for training,
one for testing.
(a) Train three classifiers fisherc, qdc, and knnc([],1) on the training set and plot
the reject curve of the test set.
(b) Explain what you see. Explain the differences between the curves.
(c) What can we maybe conclude from the shape of the curve belonging to fisherc?
Can you explain why this happens?
(d) Determine the reject rate for which qdc makes an error of (about) 5% and compute
by rejectc the rejecting classifier for this error rate (check the help :-)
(e) Plot the classifier and the rejecting classifier in a scatter-plot. Explain what you
see.

8.6

Receiver Operating Characteristic

The receiver operating characteristic (ROC), which is too often wrongly referred to as the
receiver-operator curve or the like, is a tool to study the effect of changing decision boundaries
in the two-class case due to, for instance, changing misclassification costs or class priors.

Exercise 8.14 Generate two moderately sized Highleyman data sets. Train an nmc, an ldc,
and a 3-nn on this data. Using roc plot the three ROCs.
(a) Which of the three classifiers performs the best?
(b) If one prefers a very small error on class 1, which classifier would be the chosen
one?
51

Take-home messages
• Cross-validation enables one to compare classifiers even when there
is only a single data set available that contains few objects.
• In order to further ones understanding of the behavior of particular
classifiers, learning curves, both for training and test set, should be
studied.
• Learning curves also allow one to get more understanding of ones
data, e.g. to decide if obtaining more data, possibly in combination
with another classifier, might be successful in lowering the error.
• A confusion matrix can determine more precisely than a simple error
rate where classification errors may stem from.
• Errors can be reduced via a reject option. Obviously, rejected objects should be taken care of separately, possibly by a human observer.
• ROCs give yet another view of a problem at hand. They allow one
to study classification behavior under varying priors and costs.

52

Week 9

Clustering
This week, we will discuss the problem of clustering; this practical session is intended to
familiarise you with the different techniques that were discussed, more specifically hierarchical
clustering, the K-means algorithm and the mixtures-of-Gaussians.

11

Objectives for this week:
• to learn to use hierarchical clustering on several datasets;
• to learn about the limitations of hierarchical clustering;
• to explore the K-means algorithm by applying it to a variety of datasets;
• to get familiar with mixture-of-Gaussian clustering.

9.1

Hierarchical clustering

Earlier in this course, you learned how to select features, i.e. you learned which variables
“belong together”. This week we will shift the emphasis to determining which objects belong
together in groups or clusters. This clustering process enables us to extract structure from
the data, and to exploit this structure for various purposes such as building a classifier or
creating a taxonomy of objects.
The most difficult part of clustering is to determine whether there is truly any structure
present in the data and if so, what this structure is. To this end, next week we will employ
cluster validity measures to estimate the quality of the clustering we have obtained.
In the lectures we discussed hierarchical clustering at length. There are basically two choices
that need to be made in hierarchical clustering in order to construct a dendrogram:
1. the dissimilarity measure;
2. the type of linkage.
In this course, we will only employ the Euclidean distance between two samples as a measure
of dissimilarity. As you will recall from the lecture, there are three types of linkage: complete,
single and average linkage. Once the dendrogram has been constructed, we need to cut the
53

13

dendrogram at an appropriate level to obtain a clustering. Simple interactive routines are
available to create, visualise and cut the dendrograms.
Exercise 9.1 Start with the hall dataset, an artificial dataset with clear structure.
(a) Load the dataset (load hall). Use scatterd to visualise it. How many clusters
are visible in the plot?
(b) What is the most suitable clustering?
An interactive clustering function is provided; it is called interactclust. Look at the help
to familiarise yourself with the inputs. This function performs hierarchical clustering, draws
a dendrogram in Matlab Figure 1 and then waits for user input. The input is provided
by positioning the cross-hairs in Figure 1 at the desired vertical level, corresponding to the
desired number of clusters. The number of vertical stems of the dendrogram intersected by
the horizontal line of the cross-hairs, corresponds to the number of clusters.
After positioning the cross-hairs and clicking on the left-mouse button, the chosen clustering
is displayed in two ways. First, coloured rectangles are drawn on the dendrogram to indicate
samples that are clustered together. In Figure 2, a scatterplot, colour-coded in the same
fashion, is made of the samples, thus revealing the clustering (note that when there are more
than two features in the dataset, the clustering is performed taking all features into account,
but that only the first two features are displayed in Figure 2). The program remains in a
loop where new clusterings can be obtained by left-mouse clicking on the dendrogram. To
terminate the program, right-click anywhere on the dendrogram.
Exercise 9.2 Load dataset rnd. Make a scatterplot. This is a uniformly distributed dataset,
with no apparent cluster structure. We will hierarchically cluster this dataset to get an
idea of what a dendrogram looks like when there is no structure in the data.
(a) Perform hierarchical clustering with interactclust on the rnd dataset with complete linkage: interactclust(+a,’c’);
(b) Cut the dendrogram at arbitrary levels to see how the samples are clustered. Look
at the structure of the dendrogram. What is apparent?
(c) Repeat this for single and average linkage. Do you observe the same behavior as
with complete linkage?
Exercise 9.3 (a) Perform hierarchical clustering with interactclust on the hall dataset
with complete linkage: what do the lengths of the vertical stems in the dendrogram tell
us about the clustering?
(b) Cut the dendrogram at different levels, i.e. inspect different numbers of clusters
by left-mouse clicking at the desired level. Can you think of ways in which a good
clustering can be defined?
(c) Can you devise a simple rule-of-thumb (in terms of vertical stem lengths) for finding
a good clustering in the dendrogram?
54

Exercise 9.4 (a)
Now
perform
single
linkage
hierarchical
clustering:
(interactclust(+a,’s’);) on the hall dataset. Do you notice any significant
differences with the complete linkage dendrogram?
(b) Do you notice any differences with the average linkage dendrogram
(interactclust(+a,’a’))?
Exercise 9.5 (a) Load and plot the cigars dataset. What is an appropriate number of
clusters? Why?
(b) Perform single linkage hierarchical clustering on the cigars dataset. Can you
obtain the desired clustering with single linkage hierarchical clustering?
(c) Now perform complete linkage hierarchical clustering on the cigars data. Does
the optimal number of clusters obtained with single linkage hierarchical clustering also
produce acceptable results in complete linkage hierarchical clustering? If not, why
not?
Exercise 9.6 (a) Perform complete linkage hierarchical clustering on the messy dataset.
What is an appropriate number of clusters? Why?
(b) Now perform single linkage clustering on this dataset. Select the same number of
clusters that you determined to be optimal with complete linkage. Is it also optimal in
this case?
(c) What is an appropriate number of clusters for the single linkage case? Why?
optional

Exercise 9.7 Perform hierarchical clustering on a dataset of your choice. Bear in mind
that only the first two dimensions of higher dimensional datasets are displayed in the
scatterplot. To visualise other dimensions, reorganise the columns in your dataset so
that the appropriate variables are in columns one and two.
end optional

9.2

K-means clustering

The operation of this algorithm was explained in the lecture. In this practical session, we
will familiarise ourselves with the K-means algorithm by applying it in various settings. A
function kmclust is provided to perform K-means clustering. Take a moment to familiarise
yourself with the input, output and operation of this function.
Exercise 9.8 The initialisation of the K-means algorithm (positioning of the prototypes)
can be done in several ways. Two simple ways are:
1. placing the prototypes uniformly distributed in the domain (approximated by the
bounding box) of the data; or
2. selecting a number of the samples at random as prototypes.
55

14.5.1

For reasons of simplicity we recommend the second option.
(a) Can you think of possible advantages and disadvantages to these two initialisation
approaches?
(b) Apply kmclust with plotflag and stepflag set on the cigars and messy datasets
for different numbers of prototypes. Is K-means better suited to one of these datasets,
and if so, why?

9.2.1

The local minimum problem

You might recall that the K-means algorithm is a simple iterative way of attempting to find
those prototype positions that correspond to the global minimum of the total within-scatter.
However, there are two factors that could cause this minimisation procedure to get stuck
without having reached the global minimum. First, K-means is a procedure that always
updates the positions of the prototypes such that the within-scatter decreases. Secondly,
the within-scatter is not a monotonically decreasing function of prototype positions. This
implies that from a specific set of non-optimal prototype positions the road to the optimum
(global minimum) contains an initial increase in within-scatter before reaching the optimum.
If K-means ends up in such a position (local minimum), it gets stuck. The following exercise
illustrates this.
Exercise 9.9 (a) Load dataset triclust. Inspect it with the aid of a scatterplot. There
are clearly three equally spaced, spherical clusters.
(b) Run your K-means algorithm several times, for g = 3 clusters, until a solution is
found where there are two prototypes in a single cluster, and the third is positioned
between the remaining two clusters. Why is this a local minimum?
(c) Does hierarchical clustering also suffer from this problem?
(d) What can we do to overcome the local minimum problem?

9.3
14.2

Clustering with a mixture-of-Gaussians

During the lectures, the concept of clustering based on the quality of a mixture-of-Gaussian
density fit to the data was discussed. The operation of the Expectation-Maximisation (EM)
algorithm, which is employed to estimate the parameters of the mixture model, was also
discussed in detail. In the following set of exercises, mixture model clustering will be explored
with the aid of the function em.
Exercise 9.10 (a) Load the triclust dataset and play around with the function
em (em(data,nmodels,type,0,1,0);). Vary the number of Gaussians employed
(nmodels) in the mixture model, and also vary the type of Gaussian employed. Relate the type (’circular’,’aligned’,’gauss’) of the Gaussian to its covariance matrix.
(b) On the cigars dataset, fit an unconstrained Gaussian (type = ’gauss’) mixture
model using the function em. For the number of clusters g, assume the following values:
g = 1, 2, 3, 4, 5. Which g do you expect to be the best?
.
56

(c) Repeat this 20 times (without plotting, i.e. setting the plot mix and plot resp
arguments to zero) and plot the average log-likelihood (first output argument of em)
and its standard deviation as a function of g (hint: use the errorbar function). What
characteristic of the log-likelihood function indicates the preferable number of models
(clusters)?
(d) Do the same for the rnd dataset (which has no structure) to confirm the relationship
between the log-likelihood curve and the number of clusters. What is the behavior of
the curve for the random dataset?
(e) Now try clustering the messy dataset. What is the best shape to employ for the
Gaussians? What is the optimal number of clusters?

Clustering may be applied for image segmentation, e.g. on the basis of colors.

Exercise 9.11 A full-color image may be segmented by clustering the color feature space.
For example, read the roadsign10 image:
>> im = imread(’roadsign10.bmp’);
>> show(im)
The three colors may be used to segment the images based on their pixel values only.
Convert the image to a dataset using the im2featcommand. Use a small subset of
pixels for finding four clusters in the 3D color space:
>> a = im2feat(im);
% create a dataset
>> trainingset = gendat(a,500)
% create a small training subset
>> [d,w] = emclust(trainingset,nmc,4) % cluster the data
The retrieved classifier w may be used to classify all image pixels in the color space:
>> lab = classim(a,w);
>> figure
>> imagesc(lab)
% view image labels
Finally, we will replace the color map used to render the label image to the mean vector
of each cluster:
>> aa = prdataset(a,lab(:))
% create labeled dataset
>> map = +meancov(aa)
% compute class means
>> colormap(map)
% set color map accordingly
(a) Find out the correspondences between the image parts (sky, road, etc.) and the
clusters by visualizing the outputs of the classifier w on the entire image.
(b) Is it possible to improve the result by using more clusters?
57

Take-home messages
• Clustering is not the same as classification: it does not use the
class labels. Clustering is an unsupervised learning technique, while
classification is a supervised learning technique
• Although the examples here were mostly in 2D, clustering can be
performed for data sets with any number of features
• A mixture-of-Gaussians is a model between a single Gaussian density
and a Parzen density estimate

58

Week 10

Cluster validation
In this week, we will explore ways to guide our choice when we judge clustering results and
determine the number of clusters.
Objectives for this week:
• to learn to use cluster validity measures;
• to learn how the within-scatter, jumps in the fusion graph and the Davies-Bouldin
cluster validity measure can be employed to extract the “optimal” number of clusters from a dataset;

10.1

Cluster validation

In the lectures this week, various clustering approaches were presented. Last week you applied
hierarchical clustering with different linkage types and the Euclidean dissimilarity measure
as well as K-means clustering to several datasets. One aspect of clustering that we avoided
last week was the determination of the “optimal” number of clusters in the dataset. More
specifically, given a particular hierarchical clustering, one needs to determine where to cut
the dendrogram to produce a clustering of the data and one needs to know which value of
g to use as input in the K-means clustering algorithm. Last week you had a small taste of
cluster validation when you tried to determine the optimal number of clusters to employ in
mixture-of-Gaussians clustering algorithm.
This practical session is intended to familiarise you with the different cluster validity measures
that were discussed in the lecture. We will achieve this by:
1. employing the fusion graph as a simple, intuitive measure of clustering tendency in
hierarchical clustering;
2. coding and applying the within-scatter in the K-means clustering algorithm to estimate
the optimal number of clusters;
3. coding and applying the Davies-Bouldin index to various algorithms and datasets.
59

16

10.1.1

Fusion graphs

The fusion graph plots the fusion level as a function of the number of clusters (g). For example,
the fusion level at g = 2 represents the (single, complete, average) link distance between the
clusters that are merged to create two clusters from three clusters. A simple heuristic to
determine the number of clusters in hierarchical clustering is to cut the dendrogram at the
point where we observe a large jump in the fusion graph.
Exercise 10.1 (a) Why is this a reasonable heuristic to employ?
The following three exercises focus on the estimation of the number of clusters based on the
fusion graph.
Exercise 10.2 (a) Load the triclust dataset. Perform single linkage hierarchical
clustering and display the fusion graph by setting the third optional argument:
interactclust(triclust,’s’,1). Where do you observe the largest jump?
(b) Cut the dendrogram at this position, to check whether this is a reasonable number
of clusters. Now perform complete linkage hierarchical clustering. Does the fusion graph
give a clear indication of the number of clusters in the data? If not, why not?
Exercise 10.3 (a) Load the hall dataset. Perform single linkage hierarchical clustering
and display the fusion graph. What do you observe in the fusion graph?
Exercise 10.4 (a) Finally, load the messy dataset. Perform single linkage hierarchical clustering. According to the fusion graph, where should the dendrogram be cut?
(b) Does a satisfactory clustering result from cutting the dendrogram at this point?
Motivate.
(c) Now perform complete linkage clustering. Is the clustering suggested by the fusion
graph better or worse than the clustering obtained with single linkage clustering?

10.1.2

The within-scatter criterion

Exercise 10.5 (a) Now, generate a two-cluster, 2D dataset: a = +gendats([50,50],2,d),
for d = 10. Make a scatterplot of the data.
(b) Recall that the function kmclust outputs the within-scatter associated with the
obtained clustering. For the generated two-cluster dataset, compute the cluster withinscatter for g = [1, 2, 3, 5, 10, 20, 40, 50, 100], and compute the normalised within-scatter
values as a function of g (normalise by dividing all within-scatter values by the withinscatter for g = 1). Repeat this 10 times, and employ the errorbar function to plot the
average and standard deviation of the cluster within-scatter as a function of g.
(c) What are the most prominent characteristics of this curve?
(d) Now for d = 5, generate a new dataset. For this dataset, generate the same
normalised within-scatter plot as in the previous exercise, i.e. the within-scatter as a
function of g. What is the biggest difference between the datasets? What is the biggest
difference between the within-scatter curves?
60

optional

Exercise 10.6 Generate a dataset consisting of a single Gaussian, i.e. use d = 0. Plot
the same within-scatter curve.
(a) How does this curve differ from the other
curves?
end optional

10.1.3

The Davies-Bouldin index

D.L. Davies and D.W. Bouldin1 introduced a cluster separation measure which is based on
both the within-scatter of the clusters in a given clustering and the separation between the
clusters. Formally, this measure is known as the Davies-Bouldin index (DBI). It assumes that
clusters are spherical, and that a desirable clustering consists of compact clusters that are
well-separated.
Suppose we wish to compute the DBI for a clustering consisting of n objects assigned to g
clusters. We can compute a score for every possible pair of clusters in this clustering, which
is inversely proportional to the distance between the cluster means and directly proportional
to the sum of the within-scatters in the pair of clusters. This score is given by
Rjk =

σj + σk
, j, k = 1, 2, . . . , g; k 6= j.
kµj − µk k

(10.1)

Here µj is the mean of all the objects in cluster j and σj is the within scatter of cluster j,
given by:
v
u
u1 X
kx i − µj k2 ,
(10.2)
σj = t
nj
x i ∈Cj

where Cj is the set of objects associated with cluster j and nj is the number of objects in
cluster j. The score, Rjk , is small when the means of clusters j and k are far apart and the
sum of the within-scatter for these clusters is small. Since cluster j can be paired with g − 1
other clusters, resulting in g − 1 pair-scores, Rjk , j = 1, 2, . . . , g; k 6= j, a conservative estimate
of the cluster score for cluster j, when paired with all other clusters, is obtained by assigning
the maximal pair-score with cluster j:
Rj =

max

k=1,2,...,g; k6=j

Rjk .

(10.3)

The Davies-Bouldin index of the complete clustering is then determined by averaging these
maximal pair-scores for all clusters:
g

IDB

1X
=
Rj .
g
j=1

1

IEEE Transactions on Pattern Analysis and Machine Intelligence 1, pp. 224–227, 1979.

61

(10.4)

Exercise 10.7 We will employ the Davies-Bouldin index to evaluate the clustering produced
by hierarchical clustering. We will do so for a range of clusters in order to determine
the optimal number of clusters, i.e. the best level to cut the dendrogram at. To achieve
this we employ the function hdb().
(a) The function hdb() has as inputs a dataset, the linkage type, the distance measure
and the range of clusters (start,stop) for which you wish to evaluate the quality of
the clustering. It computes, for each clustering, the DBI. Familiarize yourself with the
operation of this function.
(b) Load the triclust dataset and make a scatterplot. What do you expect the DBI
curve as a function of the number of clusters to look like?
(c) Now apply hdb to this dataset with Euclidean distance, complete linkage clustering,
starting at 2 clusters and stopping at 10. What is evident from the DBI curve?
(d) Apply hdb to the hall dataset with Euclidean distance, complete linkage clustering,
starting at 2 clusters and stopping at 20. What do you observe in the obtained curve?
Can you explain your observation?
(e) Finally, apply hdb to the cigars dataset with Euclidean distance, complete linkage
clustering, starting at 2 clusters and stopping at 20. Inspect the DBI curve. Do you
detect the expected number of clusters? Now try single linkage clustering. Does this
help? Why (not)?

10.1.4

Log-likelihood and information criteria

Last week, you already looked at how to use log-likelihood to judge the number of clusters in
exercise 9.10.
optional

Exercise 10.8 Repeat exercise 9.10, but generate plots of the Akaike Information Criterion
(AIC) and the Bayesian Information Criterion (BIC) instead.
end optional

Take-home messages
• Some cluster criteria indicate the optimal number of clusters at the
point where the curve flattens, e.g. fusion graphs and within-scatter,
while other criteria reveal the optimal number of clusters at an
extremum
• Most cluster validity measures assume a model of a typical cluster
(e.g. spherical) and will therefore not detect a good clustering if
clusters have a different shape

62

Week 11

Feature Selection
Objectives When you have done the exercises and read the text for this week, you should
• have an understanding of the computational complexity of feature selection
• have an understanding of the approximative nature of selection schemes
• have an understanding of some basic properties of scatter matrices
• have a basic idea of the concept of VC-dimensions.
Exercise 11.1 Given a data set with 100 features. Say you want to find the 5 best features.
How many feature sets do you need to check for this in case we have to rely on an
exhaustive search?

11.1

Selection Experiments

Exercise 11.2 Check the help of a couple standard search strategies for features selection,
e.g. featself, featselb, featsellr, featselp, etc.
(a) Create 100 samples from a 10D “difficult” distribution using a =
gendatd(100,10).
Apply a few of these feature selection strategies together
with the ’maha-s’ criterion (what does ’maha-s’ do?).
(b) Which features are the best two and do the algorithms pick them out? And if not,
do they agree among each other on the optimal choice?
Exercise 11.3 Load the diabetes data set using a = diabetes. This data set is a realworld data set (though an old one). The aim is, or rather, was to predict the presence
of diabetes mellitus in people based on the eight measurements made. Some of the
features included are diastolic blood pressure, triceps skin fold thickness, and age.
(a) Split up the data set in 50% for training and the rest for testing. Based on the
training set, what is the optimal 5-dimensional subspace for the fisher classifier? (Hint:
you can loop you way out of this but it might be more handy to use some feature from
nchoosek; read the help carefully.)
(b) What is the optimal 5-dimensional space for a 1NN classifier?
63

(c) What is the error attained on the test set for these classifiers with their respective
optimal feature spaces.
(d) Check some search strategies and selection criteria (help feateval) and see
whether they are able to pick out the same 5-dimensional subspace. If not, do at
least the error rates on the test set come close in case one uses the same classifier?
(e) Do another 50-50 split of the original data set. When redoing some of the feature
selections, do you find back the same feature sets?
Exercise 11.4 This code you get for free: a = mfeat zer
[trn,tst] = gendat(a,.04)
clsf = ldc
featnum = [1:5:47]
prwaitbar off
prwarning off
mf = max(featnum)
e0 = clevalf(trn,clsf,featnum,[],1,tst);
rp = randperm(size(a,2));
er = clevalf(trn(:,rp),clsf,featnum,[],1,tst(:,rp));
[w,r] = featseli(trn,’eucl-m’,mf)
e1 = clevalf(trn*w,clsf,featnum,[],1,tst*w);
[w,r] = featself(trn,’eucl-m’,mf)
e2 = clevalf(trn*w,clsf,featnum,[],1,tst*w);
[w,r] = featselb(trn,’eucl-m’,mf)
e3 = clevalf(trn*w,clsf,featnum,[],1,tst*w);
figure
plote({e0,er,e1,e2,e3})
legend({’o’,’r’,’i’,’f’,’b’})
(a) Run it a couple of times and have a look at the output, both in the figure and in
the command window, and try to understand it. Also have a look at the code of course.
E.g. what does randperm do in the code and what is rp used for? What does featseli
do? And ’eucl-m’?
(b) Can you explain what the backward feature selection does if you want to extract mf
features from dataset a? Using this explanation, can you then explain why the feature
curve for the backward selection looks so poor?
(c) Can you find a selection strategy, which optimal performance is, by and large,
better than the best performing one in (a)? Recall that there are not only so-called
filter approaches .
Page 285

11.2

Stuff on Scatter Matrices

Exercise 11.5 Given a set of feature vectors {xi } for which the sample covariance matrix
equals C and given a linear transformation A. Show that the covariance matrix for the
64

transformed feature vectors {Axi } equals ACAT .

Exercise 11.6 Note: this might be a tiresome exercise but if you have too much time on
your hands. . . One way or the other, you should know that Sm = Sb + Sw for a fact!
optional

(a) Show that Σi = E[(x − µi )(x − µi )T ] = 12 E[(x − x0 )(x − x0 )T ] in which x and x0 are
equally distributed, i.e., according to the distribution of class i.
(b) Exercise 5.12 from the book. Show that the mixture scatter matrix is the sum of
the within-class and between-class scatter matrices. The fact proven in (a) can be very Ex. 5.12
helpful here.
end optional

Exercise 11.7 (a) Variation to Exercise 5.18 from the book (where the book is slightly
inaccurate). Convince yourself that if the number of classes equals M that the rank of Ex. 5.18
the between scatter matrix Sb is at most M − 1. You can either try to proof this, which
might be a bit though, or you can develop some insight be staring at the formula or
doing some experiments in, say, Matlab.
(b) In which (two) different cases can the rank of Sb be smaller than M − 1? Note
that this applies similarly to any estimate of a covariance matrix based on M samples.

optional

Exercise 11.8 Make Exercise 5.20 from the book. (Note that Sxw actually equals Sw ?)
Ex. 5.20
end optional

11.3

More. . .

Exercise 11.9 Determine what is VapnikChervonenkis dimension for classifiers with a
quadratic decision boundary in 1D and 2D? (Again, there is no need to find an ac- Sect. 5.10
tual proof.)
65

Take-home messages
• Reducing the dimensionality by means of feature selection can improve performance
• Procedures are approximate and will, typically, not find the optimal
features
• Applying feature selection strategies easily becomes rather time consuming
• Especially in the small sample setting, feature selection can be “unstable”
• Sm = Sb + Sw .

66

Week 12

Feature Extraction
Objectives When you have done the exercises and read the text for this week, you should
• have an understanding of some PCA / KL transform basics
• have an understanding of LDA basics
• have an understanding of some properties of scatter matrices
• know how to kernelize PCA
Exercise 12.1 Make sure that you had enough practice with last week’s exercises.

12.1

Principal Component Analysis
Karhunen-Loève Transform

Exercise 12.2 Load the diabetes data set using a = diabetes. We revisit Exercise 11.3
from the foregoing week.
(a) Consider one of the better performing feature selection strategies and compare
its performance (e.g. using clevalf), for various feature dimensionalities with the
performance of PCA (pcam). Do this based on two quite differently behaving classifiers,
say, a complex and a simple one.
(b) Does a normalization of the individual features lead to an improved error rate
when using PCA?
Exercise 12.3 (a) Load the mfeat pix data set. Scatter this data set. What objects does
this data set describe and what are the features that are used to describe them?
(b) Split up the set into a training set containing 50% of the data. The other 50%
goes into the test set. Calculate a mapping w on the training set that consist of the
first 2 PCs. Display, in two separate figures, the 2-dimensional training set and test set
obtained by means of PCA. What noticeable difference between the scatter plots, from
a classification point of view, can be observed?
67

(c) Now split up the set into a training set containing 5% of the data. Again the rest
goes into the test set. Determine a PCA on the training set. What is the maximum
number of features that PCA can extract? Why is this the case?
(d) Compare feature curves obtained with the original data sets and some obtained
with the PCA transformed data sets for a couple of classifiers (e.g. using something
like clevalf(trn,classifier,[5:5:90],[],1,tst)). Is there any benefit of the PCA
feature space over the pixel feature space? Do you observe peaking for any of the
classifiers?

12.2

Scatter Matrices, Repeating. . .

Exercise 12.4 (a) Variation to Exercise 5.18 from the book (where the book is slightly
Ex. 5.18
inaccurate). Convince yourself that if the number of classes equals M that the rank of
the between scatter matrix Sb is at most M − 1. You can either try to proof this, which
might be a bit though, or you can develop some insight be staring at the formula or
doing some experiments in, say, Matlab.
(b) In which (two) different cases can the rank of Sb be smaller than M − 1? Note
that this applies similarly to any estimate of a covariance matrix based on M samples.

12.3

Supervised Feature Extraction: the Fisher Mapping

Exercise 12.5 (a) Consider three arbitrary points in a 2-dimensional feature space. One is
from one class and the two others from the other. That is, we have a two-class problem.
What (obviously?) is the 1-dimensional subspace that is optimal according to the Fisher
mapping/LDA? What value would the Fisher criterion take on in this case? Explain
your answers.
(b) Consider an arbitrary two-class problem in three dimensions with three points in
one class and one in the other. How can one determine a 1-dimensional subspace in
this 3D space that optimizes the Fisher criterion? (Like in (a), a geometric view maybe
seems the easiest here.)
(c) Again we take a two-class problem with four objects in 3D but now both classes
contain the same number of points. Again determine the 1D subspace that Fisher gives.
(d) When dealing with two points from two different classes in three dimensions, how
many essentially different Fisher optimal 1D subspaces are there?
Exercise 12.6 Take a = mfeat pix but now split up the set into a training set containing
10% of the data. Again the rest goes into the test set. Determine a PCA on the training
set.
(a) What is the maximum number of features that Fisher can extract? Why is this
the case?
(b) Calculate a mapping w on the training set that consist of the first 3 components
provided by fisherm. Display the three (3) different 2-dimensional scatters of the
68

training set after the Fisher mapping (use the ’gridded’ option in scatterd). What
happened here? Explain what you see (hint: reconsider the solutions you found in the
previous simple, low-dimensional examples).
(c) Use the mapping from (b) and apply it to the test set. Display, in a different figure,
the three (3) different 2-dimensional scatters. What noticeable difference between the
scatter plots can be observed?
(d) Would you say the Fisher mapping is overtrained?
(e) Does the mapping with PCA look better or worse than the mapping using
Fisher/LDA? Explain your answer.
(f ) Can you come up with a combination of PCA and Fisher that performs better than
using PCA or Fisher separately? Does it depend on the choice of classifier?

12.4

Kernel PCA

Exercise 12.7 Assume that the overall mean of a data set is zero and that all feature vectors
are stored in the rows of the matrix X.
(a) Show that the eigenvector of X T X with the largest eigenvalue is equal to the
largest principal component.
(b) Given that v is a (column) eigenvector with eigenvalue λ of XX T , show that X T v
is an eigenvector of X T X. What is the eigenvalue associated to the eigenvector X T v?
(c) Using X T v, a (row) feature vector x from the original space can be mapped onto
the first PC by xX T v.
(d) Describe now how one can kernelize PCA: describe the procedure to do the actual
calculation. All ingredients are there. (Hint: realize what information XX T and xX T
contain.)

Take-home messages
• Reducing the dimensionality by means of feature extraction can improve performance
• Procedures are approximate and will, typically, not find the optimal
features
• Especially in the small sample setting, supervised methods can overtrain
• PCA can be kernelized.

69

70

Week 13

Dissimilarity Representation
Objectives When you have done the exercises and read the text for this week, you should
• understand how dissimilarities relate to kernels and features,
• understand how dissimilarities can represent an object,
• get DisTools to work,
• be able to employ DisTools to your pattern recognition problems.
Any feature based representation a (e.g. a = gendath(100)) can be converted into a
(dis)similarity representation d using the proxm mapping:
>> w = proxm(b,par1,par2); % define some dissimilarity measure
>> d = a*w;
% apply to the data
in which the representation set b is a small set of objects. In d all (dis)similarities between
the objects in a and b are stored (depending on the parameters par1 and par2, see help
proxm). b can be a subset of a. The dataset d can be used similarly as a feature based
set. A dissimilarity based classifier using a representation set of 5 objects per class can be
constructed and visualised by:
>> b = gendat(a,5); % the representation set
>> w = proxm(b);
% define an Euclidean distance mapping to the objects in b
>> c = a*w;
% construct a dissimilarity representation for a
>> scatterd(c*pca(c,2)); % compute and plot a PCA
Exercise 13.1 Generate a set of 50 objects per class by gendatb.
(a) Compute a dissimilarity matrix for using the city-block distance. Visualize the
result by multi-dimensional scaling.
(b) Use a randomly selected representation set of 5 objects per class. Compute a
dissimilarity representations using proxm. Visualize the result by PCA.
(c) Reduce the size of the representation set to 2 objects by feature selection and
inspect the result by scatterd.
71

In most applications the dissimilarity representation is computed on raw data like images and
not on an already given feature representation. We will use distances resulting from a study
by Zongker and Jain on deformable templates between 10 classes of handwritten digits (’0’ ’9’). Each class is given by 200 examples, resulting in a 2000 x 2000 dissimilarity matrix. We
will however just use two of them. For handling dissimilarity matrices the following commands
from the DisTools toolbox are useful:
• selcdat: selection of some classes from a dataset representing a dissimilarity matrix
• genddat: split of a dissimilarity matrix in a training set and a test set. Note that such
a split effects the numbers of rows as well as columns.
• psem: pseudo-Euclidean embedding of a dissimilarity matrix. Note that this command
returns a signature describing the numbers of positive and negative eigenvalues found
during (pseudo)-Euclidean embedding. (Currently, its use is discouraged.)
• psdistm: computation of squared distances in a pseudo-Euclidean space with given
signature. (Currently, its use is discouraged.)
• testkd: 1-NN classification error based on a given set of distances between a test set
and a training set.
Exercise 13.2 Load the Zongker dataset by readzongker and inspect its size. Select the
just two classes (1 and 2) using selcdat. Use for the following experiments training
sets and test sets of 100 objects per class.
(a) Use the entire dissimilarity matrix of the two classes (size 400 x 400) for embedding
by psem into a 10-dimensional space. Project all dissimilarity data into this space.
Generate the training and test set and compute the errors of the 1-NN and qdc classifiers.
optional

(b) Select a random set of 10 objects for representation (5 from each class) and determine the corresponding dissimilarity matrix and use it for embedding. Project all
dissimilarity data into this pseudo-Euclidean space. Generate the training and test set
and compute the errors of the 1-NN and qdc classifiers.
end optional

(c) Use the same representation set for constructing a dissimilarity space (just by
reducing the numbers of columns of the total dissimilarity matrix). Generate a training
set and a test set and compute in the dissimilarity space the errors of the 1-NN and qdc
classifiers.
(d) Compare the above results with 1-NN errors for the same test set using the representation set as training set, as well as using the entire training set.
(e) What representation yields the best classifiers? Which representation is preferred
if for test objects just the computation of 10 dissimilarities can be afforded?
72

Take-home messages
• There are other object representations than the standard feature
vector representation!

73

74

Appendix A

Final assignment
A.1

Case

For the purpose of this final assignment, a case is introduced. Assume your group plays
the role of a pattern recognition consultancy company. Your company is faced with an
assignment by a client company working on automatic bank cheque processing applications.
This client wants you to research the possibility of using pattern recognition techniques to
classify individual digits in bank account numbers and the monetary amount.
They are interested in two scenarios:
1. the pattern recognition system is trained once, and then applied in the field;
2. the pattern recognition system is trained for each batch of cheques to be processed.
In pattern recognition terms, Scenario 1 means the amount of training data available is large
(in this case, at least 200 and at most 1000 objects per class), whereas for Scenario 2 it is
much smaller (at most 10 objects per class).

A.2

Input: the NIST dataset

To help you to construct your system, the client has supplied you with a standard dataset
of handwritten digits put together by the US National Institute of Standards & Technology,
NIST (see http://www.nist.gov/). This set consists of 2800 images of handwritten digits
for each of the 10 classes, “0”,“1”, . . . , “9”. They were scanned from forms filled out by
volunteers, thresholded in black-and-white and automatically segmented in images of 128 x
128 pixels. Sometimes parts of other digits are visible in an image. In order to save diskspace,
we determined for each digit a bounding box, thereby causing images to have different sizes.
The data can be loaded and displayed as:
>> a = prnist([0:9],[1:40:1000])
>> show(a)
This loads, for each of the 10 digits, 25 objects out of the available 1000 into a prdatafile
a. The result is shown in figure A.1.
75

Figure A.1: Subset of the NIST digit dataset

A.3

Expected output: deliverables

The client requires two deliverables:
• a report detailing the design choices you made for both scenarios, including detailed
motivation. This report should discuss the following:
1. the choice of representation (by pixels, features or dissimilarities);
2. the actual features or dissimilarity measure used;
3. for each representation, whether feature reduction is possible and, if so, what
number of features should be used;
4. for each representation, the optimal classifier and its type, i.e. parametric (nmc,
ldc, qdc, fisherc, loglc), non-parametric (knnc, parzenc) or advanced (neural
networks, support vector classifiers, one-class classifiers, combining classifiers);
5. the estimated performance of the optimal classifier on novel data.
• a test of your system on a set of benchmark data withheld from you by the client (see
below). For Scenario 1, your system should have lower than 5% test error; for Scenario
2, your target is 25% test error.
Note that each choice is expected to be backed up by either evidence or solid reasoning.
In the last exercise(s) for each week you have already experimented somewhat with the handwritten digit dataset. However, it is important to question each step as you construct your
final digit recognition system, for two reasons:
76

• for Scenario 1, you should use as much training data as possible for the construction of
your system (but at least 200 objects per class), rather than the smaller sets you used
earlier; this may invalidate some of your earlier conclusions;
• you are supposed to optimise the overall performance of the system, rather than just
the individual steps.

A.4

The benchmark

It is common for clients of your company to withhold data from you, which they can then
use to test your system and verify whether the performance you claim to obtain is actually
valid. In this case, the client has supplied you with a function you can use to benchmark your
system yourself, without having access to the actual data used. It goes without saying that
you may not use the test error to optimise your system in any way.
The procedure agreed upon is as follows:
• write a function a = my rep(m) in which m is a NIST prdatafile set and a the resulting
dataset;
• compute your classifier as PRTools classifier w in which dimension reduction and classification are incorporated, so a*w*labeld should generate the proper labels;
• call e = nist eval(filename,w,n) in which filename is the filename for the my rep
routine, w is your classifier and n is the desired number of digits per class to be tested
(which you can leave empty, if you want to test on all data). A typical call therefore
looks like: e = nist eval(’my rep’,w);. The maximum value of n is 100. Evaluation
may take a long time. The fraction of classification errors is returned in e.
In your report, give the performance obtained using the classifiers selected for both scenarios
and compare these to the performance predicted using the training data. Are there large
differences; if so, what could be their cause?

A.5

Grading

Your report will be graded by the supervisors, acting as the client company. If the minimum
performance requirements (see section A.3 are not met, a passing grade will not be given. The
report is graded on the results obtained, the choices based on their results and the motivation
behind these choices. The maximum grade that can be obtained for the report is an 8.
The final two points can be earned as follows:
• Include a section “Live test” in your report, in which you report on the performance
obtained on actual handwritten digits and compare it to the expected performance and
the benchmark performance. To this end, you should:
– use a scanner to read in an image of a sheet of paper on which you have written a
test set of digits;
77

– segment the individual digits out of the resulting image;
– classify the digit images.
You will have to organize the scanner yourself.
• Include a section “Recommendations” to your report, in which you advise your client
in detail on possible steps to improve performance in light of the overall system they
are constructing. For example:
– will having more training data help?
– would other features help?
– does the application require reject options?
– does a single classifier suffice for all digits on a bank cheque?
– what is the role of time constraints?
– etc.

A.6

Feedback

While you work on the final assignment during the last weeks, the supervisors will of course
be available to answer questions and comment on your results.

78

Appendix B

Self-evaluation probability theory
and linear algebra
In the course on Pattern Recognition, we assume a certain familiarity with probability theory
and linear algebra. You should be able to answer the following questions without too many
problems and, in principle, only using pen and paper. If you find you have difficulties in doing
so, please consult Theodoridis (Appendices A and B), or some other textbook of your liking,
to refresh your memory.
Exercise B.1 Probability and combinatorics
From an ordinary deck of 52 cards, someone draws a card at random.
(a) What is the probability that it is a black ace or a red queen?
(b) What is the probability that it is a face card or a black card?
(c) What is the probability that it is neither a heart nor a queen?
(d) If two random cards were missing from the deck, what is the probability that it is
a spade?
Exercise B.2 Expectation, (co)variance, sample estimate, and correlation
The time it takes for a student to finish a statistics aptitude test has probability density
function (PDF)
 3 2
0 n, and various N dimensional vectors a i .
(a) What is the dimensionality of La i ?
(b) L+ denotes the Moore-Penrose pseudo-inverse. What is the dimensionality of
L+ La i and why can’t we just use the standard inverse in this expression?
(c) What is the relation between L+ La i and La i if L is (row-)orthogonal (or more
exact (row-)orthonormal)? How is such a transformation also called? What can you
say about the relation if L is not orthogonal?

81

82

Appendix C

Introduction to Matlab
Ela Pekalska
and Marjolein van der Glas
,

C.1

Getting started with Matlab

Matlab is a tool for mathematical (technical) calculations. First, it can be used as a scientific
calculator. Next, it allows you to plot or visualize data in many different ways, perform matrix algebra, work with polynomials or integrate functions. You can create, execute and save
a sequence of commands to make your computational process automatic. Finally, Matlab
is also a user-friendly programming language, which gives the possibility to handle mathematical calculations in an easy way. In summary, as a computing/programming environment,
Matlab is especially designed to work with data sets as a whole such as vectors, matrices
and images. Therefore, PRTools, a toolbox for Pattern Recognition purposes, and DIPLIB,
a toolbox for Image Processing, have been developed under Matlab.
On most systems, after logging in, you can enter Matlab with the system command matlab
and exit with the command exit or quit. Running Matlab creates one or more windows
on your screen. The most important is the Matlab Command Window , which is the place
where you interact with Matlab. The string >> (or EDU>> for the Student Edition) is the
Matlab prompt. When the Command window is active, a cursor appears after the prompt,
indicating that Matlab is waiting for your command.

C.1.1

Input via the command-line

Matlab is an interactive system; commands followed by Enter are executed immediately.
The results are, if desired, displayed on the screen. Execution of a command is only possible
when the command is typed according to the rules. Table C.1 shows a list of commands used
to solve indicated mathematical equations (a, b, x and y are numbers). Below you find basic
information to help you starting with Matlab:
• Commands in Matlab are executed by pressing Enter or Return. The output will be
displayed on the screen immediately. Try the following:
>> 3 + 7.5
>> 18/4 - (3 * 7)
83

Note that spaces are not important in Matlab.
• The result of the last performed computation is ascribed to the variable ans, which is
an example of a Matlab built-in variable. It can be used on the next command line.
For instance:
>> 14/4
ans =
3.5000
>> ans^(-6)
ans =
5.4399e-04
5.4399e-04 is a computer notation of 5.4399∗10−4 . Note that ans is always overwritten
by the last command.
• You can also define your own variables (more on variables in Section C.2.1), e.g. a and
b as:
>> a = 14/4
a =
3.5000
>> b = a^(-6)
b =
5.4399e-04
• When the command is followed by a semicolon ’;’, the output is suppressed. Compare:
>> 3 + 7.5
>> 3 + 7.5;
• It is possible to execute more than one command at the same time; the commands should
then be separated by commas (to display the output) or by semicolons (to suppress the
output display), e.g.:
>> sin(pi/4), cos(pi); sin(0)
ans =
0.7071
ans =
0
Note that the value of cos(pi) is not printed.
• By default, Matlab displays only 5 digits. The command format long increases this
number to 15, format short reduces it to 5 again. For instance:
>> 312/56
ans =
5.5714
>> format long
>> 312/56
ans =
5.57142857142857
• The output may contain some empty lines; this can be suppressed by the command
format compact. In contrast, the command format loose will insert extra empty
lines.
84

Mathematical notation

Matlab command

a+b
a−b
ab

a + b
a - b
a * b
a / b or b \ a
x^b
sqrt(x) or x^0.5
abs(x)
pi
4e3 or 4*10^3
i or j
3-4*i or 3-4*j
exp(1), exp(x)
log(x), log10(x)
sin(x), atan(x),...

a
b

xb
√
x
|x|
π
4 · 103
i
3 − 4i
e, ex
ln x, log x
sin x, arctan x, ...

Table C.1: Translation of mathematical notation to Matlab commands.

• Matlab is case sensitive, for example, a is written as a in Matlab; A will result then
in an error.
• All text after a percent sign % until the end of a line is treated as a comment. Enter
e.g. the following:
>> sin(3.14159)

% this is an approximation of sin(pi)

You should skip comments while typing the examples.
• Previous commands can be fetched back with the ↑ -key. The command can also be
changed, the ← and → -keys may be used to move around in a line and edit it.

C.1.2

help-facilities

Matlab provides assistance through extensive online help. The help command is the simplest
way to get help. It displays the list of all possible topics. To get a more general introduction
to help, try:
>> help help
If you already know the topic or command, you can ask for a more specified help. For instance:
>> help elfun
provides information on the elementary math functions of Matlab. The topic you want help
on must be exact. The lookfor command is useful if you do not know the exact name of the
command or topic, e.g.:
>> lookfor inverse
displays a list of all commands, with a short description, for which the word inverse is
included in its help-text. Besides the help and lookfor commands, there is also a separate
85

mouse driven help. The helpwin command opens a new window on screen which can be
browsed in an interactive way.

C.1.3

Interrupting a command or program

Sometimes you might spot an error in your command or program. Due to this error it can
happen that the command or program does not stop. Pressing Ctrl-C (or Ctrl-Break on
PC) forces Matlab to stop the process. After this, the Matlab prompt (>>) re-appears.
This may take a while, though.

C.1.4

Workspace issues

If you work in the Command Window, Matlab memorizes all commands that you entered and
all variables that you created. These commands and variables reside in the Matlabworkspace
and they might be easily recalled when needed, e.g. to recall previous commands, the ↑ -key
is used. Variables can be verified with the commands who, which gives a list of variables
present, and whos, which includes also extra information. Assuming that you performed all
commands given in Section A.1, after typing who you would obtain:
>> who
Your variables are:
a
ans
b
x
The command clear  deletes the variable  from the Matlab workspace, clear
or clear all removes all variables. This is useful when starting a new task. For example:
>> clear a x
>> who
Your variables are:
ans
b

C.1.5

Saving and loading data

The easiest way to save or load Matlab variables is by using (clicking) the File menu-bar,
and then the Save Workspace as... and Load Workspace... items. Alternatively, the
Matlab commands save and load can be used. save allows for saving your workspace
variables either into a binary file or an ASCII file. Binary files automatically get the ’.mat’
extension. For ASCII files, it is recommended to add a ’.txt’ or ’.dat’ extension. Study
the use of the save command:
>> s1 = sin(pi/4);
>> c1 = cos(pi/4); c2 = cos(pi/2);
>> str = ’hello world’;
% a string
>> save data
% saves all variables in binary format to data.mat
>> save numdata s1, c1
% saves numeric variables: s1 and c1 to numdata.mat
>> save allcos.dat c* -ascii % saves c1,c2 in 8-digit ascii format to allcos.dat
The load command allows for loading variables into the workspace. It uses the same syntax
as save. It might be useful to clear the workspace before each load and check which variables
86

are present in the workspace (use who) after loading. Assuming that you have created the
files above, the variables can be now loaded as:
>> load data
% loads all variables from data.mat
>> load data s1 c1
% loads only specified variables from data.mat
It is also possible to read ASCII files that contain rows of space separated values. Such a file
may contain comments that begin with a percent character. The resulting data is placed into
a variable with the same name as the ASCII file (without the extension). Check, for example:
>> load allcos.dat
% loads data from allcos.dat into variable allcos
>> who
% lists variables present in the workspace now

C.2

Mathematics with vectors and matrices

The basic element of Matlab is a matrix (or an array). Special cases are:
• a 1 × 1-matrix: a scalar or a single number;
• a matrix existing only of one row or one column: a vector.
Note that Matlab may behave differently depending on input, whether it is a number, a
vector or a 2D matrix.

C.2.1

Assignments and variables

Formally, there is no need to declare (i.e. define the name, size and the type of) a new
variable in Matlab. A variable is simply created by an assignment (e.g. a = 15/4). Each
newly created numerical variable is always of the double type. You can change this type by
converting it into e.g. the single type1 .
Bear in mind that undefined values cannot be assigned to a variable. So, the following is not
possible:
>> clear x;
% to make sure that x does not exist
>> f = x^2 + 4 * sin (x)
It becomes possible by:
>> x = pi / 3;

f = x^2 + 4 * sin (x)

Here are some examples of different types of Matlab variables (check their types by using
whos):
>> M = [1 2; 3 4; 5 6]
% a matrix
>> 2t = 8
% what is the problem with this command?
>> c = ’E’
% a character
>> str = ’Hello world’
% a string
There is also a number of built-in variables, e.g. pi, eps or i, presented in Table C.2. In
addition to creating variables by assigning values to them, another possibility is to copy one
1

A variable a is converted into a different type by performing e.g. a = single(a), etc.

87

Variable name

Value/meaning

ans
pi
eps

the default variable name used for storing the last result
π = 3.14159...
the smallest positive number that added to 1, creates a result
larger than 1
representation for positive infinity, e.g. 1/0
representation
for not-a-number, e.g. 0/0
√
i = j = −1
number of function input/output arguments used
the smallest/largest usable positive real number

inf
nan or NaN
i or j
nargin/nargout
realmin/realmax

Table C.2: Built-in variables in Matlab.

variable, e.g. b into another, e.g. a. In this way, the variable a is automatically created (if a
already existed, its previous value is now lost):
>> b = 10.5;
>> a = b;
If min is the name of a function (see Section C.4.2), then a defined as:
>> b = 5; c = 7;
>> a = min (b,c);
will call that function, with the values b and c as parameters. The result of this function (its
return value) will be written (assigned) into a.

C.2.2

Vectors

Row vectors are lists of numbers separated either by commas or by spaces. They are examples
of simple arrays. First element has index 1. The number of entries is known as the length of
the vector (the command length exists as well). Their entities are referred to as elements or
components. The entries must be enclosed in [ ]:
>> v = [-1 sin(3) 7]
v =
-1.0000
0.1411
7.0000
A number of operations can be performed on vectors. A vector can be multiplied by a
scalar, or added/subtracted to/from another vector with the same length, or a number can
be added/subtracted to/from a vector. A particular value can be also changed or displayed.
All these operations are carried out element-by-element. Vectors can be also built from the
88

already existing ones:
>> v = [-1 2 7]; w = [2 3 4];
>> z = v + w
z =
1
5
11
>> t = [2*v, -w+2]
ans =
-2
4
14
0
-1
>> v(2) = -1
v =
-1
-1
7

% an element-by-element sum

-2
% change the 2nd element of v

A colon notation is an important shortcut, used when producing row vectors (see Table C.3
and help colon):
>> 2:5
ans =
2
3
4
5

In general, first:step:last produces a vector of entities with the value first, incrementing
by the step until it reaches last:
>> -0.2:0.5:1.7
ans =
-0.2000
0.3000
0.8000
1.3000
>> 5.5:-2:-2.5
% a negative step is also possible
ans =
5.5000
3.5000
1.5000
-0.5000
-2.5000

Parts of vectors can be extracted by using a colon notation:
>> r = [-1:2.5:6, 2, 3, -2];
% r = [-1 1.5 4 2 3 -2]
>> r(2:2:6)
% get the elements of r from the positions 2, 4 and 6
ans =
1.5
2
-2

To create column vectors, you should separate entries by a semicolon ’;’ or by new lines.
The same operations as on row vectors can be done on column vectors. However, you cannot
for example add a column vector to a row vector. To do that, you need an operation called
89

transposing, which converts a column
>> f = [-1; 3; 5]
f =
-1
3
5
>> f’
ans =
-1
3
5
>> v = [-1 2 7];
>> f + v
??? Error using ==> +
Matrix dimensions must agree.
>> f’ + v
ans =
-2
5
12

vector into a row vector and vice versa:
% a column vector

% f’ is a row vector

% a row vector
% you cannot add a column vector f to a row vector v

P
You can now compute the inner product between two vectors x and y, xT y = i xi yi , in a
simple way:
>> v = [-1; 2; 7];
% v is now a column vector; f is also a column vector
>> f’ * v
% this is the inner product! f * v’ is the outer product
ans =
42

C.2.3

Matrices

An n × k matrix is a rectangular array of numbers having n rows and k columns. Defining
a matrix in Matlab is similar to defining a vector. The generalization is straightforward, if
you know that a matrix consists of row vectors (or column vectors). Commas or spaces are
used to separate elements in a row, and semicolons are used to separate individual rows. For
example, a matrix can be defined as:
>> A = [1 2 3; 4 5 6; 7 8 9]
% row by row input
A =
1
2
3
4
5
6
7
8
9
Transposing a vector changes it from a row to a column or the other way around. This idea
can be extended to a matrix, where transposing interchanges rows with the corresponding
90

Command

Result

A(i,j)
A(:,j)
A(i,:)
A(k:l,m:n)

Aij
j-th column of A
i-th row of A
(l − k + 1) × (n − m + 1) matrix with elements Aij with
k ≤ i ≤ l, m ≤ j ≤ n
’vector-part’ (vi , vi+1 , . . . , vj ) of vector v

v(i:j)’

Table C.3: Manipulation of (groups of) matrix elements.

columns, as in the example:
>> A2 = [1:4; -1:2:5], A2’
A2 =
1
2
3
4
-1
1
3
5
ans =
1
-1
2
1
3
3
4
5
>> size(A2), size(A2’)
ans =
2
4
ans =
4
2

% returns the size (dimensions) of a matrix

There are also built-in matrices of size specified by the user (see Table C.4). A few examples
are given below:
>> I = eye(2)
% the 2-by-2 identity matrix
I =
1
0
0
1
>> B = randn(1,2)
% a vector of normally distributed random numbers
B =
-0.4326
-1.6656
>> r = [1 3 -2]; R = diag(r)
% create a diagonal matrix with r on the diagonal
R =
1
0
0
0
3
0
0
0
-2
91

It is often needed to build a larger matrix from the smaller ones:
>> x = [4; -1]; y = [-1 3]; X = [x y’]
X =
% X consists of the columns: x and y’
4
-1
-1
3
>> T = [-1 3 4; 4 5 6]; t = 1:3;
>> A = [[T; t] ones(3,2)]]
% add a row vector t, then add two columns of 1’s
A =
-1
3
4
1
1
4
5
6
1
1
1
2
3
1
1

It is possible to manipulate (groups of) matrix elements (see Table C.3). A part can be
extracted from a matrix in a similar way as from a vector. Each element in the matrix is
indexed by a row and a column to which it belongs, e.g. A(i,j) denotes the element from
the i-th row and the j-th column of the matrix A.
>> A(3,4)
ans =
1
>> A(4,3)
% A is a 3-by-5 matrix! A(4,3) does not exist
??? Index exceeds matrix dimensions.

It is easy to extend a matrix automatically. For the matrix A it can be done e.g. as follows:
>> A(4,3) = -2
% assign -2 to the position (4,3); the uninitialized
A =
% elements become zeros
-1
3
4
1
1
4
5
6
1
1
1
2
3
1
1
0
0
-2
0
0
>> A(4,[1:2,4:5]) = 4:7;
% assign A(4,1)=4, A(4,2)=5, A(4,4)=6 and A(4,5)=7

Different parts of the matrix A can be now extracted:
>> A(:,2)
% extract the 2nd column of A; what will you get?
>> A(4,:)
% extract the 4th row of A
ans =
4
5
-2
6
7
>> A(2:3,[1,5])
4
-1
1
1

Table C.4 shows some frequently used matrix operations and functions. A few examples are
92

given below:
>> B = [1 4 -3; -7 8 4]; C = [1 2; 5
>> (B + C’)./4
% add
ans =
0.5000
2.2500
0.5000
-1.2500
2.2500
2.5000
>> D = [1 -1 4; 7 0 -1];
>> D = B * D’; D^3
% D^3
ans =
-4205
38390
3839
-150087
>> D.^3 - 2
% for
ans =
-3377
998
-1
-148879

1; 5 6];
matrices and divide all elements of result by 4

is equivalent to D * D * D

all elements: raise to the power 3, subtract 2

The concept of an empty matrix [] is also very useful in Matlab. For instance, a few
columns or rows can be removed from a matrix by assigning an empty matrix to it. Try for
example:
>> C = [1 2 3 4; 5 6 7 8; 1 1 1 1];
>> D = C; D(:,2) = []
% now a copy of C is in D; remove the 2nd column of D
>> C ([1,3],:) = []
% remove the 1st and 3rd rows from C

C.3

Control flow

A control flow structure is a block of commands that allows conditional code execution and
making loops.

C.3.1

Logical and relational operators

To use control flow commands, it is necessary to perform operations that result in logical
values: TRUE or FALSE, which in Matlab correspond to 1 or 0 respectively (see Table C.5
and help relop). The relational operators <, <=, >, >=, == and ~= can be used to compare
two arrays of the same size or an array to a scalar. The logical operators &, | and ~ allow
for the logical combination or negation of relational operators. The logical & and | have equal
precedence in Matlab, which means that those operators associate from left to right. In
addition, three functions are also available: xor, any and all (use help to find out more).

The command find
You can extract all elements from the vector or the matrix satisfying a given condition, e.g.
equal to 1 or larger than 5, by using logical addressing. The same result can be obtained via
93

the command find, which return the positions (indices) of such elements. For instance:
>> x = [1 1 3 4 1];
>> i = (x == 1)
i =
1
1
0
0
1
>> j = find(x == 1)
% j holds indices of the elements satisfying x == 1
j =
1
2
5
>> y = x(i), z = x(j)
y =
1
1
1
z =
1
1
1
find operates in a similar way on matrices. It reshapes first the matrix A into a column
vector, i.e. all columns all concatenated one after another. Therefore, e.g. k = find (A >
2) is a list of indices of elements larger than 2 and A(k) gives the values. The row and column
indices can be returned by [I,J] = find (A > 2).

C.3.2

Conditional code execution

Selection control structures, if-blocks, are used to decide which instruction to execute next
depending whether expression is TRUE or not. The general description is given below. In
the examples below the command disp is frequently used. This command displays on the
screen the text between the quotes.
• if ...

end

[frame=single,framesep=3mm,xrightmargin=2cm,label=Syntax ]
if logical_expression
statement1
statement2
....
end
[frame=single,framesep=3mm,xrightmargin=1.2cm,label=Example ]
if (a > 0)
b = a;
disp (’a is positive’);
end
• if ...

else ...

end

[frame=single,framesep=3mm,xrightmargin=1.7cm,label=Syntax ]
if logical_expression
block of statements
evaluated if TRUE
else
94
block of statements
evaluated if FALSE
end
[frame=single,framesep=3mm,xrightmargin=0.4cm,label=Example ]

• if ...

elseif ...

else ...

end

[frame=single,framesep=3mm,xrightmargin=0cm,label=Syntax ]
if logical_expression1
block of statements evaluated
if logical_expression1 is TRUE
elseif logical_expression2
block of statements evaluated
if logical_expression2 is TRUE
else
block of statements evaluated
if no other expression is TRUE
end
[frame=single,framesep=3mm,xrightmargin=1.9cm,label=Example ]
if (height > 190)
disp (’very tall’);
elseif (height > 170)
disp (’tall’);
elseif (height < 150)
disp (’small’);
else
disp (’average’);
end
Another selection structure is switch, which switches between several cases depending on an
expression, which is either a scalar or a string. Learn more by using help.

C.3.3

Loops

Iteration control structures, loops, are used to repeat a block of statements until some condition is met:
• the for loop that repeats a group of statements a fixed number of times;

[frame=single,framesep=3mm,xrightmargin=0.8cm,label=Syntax ]
for index = first:step:last
block of statements
end
[frame=single,framesep=3mm,xrightmargin=1.8cm,label=Example ]
sumx = 0;
for i=1:length(x)
sumx = sumx + x(i);
end
You can specify any step, including a negative value. The index of the for-loop can
be also a vector. See some examples of possible variations:
95

[frame=single,framesep=3mm,xrightmargin=0cm,label=Example 1 ]
for i=1:2:n
...
end
[frame=single,framesep=3mm,xrightmargin=0cm,label=Example 2 ]
for i=n:-1:3
....
end
[frame=single,framesep=3mm,xrightmargin=0cm,label=Example 3 ]
for x=0:0.5:4
disp(x^2);
end
[frame=single,framesep=3mm,xrightmargin=-0.2cm,label=Example 4 ]
for x=[25 9 81]
disp(sqrt(x));
end
• while loop, which evaluates a group of commands as long as expression is TRUE.

[frame=single,framesep=3mm,xrightmargin=2.8cm,label=Syntax ]
while expression
statement1
statement2
statement3
...
end
[frame=single,framesep=3mm,xrightmargin=1.9cm,label=Example ]
N = 100;
iter = 1;
msum = 0;
while iter <= N
msum = msum + iter;
iter = iter + 1;
end;

C.4
C.4.1

Script and function m-files
Script m-files

Script m-files are useful when the number of commands increases or when you want to change
values of some variables and re-evaluate them quickly. Formally, a script is an external file
that contains a sequence of Matlab commands. However, it is not a function, since there
are no input/output parameters and the script variables remain in the workspace. When
you run a script, the commands in it are executed as if they have been entered
96

through the keyboard. To create a script, open the Matlab editor (go to the File menubar, choose the Open option and then m-file; the Matlab Editor Window will appear),
enter the lines listed below and save as sinplot.m:
x = 0:0.2:6; y = sin(x); plot(x,y);
title(’Plot of y = sin(x)’);
The script can be run by calling sinplot. Note that m-script file must be saved in one of the
directories in Matlab’s path. The sinplot script affects the workspace. Check:
>> clear
% all variables are removed from the workspace
>> who
% no variables present
>> sinplot
% run the script
>> who
Your variables are:
x
y
These generic names, x and y, may be easily used in further computations and this can cause
side effects. They occur, in general, when a set of commands change variables other than the
input arguments. Remember that the commands within a script have access to all variables
in the workspace and all variables created in this script become a part of the workspace.
Therefore, it is better to use function m-files (see Section C.4.2) for a specific problem to be
solved.

C.4.2

Function m-file

Functions m-files are true subprograms, as they take input arguments and/or return output
parameters. They can call other functions, as well. Variables defined and used inside a
function, different from the input/output arguments, are invisible to other functions and to
the command environment. The general syntax is:
function [outputArgs] = function_name (inputArgs)
outputArgs are enclosed in [ ]:
– a comma-separated list of variable names;
– [ ] is optional when only one argument is present;
– functions without outputArgs are legal2 .
inputArgs are enclosed in ( ):
– a comma-separated list of variable names;
– functions without inputArgs are legal.
Matlab provides a structure for creating your own functions. The first line should be a
definition of a new function (called a header). After that, a continuous sequence of comment
lines, explaining what the function does, should appear. Since they are displayed in response
to the help command, also the expected input parameters, returned output parameters and
synopsis should be described there. Finally, the remainder of the function is called the body.
Function m-files terminate execution when they reach the end of the file or, alternatively,
when the command return is encountered. The name of the function and the name of
2

In other programming languages, functions without output arguments are called procedures.

97

the file stored should be identical . As an example, the function average, stored in a file
average.m, is defined as:
output argument
the first line must be
the function definition
comment

function name

input argument

function avr = average (x)
%AVERAGE computes the average value of a vector x
%
and returns it in avr
% Notes: an example of a function

function body

n
= length(x);
avr = sum(x)/n;
return;

a blank line within the comment;
Notes information will NOT appear
when you ask: help average

Here is an another example of a function:
function [avr,sd] = stat(x)
%STAT Simple statistics; computes the average and standard deviation of a vector x.
n
= length(x);
avr = sum(x)/n;
sd = sqrt(sum((x - avr).^2)/n);
return;
Warning: The functions mean and std already exist in Matlab. As long as a function
name is used as a variable name, Matlab can not perform the function. Many other, easily
appealing names, such as sum or prod are reserved by Matlab functions, so be careful when
choosing your names.
When controlling the proper use of parameters, the function error may become useful. It displays an error message, aborts function execution, and returns to the command environment.
Here is an example:
if (a >= 1)
error (’a must be smaller than 1’);
end;

C.4.3

Scripts vs. functions

The most important difference between a script and a function is that all script’s parameters
and variables are externally accessible (i.e. in the workspace), where function variables are
not. Therefore, a script is a good tool for documenting work, designing experiments and
testing. In general, create a function to solve a given problem for arbitrary parameters; use
a script to run functions for specific parameters required by the assignment.

98

Command

Result

n
x
x
x
x

=
=
=
=
=

rank(A)
det(A)
size(A)
trace(A)
norm(v)

n
x
x
x
x

C
C
C
C
C
C
C
C
X
X

=
=
=
=
=
=
=
=
=
=

A + B
A - B
A * B
A .* B
A ^ k
A .^ k
A’
A ./ B
A \ B
B / A

sum of two matrices
subtraction of two matrices
multiplication of two matrices
’element-by-element’ multiplication (A and B are of equal size)
power of a matrix (k ∈ Z; can also be used for A−1 )
’element-by-element’ power of a matrix
the transposed of a matrix; AT
’element-by-element’ division (A and B are of equal size)
finds the solution in the least squares sense to the system of equations AX = B
finds the solution of XA = B, analogous to the previous command

C = orth(A)
C = rref(A)
L = eig(A)
[X,D] = eig(A)
S = svd(A)
[U,S,V] = svd(A)
=
=
=
=
=
=
=
=
=
=

the rank of matrix A
the determinant of matrix A
a row-vector with 2 elements: the number of rows and columns of A
the trace (sum of diagonal elements) of matrix A
the Euclidean length of vector v

C becomes the inverse of A
C is an orthonormal basis for the null space of A obtained from the singular
value decomposition
C is an orthonormal basis for the range of A
C is the reduced row echelon form of A
L is a vector containing the (possibly complex) eigenvalues of a square matrix A
produces a diagonal matrix D of eigenvalues and a full matrix X whose columns
are the corresponding eigenvectors of A
S is a vector containing the singular values of A
S is a diagonal matrix with nonnegative diagonal elements in decreasing order;
columns of U and V are the accompanying singular vectors

C = inv(A)
C = null(A)

x
x
A
A
A
A
X
X
A
A

becomes
becomes
becomes
becomes
becomes

linspace(a,b,n)
logspace(a,b,n)
eye(n)
zeros(n,m)
ones(n,m )
diag(v)
tril(A)
triu(A)
rand(n,m)
randn(n,m)

generates a vector x of n equally spaced points between a and b
generates a vector x starting at 10a and ended at 10b containing n values
A is an n × n identity matrix
A is an n × m matrix with zeros (default m = n)
A is an n × m matrix with ones (default m = n)
gives a diagonal matrix with the elements v1 , v2 , . . . , vn on the diagonal
X is lower triangular part of A
X is upper triangular part of A
A is an n × m matrix with elements uniformly distributed between 0 and 1
ditto - with elements standard normal distributed
v is a vector with the maximum value of the elements in each column of A
or v is the maximum of all elements if A is a vector
ditto - with minimum
ditto - with sum

v = max(A)
v = min(A)
v = sum(A)

Table C.4: Frequently used matrix operations and functions.

Command

Result

a
a
a
a
a
a

a is 1 if b is larger than c. Similar are: <, >= and <=
a is 1 if b is equal to c
a is 1 if b is not equal c
logical complement: a is 1 if b is 0
logical AND: a is 1 if b = TRUE AND c = TRUE
logical OR: a is 1 if b = TRUE OR c = TRUE

=
=
=
=
=
=

(b
(b
(b
~b
(b
(b

> c)
== c)
~= c)
& c)
| c)

Table C.5: Relational and logical operations.

99

100

Appendix D

Introduction to PRTools
David M.J. Tax and Robert P.W. Duin

D.1

Introduction

The more than 100 routines offered by PRTools in its present state represent a basic set
covering largely the area of statistical pattern recognition. In order to make the evaluation
and comparison of algorithms more easy a set of data generation routines is included, as well
as a small set of standard real world datasets.
This manual treats just the basic objects and constructs in the PRTools toolbox. It does
not give an exhaustive explanation of all possibilities in the toolbox. To be more explicit, this
manual does not treat:
• feature selection,
• error estimation techniques (cross-validation etc.),
• combining classifiers,
• clustering,
• applications on images.
Note: the manual tries to show the basic constructs and ideas behind the toolbox. In the
actual application of the toolbox in practice, the user should frequently use the help. For
each function defined in the PRTools toolbox an extensive help text is present, explaining
the syntax, the arguments, the function and the possible default values for arguments.

D.2

PRTools

PRTools deals with sets of labeled objects and offers routines for generalising such sets into
functions for data mapping and classification. An object is a k-dimensional vector of feature
values. It is assumed that for all objects in a problem all values of the same set of features
are given (and that there are no missing values). The space defined by the actual set of
features is called the feature space. Objects are represented as points or vectors in this space.
101

A classification function assigns labels to new objects in the feature space. Usually, this is
not done directly, but in a number of stages in which the initial feature space is successively
mapped into intermediate stages, finally followed by a classification. The concept of mapping
spaces and dataset is thereby important and constitutes the basis of many routines in the
toolbox.
PRTools makes use of the possibility offered by Matlab 5 to define ‘classes’ and ‘objects’.
These programmatic concepts should not be confused with the classes and objects as defined
in Pattern Recognition. Two classes have been defined: prdataset and mapping. A large
number of operators (like *, []) and Matlab commands have been overloaded and have
thereby a special meaning when applied to a dataset and/or a mapping.
In the coming sections it is explained how prdatasets and prmappings can be created, modified and applied.

D.3

Dataset

The central data structure of PRTools is the prdataset. It primarily consists of a set of
objects represented by a matrix of feature vectors. Attached to this matrix is a set of labels,
one for each object and a set of feature names. Moreover, a set of a priori probabilities, one
for each class, is stored. In most help files of PRTools, a dataset is denoted by a. In almost
any routine this is one of the inputs. Almost all routines can handle multi-class object sets.

prdataset
getdata
getlabels
getfeat

prdataset
Define dataset from data matrix and labels
Retrieve data from dataset
Retrieve object labels from dataset
Retrieve feature labels from dataset

genlab
renumlab

Generate dataset labels
Convert labels to numbers

Sets of objects may be given externally or may be generated by one of the data generation
routines of PRTools. Their labels may also be given externally or may be the result of a
cluster analysis.
A dataset containing 10 objects with 5 random measurements can be generated by:
>> data = rand(10,5);
>> a = prdataset(data)
10 by 5 dataset with 0 classes: []
In the previous example no labels were supplied, therefore one class is detected. Labels can
be added to the dataset by:
>> data = rand(10,5);
>> labs = [’A’; ’A’; ’A’; ’B’; ’A’; ’A’; ’B’; ’B’; ’B’; ’B’];
>> a = prdataset(data,labs)
10 by 5 dataset with 2 classes: [5 5]
102

In most cases the objects in the dataset are ordered (such that the first n1 objects belong to
class A, the next n2 to class B, etc). In this case it is easier to use the genlab routine for the
generation of the labels:
>> data = rand(10,5);
>> labs = genlab([5 5],[’A’;’B’]);
>> a = prdataset(data,labs)
10 by 5 dataset with 2 classes
Next to class labels, it is also possible to supply feature labels. To set this parameter, the
keyword featlab should be given before:
>> data = rand(10,3);
>> labs = genlab([5 5],[’A’;’B’]);
>> feats = [’Feat1’;’Feat2’;’Feat3’];
>> a = prdataset(data,labs,’featlab’,feats)
10 by 3 dataset with 2 classes: [5 5]
There are numberous other parameters which can be defined inside the dataset definition. The
class prior probabilities can be defined, the dataset name, cost matrices can be given, it can
be indicated whether features are numerical or categorical, etc. etc. See for more information
about the possible parameters help prdataset.
All the values can be set by their respective commands. For instance for setting the prior:
>> a = setprior(a,[0.4 0.6]);
Analogous you can use setcost, setname or setfeatdom.
Various items stored in a dataset can be retrieved by the get series of commands:
>> name = getname(a);
>> prob = getprior(a);
>> nlab = getnlab(a);
>> lablist = getlablist(a);
>> [m,k,c] = getsize(a);
in which nlab are numeric labels for the objects (1, 2, 3, ...) referring to the true labels
stored in the rows of lablist. The size of the dataset is m by k, c is the number of classes
(equal to max(nlab)).
The original data matrix can be retrieved by double(a) or by +a. The labels in the objects of
a can be retrieved labels = getlab(a), which is equivalent to labels = lablist(nlab,:).
The feature labels can be retrieved by featlist = getfeat(a).
Note: In many respects a prdataset object can be compared with a normal Matlab matrix.
Normal matrix operations like a.^2, abs(a), a(:,3), a(1,:), size(a) etc, can directly
be applied to prdataset objects. In some cases Matlab routines cannot cope with the
prdataset object. In these cases the user has to use +a instead of a.
Another way to inspect a dataset a is to make a scatterplot of the objects in the dataset.
For this the function scatterd is supplied. This plots each object in the dataset a in a 2D
graph, using a coloured marker when class labels are supplied. When more than 2 features
are present in the dataset, the first two are used. When you want to make a scatterplot of
two other features, you have to explicitly extract them by scatterd(a(:,[2 5])).
103

gauss
gendatb
gendatc
gendatd
gendath
gendatl
gendatm
gendats

prdataset generation
Generation of multivariate Gaussian distributed data
Generation of banana shaped classes
Generation of circular classes
Generation of two difficult classes
Generation of Higleyman classes
Generation of Lithuanian classes
Generation of many Gaussian distributed classes
Generation of two Gaussian distributed classes

Exercise D.1 Generate an artificial dataset (one from the prdataset generation table) containing 3 objects per class.
Exercise D.2 Inspect the data matrix in the prdataset object by using the + operator and
make a scatterplot of the dataset.
Exercise D.3 Can you point out which object is plotted where in the the scatterplot? Check
if each object is coloured according to their class label.
Datasets can be combined by [a; b] if a and b have equal numbers of features and by [a b]
if they have equal numbers of objects.
Exercise D.4 Make a dataset containing 2 classes with 5 objects per class and call it a1 (for
instance from gendats). Make a second dataset containing 2 classes with 5 objects per
class and call it a2 (for instance from gendatb).
Combine the two datasets using
>> b = [a1; a2]
and make a scatterplot.
Next combine the two datasets using
>> c = [a1 a2]
and again make a scatterplot. In what do the two datasets differ? Check your opinions
by using size(b) and size(c).
Creating subsets of datasets can be done by a(I,J) in which I is a set of indices defining the
desired objects and J is a set of indices defining the desired features. In all these examples
the a priori probabilities set for a remain unchanged.
Exercise D.5 Use the dataset c from the previous exercise to extract feature 2 and 4. Make
a scatterplot and compare it with the data matrix. Where is the first object in the data
matrix plotted in the scatterplot?
There is a large set of routines for the generation of arbitrary normally distributed classes
(gauss), and for various specific problems (gendatc, gendatd, gendath, gendatm and
104

gendat
gendatk
gendatp
gendat
prdata

prdataset manipulation
Extraction of subsets of a given data set
Nearest neighbour data generation
Parzen density data generation
Extraction of test set from given dataset
Read data from file and convert into a dataset

gendats). There are two commands for enriching classes by noise injection (gendatk and
gendatp). These are used for the general test set generator gendatt. A given dataset can be
split randomly into a training set and a test set using gendat.
Exercise D.6 Generate a dataset using gendatb containing 5 objects per class.
Enlarge this dataset by making a new dataset using gendatk and gendatp and appending this dataset to your original dataset.
Make a scatterplot of the original and the second dataset, to check that it worked.
Note: have a look at the help if it is not clear how you should use a function!
Exercise D.7 Make a dataset with 100 objects per class. Generate two new datasets
trainset and testset with 60 and 40 objects per class respectively.
Make two scatterplots of the datasets. Do they differ much?

D.4

Datafile

The above described prdataset class refers to data that are stored in the computer memory
and not on disk. It has thereby its limitations w.r.t. the numbers of objects and features that
can be used. Moreover, it is assumed that objects are already represented by a fixed number
of vector elements (features, dissimilarities).
The recently introduced (PRTools version 4.1) class prdatafile offers several ways to start
with objects stored in files as raw images, time signals or mat-files. There is thereby no limitation to the size. a prdatafile is a special type of prdataset by which many operations
defined on a prdataset are supported. In addition there are tools to define several types of
preprocessing filters and feature measurement commands. At some moment all objects in a
prdatafile are represented by the same features and it can be converted to a prdataset. After that all procedures for dimension reduction, cluster analysis, classification and evaluation
can be used.
An important difference between a prdatafile and a prdataset is that operations on the
first are just stored in the prdatafile as a defined processing which is only executed at the
moment it is converted to a prdataset. Operations on a prdataset are directly executed.
105

D.5

Mapping

Data structures of the class mapping store trained classifiers, feature extraction results, data
scaling definitions, nonlinear projections, etc. They are usually denoted by w.

prmapping
getlab

mapping
Define mapping
Retrieve labels assigned by mapping

A mapping w is often created by training a classifier on some data. For instance, the nearest
mean classifier nmc is trained on some data a by:
>> a = gendatb(20);
>> w = nmc(a)
Nearest Mean, 2 to 2 trained classifier --> affine
w by itself, or display(w) lists the size and type of a classifier as well as the routine which is
used for computing the mapping a*w.
When a mapping is trained, it can be applied to a dataset, using the operator *:
>> b = a*w
Banana Set, 20 by 2 dataset with 2 classes: [7 13]
The result of the operation a*w is again a dataset. It is the classified, rescaled or mapped
result of applying the mapping definition stored in w to a.

classc
labeld
testc

mappings and classifiers
Converts a mapping into a classifier
General classification routine for trained classifiers
General error estimation routine for trained classifiers

All routines operate in multi-class problems. For mappings which change the labels of the
objects (so the mapping is actually a classifier) the routines labeld and testc are useful.
labeld and testc are the general classification and testing routines respectively. They can
handle any classifier from any routine.

klldc
loglc
fisherc
ldc
nmc
nmsc
perlc
pfsvc
qdc
udc

Linear and polynomial classifiers
Linear classifier by KL expansion of common cov matrix
Logistic linear classifier
Fisher’s discriminant (minimum least square linear classifier)
Normal densities based linear classifier (Bayes rule)
Nearest mean classifier
Scaled nearest mean classifier
Linear classifier by linear perceptron
Pseudo-Fisher support vector classifier
Normal densities based quadratic (multi-class) classifier
Uncorrelated normal densities based quadratic classifier
106

knnc
mapk
testk
parzenc
parzenml
parzen map
testp
edicon
treec
tree map
bpxnc
lmnc
rbnc
neurc
rnnc
svc

D.6

Nonlinear classifiers
k-nearest neighbour classifier (find k, build classifier)
k-nearest neighbour mapping routine
Error estimation for k-nearest neighbour rule
Parzen density based classifier
Optimization of smoothing parameter in Parzen density estimation.
Parzen mapping routine
Error estimation for Parzen classifier
Edit and condense training sets
Construct binary decision tree classifier
Classification with binary decision tree
Train feed forward neural network classifier by backpropagation
Train feed forward neural network by Levenberg-Marquardt rule
Train radial basis neural network classifier
Automatic neural network classifier
Random neural network classifier
Support vector classifier

Training and testing

There are many commands to train and use mappings between spaces of different (or equal)
dimensionalities. For example:
if a is an m by k dataset (m objects in a k-dimensional space)
and w is a k by n mapping (map from k to n dimensions)
then a*w is an m by n dataset (m objects in a n-dimensional space)
Mappings can be linear (e.g. a rotation) as well as nonlinear (e.g. a neural network). Typically they can be used for classifiers. In that case a k by n mapping maps an k-feature data
vector on the output space of an n-class classifier (exception: 2-class classifiers like discriminant functions may be implemented by a mapping to a 1D space like the distance to the
discriminant, n = 1).
Mappings are of the data type mapping, have a size of [k,n] if they map from k to n
dimensions. Mappings can be instructed to assign labels to the output columns, e.g. the class
names. These labels can be retrieved by
labels = getlab(w); before the mapping, or
labels = getlab(a*w); after the dataset a is mapped by w.
Mappings can be learned from examples, (labeled) objects stored in a dataset a, for instance
by training a classifier:
w3 = ldc(a); the normal densities based linear classifier
w2 = knnc(a,3); the 3-nearest neighbor rule
w1 = svc(a,’p’,2); the support vector classifier based on a 2nd order
polynomial kernel
107

optional

The mapping of a test set b by b*w1 is now equivalent to b*(a*v1) or even, irregularly but
very handy to a*v1*b (or even a*ldc*b). Note that expressions are evaluated from left to
right, so b*a*v1 may result in an error as the multiplication of the two datasets (b*a) is
executed first.
end optional

108

D.7

Example

In this example a 2D Highleyman dataset A is generated, 100 objects for each class. Out of
each class 20 objects are generated for training, C and 80 for testing, D. Three classifiers are
computed: a linear one and a quadratic one, both assuming normal densities (which is correct
in this case) and a Parzen classifier. Note that the data generation use the random generator.
As a result they only reproduce if they use the original seed. After computing and displaying
classification results for the test set a scatterplot is made in which all classifiers are drawn.
%PREX_PLOTC PRTools example on the dataset scatter and classifier plot
help prex_plotc
echo on
% Generate Higleyman data
A = gendath([100 100]);
% Split the data into the training and test sets
[C,D] = gendat(A,[20 20]);
% Compute classifiers
w1 = ldc(C);
% linear
w2 = qdc(C);
% quadratic
w3 = parzenc(C);
% Parzen
w4 = lmnc(C,3);
% neural net
% Compute and display errors
% Store classifiers in a cell
W = w1,w2,w3,w4;
% Plot errors
disp(D*W*testc);
% Plot the data and classifiers
figure
% Make a scatter-plot
scatterd(A);
% Plot classifiers
plotc(w1,w2,w3,w4);
echo off

109

110

Appendix E

Introduction to datafiles
Robert P.W. Duin and David M.J. Tax

E.1

Introduction

Since the introduction of PRTools 4.1 the toolbox is extended to avoid some of the limitations that were imposed by the use of the prdataset object. The first limitation of the
dataset object is that all objects in a dataset should have the same (number of) features. In
particular when one starts with raw data, for instance when one is using images or timeseries
with different sizes, the processing has to be done outside PRTools. A second limitation
is that datasets can become very large. So large in fact, that the dataset object cannot be
stored in the memory. To remove these limitations the prdatafile object is introduced.
A datafile in PRTools is a generalization of the dataset concept, and they inherit most of
the properties of a dataset. The datafile allows the distribution of a large dataset over a set
of files, stored on disk. Datafiles are mainly an administration about the files and directories
in which the objects are stored. A datafile is, like a dataset, a set consisting of M objects,
each described by k features. k may be unknown and varying over different objects, in which
case it is set to zero.
Almost all operations defined for datasets are also defined for datafiles, with a few exceptions.
Operations on datafiles are possible as long as they can be stored (e.g. filtering of images
for raw datafiles, or object selection by gendat). Furthermore, commands that are able to
process objects sequentially, like nmc and testc can be executed on datafiles. The use of
untrained mappings in combination with datafiles is a problem, as they have to be adapted
to the sequential use of the objects. Mappings that can handle datafiles are indicated in the
Contents file.
Different types of datafiles are defined:
raw Every file is interpreted as a single object in the dataset. These files may, for instance,
be images of different size. It is assumed that objects from different classes are stored
each in a different directory. The directory name is used as the class label.
cell All files should be mat-files containing just a single variable being a cell array. Its
elements are interpreted as objects. The file names will be used as labels during construction. This may be changed by the user afterwards.
111

pre-cooked In this case the user should supply a command that reads the file and converts
it to a dataset. More than one object may be stored in a single file. Furthermore, the
command should define the class labels for the objects stored in the dataset.
half-baked All files are mat-files, containing a single labeled dataset.
mature This is a datafile created by PRTools, using the savedatafile command after
execution of all preprocessings defined for the datafile.
The most natural way for a user to create a datafile, is using the option raw.
Whenever a raw datafile is sufficiently defined by pre- and postprocessing it can be converted
into a dataset. If this is still a large dataset, not suitable for the available memory, it should be
stored by the savedatafile command and is ready for later use. If the dataset is sufficiently
small it can be directly converted into a dataset by prdataset.

E.2

Operations on datafiles

The main commands specific for datafiles are:
prdatafile
addpreproc
addpostproc
filtm
savedatafile
prdataset

constructor. It defines a datafile on a directory.
adds preprocessing commands (low level command)
adds postprocessing commands (low level command)
user interface to add preprocessing to a datafile.
executes all defined pre- and postprocessing and stores
the result as a dataset in a set of matfiles.
conversion to dataset

Most functions that were defined in MeasTools are also available as mappings that operate
on datafiles. In particular, all examples as they were defined in the MeasTools appendix
work in an idential way:
% load digits as measurement variables
>> a = load_nist([5,9],[1:20]); figure; show(a)
% add rows and columns to create square figure (aspect ratio 1)
>> b = im_box(a,[],1);
figure; show(b)
% resample by 16 x 16 pixels
>> c = im_resize(b,[16,16]);
figure; show(c)
% compute means
>> d = im_mean(c);
% convert to PRTools dataset and display
>> d = prdataset(d,[],’featlab’,char(’mean-x’,’mean-y’);
>> scatterd(x,’legend’);

E.3

Image Processing

Before measurering features, images might have to be preprocessed. PRTools contains a
series of mapping routines to facilitate this. If the desired routine is not available, they can
easily be built from standard Matlab or DipImage routines using filtm.
112

im
im
im
im
im
im
im
im
im
im
im
im
im
im
im
im
im
im
im
im

gray
invert
gaussf
minf
maxf
stretch
hist equalize
threshold
berosion
bdilation
bpropagation
label
select blob
box
center
resize
scale
rotate
fft
cols

mappings for image processing
Convert any image to a gray-value image
Invert an image by subtracting it from its maximum
Gaussian filter
Minimum filter
Maximum filter
Grey-value stretching
Histogram equalization
Thresholding
Binary erosion
Binary dilation
Binary propagation
Label binary image
Selection of largest blob in a binary image
Bounding box for binary image
Center binary image in center of gravity
Resize images
Scale binary image
Rotate image
Image FFT
Conversion of full color images to 3 grey value images

A number of more specific routines for measuring features in images are available:
prfilters for features of images
im moments Computes various image moments, central, Hu and Zernike.
im mean
Computes just the centre of gravity.
im profile Computes the horizontal and vertical image profiles.
im measure Measures various object features.
The last routine, im measure, computes a single set of features. If the image contains a set
of objects, it has first to be split by im2meas into images containing a single object, in order
to obtain a measurement set in which each object refers to a single physical object. The
im measure-routine makes use of the DipImage package. The user has to take care that this
is loaded.

113



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 113
Producer                        : pdfTeX-1.40.15
Creator                         : TeX
Create Date                     : 2018:09:26 09:20:31+02:00
Modify Date                     : 2018:09:26 09:20:31+02:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2014) kpathsea version 6.2.0
EXIF Metadata provided by EXIF.tools

Navigation menu